hardy crag Jan 4, 2019, 10:51 PM

#

also you would need some kind of evaluation to know how good your algorithm performed/how satisfied you are with the result

#

the optimizer you are using is then a class which represents the algorithm, so it needs the model parameters in the constructor, some training methods that take data and a run method, which just evaluates data and does not train.

#

(or you can just use a framework for the model that you can compare your solution to, and for creating the rest of the script)

#

does that make sense?

chrome lily Jan 4, 2019, 11:14 PM

#

Thank you so much, it makes some sense
Im just a bit clueless on how to begin writting the code as thats my main issue been coding for around 5 months now first year university student
Still learning the ropes

#

@hardy crag

hardy crag Jan 4, 2019, 11:26 PM

#

just build in step by step. first, create the input, and just print it back to console and so on

chrome lily Jan 4, 2019, 11:38 PM

#

Okay thank you so much!

hardy crag Jan 5, 2019, 12:03 AM

#

yw

brisk lynx Jan 5, 2019, 7:27 AM

#

can someone help me here? Im learning basic data science and did this for model accuracy

#

    def get_model_accuracy(self):
        x = self.selected_columns[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_COMB"]]
        y = self.selected_columns[["CO2EMISSIONS"]]
        linear_reg = LinearRegression().fit(x, y)
        accuracy_value = -cross_val_score(linear_reg, x, y, cv=10, scoring="neg_mean_squared_error").mean()
        print(
            "model accuracy was:", format(accuracy_value ** 0.5, ".2f")

        )```

#

from what I learned the accuracy values need to be between [0, 1]

#

but the value im getting is 24

#

what I did wrong

#

?

placid snow Jan 5, 2019, 7:30 AM

#

Accuracy would be calculated from Number of correct predictions / Total number of predictions -> ```
TP + TN

TP + TN + FP + FN``` So either you got the formula wrong, or some of your values don't add up

#

I haven't touched this is ages so can't really remember much more about it

brisk lynx Jan 5, 2019, 7:31 AM

#

hum, I though the module would do the formula for me

placid snow Jan 5, 2019, 7:31 AM

#

tp being true positives, tn true negatives and so on

#

Humm, I can look through my old course work for how I did it.

#

I believe i had a bit of accuracy calculated

#

https://github.com/tagptroll1/NaiveBayes284/blob/master/NaiveBayes.py#L113 I only calculated accuracy when i did the model manually, so no sklearn or any other lib

#

¯_(ツ)_/¯

#

It's also pretty old, so not the best of python code in general.

brisk lynx Jan 5, 2019, 7:38 AM

#

oh I just found out Linear_regression() has a score() method

#

but Ill look into your repo

placid snow Jan 5, 2019, 7:39 AM

#

I forget if score is accuracy or something else, but that could be it

brisk lynx Jan 5, 2019, 7:41 AM

#

yeah score did it

#

it returned 86.4%

#

hum, thats not good is it?

#

    def get_model_accuracy(self):
        x = self.selected_columns[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_COMB"]]
        y = self.selected_columns[["CO2EMISSIONS"]]
        linear_reg = LinearRegression().fit(x, y)
        print("model accuracy was:", format(linear_reg.score(x, y) * 100, ".2f"), "%")
        print("The predicted co2 emission values were", linear_reg.predict(x))```

placid snow Jan 5, 2019, 7:42 AM

#

That's fairly good, from what i remember

brisk lynx Jan 5, 2019, 7:42 AM

#

well, thanks for the help, Ill check your repo now chibli, thanks

placid snow Jan 5, 2019, 7:42 AM

#

Having too high of a score could mean overfitting

brisk lynx Jan 5, 2019, 7:43 AM

#

that means the data is way to specific to the training set and not to a general situation, right?

placid snow Jan 5, 2019, 7:44 AM

#

Something like that :P

#

It's a bit blurry for me

brisk lynx Jan 5, 2019, 7:44 AM

#

thanks

placid snow Jan 5, 2019, 7:48 AM

#

If you're interested this was my exam in that course, very small training set and I used to wrong model but hey 🤷🏽
https://github.com/tagptroll1/machinelearningexam

thin terrace Jan 5, 2019, 7:22 PM

#

Anyone can give some directions on how to select nr of epochs, batch size, hidden layer size etc. I made a dataset and now I'm trying to evaluate how well my model can fit it. I can't make any sense of the accuracy results. Sometimes the accuracy can be everything from 25-95% and sometimes it hits a reoccurring 28.42% - even if i run with the exact same hyperparameters..

#

X = np.array(data.drop(['H', 'D', 'A'], axis=1))  # Features
y = np.array(data[['H', 'D', 'A']])  # Labels

# 10-fold Cross-validation
kf = KFold(n_splits=10)  # Shuffle=True?
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

model = Sequential()
model.add(Dense(51, activation='relu'))
model.add(Dense(25, activation='relu'))  # Experiment with hidden layer size
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X, y, epochs=1000, batch_size=18)  # Experiment with epochs and batch_size

scores = model.evaluate(X, y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))```

#

Like, why is every single epoch 28.42%?

📎 unknown.png

hasty maple Jan 5, 2019, 7:44 PM

#

that's the best the model can do I guess

thin terrace Jan 5, 2019, 8:26 PM

#

It's not, cuz running it again it might just as well hit 95%

#

Or 62 or 54 or whatever

velvet anchor Jan 5, 2019, 9:09 PM

#

There isn’t a magic way to check hyper parameters I don’t believe

#

It just kind of is trial and error

thin terrace Jan 5, 2019, 9:26 PM

#

Well how can I trial and error if it always gives me different results even if I tweak the hyperparameters or not?

#

I can't really tell if it was the tweaking who made it better/worse or just the random change

hardy crag Jan 5, 2019, 9:28 PM

#

well if the model is randomly initialized, the minimum it finds may be different from run to run

#

have you checked wether the the 28.xx% is the result from the model always predicting the same output for any input?

#

also: how big are your training and testing set?

#

there is a lot of "intuition" and "experience" and reading tips and tricks for x architecture involved in finding hyperparameters

thin terrace Jan 5, 2019, 10:47 PM

#

Dataset is 380 samples (representing one season of EPL football games) which can be increased by adding more seasons. I've tried splitting it with the validation_splitparameter of model.compile() by 10%, 20% and 30%.

#

Can you point me to any good resources of these tips and tricks? @hardy crag

hardy crag Jan 5, 2019, 11:15 PM

#

@thin terrace so the model always predicts Away win?

#

if you have an inbalanced dataset, do you correct for that?

#

an example of tipps and tricks (and a good blog in general ): https://towardsdatascience.com/deep-learning-tips-and-tricks-1ef708ec5f53

Towards Data Science

Deep Learning Tips and Tricks – Towards Data Science

Below is a distilled collection of conversations, messages, and debates I’ve had with peers and students on how to optimize deep models.

small pumice Jan 6, 2019, 1:00 AM

#

Hi,
I am trying to use satellite data and a neural network to predict whether a wildfire will occur in a given area within a month. I am using Google Earth Engine to collect the data. I have a question about the neural network: I will not have as many occurrences of fires as I will fires not occurring. Therefore, I will have to use the same amount of fire occurrences as non-fires. Then, however, the result that the neural network outputs for whether a fire will occur within a month will be skewed. How do I account for this?

hardy crag Jan 6, 2019, 10:50 AM

#

you can correct in your loss function for inbalanced classes

#

like this stackoverflow

#

https://stackoverflow.com/questions/35155655/loss-function-for-class-imbalanced-binary-classifier-in-tensor-flow

Stack Overflow

Loss function for class imbalanced binary classifier in Tensor flow

I am trying to apply deep learning for a binary classification problem with high class imbalance between target classes (500k, 31K). I want to write a custom loss function which should be like:
mi...

lapis sequoia Jan 6, 2019, 12:20 PM

#

Hi. I am trying to work on document similarity to retrieve top n documents for a given query

#

at present, i am using tf idf vectorizer

#

Does someone know something better than tf idf that can capture semantic meaning as well?

bronze osprey Jan 6, 2019, 1:41 PM

#

Can i convert .mlmodel (Apple MLKit Model) to .tflite (Tensorflow Lite Model) using python?

winter cliff Jan 8, 2019, 10:48 PM

#

What is some example code of putting eye tracker data into a csv file?

supple ferry Jan 9, 2019, 6:32 PM

#

Hey there! Anyone has experience with Logit model in Python? I am using now statsmodels version of it. What I try to do, when fitting my model, specify Logarithmic function so that it fits not on original X, but on Log(X) any easy way to do that except manually creating columns in df?

delicate nymph Jan 9, 2019, 9:05 PM

#

hi

#

Can anyone explain to me something about BSpline?

deft harbor Jan 10, 2019, 6:20 AM

#

Does anyone have good resources for dataviz or "storytelling"?

meager laurel Jan 10, 2019, 8:11 AM

#

So i got redirected here and I needed advice

#

how i can use image analysis
to identify the differences between these two
and other images
Like these two are different stages of a microstructure
of a material
and I wanted to create an app that will be able to distinguish between these two

📎 unknown.png

polar acorn Jan 10, 2019, 8:39 AM

#

How many examples do you have of each stage?

chilly shuttle Jan 10, 2019, 9:02 AM

#

You don't even need ml. Just run hough transform and observe the much more prominent lines in the left image

meager laurel Jan 10, 2019, 9:22 AM

#

@chilly shuttle can you elaborate more please

#

Like you're saying to use Hough transform, and to find the image with more lines than dark shapes?

chilly shuttle Jan 10, 2019, 9:24 AM

#

Run them through hough transform and look at the differences in output, it should be quite obvious after that

meager laurel Jan 10, 2019, 9:29 AM

#

@chilly shuttle ohh ok, thank you!

chilly shuttle Jan 10, 2019, 9:30 AM

#

if you want more help you need to give a lot more information about your problem, e.g. how many samples of each class you have as pptt asked

#

but the two you showed don't need ml to discern, you just need to look for straight lines

meager laurel Jan 10, 2019, 9:32 AM

#

@chilly shuttle yea can I get back to you on that in a later day because

#

We just got the challenge today

#

From our materials uni course which is making a challenge for a hackathon

chilly shuttle Jan 10, 2019, 9:32 AM

#

if it's a challenge why are you coming straight here for help

#

take it on as a challenge

meager laurel Jan 10, 2019, 9:32 AM

#

And I'm planning to learn

#

Noo like

#

Ik what kinda thing to find out

#

It's just idk about the technology

#

Needed to find it out

#

Which I'm planning to leaen

#

Before the hackathon

#

That's why they showed the challenge

#

Rn

#

The hackathon is in a few weeks

#

So we can kinda prepare for it

#

Like I just needed to know what kind of technology I needed to learn to find the differences in the two images

#

And I was planning to learn those stuff

#

So I can use it in the actual hackathon

#

That's why, sorry for the miscommunication

chilly shuttle Jan 10, 2019, 9:34 AM

#

i don't mind, we'll answer whatever questions

meager laurel Jan 10, 2019, 9:35 AM

#

Thanks! The only questions I plan on asking are related to the technology help itself

#

Hence why I came to the Python discord to start off

chilly shuttle Jan 10, 2019, 9:35 AM

#

just saying if it's a challenge i'd personally feel like any success would be diminished if I got my hand held through it

meager laurel Jan 10, 2019, 9:35 AM

#

Cuz I want to get involved in computer vision

#

And this challenge was something for me to start off with

#

Yea that's true, well like I just wanted to know where to start off from

#

Cuz I'm clueless right now

#

About machine learning

#

Like I got a course from Udemy, and I'm gonna use that to learn and from there try to solve the problem

#

Because I don't know yet, I still have to get clarification from then, whether they have more samples and whether the app they want is supposed to scan different materials that have diff microstructures, so the samples would vary between each material

#

I just got the idea of this a few hours ago, so I was just desperately trying to find a place to start

polar acorn Jan 10, 2019, 9:55 AM

#

Best of luck! But remember struggling and/or failing a bit is a great way to learn.

vestal ravine Jan 10, 2019, 2:34 PM

#

Hello everyone
i'm tryin to build a live forex chart
i don't have any experience with that yet
but i do know a little about python
is there anyone here, who can help me?

narrow ivy Jan 10, 2019, 4:10 PM

#

Hey guys, is there any free stock price API which allows me to get data more frequent than 1 day ? I want to make live updating chart using dash, but I need some great api that will make my chart to change for example each minute. Thanks in advance 😃

#

@vestal ravine I guess we're looking for same thing 😄 Try to learn really useful libraries like pandas and numpy and then matplotllib/plotly to creating charts. After that you will be in my position - to find some great API to get live data, as frequent as possible .

vestal ravine Jan 10, 2019, 4:13 PM

#

ill take a look, Thnx.

meager laurel Jan 10, 2019, 5:19 PM

#

@polar acorn Thankks!!!

silk hill Jan 10, 2019, 5:24 PM

#

@mikey770 alpha vantage has an intraday api

small ore Jan 11, 2019, 7:52 AM

#

@meager laurel Is that a scehmatic diagram from a text book or an actual black and white picture of the microstructure? Or an actual picture filtered to make the image simple for the task? A real life problem would be to recognize the process ( heat treatment/ natural process of formation etc) of any given material. In that case you will need a lot of sample pictures of each matrial structure

meager laurel Jan 11, 2019, 7:55 AM

#

@small ore well our materials Prof posted that on the slide for the lecture and I cropped the image from there, so I'm assuming it's an actual picture.

#

📎 20190110_032058-1.jpg

#

This is the challenge part

#

And we just started our materials course, so I don't know much about how the microstructures change. But I remember hearing he said the temperature affects it. I'll have to find out more info about it.

small ore Jan 11, 2019, 7:58 AM

#

Those may even be differently heat-treated steel. I too donno how a photograph of a microstructure is taken. I only remember seeing them in textbooks. Not even good ones

#

And my very little and out-dated knowledge tells me that recognizing microstructures were a forte of higly skilled and trained man power. So this challenge seems to me like they are trying to remove the discrepancies that come with human decisions and the high cost involved ( Although who knows, machines mayprove to be inferior in certain cases)

void anvil Jan 11, 2019, 4:36 PM

#

^

Steel microstructure is a massive fucking pain in the ass. Good luck getting anything beyond meh results if they give you a real world dataset.

#

Photographs come from SEMs

feral lodge Jan 11, 2019, 5:06 PM

#

https://www.researchgate.net/publication/317711703_Advanced_Steel_Microstructural_Classification_by_Deep_Learning_Methods Here's some recent research you can have a look at, it's a very nontrivial problem looks like! Judging from this table, looks like you need pretty sophisticated methods to even reach 50% accuracy on real-world data @meager laurel

📎 unknown.png

meager laurel Jan 11, 2019, 6:36 PM

#

@small ore @void anvil @feral lodge ohh ok, I'll look more into that, and ask my prof what he's expecting and about this stuff too

void anvil Jan 11, 2019, 6:44 PM

#

I have a friend who works in a steel mill. ML is useless in > 99% of the cases. If you get an actual breakthrough you'll get hundreds of millions of dollars every year.

small ore Jan 11, 2019, 7:56 PM

#

Well, unless you are working for a research firm, you may spend a million trying to get decent sample photographs

meager laurel Jan 11, 2019, 7:57 PM

#

Idk I think I recall the prof saying to find the difference by checking the amount of white in the picture or the size or something. Idk if that's fully true now

small ore Jan 11, 2019, 7:59 PM

#

True. They might be seeking something simple. I can't imagine someone throwing their students something so complex

thin terrace Jan 11, 2019, 7:59 PM

#

Any suggestions to which metrics to use in ML classification?

polar acorn Jan 11, 2019, 11:20 PM

#

Depends on the case. With nicely balanced classes accuracy might do fine. Screening for cancer? Maybe keep an eye on the f1 score and the false negative rate etc. etc.

thin terrace Jan 12, 2019, 3:22 PM

#

@polar acorn balance is like 45/25/30%

void anvil Jan 12, 2019, 3:27 PM

#

Depends strongly on what you're trying to predict

#

What input(s) do you have and what output(s) are you looking for?

thin terrace Jan 12, 2019, 3:29 PM

#

Predicting football matches. Input is team ratings and some history data from previous games. output is home team win / draw / away team win

#

From previous work I've basically only seen accuracy being used

void anvil Jan 12, 2019, 6:48 PM

#

You might be better off using a (-1,1) classifying system like they use in finance and putting draw in at 0

thin terrace Jan 12, 2019, 7:53 PM

#

@void anvil why is that?

void anvil Jan 12, 2019, 8:18 PM

#

experience

latent flicker Jan 13, 2019, 4:15 AM

#

Is there an R discord?

heavy apex Jan 13, 2019, 5:12 AM

#

Within data science, what tools are best in which situation. Currently doing a visualization class that is strictly Tableau, with only optional python and R learning. I'm obviously going to putting in the extra optional work, but which tool tends to hold the most weight in a working environment?

carmine lava Jan 13, 2019, 5:25 AM

#

@feral lodge object tracking python codes or research paper please share if you know

quiet crest Jan 13, 2019, 10:00 PM

#

Does anyone have jupyter notebook slowing down after some period of time and was able to fix it? I have been looking it up, but couldnt find much

orchid lintel Jan 14, 2019, 4:18 AM

#

@thin terrace To expand on that answer, it's because there's an element of ordinality to those classes. ie, you're not predicting something truly Categorical like, say, "hair color". A win is "closer" to a draw than it is to a loss, and so it makes sense to put them on a scale.

#

@heavy apex Matplotlib itself is very powerful but honestly can be a pain. Seaborn's a much-easier wrapper around that, but it's limited in the types of visualizations it can do. Plotly is very good for both exploratory data analysis and presenting findings - it's also got a thing called Dash that can replace Tableau in a lot of instances. There's also Bokeh for interactive visualization. My favorite is probably Altair.

thin terrace Jan 14, 2019, 10:19 AM

#

@void anvil @orchid lintel which activation function do you use on the output layer for such a label?

thin terrace Jan 14, 2019, 10:35 AM

#

(and loss function)

vague jetty Jan 14, 2019, 11:08 PM

#

Anyone know a good free alternative to prodigy (https://prodi.gy/)?

#

Jk, looks like https://github.com/chakki-works/doccano is a good option

GitHub

chakki-works/doccano

Open source text annotation tool for machine learning practitioner. - chakki-works/doccano

void anvil Jan 15, 2019, 1:03 AM

#

@vague jetty https://github.com/zalandoresearch/flair

GitHub

zalandoresearch/flair

A very simple framework for state-of-the-art Natural Language Processing (NLP) - zalandoresearch/flair

#

is also a great one as well

#

same with spacy

vague jetty Jan 15, 2019, 1:14 AM

#

Prodigy is spaCy's version

#

Does anyone have experience with Doccano? I'm having trouble uploading a dataset.

vague jetty Jan 15, 2019, 1:36 AM

#

Looks like Doccano isn't working for me. Any other suggestions for an easy interface for data annotation?

Specifically, I'm looking to annotate classification, not keywords in text.

vague jetty Jan 15, 2019, 4:58 AM

#

Nvm, looks like Doccano started working for me.

chilly shuttle Jan 15, 2019, 12:00 PM

#

that's pretty cool

#

i guess you couple something like that with something like mechanical turk to generate labelled datasets

meager laurel Jan 15, 2019, 4:41 PM

#

Oh by the way, my professor got back to me and he said

#

"Sorry for the delay in getting back to you. Attached are two sample files similar to the ones that will be provided for the challenge. What we would like you to do is to measure the fractions of the light and dark phases. The fraction could simply be expressed as number of pixels of a given color over the total number of pixels. The challenge will be the lighting conditions. Sometimes you’ll get very good contrast between the phases (e.g. micro1.png) and sometimes there is less contrast (micro2_prec.png).

To do this, you can start by using Python libraries for reading an image and for interrogating the data."

#

📎 Capture_2019-01-15-11-41-16.png 📎 Capture_2019-01-15-11-41-21.png

#

That's what he wanted

vague jetty Jan 15, 2019, 6:32 PM

#

Woop woop, got a follow-up email from a company for an ML research internship. They want me to submit a research proposal, but I have no idea where to start...

rugged leaf Jan 16, 2019, 1:12 AM

#

@vague jetty ay nice job

void anvil Jan 16, 2019, 5:49 PM

#

Basically here's a problem, here's what it'll do for your company, here are some methods that I could try to apply, and here's why it'll be worth your money

small ore Jan 16, 2019, 5:57 PM

#

Like I want to learn deep ocean diving, want budget for a sonar measurement of dophin and whale sounds, so that I can help devise a way to distinguish between ships/subs and other creatures 😄 😛

vague jetty Jan 16, 2019, 8:19 PM

#

So I'm really new to sentiment analysis and am playing around with a project doing sentiment on a hockey forum. I've scraped a bunch of posts and am working on annotating to to make the training and test sets. I imagine this is a vague, common question with no good answer, but how do I label a post like "He wasn't great at the u18's. He started off the year in Kingston pretty bad, but got a lot better later on near the new year. His consistency could use some work from what I've seen/heard. He can be dominant but he can also have some bad showings." It's pretty clearly both positive and negative. Right now I'm applying labels to the entire post. I imagine selecting the instances of positive text and negative text in the post would be better than blanket labeling the entire post, but it will take a lot more time to do that. Should I label it both positive and negative?

void anvil Jan 16, 2019, 8:24 PM

#

You can count sentiment items per post (10x bad, 3x good) if you want. What's harder to do is pick up sarcasm (e.g. PK Subban is the worst role model in the league).

#

It's also a bit more difficult to pick out what's being said good / bad about the specific person in the post (e.g. player XYZ is doing badly so far this year but will pick up once he's off injury. He's still better than ABC who sucks.)

vague jetty Jan 16, 2019, 8:39 PM

#

Yeah, that's a big issue I'm running into. Eventually I might try playing around with disambiguation, but I want to keep things simple right now. Do you recommend counting sentiment items over selecting the actual text?

void anvil Jan 16, 2019, 8:43 PM

#

Counting sentiment items is > then just having a -1, 1 for good / bad on the post. It's not necessarily the best option.

obtuse kettle Jan 16, 2019, 11:11 PM

#

Not sure if this is the place to ask. I’m a computer science student who’s currently taking calc. I love Microsoft’s ability to annotate equations (think it uses LaTex.). My issues is that during work if I get a chance to study I’m not able to use downloadable apps. Do you guys have a good note taking web app or web word processor that supports math equations? Google docs is super limited :/

desert oar Jan 16, 2019, 11:37 PM

#

that is an odd one indeed. you could use markdown in a jupyter notebook, i think there are free notebook servers out there

#

otherwise i think google docs is your best bet

terse pewter Jan 17, 2019, 3:12 AM

#

Hey does anyone know if date is a discrete or continuous numerical type?

#

I would think that it would be discrete

small ore Jan 17, 2019, 5:42 AM

#

@obtuse kettle If Jupyter Labs is okay for your purpose ( It can in addition to taking notes with equations also run various programs inline. Can also have graphs, tables, etc) then Microsoft Azure is one option. Requires a login

obtuse kettle Jan 17, 2019, 11:41 AM

#

@small ore Why azure over digital ocean or other vps?

small ore Jan 17, 2019, 11:42 AM

#

I didnt say over anything else. That was the only one I know which has support for jupyter notebooks

obtuse kettle Jan 17, 2019, 12:54 PM

#

I see. Thank you for your input, you as well salt 😃

obtuse kettle Jan 17, 2019, 3:12 PM

#

Jupyter is a no go =/

#

Can't access my server from work

vague jetty Jan 17, 2019, 5:21 PM

#

There's OverLeaf, but it's probably OverKill

#

(☞ﾟヮﾟ)☞

feral lodge Jan 17, 2019, 8:48 PM

#

@obtuse kettle I also recommend overleaf for notes, I always use it for math assignments etc. It's like google docs for latex

obtuse kettle Jan 17, 2019, 8:48 PM

#

Does it have a web app, free?

feral lodge Jan 17, 2019, 8:49 PM

#

@terse pewter Date should be a discrete variable. If it's date/time its probably continuous though

#

@obtuse kettle https://www.overleaf.com/ No charge, you just need to make an account

terse pewter Jan 17, 2019, 8:50 PM

#

Thank you!

feral lodge Jan 17, 2019, 8:50 PM

#

ctrl-enter to compile the pdf so you can see what you're doing

terse pewter Jan 17, 2019, 8:50 PM

#

That makes sense since time can be hours minutes seconds ms,etc

feral lodge Jan 17, 2019, 8:51 PM

#

Agreed!

#

@carmine lava https://www.pyimagesearch.com/2015/09/14/ball-tracking-with-opencv/ This blog entry is a few years old, I hope it's not outdated. I think it's a nice first exposure to object tracking though! Regarding papers, it might be difficult to find something general to learn the basics from; you'll probably mostly find papers exploring specifics and trying to improve the state-of-the-art for difficult problems

carmine lava Jan 17, 2019, 8:56 PM

#

@feral lodge thanks but i have seen it

#

Now what i am trying to do is give a unique id for object and track it we the object is in the frame it should give id 1 and if he goes out the frame and come back the id should be same @feral lodge 😱 if you find something close to it please let me known i really appreciate that

obtuse kettle Jan 17, 2019, 9:00 PM

#

Overleaf seems really cool so far! Thank you for the suggestion! But what's the catch... ALl this for free? I can store as many files as I want?

feral lodge Jan 17, 2019, 9:08 PM

#

100 MB storage for starters, you can raise it to 1GB for free if you do their referral stuff. If there is some nefarious dark side to overleaf I haven't seen it yet 👌 @obtuse kettle

obtuse kettle Jan 17, 2019, 9:09 PM

#

I see. Thank you again for this ^_^ time to lean this magic

feral lodge Jan 17, 2019, 9:21 PM

#

@carmine lava I can't help you much, I have hardly touched object tracking 😖 I don't think I've ever seen work on recognizing and labeling certain individuals among some object class; only class labeling (ie., labeling objects as a coffe cup rather than coffe cup #345 and coffe cup #346). I'll keep my eyes open though If I see something! What kinds of objects are you trying to label?

carmine lava Jan 17, 2019, 9:25 PM

#

Person

#

I am traning person but the thing is when he goes out of frame and come back the id is changing my requirement is he should have same id

#

@Slandön# what are you using for detection i mean which lib faster rcnn or SSD which one

#

@feral lodge any tutorial on using faster rcnn

feral lodge Jan 17, 2019, 9:36 PM

#

Sorry my friend, I have no idea about the libraries! If there's a big difference you can probably find some benchmarks or discussions online

#

Same blog as before; he labels individual faces in this one it seems! https://www.pyimagesearch.com/2018/09/24/opencv-face-recognition/

#

Couldn't check very carefully though, duty calls 👺

hardy crag Jan 18, 2019, 12:00 AM

#

@carmine lava the problem with individual people is that you need a good dataset. If you have a good dataset you can use tensorflows object detection, which includes several faster_rcnn architectures

#

see e.g. https://medium.com/@WuStangDan/step-by-step-tensorflow-object-detection-api-tutorial-part-1-selecting-a-model-a02b6aabe39e or https://cloud.google.com/solutions/creating-object-detection-application-tensorflow?hl=en

spring yoke Jan 18, 2019, 7:50 AM

#

./m

crude flame Jan 18, 2019, 4:11 PM

#

anyone tried the book "Datascience from Scratch" by Joel Grus? I'm thinking of getting it, since I want to get some datascience book and it seems to put more emphasis on actually understanding things rather than just learning to feed a black box

#

uhm also I might be missing something, but when I check out the recently pinned guide on r/learnmachinelearning, the PhD-version of it just includes no guide (?)

desert oar Jan 18, 2019, 10:17 PM

#

any book teaching you to feed a black box isnt worth your time or money

#

unfortunately i dont know that book in particular, no do i have any sane route to data science because my own personal path was so winding

#

but any good data science book with have both math/theory and applied content

#

my gripe with some "machine learning" books is they dont actually spend much time on applications

#

it should be a push-and-pull of: learn a concept, learn the math behind it, and finally learn to apply it

#

book exercises ideally would be both practice math/theory problems (e.g. derive some expression, prove a theorem), and practice mini-projects (simulate or download a dataset, then implement a model and make inferences on it)

void anvil Jan 19, 2019, 5:57 AM

#

Honestly you don't need to know fuckall about the math behind machine learning because of the big, beautiful packages set up by people. 99% of the effort is in data manipulation and experimental verification.

ripe lava Jan 19, 2019, 10:17 AM

#

@crude flame I can highly recommend An Introduction to Statistical Learning by James et al. All the main models used in supervised and unsupervised learning are described. I found it very intuitive and with just the right amount of math to get started (if you want to go really technical you can check Elements of Statistical Learning, the advanced book that they have written).

#

and it's free!

#

you can find it there: http://www-bcf.usc.edu/~gareth/ISL/

#

application exercises are in R though... But it should be easy to replicate most of the results using scikit-learn

wide oxide Jan 19, 2019, 10:35 AM

#

Introduction to statistical learning is a tough book to start with

#

@ripe lava

ripe lava Jan 19, 2019, 11:42 AM

#

I think anyone who once had a basic stats course should be fine

wide oxide Jan 19, 2019, 11:55 AM

#

These days I am very much into behaviour analysis, and psychology.

#

But, I am completely clueless. I don't know which course to take or which thing to follow.

crude flame Jan 19, 2019, 1:43 PM

#

I should have mentioned that I'm doing a PhD in mathematical physics, so I'm fine with technical stuff. Thanks for the suggestion!

orchid lintel Jan 20, 2019, 2:37 AM

#

@crude flame If you really want the foundations of the algos, Elements of Statistical Learning is where you wanna be. There's a free online Stanford class on it too, with guest lectures by like the actual guy who came up with CART and stuff. https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about

Statistical Learning

Learn some of the main tools used in statistical modeling and data science. We cover both traditional as well as exciting new methods, and how to use them in R.

#

@wide oxide If I were starting out now, I'd be doing DataQuest, I really like their material.

wraith crow Jan 20, 2019, 5:14 AM

#

Anyone that is familiar with K-mean clustering? I'm having trouble with using a template in my exercise.

ripe lava Jan 20, 2019, 5:43 AM

#

I might be repeating myself.... but go check out the ISLR book, its very well explained in there

wide oxide Jan 20, 2019, 5:46 AM

#

@orchid lintel Let me check their material. I've completed like 2-3 courses on Datacamp for Data Scientist with Python.

#

Their material was more like learn python for data science

reef bone Jan 20, 2019, 5:46 AM

#

@wraith crow Most people here will be familiar with k-means so go ahead and ask your question

#

Oh my god

#

Who did I ping

#

Oh you changed your name

#

All good

#

mr_thumbers

wraith crow Jan 20, 2019, 5:48 AM

#

Hehe 😃

#

I can't get this function to work:

#

📎 unknown.png

reef bone Jan 20, 2019, 5:49 AM

#

Are you using sklearn's k-means implementation?

wraith crow Jan 20, 2019, 5:50 AM

#

Hmm, the bot is not letting me upload the function

#

No, it's a function I have gotten from a template in the exericse

#

class KMeans:
""" Simple K-Means implementation. Note that you can access
the cluster means and the cluster assignments once you have
called the "fit" function. The cluster means are stored in the
variable 'cluster_means' and the assignments to the cluster
means in 'cluster_assignments'. You can also use the function
'assign_to_clusters' to obtain such assignments for a new set
X of points.
"""

def __init__(self, n_clusters=2, max_iter=30, seed=0, verbose=0):
    """ Constructor for the model.

    Parameters
    ----------
    n_clusters : int
        The number of clusters that should be
        found via the K-Means clustering approach.
    max_iter : int
        The maximum number of iterations (stopping condition)
    seed : int
        Number that is used to initialize the random 
        number generator.
    """
    
    self.n_clusters = n_clusters
    self.max_iter = max_iter
    self.seed = seed

def fit(self, X):
    """
    Fits the K-Means model. The final cluster assignments 
    (i.e., the indices) and the cluster means are stored
    in the variables 'cluster_assignments' and 'cluster_means',
    respectively, see the end of this function.

    Parameters
    ----------
    X : Array of shape [n_samples, n_features]
    """

wide oxide Jan 20, 2019, 5:51 AM

#

(Which course are you taking? @wraith crow )

wraith crow Jan 20, 2019, 5:51 AM

#

It's modelling and analysis of data

#

got it as a 'side-subject'

reef bone Jan 20, 2019, 5:51 AM

#

Oh this looks like a python problem, let's make sure you instantiate KMeans first, as you're calling the method

wraith crow Jan 20, 2019, 5:52 AM

#

Ah, I figured I might be calling it wrong?

reef bone Jan 20, 2019, 5:52 AM

#

Try doing

kmeans = KMeans()
kmeans.fit(data)

wraith crow Jan 20, 2019, 5:56 AM

#

Oh my god, that was it. Thanks a lot!

wide oxide Jan 20, 2019, 5:57 AM

#

Why there is no subject like that for us ;__;

#

I want to learn data science and I end up learning models but, no practical stuff, like implementing..

wraith crow Jan 20, 2019, 5:57 AM

#

Oh one thing, can I also initialise it using the init before kmeans.fit(data)?

#

For example if I need some other parameters than the one in the code, like KMeans(n clusters=2, max iter=30, seed=0).

#

Hmm seems to work

#

When I write it like
kmeans = KMeans()
kmeans.init(2,30,0)
kmeans.fit(data)

reef bone Jan 20, 2019, 6:01 AM

#

This is a bit of a weird situation

#

Normally the method init would actually be called __init__ which makes it a constructor - a method that is automatically called when you instantiate a new object, but in this case it isn't

#

If it was named that, you could instantiate like this:

kmeans = KMeans(2, 30, 0)

#

And it would automatically call the init function with those values

#

But since it's not named properly, the method has to be called separately (just as you have done)

#

Oh wait

#

I'm dumb

wraith crow Jan 20, 2019, 6:04 AM

#

well, they write I shouldn't change the orignal cell which says:
def init(self, n_clusters=2, max_iter=100, seed=0, verbose=0):
""" Constructor for the model.
So I need a way to change the parameters for the different exercises without changing that one in the original cell.

reef bone Jan 20, 2019, 6:04 AM

#

It probably is called __init__, but Discord is making it look like that because the underscored translate to underline, right?

wraith crow Jan 20, 2019, 6:05 AM

#

📎 unknown.png

reef bone Jan 20, 2019, 6:05 AM

#

Yeah I should have realized, sorry

#

It's 6am for me

wraith crow Jan 20, 2019, 6:05 AM

#

no worries!

reef bone Jan 20, 2019, 6:05 AM

#

So what you've done is fine

wraith crow Jan 20, 2019, 6:05 AM

#

It's 07:05 here 😕

reef bone Jan 20, 2019, 6:05 AM

#

But you can also do

means = KMeans(2, 30, 0)

And it will automatically call the __init__ function and pass those values

#

That's the intended purpose of the method

wraith crow Jan 20, 2019, 6:08 AM

#

okay great, thank you! I've been stressing a lot over this, so you have been my saviour 😄

Seems like means = KMeans(2, 30, 0) doesn't work, because when I change it to means = KMeans(4, 30, 0) for exmaple, it doesn't give me 4 cluster means, but still 2

#

LIke this:

#

📎 unknown.png

reef bone Jan 20, 2019, 6:09 AM

#

The variable on the second line is called means

#

But then you're fitting to kmeans

#

So they are different

wraith crow Jan 20, 2019, 6:09 AM

#

ah that did it

#

just needed that k there

reef bone Jan 20, 2019, 6:10 AM

#

I need to catch some sleep now but feel free to ask questions here, very smart people lurk this chat

wraith crow Jan 20, 2019, 6:12 AM

#

I have to go to sleep as well now, thanks a lot for your help! You made my day (or night)! 😃

#

Goodnight

wide oxide Jan 20, 2019, 6:12 AM

#

@reef bone How did you learn data science?

reef bone Jan 20, 2019, 6:17 AM

#

I'm still very much in the process of learning, but I picked data sciencey modules in uni and was blessed with incredible lecturers that made the field very accessible to me and eventually ended up doing a deep learning dissertation in my 3rd year

wide oxide Jan 20, 2019, 6:19 AM

#

How should a beginner start?

#

I have tried studying from Datacamp, Introduction to Statistical Learning and many other courses like Andrew NG.

#

For datacamp, I completed 3-4 courses and ended up learning nothing practical.

#

ISL: Completed 4 chapters, understood 25% of it.

#

Andrew NG's course was good but, assignments were different.

reef bone Jan 20, 2019, 6:25 AM

#

I've heard Introduction to Statistical Learning is a very good book. Apart from that however I don't think I'm the right person to answer this question, because I went the uni route so it was very different for me (learning from lectures, being driven by marked assignments, and the ability to ask for help directly in lab sessions). The book I learned the most from was Bishop's Pattern Recognition and Machine Learning - an incredible book, but a little math heavy and might be difficult to read for a beginner. I don't know much about those courses you have mentioned so I can't comment on them.

wide oxide Jan 20, 2019, 6:27 AM

#

No problem, thank you!

#

This is what we are going to learn this semester in Mathematics:

#

Helpful for DS?

📎 unknown.png

reef bone Jan 20, 2019, 6:41 AM

#

Looks like you'll get a thorough introduction to regression which is an important technique

#

Linear regression on its own is powerful

#

And then logistic regression introduces you to the sigmoid curve which forms the backbone of neural networks (although nowadays the ReLU activation seems to be more popular as it learns faster)

#

I think you'll learn a lot and you'll also get a good mathematical background that will pay off once you look into more complicated things

#

Out of interest, which level of education is this?

wide oxide Jan 20, 2019, 6:45 AM

#

I am in 2nd semester of B.E. Computer Science Engineering

#

The thing is we are learning nothing more than formulas. The professor taught us how to calculate central tendency for individual, discrete, and continuous series but, didn't tell us where to use which one. No concept of outliers and nothing ;_;

#

Few commands in R and Python can calculate the same things

reef bone Jan 20, 2019, 6:48 AM

#

That's annoying, but I'm a firm believer that having a good grasp on the mathematical background pays off

wide oxide Jan 20, 2019, 6:49 AM

#

I love Mathematics so I try to find things by myself

reef bone Jan 20, 2019, 6:49 AM

#

And you seem to be very driven on your own which is great

#

In university you shouldn't be afraid to go speak to your lecturers, maybe there is a reason why some things aren't covered yet, and I'm sure the lecturer wouldn't mind setting up a meeting with you and going over those things in more depth

#

I really need to get some sleep now, I wish you luck on your journey!

wide oxide Jan 20, 2019, 6:57 AM

#

Thank you!

orchid lintel Jan 20, 2019, 8:23 AM

#

@wide oxide Awesomely useful DS code snippets: https://chrisalbon.com/

Chris Albon

wide oxide Jan 20, 2019, 1:19 PM

#

@orchid lintel Thank you!

#

how do I use it?

wraith crow Jan 20, 2019, 2:53 PM

#

@reef bone Can I ask you again?

reef bone Jan 20, 2019, 2:53 PM

#

Sure, no guarantee I'll have the answer though

wraith crow Jan 20, 2019, 2:56 PM

#

I just did the K-mean cluster for a simple data set, and know I have to do it for an image.

#

📎 unknown.png

#

I'm wondering if this step is critical or not:
data = china / 255.0 # use 0...1 scale
data = data.reshape(427 * 640, 3)
data.shape

Using the example from: https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

#

📎 unknown.png

#

getting an error with reshape

#

📎 unknown.png

reef bone Jan 20, 2019, 3:03 PM

#

Hmm

#

data = china / 255.0 so seeing as all values in china will be in the range 0 - 255 (that's a range you'll see quite often, since these are the numbers we can represent using 8 unsigned bits), dividing them by 255 will rescale them to 0 to 1 range, which in this case is only useful for the visualisation part (it basically becomes a coefficient we can multiply other values with easily)

#

data = data.reshape(427 * 640, 3) in here, we start with data being a 3D array with dimensions (427, 640, 3), so it's an image with 427 pixel in height, 640 pixels in width, and 3 colour channels (r, g, b)

#

By calling reshape(427 * 640, 3) on it, we retrieve a 2D array, kinda like this:
x x x
y y y
z z z
Turns into:
x x x y y y z z z

wraith crow Jan 20, 2019, 3:12 PM

#

Okay, and that's important for the "X : Array of shape [n_samples, n_features]" requirement right?

reef bone Jan 20, 2019, 3:13 PM

#

that sounds right

#

i probably can't explain the error you're getting without seeing more of the code

#

mainly what the variables hold

#

it comes from trying to reshape an array in a way that's not possible

wraith crow Jan 20, 2019, 3:20 PM

#

📎 unknown.png

#

Looks like this one is the issue:

#

cph_image_tiny_recolored = new_colors.reshape(cph_image_tiny.shape)

#

Yeah, just the reshaping that not working

#

How would you reverse the reshaping of:
data = data.reshape(166 * 250, 3)
after it was processed by the function?

reef bone Jan 20, 2019, 3:29 PM

#

reshape(166, 250, 3)

#

but you're trying to reshape the cluster means

wraith crow Jan 20, 2019, 3:29 PM

#

Yeah

reef bone Jan 20, 2019, 3:29 PM

#

can you print the shape of cluster means and cluster assignments

wraith crow Jan 20, 2019, 3:31 PM

#

The first is assignments:

#

📎 unknown.png

reef bone Jan 20, 2019, 3:35 PM

#

i'm trying to understand the entire process of how that's supposed to work

#

i don't think the means are particularly useful here

wraith crow Jan 20, 2019, 3:35 PM

#

"TODO: Segment the image by searching for 5 clusters

in the RGB space (see slide 21 of L13); use

'max_iter=5' and 'seed=0' as parameters."

reef bone Jan 20, 2019, 3:35 PM

#

when you do print(CA.shape), what do you get?

wraith crow Jan 20, 2019, 3:36 PM

#

This is the slide they refer to:

#

📎 unknown.png

#

I get (41500,)

#

Then #cph_image_tiny_recolored = new_colors.reshape(cph_image_tiny.shape) should be the correct idea right? Try to get the cluster means form the colours picture. But is assignments missing there?

reef bone Jan 20, 2019, 3:53 PM

#

sorry i'm lost

#

i'm very confused about why they make you reshape the image data in the first place

wraith crow Jan 20, 2019, 3:56 PM

#

if I don't reshape it before calling the function, the shape of cluster means become (5, 250, 3)

#

If it's reshaped

#

"X : Array of shape [n_samples, n_features]"

#

then data have n samples (pixels) and 3 features

#

but that reshape was from the website I linked, so I'm not sure it's needed

#

Hmm, it does say consider each pixel as a point in R^3..

#

📎 unknown.png

#

@reef bone

reef bone Jan 20, 2019, 4:09 PM

#

oh right

#

can you do

#

on the line where you have new_colors = CM do new_colors = CM[CA]

#

that should give you an array of shape (41500, 3)

#

and then you can reshape this back to (166, 250, 3)

wraith crow Jan 20, 2019, 4:27 PM

#

That worked, thank you!! 😄

#

Look at that beautiful thing

#

📎 unknown.png

reef bone Jan 20, 2019, 4:28 PM

#

nice! try playing with the K number (n_clusters) and it should look better as you go higher

#

sorry I took so long I was a little confused about what they want you to do

#

basically with new_colors = CM[CA] we are using the assignments as indices for the means, so to each sample (pixel) we're assigning it's mean in the 3D colour space

wraith crow Jan 20, 2019, 4:30 PM

#

Yeah, it's a great solution

reef bone Jan 20, 2019, 4:31 PM

#

numpy can be a little esoteric sometimes (intuition would tell you to use a loop here) but once you learn the ins and outs its an amazing tool

wraith crow Jan 20, 2019, 4:35 PM

#

Should it be easy to select a random subset to fit the k-means model?

#

TODO: Segment the image 'copenhagen.jgp' in a similar

fashion (using n_clusters=16, max_iter=5, seed=0). Instead

of using all data points/pixels, consider a random subset of

size 5000 to fit the k-means model and to find suitable cluster

centers. You can use the numpy.random.choice function to select

a random subset of indices (without replacement).

#

Ah, otherwise it has to fit for 300k points 😛

reef bone Jan 20, 2019, 4:38 PM

#

sure, that should be fairly easy

wraith crow Jan 20, 2019, 4:39 PM

#

📎 unknown.png

reef bone Jan 20, 2019, 4:49 PM

#

ok so numpy.random.choice() only works for 1D arrays, and we want to draw from a 2D array

#

so we can grab the indices from the 0th axis (samples) like this

#

indices = np.random.choice(data[0], 5000)

#

and then slice into the data array with the indices like this

#

chosen_data = data[indices, :] that will select only the randomly chosen indices

#

sorry I have to run off now so I don't have time to go over this in more detail

#

but you should be able to figure out the rest

wraith crow Jan 20, 2019, 4:51 PM

#

Okay thanks, I will try!

reef bone Jan 20, 2019, 4:52 PM

#

if you run into any problems just ask here and someone will help I'm sure

#

numpy also has fairly good docs so feel free to refer to this if you're having trouble understanding it

#

https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html

wraith crow Jan 20, 2019, 4:53 PM

#

Thank you again! 😃

#

get a "arrays used as indices must be of integer (or boolean) type"

#

I think I got it

wraith crow Jan 20, 2019, 5:33 PM

#

Hmm, how can I reshape it back into the full image?

#

from the 5000 points

#

📎 unknown.png

#

Hmm, I only have 5000 assignments, but I need 272640 to assign every pixel.

#

@reef bone Are you back?

reef bone Jan 20, 2019, 6:18 PM

#

Does your KMeans class have a predict method? Or similar? You want to fit using the sebset (5000 samples) and then predict the closest cluster for each data point (all samples)

wraith crow Jan 20, 2019, 6:21 PM

#

Doesn't seem like it has a predict method

#

📎 Class_Kmeans.txt

reef bone Jan 20, 2019, 6:27 PM

#

I think they want you to use this method

#

📎 Screenshot_20190120-182656.png

#

This is a bit of a struggle I'm on my phone

#

You want to pass it all data (X) and the means given by your fitting

#

And it will return the full assignment indices which we have worked with before

wraith crow Jan 20, 2019, 6:30 PM

#

Hmm, so I can use assign_to_clusters(data,means) where means is the one I got from using the full function?

#

on the 5000 points

reef bone Jan 20, 2019, 6:31 PM

#

Yes

#

mr_thumbers

#

It will return the full assignments

#

So you want to store them in a variable

#

assignments = kmeans.assign_to_clusters(data, means)

wraith crow Jan 20, 2019, 6:34 PM

#

Trying run it now

#

Seems like it uses a lot of computation though, which seems like I'm missing the point of using 5000 points

#

Hmm, still running

#

never resolved, maybe it run in a infinite loop. I'll just write a comment and use the full data when plotting the image

reef bone Jan 20, 2019, 6:43 PM

#

Thats odd, kmeans is generally quite fast

#

If you show your code I can take a look

wraith crow Jan 20, 2019, 6:50 PM

#

📎 unknown.png

#

Had uncommented the two yellow dots, and commented the yellow cross before

reef bone Jan 20, 2019, 6:53 PM

#

I think you should do assign_to_clusters(data, CM)

#

Not CA

wraith crow Jan 20, 2019, 6:55 PM

#

oh.. yeah that was a mistake. I'll try to run it again

#

it worked, and was much quicker

reef bone Jan 20, 2019, 6:56 PM

#

yeboa

#

The predictions are fast, its the fitting that takes time

wraith crow Jan 20, 2019, 6:56 PM

#

Have you seen this equation before?

#

📎 unknown.png

reef bone Jan 20, 2019, 7:30 PM

#

Possibly, but I dont recognize it off the top of my head

#

What is it for?

wraith crow Jan 20, 2019, 7:36 PM

#

it's non-linear regression

lapis sequoia Jan 20, 2019, 7:47 PM

#

Hi guys, I wanted to start studying Data Science with Python or R through DataCamp. I know coding is about skill development and not where you study, but I wanted to ask if that website is good enough to get at least a good grasp on what Data Science is like.

wraith crow Jan 20, 2019, 9:09 PM

#

Anyone that can help with this?

#

📎 unknown.png

#

have to implement this approach

#

📎 unknown.png

hardy crag Jan 20, 2019, 9:28 PM

#

@lapis sequoia never tried it , but their podcast is fairly good and the host seems competent enough

#

@wraith crow hard to say what you should do without the additional code you've been provided

wraith crow Jan 20, 2019, 9:35 PM

#

This is the 'dummy version'

#

📎 unknown.png

#

this is most of the function

#

📎 unknown.png

#

not sure where and how to construct that new equation

#

TODO: Implement the non-linear regression approach;

generate corresponding plots for sigma=0.1,

sigma=1.0, and sigma=10.0 by computing, for each

xbar in X_plot, the corresponding prediction

wraith crow Jan 20, 2019, 10:02 PM

#

@reef bone Do you remember PCA well?

reef bone Jan 20, 2019, 10:02 PM

#

I have some idea

wraith crow Jan 20, 2019, 10:03 PM

#

📎 unknown.png

#

I missed the earlier assigment with PCA, so I'm a bit lost there

reef bone Jan 20, 2019, 10:04 PM

#

PCA is a fairly complex algorithm in comparison with kmeans, are they asking you to implement it yourself or are you using some library?

wraith crow Jan 20, 2019, 10:04 PM

#

We have some templates available

#

Does this one seem fitting?

#

📎 unknown.png

#

I just don't see training data in that example

reef bone Jan 20, 2019, 10:06 PM

#

PCA is an unsupervised algorithm so training data will be similar to what you had with clustering

wraith crow Jan 20, 2019, 10:06 PM

#

📎 L9_PCA.ipynb

reef bone Jan 20, 2019, 10:07 PM

#

Sorry I can't go over the entire thing with you today

#

The code you have shown looks good, most importantly it lets you extract the eigenvalues and eigenvectors

#

Because they are sorted by eigenvalues, the components that describe the data the most will be at the top

wraith crow Jan 20, 2019, 10:09 PM

#

But they used "data" as input there, but I have 4 different 'data' as in trainset, testset, trainlabels and testlabels

reef bone Jan 20, 2019, 10:11 PM

#

Looks like you're passing something called diatoms to the pca function

#

By definition PCA ignores labels, it's unsupervised

#

The labels might be useful later on, for example PCA can sometimes be used as preprocessing technique to reduce dimensionality before you implement some kind of a classifier

#

But PCA itself only decorrelates data and reduces its dimensionality

wraith crow Jan 20, 2019, 10:14 PM

#

So if I should try to run

#

def pca(data):
# Extract data dimensions
d, N = data.shape
# First, center the data
center = np.mean(data, 1)
centers = np.matlib.repmat(center, N, 1)
data_cent = data - np.transpose(centers)

# Compute covariance and its eigenvalues from centered data
Sigma = np.cov(data_cent)
evals, evecs = np.linalg.eigh(Sigma)

# Return eigenvalues and eigenvectors and -- for the sake of the lecture -- also the centered data
return np.flip(evals,0), np.flip(evecs, 1), data_cent

PCevals, PCevecs, data_cent = pca(testset)

PCevals is a vector of eigenvalues in decreasing order. To verify, uncomment:

print(PCevals)

PCevecs is a matrix whose columns are the eigenvectors listed in the order of decreasing eigenvectors

#

do you have an idea of what data should be?

reef bone Jan 20, 2019, 10:15 PM

#

Take a look at what diatoms is in the template code

#

It will be defined in one of the previous cells

wraith crow Jan 20, 2019, 10:17 PM

#

📎 unknown.png

#

classes seems a bit like labels

native rivet Jan 21, 2019, 2:44 AM

#

Hi guys

#

I need help

#

Im new to machine learning

#

But cant fig out from where to start

#

I want to learn ML

#

Please help me

#

I have 0% knowledge about machine learning

lapis sequoia Jan 21, 2019, 4:34 AM

#

@native rivet I'm not certain on exactly where to start since I barely got into data science myself, but I was told DataCamp machine learning courses are solid

supple ferry Jan 21, 2019, 1:14 PM

#

Hey there everyone. I have a Pandas related problem and would be glad if someone can help me out. I posted this question on stackoverflow:
https://stackoverflow.com/questions/54288604/applying-a-function-which-involves-multiple-boolean-operations-on-multiindexed-d

Stack Overflow

Applying a function which involves multiple boolean operations on ...

I have a question regarding Pandas Multiindex and applying a function. I have the following multi-indexed dataset as a result of this code:

grouped_df = df.groupby(by = ["individual", "cluster"])["

#

I keep trying to solve it for the past hours, but no success

wide oxide Jan 21, 2019, 1:44 PM

#

@native rivet DataCamp's "Become data scientist with python" course is more like "Learn Python for data science"

#

I did 4 courses on it and left it

native rivet Jan 21, 2019, 1:45 PM

#

@wide oxide do you know python than right

#

Sorry i mean machine learning

wide oxide Jan 21, 2019, 1:48 PM

#

Nope

small ore Jan 21, 2019, 1:50 PM

#

@native rivet Depending on how much Math you alreaady know and how much you can handle, there are different courses. For an understanding of the principles behind ML, Andrew Ng's course on Courseera and one by Columbia university on EdX is nice. The latter is high is math than the former. There are a tonne of other material and courses too

native rivet Jan 21, 2019, 1:51 PM

#

Bro i want to go from scratch like

#

All algebra , calculus etc

#

From zero

reef bone Jan 21, 2019, 1:52 PM

#

http://vmls-book.stanford.edu/

#

This book covers a lot of the very basic math

native rivet Jan 21, 2019, 1:53 PM

#

I need something like ml course step by step

#

From maths to intermediate

wide oxide Jan 21, 2019, 1:53 PM

#

I am good with high school level mathematics

native rivet Jan 21, 2019, 1:54 PM

#

Can you just teach me basics which req to ml

#

After that i can enrol udacity nano degree

wide oxide Jan 21, 2019, 1:54 PM

#

and then I have done this last semester:

📎 unknown.png

#

Should I start with course by Columbia?

small ore Jan 21, 2019, 1:57 PM

#

@native rivet Andrew Ngs course is manageable with very little math knowledge. He even has a couple lessons on martix multiplication and such and then covers required math as he goes through. But learning Math on your own from any source would be good for a better understanding

vapid mauve Jan 21, 2019, 1:57 PM

#

Are numpy.append, numpy.stack manipulations slower than normal lists? Am I supposed to be using normal lists and append, and turning that into a numpy array using numpy.array?

native rivet Jan 21, 2019, 1:58 PM

#

@small ore its not with python

wide oxide Jan 21, 2019, 1:58 PM

#

There are codes available that implement the same on Python

small ore Jan 21, 2019, 1:58 PM

#

@wide oxide Some statistics and probability would be good but if you are sharp enough you can manage by going through some YT videos

native rivet Jan 21, 2019, 1:58 PM

#

So cant do any excersise properly

#

Do you guys recommend me udacity nando degree?

small ore Jan 21, 2019, 1:59 PM

#

@native rivet Well, you said from scratch. So given the kind of courses I have seen it is best to take a course that isnt tool specific and then take a small course/read a tutorial for python modules

wide oxide Jan 21, 2019, 2:00 PM

#

@small ore We are studying this, this semester.

📎 unknown.png

native rivet Jan 21, 2019, 2:00 PM

#

Udacity will not take me from scratch?

small ore Jan 21, 2019, 2:01 PM

#

I have no idea abt the Udacity course

wide oxide Jan 21, 2019, 2:02 PM

#

course will start in February

📎 unknown.png

small ore Jan 21, 2019, 2:03 PM

#

@wide oxide Those topics in your Math course will certainly help

#

Also there are old columbia courses you can audit if you dont want to take the live course. See 📌 for a link

wide oxide Jan 21, 2019, 2:06 PM

#

Thank you very much!

#

Do you have something for data science? @small ore

#

or the same is good to start with it as well?

small ore Jan 21, 2019, 2:12 PM

#

There is a tonne of material. Courses/Texts/online material/free datasets/Blogs/etc. But good to start with learning the math and then basics of ML. Data science as far as I understand has many elements to it. Data exploration and ML should start you out well. You can learn data exploration when you start to learn some python ways of doing it

wide oxide Jan 21, 2019, 2:14 PM

#

I will start with the course then. Thank you very much!

#

I want to get into behavioural analysis and higher mathematics.

wide oxide Jan 21, 2019, 5:01 PM

#

Is it important to understand these functions?

📎 unknown.png

void anvil Jan 21, 2019, 5:04 PM

#

It depends on what side of ML you want to be on

#

there's the data science side where you don't need to know the math and treat each algorithm as a grey box, knowing what inputs / outputs / assumptions you make feeding / getting results from each model

wide oxide Jan 21, 2019, 5:05 PM

#

I am just starting

void anvil Jan 21, 2019, 5:05 PM

#

then there's the development side where you're developing faster or better ML algorithms where it's super important

#

Obviously if you understand how the algorithm works you will know what's happening much better

wide oxide Jan 21, 2019, 5:06 PM

#

From where I can learn this level Mathematics?

void anvil Jan 21, 2019, 5:06 PM

#

Calculus 1-3, Differential Equations, Statistics, Stochastic Calculus

#

they're Freshman - Jr year math

#

stochastic calculus, depending on the course, can be a grad level class

#

you'll also probably want to take classes on econometr ics

wide oxide Jan 21, 2019, 5:08 PM

#

from where?

#

Statistics is already so vast

void anvil Jan 21, 2019, 5:08 PM

#

MIT OCS is a good palce to start

#

introduction to statistcis

#

Look for statistics for engineers prlly

#

https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/

MIT OpenCourseWare

Introduction to Probability and Statistics

This course provides an elementary introduction to probability and statistics with applications. Topics include: basic combinatorics, random variables, probability distributions, Bayesian inference, hypothesis testing, confidence intervals, and linear regression. The Spring 2...

#

https://ocw.mit.edu/courses/civil-and-environmental-engineering/1-151-probability-and-statistics-in-engineering-spring-2005/lecture-notes/

There are two parts to the lecture notes for this class: The Brief Note, which is a summary of the topics discussed in class, and the Application Example, which gives real-world examples of the topics covered.

wide oxide Jan 21, 2019, 5:11 PM

#

Last one has no lecture videos

#

@void anvil what do you think about this?

#

https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-041-probabilistic-systems-analysis-and-applied-probability-fall-2010/

MIT OpenCourseWare

Probabilistic Systems Analysis and Applied Probability

Welcome to 6.041/6.431, a subject on the modeling and analysis of random phenomena and processes, including the basics of statistical inference. Nowadays, there is broad consensus that the ability to think probabilistically is a fundamental component of scientific literacy. F...

#

I have completed 5 videos of this before

#

did assignments as well (in semester breaks) and then semester started so had to leave

void anvil Jan 21, 2019, 5:13 PM

#

that's fine as well

#

it's a harder course

wide oxide Jan 21, 2019, 5:15 PM

#

I was able to understand 60-70% of the lectures

#

and was able to solve 50-60% of the assignments

#

Do you still think that I should follow the same course?

#

(BTW we're learning this : this semester)

📎 unknown.png

sharp gorge Jan 21, 2019, 11:22 PM

#

hey, could anyone recommend a reliable source for any sort of data? The actual contents of the data doesn't really matter as long as I can scrape it easily or get it through an API

reef bone Jan 21, 2019, 11:24 PM

#

Kaggle has datasets and competitions too

#

https://www.kaggle.com/

sharp gorge Jan 21, 2019, 11:33 PM

#

thank you! <3

heavy apex Jan 22, 2019, 6:30 AM

#

Omg, thank you @reef bone, but been looking for someplace to get some data to play around with more.

void anvil Jan 22, 2019, 1:24 PM

#

https://data.worldbank.org/

World Bank Open Data | Data

World Bank Open Data from The World Bank: Data

acoustic bone Jan 23, 2019, 11:22 AM

#

not "data science", but can someone direct me some resources on how matplotlib does its thing? like things will only work if you assign them to a variable

seems like an odd design

simple crag Jan 23, 2019, 2:09 PM

#

?

night fulcrum Jan 23, 2019, 11:30 PM

#

hello, does somebody here have experience with openai gym?
i have a question about it, i cant run ale anymore because my ubuntu doesnt start anymore since i've installed my new gpu,
if somebody knows, does the space invaders environment give the current score as score or the delta between the last step?

lapis sequoia Jan 23, 2019, 11:33 PM

#

is this related to python

night fulcrum Jan 23, 2019, 11:40 PM

#

yes

#

openai gym, tensorflow and keras are all in python

#

@lapis sequoia so?

lapis sequoia Jan 23, 2019, 11:57 PM

#

you dont have to ping me

ripe lava Jan 24, 2019, 4:54 AM

#

Hi guys! I am trying to figure out how to select candidate features to feed into an gradient boosted tree model (I am using XGBoost). Blindly feeding all my candidate variables doesn't do the trick as it significantly increases the search space and seems to deteriorate the validation score compared to a carefully selected list of candidate features. My idea was to train the model on all potential features and then make a selection based on feature importance (i.e. discard all the variables that don't contribute too much). What do you guys think of this approach? (asking here because somehow I can't find anything online on that topic)

agile epoch Jan 24, 2019, 5:57 AM

#

https://en.m.wikipedia.org/wiki/Principal_component_analysis

Principal component analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated vari...

#

I'm not super sure what you're talking about but it kinda sounds like pca

ripe lava Jan 24, 2019, 7:59 AM

#

PCA transforms the variables which I would like to avoid.

languid adder Jan 24, 2019, 1:48 PM

#

I'm trying to plot a groupby on a panda DF with seaborn but i notice i have to use the .head(n=1000) to get it to work. If I ommit the .head() I get an error saying the groupBySeries isn't callable. Here's the full code:

names = ["Sepal L","Sepal W","Petal L","Petal W","Class"]
irises = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data",names=names)
df = irises.groupby("Class")
sns.countplot(x=df["Class"].head(1000),label="Count")
plt.show()

agile epoch Jan 24, 2019, 4:40 PM

#

irises.groupby("Class") is a pandas group by object not a dataframe. You need to apply something to it to get a proper dataframe back. Apparently head works which I did not know. Try .head(len(df)) lol

late garnet Jan 24, 2019, 6:31 PM

#

@languid adder A count plot by nature will automatically do what you are trying to do with pandas; count the number of classes. You can simply do this.

sns.countplot(x='Class', data=irises)

agile epoch Jan 24, 2019, 9:06 PM

#

That's so weird that .head() doesn't error that

#

Is that a bug?

languid adder Jan 25, 2019, 7:54 AM

#

ah great @late garnet that's what I was after

#

.head is in the official documentation of the group by object so that's why I tried it. That's also why I was confused I couldn't work with it like a DF

#

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.head.html

languid adder Jan 26, 2019, 10:29 AM

#

I'm trying to clean up some data as was wondering if this is a good way or if there is a better way
I want to create a new column in my DF that is based upon 2 other column:

def splitIntoName(value):
    arr = value.split(", ")
    if len(arr) > 1:
        return arr[1]
    return None
df["Name"] = pd.Series([splitIntoName(x) for x in df.description.dropna()])
df["Name"].fillna(df["OtherNamej"],inplace=True)

spark nimbus Jan 26, 2019, 1:56 PM

#

with numpy array of floats x, and treshold float f, how would I make x the same array as it was before, but with any value below f turned to 0?

desert oar Jan 26, 2019, 4:03 PM

#

@languid adder that seems like a reasonable way to go

#

err wait no

#

df["Name"] = pd.Series([splitIntoName(x) for x in df.description.dropna()])

this line is bad, dont do this

#

df["Name"] = df["description"].map(splitIntoName, na_action='ignore')

do this instead

#

also dont use inplace

#

so do this:

df["Name"] = df["description"] \
    .map(splitIntoName, na_action='ignore') \
    .fillna(df["OtherName"])

#

@spark nimbus you arent allowed to think when you write python code, otherwise you think too hard and ask silly questions ;)

x[x < f] = 0

spark nimbus Jan 26, 2019, 4:09 PM

#

that actually works?

desert oar Jan 26, 2019, 4:09 PM

#

why wouldnt it

#

have you read the numpy subsetting docs?

#

which are admittedly not easy to read

#

https://docs.scipy.org/doc/numpy/user/basics.indexing.html#assignment-vs-referencing

#

Most of the following examples show the use of indexing when referencing data in an array. The examples work just as well when assigning to an array. See the section at the end for specific examples and explanations on how assignments work.

#

https://docs.scipy.org/doc/numpy/user/basics.indexing.html#boolean-or-mask-index-arrays

#

b = y > 20
y[b]

etc

languid adder Jan 26, 2019, 4:16 PM

#

@desert oar thanks. I'll dive deeper into the map function as I seem to do a lot of these types of mappings so it might be more useful to use .map instead of the construct with pd.Series([])

desert oar Jan 26, 2019, 4:16 PM

#

@languid adder the bigger problem with your code is the .dropna() will make the whole thing misaligned

#

but yeah .map and .apply are there for a reason. use them

languid adder Jan 26, 2019, 4:17 PM

#

ah yes the indexes are off.

#

didn't think about that...

desert oar Jan 26, 2019, 4:18 PM

#

like you could do...

pd.Series([foo(x) if pd.notnull(x) else x for x in df['y']], index=df.index)

or

df['y'].map(foo, na_action='ignore')

i know which one i prefer 😉

languid adder Jan 26, 2019, 4:18 PM

#

yes the later is more readable

worldly sigil Jan 27, 2019, 4:56 AM

#

How do you deploy your models? I’ve been experimenting with Flask. Looking to see what others are doing.

placid snow Jan 27, 2019, 11:58 AM

#

Seems like something that belongs in #web-development ?

hexed juniper Jan 27, 2019, 2:20 PM

#

so this is a bit of a "meta-data-science" question so i hope people wont jump at me 😃 I want to support a small open source data science community and look for the best team collaboration tool (or combination of tools - chat, file sharing, planning etc). do people have some recommendation? I guess discord is not sufficient for that.

south quest Jan 27, 2019, 2:26 PM

#

We aren't a data science server but I can answer as a large Python community:
Chat: Discord, GitHub Discussions
Collaboration: GitHub mainly, for some internal things we do use Dropbox as well
Planning: We use GitHub Issues and Projects (Projects are very similar to trello but link very well to GitHub repos & issues)

hexed juniper Jan 27, 2019, 2:39 PM

#

thanks @south quest thats quite reassuring that this combination works well (I'd like a small number of tools)

south quest Jan 27, 2019, 2:40 PM

#

Yeah, we don't really have a huge number of tools and it has served us well since our move to GitHub. We now use Azure for CI, GitHub for basically all project management and Discord for development chat

#

discord has some nice github webhooks so it all integrates nicely

hexed juniper Jan 27, 2019, 2:53 PM

#

I am new to discord but I quite like what I see so far, esp the python server looks super well organized so I'll be copying some best practices 😃

desert oar Jan 27, 2019, 6:32 PM

#

@worldly sigil flask works well for my team but i have built up quite a bit of additional structure around it over time

#

If you are just receiving and serving JSON any half decent web framework will do. I like flask because it has been around for a while so it is fairly well tested in the field, but it's also generally simple to just get up and running

worldly sigil Jan 27, 2019, 6:44 PM

#

@desert oar that's awesome to hear. Yeah, right now it's just receiving and serving JSON so I'll keep it simple til I need something more tailored to my scenario. thanks for the insight.

hexed juniper Jan 27, 2019, 10:24 PM

#

@worldly sigil there is the lean way of flask and the fat way of django+drf. the good news is that once you adopt a rest/json architecture its really easy to switch

small ore Jan 27, 2019, 10:27 PM

#

Noob question. Why do you guys use flask/json for data-science?

earnest prawn Jan 27, 2019, 10:59 PM

#

It's about deploying models to the real world like when you built an auto complete model you want to provide it as a service somehow

languid adder Jan 28, 2019, 10:17 AM

#

how can you improve resource utilization when optimizing or training a model.

#

For example i have following code:

from sklearn.model_selection import GridSearchCV
paramGrid = [
    {"n_estimators":[3,10,30],"max_features":[2,4,6,8]},
    {"bootstrap":[False],"n_estimators":[3,10],"max_features":[2,3,4]}
]
forestReg = RandomForestRegressor()
gridSearch = GridSearchCV(forestReg,paramGrid,cv=5,scoring="neg_mean_squared_error")
gridSearch.fit(housing_prep,housingLabels)

#

this runs a long time as it needs to do a lot of iterations however I notice my CPU never goes over 10%

#

so I was wondering how I can give it more resources so it's done quicker

languid adder Jan 28, 2019, 12:28 PM

#

found out the n_jobs parameter does just what I wanted. Much quicker now 😃

languid adder Jan 29, 2019, 8:16 AM

#

another question about performance...
I have a 6 core i7 processor. Is it best to use hyperthreading or not? At the moment it's disabled so I have 6 threads but was wondering if enabling hyperthreading would improve performance?

lapis sequoia Jan 29, 2019, 9:37 AM

#

it's better to run it on the cloud :v

languid adder Jan 29, 2019, 10:29 AM

#

$$$ 😛

violet bison Jan 29, 2019, 10:50 AM

#

Hello chaps! I'd like a few opinions on something if you can: I have rather heavy CSVs (between 200MB to above 2GB) that come from a cloud provider where there is the billing information, how would you by storing this information and extract data on it? put it into a postgres DB and run custom python code alongside ? send it to an ELK stack? something else? The idea would be to get the biggest costs centers and see what I can do with that.

chilly shuttle Jan 29, 2019, 11:10 AM

#

how many of those CSVs do you have that you need to coalesce?

#

if they total more than your memory, your options are spark, dask, or some out of memory dbms

#

as far as out of memory dbms go, I've been in love with Clickhouse recently so i'll shill that

worldly sigil Jan 29, 2019, 12:55 PM

#

@languid adder have you considered using Dask added onto your Scikit-Learn code? You seem like you're digging for better performance, and Dask (or Pyspark) are a pretty fantastic blitz of performance.

languid adder Jan 29, 2019, 1:02 PM

#

I haven't. Working my way through Hands on machine learning with scikit-learn and TF at the moment

#

will have a look at it afterwards, thanks for the tip!

lapis sequoia Jan 29, 2019, 1:35 PM

#

@violet bison bigquery

#

why do people complicate things so much.. use something that doesn't have you running around trying to find more resources everytime.. focus on the objective

violet bison Jan 29, 2019, 1:40 PM

#

thank you chaps, I'm in the EU and personnal google cloud accounts cannot be done, also, I have 20 GB of ram, it's cool that all the CSVs should fit into it, but thanks for the heads up

#

ooh snap GCP private accounts works now...

desert oar Jan 29, 2019, 3:43 PM

#

nothing wrong with postgres

#

like why even bother loading 20 CSVs into memory if you just wanna do some queries and aggregates

#

spark? why

#

apt-get install postgresql, youre up and running in 10 minutes

#

@violet bison

#

grab a 1bn row sample and analyze it in R using data.table

#

can do that on a laptop

#

source: i have done it on a laptop

violet bison Jan 29, 2019, 4:05 PM

#

I don't know R that's the thing, but I just wanted a few graphs and a tool capable of ingesting csv

carmine lava Jan 29, 2019, 4:09 PM

#

Any one using Intel movidius

languid adder Jan 29, 2019, 4:55 PM

#

is R so much better than python for DS and ML?

desert oar Jan 29, 2019, 5:01 PM

#

For machine learning specifically, python is better because it's more of a general purpose programming language

#

For data science in general, R is a great tool to have at your disposal

#

Also machine learning libraries tend to be written with python in mind nowadays

#

R is better for many kinds of statistical analysis

#

"Better" being my subjective opinion

#

The R data.table library is ridiculously efficient at large in memory data processing

#

I haven't tried a comparable task with pandas, but im not sure i trust it for that purpose

languid adder Jan 29, 2019, 5:34 PM

#

ok so that makes sense 😃
so for someone starting out in the ML/DS landscape, it is worth having both tools at their disposal?

serene veldt Jan 29, 2019, 5:40 PM

#

sorry to bump in

#

anyone has any ide ahow to make an efficient RMSE in torch?

#

there are some version but its for loss functions

#

im just trying to get the RMSE from two tensors

desert oar Jan 29, 2019, 6:56 PM

#

@languid adder learn one, then see about picking up the other once youre comfortable with one

#

@serene veldt you wanna use RMSE as an input to a layer?

serene veldt Jan 29, 2019, 7:20 PM

#

nop, its not for networks

#

i have 2 tensors

#

1 expected values 1 results

#

wants the rmse of that

#

there will be no propagation whatsoever of that value

#

@desert oar

lapis sequoia Jan 30, 2019, 2:42 AM

#

@languid adder R is for people in finance.. it's better for small data.. and time series

#

it sucks for network analysis in particular

#

plus.. R studio costs a few hundred dollars for licensing

lyric canopy Jan 30, 2019, 6:50 AM

#

I think R is used the most in acedamic statistics

#

It's has many more packages available for that than Python has

#

More and more are being published in Python and there's a trend to also publish one for Python these days

chilly shuttle Jan 30, 2019, 8:13 AM

#

R is for people who can't code

#

R is less and less in favour outside academia because it's a lot of work (sometimes rewrite in python) to productionise

#

https://editor.aifiddle.io

languid adder Jan 30, 2019, 8:28 AM

#

thx for the info guys. I ordered "An introduction to statistical learning" which uses R to show the examples. I wasn't planning on learning R and my intention was to code the examples in python instead so I'll stick with that plan

#

FYI I'm working for a major software company and our team needs to focus on DS/ML more so that's why I'm retraining myself...

lyric canopy Jan 30, 2019, 8:29 AM

#

I disagree with the notion of @chilly shuttle I feel it's too much gatekeeping "programming"

#

I don't care for such exclusive sentiments

#

R is fine as a tool and a lot of people use it

#

It's just not the right tool for every job

chilly shuttle Jan 30, 2019, 8:51 AM

#

eh, I stated the business reality

#

if you're in academia R is fine

#

as for commercial settings, I increasingly shy away from taking on data scientists that don't know anything except R

lyric canopy Jan 30, 2019, 8:52 AM

#

R is widely used here in commercial settings as well, so I don't think it's a business reality

chilly shuttle Jan 30, 2019, 8:52 AM

#

how do you productionise your R code?

lyric canopy Jan 30, 2019, 8:52 AM

#

It's one of the main requirements on job advertisements for professional statistics here.

#

That's usually not the main goal of R

chilly shuttle Jan 30, 2019, 8:52 AM

#

correct.

#

to productionise R code often means to re-implement it in python

lyric canopy Jan 30, 2019, 8:53 AM

#

As I said, it's not the right tool for every job

chilly shuttle Jan 30, 2019, 8:53 AM

#

why take on that burden in a commercial setting when you can take on data scientists who can do both in one shot?

lyric canopy Jan 30, 2019, 8:53 AM

#

Maybe I don't have a software development application in mind, but that doesn't mean that "R is for people who can't program"

#

Because not every project is a software development project

#

I know it's easy to only see the big data and machine learning aspect of it, but there's a whole world of people who's job it is to research rather than develop

#

For that, R is a very nice tool and it offers much more tools than Python does

polar acorn Jan 30, 2019, 8:55 AM

#

Also Rstudio has a free version that works fine

lyric canopy Jan 30, 2019, 8:55 AM

#

A lot of cutting-edge statistical models have no implementation in Python at the moment

chilly shuttle Jan 30, 2019, 8:55 AM

#

like what

lyric canopy Jan 30, 2019, 8:58 AM

#

New developments in regression trees, several multidimensional scaling techniques, and most of the other models currently being developed in statistical science (as opposed to machine learning), as a lot of those researchers publish their packages in R. Some have started to switch to Python, but there's still a distinction between smaller datasets (like used in statistical learning) and larger dataset (more machine learning perspective)

chilly shuttle Jan 30, 2019, 8:58 AM

#

so.. all academia?

#

we are pretty much in agreement then

lyric canopy Jan 30, 2019, 8:59 AM

#

Now you purposefully misrepresent what I said.

#

That those models are developed in academia doesn't mean that they're exclusively used in it

#

My research department does a lot of consultency work for external organizations that use those models

chilly shuttle Jan 30, 2019, 9:01 AM

#

i'm seriously curious, can you provide one anecdote?

lyric canopy Jan 30, 2019, 9:01 AM

#

Sure

chilly shuttle Jan 30, 2019, 9:02 AM

#

we've had to do non-bread and butter stuff maybe twice last year for external clients, and we're one of the biggest consultancies that exist. But I can imagine more specialized consultancies picking up the more exotic stuff

lyric canopy Jan 30, 2019, 9:04 AM

#

One of our students recently started an internship with the Dutch Ministry of Social Affairs and Employment to use those statistical learning models to predict which companies should be inspected for "employing" illegal immigrants/victims of human trafficking.

#

All of that is done in R

chilly shuttle Jan 30, 2019, 9:05 AM

#

and they couldn't address it with ML because?

lyric canopy Jan 30, 2019, 9:06 AM

#

It's basically the scale of the data you have; the datasets are smaller than what you'd normally use for ML

#

The distinction is a bit vague, though

chilly shuttle Jan 30, 2019, 9:06 AM

#

i should rephrase that as, 'they couldn't use existing techniques because?'

lyric canopy Jan 30, 2019, 9:07 AM

#

Because techniques evolve contantly. It's basiclly still in its infancy

chilly shuttle Jan 30, 2019, 9:08 AM

#

but that sounds like academia

lyric canopy Jan 30, 2019, 9:08 AM

#

The development of the techniques is, but the application of the techniques that stem from it is not

chilly shuttle Jan 30, 2019, 9:09 AM

#

and I'd be willing to take a wager that this student is a stats/similar academic who isn't CS proficient at programming?

lyric canopy Jan 30, 2019, 9:09 AM

#

The student isn't, but the people who currently run the project are (and who are currently already using R for it)

chilly shuttle Jan 30, 2019, 9:10 AM

#

I see

lyric canopy Jan 30, 2019, 9:11 AM

#

Now, honestly, I would really like to see the use of Python spread as that would eliminate the delay in the spread of new techniques

#

Because, as you said, R is academia focussed

#

But, it's not just for people who can't program

chilly shuttle Jan 30, 2019, 9:12 AM

#

anyway, my observations from fairly high up at a big4 consultancy:
we hire less and less data scientists who can't write code (i.e. R only)

r&d type work that can be done by pure stats folk is tiny in volume compared to commodity data platform deployments and the like

data science is increasingly becoming commoditised where creating a complex model is now a couple of clicks in a GUI. What isn't point and click at this point is data engineering

lyric canopy Jan 30, 2019, 9:14 AM

#

Now, another point is whether the use of R in academia is a good thing

#

I'd rather that we all switch to Python

chilly shuttle Jan 30, 2019, 9:15 AM

#

that is happening at the university i'm affiliated with (as part of a broader theme to add software engineering concepts to pretty much every discipline). Not sure how widespread that movement is though

lyric canopy Jan 30, 2019, 9:15 AM

#

As the stuff we do in R isn't less complex than what you'd have to write in R, but R is not a very nice language (IMO) and Python is a general programming language.

#

The biggest issue is that there's a lot that's only implemented in R at the moment

#

The newest techniques for Multiple Imputation have very nice implementations in R (e.g., MICE is one of those packages), but support in Python is incomplete

chilly shuttle Jan 30, 2019, 9:17 AM

#

yeah I'm not disagreeing with that

#

there's a clear trend of <thing> gets invented, reference implementation in R, stable python implementation 1-2 years down the track

#

it's just not something with a huge job market

lyric canopy Jan 30, 2019, 9:18 AM

#

But, as you say, I'm a lot more focussed on applying research than to software development

#

Yeah, probably

#

Still, we have about 20 master students a year and they all receive multiple job offers before they even graduate

#

So, I guess the supply part is also relatively small

chilly shuttle Jan 30, 2019, 9:19 AM

#

and some of them will be picked up by employers like mine just to pad out the credentials on proposals

#

and never do a day of enjoyable work in their life

#

i die a little every time i see a PhD stats person being put on a project to do fucking.. excel mangling

lyric canopy Jan 30, 2019, 9:22 AM

#

I guess here most students are gobbled up after they finish the master of science and before a PhD

#

I can probably dig up a list with the kind of projects I'm talking about if you like

#

But I need to do some work first

#

(and I have to check which projects are confidential first)

chilly shuttle Jan 30, 2019, 9:24 AM

#

its ok they sound like the kind of r&d projects we do once or twice a year

#

last one was modelling road network utilisation after a transition to self-driving cars, back when people still thought self-driving cars are about to take over the world

lyric canopy Jan 30, 2019, 9:29 AM

#

Most of the stuff we do is smaller scale and there's a lot of (semi-) public sector work (if that term makes sense in English)

#

It's not really related to what I do, though, so I have to look up which projects we have at the moment

lapis sequoia Jan 30, 2019, 9:30 AM

#

R studio does have a free version..but if you work at a company, you stillhave to fork over $ for a license..

#

R was nice.. manageable.. not for large data definitely.. I had to drop it because my python skills are very domain specific and being the dumbass that I am I didn't want to confuse syntax and lose progress v.v

chilly shuttle Jan 30, 2019, 9:31 AM

#

you also have to fork over $ to monitor compliance for said licensing

lyric canopy Jan 30, 2019, 9:39 AM

#

You can use the open-source RStudio (license AGPL v3) for commercial purposes without the commercial license.

chilly shuttle Jan 30, 2019, 9:42 AM

#

soooort of

#

you can use it as an individual or a small company for free

#

as a large company, the risk associated with taking on foss for critical work far outweighs the cost of paying the commercial licensing

#

the main reason foss is shied away from in large enterprise is because when something goes wrong, you want to have a support contract with the vendor so you can blame/sue them

lapis sequoia Jan 30, 2019, 9:50 AM

#

Oui

#

is why people use R in other environments

slate rock Jan 30, 2019, 11:26 AM

#

R was my first and only programming language, so when I think about writing a script my mind goes to R. Is there any book, video, tutorial, whatever that I can use to grasp the Python's way?

lapis sequoia Jan 30, 2019, 11:30 AM

#

https://jakevdp.github.io/PythonDataScienceHandbook/

#

I want this book.. but it's not released yet.. these guys seem to be taking forever

#

https://www.amazon.com/Deep-Learning-Text-Approach-Processing/dp/1491984414

chilly shuttle Jan 30, 2019, 3:29 PM

#

R is a programming language in only the loosest meaning of the term. Investing the effort to learn python or lua or c pays huge dividends because every subsequent programming language is quite familiar

languid adder Jan 30, 2019, 6:48 PM

#

i found a machine learning course on coursera but it used python 2 and graphlab... is that still worth following? Concidering the examples aren't really focussed on python 3 and scikit learn or tensorflow?

small ore Jan 30, 2019, 6:58 PM

#

What are your objectives? Learn what is behind ML or to learn the packages?

languid adder Jan 30, 2019, 6:59 PM

#

both actually... If a course is using outdated packages (no commits over 2 years) than I assume the course might be outdated as well?

small ore Jan 30, 2019, 7:03 PM

#

The one by Andrew Ng on Courseera uses Matlab/octave. It is just about what is behind ML but uses Matlab/octave to do excercises and assignments. These are easier than python in ways and enough to the point that you need to know what goes in and what comes out and infer from it. More mathematical approach by a columbia univ course on edx. It does not use any tool. See pin for a link. There is another edx course ( I think from microsoft) which teaches you to do ML using Python( Uses Jupyterlabs on azure). It uses basic libs like pandas, matplotlib and seaborn

#

And skikit learn

languid adder Jan 30, 2019, 7:05 PM

#

yeah I noticed the course from Andew Ng (founder of Coursera) but was looking into a python course so I could combine the theory + practical knowledge at the same time

small ore Jan 30, 2019, 7:06 PM

#

I suggest taking two different courses for these. The ones that teach the packages don't cover what goes on behind the scenes or why you are doing what

#

Or at least does not do it well

languid adder Jan 30, 2019, 7:06 PM

#

so which one do you suggest for the theory?

#

Andrew Ng covers more neural networks (deep learning) and doesn't seem to cover things like linear regression, classification and so on

small ore Jan 30, 2019, 7:07 PM

#

Andrew Ng starts with linear regression and classification and then goes on to neural networks( Which is again a way of solving the regression/classification problem)

languid adder Jan 30, 2019, 7:08 PM

#

ah then I must be looking at a different course

small ore Jan 30, 2019, 7:09 PM

#

If you want a more involved mathematical approach to get the intutions behind each method, The course from columbia is good. I suggest going with that course if you can manage to do some reading on the side from a standard text.

languid adder Jan 30, 2019, 7:11 PM

#

link to the course? Can't seem to find it

small ore Jan 30, 2019, 7:14 PM

#

Which one?

languid adder Jan 30, 2019, 7:14 PM

#

the one from columbia

small ore Jan 30, 2019, 7:14 PM

#

It is in the pins 📌

#

Top right of you screen where you mute

small ore Jan 31, 2019, 6:01 AM

#

afaik this aint the place to ask abt regex

willow siren Jan 31, 2019, 6:01 AM

#

Which channel?

small ore Jan 31, 2019, 6:01 AM

#

One of the help channels please

willow siren Jan 31, 2019, 6:01 AM

#

Alright, thank you

lapis sequoia Feb 1, 2019, 3:54 AM

#

Any A.I or ML server on discord??

languid adder Feb 1, 2019, 9:53 AM

#

think you'll have more luck on slack for that: https://towardsdatascience.com/15-data-science-slack-communities-to-join-8fac301bd6ce

Towards Data Science

15 Data Science Slack Communities to Join – Towards Data Science

Reach out in Slack to level up in your career

lapis sequoia Feb 1, 2019, 9:53 AM

#

I dunno

#

always found slack annoying

languid adder Feb 1, 2019, 9:56 AM

#

i don't find much differences between discord and slack... I mainly use discord for hobby specific things and slack for professional things 😛

storm monolith Feb 1, 2019, 4:15 PM

#

Hello, I have this numpy array :np_array = np.array([1, 2, 3]) I was wondering what is the most performance efficient way of obtaining the int 123 from it ?

charred tinsel Feb 1, 2019, 4:20 PM

#

np_array[0] i think. If you want the 1

#

Im not that experienced but thats what I learned so far

placid snow Feb 1, 2019, 4:48 PM

#

"".join(map(str, np.array([1, 2, 3]))``` perhaps?

#

You could always try the performance difference between options with the timeit module

storm monolith Feb 1, 2019, 7:44 PM

#

I am trying to create an empty pandas df and later fill in the values, but the performance isnt great - upwards of 100 ms for 35 rows over 3 columns...

#

Can you help me set up the dtypes for it ? maybe thats the problem

#

pd.DataFrame(index=range(0, 35), columns=['MMID', 'Price', 'Size'], dtype={'MMID': object, 'Price': np.float64, 'Size': np.int64})

#

I keep getting TypeError: data type not understood

#

in MMID colum i want to store strings, Price - floats and ints insde Size

half basalt Feb 2, 2019, 6:12 AM

#

I believe you can only specify one sort of dtype and not a dtype for every column

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame

chilly shuttle Feb 2, 2019, 8:10 AM

#

You can totally specify a dtype per column

languid adder Feb 2, 2019, 8:48 AM

#

yeah plenty of DF have different dtype per column

undone spoke Feb 2, 2019, 10:36 AM

#

any #api / requests question: I know how to set a ?per_page= , but I can't figure out how to ask an api what its maximum ?per_page - anybody? Thanks!

half basalt Feb 2, 2019, 12:32 PM

#

"Parameters:

dtype : dtype, default None

Data type to force. Only a single dtype is allowed. If None, infer

copy : boolean, default False

Copy data from inputs. Only affects DataFrame / 2d ndarray input

"

half basalt Feb 2, 2019, 12:55 PM

#

Nevermind...what I said before that was wrong

languid adder Feb 2, 2019, 9:31 PM

#

i'm trying to apply a OneHotEncoder but fail to see how I add the encoded values to my dataset

#

at the moment I have this:

ms = df.MSZoning
encoded, cats = ms.factorize()
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder(categories ="auto",sparse=False)
msOneHot = ohc.fit_transform(encoded.reshape(-1,1))

#

msOneHot now contains the columns with the binary values so I just need to add it to the dataset

#

I don't understand why the fit_transform doesn't add these automaticlly to my dataset?

languid adder Feb 2, 2019, 9:58 PM

#

lol... just realized the get_dummies does exactly that:

df = pd.concat([df,pd.get_dummies(df.MSZoning, prefix='zoning')],axis=1)

fervent solar Feb 3, 2019, 3:45 PM

#

WHY USE INPLACE ? AND WHY SUBSET["PRICE "] WORKS??

📎 Screenshot_131.png

desert oar Feb 3, 2019, 3:46 PM

#

Did you try reading the docs..?

fervent solar Feb 3, 2019, 3:47 PM

#

yes i got inplace from there but still don't understand why used subset['price']

placid snow Feb 3, 2019, 3:52 PM

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
Which columns to consider if you're dropping rows for instance
and inplace means that it modifies the df, and doesnt return a new one

fervent solar Feb 3, 2019, 4:08 PM

#

what is the good way to read documentation ?

#

and when to read it ?

placid snow Feb 3, 2019, 4:09 PM

#

Just comes with reading many, and experimenting with the meaning of them. You read them when you dont know what something does

languid adder Feb 3, 2019, 4:48 PM

#

how do I make sure I split my train and test data so my test data contains at least one of each categorical data.
At the moment I believe this isn't the case because I use OneHotEncoder and I have less features on my test data than on my training data...
I use this to split it up at the moment:

from sklearn.model_selection import train_test_split
trainSet, testSet = train_test_split(housing,test_size=0.2,random_state=42)

#

or... how can I add the missing columns with all 0 values?

fervent solar Feb 3, 2019, 6:05 PM

#

why this error ?

📎 Screenshot_132.png

polar acorn Feb 3, 2019, 6:09 PM

#

@languid adder check this answer out
https://stackoverflow.com/questions/37425961/dummy-variables-when-not-all-categories-are-present

Stack Overflow

Dummy variables when not all categories are present

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.

What happens is ...

languid adder Feb 3, 2019, 6:10 PM

#

cool thanks. Meanwhile I build my own code to fill up the gaps by comparing the columns.

#

Always good to see other solutions as well

polar acorn Feb 3, 2019, 6:11 PM

#

But why don't you add dummies before you split the data?

languid adder Feb 3, 2019, 6:13 PM

#

well.. because I split my training data into training and test.
I also have a different file that I need to make predictions for and I don't have the labels for them

#

if I add the dummies before splitting, I might have the same issue when I predict on my unknown dataset

#

also, my split up test set has some values that aren't in my training so it gives a more accurate representation of the unknown set

terse pewter Feb 3, 2019, 6:25 PM

#

@fervent solar I’m not entirely sure, but I think you tried to calculate a mean from your data frame but there are some values such as the one you see in the error that cannot be converted to a numerical value

#

I suggest looking at your data and cleaning out all the bad rows

languid adder Feb 3, 2019, 7:04 PM

#

my RMSLE of a model is 0.00054004 which is based upon an unseen dataset where i have labels for.
my RMSLE of a different dataset all of a sudden is 1.37941
Do I need to conclude my model is rubbish or that I Might have made an error in importing/transforming the second set?

polar acorn Feb 3, 2019, 7:14 PM

#

Whats your training error? It might provide some pointers about which of the RMSLE values are wrong

languid adder Feb 3, 2019, 7:16 PM

#

I'm doing the housing predictions from Kaggle and my predictions for the submissions are all very low (41k mean with std of 300)
I haven't looked at the errors on my training or test set but because my test set performs rather well and the submission set gives these weird results, I might be inclined to think there's a problem with applying the submission set to my model

polar acorn Feb 3, 2019, 7:21 PM

#

That hypothesis could be tested if a simpler model performs similarly on test and submission set I guess.

languid adder Feb 3, 2019, 7:31 PM

#

I tried changing from RandomForest to DeiccionTreeRegressor and I get similar low values.
It's weird because the predictions in that model seems almost identical...
array([52373.58333333, 52373.58333333, 52373.58333333, ...,
52373.58333333, 52373.58333333, 52373.58333333])

#

I also did a test with SGDRegressor but my initial RMSE on the training set is around 3e+20 while my predications are around 2e+19 so again way lower

#

on the DecisionTreeRegressor I get a constant for all my predictions on the Kaggle submission set while my testing set gives a very nice RMSE of around 9k

#

so this probably shows I'm applying the submission set wrong

polar acorn Feb 3, 2019, 7:49 PM

#

Probably, you should make sure you clean the data in the same way. And maybe also look at the submission set to see where it's different

languid adder Feb 3, 2019, 7:50 PM

#

i have one method that I apply to both my test set and kaggle set so to make sure they are transformed in the same way

polar acorn Feb 3, 2019, 7:58 PM

#

Hold on is this what you're doing?
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

House Prices: Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

#

Because I don't think that challenge has a two test sets. The sample submission set is a sample of what your submission should look like, it only contains ID and price.

languid adder Feb 3, 2019, 8:06 PM

#

yes

#

but I split up the trainset so I have my own test set to validate my model against an unseen set

polar acorn Feb 3, 2019, 8:09 PM

#

👍 of course, I was a bit perplexed about where you got your second test set from.

languid adder Feb 3, 2019, 8:09 PM

#

from common practice 😉

#

so that's also why I am surprised. My own test set performs rather well. 9k RMSE while the submission set doesn't work at all...

polar acorn Feb 3, 2019, 8:11 PM

#

Hmm and you have looked at the size of both test sets after transforming them to check if there are any obvious differences?

languid adder Feb 3, 2019, 8:11 PM

#

if I have different columns, I would get an error in my model.
The shape looks good

#

when I do a .describe() of both my trainSet and my kaggleSet after transformation, they look similar

#

the means of different columns I looked at are in similar ranges, same for std

#

i'll restart the kernel... I've seen similar issues that some things are in memory from trying and I remember spending a lot of time chasing ghosts... will try that and see.

polar acorn Feb 3, 2019, 8:17 PM

#

Hmm I see that a few other submissions get a RMSLE of around 0.1 with some basic models. If your model gets you 0.00054004 on the first unseen test set. That seems very low, there might be some information leakage somewhere.

languid adder Feb 3, 2019, 8:17 PM

#

ah yes you are right... So maybe the model is flawed after all. Overfitting like hell 😛

#

ow... now my predictions look better

#

now the predictions of my kaggle set is between a normal range. plenty of houses in the 100-500k mark 😄

#

will upload and see the result

#

ow darn spoke to early 😛 was plotting the training price 😦

#

yup same issue so will look into my model.
And yes... because my RMSLE is so small on my training set while the top guys at kaggle are at 0.1 and my training RMSLE has 0.0004 I indeed need to look into my model

#

but what I still don't understand is why it performs so well against my own test set

#

I'm sure I don't use my test data for training

languid adder Feb 3, 2019, 8:48 PM

#

ah think I found something... I send a non standardized dataset to my Random Forest

#

when I standardize my set and calculate the RMSLE I now get 1.81040718 which is more inline with what I see on my submission 😛

#

when I apply that new model to the kaggle set, my housing prices are still off but at least they are all in the range of 181k which is better than in 40k so I bet if I upload that, I improved a lot 😛

languid adder Feb 3, 2019, 9:16 PM

#

it keeps nagging me why my own test set does perform well but as soon as I apply it to the kaggle set it goes haywire

#

for example a DecisionTreeRegressor seem to work really well on the train data + test set but when I apply it to the kaggle set, I get the same prediction for each row

#

doesn't make any sense, right?

wind osprey Feb 3, 2019, 9:19 PM

#

hello anyone able to assist with some data science/probability/linear algebra questions as it relates to the L2 Norm?

languid adder Feb 3, 2019, 9:38 PM

#

alright, scored a 0.37 on kaggle now... looks a bit better

#

top 4000 😛

polar acorn Feb 3, 2019, 9:53 PM

#

👏 Nice!

small ore Feb 3, 2019, 10:20 PM

#

@wind osprey

#

!t ask

arctic wedgeBOT Feb 3, 2019, 10:20 PM

#

ask

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

fervent solar Feb 4, 2019, 12:10 AM

#

df.dropna(axis=0,inplace=True,subset=['price']) this code is not cleaning empty values @terse pewter

desert oar Feb 4, 2019, 5:02 AM

#

i really dont recommend using inplace

#

it might actually be deprecated now, i cant remember

potent phoenix Feb 4, 2019, 6:12 AM

#

If I want to classify a CSGO match stream at any given frame as either showing the game or not showing the game (showing commentary, commercial, et cetera), should I just get a random collection of frames and label them as "game" and "non-game"?

#

Then use a decision tree classifier to create a model?

river plume Feb 4, 2019, 6:35 AM

#

which website/ course do you guys recommend to someone who wants to learn numpy ?

lapis sequoia Feb 4, 2019, 6:37 AM

#

@river plume datacamp

#

@potent phoenix what would be non-game?.. you need to define the problem better.. if there's very little difference in the encoded images, it won't make a difference.. and you'll get false classifications every time..

river plume Feb 4, 2019, 6:38 AM

#

@lapis sequoia alright I'll check it out

#

Thanks

lapis sequoia Feb 4, 2019, 6:38 AM

#

np

languid adder Feb 4, 2019, 8:01 AM

#

can someone recommend a book that covers data analysis with regards to ML so you have better understanding on how to interpret and manipulate data so it's optimized for your a model?

#

would this be any good? https://www.amazon.de/Think-Stats-Allen-B-Downey/dp/1491907339/ref=sr_1_1?ie=UTF8&qid=1549267476&sr=8-1&keywords=think+stats

small ore Feb 4, 2019, 8:11 AM

#

The usually recommended ESL and PRML wont serve your purpose?

languid adder Feb 4, 2019, 8:15 AM

#

isn't ESL a math heavy book? I was more looking at a text book

small ore Feb 4, 2019, 8:16 AM

#

Both are quite math heavy.

languid adder Feb 4, 2019, 8:17 AM

#

i already have Introduction to Statistical learning which is from the same authors as ESL

small ore Feb 4, 2019, 8:19 AM

#

I would have read that if it were python instead of R. Now that you have pointed a book that looks promising from what it says on the back cover, I would like to take a look at its contents. I donno how to access that from amazon

#

By contents I mean the contents page. Not the entire text contents

languid adder Feb 4, 2019, 8:20 AM

#

if you google the book, you find a pdf. Not sure how legal that is... 😦

#

ah 😛

#

http://www-bcf.usc.edu/~gareth/ISL/index.html

#

you should get some more info on the book website.

small ore Feb 4, 2019, 8:21 AM

#

Oh. I meant the other book. Think Stats

languid adder Feb 4, 2019, 8:22 AM

#

ah ok.

#

http://shop.oreilly.com/product/0636920020745.do

Think Stats

If you know how to program, you have the skills to turn data into knowledge using the tools of probability and statistics. This concise introduction shows you how to perform statistical analysis co...

#

here you'll find the ToC

#

the more I dabble into DS/ML the more I realize how little I know and how much I have still to learn 😃

small ore Feb 4, 2019, 8:30 AM

#

That book seems to cover a lot more topics than ESL and PRML. And they are all surely highly mathematical topics ( Although it migh be taught in a different approach as the back cover claims). Overwhelming number of topics there

languid adder Feb 4, 2019, 8:31 AM

#

i do notice from the learning i've done so far that it's a good approach to get some high level overview of topics so you know what's out there and if you need further information/clarification, the more detailed books will guide you

small ore Feb 4, 2019, 8:33 AM

#

That ToC does not look like high level overview to me. But it might be different to you

languid adder Feb 4, 2019, 8:34 AM

#

i'll let you know in a few days 😃 Book should arrive tomorrow

small ore Feb 4, 2019, 8:36 AM

#

I would certainly be eager to know that 😃 . Thank you

languid adder Feb 4, 2019, 8:37 AM

#

found this kernel on Kaggle about the housing price competition and it's really good in explaining how to explore the data: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

small ore Feb 4, 2019, 8:38 AM

#

There seems to be another book by the same name by Allen B. Downey

languid adder Feb 4, 2019, 8:39 AM

#

he has a couple of books with o'reilly

#

and think stats has a 2nd edition if that's what you mean

#

I ordered the second edition

small ore Feb 4, 2019, 8:41 AM

#

Oh wait. I misread. Sorry. I tried to discern from the german amazon site. I thought Taschenbuch was the author. Whatever that means

languid adder Feb 4, 2019, 8:41 AM

#

yeah I'm not German but i have to use the german amazon as they deliver to Belgium

#

i read a review about that book that says it's an intro to many of the statistical concepts and the code is to explain the concept using code vs math

terse pewter Feb 4, 2019, 9:44 AM

#

So I'm actually doing an R program, but my question is for statistics related

#

I did a KNN application on a dataset and the min-max normalization is rendering better results than a z-score standardization

#

why is this the case?

#

My prof had said that generally z-score standardization is the better way to go, but based on some various K values tested on both normalization methods, the min-max rendered the best results

languid adder Feb 4, 2019, 11:41 AM

#

is it common practice to do data cleanup and feature engineering on a combined set of the train and test even although you don't have the labels for the test set?

#

i've seen it in several examples where people combine both sets before changing NA values to something else or before they call get_dummies

lapis sequoia Feb 4, 2019, 12:07 PM

#

yeah usually people drop NAs.. but it depends.. on the type of variables..

#

some will fill with the median or mode

languid adder Feb 4, 2019, 12:12 PM

#

yeah that I know. My question more about the practice of combining both the training and (unlabelled) test set before you do these operations

polar acorn Feb 4, 2019, 1:27 PM

#

Depends I think, if it's just for analysis of that dataset or I'm completely sure the test data contain no surprises then I might do that. But if it's for training a model that I plan to use many times on different data, then I'd make a proper data wrangling pipeline and use it on both the training and testing set to see that it works. I don't know what everybody else does though, but for me it depends.

storm monolith Feb 4, 2019, 3:14 PM

#

df = pd.read_csv(r'‪C:\Users\damia\Desktop\500.csv')
print(df)```
CSV contains : ```
Symbol
NCI
BGH
VYP``` I get this error : `UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character`

late garnet Feb 4, 2019, 3:24 PM

#

@storm monolith - specify the encoding when you read in the csv. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

storm monolith Feb 4, 2019, 3:31 PM

#

@late garnet Didnt help , i specified utf-8. Notepad++ says its UTF-8

polar acorn Feb 4, 2019, 3:53 PM

#

You could try, utf_8_sig

desert cradle Feb 4, 2019, 5:15 PM

#

where did you specify utf-8 and what error (if different) did you get when you specified it

fervent solar Feb 4, 2019, 7:13 PM

#

kindly anyone tell why my dropna is not working

📎 Screenshot_134.png

#

its not dropping values

#

i printed the price before dropna and after dropna still not working

ivory galleon Feb 4, 2019, 7:17 PM

#

Any advice on how to debug an assertion error, from statsmodels.formula.api.ols, assert pytype not in (tokenize.NL, tokenize.NEWLINE)?

#

I'm not entirely sure what sort of error I'm even dealing with. I've verified that my column names are right, but I'm not sure what else could be going on.

#

The problem is somewhere in patsy, but I've poked at Patsy a bit and I'm not seeing an obvious solution.

#

Same problem arises when I type in a manual formula ("Employees ~ Time") and when I use ModelDesc, and patsy worked on a slightly different dataframe earlier in my notebook.

small pumice Feb 4, 2019, 7:33 PM

#

I'm building a neural network using Keras. I have two variables for my input. The training data is split into two Numpy arrays, one with all the data for the first variable, and one with all the data for the second variable. The output data is also a Numpy array. How should I format the data to give to the network?

storm monolith Feb 4, 2019, 8:10 PM

#

@desert cradle df = pd.read_csv(r'‪C:\Users\damia\Desktop\500.csv',encoding='utf-8') Same error, want full paste ?

#

File "pandas\_libs\parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__ File "pandas\_libs\parsers.pyx", line 686, in pandas._libs.parsers.TextReader._setup_parser_source UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

desert cradle Feb 4, 2019, 8:32 PM

#

ok, your actual problem

#

there is an invisible character in your filename

#

📎 unknown.png

#

that's what i got when i pasted the string from the line you just typed into my prompt @storm monolith

#

delete and retype the string constant

#

thought it was weird that it was an encode error

storm monolith Feb 4, 2019, 8:34 PM

#

lol, never would've noticed

#

thx

ivory galleon Feb 4, 2019, 9:55 PM

#

I'm still stuck, trying to figure out what the tokenization is even doing that's making it go wrong.

#

I just wanted to run an ols on a dataframe!

languid adder Feb 5, 2019, 9:11 AM

#

question regarding OneHotEncoder. If I have a column that can have 4 different values so with OneHotEncoder I transform that into 4 new columns.
If I calculate the correlation of those 4 new columns vs my label and notice only 2 of those columns have a decent correclation value, is it good to remove the other 2 or is that bad considering its origin?

#

the same question if I use a PolynomialFeatures transform? this can increase the amount of features exponentiel so dropping the additional columns that don't have a good correlation should help the model, right?

chilly shuttle Feb 5, 2019, 12:22 PM

#

if you're only looking for linear relationships, sure

#

but corr won't tell you much about nonlinear or stateful relationships, which can be learned by a range of ML techniques

languid adder Feb 5, 2019, 12:23 PM

#

ah yes, if there is higher relation the correlation won't point it out and I might drop those

chilly shuttle Feb 5, 2019, 12:23 PM

#

i observed there's a bit of a miss in current ML curriculum around feature selection

#

most of the feature selection techniques presented rely on linear relations and are worthless for stuff like decision trees or ann

languid adder Feb 5, 2019, 12:25 PM

#

yeah most of the stuff you find talks about correlation and if you're lucky they mention it doesn't apply to nonlinear relationships

chilly shuttle Feb 5, 2019, 12:26 PM

#

pca is a huge offender

#

it's still taught as one of or THE primary feature selection technique, but it's actually useless for popular ml techniques

ivory galleon Feb 5, 2019, 11:30 PM

#

OK, I've confirmed that
'response_terms = [patsy.Term([patsy.LookupFactor("Employees")])]
model_terms = [patsy.Term([])]
model_terms += [patsy.Term([patsy.LookupFactor("Time")])]'
works and
'ols(formula="Employees ~ Time",data=longer_df).fit().summary()'
doesn't.

old axle Feb 5, 2019, 11:56 PM

#

whats the command to install numpy and pandas?

ivory galleon Feb 6, 2019, 12:00 AM

#

@old axle, do you have pip?

old axle Feb 6, 2019, 12:01 AM

#

yep

#

and git

#

is the packaged called numpy itsself

#

numpy/pandas

ivory galleon Feb 6, 2019, 12:01 AM

#

pip install pandas and pip install numpy should do the trick.

old axle Feb 6, 2019, 12:01 AM

#

i thought some had different names

#

okay thanks

ivory galleon Feb 6, 2019, 12:04 AM

#

Some do, but it's always worth checking the defaults.

old axle Feb 6, 2019, 12:07 AM

#

ok

ivory galleon Feb 6, 2019, 12:56 AM

#

(My problem has been fixed! For the record, I hadn't updated a library that broke on the 3.6->3.7 upgrade.)

languid adder Feb 6, 2019, 10:15 AM

#

if you have a dataset of around 1500 records and one of the categorical features has 2 values. One value has 1400 occurences and the other around 100.
I notice that those 100 records have a capped value for my label while if a record has the other value, the label can go much higher (over double the value of the other category)
I'm not sure that based upon the high difference in frequency of these categories, It would be wise to make this conclusion and include it in my model

languid adder Feb 6, 2019, 10:31 AM

#

if you have following distribution of categories:

#

📎 unknown.png

#

do I need to disregard that feature? Knowing that one category takes around 90% of the dataset so when it comes to probability... it has more opportunity to generate more outliers or wider distributions

carmine lava Feb 6, 2019, 2:15 PM

#

Best course to start deeplearning 🤔

#

Any one

polar acorn Feb 6, 2019, 2:27 PM

#

@carmine lava check pinned messages.

carmine lava Feb 6, 2019, 2:31 PM

#

@polar acorn link

polar acorn Feb 6, 2019, 2:32 PM

#

Pinned messages in this channel, top right above the chat window.

lime lava Feb 6, 2019, 3:11 PM

#

Hi, Pandas question: I have two tables, one with a yearly value for las 10 years an another with several price entries by year. I want to create a new column that divides the second table price by the first table value if years match
is there a simple way to do this?

solar oracle Feb 6, 2019, 3:15 PM

#

So you have yearly data and let's say monthly data for the same thing?

lime lava Feb 6, 2019, 3:17 PM

#

actually

#

I have a yearly index in one table

#

and a lot of transactions for each year in another

#

fist table is like 10 rows

#

second one is like 300k

river plume Feb 6, 2019, 3:29 PM

#

i have a dataframe column with such values: "Eraser (5)"; where the round brackets contain the price of object in int

#

How can i add another df column after extracting the price and storing tt in the nes column

#

Like df.Object should be "Eraser" and df.Price should be 5

desert cradle Feb 6, 2019, 3:37 PM

#

>>> df = pd.DataFrame({'TheColumn': ['Eraser (5)']})
>>> r = re.compile(r'^(.*?)\s*\((\d+)\)$')
>>> ms = df['TheColumn'].apply(r.match)
>>> df['Object'] = ms.apply(lambda x: x.group(1))
>>> df['Price'] = ms.apply(lambda x: int(x.group(2)))
>>> df
    TheColumn  Object  Price
0  Eraser (5)  Eraser      5``` @river plume

river plume Feb 6, 2019, 5:00 PM

#

thanks @desert cradle

river plume Feb 6, 2019, 5:16 PM

#

df['Price'] = df['TheColumn'].str.extract('.((.)).*', expand=True)

#

This worked too @desert cradle

#data-science-and-ml

Accuracy would be calculated from Number of correct predictions / Total number of predictions -> ``` TP + TN

in the RGB space (see slide 21 of L13); use

'max_iter=5' and 'seed=0' as parameters."

TODO: Segment the image 'copenhagen.jgp' in a similar

fashion (using n_clusters=16, max_iter=5, seed=0). Instead

of using all data points/pixels, consider a random subset of

size 5000 to fit the k-means model and to find suitable cluster

centers. You can use the numpy.random.choice function to select

a random subset of indices (without replacement).

TODO: Implement the non-linear regression approach;

generate corresponding plots for sigma=0.1,

sigma=1.0, and sigma=10.0 by computing, for each

xbar in X_plot, the corresponding prediction

PCevals is a vector of eigenvalues in decreasing order. To verify, uncomment:

print(PCevals)

PCevecs is a matrix whose columns are the eigenvectors listed in the order of decreasing eigenvectors

Accuracy would be calculated from Number of correct predictions / Total number of predictions -> ```
TP + TN