#data-science-and-ml
1 messages · Page 192 of 1
the optimizer you are using is then a class which represents the algorithm, so it needs the model parameters in the constructor, some training methods that take data and a run method, which just evaluates data and does not train.
(or you can just use a framework for the model that you can compare your solution to, and for creating the rest of the script)
does that make sense?
Thank you so much, it makes some sense
Im just a bit clueless on how to begin writting the code as thats my main issue been coding for around 5 months now first year university student
Still learning the ropes
@hardy crag
just build in step by step. first, create the input, and just print it back to console and so on
Okay thank you so much!
yw
can someone help me here? Im learning basic data science and did this for model accuracy
def get_model_accuracy(self):
x = self.selected_columns[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_COMB"]]
y = self.selected_columns[["CO2EMISSIONS"]]
linear_reg = LinearRegression().fit(x, y)
accuracy_value = -cross_val_score(linear_reg, x, y, cv=10, scoring="neg_mean_squared_error").mean()
print(
"model accuracy was:", format(accuracy_value ** 0.5, ".2f")
)```
from what I learned the accuracy values need to be between [0, 1]
but the value im getting is 24
what I did wrong
?
Accuracy would be calculated from Number of correct predictions / Total number of predictions -> ```
TP + TN
TP + TN + FP + FN``` So either you got the formula wrong, or some of your values don't add up
I haven't touched this is ages so can't really remember much more about it
hum, I though the module would do the formula for me
tp being true positives, tn true negatives and so on
Humm, I can look through my old course work for how I did it.
I believe i had a bit of accuracy calculated
https://github.com/tagptroll1/NaiveBayes284/blob/master/NaiveBayes.py#L113 I only calculated accuracy when i did the model manually, so no sklearn or any other lib
¯_(ツ)_/¯
It's also pretty old, so not the best of python code in general.
oh I just found out Linear_regression() has a score() method
but Ill look into your repo
I forget if score is accuracy or something else, but that could be it
yeah score did it
it returned 86.4%
hum, thats not good is it?
def get_model_accuracy(self):
x = self.selected_columns[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_COMB"]]
y = self.selected_columns[["CO2EMISSIONS"]]
linear_reg = LinearRegression().fit(x, y)
print("model accuracy was:", format(linear_reg.score(x, y) * 100, ".2f"), "%")
print("The predicted co2 emission values were", linear_reg.predict(x))```
That's fairly good, from what i remember
well, thanks for the help, Ill check your repo now chibli, thanks
Having too high of a score could mean overfitting
that means the data is way to specific to the training set and not to a general situation, right?
thanks
If you're interested this was my exam in that course, very small training set and I used to wrong model but hey 🤷🏽
https://github.com/tagptroll1/machinelearningexam
Anyone can give some directions on how to select nr of epochs, batch size, hidden layer size etc. I made a dataset and now I'm trying to evaluate how well my model can fit it. I can't make any sense of the accuracy results. Sometimes the accuracy can be everything from 25-95% and sometimes it hits a reoccurring 28.42% - even if i run with the exact same hyperparameters..
X = np.array(data.drop(['H', 'D', 'A'], axis=1)) # Features
y = np.array(data[['H', 'D', 'A']]) # Labels
# 10-fold Cross-validation
kf = KFold(n_splits=10) # Shuffle=True?
for train_index, test_index in kf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = Sequential()
model.add(Dense(51, activation='relu'))
model.add(Dense(25, activation='relu')) # Experiment with hidden layer size
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=1000, batch_size=18) # Experiment with epochs and batch_size
scores = model.evaluate(X, y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))```
Like, why is every single epoch 28.42%?
that's the best the model can do I guess
It's not, cuz running it again it might just as well hit 95%
Or 62 or 54 or whatever
There isn’t a magic way to check hyper parameters I don’t believe
It just kind of is trial and error
Well how can I trial and error if it always gives me different results even if I tweak the hyperparameters or not?
I can't really tell if it was the tweaking who made it better/worse or just the random change
well if the model is randomly initialized, the minimum it finds may be different from run to run
have you checked wether the the 28.xx% is the result from the model always predicting the same output for any input?
also: how big are your training and testing set?
there is a lot of "intuition" and "experience" and reading tips and tricks for x architecture involved in finding hyperparameters
Dataset is 380 samples (representing one season of EPL football games) which can be increased by adding more seasons. I've tried splitting it with the validation_splitparameter of model.compile() by 10%, 20% and 30%.
Can you point me to any good resources of these tips and tricks? @hardy crag
@thin terrace so the model always predicts Away win?
if you have an inbalanced dataset, do you correct for that?
an example of tipps and tricks (and a good blog in general ): https://towardsdatascience.com/deep-learning-tips-and-tricks-1ef708ec5f53
Hi,
I am trying to use satellite data and a neural network to predict whether a wildfire will occur in a given area within a month. I am using Google Earth Engine to collect the data. I have a question about the neural network: I will not have as many occurrences of fires as I will fires not occurring. Therefore, I will have to use the same amount of fire occurrences as non-fires. Then, however, the result that the neural network outputs for whether a fire will occur within a month will be skewed. How do I account for this?
you can correct in your loss function for inbalanced classes
like this stackoverflow
Hi. I am trying to work on document similarity to retrieve top n documents for a given query
at present, i am using tf idf vectorizer
Does someone know something better than tf idf that can capture semantic meaning as well?
Can i convert .mlmodel (Apple MLKit Model) to .tflite (Tensorflow Lite Model) using python?
What is some example code of putting eye tracker data into a csv file?
Hey there! Anyone has experience with Logit model in Python? I am using now statsmodels version of it. What I try to do, when fitting my model, specify Logarithmic function so that it fits not on original X, but on Log(X) any easy way to do that except manually creating columns in df?
Does anyone have good resources for dataviz or "storytelling"?
So i got redirected here and I needed advice
how i can use image analysis
to identify the differences between these two
and other images
Like these two are different stages of a microstructure
of a material
and I wanted to create an app that will be able to distinguish between these two
How many examples do you have of each stage?
You don't even need ml. Just run hough transform and observe the much more prominent lines in the left image
@chilly shuttle can you elaborate more please
Like you're saying to use Hough transform, and to find the image with more lines than dark shapes?
Run them through hough transform and look at the differences in output, it should be quite obvious after that
@chilly shuttle ohh ok, thank you!
if you want more help you need to give a lot more information about your problem, e.g. how many samples of each class you have as pptt asked
but the two you showed don't need ml to discern, you just need to look for straight lines
@chilly shuttle yea can I get back to you on that in a later day because
We just got the challenge today
From our materials uni course which is making a challenge for a hackathon
if it's a challenge why are you coming straight here for help
take it on as a challenge
And I'm planning to learn
Noo like
Ik what kinda thing to find out
It's just idk about the technology
Needed to find it out
Which I'm planning to leaen
Before the hackathon
That's why they showed the challenge
Rn
The hackathon is in a few weeks
So we can kinda prepare for it
Like I just needed to know what kind of technology I needed to learn to find the differences in the two images
And I was planning to learn those stuff
So I can use it in the actual hackathon
That's why, sorry for the miscommunication
i don't mind, we'll answer whatever questions
Thanks! The only questions I plan on asking are related to the technology help itself
Hence why I came to the Python discord to start off
just saying if it's a challenge i'd personally feel like any success would be diminished if I got my hand held through it
Cuz I want to get involved in computer vision
And this challenge was something for me to start off with
Yea that's true, well like I just wanted to know where to start off from
Cuz I'm clueless right now
About machine learning
Like I got a course from Udemy, and I'm gonna use that to learn and from there try to solve the problem
Because I don't know yet, I still have to get clarification from then, whether they have more samples and whether the app they want is supposed to scan different materials that have diff microstructures, so the samples would vary between each material
I just got the idea of this a few hours ago, so I was just desperately trying to find a place to start
Best of luck! But remember struggling and/or failing a bit is a great way to learn.
Hello everyone
i'm tryin to build a live forex chart
i don't have any experience with that yet
but i do know a little about python
is there anyone here, who can help me?
Hey guys, is there any free stock price API which allows me to get data more frequent than 1 day ? I want to make live updating chart using dash, but I need some great api that will make my chart to change for example each minute. Thanks in advance 😃
@vestal ravine I guess we're looking for same thing 😄 Try to learn really useful libraries like pandas and numpy and then matplotllib/plotly to creating charts. After that you will be in my position - to find some great API to get live data, as frequent as possible .
ill take a look, Thnx.
@polar acorn Thankks!!!
@mikey770 alpha vantage has an intraday api
@meager laurel Is that a scehmatic diagram from a text book or an actual black and white picture of the microstructure? Or an actual picture filtered to make the image simple for the task? A real life problem would be to recognize the process ( heat treatment/ natural process of formation etc) of any given material. In that case you will need a lot of sample pictures of each matrial structure
@small ore well our materials Prof posted that on the slide for the lecture and I cropped the image from there, so I'm assuming it's an actual picture.
This is the challenge part
And we just started our materials course, so I don't know much about how the microstructures change. But I remember hearing he said the temperature affects it. I'll have to find out more info about it.
Those may even be differently heat-treated steel. I too donno how a photograph of a microstructure is taken. I only remember seeing them in textbooks. Not even good ones
And my very little and out-dated knowledge tells me that recognizing microstructures were a forte of higly skilled and trained man power. So this challenge seems to me like they are trying to remove the discrepancies that come with human decisions and the high cost involved ( Although who knows, machines mayprove to be inferior in certain cases)
^
Steel microstructure is a massive fucking pain in the ass. Good luck getting anything beyond meh results if they give you a real world dataset.
Photographs come from SEMs
https://www.researchgate.net/publication/317711703_Advanced_Steel_Microstructural_Classification_by_Deep_Learning_Methods Here's some recent research you can have a look at, it's a very nontrivial problem looks like! Judging from this table, looks like you need pretty sophisticated methods to even reach 50% accuracy on real-world data @meager laurel
@small ore @void anvil @feral lodge ohh ok, I'll look more into that, and ask my prof what he's expecting and about this stuff too
I have a friend who works in a steel mill. ML is useless in > 99% of the cases. If you get an actual breakthrough you'll get hundreds of millions of dollars every year.
Well, unless you are working for a research firm, you may spend a million trying to get decent sample photographs
Idk I think I recall the prof saying to find the difference by checking the amount of white in the picture or the size or something. Idk if that's fully true now
True. They might be seeking something simple. I can't imagine someone throwing their students something so complex
Any suggestions to which metrics to use in ML classification?
Depends on the case. With nicely balanced classes accuracy might do fine. Screening for cancer? Maybe keep an eye on the f1 score and the false negative rate etc. etc.
@polar acorn balance is like 45/25/30%
Depends strongly on what you're trying to predict
What input(s) do you have and what output(s) are you looking for?
Predicting football matches. Input is team ratings and some history data from previous games. output is home team win / draw / away team win
From previous work I've basically only seen accuracy being used
You might be better off using a (-1,1) classifying system like they use in finance and putting draw in at 0
@void anvil why is that?
experience
Is there an R discord?
Within data science, what tools are best in which situation. Currently doing a visualization class that is strictly Tableau, with only optional python and R learning. I'm obviously going to putting in the extra optional work, but which tool tends to hold the most weight in a working environment?
@feral lodge object tracking python codes or research paper please share if you know
Does anyone have jupyter notebook slowing down after some period of time and was able to fix it? I have been looking it up, but couldnt find much
@thin terrace To expand on that answer, it's because there's an element of ordinality to those classes. ie, you're not predicting something truly Categorical like, say, "hair color". A win is "closer" to a draw than it is to a loss, and so it makes sense to put them on a scale.
@heavy apex Matplotlib itself is very powerful but honestly can be a pain. Seaborn's a much-easier wrapper around that, but it's limited in the types of visualizations it can do. Plotly is very good for both exploratory data analysis and presenting findings - it's also got a thing called Dash that can replace Tableau in a lot of instances. There's also Bokeh for interactive visualization. My favorite is probably Altair.
@void anvil @orchid lintel which activation function do you use on the output layer for such a label?
(and loss function)
Anyone know a good free alternative to prodigy (https://prodi.gy/)?
Jk, looks like https://github.com/chakki-works/doccano is a good option
@vague jetty https://github.com/zalandoresearch/flair
is also a great one as well
same with spacy
Prodigy is spaCy's version
Does anyone have experience with Doccano? I'm having trouble uploading a dataset.
Looks like Doccano isn't working for me. Any other suggestions for an easy interface for data annotation?
Specifically, I'm looking to annotate classification, not keywords in text.
Nvm, looks like Doccano started working for me.
that's pretty cool
i guess you couple something like that with something like mechanical turk to generate labelled datasets
Oh by the way, my professor got back to me and he said
"Sorry for the delay in getting back to you. Attached are two sample files similar to the ones that will be provided for the challenge. What we would like you to do is to measure the fractions of the light and dark phases. The fraction could simply be expressed as number of pixels of a given color over the total number of pixels. The challenge will be the lighting conditions. Sometimes you’ll get very good contrast between the phases (e.g. micro1.png) and sometimes there is less contrast (micro2_prec.png).
To do this, you can start by using Python libraries for reading an image and for interrogating the data."
That's what he wanted
Woop woop, got a follow-up email from a company for an ML research internship. They want me to submit a research proposal, but I have no idea where to start...
@vague jetty ay nice job
Basically here's a problem, here's what it'll do for your company, here are some methods that I could try to apply, and here's why it'll be worth your money
Like I want to learn deep ocean diving, want budget for a sonar measurement of dophin and whale sounds, so that I can help devise a way to distinguish between ships/subs and other creatures 😄 😛
So I'm really new to sentiment analysis and am playing around with a project doing sentiment on a hockey forum. I've scraped a bunch of posts and am working on annotating to to make the training and test sets. I imagine this is a vague, common question with no good answer, but how do I label a post like "He wasn't great at the u18's. He started off the year in Kingston pretty bad, but got a lot better later on near the new year. His consistency could use some work from what I've seen/heard. He can be dominant but he can also have some bad showings." It's pretty clearly both positive and negative. Right now I'm applying labels to the entire post. I imagine selecting the instances of positive text and negative text in the post would be better than blanket labeling the entire post, but it will take a lot more time to do that. Should I label it both positive and negative?
You can count sentiment items per post (10x bad, 3x good) if you want. What's harder to do is pick up sarcasm (e.g. PK Subban is the worst role model in the league).
It's also a bit more difficult to pick out what's being said good / bad about the specific person in the post (e.g. player XYZ is doing badly so far this year but will pick up once he's off injury. He's still better than ABC who sucks.)
Yeah, that's a big issue I'm running into. Eventually I might try playing around with disambiguation, but I want to keep things simple right now. Do you recommend counting sentiment items over selecting the actual text?
Counting sentiment items is > then just having a -1, 1 for good / bad on the post. It's not necessarily the best option.
Not sure if this is the place to ask. I’m a computer science student who’s currently taking calc. I love Microsoft’s ability to annotate equations (think it uses LaTex.). My issues is that during work if I get a chance to study I’m not able to use downloadable apps. Do you guys have a good note taking web app or web word processor that supports math equations? Google docs is super limited :/
that is an odd one indeed. you could use markdown in a jupyter notebook, i think there are free notebook servers out there
otherwise i think google docs is your best bet
Hey does anyone know if date is a discrete or continuous numerical type?
I would think that it would be discrete
@obtuse kettle If Jupyter Labs is okay for your purpose ( It can in addition to taking notes with equations also run various programs inline. Can also have graphs, tables, etc) then Microsoft Azure is one option. Requires a login
@small ore Why azure over digital ocean or other vps?
I didnt say over anything else. That was the only one I know which has support for jupyter notebooks
I see. Thank you for your input, you as well salt 😃
@obtuse kettle I also recommend overleaf for notes, I always use it for math assignments etc. It's like google docs for latex
Does it have a web app, free?
@terse pewter Date should be a discrete variable. If it's date/time its probably continuous though
@obtuse kettle https://www.overleaf.com/ No charge, you just need to make an account
Thank you!
ctrl-enter to compile the pdf so you can see what you're doing
That makes sense since time can be hours minutes seconds ms,etc
Agreed!
@carmine lava https://www.pyimagesearch.com/2015/09/14/ball-tracking-with-opencv/ This blog entry is a few years old, I hope it's not outdated. I think it's a nice first exposure to object tracking though! Regarding papers, it might be difficult to find something general to learn the basics from; you'll probably mostly find papers exploring specifics and trying to improve the state-of-the-art for difficult problems
@feral lodge thanks but i have seen it
Now what i am trying to do is give a unique id for object and track it we the object is in the frame it should give id 1 and if he goes out the frame and come back the id should be same @feral lodge 😱 if you find something close to it please let me known i really appreciate that
Overleaf seems really cool so far! Thank you for the suggestion! But what's the catch... ALl this for free? I can store as many files as I want?
100 MB storage for starters, you can raise it to 1GB for free if you do their referral stuff. If there is some nefarious dark side to overleaf I haven't seen it yet 👌 @obtuse kettle
I see. Thank you again for this ^_^ time to lean this magic
@carmine lava I can't help you much, I have hardly touched object tracking 😖 I don't think I've ever seen work on recognizing and labeling certain individuals among some object class; only class labeling (ie., labeling objects as a coffe cup rather than coffe cup #345 and coffe cup #346). I'll keep my eyes open though If I see something! What kinds of objects are you trying to label?
Person
I am traning person but the thing is when he goes out of frame and come back the id is changing my requirement is he should have same id
@Slandön# what are you using for detection i mean which lib faster rcnn or SSD which one
@feral lodge any tutorial on using faster rcnn
Sorry my friend, I have no idea about the libraries! If there's a big difference you can probably find some benchmarks or discussions online
Same blog as before; he labels individual faces in this one it seems! https://www.pyimagesearch.com/2018/09/24/opencv-face-recognition/
Couldn't check very carefully though, duty calls 👺
@carmine lava the problem with individual people is that you need a good dataset. If you have a good dataset you can use tensorflows object detection, which includes several faster_rcnn architectures
./m
anyone tried the book "Datascience from Scratch" by Joel Grus? I'm thinking of getting it, since I want to get some datascience book and it seems to put more emphasis on actually understanding things rather than just learning to feed a black box
uhm also I might be missing something, but when I check out the recently pinned guide on r/learnmachinelearning, the PhD-version of it just includes no guide (?)
any book teaching you to feed a black box isnt worth your time or money
unfortunately i dont know that book in particular, no do i have any sane route to data science because my own personal path was so winding
but any good data science book with have both math/theory and applied content
my gripe with some "machine learning" books is they dont actually spend much time on applications
it should be a push-and-pull of: learn a concept, learn the math behind it, and finally learn to apply it
book exercises ideally would be both practice math/theory problems (e.g. derive some expression, prove a theorem), and practice mini-projects (simulate or download a dataset, then implement a model and make inferences on it)
Honestly you don't need to know fuckall about the math behind machine learning because of the big, beautiful packages set up by people. 99% of the effort is in data manipulation and experimental verification.
@crude flame I can highly recommend An Introduction to Statistical Learning by James et al. All the main models used in supervised and unsupervised learning are described. I found it very intuitive and with just the right amount of math to get started (if you want to go really technical you can check Elements of Statistical Learning, the advanced book that they have written).
and it's free!
you can find it there: http://www-bcf.usc.edu/~gareth/ISL/
application exercises are in R though... But it should be easy to replicate most of the results using scikit-learn
I think anyone who once had a basic stats course should be fine
These days I am very much into behaviour analysis, and psychology.
But, I am completely clueless. I don't know which course to take or which thing to follow.
I should have mentioned that I'm doing a PhD in mathematical physics, so I'm fine with technical stuff. Thanks for the suggestion!
@crude flame If you really want the foundations of the algos, Elements of Statistical Learning is where you wanna be. There's a free online Stanford class on it too, with guest lectures by like the actual guy who came up with CART and stuff. https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about
Learn some of the main tools used in statistical modeling and data science. We cover both traditional as well as exciting new methods, and how to use them in R.
@wide oxide If I were starting out now, I'd be doing DataQuest, I really like their material.
Anyone that is familiar with K-mean clustering? I'm having trouble with using a template in my exercise.
I might be repeating myself.... but go check out the ISLR book, its very well explained in there
@orchid lintel Let me check their material. I've completed like 2-3 courses on Datacamp for Data Scientist with Python.
Their material was more like learn python for data science
@wraith crow Most people here will be familiar with k-means so go ahead and ask your question
Oh my god
Who did I ping
Oh you changed your name
All good

Are you using sklearn's k-means implementation?
Hmm, the bot is not letting me upload the function
No, it's a function I have gotten from a template in the exericse
class KMeans:
""" Simple K-Means implementation. Note that you can access
the cluster means and the cluster assignments once you have
called the "fit" function. The cluster means are stored in the
variable 'cluster_means' and the assignments to the cluster
means in 'cluster_assignments'. You can also use the function
'assign_to_clusters' to obtain such assignments for a new set
X of points.
"""
def __init__(self, n_clusters=2, max_iter=30, seed=0, verbose=0):
""" Constructor for the model.
Parameters
----------
n_clusters : int
The number of clusters that should be
found via the K-Means clustering approach.
max_iter : int
The maximum number of iterations (stopping condition)
seed : int
Number that is used to initialize the random
number generator.
"""
self.n_clusters = n_clusters
self.max_iter = max_iter
self.seed = seed
def fit(self, X):
"""
Fits the K-Means model. The final cluster assignments
(i.e., the indices) and the cluster means are stored
in the variables 'cluster_assignments' and 'cluster_means',
respectively, see the end of this function.
Parameters
----------
X : Array of shape [n_samples, n_features]
"""
(Which course are you taking? @wraith crow )
Oh this looks like a python problem, let's make sure you instantiate KMeans first, as you're calling the method
Ah, I figured I might be calling it wrong?
Try doing
kmeans = KMeans()
kmeans.fit(data)
Oh my god, that was it. Thanks a lot!
Why there is no subject like that for us ;__;
I want to learn data science and I end up learning models but, no practical stuff, like implementing..
Oh one thing, can I also initialise it using the init before kmeans.fit(data)?
For example if I need some other parameters than the one in the code, like KMeans(n clusters=2, max iter=30, seed=0).
Hmm seems to work
When I write it like
kmeans = KMeans()
kmeans.init(2,30,0)
kmeans.fit(data)
This is a bit of a weird situation
Normally the method init would actually be called __init__ which makes it a constructor - a method that is automatically called when you instantiate a new object, but in this case it isn't
If it was named that, you could instantiate like this:
kmeans = KMeans(2, 30, 0)
And it would automatically call the init function with those values
But since it's not named properly, the method has to be called separately (just as you have done)
Oh wait
I'm dumb
well, they write I shouldn't change the orignal cell which says:
def init(self, n_clusters=2, max_iter=100, seed=0, verbose=0):
""" Constructor for the model.
So I need a way to change the parameters for the different exercises without changing that one in the original cell.
It probably is called __init__, but Discord is making it look like that because the underscored translate to underline, right?
no worries!
So what you've done is fine
It's 07:05 here 😕
But you can also do
means = KMeans(2, 30, 0)
And it will automatically call the __init__ function and pass those values
That's the intended purpose of the method
okay great, thank you! I've been stressing a lot over this, so you have been my saviour 😄
Seems like means = KMeans(2, 30, 0) doesn't work, because when I change it to means = KMeans(4, 30, 0) for exmaple, it doesn't give me 4 cluster means, but still 2
LIke this:
The variable on the second line is called means
But then you're fitting to kmeans
So they are different
I need to catch some sleep now but feel free to ask questions here, very smart people lurk this chat
I have to go to sleep as well now, thanks a lot for your help! You made my day (or night)! 😃
Goodnight
@reef bone How did you learn data science?
I'm still very much in the process of learning, but I picked data sciencey modules in uni and was blessed with incredible lecturers that made the field very accessible to me and eventually ended up doing a deep learning dissertation in my 3rd year
How should a beginner start?
I have tried studying from Datacamp, Introduction to Statistical Learning and many other courses like Andrew NG.
For datacamp, I completed 3-4 courses and ended up learning nothing practical.
ISL: Completed 4 chapters, understood 25% of it.
Andrew NG's course was good but, assignments were different.
I've heard Introduction to Statistical Learning is a very good book. Apart from that however I don't think I'm the right person to answer this question, because I went the uni route so it was very different for me (learning from lectures, being driven by marked assignments, and the ability to ask for help directly in lab sessions). The book I learned the most from was Bishop's Pattern Recognition and Machine Learning - an incredible book, but a little math heavy and might be difficult to read for a beginner. I don't know much about those courses you have mentioned so I can't comment on them.
No problem, thank you!
This is what we are going to learn this semester in Mathematics:
Helpful for DS?
Looks like you'll get a thorough introduction to regression which is an important technique
Linear regression on its own is powerful
And then logistic regression introduces you to the sigmoid curve which forms the backbone of neural networks (although nowadays the ReLU activation seems to be more popular as it learns faster)
I think you'll learn a lot and you'll also get a good mathematical background that will pay off once you look into more complicated things
Out of interest, which level of education is this?
I am in 2nd semester of B.E. Computer Science Engineering
The thing is we are learning nothing more than formulas. The professor taught us how to calculate central tendency for individual, discrete, and continuous series but, didn't tell us where to use which one. No concept of outliers and nothing ;_;
Few commands in R and Python can calculate the same things
That's annoying, but I'm a firm believer that having a good grasp on the mathematical background pays off
I love Mathematics so I try to find things by myself
And you seem to be very driven on your own which is great
In university you shouldn't be afraid to go speak to your lecturers, maybe there is a reason why some things aren't covered yet, and I'm sure the lecturer wouldn't mind setting up a meeting with you and going over those things in more depth
I really need to get some sleep now, I wish you luck on your journey!
Thank you!
@wide oxide Awesomely useful DS code snippets: https://chrisalbon.com/
@reef bone Can I ask you again?
Sure, no guarantee I'll have the answer though
I just did the K-mean cluster for a simple data set, and know I have to do it for an image.
I'm wondering if this step is critical or not:
data = china / 255.0 # use 0...1 scale
data = data.reshape(427 * 640, 3)
data.shape
Using the example from: https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html
getting an error with reshape
Hmm
data = china / 255.0 so seeing as all values in china will be in the range 0 - 255 (that's a range you'll see quite often, since these are the numbers we can represent using 8 unsigned bits), dividing them by 255 will rescale them to 0 to 1 range, which in this case is only useful for the visualisation part (it basically becomes a coefficient we can multiply other values with easily)
data = data.reshape(427 * 640, 3) in here, we start with data being a 3D array with dimensions (427, 640, 3), so it's an image with 427 pixel in height, 640 pixels in width, and 3 colour channels (r, g, b)
By calling reshape(427 * 640, 3) on it, we retrieve a 2D array, kinda like this:
x x x
y y y
z z z
Turns into:
x x x y y y z z z
Okay, and that's important for the "X : Array of shape [n_samples, n_features]" requirement right?
that sounds right
i probably can't explain the error you're getting without seeing more of the code
mainly what the variables hold
it comes from trying to reshape an array in a way that's not possible
Looks like this one is the issue:
cph_image_tiny_recolored = new_colors.reshape(cph_image_tiny.shape)
Yeah, just the reshaping that not working
How would you reverse the reshaping of:
data = data.reshape(166 * 250, 3)
after it was processed by the function?
Yeah
can you print the shape of cluster means and cluster assignments
i'm trying to understand the entire process of how that's supposed to work
i don't think the means are particularly useful here
"TODO: Segment the image by searching for 5 clusters
in the RGB space (see slide 21 of L13); use
'max_iter=5' and 'seed=0' as parameters."
when you do print(CA.shape), what do you get?
This is the slide they refer to:
I get (41500,)
Then #cph_image_tiny_recolored = new_colors.reshape(cph_image_tiny.shape) should be the correct idea right? Try to get the cluster means form the colours picture. But is assignments missing there?
sorry i'm lost
i'm very confused about why they make you reshape the image data in the first place
if I don't reshape it before calling the function, the shape of cluster means become (5, 250, 3)
If it's reshaped
"X : Array of shape [n_samples, n_features]"
then data have n samples (pixels) and 3 features
but that reshape was from the website I linked, so I'm not sure it's needed
Hmm, it does say consider each pixel as a point in R^3..
@reef bone
oh right
can you do
on the line where you have new_colors = CM do new_colors = CM[CA]
that should give you an array of shape (41500, 3)
and then you can reshape this back to (166, 250, 3)
nice! try playing with the K number (n_clusters) and it should look better as you go higher
sorry I took so long I was a little confused about what they want you to do
basically with new_colors = CM[CA] we are using the assignments as indices for the means, so to each sample (pixel) we're assigning it's mean in the 3D colour space
Yeah, it's a great solution
numpy can be a little esoteric sometimes (intuition would tell you to use a loop here) but once you learn the ins and outs its an amazing tool
Should it be easy to select a random subset to fit the k-means model?
TODO: Segment the image 'copenhagen.jgp' in a similar
fashion (using n_clusters=16, max_iter=5, seed=0). Instead
of using all data points/pixels, consider a random subset of
size 5000 to fit the k-means model and to find suitable cluster
centers. You can use the numpy.random.choice function to select
a random subset of indices (without replacement).
Ah, otherwise it has to fit for 300k points 😛
sure, that should be fairly easy
ok so numpy.random.choice() only works for 1D arrays, and we want to draw from a 2D array
so we can grab the indices from the 0th axis (samples) like this
indices = np.random.choice(data[0], 5000)
and then slice into the data array with the indices like this
chosen_data = data[indices, :] that will select only the randomly chosen indices
sorry I have to run off now so I don't have time to go over this in more detail
but you should be able to figure out the rest
Okay thanks, I will try!
if you run into any problems just ask here and someone will help I'm sure
numpy also has fairly good docs so feel free to refer to this if you're having trouble understanding it
Thank you again! 😃
get a "arrays used as indices must be of integer (or boolean) type"
I think I got it
Hmm, how can I reshape it back into the full image?
from the 5000 points
Hmm, I only have 5000 assignments, but I need 272640 to assign every pixel.
@reef bone Are you back?
Does your KMeans class have a predict method? Or similar? You want to fit using the sebset (5000 samples) and then predict the closest cluster for each data point (all samples)
I think they want you to use this method
This is a bit of a struggle I'm on my phone
You want to pass it all data (X) and the means given by your fitting
And it will return the full assignment indices which we have worked with before
Hmm, so I can use assign_to_clusters(data,means) where means is the one I got from using the full function?
on the 5000 points
Yes

It will return the full assignments
So you want to store them in a variable
assignments = kmeans.assign_to_clusters(data, means)
Trying run it now
Seems like it uses a lot of computation though, which seems like I'm missing the point of using 5000 points
Hmm, still running
never resolved, maybe it run in a infinite loop. I'll just write a comment and use the full data when plotting the image
oh.. yeah that was a mistake. I'll try to run it again
it worked, and was much quicker
it's non-linear regression
Hi guys, I wanted to start studying Data Science with Python or R through DataCamp. I know coding is about skill development and not where you study, but I wanted to ask if that website is good enough to get at least a good grasp on what Data Science is like.
@lapis sequoia never tried it , but their podcast is fairly good and the host seems competent enough
@wraith crow hard to say what you should do without the additional code you've been provided
This is the 'dummy version'
this is most of the function
not sure where and how to construct that new equation
TODO: Implement the non-linear regression approach;
generate corresponding plots for sigma=0.1,
sigma=1.0, and sigma=10.0 by computing, for each
xbar in X_plot, the corresponding prediction
@reef bone Do you remember PCA well?
I have some idea
PCA is a fairly complex algorithm in comparison with kmeans, are they asking you to implement it yourself or are you using some library?
We have some templates available
Does this one seem fitting?
I just don't see training data in that example
PCA is an unsupervised algorithm so training data will be similar to what you had with clustering
Sorry I can't go over the entire thing with you today
The code you have shown looks good, most importantly it lets you extract the eigenvalues and eigenvectors
Because they are sorted by eigenvalues, the components that describe the data the most will be at the top
But they used "data" as input there, but I have 4 different 'data' as in trainset, testset, trainlabels and testlabels
Looks like you're passing something called diatoms to the pca function
By definition PCA ignores labels, it's unsupervised
The labels might be useful later on, for example PCA can sometimes be used as preprocessing technique to reduce dimensionality before you implement some kind of a classifier
But PCA itself only decorrelates data and reduces its dimensionality
So if I should try to run
def pca(data):
# Extract data dimensions
d, N = data.shape
# First, center the data
center = np.mean(data, 1)
centers = np.matlib.repmat(center, N, 1)
data_cent = data - np.transpose(centers)
# Compute covariance and its eigenvalues from centered data
Sigma = np.cov(data_cent)
evals, evecs = np.linalg.eigh(Sigma)
# Return eigenvalues and eigenvectors and -- for the sake of the lecture -- also the centered data
return np.flip(evals,0), np.flip(evecs, 1), data_cent
PCevals, PCevecs, data_cent = pca(testset)
PCevals is a vector of eigenvalues in decreasing order. To verify, uncomment:
print(PCevals)
PCevecs is a matrix whose columns are the eigenvectors listed in the order of decreasing eigenvectors
do you have an idea of what data should be?
Take a look at what diatoms is in the template code
It will be defined in one of the previous cells
Hi guys
I need help
Im new to machine learning
But cant fig out from where to start
I want to learn ML
Please help me
I have 0% knowledge about machine learning
@native rivet I'm not certain on exactly where to start since I barely got into data science myself, but I was told DataCamp machine learning courses are solid
Hey there everyone. I have a Pandas related problem and would be glad if someone can help me out. I posted this question on stackoverflow:
https://stackoverflow.com/questions/54288604/applying-a-function-which-involves-multiple-boolean-operations-on-multiindexed-d
I keep trying to solve it for the past hours, but no success
@native rivet DataCamp's "Become data scientist with python" course is more like "Learn Python for data science"
I did 4 courses on it and left it
Nope
@native rivet Depending on how much Math you alreaady know and how much you can handle, there are different courses. For an understanding of the principles behind ML, Andrew Ng's course on Courseera and one by Columbia university on EdX is nice. The latter is high is math than the former. There are a tonne of other material and courses too
I am good with high school level mathematics
Can you just teach me basics which req to ml
After that i can enrol udacity nano degree
@native rivet Andrew Ngs course is manageable with very little math knowledge. He even has a couple lessons on martix multiplication and such and then covers required math as he goes through. But learning Math on your own from any source would be good for a better understanding
Are numpy.append, numpy.stack manipulations slower than normal lists? Am I supposed to be using normal lists and append, and turning that into a numpy array using numpy.array?
@small ore its not with python
There are codes available that implement the same on Python
@wide oxide Some statistics and probability would be good but if you are sharp enough you can manage by going through some YT videos
@native rivet Well, you said from scratch. So given the kind of courses I have seen it is best to take a course that isnt tool specific and then take a small course/read a tutorial for python modules
@small ore We are studying this, this semester.
Udacity will not take me from scratch?
I have no idea abt the Udacity course
course will start in February
@wide oxide Those topics in your Math course will certainly help
Also there are old columbia courses you can audit if you dont want to take the live course. See 📌 for a link
Thank you very much!
Do you have something for data science? @small ore
or the same is good to start with it as well?
There is a tonne of material. Courses/Texts/online material/free datasets/Blogs/etc. But good to start with learning the math and then basics of ML. Data science as far as I understand has many elements to it. Data exploration and ML should start you out well. You can learn data exploration when you start to learn some python ways of doing it
I will start with the course then. Thank you very much!
I want to get into behavioural analysis and higher mathematics.
Is it important to understand these functions?
It depends on what side of ML you want to be on
there's the data science side where you don't need to know the math and treat each algorithm as a grey box, knowing what inputs / outputs / assumptions you make feeding / getting results from each model
I am just starting
then there's the development side where you're developing faster or better ML algorithms where it's super important
Obviously if you understand how the algorithm works you will know what's happening much better
From where I can learn this level Mathematics?
Calculus 1-3, Differential Equations, Statistics, Stochastic Calculus
they're Freshman - Jr year math
stochastic calculus, depending on the course, can be a grad level class
you'll also probably want to take classes on econometr ics
MIT OCS is a good palce to start
introduction to statistcis
Look for statistics for engineers prlly
Last one has no lecture videos
@void anvil what do you think about this?
I have completed 5 videos of this before
did assignments as well (in semester breaks) and then semester started so had to leave
I was able to understand 60-70% of the lectures
and was able to solve 50-60% of the assignments
Do you still think that I should follow the same course?
(BTW we're learning this : this semester)
hey, could anyone recommend a reliable source for any sort of data? The actual contents of the data doesn't really matter as long as I can scrape it easily or get it through an API
thank you! <3
Omg, thank you @reef bone, but been looking for someplace to get some data to play around with more.
World Bank Open Data from The World Bank: Data
not "data science", but can someone direct me some resources on how matplotlib does its thing? like things will only work if you assign them to a variable
seems like an odd design
?
hello, does somebody here have experience with openai gym?
i have a question about it, i cant run ale anymore because my ubuntu doesnt start anymore since i've installed my new gpu,
if somebody knows, does the space invaders environment give the current score as score or the delta between the last step?
is this related to python
you dont have to ping me
Hi guys! I am trying to figure out how to select candidate features to feed into an gradient boosted tree model (I am using XGBoost). Blindly feeding all my candidate variables doesn't do the trick as it significantly increases the search space and seems to deteriorate the validation score compared to a carefully selected list of candidate features. My idea was to train the model on all potential features and then make a selection based on feature importance (i.e. discard all the variables that don't contribute too much). What do you guys think of this approach? (asking here because somehow I can't find anything online on that topic)
I'm not super sure what you're talking about but it kinda sounds like pca
PCA transforms the variables which I would like to avoid.
I'm trying to plot a groupby on a panda DF with seaborn but i notice i have to use the .head(n=1000) to get it to work. If I ommit the .head() I get an error saying the groupBySeries isn't callable. Here's the full code:
names = ["Sepal L","Sepal W","Petal L","Petal W","Class"]
irises = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data",names=names)
df = irises.groupby("Class")
sns.countplot(x=df["Class"].head(1000),label="Count")
plt.show()
irises.groupby("Class") is a pandas group by object not a dataframe. You need to apply something to it to get a proper dataframe back. Apparently head works which I did not know. Try .head(len(df)) lol
@languid adder A count plot by nature will automatically do what you are trying to do with pandas; count the number of classes. You can simply do this.
sns.countplot(x='Class', data=irises)
ah great @late garnet that's what I was after
.head is in the official documentation of the group by object so that's why I tried it. That's also why I was confused I couldn't work with it like a DF
I'm trying to clean up some data as was wondering if this is a good way or if there is a better way
I want to create a new column in my DF that is based upon 2 other column:
def splitIntoName(value):
arr = value.split(", ")
if len(arr) > 1:
return arr[1]
return None
df["Name"] = pd.Series([splitIntoName(x) for x in df.description.dropna()])
df["Name"].fillna(df["OtherNamej"],inplace=True)
with numpy array of floats x, and treshold float f, how would I make x the same array as it was before, but with any value below f turned to 0?
@languid adder that seems like a reasonable way to go
err wait no
df["Name"] = pd.Series([splitIntoName(x) for x in df.description.dropna()])
this line is bad, dont do this
df["Name"] = df["description"].map(splitIntoName, na_action='ignore')
do this instead
also dont use inplace
so do this:
df["Name"] = df["description"] \
.map(splitIntoName, na_action='ignore') \
.fillna(df["OtherName"])
@spark nimbus you arent allowed to think when you write python code, otherwise you think too hard and ask silly questions ;)
x[x < f] = 0
that actually works?
why wouldnt it
have you read the numpy subsetting docs?
which are admittedly not easy to read
Most of the following examples show the use of indexing when referencing data in an array. The examples work just as well when assigning to an array. See the section at the end for specific examples and explanations on how assignments work.
b = y > 20
y[b]
etc
@desert oar thanks. I'll dive deeper into the map function as I seem to do a lot of these types of mappings so it might be more useful to use .map instead of the construct with pd.Series([])
@languid adder the bigger problem with your code is the .dropna() will make the whole thing misaligned
but yeah .map and .apply are there for a reason. use them
like you could do...
pd.Series([foo(x) if pd.notnull(x) else x for x in df['y']], index=df.index)
or
df['y'].map(foo, na_action='ignore')
i know which one i prefer 😉
yes the later is more readable
How do you deploy your models? I’ve been experimenting with Flask. Looking to see what others are doing.
Seems like something that belongs in #web-development ?
so this is a bit of a "meta-data-science" question so i hope people wont jump at me 😃 I want to support a small open source data science community and look for the best team collaboration tool (or combination of tools - chat, file sharing, planning etc). do people have some recommendation? I guess discord is not sufficient for that.
We aren't a data science server but I can answer as a large Python community:
Chat: Discord, GitHub Discussions
Collaboration: GitHub mainly, for some internal things we do use Dropbox as well
Planning: We use GitHub Issues and Projects (Projects are very similar to trello but link very well to GitHub repos & issues)
thanks @south quest thats quite reassuring that this combination works well (I'd like a small number of tools)
Yeah, we don't really have a huge number of tools and it has served us well since our move to GitHub. We now use Azure for CI, GitHub for basically all project management and Discord for development chat
- discord has some nice github webhooks so it all integrates nicely
I am new to discord but I quite like what I see so far, esp the python server looks super well organized so I'll be copying some best practices 😃
@worldly sigil flask works well for my team but i have built up quite a bit of additional structure around it over time
If you are just receiving and serving JSON any half decent web framework will do. I like flask because it has been around for a while so it is fairly well tested in the field, but it's also generally simple to just get up and running
@desert oar that's awesome to hear. Yeah, right now it's just receiving and serving JSON so I'll keep it simple til I need something more tailored to my scenario. thanks for the insight.
@worldly sigil there is the lean way of flask and the fat way of django+drf. the good news is that once you adopt a rest/json architecture its really easy to switch
Noob question. Why do you guys use flask/json for data-science?
It's about deploying models to the real world like when you built an auto complete model you want to provide it as a service somehow
how can you improve resource utilization when optimizing or training a model.
For example i have following code:
from sklearn.model_selection import GridSearchCV
paramGrid = [
{"n_estimators":[3,10,30],"max_features":[2,4,6,8]},
{"bootstrap":[False],"n_estimators":[3,10],"max_features":[2,3,4]}
]
forestReg = RandomForestRegressor()
gridSearch = GridSearchCV(forestReg,paramGrid,cv=5,scoring="neg_mean_squared_error")
gridSearch.fit(housing_prep,housingLabels)
this runs a long time as it needs to do a lot of iterations however I notice my CPU never goes over 10%
so I was wondering how I can give it more resources so it's done quicker
found out the n_jobs parameter does just what I wanted. Much quicker now 😃
another question about performance...
I have a 6 core i7 processor. Is it best to use hyperthreading or not? At the moment it's disabled so I have 6 threads but was wondering if enabling hyperthreading would improve performance?
it's better to run it on the cloud :v
$$$ 😛
Hello chaps! I'd like a few opinions on something if you can: I have rather heavy CSVs (between 200MB to above 2GB) that come from a cloud provider where there is the billing information, how would you by storing this information and extract data on it? put it into a postgres DB and run custom python code alongside ? send it to an ELK stack? something else? The idea would be to get the biggest costs centers and see what I can do with that.
how many of those CSVs do you have that you need to coalesce?
if they total more than your memory, your options are spark, dask, or some out of memory dbms
as far as out of memory dbms go, I've been in love with Clickhouse recently so i'll shill that
@languid adder have you considered using Dask added onto your Scikit-Learn code? You seem like you're digging for better performance, and Dask (or Pyspark) are a pretty fantastic blitz of performance.
I haven't. Working my way through Hands on machine learning with scikit-learn and TF at the moment
will have a look at it afterwards, thanks for the tip!
@violet bison bigquery
why do people complicate things so much.. use something that doesn't have you running around trying to find more resources everytime.. focus on the objective
thank you chaps, I'm in the EU and personnal google cloud accounts cannot be done, also, I have 20 GB of ram, it's cool that all the CSVs should fit into it, but thanks for the heads up
ooh snap GCP private accounts works now...
nothing wrong with postgres
like why even bother loading 20 CSVs into memory if you just wanna do some queries and aggregates
spark? why
apt-get install postgresql, youre up and running in 10 minutes
@violet bison
grab a 1bn row sample and analyze it in R using data.table
can do that on a laptop
source: i have done it on a laptop
I don't know R that's the thing, but I just wanted a few graphs and a tool capable of ingesting csv
Any one using Intel movidius
is R so much better than python for DS and ML?
For machine learning specifically, python is better because it's more of a general purpose programming language
For data science in general, R is a great tool to have at your disposal
Also machine learning libraries tend to be written with python in mind nowadays
R is better for many kinds of statistical analysis
"Better" being my subjective opinion
The R data.table library is ridiculously efficient at large in memory data processing
I haven't tried a comparable task with pandas, but im not sure i trust it for that purpose
ok so that makes sense 😃
so for someone starting out in the ML/DS landscape, it is worth having both tools at their disposal?
sorry to bump in
anyone has any ide ahow to make an efficient RMSE in torch?
there are some version but its for loss functions
im just trying to get the RMSE from two tensors
@languid adder learn one, then see about picking up the other once youre comfortable with one
@serene veldt you wanna use RMSE as an input to a layer?
nop, its not for networks
i have 2 tensors
1 expected values 1 results
wants the rmse of that
there will be no propagation whatsoever of that value
@desert oar
@languid adder R is for people in finance.. it's better for small data.. and time series
it sucks for network analysis in particular
plus.. R studio costs a few hundred dollars for licensing
I think R is used the most in acedamic statistics
It's has many more packages available for that than Python has
More and more are being published in Python and there's a trend to also publish one for Python these days
R is for people who can't code
R is less and less in favour outside academia because it's a lot of work (sometimes rewrite in python) to productionise
thx for the info guys. I ordered "An introduction to statistical learning" which uses R to show the examples. I wasn't planning on learning R and my intention was to code the examples in python instead so I'll stick with that plan
FYI I'm working for a major software company and our team needs to focus on DS/ML more so that's why I'm retraining myself...
I disagree with the notion of @chilly shuttle I feel it's too much gatekeeping "programming"
I don't care for such exclusive sentiments
R is fine as a tool and a lot of people use it
It's just not the right tool for every job
eh, I stated the business reality
if you're in academia R is fine
as for commercial settings, I increasingly shy away from taking on data scientists that don't know anything except R
R is widely used here in commercial settings as well, so I don't think it's a business reality
how do you productionise your R code?
It's one of the main requirements on job advertisements for professional statistics here.
That's usually not the main goal of R
As I said, it's not the right tool for every job
why take on that burden in a commercial setting when you can take on data scientists who can do both in one shot?
Maybe I don't have a software development application in mind, but that doesn't mean that "R is for people who can't program"
Because not every project is a software development project
I know it's easy to only see the big data and machine learning aspect of it, but there's a whole world of people who's job it is to research rather than develop
For that, R is a very nice tool and it offers much more tools than Python does
Also Rstudio has a free version that works fine
A lot of cutting-edge statistical models have no implementation in Python at the moment
like what
New developments in regression trees, several multidimensional scaling techniques, and most of the other models currently being developed in statistical science (as opposed to machine learning), as a lot of those researchers publish their packages in R. Some have started to switch to Python, but there's still a distinction between smaller datasets (like used in statistical learning) and larger dataset (more machine learning perspective)
Now you purposefully misrepresent what I said.
That those models are developed in academia doesn't mean that they're exclusively used in it
My research department does a lot of consultency work for external organizations that use those models
i'm seriously curious, can you provide one anecdote?
Sure
we've had to do non-bread and butter stuff maybe twice last year for external clients, and we're one of the biggest consultancies that exist. But I can imagine more specialized consultancies picking up the more exotic stuff
One of our students recently started an internship with the Dutch Ministry of Social Affairs and Employment to use those statistical learning models to predict which companies should be inspected for "employing" illegal immigrants/victims of human trafficking.
All of that is done in R
and they couldn't address it with ML because?
It's basically the scale of the data you have; the datasets are smaller than what you'd normally use for ML
The distinction is a bit vague, though
i should rephrase that as, 'they couldn't use existing techniques because?'
Because techniques evolve contantly. It's basiclly still in its infancy
but that sounds like academia
The development of the techniques is, but the application of the techniques that stem from it is not
and I'd be willing to take a wager that this student is a stats/similar academic who isn't CS proficient at programming?
The student isn't, but the people who currently run the project are (and who are currently already using R for it)
I see
Now, honestly, I would really like to see the use of Python spread as that would eliminate the delay in the spread of new techniques
Because, as you said, R is academia focussed
But, it's not just for people who can't program
anyway, my observations from fairly high up at a big4 consultancy:
we hire less and less data scientists who can't write code (i.e. R only)
r&d type work that can be done by pure stats folk is tiny in volume compared to commodity data platform deployments and the like
data science is increasingly becoming commoditised where creating a complex model is now a couple of clicks in a GUI. What isn't point and click at this point is data engineering
Now, another point is whether the use of R in academia is a good thing
I'd rather that we all switch to Python
that is happening at the university i'm affiliated with (as part of a broader theme to add software engineering concepts to pretty much every discipline). Not sure how widespread that movement is though
As the stuff we do in R isn't less complex than what you'd have to write in R, but R is not a very nice language (IMO) and Python is a general programming language.
The biggest issue is that there's a lot that's only implemented in R at the moment
The newest techniques for Multiple Imputation have very nice implementations in R (e.g., MICE is one of those packages), but support in Python is incomplete
yeah I'm not disagreeing with that
there's a clear trend of <thing> gets invented, reference implementation in R, stable python implementation 1-2 years down the track
it's just not something with a huge job market
But, as you say, I'm a lot more focussed on applying research than to software development
Yeah, probably
Still, we have about 20 master students a year and they all receive multiple job offers before they even graduate
So, I guess the supply part is also relatively small
and some of them will be picked up by employers like mine just to pad out the credentials on proposals
and never do a day of enjoyable work in their life
i die a little every time i see a PhD stats person being put on a project to do fucking.. excel mangling
I guess here most students are gobbled up after they finish the master of science and before a PhD
I can probably dig up a list with the kind of projects I'm talking about if you like
But I need to do some work first
(and I have to check which projects are confidential first)
its ok they sound like the kind of r&d projects we do once or twice a year
last one was modelling road network utilisation after a transition to self-driving cars, back when people still thought self-driving cars are about to take over the world
Most of the stuff we do is smaller scale and there's a lot of (semi-) public sector work (if that term makes sense in English)
It's not really related to what I do, though, so I have to look up which projects we have at the moment
R studio does have a free version..but if you work at a company, you stillhave to fork over $ for a license..
R was nice.. manageable.. not for large data definitely.. I had to drop it because my python skills are very domain specific and being the dumbass that I am I didn't want to confuse syntax and lose progress v.v
you also have to fork over $ to monitor compliance for said licensing
You can use the open-source RStudio (license AGPL v3) for commercial purposes without the commercial license.
soooort of
you can use it as an individual or a small company for free
as a large company, the risk associated with taking on foss for critical work far outweighs the cost of paying the commercial licensing
the main reason foss is shied away from in large enterprise is because when something goes wrong, you want to have a support contract with the vendor so you can blame/sue them
R was my first and only programming language, so when I think about writing a script my mind goes to R. Is there any book, video, tutorial, whatever that I can use to grasp the Python's way?
I want this book.. but it's not released yet.. these guys seem to be taking forever
R is a programming language in only the loosest meaning of the term. Investing the effort to learn python or lua or c pays huge dividends because every subsequent programming language is quite familiar
i found a machine learning course on coursera but it used python 2 and graphlab... is that still worth following? Concidering the examples aren't really focussed on python 3 and scikit learn or tensorflow?
What are your objectives? Learn what is behind ML or to learn the packages?
both actually... If a course is using outdated packages (no commits over 2 years) than I assume the course might be outdated as well?
The one by Andrew Ng on Courseera uses Matlab/octave. It is just about what is behind ML but uses Matlab/octave to do excercises and assignments. These are easier than python in ways and enough to the point that you need to know what goes in and what comes out and infer from it. More mathematical approach by a columbia univ course on edx. It does not use any tool. See pin for a link. There is another edx course ( I think from microsoft) which teaches you to do ML using Python( Uses Jupyterlabs on azure). It uses basic libs like pandas, matplotlib and seaborn
And skikit learn
yeah I noticed the course from Andew Ng (founder of Coursera) but was looking into a python course so I could combine the theory + practical knowledge at the same time
I suggest taking two different courses for these. The ones that teach the packages don't cover what goes on behind the scenes or why you are doing what
Or at least does not do it well
so which one do you suggest for the theory?
Andrew Ng covers more neural networks (deep learning) and doesn't seem to cover things like linear regression, classification and so on
Andrew Ng starts with linear regression and classification and then goes on to neural networks( Which is again a way of solving the regression/classification problem)
ah then I must be looking at a different course
If you want a more involved mathematical approach to get the intutions behind each method, The course from columbia is good. I suggest going with that course if you can manage to do some reading on the side from a standard text.
link to the course? Can't seem to find it
Which one?
the one from columbia
afaik this aint the place to ask abt regex
Which channel?
One of the help channels please
Alright, thank you
Any A.I or ML server on discord??
think you'll have more luck on slack for that: https://towardsdatascience.com/15-data-science-slack-communities-to-join-8fac301bd6ce
i don't find much differences between discord and slack... I mainly use discord for hobby specific things and slack for professional things 😛
Hello, I have this numpy array :np_array = np.array([1, 2, 3]) I was wondering what is the most performance efficient way of obtaining the int 123 from it ?
np_array[0] i think. If you want the 1
Im not that experienced but thats what I learned so far
"".join(map(str, np.array([1, 2, 3]))``` perhaps?
You could always try the performance difference between options with the timeit module
I am trying to create an empty pandas df and later fill in the values, but the performance isnt great - upwards of 100 ms for 35 rows over 3 columns...
Can you help me set up the dtypes for it ? maybe thats the problem
pd.DataFrame(index=range(0, 35), columns=['MMID', 'Price', 'Size'], dtype={'MMID': object, 'Price': np.float64, 'Size': np.int64})
I keep getting TypeError: data type not understood
in MMID colum i want to store strings, Price - floats and ints insde Size
You can totally specify a dtype per column
yeah plenty of DF have different dtype per column
any #api / requests question: I know how to set a ?per_page= , but I can't figure out how to ask an api what its maximum ?per_page - anybody? Thanks!
"Parameters:
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
"
Nevermind...what I said before that was wrong
i'm trying to apply a OneHotEncoder but fail to see how I add the encoded values to my dataset
at the moment I have this:
ms = df.MSZoning
encoded, cats = ms.factorize()
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder(categories ="auto",sparse=False)
msOneHot = ohc.fit_transform(encoded.reshape(-1,1))
msOneHot now contains the columns with the binary values so I just need to add it to the dataset
I don't understand why the fit_transform doesn't add these automaticlly to my dataset?
lol... just realized the get_dummies does exactly that:
df = pd.concat([df,pd.get_dummies(df.MSZoning, prefix='zoning')],axis=1)
WHY USE INPLACE ? AND WHY SUBSET["PRICE "] WORKS??
Did you try reading the docs..?
yes i got inplace from there but still don't understand why used subset['price']
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
Which columns to consider if you're dropping rows for instance
and inplace means that it modifies the df, and doesnt return a new one
Just comes with reading many, and experimenting with the meaning of them. You read them when you dont know what something does
how do I make sure I split my train and test data so my test data contains at least one of each categorical data.
At the moment I believe this isn't the case because I use OneHotEncoder and I have less features on my test data than on my training data...
I use this to split it up at the moment:
from sklearn.model_selection import train_test_split
trainSet, testSet = train_test_split(housing,test_size=0.2,random_state=42)
or... how can I add the missing columns with all 0 values?
why this error ?
@languid adder check this answer out
https://stackoverflow.com/questions/37425961/dummy-variables-when-not-all-categories-are-present
cool thanks. Meanwhile I build my own code to fill up the gaps by comparing the columns.
Always good to see other solutions as well
But why don't you add dummies before you split the data?
well.. because I split my training data into training and test.
I also have a different file that I need to make predictions for and I don't have the labels for them
if I add the dummies before splitting, I might have the same issue when I predict on my unknown dataset
also, my split up test set has some values that aren't in my training so it gives a more accurate representation of the unknown set
@fervent solar I’m not entirely sure, but I think you tried to calculate a mean from your data frame but there are some values such as the one you see in the error that cannot be converted to a numerical value
I suggest looking at your data and cleaning out all the bad rows
my RMSLE of a model is 0.00054004 which is based upon an unseen dataset where i have labels for.
my RMSLE of a different dataset all of a sudden is 1.37941
Do I need to conclude my model is rubbish or that I Might have made an error in importing/transforming the second set?
Whats your training error? It might provide some pointers about which of the RMSLE values are wrong
I'm doing the housing predictions from Kaggle and my predictions for the submissions are all very low (41k mean with std of 300)
I haven't looked at the errors on my training or test set but because my test set performs rather well and the submission set gives these weird results, I might be inclined to think there's a problem with applying the submission set to my model
That hypothesis could be tested if a simpler model performs similarly on test and submission set I guess.
I tried changing from RandomForest to DeiccionTreeRegressor and I get similar low values.
It's weird because the predictions in that model seems almost identical...
array([52373.58333333, 52373.58333333, 52373.58333333, ...,
52373.58333333, 52373.58333333, 52373.58333333])
I also did a test with SGDRegressor but my initial RMSE on the training set is around 3e+20 while my predications are around 2e+19 so again way lower
on the DecisionTreeRegressor I get a constant for all my predictions on the Kaggle submission set while my testing set gives a very nice RMSE of around 9k
so this probably shows I'm applying the submission set wrong
Probably, you should make sure you clean the data in the same way. And maybe also look at the submission set to see where it's different
i have one method that I apply to both my test set and kaggle set so to make sure they are transformed in the same way
Hold on is this what you're doing?
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Because I don't think that challenge has a two test sets. The sample submission set is a sample of what your submission should look like, it only contains ID and price.
yes
but I split up the trainset so I have my own test set to validate my model against an unseen set
👍 of course, I was a bit perplexed about where you got your second test set from.
from common practice 😉
so that's also why I am surprised. My own test set performs rather well. 9k RMSE while the submission set doesn't work at all...
Hmm and you have looked at the size of both test sets after transforming them to check if there are any obvious differences?
if I have different columns, I would get an error in my model.
The shape looks good
when I do a .describe() of both my trainSet and my kaggleSet after transformation, they look similar
the means of different columns I looked at are in similar ranges, same for std
i'll restart the kernel... I've seen similar issues that some things are in memory from trying and I remember spending a lot of time chasing ghosts... will try that and see.
Hmm I see that a few other submissions get a RMSLE of around 0.1 with some basic models. If your model gets you 0.00054004 on the first unseen test set. That seems very low, there might be some information leakage somewhere.
ah yes you are right... So maybe the model is flawed after all. Overfitting like hell 😛
ow... now my predictions look better
now the predictions of my kaggle set is between a normal range. plenty of houses in the 100-500k mark 😄
will upload and see the result
ow darn spoke to early 😛 was plotting the training price 😦
yup same issue so will look into my model.
And yes... because my RMSLE is so small on my training set while the top guys at kaggle are at 0.1 and my training RMSLE has 0.0004 I indeed need to look into my model
but what I still don't understand is why it performs so well against my own test set
I'm sure I don't use my test data for training
ah think I found something... I send a non standardized dataset to my Random Forest
when I standardize my set and calculate the RMSLE I now get 1.81040718 which is more inline with what I see on my submission 😛
when I apply that new model to the kaggle set, my housing prices are still off but at least they are all in the range of 181k which is better than in 40k so I bet if I upload that, I improved a lot 😛
it keeps nagging me why my own test set does perform well but as soon as I apply it to the kaggle set it goes haywire
for example a DecisionTreeRegressor seem to work really well on the train data + test set but when I apply it to the kaggle set, I get the same prediction for each row
doesn't make any sense, right?
hello anyone able to assist with some data science/probability/linear algebra questions as it relates to the L2 Norm?
👏 Nice!
Asking good questions will yield a much higher chance of a quick response:
• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
df.dropna(axis=0,inplace=True,subset=['price']) this code is not cleaning empty values @terse pewter
i really dont recommend using inplace
it might actually be deprecated now, i cant remember
If I want to classify a CSGO match stream at any given frame as either showing the game or not showing the game (showing commentary, commercial, et cetera), should I just get a random collection of frames and label them as "game" and "non-game"?
Then use a decision tree classifier to create a model?
which website/ course do you guys recommend to someone who wants to learn numpy ?
@river plume datacamp
@potent phoenix what would be non-game?.. you need to define the problem better.. if there's very little difference in the encoded images, it won't make a difference.. and you'll get false classifications every time..
np
can someone recommend a book that covers data analysis with regards to ML so you have better understanding on how to interpret and manipulate data so it's optimized for your a model?
The usually recommended ESL and PRML wont serve your purpose?
isn't ESL a math heavy book? I was more looking at a text book
Both are quite math heavy.
i already have Introduction to Statistical learning which is from the same authors as ESL
I would have read that if it were python instead of R. Now that you have pointed a book that looks promising from what it says on the back cover, I would like to take a look at its contents. I donno how to access that from amazon
By contents I mean the contents page. Not the entire text contents
if you google the book, you find a pdf. Not sure how legal that is... 😦
ah 😛
you should get some more info on the book website.
Oh. I meant the other book. Think Stats
ah ok.
here you'll find the ToC
the more I dabble into DS/ML the more I realize how little I know and how much I have still to learn 😃
That book seems to cover a lot more topics than ESL and PRML. And they are all surely highly mathematical topics ( Although it migh be taught in a different approach as the back cover claims). Overwhelming number of topics there
i do notice from the learning i've done so far that it's a good approach to get some high level overview of topics so you know what's out there and if you need further information/clarification, the more detailed books will guide you
That ToC does not look like high level overview to me. But it might be different to you
i'll let you know in a few days 😃 Book should arrive tomorrow
I would certainly be eager to know that 😃 . Thank you
found this kernel on Kaggle about the housing price competition and it's really good in explaining how to explore the data: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
There seems to be another book by the same name by Allen B. Downey
he has a couple of books with o'reilly
and think stats has a 2nd edition if that's what you mean
I ordered the second edition
Oh wait. I misread. Sorry. I tried to discern from the german amazon site. I thought Taschenbuch was the author. Whatever that means
yeah I'm not German but i have to use the german amazon as they deliver to Belgium
i read a review about that book that says it's an intro to many of the statistical concepts and the code is to explain the concept using code vs math
So I'm actually doing an R program, but my question is for statistics related
I did a KNN application on a dataset and the min-max normalization is rendering better results than a z-score standardization
why is this the case?
My prof had said that generally z-score standardization is the better way to go, but based on some various K values tested on both normalization methods, the min-max rendered the best results
is it common practice to do data cleanup and feature engineering on a combined set of the train and test even although you don't have the labels for the test set?
i've seen it in several examples where people combine both sets before changing NA values to something else or before they call get_dummies
yeah usually people drop NAs.. but it depends.. on the type of variables..
some will fill with the median or mode
yeah that I know. My question more about the practice of combining both the training and (unlabelled) test set before you do these operations
Depends I think, if it's just for analysis of that dataset or I'm completely sure the test data contain no surprises then I might do that. But if it's for training a model that I plan to use many times on different data, then I'd make a proper data wrangling pipeline and use it on both the training and testing set to see that it works. I don't know what everybody else does though, but for me it depends.
df = pd.read_csv(r'C:\Users\damia\Desktop\500.csv')
print(df)```
CSV contains : ```
Symbol
NCI
BGH
VYP``` I get this error : `UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character`
@storm monolith - specify the encoding when you read in the csv. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
@late garnet Didnt help , i specified utf-8. Notepad++ says its UTF-8
You could try, utf_8_sig
where did you specify utf-8 and what error (if different) did you get when you specified it
kindly anyone tell why my dropna is not working
its not dropping values
i printed the price before dropna and after dropna still not working
Any advice on how to debug an assertion error, from statsmodels.formula.api.ols, assert pytype not in (tokenize.NL, tokenize.NEWLINE)?
I'm not entirely sure what sort of error I'm even dealing with. I've verified that my column names are right, but I'm not sure what else could be going on.
The problem is somewhere in patsy, but I've poked at Patsy a bit and I'm not seeing an obvious solution.
Same problem arises when I type in a manual formula ("Employees ~ Time") and when I use ModelDesc, and patsy worked on a slightly different dataframe earlier in my notebook.
I'm building a neural network using Keras. I have two variables for my input. The training data is split into two Numpy arrays, one with all the data for the first variable, and one with all the data for the second variable. The output data is also a Numpy array. How should I format the data to give to the network?
@desert cradle df = pd.read_csv(r'C:\Users\damia\Desktop\500.csv',encoding='utf-8') Same error, want full paste ?
File "pandas\_libs\parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__ File "pandas\_libs\parsers.pyx", line 686, in pandas._libs.parsers.TextReader._setup_parser_source UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character
ok, your actual problem
there is an invisible character in your filename
that's what i got when i pasted the string from the line you just typed into my prompt @storm monolith
delete and retype the string constant
thought it was weird that it was an encode error
I'm still stuck, trying to figure out what the tokenization is even doing that's making it go wrong.
I just wanted to run an ols on a dataframe!
question regarding OneHotEncoder. If I have a column that can have 4 different values so with OneHotEncoder I transform that into 4 new columns.
If I calculate the correlation of those 4 new columns vs my label and notice only 2 of those columns have a decent correclation value, is it good to remove the other 2 or is that bad considering its origin?
the same question if I use a PolynomialFeatures transform? this can increase the amount of features exponentiel so dropping the additional columns that don't have a good correlation should help the model, right?
if you're only looking for linear relationships, sure
but corr won't tell you much about nonlinear or stateful relationships, which can be learned by a range of ML techniques
ah yes, if there is higher relation the correlation won't point it out and I might drop those
i observed there's a bit of a miss in current ML curriculum around feature selection
most of the feature selection techniques presented rely on linear relations and are worthless for stuff like decision trees or ann
yeah most of the stuff you find talks about correlation and if you're lucky they mention it doesn't apply to nonlinear relationships
pca is a huge offender
it's still taught as one of or THE primary feature selection technique, but it's actually useless for popular ml techniques
OK, I've confirmed that
'response_terms = [patsy.Term([patsy.LookupFactor("Employees")])]
model_terms = [patsy.Term([])]
model_terms += [patsy.Term([patsy.LookupFactor("Time")])]'
works and
'ols(formula="Employees ~ Time",data=longer_df).fit().summary()'
doesn't.
whats the command to install numpy and pandas?
@old axle, do you have pip?
pip install pandas and pip install numpy should do the trick.
Some do, but it's always worth checking the defaults.
ok
(My problem has been fixed! For the record, I hadn't updated a library that broke on the 3.6->3.7 upgrade.)
if you have a dataset of around 1500 records and one of the categorical features has 2 values. One value has 1400 occurences and the other around 100.
I notice that those 100 records have a capped value for my label while if a record has the other value, the label can go much higher (over double the value of the other category)
I'm not sure that based upon the high difference in frequency of these categories, It would be wise to make this conclusion and include it in my model
if you have following distribution of categories:
do I need to disregard that feature? Knowing that one category takes around 90% of the dataset so when it comes to probability... it has more opportunity to generate more outliers or wider distributions
@carmine lava check pinned messages.
@polar acorn link
Pinned messages in this channel, top right above the chat window.
Hi, Pandas question: I have two tables, one with a yearly value for las 10 years an another with several price entries by year. I want to create a new column that divides the second table price by the first table value if years match
is there a simple way to do this?
So you have yearly data and let's say monthly data for the same thing?
actually
I have a yearly index in one table
and a lot of transactions for each year in another
fist table is like 10 rows
second one is like 300k
i have a dataframe column with such values: "Eraser (5)"; where the round brackets contain the price of object in int
How can i add another df column after extracting the price and storing tt in the nes column
Like df.Object should be "Eraser" and df.Price should be 5
>>> df = pd.DataFrame({'TheColumn': ['Eraser (5)']})
>>> r = re.compile(r'^(.*?)\s*\((\d+)\)$')
>>> ms = df['TheColumn'].apply(r.match)
>>> df['Object'] = ms.apply(lambda x: x.group(1))
>>> df['Price'] = ms.apply(lambda x: int(x.group(2)))
>>> df
TheColumn Object Price
0 Eraser (5) Eraser 5``` @river plume
thanks @desert cradle
