#data-science-and-ml | Python | Page 190

fallow summit Nov 3, 2018, 1:45 PM

#

I didn't have multivariate calc in highschool, but we had tons of linear algebra here in Poland

stable tinsel Nov 3, 2018, 1:46 PM

#

@fallow summit well i'm not sure where the existing solutions are but this is the arcade learning environment: https://github.com/mgbellemare/Arcade-Learning-Environment

GitHub

mgbellemare/Arcade-Learning-Environment

The Arcade Learning Environment (ALE) -- a platform for AI research. - mgbellemare/Arcade-Learning-Environment

#

(ALE)

#

ALE is the most popular gaming environment that's being used for AI research right now

#

i would search around and try to find solutions to ALE roms

#

but as @lean ledge points out, you won't understand the solutions until you get through at least linear algebra and bayesian statistics

fallow summit Nov 3, 2018, 1:50 PM

#

dw I'll learn it

#

ALE looks cool

woven tundra Nov 3, 2018, 5:29 PM

#

There's a udemy course taught by Kirill Emchenko called ML with Python

#

Absolutely zero math, all he does is say how to use the ML libraries and what to put in and what you get out

#

If you just wanna play around with something and start playing with the libraries with no knowledge of the math required to understand what's going on, you could check that out

#

And hey who knows, if it piques your interest you could look into the math behind everything

#

Everybody learns in different ways and maybe tinkering until it fasincates you is your way

#

It certainly was mine for programming in general

#

But if you just wanna learn how to use the libraries you could check out some free stuff out there too. Start with what the usual machine learning models are and look up how to implement each of them in Python

#

I think Google has an intro to ML course as well somewhere on their website

#

https://developers.google.com/machine-learning/crash-course/ml-intro

Haven't done it though. Not sure if it's good.

Google Developers

Introduction to Machine Learning | Machine Learning Crash Cour...

fallow summit Nov 3, 2018, 5:48 PM

#

@woven tundra exactly what I meant 😉 Thanks 😃

woven tundra Nov 3, 2018, 6:33 PM

#

No worries

small ore Nov 3, 2018, 8:01 PM

#

@lean ledge I never argued about the need of learning maths for ML or any other thing. It is all about how math is approached. Remember the discussion was about Andrew Ng's course and the Columbia course on Edx. Based on my viewing of a couple of vedios of the columbia course it is mere recitation of mathematical formulae without even clearly mentioning what each of those variables stand for. Nor is it in anyway making it interesting and getting people involved. Ofcourse if the viewer already knows the math and are super clever enough to understand the intutions and limitations then it may seem good to them and a mere revision. From my vantage there is no point in reciting those math formulae without going into the nitty gritties. It is not about math vs no-math but all about the level of math and the approach. I am personally disappointed with the columbia course (based on viewing only a couple of lessons) . It should either point to a math course as pre-req or just go to the point and omit the math without the basic details of what stands for what ( I have seen a few texts on statistics and ML now and the notations vary hugely. If you already know the math it is about being able to adapt but that is not the case with everyone). I would love to see a mathy course on ML but with a better beginner friendly + nuanced approach

#

@fallow summit I have a tonne of links on ML related courses and it all depends on how you want to approach and what you are already proficient with. ( Programming etc). DM me if you want to look at the text wall of links 😄

river plume Nov 3, 2018, 8:10 PM

#

I've completed 3 courses in ML:
1st one was ML A-Z by Kirill Eremenko on Udemy which involved no math at all
2nd one was ML on Coursera by Andrew Ng and
3rd one was Maths for ML on Coursera offered by Imperial college london

#

I wasnt really understanding the math taught in andrew's course so the 3rd one helped me A LOT in understanding Andrew's course

#

I recommend you to check it out if you arent understanding it.

#

https://www.coursera.org/learn/linear-algebra-machine-learning/home/welcome

Coursera

Coursera | Online Courses From Top Universities. Join for Free

1000+ courses from schools like Stanford and Yale - no application required. Build career skills in data science, computer science, business, and more.

lean ledge Nov 3, 2018, 8:15 PM

#

@small ore perhaps we're remembering differently because Columbia's course had excellent explanations. It was far from recitations of formulae. And I'm not sure what you expect. ML is essentially a field of maths and it has some bare minimum mathematical prerequisite. It's the same amount any quantitative degree should teach you. It's silly to think you can make progress in differential geometry without any background in calculus and ML isn't any different.

river plume Nov 3, 2018, 8:16 PM

#

Agreed @lean ledge

lean ledge Nov 3, 2018, 8:16 PM

#

But don't fault the course if you don't have a mathematical background because the course was absolutely excellent

river plume Nov 3, 2018, 8:19 PM

#

@lean ledge can you suggest some advanced courses on ML and its applications?

#

specifically Neural Networks since I'm quite fascinated with it

small ore Nov 3, 2018, 8:19 PM

#

I mean. Now that you say it, I am confused. Are there multiple courses of columbia on edx? I know basic calculus and have studied statistics and probability and stuff long ago. I am not sure if that is enough for the "bare munimum" standards. I also do not understand the notations of any set of formulae without them spelling it out and preferably give it to me in a handout

lean ledge Nov 3, 2018, 8:25 PM

#

There was no special notation that wasn't already standard in maths. Yes, they didn't go through every single symbol like they would in high school but everything was explained qualitatively and it's purpose was explained. They did assume someone can read mathematical equations on their own. E.g. they would explain they would take the distance and the formula for distance would pop up. I don't expect it to be unfair to assume someone doing an ML course would know how summing notation and Pythagoras theorem works without having it explained? Perhaps if every single formula needs to be explained, you aren't as fluent at maths as you should be?

#

@river plume Once you've gone through an introductory ML course, for the most part your doors are open for starting on any specific topic

small ore Nov 3, 2018, 8:30 PM

#

Pythogoras theorem, distance formula, summation notation are all easy and trying to start explaining it will further make it boring and will digress a lot from what is being taught. I am not at all talking about it. If they are teaching probability/expectations/statistics etc and not using the same notation in another text of the topic (which we can refer to or have to refer to coz it does not always make it clear) then it is difficult. Also for me an approach where one explains math through an example esp relating to the objective the course is trying to achieve will make it more involving

lean ledge Nov 3, 2018, 8:31 PM

#

@river plume you can try Goodfellows deep learning book, it's meant to be good. Stanford has a DL course which posts it's syllabus and lecture slides online

#

Their probably notation is pretty standard across any text I've seen.

#

There's honestly not that many ways you can write probability stuff differently

#

And until what level do they need to explain everything with examples and qualitative expressions? Everyone comes with different levels of maths and they can't just spend four times as much effort to record more explanations of basic things with more examples and explanations of every single part of the formula just because someone is taking a maths course without prerequisite maths knowledge

small ore Nov 3, 2018, 8:45 PM

#

As you said it is a difference in view-point arising from our different levels of understanding math. But I will certainly say I will not be ( and ask anyone to not be) disheartened at the lack of an advanced knowledge in math to learn ML. Maybe Jobs are a different thing. But if you want to enjoy learning ML that level of math is either not necessary or can be learned up. I find that Andrew Ng sometimes oversimplifies math and would have liked more but I am kinda happy I am learning something there. Columbia one is the other end. Makes it look complex ( if not being complex) and either above my level or requires a lot lot more effort from me.
Btw, I think in the amount of text you have typed for this discussion, you could have taught us a couple of lessons in probability 😉

lean ledge Nov 3, 2018, 8:47 PM

#

I actually could have! I actually think I'll write a blog post or two at some point when my exams are over

small ore Nov 3, 2018, 8:47 PM

#

👍

river plume Nov 3, 2018, 8:48 PM

#

@lean ledge I'll check out the deeplearning.ai courses out

#

Thanks

small ore Nov 3, 2018, 8:49 PM

#

Maybe Rags ( If I may take the liberty of calling you so), you could also recommend me a text to read up perhaps. (Sometimes it can be better than a lecture). Preferably something that is available free online

lean ledge Nov 3, 2018, 8:50 PM

#

I was more talking about http://cs230.stanford.edu than deeplearning.ai. Andrew is less wishy washy and shallow when he's teaching actual students.

#

I've never actually done deeplearning.ai myself

#

But CS230 should be better regardless

#

It doesn't have recorded lectures unfortunately

#

I don't actually know a suitable textbook for ML. I went with ESL at first and it was too dense even for me and had weird way of phrasing things. Much better as a reference when you already know something than as a new way to learn. Bishop's pattern recognition and ML is what I've settled on for myself since it clicks with me quite well

#

But likely a bit too intense for you mathematically. Best covered by an upper undergrad in physics, maths, engineering etc, probably even a bit too much for most CS majors.

#

The ISL exists but it had no maths at all and has similar problems as Andrew's course at times. Still better imo. At least it doesn't miss obvious stuff

#

Better at building conceptual understanding

#

They realise that if you don't understand the maths, you won't be able to implement it so they also have examples in R

small ore Nov 3, 2018, 8:56 PM

#

ESL I reckon is Elements of Statistical learning? What s ISL? and meh, I do not want to learn R

lean ledge Nov 3, 2018, 8:59 PM

#

Introduction to statistical learning. Book by the same people behind it but ESL was meant to be an introductory text for someone just out of uni doing their PhD. Isl is essentially meant for people from non quantitative degrees that want to get into some basics, eg for PhD students in biology or psychology or something where they might use ml for the data but they don't understand ML themselves so they want some basic background on things

small ore Nov 3, 2018, 9:01 PM

#

I am neither here nor there. I hang in-between. 😄 . Anyway, thanks for the recos, Raggy. Looking forward to your blog

lean ledge Nov 3, 2018, 9:06 PM

#

Yah it can be hard to be in between situations. I felt the same when I hadn't found Bishop's book

#

There's probably a book for your level too, just keep searching

tranquil iron Nov 3, 2018, 10:48 PM

#

So as a computer science student, I should be starting off with ESL?

reef bone Nov 3, 2018, 11:06 PM

#

I would recommend Bishop's book if you can handle the math

#

I'm a BSc CS student and it worked for me

tranquil iron Nov 3, 2018, 11:11 PM

#

Pattern Recognition and ML?

reef bone Nov 3, 2018, 11:11 PM

#

Yes

tranquil iron Nov 3, 2018, 11:11 PM

#

Alright I'll check it out

lean ledge Nov 3, 2018, 11:18 PM

#

Bishop's is amazing ❤ ❤ ❤

gritty hawk Nov 4, 2018, 6:39 AM

#

hi I'm using pandas at work to read in an excel file, compare strings with a database and make a new dataframe filled with ids pointing to those rows

#

however

#

read_excel is moving the contents of a column over to another column for some damn reason

#

anyone ever experienced something like this?

gritty hawk Nov 4, 2018, 7:10 AM

#

found the issue: don't have spaces in your column names, people

woven tundra Nov 4, 2018, 10:33 AM

#

i've never had issues with spaces in my column names

#

i've had plenty of issues with duplicate column names however

#

what exactly was your issue (for the benefit of anyone who may have the same problem in the future)

woven tundra Nov 4, 2018, 11:36 AM

#

How do we visualize higher dimensions such as 4D or 5D? I mean, is it even possible to conceive beyond numbers?

lean ledge Nov 4, 2018, 11:54 AM

#

to some extent, yes

#

not very productive though usually

woven tundra Nov 4, 2018, 12:02 PM

#

Would you mind explaining how it can be done to some extent?

lean ledge Nov 4, 2018, 12:58 PM

#

4d -> 3d Projections http://eusebeia.dyndns.org/4d/vis/05-proj-1
3d slices of 4d objects https://www.youtube.com/watch?v=vZp0ETdD37E
Alternative methods https://www.youtube.com/watch?v=zwAD6dRSVyI

YouTube

Miegakure

Designing a 4D World: The Technology behind Miegakure [Hide&Reveal]

We build our 4D world using Tetrahedral (instead of Triangular) Meshes, and show 4D Crystals as an example. See also: How to walk through walls using the 4th...

▶ Play video

YouTube

3Blue1Brown

Thinking visually about higher dimensions

How do you think about a sphere in four dimensions? What about ten dimensions? Podcast! https://www.benbenandblue.com/ Problem-driven learning on at https://...

▶ Play video

#

wrong ping

#

sorry :c

#

@woven tundra

woven tundra Nov 4, 2018, 12:59 PM

#

Thanks @lean ledge !

lean ledge Nov 4, 2018, 12:59 PM

#

nw

gritty hawk Nov 4, 2018, 2:43 PM

#

hi

#

@woven tundra since you seem to know about pandas

#

mind helping me out with a query?

woven tundra Nov 4, 2018, 3:01 PM

#

Sure @gritty hawk , what's up?

fallow summit Nov 4, 2018, 4:25 PM

#

Hello again!

#

I found this one

#

https://course.fast.ai/

Deep Learning For Coders—36 hours of lessons for free

fast.ai's practical deep learning MOOC for coders. Learn CNNs, RNNs, computer vision, NLP, recommendation systems, pytorch, time series, and much more

#

And it looks quite cool. I will go through this and Andrew Ng course 😉

hardy drift Nov 4, 2018, 4:26 PM

#

i've tried experimenting with pyautogui and pytesseract to recognize these numbers, but neither work. how would you go about it? (it's runescape btw)

📎 unknown.png

#

with pyautogui i manually saved images of each number 0,1,..9 and tried to find them in the image. pytesseract spits out letters (i think for 244 it read it as "eag")

lyric canopy Nov 4, 2018, 5:40 PM

#

One thing you can try is to restrict the characters pytesseract is looking for

#

This may help you: https://stackoverflow.com/a/43710072

Stack Overflow

Pytesser set character whitelist

Does anyone know how to set the character whitelist for Pytesseract? I want it to only output A-z and 0-9. Is this possible? I have the following:

img = Image.open('test.jpg')
result = pytesseract.

#

@hardy drift

placid snow Nov 4, 2018, 7:42 PM

#

What are you trying to do with runescape? @hardy drift

lone mist Nov 4, 2018, 7:48 PM

#

my understanding is that tesseract is trained for things like scanned documents, not screenshots of a computer screen

#

hence poor results

small ore Nov 4, 2018, 9:14 PM

#

I thought this maybe helpful to anyone trying to learn ML/AI:

Curated list of tutorials: https://medium.com/machine-learning-in-practice/over-150-of-the-best-machine-learning-nlp-and-python-tutorials-ive-found-ffce2939bd78
Curated list of cheat sheets ( Helpful when you quickly wan to look up a formula or recall a method in a ML module): https://medium.com/machine-learning-in-practice/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6```

#

This list is good too:
http://rise.cse.iitm.ac.in/wiki/index.php/Introduction_to_Machine_Learning

mint pike Nov 4, 2018, 9:36 PM

#

Is machine learning really hard?

reef bone Nov 4, 2018, 9:36 PM

#

This is super nice for anyone with interest in deep learning
https://github.com/floodsung/Deep-Learning-Papers-Reading-Roadmap

mint pike Nov 4, 2018, 9:36 PM

#

I'm a beginner in python and im thinking of diving into it

#

@reef bone thanks

tame jacinth Nov 4, 2018, 9:37 PM

#

well, python is the right language

#

thats all I know

reef bone Nov 4, 2018, 9:39 PM

#

I would probably do a little bit of background reading and then familiarize yourself with something like numpy, try to solve some basic problems and see if you like it

small ore Nov 4, 2018, 9:41 PM

#

The links I have posted above has recommendations for resources/tutorials. I suggest picking one and diving in before you decide if it is hard or not

mint pike Nov 4, 2018, 9:42 PM

#

i will read them! @small ore @reef bone , Does it require lots of computational power if I'm training a neural network like thousands of images, books, etc.....

small ore Nov 4, 2018, 9:43 PM

#

Maybe. Depending on the problem

reef bone Nov 4, 2018, 9:43 PM

#

Don't worry about that now

small ore Nov 4, 2018, 9:44 PM

#

Most problems that do not involve huge database of images can be done on simple laptops/pcs

#

If you need more later you can hire space/computing power on the cloud as per your requirement

reef bone Nov 4, 2018, 9:44 PM

#

At some points it becomes computationally expensive, but you probably won't encounter any issues for a long time if you're just starting out

#

If you have a non-ancient nvidia gpu you can easily accelerate using tensorflow-gpu

#

But for now that is not necessary at all

mint pike Nov 4, 2018, 9:46 PM

#

yea im thinking too far ahead lol. ive got a decent system

#

AWS is pretty good

lean ledge Nov 4, 2018, 10:15 PM

#

Thousands of books isnt enough data, you need millions :P Generally speaking, you often dont make your own NN for that, you get a Resnet model pretrained on Imagenet and wipe and retrain last few layers

#

Transfer learning is currently the only decent way to train on little data

#

@reef bone The reading list is excellent!

#

I never even thought to try look at the original dropout paper, whoa

small ore Nov 4, 2018, 10:19 PM

#

Btw Raggy, Bishop seems good and I am able to manage till now. Only through first chapter though. I like the presentation

lean ledge Nov 4, 2018, 10:19 PM

#

👌

#

It does ramp up a bit quickly. Try to keep in mind, it's a maths book. Maths books are hard to read because they're always dense af

#

Always worth the effort though

#

the presentation is soooo much better than ESL imo

small ore Nov 4, 2018, 10:21 PM

#

As long as the explanation and clear description of notation ( preferably inline wherever needed) goes on I will be happy to read

late garnet Nov 5, 2018, 1:14 AM

#

@lean ledge what algorithms do you specialize in?

lean ledge Nov 5, 2018, 2:17 AM

#

I wouldnt say I specialise considering I'm technically still fairly new to the field, at least compared to my colleagues. But I lean towards things that borrow from the electrical engineering side of things, e.g. things based on signal theory. Both traditional and ML-based computer vision, time series/stochastic processes etc.

#

Currently do time-series-y forecasting stuff at one job and am about to start an internship at CSIRO's Data61 in their Robotics and Autonomous Systems group for Deep Computer Vision

#

@late garnet

late garnet Nov 5, 2018, 2:32 AM

#

Nice - I might need to bug you about some stuff. I focus on time series, nlp and clustering problems.

#

@lean ledge

#

Primarily anomaly detection in time series - not so much forecasting

#

Specifically, I follow Eamonn Keogh at UCR with matrix profiling techniques. https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

chilly shuttle Nov 5, 2018, 2:34 AM

#

@lean ledge you in australia then?

lean ledge Nov 5, 2018, 2:42 AM

#

@chilly shuttle oui

chilly shuttle Nov 5, 2018, 3:07 AM

#

data61 has some pretty good people, worked with them before

lean ledge Nov 5, 2018, 3:22 AM

#

It does, some very smart people. Where are you? Not brissy by any chance? @chilly shuttle

small ore Nov 5, 2018, 3:38 AM

#

@lean ledge Dumb math question. What is a math notation that looks like a modulus notation but has two vertical lines on either side of expression stand for?

chilly shuttle Nov 5, 2018, 3:44 AM

#

|like this| ?

#

@lean ledge nah melbs but I stop by brissy sometimes and know a few folk there

simple crag Nov 5, 2018, 3:49 AM

#

@small ore do you have a picture?

small ore Nov 5, 2018, 3:50 AM

#

I think bicubic showed a better way than a picture. Why didn't I think of it.
It is ||x1*w-x2*w||

#

The lines look rather close by though

simple crag Nov 5, 2018, 3:51 AM

#

Usually that's norm

small ore Nov 5, 2018, 3:51 AM

#

Norm?

simple crag Nov 5, 2018, 3:51 AM

#

https://en.wikipedia.org/wiki/Norm_(mathematics)

Norm (mathematics)

In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space—except for the zero vector, which is assigned a length of zero. A seminorm, on the other hand, is ...

#

Or apparently it could also mean nearest integer, never seen that before https://en.wikipedia.org/wiki/Nearest_integer_function

small ore Nov 5, 2018, 3:54 AM

#

I am not sure what a negative-length vector is

#

If such exists

lean ledge Nov 5, 2018, 3:58 AM

#

Yep, || is norm. Vectors can be generalised to things that arent just lists of numbers or geometrical objects and hence we create specific terminology that applies to all sorts of different vectors. Thats why you'll also see dot products like a . b written in inner product notation as <a, b> or (a, b)

#

You might want to study some linear algebra if you havent already. It's all very relevant to ML. You might even see something like the normal formula and see its derivation which might help you understand linear regression in a slightly deeper way

#

essentially norm is the genrealised version of the "length" of a vector @small ore

#

and in the same way you might see |a| as sqrt(a.a), you might see ||a|| = sqrt(<a,a>)

small ore Nov 5, 2018, 4:04 AM

#

Currently it does not appear relevant. It was used for regularisation weights. Maybe I will dive into linalg later for the deeper meaning

lean ledge Nov 5, 2018, 4:05 AM

#

It's taking the length or equivalently the "size" of the vector and penalising it for being too large

small ore Nov 5, 2018, 4:07 AM

#

Huh. The 'size' itself is penalizing weights here

#

||w||^2

late garnet Nov 5, 2018, 1:09 PM

#

@small ore There is a course on Udemy that is pretty good; covering linear algebra. It has a good amount of theory, notation and application. Code examples for matlab and python exist as well. https://www.udemy.com/linear-algebra-theory-and-implementation/

Udemy

Complete linear algebra: theory and implementation

Learn concepts in linear algebra and matrix analysis, and implement them in MATLAB and Python.

twilit bolt Nov 5, 2018, 7:24 PM

#

Alternatively, you can take a linear algebra course via MIT's opencourseware: https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/

MIT OpenCourseWare

Linear Algebra

This course covers matrix theory and linear algebra, emphasizing topics useful in other disciplines such as physics, economics and social sciences, natural sciences, and engineering. It parallels the combination of theory and applications in Professor Strang’s textbook Intr...

lean ledge Nov 5, 2018, 7:46 PM

#

^^^^ THIS

#

Strang is a god

#

Get his book, watch his lectures

#

Udemy is too dodgy for me to trust lol

lyric canopy Nov 5, 2018, 7:57 PM

#

Strang, this one? Great introduction book.

📎 cropped.jpg

lean ledge Nov 5, 2018, 7:59 PM

#

That's the one

#

Nice ISL

#

There's also linear algebra and it's applications

lyric canopy Nov 5, 2018, 8:01 PM

#

The book we used when I first took a uni linear algebra class was Poole - Linear Algebra: A Modern Introduction. Didn't like that one at all.

#

But, this was a long time ago

#

I'm using Strang to refresh things atm

late garnet Nov 5, 2018, 8:26 PM

#

Udemy courses are definitely hit and miss. Glad they have the refund policy.

#

I like this book a lot - https://www.amazon.com/No-bullshit-guide-linear-algebra/dp/0992001021

thorn river Nov 6, 2018, 12:23 PM

#

I have downloaded a .bz2 compressed file of a month of reddit comments from http://files.pushshift.io/
Decompressed to file would be ~24 gb of json objects per line.

I have a list of reddit usernames which I would like to check against the authors in that pushshift file and extract only 3 k:v pairs of those lines where the author corresponds to one of the authors in my list.

Example of structure of the comment file would be:

`{'author': x, 'time_created': 'xy}', etc.

In other words: a line with a json object(?).

What would be a good way to approach this? I don't have access to a boatload of RAM.
Reading the file line by line and checking if author corresponds to my list and then extracting the k:v pairs which Im interested in will probably take a lot of time considering the large size of the file.

Any tips?

late garnet Nov 6, 2018, 1:02 PM

#

You can read compressed files in a buffered way. I'm guessing you only want to match the authors with the comments? What do you plan to do after that @thorn river ?

thorn river Nov 6, 2018, 1:04 PM

#

Ideally I would end up with only comments from my separate list from authors, and see if they havea certain keyword in their flairtext or comment text itself

late garnet Nov 6, 2018, 1:05 PM

#

You can probably just iterate over the buffer and process it as you iterate looking for the key words etc.

#

I'm not sure if you are a pandas user, but it has some nice buffering features - http://pandas.pydata.org/pandas-docs/stable/io.html

#

Otherwise you can look up whatever is appropriate for your toolset.

placid snow Nov 6, 2018, 1:09 PM

#

Does that allow for manipulation of large json files as well?

#

I recently had the issue of a json file being too large to be opened normaly, and was looking for a solution like this, if it does

thorn river Nov 6, 2018, 1:10 PM

#

I used pd.load_json once but I frzoe my system, so there might be something more to it

late garnet Nov 6, 2018, 1:11 PM

#

You can try what they suggest here - https://datascience.stackexchange.com/questions/27767/opening-a-20gb-file-for-analysis-with-pandas

Data Science Stack Exchange

Opening a 20GB file for analysis with pandas

I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB bu...

#

Just use read_json instead of read_csv.

placid snow Nov 6, 2018, 1:11 PM

#

Thinkint about it my file sort of has a json file pr line, so might be worth reading line by line for me.

late garnet Nov 6, 2018, 1:12 PM

#

I'm not sure how it would handle a single json object that is large, but the approach I listed should work for line by line objects.

thorn river Nov 6, 2018, 1:13 PM

#

Thanks for the pointers! I wrote a script which will do what I (hope) i intended to do. If you could have a look and see if this will probably work?

    data = {}
    for i, line in enumerate(fi):
        parser = json.loads(line)
        for user in parser['author']:
            if user in author_set: 
                data[i] = (parser['author'], parser['author_flair_text'], parser['body'])

#

oh wow dat formatting

#

c

#

sec

#

welp

#

I hope you understand formatting doesn't seem to work

late garnet Nov 6, 2018, 1:15 PM

#

It looks ok to me, but honestly I would need to test it myself. 😃

#

You can just set a break after the first line to test it.

thorn river Nov 6, 2018, 1:16 PM

#

You mean like: if i == 1: break?

late garnet Nov 6, 2018, 1:16 PM

#

with bz2.open(jsonnew, 'r') as fi:
    data = {}
    for i, line in enumerate(fi):
        parser = json.loads(line)
        print(parser)
        break
        for user in parser['author']:
            if user in author_set: 
                data[i] = (parser['author'], parser['author_flair_text'], parser['body'])

thorn river Nov 6, 2018, 1:17 PM

#

ah alright, thanks!

late garnet Nov 6, 2018, 1:17 PM

#

Also - be sure to create a set object out of your users that you are looking for.

thorn river Nov 6, 2018, 1:18 PM

#

I shouldve mentioned that author_set is a set()

late garnet Nov 6, 2018, 1:18 PM

#

Great - I just wanted to be sure

thorn river Nov 6, 2018, 1:18 PM

#

I assume because checking if it is in a set is faster right?

late garnet Nov 6, 2018, 1:19 PM

#

Yes

thorn river Nov 6, 2018, 1:20 PM

#

Alright, will also check out that article you linked. Thanks!

late garnet Nov 6, 2018, 1:57 PM

#

FYI - I tested this in pandas and it appears my specific tar.gz that I created has extra information on the first line while iterating - "something.json0000664000175100017510000000034413370313602013031 0ustar tylertyler{"author": "someone", "text": "aldskfjasdf"}"

#

With pandas it breaks with read_json, however with read_csv it works.

#

Weird

#

Nice - I realized that I was being silly and have a tar.gz file not just a gunzip file.

#

import pandas as pd
for chunk in pd.read_json('/home/tyler/src/something.json.gz', lines=True, chunksize=2):
    print('new chunk')
    print(chunk)

#


new chunk
     author         text
0   someone  aldskfjasdf
1  someone2  aldskfjasdf
new chunk
     author         text
2  someone3  aldskfjasdf
3  someone4  aldskfjasdf
new chunk
     author         text
4  someone5  aldskfjasdf

#

@thorn river This works pretty well, however be sure that pandas supports the compression type beforehand.

late garnet Nov 6, 2018, 2:34 PM

#

You may also want to consider multi-threading if you are processing so much data. It is fairly easy to do.

rich swift Nov 6, 2018, 3:33 PM

#

Hi, has anyone ever try to create Jupyter widgets before? I'm curious what approach, lib, or js-framework that people use thinking

thorn river Nov 6, 2018, 4:05 PM

#

@late garnet Thanks! Don't know much about multithreading but I can check it out. I'll try to match what is in the chunks with my list of authors and work that way!

late garnet Nov 6, 2018, 4:10 PM

#

from multiprocessing import Pool

import pandas as pd

def process_chunk(chunk):
    pass

pool = Pool(8)

all_results = []

for chunk in pd.read_json('/home/tyler/src/something.json.gz', lines=True, chunksize=100):
    all_results = all_results + pool.map(my_function, chunk)

#

Something like that will give you a list of results.

#

@thorn river

thorn river Nov 6, 2018, 4:15 PM

#

def process_chunk(chunk):
pass

what's the reason for this function?

#

What does it do?

silk acorn Nov 6, 2018, 4:16 PM

#

it passes

thorn river Nov 6, 2018, 4:18 PM

#

I suppose you use that in place of my_function right?

late garnet Nov 6, 2018, 4:20 PM

#

Yeah sorry it was quick example

thorn river Nov 6, 2018, 4:21 PM

#

Ah no problem

#

So I could then iterate all_results to check if its in the author_set

#

If I understand your code correctly

late garnet Nov 6, 2018, 4:51 PM

#

The process_chunk would do all of the processing of each row for each chunk. Chunk in this case is a dataframe row.

#

Just return what is processed back. It could be an empty list or list of what you want.

#

@thorn river

#

My example isn't perfect, but conceptually you should have an idea of how to use it.

#

pool.map maps an iterable of items across N number of threads

thorn river Nov 6, 2018, 5:00 PM

#

So I would have to edit the process_chunk function for it to only include authors in author_set?

#

Having a hard time wrapping my head around it

late garnet Nov 6, 2018, 5:33 PM

#

Yes

lapis sequoia Nov 6, 2018, 8:03 PM

#

is a gtx 1050 alright for data science?

placid snow Nov 6, 2018, 8:04 PM

#

Hi @lapis sequoia would you mind setting a nickname of the server to something consisting of characters on a normal US/EU keyboard so others can mention you easier?

lapis sequoia Nov 6, 2018, 8:05 PM

#

oh yeah sorry

placid snow Nov 6, 2018, 8:05 PM

#

Thank you :)

lapis sequoia Nov 6, 2018, 8:05 PM

#

there

#

i'll rephrase my question, anybody know of any good gpus for data science > £200?

late garnet Nov 6, 2018, 8:24 PM

#

@Viibrant depending on your needs, maybe AWS could be a cheaper alternative?

#

Disclosure - I'm not sponsored by AWS. 😃

hollow gulch Nov 6, 2018, 8:28 PM

#

anyone know a quick script that I can keep all the columns of this df4, and make a df5 that is the same layout but group by 'Order Number' + 'Item Number' but 'Quantity Ordered' is a sum of those?

📎 unknown.png

late garnet Nov 6, 2018, 8:30 PM

#

Is this an excel or pandas or ? question?

hollow gulch Nov 6, 2018, 8:31 PM

#

i have data available in excel but I am using python to process it and I use panda package for it. hope it make sense

#

the snapshot is out of panda

#

hope I am posting in the right place

olive trench Nov 6, 2018, 9:07 PM

#

@hollow gulch

import pandas as pd

df4 = pd.DataFrame({'Order Number': [1,1,1,2,2,3,3,3], 'Item Number':[1,1,2,5,5,4,5,5], 'Quantity Ordered':[10,15,5,3,3,7,5,6]})

df5 = df4.groupby(by=['Order Number', 'Item Number'], sort=False, as_index=False)['Quantity Ordered'].sum()

#

I put in some dummy data

hollow gulch Nov 6, 2018, 9:09 PM

#

Thanks @olive trench I tried something similar. Let me see if it keep all the columns as it was

#

splendid!

#

can you explain the code a little bit?

#

for academic purpose

olive trench Nov 6, 2018, 9:13 PM

#

What exactly do need?

#

the by parameter specifies what keys to group by, the sort one doesn't sort the keys, and the index makes it so the keys aren't used as index. Then you sum the the groupby object by quantity and it returns a dataframe

hollow gulch Nov 6, 2018, 9:16 PM

#

here's my version df5=df4.groupby (['Order Number','Item Number'])['Quantity Ordered'].sum()

#

perhaps you could help me understand the difference betwen myversion and yours?

olive trench Nov 6, 2018, 9:16 PM

#

without the ['Quantity Ordered'].sum() part it's just a groupby object. If you perform a function on one of the columns it returns a dataframe where the column is the result of the function you used for all the grouped elements

#

you don't have the as_index=False

#

it's true by default, which will put the keys into the index

hollow gulch Nov 6, 2018, 9:18 PM

#

what does sort= false do

olive trench Nov 6, 2018, 9:19 PM

#

doesn't sort your keys. If you don't need it, it's better performance if you dataframe is big

hollow gulch Nov 6, 2018, 9:19 PM

#

I see

#

now, if I want to keep all the others column. I have to expand the code right, depends on which I want to keep as is and which I want to aggregrade

#

for example I have column A B C D E F
further version will be df5=df4.groupby([ all keep column], sort=False, as_index=False)['D']['F'].sum() to get sum D, F?

olive trench Nov 6, 2018, 9:22 PM

#

I'm not sure if that'll work. If you're summing the column, why don't you do that first, then the groupby and sum again?

hollow gulch Nov 6, 2018, 9:22 PM

#

df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
       'Parent Number', 'Order Date', 'Item Number', 'Description ',
       'Unit Price', 'Extended Price',
       'Foreign Unit Price', 'Foreign Extended Price', 'Ln Ty', 'Last Stat',
       'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['Quantity Ordered'].sum()

#

I have a large data set that summing is only applied to 'Total Quantity' and 'Extended Price' by 'Order Number' and 'Item Number'.

#

I hope that make sense

olive trench Nov 6, 2018, 9:25 PM

#

I'm a little confused what's the goal of this exercise. If you group by all the columns, it's gonna only group idential rows

hollow gulch Nov 6, 2018, 9:25 PM

#

so my actual code is something like this

#

df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
       'Parent Number', 'Order Date', 'Item Number', 'Description ',
       'Unit Price',
       'Foreign Unit Price','Ln Ty', 'Last Stat',
       'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['Quantity Ordered','Extended Price','Foreign Extended Price'].sum()

#

with the last 3 columns being sum

#

I am just curious how much flexibility i can do with this code

olive trench Nov 6, 2018, 9:27 PM

#

I am not sure if summing the list of columns will work. But as I said, if you prepare a new column that is sum of ['Quantity Ordered','Extended Price','Foreign Extended Price'] beforehand and then just sum it agian over the aggregated elemtns, it'll do the job

hollow gulch Nov 6, 2018, 9:28 PM

#

I am not sure what that would look like. I am quite new to python

olive trench Nov 6, 2018, 9:29 PM

#

df5['new_column'] = df5['Quantity Ordered']+df5['Extended Price']+...

#

sorry df4 ^

#

and then you do the groupby and use the .sum() on ['new_column']

hollow gulch Nov 6, 2018, 9:30 PM

#

sounds like 3 different step if i understand correctly

olive trench Nov 6, 2018, 9:31 PM

#

df4['newcol']=df4['Quantity Ordered']+df4['Extended Price']+...
df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
       'Parent Number', 'Order Date', 'Item Number', 'Description ',
       'Unit Price',
       'Foreign Unit Price','Ln Ty', 'Last Stat',
       'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['newcol'].sum()

#

just two

hollow gulch Nov 6, 2018, 9:32 PM

#

isnt that the same as the step above?

#

just written into 2 parts

olive trench Nov 6, 2018, 9:32 PM

#

as I said, I am not sure

hollow gulch Nov 6, 2018, 9:32 PM

#

sorry for dump question

olive trench Nov 6, 2018, 9:32 PM

#

I don't have your dataset so it's hard to say

hollow gulch Nov 6, 2018, 9:32 PM

#

but thanks for the great help. I think that code works, I am verifying it

#

its been a whole day to bang myhead around that code

#

you definitely save the day

olive trench Nov 6, 2018, 9:34 PM

#

import pandas as pd

df4 = pd.DataFrame({'Order Number': [1,1,1,2,2,3,3,3], 'Item Number':[1,1,2,5,5,4,5,5], 'Quantity Ordered':[10,15,5,3,3,7,5,6], 'irr':[5,6,4,5,2,3,5,5]})

df5 = df4.groupby(by=['Order Number', 'Item Number'], sort=False, as_index=False)['Quantity Ordered', 'irr'].sum()
df5['newcol'] = df5['Quantity Ordered']+df5['irr']

#

it's the same thing as mine but you're doing the column sums after not before. If you do the ...['Quantity Ordered', 'irr'].sum(), it does sum but just the columns not together

#

haha no worries

hollow gulch Nov 6, 2018, 9:35 PM

#

@@ still dont get it completely yet haha

olive trench Nov 6, 2018, 9:35 PM

#

I had a very frustrating experience myself today that I only just solved haha

#

well ask away

hollow gulch Nov 6, 2018, 9:35 PM

#

does it matter if it does sum before/after?

#

I guess I dont understand whats going on in the background

#

I added you friend btw. 😃 you seem like a nice person.

olive trench Nov 6, 2018, 9:37 PM

#

no. Except if you do it after you'll have the two summed columns in the dataframe

hollow gulch Nov 6, 2018, 9:38 PM

#

that's what I want right? because I wanted to keep those column separated

olive trench Nov 6, 2018, 9:38 PM

#

No worries, and feel free to hit me up if you need help. I am not very experienced myself but I'll try to help with that I can 🙂

hollow gulch Nov 6, 2018, 9:38 PM

#

I want the price.sum() and the quantity.sum() because they were the same order number

#

the order number was split due to different batch of shippnig

#

but for the analysis purpose, we want to consider them as 1 order.

olive trench Nov 6, 2018, 9:39 PM

#

ohhh

hollow gulch Nov 6, 2018, 9:39 PM

#

I work with suppy chain so these things happen alot in my data

olive trench Nov 6, 2018, 9:39 PM

#

I thought you were trying to sum those columns together

hollow gulch Nov 6, 2018, 9:39 PM

#

^ thus the confusion in us haha

#

excel can't handle this stuff as flexible as python so I am trying to pick up python to do the heavy work

#

I play alot with panda

olive trench Nov 6, 2018, 9:40 PM

#

Yeah then just specify which columns you want to have sum of for the aggregated elements

#

my bad I didn't understand

hollow gulch Nov 6, 2018, 9:40 PM

#

noo, you helped alot

#

😃

olive trench Nov 6, 2018, 9:40 PM

#

Do you not have this data in a database? I feel like this might be easier with SQL

hollow gulch Nov 6, 2018, 9:40 PM

#

trust me, I am ready to bang my head in the cabinet

#

IT doesnt trust us enough to give us access to SQL lol

olive trench Nov 6, 2018, 9:41 PM

#

Oh I know that struggle, gaining access to anything in my company is a nigthmare too lol

hollow gulch Nov 6, 2018, 9:41 PM

#

my thought SQL would be a way to go too

olive trench Nov 6, 2018, 9:42 PM

#

if it was read only you should be fine

#

it's odd though that they want you to do a job but don't give you the tools

hollow gulch Nov 6, 2018, 9:42 PM

#

that's something that I am trying to convince them

#

even my boss struggle too lol

#

IT doesnt understand all of our needs

olive trench Nov 6, 2018, 9:43 PM

#

are you like an analyst?

hollow gulch Nov 6, 2018, 9:43 PM

#

yep

#

and you?

olive trench Nov 6, 2018, 9:43 PM

#

I'm a junior data scientist

hollow gulch Nov 6, 2018, 9:43 PM

#

nice 😃

#

let me know how it goes and perhaps we could learn more together

olive trench Nov 6, 2018, 9:44 PM

#

sure thing! I am fairly capable with pandas

#

I am kind of fortunate that all our company is services in IT, so most people are capable in IT. And luckily I don't have to communicate with the business divisions 😄

hollow gulch Nov 6, 2018, 9:48 PM

#

haha

#

I work in business division

#

I do their heavy math

#

i used to do these things in VBA and excel

#

but I think python is the real hammer to get anything done so I am trying to pick up and get everything done in python

#

takes longer but much more learning curve and more flexibility in manipulatnig data

olive trench Nov 6, 2018, 9:55 PM

#

eww VBA. I had the displeasure to work in it and nevermore 😄

python is definitely powerful. I'd like to dive a little bit into big data sometime, so I hope we get a project including it eventually

charred crest Nov 6, 2018, 10:06 PM

#

Hello, is there anyone with some knowledge about reinforcement learning (Q-Learning, Monte Carlo, TD(0), greedy)?

#

Or Am I in the wrong channel ?

late garnet Nov 6, 2018, 10:35 PM

#

@charred crest it is more beneficial to you and everyone else to ask a specific question.

lean ledge Nov 6, 2018, 10:53 PM

#

Wish I had time to look more into RL, it's so relevant to my field

#

Very cool stuff

terse pewter Nov 6, 2018, 11:02 PM

#

Seems like super interesting stuff

#

give the model candy until it likes you and gives good results

chilly shuttle Nov 7, 2018, 10:18 AM

#

someone said something today that made me do a double take

#

'you can train a model on X wide feature vectors but do inference only on <X wide feature vectors'

#

that's... not right, right?

#

i mean you can do it, but running inference with consistently less information is the same as training the model without being aware of that extra information right

woven tundra Nov 7, 2018, 11:08 AM

#

What do you guys think about this certificate?

https://academy.microsoft.com/en-us/professional-program/tracks/artificial-intelligence/

Microsoft Professional Program

Microsoft Professional Program Artificial Intelligence track

Learn the skills you need to help land the career you want!

olive trench Nov 7, 2018, 1:17 PM

#

@chilly shuttle maybe they included the label vectors in X...?

chilly shuttle Nov 7, 2018, 1:18 PM

#

no, their example was along the lines of 'train on postcode and age, then run inference on postcode only'

#

i... don't think windmills work that way

polar acorn Nov 7, 2018, 6:34 PM

#

They are probably talking about the difference between a design matrix used in pure statistics and the one hot encoded feature matrix often used in machine learning.

#

It's been a long time since I looked at this, but it's something something matrix must be invertable, remove one variable and use it as a base for interpreting the others. It's one of the small differences between pure stats and ml thats easy to trip over.

true acorn Nov 7, 2018, 7:59 PM

#

@lean ledge I'd like to be able to simulate gyroscopic technology in the vacuum of space

#

Are you studying computational science right now?

#

Im teaching myself, but want to really dive into it and learn as much as i can

lean ledge Nov 7, 2018, 8:02 PM

#

@true acorn nah, I'm an engineering student. Gyroscopic shouldn't be too bad

#

Given you have the moment of inertia tensor for the body, it's a simple simulation

#

Do you know the basics of rigid body dynamics?

true acorn Nov 7, 2018, 8:03 PM

#

Hell to the nawh nawh nawh. I just had a discussion in the mathematics discord server about learning math relative to the problem im trying to solve.

#

but id love to learn

lean ledge Nov 7, 2018, 8:05 PM

#

Oof, yeah the thing is, computational science isn't really something people learn on their own. They usually learn it alongside and to aid in another discipline like physics or engineering or chemistry etc. Physics and engineering especially

#

Not having any background in what you're trying to simulate is harder because you don't know how it should behave and it takes longer to understand where to even start

#

But

#

There should be resources

true acorn Nov 7, 2018, 8:06 PM

#

Why couldnt i learn it on my own? What if i told you i was surrounded by some professors who teach this stuff?

#

ah

#

so as long as i have resources aka mentors aka professors who study this stuff, i should be okay?

#

@lean ledge I have you too, silly

#

I just cant afford school, im trying to save, but school is way too expensive so im trying my best to learn on my own

lean ledge Nov 7, 2018, 8:08 PM

#

https://www.cs.cmu.edu/~baraff/sigcourse/

true acorn Nov 7, 2018, 8:08 PM

#

woooow, 1997

lean ledge Nov 7, 2018, 8:09 PM

#

Gyroscopic stuff starts becoming relevant at rigid body dynamics
@true acorn

true acorn Nov 7, 2018, 8:09 PM

#

Fascinating

lean ledge Nov 7, 2018, 8:10 PM

#

Look through those notes and you should be able to start figuring it out

true acorn Nov 7, 2018, 8:10 PM

#

i know the basics of programming and math so if i have a question about a function can i shoot you an IM?

lean ledge Nov 7, 2018, 8:10 PM

#

A gyroscope is just a rigid body with a specific moment of inertia tensor

#

I'd rather talk here

true acorn Nov 7, 2018, 8:10 PM

#

okay,

#

Damn, why am i just hearing about Euler right now. i was never taught about him in school

lean ledge Nov 7, 2018, 8:12 PM

#

Euler was a smart guy

#

Stuff in maths has to be named after the second person who discovered it because the first is always Euler :P

true acorn Nov 7, 2018, 8:13 PM

#

rofl

#

thanks for the resource @lean ledge

lean ledge Nov 7, 2018, 8:14 PM

#

Nw, happy to help

polar acorn Nov 7, 2018, 8:14 PM

#

Also he had cool hats, Euler that is

late garnet Nov 7, 2018, 8:16 PM

#

@lean ledge you must be pretty familiar with markovian processes and system theory yes?

lean ledge Nov 7, 2018, 8:17 PM

#

To some extent ye. Don't expect too much, my knowledge is full of gaps

#

I am 18, i haven't had the time to know it all in detail

late garnet Nov 7, 2018, 8:18 PM

#

Ahh okay, I'm working on a problem in understanding operational inefficiencies through deriving markov transition matrices. I was hoping someone could provide some advice.

#

Essentially, I have a system with a finite number of states, but many states are interconnected. There are many users of the system and I am comparing user level efficiency of the system to optimize business processes.

lean ledge Nov 7, 2018, 8:22 PM

#

Oof, I have worked a bit on something similqr and my coworkers have worked a lot on that but that's directly connected to a product my company has and another we're working on and I'm afraid I'd rather can't say much.

late garnet Nov 7, 2018, 8:22 PM

#

RIP

lean ledge Nov 7, 2018, 8:23 PM

#

Sorry D:

late garnet Nov 7, 2018, 8:24 PM

#

I think I have the right idea on comparing at a high level to evaluate potential areas of inefficiency

#

What company do you work for?

lean ledge Nov 7, 2018, 8:25 PM

#

http://petradatascience.com

#

Being able to rate and improve worker efficiency is a big topic in a mine where a single more round of shovelling a day will lead to 9M an year in profit etc

late garnet Nov 7, 2018, 8:29 PM

#

Do you have general advice on how to compare quantitatively other individuals based on some weighting within the matrix of transitions?

#

A paper or general concept would be great 😃

#

I can read and apply it myself - I hope

#

I think in my case I could take the weighted average of the differences between corresponding row, weighted by the fraction of time spent in the state corresponding to that row

lean ledge Nov 7, 2018, 8:34 PM

#

Sounds like a good start. Should be able to experiment a bit with that. My boss taught me all the relevant stuff so I don't really have any link to papers but with a bit of searching, you should be able to find them on your own

small ore Nov 8, 2018, 12:27 AM

#

I am not even sure if all these are data-science or if we need a new OT-Advacned(Nerds only) channel

lean ledge Nov 8, 2018, 1:04 AM

#

there's lots of stuff here that's not data science because everyone just redirects people here

#

it's weird

delicate nymph Nov 8, 2018, 12:02 PM

#

hello

#

i'm told to come here and you for advice

#

i have this code https://paste.pydis.com/xuyukoqisu.py

#

and after this command it alters my dataframe

data1 = []                                                                        #create empty list
for name, dates in data.groupby(pd.Grouper(freq='D')):                            #separate days
    data1.append(dates)

lyric canopy Nov 8, 2018, 12:12 PM

#

Did you check if your DataFrame is still how you want it to be between this:

data = data.drop(['use','hours','date','B','B2','E','A','w','d','year','month','day','sec'], 1)
###############################################################################
data = data.set_index('date_time')

And that code you've posted? Because I don't think that code you've posted should alter it.

delicate nymph Nov 8, 2018, 12:13 PM

#

up until the

data = data.set_index('date_time')

#

my code is fine

placid snow Nov 8, 2018, 12:13 PM

#

continuation from https://discordapp.com/channels/267624335836053506/303906556754395136/510055520200032266 btw.

delicate nymph Nov 8, 2018, 12:14 PM

#

it runs without any alternation

#

📎 unknown.png

lyric canopy Nov 8, 2018, 12:18 PM

#

When you say you ran the upper part, does that include or exclude that set_index line?

delicate nymph Nov 8, 2018, 12:18 PM

#

yes

lyric canopy Nov 8, 2018, 12:18 PM

#

Because the code you've posted in the help channel excluded it

delicate nymph Nov 8, 2018, 12:18 PM

#

i run the set index too

#

and its perfect

#

when i run the groupby i get those blue cells in my last column

#

those blue cells are a datetime which is reapeted but also wrong

#

they mess my whole frame

lyric canopy Nov 8, 2018, 12:20 PM

#

I don't know what the colors mean, I've never used Spyder

#

Has the actual data in the DF in memory changed?

delicate nymph Nov 8, 2018, 12:21 PM

#

no the original is correct

#

just the new one that divides the days

lyric canopy Nov 8, 2018, 12:22 PM

#

You're not creating a new one anywhere, though. The only thing you get from a groupby is a groupby object.

delicate nymph Nov 8, 2018, 12:22 PM

#

i know

#

😦

lyric canopy Nov 8, 2018, 12:22 PM

#

But, the grouping is wrong ?

#

Or, what is actually going wrong?

delicate nymph Nov 8, 2018, 12:23 PM

#

no its perfect it even creates empty frames from the days i don't have any info

#

it just adds double dates to some dates

lyric canopy Nov 8, 2018, 12:26 PM

#

Right, but if you look to the indexes, some date_time values are repeated exactly. Aren't those the ones that get grouped together and "doubled"?

delicate nymph Nov 8, 2018, 12:26 PM

#

yes that's them

#

the original doesn't contain them

#

i can't understand why they exist

lyric canopy Nov 8, 2018, 12:29 PM

#

Are you sure they don't exist after you've created that series with:

data['date_time'] = data[['date','time']].astype(str).apply(''.join,1)

?

delicate nymph Nov 8, 2018, 12:29 PM

#

yes

#

i check the dataframe line by line

lyric canopy Nov 8, 2018, 12:29 PM

#

And they show up right after data = data.set_index('date_time') before the loop?

delicate nymph Nov 8, 2018, 12:30 PM

#

let me check it one more time i'll run the program line by line too

lyric canopy Nov 8, 2018, 12:31 PM

#

You can always print out the first, say 30 rows, with DataFrame.head(n=30) at various points to track the changes throughout the script.

#

That way, you don't have to rely on running it line by line and checking the dataframe viewer in Spyder

delicate nymph Nov 8, 2018, 12:33 PM

#

i did it

#

i check the extra lines

#

they don't exist before the groupby

lyric canopy Nov 8, 2018, 12:37 PM

#

Okay, I don't know what's going on.

delicate nymph Nov 8, 2018, 12:38 PM

#

and that is how i throw 1 month of work

#

thanks anyway

#

have a nice day

lyric canopy Nov 8, 2018, 12:39 PM

#

There are people on this server who are much more fluent in Pandas than I am, maybe one of them will spot what's going on

delicate nymph Nov 8, 2018, 12:40 PM

#

i would appreciate that ty

#

should i post it again or will they see it?

lyric canopy Nov 8, 2018, 12:41 PM

#

Depends on if the question gets burried.

delicate nymph Nov 8, 2018, 12:42 PM

#

so what do you recommend me to do?

delicate nymph Nov 8, 2018, 1:04 PM

#

@lyric canopy i did it i found my mistake so you don't have to search. thank you for your time. it was a stupid mistake

lyric canopy Nov 8, 2018, 1:05 PM

#

Great! Do you mind sharing the mistake so I can learn from it, too?

delicate nymph Nov 8, 2018, 1:06 PM

#

sure i didn't change a string to float [.astype(float)] so instead of calculating a a number was making a sequence which made the program go nuts

#

so it seemed like correct

#

and i was so focused on what seemed to be wrong that i didn't pay attention to its origin

lyric canopy Nov 8, 2018, 1:08 PM

#

Right, thanks!

teal veldt Nov 8, 2018, 3:15 PM

#

Hey, would this be the correct channel for a question related to outliers in a dataset?

#

If not, is there a related discord server that you folks care to recommend?

lyric canopy Nov 8, 2018, 3:15 PM

#

I think you can go ahead

#

What kind of model are we talking about?

teal veldt Nov 8, 2018, 3:16 PM

#

It's actually a dataset that I got from insideairbnb.com

#

I am terribly new at pandas and I though I'd practice with some real life examples

#

The issue with that dataset (at least the Barcelona one) is that the price data has a lot of stuff that is clearly an outlier

#

Since we're talking about apartments with a listed price of 6000 euro/night

#

Now, is there a practical and scientific way, so to say, to determine where to cut the line for outliers?

#

I suppose I could arbitrarily say that whatever is out of 2 or 3 standard deviations is an outlier

#

But an expert opinion would be much appreciated

lyric canopy Nov 8, 2018, 3:18 PM

#

It actually depends on what you want to do with the data, but just deleting outlying observations is usually a bad idea (although it happens way too often)

#

If there's no substantive reason to delete the data points, you could very well be deleting valid observations

teal veldt Nov 8, 2018, 3:19 PM

#

I mean, just random analysis, like avg price per Neighbourhood, or something related to reviews

lyric canopy Nov 8, 2018, 3:19 PM

#

Well, by just deleting those outliers, you're biasing your sample

teal veldt Nov 8, 2018, 3:19 PM

#

But to do that I need to know what is actual data and what is clearly not significant

lyric canopy Nov 8, 2018, 3:20 PM

#

Do you have any reason for why these values are not "actual data"?

teal veldt Nov 8, 2018, 3:20 PM

#

I mean, I know they're wrong because if I access the relevant listing page on Airbnb I see that the price there is normal

#

So it's either badly parsed from Airbnb, or maybe it was set superhigh while the owner was creating the page

#

To avoid receiving bookings

lyric canopy Nov 8, 2018, 3:21 PM

#

So, how do you know those errors are only happening for those high values in your list?

teal veldt Nov 8, 2018, 3:21 PM

#

Feck

#

I don't

lyric canopy Nov 8, 2018, 3:22 PM

#

Anyway, I assume the distributions are likely skewed anyway, so the average may not be the best measure to describe the central tendency in the data you have

teal veldt Nov 8, 2018, 3:23 PM

#

So something like k-means clustering would be a better idea?

lyric canopy Nov 8, 2018, 3:23 PM

#

Anyway, if you have reasons to assume some values are truly erroneous, then you're justified to delete them

#

There are plenty of alternatives. Something like the median is often seen as more robust than the mean, for instance

#

That's why is sometimes used with wages

teal veldt Nov 8, 2018, 3:25 PM

#

Fair enough

#

I don't have any info on how was the data mined so I guess I'll make some assumptions to have a "clean" dataset just for the sake of practice

#

It won't really describe reality, but that's not the point right now

#

Thanks a lot for the help

#

Appreciate it

lyric canopy Nov 8, 2018, 3:27 PM

#

I may be a bit on the fence about it, because it happens way to much in academics. (deleting observations based on some arbitrary cut-off without a substantive reason)

teal veldt Nov 8, 2018, 3:28 PM

#

Nah, it's a totally fair point and it was very revealing, so thanks for that, I'll just ignore it in this specific case since I'm the guy that's still googling boolean indexing

#

So it's not a matter of doing analysis that make sense for now, it's more like "how does pandas work?"

lyric canopy Nov 8, 2018, 3:29 PM

#

Right, just play around with it, I'd say

#

If you're going to fit a model (e.g., linear regression with ordinary least squares), then you may also fit it twice: Once with and once without the unusual data points to see what kind of effects it has.

teal veldt Nov 8, 2018, 3:44 PM

#

Right, I'll try that

#

Thanks again for taking the time to explain

late garnet Nov 8, 2018, 5:03 PM

#

@teal veldt - There are many univariate statistical methods to find anomalies in a data set. It is tricky to pick the right algorithm without fully understanding the purpose as @lyric canopy points out. However, I can provide you with some algorithms that I implemented to get you some application into anomaly detection. In addition it would be a good idea to read up on each method.

#

https://gist.github.com/tylerwmarrs/11ed103a00312ad89dc3597680f3eec4

Gist

univariate_anomaly_detection.py

GitHub Gist: instantly share code, notes, and snippets.

#

https://gist.github.com/tylerwmarrs/da878d7406626c0a75f6cfc682fd99ad

Gist

stats.py

GitHub Gist: instantly share code, notes, and snippets.

lyric canopy Nov 8, 2018, 5:13 PM

#

Right, but in my opinion, none of these methods should be used in an automated way. Detection is fairly easy, but the biggest question is the one that comes after that: Why is this an unusual observation? And how should I deal with it? Far too often, people just delete them from their dataset, probably out of ignorance, but thats borders on unethical research practices.

late garnet Nov 8, 2018, 5:13 PM

#

@lyric canopy I completely agree.

lyric canopy Nov 8, 2018, 5:15 PM

#

Interesting code, though.

late garnet Nov 8, 2018, 5:15 PM

#

Generally it is best to use domain specific upper and lower bound thresholds to throw out bad data.

#

Specifically when gathering sensor data

lyric canopy Nov 8, 2018, 5:22 PM

#

Probably, I don't have that much experience with that; I work in the field of social sciences/psychology/cognitive neuroscience for a methodology/statistics department of a university. Unusual observations there are less likely to be caused by sensor error, but much more likely to be valid observations. But, since a lot of reseachers do know that outliers can be problematic (say influential cases in the GLM), but don't know how to deal with that, they start to throw out observations based on arbitrary rules-of-thumb, like +/- 3 sd, bonferroni-correct significant studentized residual, arbitrary cook's d cutoffs and what have you.

There are so many robust techniques and alternative approaches available today that just throwing out observations without a substantive reason for it triggers me.

late garnet Nov 8, 2018, 5:24 PM

#

Interesting, I actually created those anomaly detection algorithms to find unusual patterns in human behavior. 😃

lyric canopy Nov 8, 2018, 5:29 PM

#

That's a great use for it, because those unusual observations usually tell an interesting story.

#

(One of the other problems is that people sometimes only start deleting observations if their original run of a model did not provide a "significant" result, but won't do that if it did. In, for instance, a GLM [generalized linear model], an outlying observation can actually increase model fit if has a high leverage value, but a low residual, i.e., if it's an outlyer on the explanatory variables, but in line with the regression plane/model.)

haughty wharf Nov 8, 2018, 6:25 PM

#

Hello,

I'm kinda new here and I have a certain predicament I'm in where I need a bit of guidance/advice. I want to be a Data engineer but I'm not sure about how to do that. I'm currently a Market Science Analyst and after being here for a few months I feel as if the job isn't for me.

lean ledge Nov 8, 2018, 7:54 PM

#

@haughty wharf data engineer is a DevOps like job. Learn SQL, docker, Hadoop, (AWS/GCP/Azure), ETL, REST APIs, Spark/Hive + some understanding of machine learning (doesn't have to be in depth) and maybe stuff like tableau. Your job is to be able to make a data pipeline for data scientists and for deployment

amber kestrel Nov 8, 2018, 8:44 PM

#

is there a way to name a plot in matplotlib so that plt.show() will only make something happen if an argument is passed into it

#

also, is there a way to automatically save generated images of that named plot, and replace old versions with the new one?

late garnet Nov 9, 2018, 4:05 PM

#

@amber kestrel Maybe you just need to add some conditionals before rendering the plot?

olive trench Nov 10, 2018, 8:29 AM

#

@amber kestrel I don't understand the first question, but the second one is plt.savefig('plot.png'). Use it before plt.show()

#

It'll save to your working directory, or you can pass a path in the filename to save it elsewhere

tardy portal Nov 10, 2018, 12:27 PM

#

this is the question
mplement a perceptron for logistric regression. For your training data, generate
2000 training instances in two sets of random data points (1000 in each) from multi-variate normal
distribution with
µ1 = [1, 0], µ2 = [0, 1.5], Σ1 =

1 0.75
0.75 1
, Σ2 =

1 0.75
0.75 1
(1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class,
i.e., 1000 in total

def generate_data(mean,cov,size):
    return (np.random.multivariate_normal(mean,cov,size))

train_data_1 = generate_data([1,0],[[1,0.75],[0.75,1]],1000)
train_data_2 = generate_data([0,1.5],[[1,0.75],[0.75,1]],1000)

test_data_1 = generate_data([1,0],[[1,0.75],[0.75,1]],500)
test_data_2 = generate_data([0,1.5],[[1,0.75],[0.75,1]],500)

so i dont know how to label it

lyric canopy Nov 10, 2018, 12:39 PM

#

I'm not sure what the question wants you to do

tardy portal Nov 10, 2018, 12:39 PM

#

nvm i figured it out

lyric canopy Nov 10, 2018, 12:39 PM

#

Okay

tardy portal Nov 10, 2018, 12:39 PM

#

thanks

#

well the problem is this if you want to look at it

lyric canopy Nov 10, 2018, 12:39 PM

#

You probably needed to generate a dependent variable with a binary coding

tardy portal Nov 10, 2018, 12:39 PM

#

(Logistic regression, 40pts) Implement a perceptron for logistric regression. For your training data, generate
2000 training instances in two sets of random data points (1000 in each) from multi-variate normal
distribution with
µ1 = [1, 0], µ2 = [0, 1.5], Σ1 =

1 0.75
0.75 1
, Σ2 =

1 0.75
0.75 1
(1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class,
i.e., 1000 in total. Use sigmoid function for your activation function and cross entropy for your objective
function. You will implement a logistic regression for the following questions. Initialize the starting weight
as w = [1, 1, 1]. During training, stop your loop when the objective function (i.e., cross entropy) does not
decrease any more (below certain threshold) or when the gradient is close to 0 or the iteration reaches 10000.
Set your thresholds properly so that the iteration doesn’t reach 10000 for all the learning rate that you will
be using.

Perform batch training using gradient descent. Divide the derivative with the total number of training
dataset as you go through iteration (it is very likely that you will get NaN if you don’t do this.).
Set your learning rate to be η = {1, 0.1, 0.01}. How many iterations did you go through the training
dataset? What is the accuracy that you have? What are the edge weights that were learned?

#

and i am really new to this

#

so i lack knowledge

lyric canopy Nov 10, 2018, 12:42 PM

#

I'm not too big on ML; I've only done logistic regression in the context of the generalized linear model.

#

So, I can't really help you with it

mighty flame Nov 11, 2018, 12:44 AM

#

guys

#

i need help

#

anyone knows about tensorflow and keras?

#

I need to create a dataset with words and I don't understand anything

unborn cave Nov 11, 2018, 4:02 AM

#

I don't think you really need Keras or TF to create a dataset. That would be to process it.

#

Firstly, where are you trying to source the data from? Scrape? API?

#

If all you need to do is output to CSV or DB, you're better off using Scrapy and Pandas

lean ledge Nov 11, 2018, 7:08 AM

#

https://www.tensorflow.org/tutorials/representation/word2vec

TensorFlow

Vector Representations of Words | TensorFlow

cerulean magnet Nov 11, 2018, 8:07 AM

#

@tardy portal Hey, what part of the problem do you need help with

spare karma Nov 11, 2018, 7:12 PM

#

Neural Network friends, am I specifying my layers properly? https://www.reddit.com/r/datascience/comments/9w6grq/my_first_neural_network_exercise_using/

r/datascience - My first Neural Network exercise (using tensorflow...

1 vote and 1 comment so far on Reddit

lean ledge Nov 11, 2018, 8:00 PM

#

I'd suggest an SVM with an appropriate kernel instead of a neural network there 👀

#

Having to use NNs is weird when there's other stuff

#

Looks okay to me but I don't use keras much

cerulean magnet Nov 11, 2018, 8:02 PM

#

@lean ledge Hey man can I ask you a DS question in regards to normalizing/scaling and matplotlib?

lean ledge Nov 11, 2018, 8:04 PM

#

Sure, just ask your questions here

spare karma Nov 11, 2018, 8:15 PM

#

@lean ledge ty

cerulean magnet Nov 11, 2018, 8:29 PM

#

I typed it in help 5 if you dont mind taking a look

#

was few minutes ago

#

ohnvm someone is looking at it

#

Appreciate it though

tardy portal Nov 12, 2018, 5:28 AM

#

Hi can someone help me write a code regarding online training using gradient descent without using scikit and sklearn

#

I was able to write a code for batch training

lunar oyster Nov 12, 2018, 5:29 AM

#

ask your question

tardy portal Nov 12, 2018, 5:29 AM

#

(Logistic regression, 40pts) Implement a perceptron for logistric regression. For your training data, gen- erate 2000 training instances in two sets of random data points (1000 in each) from multi-variate normal distribution with
􏰀 1 0.75􏰁 􏰀 1 0.75􏰁
μ1 =[1,0], μ2 =[0,1.5], Σ1 = 0.75 1 , Σ2 = 0.75 1 (1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class, i.e., 1000 in total. Use sigmoid function for your activation function and cross entropy for your objective function. You will implement a logistic regression for the following questions. Initialize the starting weight as w = [1,1,1]. During training, stop your loop when the objective function (i.e., cross entropy) does not decrease any more (below certain threshold) or when the gradient is close to 0 or the iteration reaches 10000. Set your thresholds properly so that the iteration doesn’t reach 10000 for all the learning rate that you will be using.

Perform batch training using gradient descent. Divide the derivative with the total number of training dataset as you go through iteration (it is very likely that you will get NaN if you don’t do this.). Set your learning rate to be η = {1, 0.1, 0.01}. How many iterations did you go through the training dataset? What is the accuracy that you have? What are the edge weights that were learned?
Perform online training using gradient descent. Set your learning rate to be η = {1,0.1,0.01}. Set your maximum number of iterations to 10000. How many iterations did you go through your training dataset? What is the accuracy that you have? What are the edge weights that were learned? Compare the learned parameters and accuracy to the ones that you got from batch training. Are they the same? Explain in your report.

#

Do you see the second question

#

I did make the first one ,but i dont have mental capablites to write for the second

#

#this progrom contains both batch training and aoc
import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics


def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def create_data(mean1: list, mean2: list, sigma1: list, sigma2: list)-> tuple:
    train1 = np.random.multivariate_normal(mean1, sigma1, 1000)
    test1 = np.random.multivariate_normal(mean1, sigma1, 500)

    train2 = np.random.multivariate_normal(mean2, sigma2, 1000)
    test2 = np.random.multivariate_normal(mean2, sigma2, 500)

    
    train_label1 = np.zeros(1000)
    test_label1 = np.zeros(500)

    train_label2 = np.ones(1000)
    test_label2 = np.ones(500)

    training_data = np.vstack([train1, train2])
    test_data = np.vstack([test1, test2])

    training_label = np.concatenate((train_label1, train_label2))
    test_label = np.concatenate((test_label1, test_label2))

    return training_data, training_label, test_data, test_label


def logistic_regression(data: list, labels: list, max_iter: int, lr: float)-> list:
    weights = np.ones(2)
    for i in range(max_iter):
        predictions = sigmoid(np.dot(data, weights))
        error = labels - predictions
        gradient = np.dot(data.T, error)
        weights += lr * gradient
    return weights


def calc_metrics(test_data, test_label, weights):
    confusion_matrix = [{'correct': 0, 'incorrect': 0}, {'correct': 0, 'incorrect': 0}]
    acc = 0
    predictions = []
    for i in range(len(test_data)):
        prediction = round(sigmoid(np.dot(test_data[i], weights)))
        predictions.append(prediction)
        if test_label[i] == prediction:
            confusion_matrix[int(test_label[i])]['correct'] += 1
            acc += 1
        else:
            confusion_matrix[int(test_label[i])]['incorrect'] += 1
    return acc, confusion_matrix, predictions

#


def main():
    mean1 = [1, 0]
    mean2 = [0, 1.5]
    cov1 = [[1, 0.75], [0.75, 1]]
    cov2 = [[1, 0.75], [0.75, 1]]
    steps = 100000
    print ("Choose from 1,0.1,0.001")
    lr = float(input("enter learning rate"))
    training_data, training_label, test_data, test_label = create_data(mean1, mean2, cov1, cov2)
    weights = logistic_regression(training_data, training_label, steps, lr)
    acc, confusion_matrix, predictions = calc_metrics(test_data, test_label, weights)
    fpr, tpr, th = metrics.roc_curve(test_label, predictions)
    auc = metrics.roc_auc_score(test_label, predictions)
    print('Accuracy: ' + repr(acc/len(test_data)*100) + '%')


main()

#

this is what i have written till now for batch which is 1

#

now i need help for second

chilly shuttle Nov 12, 2018, 9:01 AM

#

why do you invoke repr directly?

olive trench Nov 13, 2018, 8:26 AM

#

Hey guys, what is the most efficient way to create pandas dataframes where you have to dynamically create it row by row?

What I try to do is create a dictionary with its keys as column names and the items are lists with the column values. I append new values to those lists and then call pd.DataFrame on the dictionary at the end. It's is a lot faster than using pd.append() especially for big amounts of rows, but I am thinking if there is a more efficient way to do this.

Anyone have any ideas/workflows that work out for them?

late garnet Nov 13, 2018, 8:30 AM

#

@olive trench it might be easier if you specify your use case.

olive trench Nov 13, 2018, 8:40 AM

#

I need to create a dataframe with certain columns . Then there is a for loop and each loop generates one row of the resulting dataframe

#

the df.append() function gets really slow if I generate each row as a df and append each time if the resulting dataframe is in thousands of rows

polar acorn Nov 13, 2018, 9:31 AM

#

I do the same as you, either append to a dict or if I don't have that many columns I just keep track of a couple of lists and make the dict and df at the end.

olive trench Nov 13, 2018, 9:34 AM

#

@polar acorn oh thanks, that's actually even better!

wanton silo Nov 13, 2018, 2:09 PM

#

i am starting to get more into machine learning using python

#

anyone have any reccomendations for videos?

lean ledge Nov 13, 2018, 2:29 PM

#

@wanton silo Check pinned

#

Columbia's course is good

#

not python specific, it's language agnostic

wanton silo Nov 13, 2018, 2:30 PM

#

o, thx

small ore Nov 13, 2018, 2:57 PM

#

@wanton silo , check this out. Curated list of tutorials: https://medium.com/machine-learning-in-practice/over-150-of-the-best-machine-learning-nlp-and-python-tutorials-ive-found-ffce2939bd78

wary willow Nov 13, 2018, 4:40 PM

#

How hard would it be to take one of the catagories from the google quickdraw dataset (https://github.com/googlecreativelab/quickdraw-dataset), and then use machine learning to create new images that are similar to the images that are there? And what's a good resource to learn how to do something that?

wary willow Nov 13, 2018, 5:27 PM

#

(aka how can I learn basic image machine learning?)

wanton silo Nov 13, 2018, 5:42 PM

#

https://www.tensorflow.org/tutorials/images/image_recognition

TensorFlow

Image Recognition | TensorFlow

#

@wary willow

reef bone Nov 13, 2018, 5:43 PM

#

that's not an easy question to answer, generative models are generally quite complicated, so "basic" machine learning might not be sufficient to develop good solutions
the good thing is that we have packages like tensorflow, keras to help us, which makes the field much more accessible, but unless you're just taking code from the internet, some grasp on the fundamentals will still be necessary

#

so depending on how deep you want to go, we can recommend articles or books

#

the best resource to start with is generally google

#

also bear in mind that image recognition is not the same as image generation

#

the first would be a more approachable problem to beginners i think

turbid bay Nov 13, 2018, 6:18 PM

#

hey i got a question. I'm trying to make a neural network that can detect whether a word is either real english or just a bunch of letters bunched together. How would i make my input data? because the input layer has to have a fixed number of nodes (so i cant just do the amount of letters). and idk if you can just input the whole word? my first idea was to calculate a value for each word. I went with using A=1, B=2 etc. then calculating a base10 number using the letters A-Z in base26 (not using 1-10 instead just using the alphabet letters). So the word "hello" would calculate to be (value of "h") 8*(26characters_index) (starting at zero)+ value of "e" 5*(261). etc. This creates extremely large numbers but would this work as a dataset?

#

        word_value += (ord(word[j])-64)*(26**j)```

reef bone Nov 13, 2018, 6:32 PM

#

that's an interesting approach to the problem, you're basically trying to create a hashing function and then check if the hash is valid? i'm not sure where the neural network comes into play

#

you can use something like https://github.com/rfk/pyenchant to check if a word is valid english

GitHub

rfk/pyenchant

spellchecking library for python. Contribute to rfk/pyenchant development by creating an account on GitHub.

#

recurrent neural networks (especially lstms now) are generally the way to go for handling input of variable size, for natural language processing there are models that handle the input character by character, some handle it word by word, google has taken a compromise approach with "wordpieces" -> this is an interesting read https://arxiv.org/abs/1609.08144

#

but i'm still not sure why a neural network would be at all necessary for this problem

#

(not to discourage you)

turbid bay Nov 13, 2018, 6:40 PM

#

well im relatively new to the concept of machine learning neural networks etc. so it may be absolutely pointless and not be correct but i thought it might work

reef bone Nov 13, 2018, 6:42 PM

#

it's not pointless and your idea to hash the word is good

#

and it definitely would work to some extent, but a simple check against a dictionary would be a lot easier to implement and also probably perform better

#

well, it wouldn't perform worse

turbid bay Nov 13, 2018, 6:43 PM

#

ok thankyou. and yh probably checking it against a dictionary would be easier. im actually so stupid i didnt think about that 😂

#

well ibe done all the work on setting up the data that im not gonna just restart now. im gonna see how effective a neural network would be. i guess this will be interesting

reef bone Nov 13, 2018, 6:55 PM

#

the reason why it struck me as a little odd is that you will likely be using a dictionary as training data anyway, and in general terms we're trying to get the neural net to learn from the training dataset to then be able to predict unseen values outside of the dataset. in fact, learning the actual values in the dataset is not a good thing because that usually leads to poor performance on unseen data (overfitting), we want the network to extract features from the dataset that can then be used to identify and classify other data. in this case, you already have all the data available to you, so you would actually be trying to get the neural net to learn all the data. which is an odd approach because in that case you can just use the dataset itself to check if the element that you're trying to classify as either (valid) or (invalid) is within this data

#

but go for it, practice is good, i would be interested to know how your neural net will perform once finished

#

^ sorry that's not well written, hopefully it makes sense

turbid bay Nov 13, 2018, 7:03 PM

#

nah im going to use a random probably 40% for testing. plus the dataset has both positive and negative values of being either real or non rea words

#

i mean training

#

maybe more idk

#

i have around 500000 examples

#

and yes i understood what you have written

polar acorn Nov 13, 2018, 7:22 PM

#

@turbid bay you can solve this with machine learning. If you're trying to learn machine learning it's a fun excersize, if it's a real problem you're having you should look to the previous answers. You could for instance train an LSTM. Let each letter be a vector of length 27, A = [1,0...,0], B = [0,1,0,...0] etc. Each word is then a sequence of letters or vectors and this is something we can use an LSTM for.

#

In addition your model can tell you if a random mess of letters looks sort of like a word or not which is in itself an amusing thing to test out.

turbid bay Nov 14, 2018, 11:10 AM

#

whats an LSTM?

polar acorn Nov 14, 2018, 11:16 AM

#

It's a fancy way of making an neural network work with sequences, such as a sequence of letters i.e. a word.
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

#

If that seems a bit complex you can also just try turning each word into a vector, mapping each letter into a number a -> 1, b -> 2 etc. And then zero pad on the right so that all the words are the same length. So 'Apple' would be [1, 16, 16, 12, 5, 0, 0, ...,0] with enough zeros so that the length is equal to longest word in your list. Then you could use just a normal neural network.

lean ledge Nov 14, 2018, 12:34 PM

#

@turbid bay An LSTM (long short-term memory) network is a specific type of recurrent neural network (read, neural network that saves some inner state) that represents internal state and what is and isnt stored in it using i, f, o and g gates. They're good because they can remember stuff eg context so good with sequential data like videos, sentences, different parts of images etc. They're good because they allow easier gradient flow compared to vanilla RNNs while being rather flexible through their input/output/forget update mechanisms.

#

data does not at all have to be sequential, it's just a common example

#

you can see the difference in stuff like attention-based models for computer vision where the same data is looked at through different focal lenses etc.

#

They're also rather easy ways of making generative models based on your training data without going through other effort in a markovian chain manner

#

Pretty useful networks

thorn river Nov 14, 2018, 4:00 PM

#

I have a pandas dataframe where I want to drop rows which have a total frequency in the dataframe below a certain threshold.

I.e. if a column 'city' has in total 5 entries, I want to drop the corresponding rows. (rows with City < 5 should be dropped).

However, if that city appears in a certain list of cities, I want to skip row from dropping (even if that city has <5 entries).

I.e.

Counts = {"London": 4, "Birmingham: 3}```

I already have the counts of the corresponding city in a single column. How do I drop rows with <5 total entries in the dataframe EXCEPT if they are in to_keep?

late garnet Nov 14, 2018, 4:34 PM

#

@thorn river what does your dataframe look like?

#

It sounds like you needs to check out the filtering section of pandas; specifically on isin and multiple criteria filtering.

#

Generally I filter to keep what I want and create a new copy of the dataframe instead of dropping. I'm not sure if that is more efficient.

#

query = (df['City'].isin(to_keep)) & (df['Count'] >= 5)
df = df[query].copy()

thorn river Nov 14, 2018, 4:44 PM

#

I guess filtering works too. I would want to keep the entire dataframe if they are > 5 and if city count in dataframe is <5 but in to keep it would be kept as well.
if I understand correctly, your code snippet create a dataframe with only those cities in to_keep?

#

I ended up dropping everything <5 since it was so little and it probably not of that much importance, thanks anyway!

past grove Nov 14, 2018, 7:01 PM

#

hi, i have a problem with my genetic algorithm. i am making a GA to solve a sudoku puzzle (not the best way of achieving the solution, i know - but it's enjoyable and is definitely achievable).

My problem is that it never gets to a solution, it gets stuck at a fitness of around 50 (i'm using the fitness function from here https://www.researchgate.net/publication/224180108_Solving_Sudoku_with_genetic_operations_that_preserve_building_blocks, except (f(x) - 162) * -1 in order for fully fit to be = 0).

i'm new to ga's so i just set my crossover function to take 2 parents, and then make child1 the first 3 rows of parent1, second 3 rows of parent2 and third 3 rows of parent1, and child2 to be the inverse. i also did crossover on every individual - not sure if that's normal?

and then mutation, i mutated every individual as well, maybe i shouldnt do that

of course i kept the known values of the puzzle constant throughout

just looking for advice really

ResearchGate

(PDF) Solving Sudoku with genetic operations that preserve buildin...

PDF | Genetic operations that consider effective building blocks are proposed for using genetic algorithms to solve Sudoku puzzles. A stronger local search function is also proposed. Evaluation of the proposed techniques using commercial Sudoku puzzle sets and three puzzles r...

lost patio Nov 14, 2018, 7:07 PM

#

Are you stuck in a local maxima?

#

Maybe increase your mutation rate based on lack of genetic diversity

#

https://stackoverflow.com/questions/46858371/multithreaded-galib247-genetic-algorithm-stuck-in-local-maxima#48337858

Stack Overflow

Multithreaded galib247 genetic algorithm stuck in local maxima

I added multithreading support to galib247 (below), but I'm still seeing problems whereby solutions were getting stuck in local maxima.

Perhaps it's a shortcoming of genetic algorithms in general....

past grove Nov 14, 2018, 7:10 PM

#

i think im mutating too much

#

my mutation rate is essentially 1 since i mutate everything (i think that's how it works), so perhaps i should change that

#

and i mutate quite aggressively (swap 2 values in every box of every individual in the population)

lost patio Nov 14, 2018, 7:13 PM

#

Unfortunately I've only minimal knowledge on this subject and most of it is in NN

#

I'd work on refining your mutation strategy. Worst case scenario, you learn more about them.

past grove Nov 14, 2018, 7:14 PM

#

yeah im reading An Introduction to Evolutionary Computing rn

late garnet Nov 14, 2018, 8:58 PM

#

@past grove you might be interested in this repository

#

https://github.com/elsander/GoodEnoughAlgs

GitHub

elsander/GoodEnoughAlgs

Simple working examples of heuristic optimization algorithms - elsander/GoodEnoughAlgs

#

There is a talk on them as well

past grove Nov 14, 2018, 9:15 PM

#

thanks i'll take a look @late garnet

rapid spear Nov 14, 2018, 10:06 PM

#

Hello! I'm hoping for an idea/some advice on how to approach a python project
I'm scraping a webcam every 60 seconds (when it updates) of a feed looking at an exit to a business park
I'm trying to write some computer vision approach to detect when cars start queuing along this road at peak times
not necessarily an object detection problem or anything like that - just need to detect if there are cars or if there is just road (and cater for fading sunlight, weather etc)
I tried plotting the mean of the image to look for spikes when the cars started queuing but no success 😦 anybody know of any other approaches?
the aim is to have a graph of timestamps against when there is a traffic jam vs clear roads
if this is the wrong place to ask (an idea problem not really a code problem) then please redirect me elsewhere if you know of a better place
but this has to have been considered by someone before

eager ermine Nov 14, 2018, 11:01 PM

#

Is there any way to write/read to a website a JSON file to a website, I'm using Heroku to host it but I can't see the files of Heroku, so I wanna see/edit the data that it makes.

If so how would I do it?

earnest prawn Nov 14, 2018, 11:13 PM

#

every time you push a new version of the software heroku would delete the modified files anyway

#

so there isnt exactly a point in doing that

#

if you want to store data use the free postgresql they provide

eager ermine Nov 14, 2018, 11:37 PM

#

which is why I wanted to write to a different website

past grove Nov 15, 2018, 12:10 AM

#

update: got solutions after 123 and 75 generations

#

still working on crossover and mutation improvements though

unborn cave Nov 15, 2018, 3:10 AM

#

Can anybody lend a hand with a pandas related issue?

foggy minnow Nov 15, 2018, 3:33 AM

#

Why in the hell are these scikit learn algorithms only outputting scores in multiples of .31

#

I mean it’s only giving scores of .31,.62,.93 I don’t understand what the hells goin on

unborn cave Nov 15, 2018, 5:30 AM

#

I have this dataframe

How would I grab values for the key 'coordinates'
I would like to extract all values for coordinates in the col geometry and insert into another df
That same df will also be used to store data extracted from the properties col . Each col in this new dataframe will be a key inside either geometry or properties

#

📎 Screen_Shot_2018-11-15_at_12.26.55_AM.png

turbid bay Nov 15, 2018, 10:20 AM

#

hey im using tensorflow with keras. I want an input layer which has one node that takes on an integer value. How do i write the input layer? i tried inp = tf.keras.layers.Input(1) but that doesnt work

turbid bay Nov 15, 2018, 11:57 AM

#

my input data is singular integer values between a 1 digit number and a 39 digit number (this is a word which i have turned into an integer). And my output values should be a number between 1 and 0. The neural network should decide whether it is a real word (1) or not (0).

#


inp = Input(shape=(1,),dtype='float')

layer_1 = Dense(128, activation="relu")(inp)
layer_2 = Dense(128, activation="relu")(layer_1)
pred = Dense(1,activation="softmax")(layer_2)

model = Model(inp,pred)
model.compile(optimizer='adam',
                          loss='sparse_categorical_crossentropy',
                          metrics=['accuracy'])
model.fit(X_train, Y_train)```

reef bone Nov 15, 2018, 6:17 PM

#

I would recommend using the vector approach suggested above to encode your input, and more than one input neuron

#

You should have 2 output nodes since you're doing classification with 2 classes, the softmax activation makes sense, but applying it to a single neuron doesn't

#

If your code runs then it's probably fine but generally speaking you would do something like

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense()) # first layer, specify input_shape
model.add() # your second layer
...
model.add() # your final layer

#

And uh, you should have validation and testing data

#

And pass validation data to the model.fit() method alongside your training data

#

So that you can see how it performs on unseen data

#

And then evaluate on testing data when you're done

#

Or when you think you're done

#

And maybe LeakyReLU() for your activation on the hidden layers, but make sure you pass a lower alpha value, keras defaults to 0.3 now which is ridiculously high in my opinion

karmic axle Nov 16, 2018, 12:34 PM

#

Hi, I am interested in NLP, so should i 1st do a basic data-science course like https://www.coursera.org/learn/python-data-analysis/home/welcome or i can start with NLP directly?

Coursera

Coursera | Online Courses From Top Universities. Join for Free

1000+ courses from schools like Stanford and Yale - no application required. Build career skills in data science, computer science, business, and more.

desert oar Nov 16, 2018, 10:57 PM

#

@karmic axle depends on what you wanna do, but its worth taking a basic machine learning or data analysis course

#

you're gonna need probability and stats pretty much no matter what

cerulean magnet Nov 16, 2018, 11:00 PM

#

Anyone on here might be able to help me with trying to create a specific graph using matplotlib and my dataframe? 😃

desert oar Nov 17, 2018, 12:54 AM

#

what graph? @cerulean magnet

small ore Nov 17, 2018, 1:06 AM

#

@desert oar His question got burried in #help-chestnut just before my question coz no one answered for an hr

karmic axle Nov 17, 2018, 4:43 AM

#

@desert oar okay.. i dont know what i want to do with NLP yet 😅 .. maybe will add some nlp to my discord bot.

spare karma Nov 17, 2018, 6:54 PM

#

Anyone have a solid tutorial on installing/using tensorflow--gpu? I've used this tutorial, only to result in python not finding the package after instillation. https://www.youtube.com/watch?v=r7-WPbx8VuY

YouTube

sentdex

Installing CPU and GPU TensorFlow on Windows

In this tutorial, we cover how to install both the CPU and GPU version of TensorFlow onto 64bit Windows 10 (also works on Windows 7 and 8). TensorFlow is a P...

▶ Play video

#

Any recommendations/advice is welcomed and appreciated.

serene oar Nov 17, 2018, 9:24 PM

#

Hey, why do I get this error when trying to plot different columns in a csv by defining their names?

TypeError: 'DataFrame' object is not callable

#

This doesn't happen when I just call the whole dataframe

tall shuttle Nov 17, 2018, 9:25 PM

#

@spare karma yes?

#

is it just an import error

spare karma Nov 17, 2018, 10:57 PM

#

@tall shuttle you call the package by? import tensorflow for both the cpu and gpu versions? (pending what's installed)

tall shuttle Nov 17, 2018, 10:59 PM

#

yes

spare karma Nov 17, 2018, 11:37 PM

#

@tall shuttle

📎 unknown.png

#

Might be tmi, but on the left is the install, on the right is the call

tall shuttle Nov 17, 2018, 11:37 PM

#

I can't read that

spare karma Nov 17, 2018, 11:37 PM

#

rip

tall shuttle Nov 17, 2018, 11:37 PM

#

just give them separately

spare karma Nov 17, 2018, 11:37 PM

#

kk

#

the install:

#

📎 unknown.png

#

the import call in python:

#

📎 unknown.png

tall shuttle Nov 17, 2018, 11:38 PM

#

oh

#

you installed right

#

that is a postinstall error

spare karma Nov 17, 2018, 11:39 PM

#

related to python version?

#

Rather, with respect to you and your time, is this a common issue I can find a solution for online?

tall shuttle Nov 17, 2018, 11:40 PM

#

install cuda

spare karma Nov 17, 2018, 11:40 PM

#

I'll re-install and share a screenie.

tall shuttle Nov 17, 2018, 11:40 PM

#

install cuda for your gpu

spare karma Nov 17, 2018, 11:41 PM

#

this correct?

#

📎 unknown.png

#

(re-installing now)

tall shuttle Nov 17, 2018, 11:42 PM

#

yes

#

then restart

spare karma Nov 17, 2018, 11:42 PM

#

kk

#

one second, sorry.

#

mmk, restarting

📎 unknown.png

#

same error:

📎 unknown.png

#

(when installing cuda, I'm just following the prompts and clicking 'next')

reef bone Nov 17, 2018, 11:54 PM

#

do you have cuDNN

spare karma Nov 17, 2018, 11:55 PM

#

yep. followed the tutorial's instructions of drag/dropping (ill share the time, one sec)

#

https://www.youtube.com/watch?v=r7-WPbx8VuY @11:46

YouTube

sentdex

Installing CPU and GPU TensorFlow on Windows

In this tutorial, we cover how to install both the CPU and GPU version of TensorFlow onto 64bit Windows 10 (also works on Windows 7 and 8). TensorFlow is a P...

▶ Play video

#

Wait. would it matter if Python was installed on a separate drive? (i have all python-related-ness on one drive..)

#

(nvidia's gpu computing toolkit, and corresponding cuDNN files are on a diff drive)

reef bone Nov 17, 2018, 11:58 PM

#

I think it should be fine as long as your path variables are set correctly, I remember I had to tamper with mine

#

But it's been a while since I installed mine

#

Sorry that was more of a shot in the dark

spare karma Nov 17, 2018, 11:59 PM

#

ah gotcha

#

I assume you did something of the same sort as this?

📎 unknown.png

reef bone Nov 18, 2018, 12:01 AM

#

https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#install-windows

cuDNN Installation Guide :: Deep Learning SDK Documentation

This guide provides step-by-step instructions on how to install and check for correct operation of NVIDIA cuDNN v7.4.1 on Linux, Mac OS X, and Microsoft Windows systems.

#

I followed this, at the bottom it mentions the variables you need to have set so that tensorflow can find cudnn

spare karma Nov 18, 2018, 12:02 AM

#

ah, gotcha. thank you!

spare karma Nov 18, 2018, 12:31 AM

#

rip, tried it with both cuda v9.0 and v10.0. Would my python version matter? 3.6.7

📎 unknown.png

young aurora Nov 19, 2018, 1:08 AM

#

So I've got this code outputting this visual:

plt.rcParams['figure.figsize']=(8,10)
fig, ax = plt.subplots()
sns.countplot(x="Q18", palette="Blues", data=t10)
ax.set_ylabel('Number of Respondents', size='15')
ax.set_xlabel('')
ax.set_title('"What percentage of the team\nmeeting do you speak?"', size='20')
plt.suptitle('Top 10%', y=1.01, fontsize=30)
#code for Question 18```

#

📎 78fIJdZIktcy4OS0vSdJ4YXGXJKllLO6SJLWMxV2SpJaxuEuS1DIWd0mSWsbiLklSy1jcJUlqmf8CFo3SQwyEoT4AAAAASUVORK5.png

#

How in gods name do I get this to sort in logical order? Do I need to setup the pandas dataframe to sort them by the order (e.g. 0%,10%,20%....etc) prior to plotting it with seaborn?

#

Sorry about the transparency. The problem is that it's randomly ordering (or to me at least, randomly) the bars

upper lily Nov 19, 2018, 4:04 AM

#

Not familiar with pandas but

#

data=t10 what is t10?

#

@young aurora

chilly shuttle Nov 19, 2018, 8:43 AM

#

that's not pandas, that's seaborn

#

but t10 would be a pandas dataframe

#

also you're better off using plt.figure(figsize=...)) than rcParams

late garnet Nov 19, 2018, 5:56 PM

#

@young aurora you can specify the order in seaborn's countplot. For example, if you wanted to specify labels that are string percentages:

ordering = ['10%', '20%', '30%']
sns.countplot(x='value', data=df, order=ordering)

# you can also generate the ordering
# this gives you 0% to 100% in logical order
ordering = list(map(lambda s: str(s) + '%', range(0, 110, 10)))

gritty hawk Nov 20, 2018, 7:37 AM

#

pandas: is there a way to get the sql queries it will run if I do a to_sql()?

chilly shuttle Nov 20, 2018, 9:56 AM

#

it doesn't run the sql in one shot, so not really

#

afaik the only way to do it would be to provide your own dummy connection and intercept the bulk_save_objects or whatever other calls pandas makes to serialise

#

https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/generic.py#L2019

GitHub

pandas-dev/pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more - pandas-dev/pandas

desert oar Nov 20, 2018, 11:06 AM

#

its a bit weird that you cant generate the intermediate sql actually

#

the magic happens somewhere in here https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/io/sql.py

GitHub

pandas-dev/pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more - pandas-dev/pandas

#

ah right it uses sqlalchemy on the backend

chilly shuttle Nov 20, 2018, 1:31 PM

#

it's not that weird, bulk loads are almost never a plaintext sql query

gusty onyx Nov 20, 2018, 5:26 PM

#

Can someone explain to me how data mining works? Like what's the process?

proud raven Nov 20, 2018, 8:55 PM

#

That's a really broad topic. All it boils down to is getting a big heap of data and applying things like stats or machine learning to it to get new information. e.g. If I have a spreadsheet of raw basketball data (player, shot time, distance from net, success-or-failure, away-or-home-game) I could calculate a simple stat. like 3-pointer percentage by player, I could try to see if home games correlate to a higher accuracy, I could render a heatmap that shows how effective each player is based on their distance from net. Etc.

lean ledge Nov 21, 2018, 12:58 AM

#

Generally you have so much more data than that. You do your best to use dimentionality reduction techniques (eg PCA), use data summary reports to find covariance between data (plot all your data on a big covariance diagram, see which bits look the hottest), use some domain specific knowledge to narrow down what you should be looking into, etc

small ore Nov 21, 2018, 2:46 AM

#

"dimensionality reduction technique" . I stumble upon new words for subset selection

desert oar Nov 21, 2018, 4:03 AM

#

its not the same as subset selection

#

its more general

#

e.g. pca or multidimensional scaling.. you aren't choosing subsets of features, you are creating new features with the goal of reducing the number of features required to convey some amount of information

lean ledge Nov 21, 2018, 4:16 AM

#

^^^

polar acorn Nov 21, 2018, 10:53 AM

#

How come the "google it" answer to the data mining question was removed? For broad unspecific questions I find that to be a good answer. Vague non specific questions deserve vague non specific answers.

#

A better question would be, what is your favourite introductory text to data mining? Or does anyone have a simple and well explained example with code that uses some of the most common data mining techniques?

#

And in both cases the answer should hopefully be "look in the pinned messages" 😃

simple crag Nov 21, 2018, 12:35 PM

#

Because, whether intentional or not, it's a hostile and condescending thing to say to someone who is asking a question. As is "vague none specific questions deserve vague non specific answers" (emphasis mine)

#

Ask for more information, don't just dismiss them

polar acorn Nov 21, 2018, 12:49 PM

#

Vague non specific questions do deserve vague non specific answers, though you're probably right the people who ask them deserve a chance to better frame what they're after.

simple crag Nov 21, 2018, 12:58 PM

#

No they don't, and they especially don't here. This might help to better explain the huge issue with that attitude: https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/

Stack Overflow Blog

Jay Hanlon

Stack Overflow Isn't Very Welcoming. It's Time for That to Change....

Let’s start with the painful truth:

proud raven Nov 21, 2018, 1:49 PM

#

We've all asked vague non-specific questions, none of us popped out of the womb asking where we can find a good tutorial on the time complexity of a merge sort. If you have an answer in the form of what a better question would look like, why not make that your response? "Hey friend, that's a broad topic, here are a couple of resources I find handy and you might try these keywords to narrow your focus." You're still encouraging them to do their own work but you're providing a starting point. Or, you know, just don't respond.

polar acorn Nov 21, 2018, 1:57 PM

#

Yeah no I didn't tell him to google it. And tried to come up with two other questions to ask instead. But I thought it was strange that the google answer was removed (if that was the case, he might have deleted it himself for all I know). I didn't know that was the policy.

simple crag Nov 21, 2018, 1:58 PM

#

He did delete it himself

#

After I said it wasn't helpful, because it wasn't

polar acorn Nov 21, 2018, 2:00 PM

#

Sure, fair enough then. I thought for some reason it was removed.

small ore Nov 21, 2018, 3:09 PM

#

This is also a discussion cum help channel. Need not be as strict as a help channel when it comes to topics. A question, however vague may be taken as a good beginning for a discussion on that topic. However I believe "google it" is not a bad answer if put in a polite way

#

Oh. Didnt realize I am one hr late

lucid hornet Nov 21, 2018, 3:15 PM

#

Not really your call to make whether a channel is more or less strict.

#

Or should be, rather

dim osprey Nov 21, 2018, 3:16 PM

#

How can I build a network to reason about depth/scale from objects that I already have hard sizes for?
LIke, say I have a picture of someone holding a Comcast Remote:

small ore Nov 21, 2018, 3:16 PM

#

True. Well, let us say, that is my perception

dim osprey Nov 21, 2018, 3:16 PM

#

📎 comcast_remote.jpg

#

All of those remotes are the same size. How can I use it to measure the scale of the rest of the image?

lucid hornet Nov 21, 2018, 3:16 PM

#

Also if google it is all that is said about the topic, then no, it's not helpful

#

@dim osprey I feel like you'd have to have some sort of reference to gauge size

dim osprey Nov 21, 2018, 3:17 PM

#

The remote is the reference. I want to measure his hands.

simple crag Nov 21, 2018, 3:17 PM

#

Generically you'd have to segment the image to find the "remote-shaped" blobs and get the one that's most likely the remote

#

Once the image is segmented you'll have the coordinates relative to the image and can scale that based on the reference measurements you have

dim osprey Nov 21, 2018, 3:20 PM

#

I can't train a model to recognize certain objects that alll have the same size?
Like, there's been a lot of work on street sign detection because of the runup to sell-driving cars. All stop signs are the same size.

simple crag Nov 21, 2018, 3:20 PM

#

You can, but at some point you have to start with segmenting the images so you can train the classifier

dim osprey Nov 21, 2018, 3:21 PM

#

If I know that the stop sign is x inches, and y pixels, and the pole is y2 pixels, than it must be x2 inches, right?

simple crag Nov 21, 2018, 3:21 PM

#

Yes

#

If the remote blob is 100 pixels and your remote is 10 inches, it's 10 pixels/inch

#

However, that's not going to apply for things in the background

dim osprey Nov 21, 2018, 3:22 PM

#

📎 stop_sign.jpg

simple crag Nov 21, 2018, 3:22 PM

#

But if it's more or less in the same plane as the remote then you'll be ok

#

I'm sure there are corrections since this is a fairly well developed field but it's not one I'm particularly familiar with

dim osprey Nov 21, 2018, 3:23 PM

#

Aha, but there are already neural networks that reason about scale and depth. A lot, actually.

#

https://arxiv.org/pdf/1803.10039.pdf

#

They do pretty good, depending on how much training data you have.

#

My idea was to add the context from objects that have fixed sizes: Phones, Car tires

#

Look at this:

#

📎 showimage_1.jpeg

#

This is perfect: A CSX engine is always the same size, and so is that tanker car. I can take those hard cues and use them to measure the entire image.

desert cradle Nov 21, 2018, 3:41 PM

#

doors are usually the same height too

dim osprey Nov 21, 2018, 3:41 PM

#

But I'm interested in outdoor scenes.

polar acorn Nov 21, 2018, 7:18 PM

#

Has anybody here built tensorflow from source to optimise for CPU? I won't have access to a GPU for some time and I'm considering if the speed up is worth it.

desert oar Nov 21, 2018, 9:44 PM

#

why build from source?

#

you mean so you can use -march=native -mtune=native -O3 or something?

proud raven Nov 21, 2018, 11:17 PM

#

Building TF from scratch theoretically allows you to optimize for the target system's CPU (according to TF). Though I've only ever seen this in the context of Intel-based CPUs and even then it's only applicable to CNNs. To answer pptt, yes I've done a CPU optimized build. Results were a mixed bag.

#

Dunno about being "worth it" or not. The effort in compiling an optimized build is pretty minimal, if you have an i5 or i7 give it a shot. It can't hurt.

desert oar Nov 22, 2018, 1:02 AM

#

does it use a blas or does it implement its own linear algebra

#

like would MKL vs OpenBLAS make a difference

proud raven Nov 22, 2018, 4:19 AM

#

This got me curious so I went back and ran a simple ConvNet against Python 3.6, Tensorflow 1.7 (MKL, SSE, AVX, and FMA enabled), Tensorflow 1.7 (Just SSE, AVX, and FMA enabled), Tensorflow 1.7 (Standard, No optimization install). WIth just SSE, AVX, and FMA optimizations the average time between steps decreased by 15% from the no frills install. With MKL enabled the time between steps increased by 260%. I suspect that weirdness with MKL is a broken build on my part. I need to recompile and also potentially shift to Tensorflow 1.12.

#

https://github.com/mind/wheels Has a curated set of .whls up to Tensorflow 1.8.

GitHub

mind/wheels

Performance-optimized wheels for TensorFlow (SSE, AVX, FMA, XLA, MPI) - mind/wheels

desert oar Nov 22, 2018, 4:21 AM

#

very cool

#

thanks for trying that

#

to be fair mkl isn't always faster. i noticed improvement on basic operations, matrix multiply and especially SVD

#

but apparently its not always faster even on intel hardware

proud raven Nov 22, 2018, 4:22 AM

#

Indeed. Though I'll probably be up until 3:00a.m. validating MKL is working as expected.

desert oar Nov 22, 2018, 4:26 AM

#

oof

#

conda is your friend here

lean ledge Nov 22, 2018, 4:53 AM

#

oh relevant, literally compiling TF from source right now

#

been waiting an hour and a half and no end in sight, kms

#

would not recommend for marginal benefits in speed

#

im only doing it because I want to work with TF on 18.04 which only supports CUDA 10 which isnt supported by stable TF releases as of yet

polar acorn Nov 22, 2018, 6:45 AM

#

I ended up not doing it. Looks like it might have cost me more time than it would have saved me anyway.

lean ledge Nov 22, 2018, 10:36 AM

#

my compilation took me 3-4 hours, so unless it's saving you that much time, probably not worth it :p

polar acorn Nov 22, 2018, 11:02 AM

#

The joob took me 6 hours so unless the it halves training time then probably not

sage carbon Nov 22, 2018, 1:12 PM

#

hey can i ask a Q here ? O.o

#

O_O

proud raven Nov 22, 2018, 1:31 PM

#

If it's data science related, yes. If it's more of a general Python issue, feel free to use a help channel.

lyric canopy Nov 22, 2018, 1:42 PM

#

In general:

#

!t ask

arctic wedgeBOT Nov 22, 2018, 1:42 PM

#

ask

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

sage carbon Nov 22, 2018, 1:44 PM

#

I asked it in a help channel but they redirected me here .. So this is my Q.

I need to get a user input and determine his emotion from the answer. For example if user inputs 'I'm sad', he is sad. (:P obviously) .. But this can be more complex than that example (as i think). I need your ideas about how to achieve this. I can make a word list and match them but is it a good idea. Or do i need to involve something like TensorFlow (which is completely new to me) . ? confused
:/

#

😦

proud raven Nov 22, 2018, 2:20 PM

#

The broad term you're looking for is "sentiment analysis". Typically the first place to start, regardless of how you'll use this data, is to get a set of labelled data. Associating words or sentences with a feeling. Individual words are hard because they can have a double meaning: "I'm so happy I could cry" -> Is this person sad or happy?

#

How you then use those labels is sort of up to you. You could, as you say, just match words to sentences and assign a sentiment like happy or sad. For a small school project that might be fine. If you want a program that can more accurately predict sentiment when the user input contains multiple words from your list, as in my example, you would have to go to a deep learning approach in Tensorflow or PyTorch. If you've never tried deep learning before I would recommend you look at Keras or PyTorch first. Tensorflow is daunting for newcomers.

spark nimbus Nov 22, 2018, 2:58 PM

#

Using tensorflow, how would I return a "score" value to the model for judging itself?

spark nimbus Nov 22, 2018, 5:22 PM

#

Traceback (most recent call last):
  File "X:\Python\lib\site-packages\rlbot\botmanager\bot_manager.py", line 187, in run
    self.call_agent(agent, self.agent_class_wrapper.get_loaded_class())
  File "X:\Python\lib\site-packages\rlbot\botmanager\bot_manager_struct.py", line 36, in call_agent
    controller_input = agent.get_output(self.game_tick_packet)
  File "X:\Downloads\bot\rlai\tflayers.py", line 85, in get_output
    out = self.models[packet.num_cars].call(arr)
  File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\sequential.py", line 229, in call
    return super(Sequential, self).call(inputs, training=training, mask=mask)
  File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\network.py", line 845, in call
    mask=masks)
  File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\network.py", line 1031, in _run_internal_graph
    output_tensors = layer.call(computed_tensor, **kwargs)
  File "X:\Python\lib\site-packages\tensorflow\python\keras\layers\core.py", line 970, in call
    outputs = gen_math_ops.mat_mul(inputs, self.kernel)
  File "X:\Python\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 4856, in mat_mul
    name=name)
  File "X:\Python\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "X:\Python\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
    op_def=op_def)
  File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 1792, in __init__
    control_input_ops)
  File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 1631, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimensions must be equal, but are 193 and 48 for 'dense_734/MatMul' (op: 'MatMul') with input shapes: [193], [48,48].

#

arr = np.array(total)  # len(total) = 193
self.models[packet.num_cars].call(arr)

sage carbon Nov 22, 2018, 6:10 PM

#

Thank you very much @proud raven .. This seems complex than i thought. Actually it's a school kind project but i wanna do it in the right way. I'll look into things you told. 😃 Thanks again !

lapis sequoia Nov 22, 2018, 6:59 PM

#

may I ask a question here? can we ask questions?

silk acorn Nov 22, 2018, 6:59 PM

#

!t ask

arctic wedgeBOT Nov 22, 2018, 6:59 PM

#

ask

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

lapis sequoia Nov 22, 2018, 7:00 PM

#

okay, I never used a ttest on a data set before. My boss wants me to compare two patients' features with t test, but wants me to do this for each feature

#

is this doable? is this a thing? the only library I found in python kinda uses lists as inputs and gives a single p value

#

also I'm looking for the ttest functions of the stats library in python and there are four different kind of t test, I wonder which one is more suitable for this task.. =/

#

I think I figured it out, thanks!

polar acorn Nov 22, 2018, 8:02 PM

#

@lapis sequoia What did you end up doing?

lapis sequoia Nov 22, 2018, 8:13 PM

#

I was doing the thing wrong, for each feature, I made a list that has the values of neighbors from knn
then for two hypothesis groups, I compared the lists of these features with each other

#

so for 15 feature, I had two lists of 5 values each

#

and I had 15 t test results, which was what I needed

polar acorn Nov 22, 2018, 8:16 PM

#

Sounds fair. Was the purpose to test if the patients had a significant difference or if each feature had one?

lapis sequoia Nov 22, 2018, 8:55 PM

#

oh sorry I saw your reply too late, was busy with sending the data to the right places.. the main goal was the former one

signal juniper Nov 22, 2018, 9:51 PM

#

Matplotlib is being helpful and drawing lines directly between my data points for the green line. However I'd like it to stay horizontal until the value changes and then connect with a vertical line. Is there an option to do this or do I have to insert extra points to accomplish it?

📎 unknown.png

hearty token Nov 23, 2018, 7:46 AM

#

Try step instead of plot: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.step.html @signal juniper

signal juniper Nov 23, 2018, 9:43 AM

#

@hearty token Awesome, thank you so much!

inland viper Nov 24, 2018, 7:28 AM

#

So... how about that Tanenbaum?

lapis sequoia Nov 25, 2018, 2:14 AM

#

I'm trying to count the number of entries of a same class using pandas

#

I have a label called CLASS and each entry is a type

#

So I need to count how many of each type I have there

gleaming wadi Nov 25, 2018, 3:35 AM

#

hi , what is the most beautiful visualization tools/library/framework/app in python?

gritty hawk Nov 25, 2018, 10:18 AM

#

@lapis sequoia have you tried value_counts()?

#

df.label.value_counts()

lapis sequoia Nov 25, 2018, 4:46 PM

#

yeah that's it, cheers!

gritty hawk Nov 26, 2018, 7:27 AM

#

GWupuCirCup

spare karma Nov 26, 2018, 2:39 PM

#

Anyone have a good SQL discord (like this one)?

#

Or, lol, anyone know how to use the 'where' clause in queries? I'm trying to filter down to [fldDuration] = 02:00

📎 unknown.png

#

varchar(10)

#

(removed phi, for obvious reasons)

desert oar Nov 26, 2018, 11:05 PM

#

?

#

you need to connect them w/ logical AND and OR

#

WHERE
    (user.age < 30 OR user.age > 40) AND
    address.state = 'NY'

prisma comet Nov 27, 2018, 12:24 AM

#

Hi

woven tundra Nov 27, 2018, 2:27 AM

#

@gleaming wadi Dash is pretty good for building interactive visualizations as web apps. If this is just for a notebook however seaborn is good for non-interactive charts and plotly for interactive ones.

lean ledge Nov 27, 2018, 2:45 AM

#

there's also bokeh

unkempt zinc Nov 27, 2018, 1:53 PM

#

hi all- i am working on some NLP project and i have to read my data from PDF documents! i use tika parser to read the files and then proceed with text processing techniques. one thing i can't get my head around is basically how is it possible to get rid of headers and footers of a given document ! any direction would be really appreciated

spare karma Nov 27, 2018, 2:18 PM

#

@desert oar that was it. 12/13 had no and in-between them.

arctic moth Nov 27, 2018, 6:45 PM

#

Hello my ask for advice with CVXOPT if anyone has some experience?

#

I am using quadratic programming solver for MI-SVM algorithm, but if I get negative weights it writes me this error:
ValueError: Rank(A) < p or Rank([P; A; G]) < n
and even if i try to run it without constraints it still fails

#

this is full error output:

#

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/misc.py", line 1429, in factor
    lapack.potrf(F['S']) 
ArithmeticError: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 2011, in coneqp
    matrix(0.0, (0,1)), 'beta': [], 'v': [], 'r': [], 'rti': []})
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 1981, in kktsolver
    return factor(W, P)
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/misc.py", line 1444, in factor
    lapack.potrf(F['S']) 
ArithmeticError: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 21, in <module>
    main()
  File "main.py", line 19, in main
    model.learning()
  File "/home/frovis/Programovani/Python/Bakalarka/Model.py", line 70, in learning
    sol = solvers.qp(P,q)
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 4487, in qp
    return coneqp(P, q, G, h, None, A,  b, initvals, kktsolver = kktsolver, options = options)
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 2013, in coneqp
    raise ValueError("Rank(A) < p or Rank([P; A; G]) < n")
ValueError: Rank(A) < p or Rank([P; A; G]) < n

fierce saffron Nov 27, 2018, 11:39 PM

#

hi everyone

#

I've got a simple question. Just looking for thoughts

#

I've got a scatter plot containing 3 classes. However, lots of overlapping.

#

I'm not sure how to best display this data

hearty token Nov 28, 2018, 12:23 AM

#

@arctic moth I have no experience with that, but I think you'd have more luck if you posted some code.

desert oar Nov 28, 2018, 4:10 AM

#

@arctic moth do you know what matrix rank is? A code snippet at least would be helpful but somehow youre ending up with an ill conditioned matrix

#

@fierce saffron if its too hard to visualize with 3 colors on the same plot, make 3 plots side by side w/ the same axis dimensions

fierce saffron Nov 28, 2018, 4:15 AM

#

I hadn't thought of that but it's a good idea

thorn river Nov 28, 2018, 12:46 PM

#

I have a dict with the following structure:

'Username' with a key 'label' and a key 'text'. So if i would want to access the text of a username i would do dic["username"]["text"].

I want to tokenize the strings in the [text] part. My goal is to supply the tokenized text to a classifier so i can train and make predixtions. How do i do this using spacy?

I have looked at the spacy docs and tutorials but dont know if their pipe operations replaces the text in the dict with the tokenized text

desert oar Nov 28, 2018, 1:34 PM

#

No it just returns a tokenized version

#

Theres no special handling for dicts... list of strings goes in, list of processed docs comes out

#

Its not like C where you pass a preallocated result container and the function fills that container...

thorn river Nov 28, 2018, 1:52 PM

#

Alright thanks. Then how should i provider the tokenized strings to a model?

desert oar Nov 28, 2018, 6:51 PM

#

i recommend finding some text classification 101... but the basic method is "bag of words"

#

one row per document, one column per word, word frequency in matrix cell

#

'the quick brown fox'
'the strong brown bear'

becomes

1 1 1 1 0 0
1 0 1 0 1 1

where the columns correspond to "the", "quick", "brown", "fox", "strong", and "bear" respectively

silk acorn Nov 28, 2018, 6:54 PM

#

why are there 2 extra columns?

desert oar Nov 28, 2018, 6:54 PM

#

there aren't?

silk acorn Nov 28, 2018, 6:56 PM

#

i must be misinterpreting it then

desert oar Nov 28, 2018, 6:56 PM

#

perhaps. count the words

#

6 unique words (aka 'tokens'), 6 columns

silk acorn Nov 28, 2018, 6:57 PM

#

ah, like that

desert oar Nov 28, 2018, 7:02 PM

#

@thorn river spacy also provides word vectors you can use, look it up in their docs

#

here's a sloppy basic text pipeline using some data thats kinda like what you described @thorn river

import spacy

userdict = {
    'joe': {
        'text': 'milk is okay',
        'photo': '13861361.jpg'
    },
    'rhylli': {
        'text': 'i am the funniest weightlifter in florida',
        'photo': '09370813.jpg'
    }
}

en_core_web_md = spacy.load('en_core_web_md')

text_spacy = en_core_web_md.pipe(userdata['text'] for userdata in userdict.values())

waxen stump Nov 29, 2018, 4:36 AM

#

Seaborn is pretty

#

How does this code work:

#

my_full_dataset = tf.data.Dataset.from_tensor_slices(
  (tf.cast(mnist_images[..., tf.newaxis]/255, tf.float32),
   tf.cast(mnist_labels, tf.int64)))```

#

load_data() generates a tuple of arrays (train_x, train_y), (test_x, test_y)

#

How does that get parsed into three places?

#

e.g. isn't the first like saying (mnist_images, mnist_labels), _ = (train_x, train_y), (test_x, test_y)

#

er wait, the _ is a throwaway variable so the only data that's passed is train_x, and train_y

#

I think?

desert oar Nov 29, 2018, 1:13 PM

#

Yes, "_" is a valid variable name in python so you just dump stuff there you aint gonna use

#

Your interpretation seems right

lean ledge Nov 29, 2018, 8:33 PM

#

Their problem isn't that its dumping, it's dumping 2 things into 3 places

desert cradle Nov 29, 2018, 8:52 PM

#

no

#

you have a tuple of two tuples. the second goes in _, the first is unpacked further into two variables

#

note that since there isn't anything magical about _,if it's a significant amount of data you should del _ so it can get garbage collected

arctic moth Nov 29, 2018, 9:44 PM

#

@desert oar @hearty token Thanks guys, but I already found out, that I just made a logical error in constraints. But thanks 😃

#

Guys anyone have experience with Fox, Tiger, Elephant and Musk datasets?

#

I want to use them to train my mi-SVM classifier, but I can not find in SVM file which type of bag is positive or which one is negative. It just contains number of the bag and then values and I also can not find anything on the web, that would use those datasets.

placid galleon Nov 29, 2018, 11:38 PM

#

I've been directed here from help.

#

I'm having a problem installing tensorflow

#

📎 unknown.png

lean ledge Nov 29, 2018, 11:46 PM

#

CMU's 10-701 is also an excellent resource on introductory ML for anyone interested. Slides, lectures and additional readings available here: http://www.cs.cmu.edu/~pradeepr/701/

#

@placid galleon try pip -V?

placid galleon Nov 29, 2018, 11:48 PM

#

📎 unknown.png

#

18.1

#

Collecting tensorflow
  Could not find a version that satisfies the requirement tensorflow (from versions: )
No matching distribution found for tensorflow```

#

also tried that as well

desert oar Nov 30, 2018, 12:04 AM

#

@arctic moth got a link to a page describing it or something?

#

@placid galleon you can install other packages?

#

what does pip list output? (use https://bpaste.net if its long)

#

also can you try using python3 -m pip instead of pip3?

placid galleon Nov 30, 2018, 12:07 AM

#

Package          Version
---------------- -------------------
aiohttp          3.3.2
altgraph         0.16.1
async-timeout    3.0.0
attrs            18.1.0
auto-py-to-exe   2.4.2
bottle           0.12.13
bottle-websocket 0.2.9
certifi          2018.4.16
cffi             1.11.5
chardet          3.0.4
discord.py       1.0.0a1483+gec3435b
Eel              0.9.10
future           0.17.1
gevent           1.3.7
gevent-websocket 0.10.1
greenlet         0.4.15
idna             2.7
keyboard         0.13.2
macholib         1.11
multidict        4.3.1
numpy            1.14.5
opencv-python    3.4.3.18
pandas           0.23.4
pefile           2018.8.8
Pillow           5.2.0
pip              18.1
PyAutoGUI        0.9.38
pycparser        2.18
PyInstaller      3.4
PyMsgBox         1.0.6
pynput           1.4
pypiwin32        223
PyScreeze        0.1.18
python-dateutil  2.7.3
PyTweening       1.0.3
pytz             2018.5
pywin32          224
pywin32-ctypes   0.2.0
requests         2.19.1
scipy            1.1.0
selenium         3.141.0
setuptools       39.0.1
six              1.11.0
tflearn          0.3.2
tqdm             4.28.1
urllib3          1.23
websockets       6.0
whichcraft       0.5.2
yarl             1.2.6```

#

python3 isn't recognised

lean ledge Nov 30, 2018, 12:08 AM

#

@placid galleon TF doesnt support python3.7

placid galleon Nov 30, 2018, 12:08 AM

#

Oh, i'll try down grading

lean ledge Nov 30, 2018, 12:08 AM

#

I'd recommend python3.6 + CUDA 9 if you're trying to use tf

placid galleon Nov 30, 2018, 12:09 AM

#

3.6 exactly or will 3.6.7 work?

lean ledge Nov 30, 2018, 12:12 AM

#

will work

#

i'm currently on 3.6.7

placid galleon Nov 30, 2018, 12:12 AM

#

Ok doke installing now, I'm praying it works ^_^

#

Thanks in advance

lean ledge Nov 30, 2018, 12:13 AM

#

not even 100% sure it will work but it should. pip cant find it because you have to tag which versions something supports as you publish

placid galleon Nov 30, 2018, 12:16 AM

#

Still not working 😐

#

same error

lean ledge Nov 30, 2018, 12:17 AM

#

@placid galleon Are you using 64 bit python or?

placid galleon Nov 30, 2018, 12:18 AM

#

I'm not sure how would I check?

#

oh it's 32 bit

#

😐 what the hell

lean ledge Nov 30, 2018, 12:20 AM

#

try 64 bit, that should work

desert oar Nov 30, 2018, 12:20 AM

#

theres basically no reason to use 32 bit python unless youre in a very specific situation and you know you need it

placid galleon Nov 30, 2018, 12:23 AM

#

I ran import platform platform.architecture() and it told me it was 32bit ... installing 64bit now silly me

#

Ok, i've installed 64bit

#

but i cant just hit cmd and type python now

#

i need to do it in the exact path

#

😐

#

now pip doesn't work

#

yikes pepe

#

nevermind, sorted the path 😛

#

Yeah that sorted it, installed tensorflow 😄

#

woo

rough pecan Nov 30, 2018, 1:01 AM

#

Heya guys! two quick questions about time forecasting:

What would you use to forecast a dataset of maximum 50 datapoints that are unpredicable?
(Every dataset of <= 50 is different, so you can't know if it even has seasonality in it or it doesn't)

Moltz?
Basic multivariant linear regression?

Arima seems too much as I don't think it will be accurate with <= 50...

I heard there's an interesting way of doing that converting all the cycles to asin functions, but regarding to this and seasonality I'm wondering, is there a way to programatically calculate if there's even seasonality on it (as the dataset is different everytime it comes)?

desert oar Nov 30, 2018, 1:11 AM

#

@placid galleon highly recommend conda or at least a virtualenv

#

@rough pecan ive done arima on ~15 datapoints and gotten useful output. depends on the problem

#

can you describe the problem a bit more

#

inputs, nature of data, etc

rough pecan Nov 30, 2018, 1:45 AM

#

I'm starting the project on AliExpress but it could grow from there...

The datapoints are initially the order amount and the date of the orders...

... but I'll be doing a lot of testing with both
A) existing (review count, review score average, etc) datapoints and
B) newly created datapoints from the existed data such a scores made by myself, sales per country etc... you know, using the data that's there to create more data out of it... feed A/B into different models and see if there's anything interesting I can find...

I'm quite new to this so it would be both interesting and quite rich in my learning proccess (I belive) to try out playing with all these datapoints to see if any of those (again, both the ones out there and the new data I'll create from it) could help in the search of trendy / hot products...

So, the goal: Spot trendy products that have a forecast of increased sales for the following days... @desert oar

#

So, for some reason despite the products having thousands of orders AliExpress only allows you to see the first 50 pages of orders.

We could assume then that this is a single variant forecasting problem but as I said, I'll be doing a lot of testing with more data so it's most likely to end up being multivariant in the long run unless the other datapoints aren't really useful / helping positively anyhow in the forecast of course... we only know for certain after we test... right? 😛

desert oar Nov 30, 2018, 2:20 AM

#

yeah so you have a couple of options

#

basic option: automatically fit arima to each product

#

i actually implemented something like that at work a long time ago, basically you run through a bunch of hypothesis tests then fit a model. nowadays you can probably get away with auto.arima in the R package forecast, or implement it on Python -- it's based on the practices in https://otexts.org/fpp2/

Forecasting: Principles and Practice

2nd edition

#

there are many traditional stats methods for seasonal decomposition as well

#

but yes, one way to encode "cyclical" data (eg. day of month, day of year, hour of day) is to use polar coordinates

#

so 0 and 23 hours are closer than 0 and 4 for example

#

you could probably fit one big linear model or neural network or whatever that way. drop in last 7 periods sales, plus polar encoded day, or something

#

more advanced option: bayesian arima model where the parameters across all products share a common distribution, allowing you to pool information that way

#

there are methods for doing online/incremental updates of bayesian models but its not as easy as just running some more epochs on a NN

rough pecan Nov 30, 2018, 2:43 AM

#

Why would you want to pool such thing?

desert oar Nov 30, 2018, 2:46 AM

#

shared information across products

#

right? its not like every time you look at a product online, you forget everything you know about other products

#

sharing information will be especially useful for sharing information about cyclicality/seasonality

rough pecan Nov 30, 2018, 2:52 AM

#

That smells like useful when forecasting let's say, the overall website perfromance, yet would it be good to make them compete with one and other? it sounds more of an inclusion and not opposition situation right?

For competition (which is the main objective), I belive that it may be more than enough to have them separated and of course, together in the same database but you understand, separated in their analysis, then just compare the results between all of them 🤔

desert oar Nov 30, 2018, 2:52 AM

#

and the effect of any covariates

#

what do you mean separated?

#

or competing?

#

you said before that you're trying to forecast when sales are about to increase in some product

#

the more you know about a product, the easier that should be, right?

rough pecan Nov 30, 2018, 2:54 AM

#

Correct...

#

What I mean is that is more about finding trends of individual products and not the entire business overall... rather to make the products "compete" with each other, which is why I assume that having their analysis separated (let's say an ARIMA fit for each product, without pooling them all together) seems a little bit more accurate isn't it? how other product's information would benefit the prediction of the previous one?

#

My potatoes are about to burn in the oven! I have to run to check them out haha... brb.
Sorry if I'm asking too many questions but this is just such a juicy conversation!

Thank you so much for helping me out, you're bringing a lot of value and clarity to my sight 😊 🙏

desert oar Nov 30, 2018, 3:41 AM

#

does forgetting what you know about Product A help you make better predictions about Product B?

rough pecan Nov 30, 2018, 3:58 AM

#

Maybe and probably?... unless you're trying to predict the trend of a certain category doesn't it?
Unless product A and B are the exact same products of course...
For this I'll be training a image recognition model to compare all the pictures from two different products in case different publications (A/B) are two publications of the same products, in that case I was planning to "merge" those two, until now I never thought about merging the data of ALL products 🤔
@desert oar

desert oar Nov 30, 2018, 3:59 AM

#

you arent merging them entirely. you are just sharing common information

#

its effectively a form of regularization

#

eg you shrink all arima coefs towards 0

rough pecan Nov 30, 2018, 4:00 AM

#

Just to make the model (in case it was for example a neural net) more accurate on their predictions, just for the sake of making it smarter, then predict the trend of a product separately for each one? 🤔

#

Oh... hell, I haven't even opened my mind to the posibility of sharing common data for the sake of normalizing it, I thought the factors affecting a product were very different for each one as the market (people) of each product and behaviors was very different 🤔

desert oar Nov 30, 2018, 4:03 AM

#

yes, you still make separate predictions of course

#

its not a simple process to be sure

#

it takes some doing. but yes the idea is to only share information on common factors

arctic moth Nov 30, 2018, 11:12 AM

#

@desert oar http://www.cs.columbia.edu/~andrews/mil/datasets.html

lime lava Nov 30, 2018, 6:17 PM

#

Hi i need some help with data manipulation, a table with 3 columns. I want to group by a, then remove duplicated groups. That is, once I did the groupby, I might have 2 grouped elements that are the exact same “B” rows, and want to remove those duplicated groups.

desert oar Nov 30, 2018, 6:26 PM

#

You want to remove duplicates within each group?