#data-science-and-ml
1 messages ยท Page 190 of 1
@fallow summit well i'm not sure where the existing solutions are but this is the arcade learning environment: https://github.com/mgbellemare/Arcade-Learning-Environment
(ALE)
ALE is the most popular gaming environment that's being used for AI research right now
i would search around and try to find solutions to ALE roms
but as @lean ledge points out, you won't understand the solutions until you get through at least linear algebra and bayesian statistics
There's a udemy course taught by Kirill Emchenko called ML with Python
Absolutely zero math, all he does is say how to use the ML libraries and what to put in and what you get out
If you just wanna play around with something and start playing with the libraries with no knowledge of the math required to understand what's going on, you could check that out
And hey who knows, if it piques your interest you could look into the math behind everything
Everybody learns in different ways and maybe tinkering until it fasincates you is your way
It certainly was mine for programming in general
But if you just wanna learn how to use the libraries you could check out some free stuff out there too. Start with what the usual machine learning models are and look up how to implement each of them in Python
I think Google has an intro to ML course as well somewhere on their website
https://developers.google.com/machine-learning/crash-course/ml-intro
Haven't done it though. Not sure if it's good.
@woven tundra exactly what I meant ๐ Thanks ๐
No worries
@lean ledge I never argued about the need of learning maths for ML or any other thing. It is all about how math is approached. Remember the discussion was about Andrew Ng's course and the Columbia course on Edx. Based on my viewing of a couple of vedios of the columbia course it is mere recitation of mathematical formulae without even clearly mentioning what each of those variables stand for. Nor is it in anyway making it interesting and getting people involved. Ofcourse if the viewer already knows the math and are super clever enough to understand the intutions and limitations then it may seem good to them and a mere revision. From my vantage there is no point in reciting those math formulae without going into the nitty gritties. It is not about math vs no-math but all about the level of math and the approach. I am personally disappointed with the columbia course (based on viewing only a couple of lessons) . It should either point to a math course as pre-req or just go to the point and omit the math without the basic details of what stands for what ( I have seen a few texts on statistics and ML now and the notations vary hugely. If you already know the math it is about being able to adapt but that is not the case with everyone). I would love to see a mathy course on ML but with a better beginner friendly + nuanced approach
@fallow summit I have a tonne of links on ML related courses and it all depends on how you want to approach and what you are already proficient with. ( Programming etc). DM me if you want to look at the text wall of links ๐
I've completed 3 courses in ML:
1st one was ML A-Z by Kirill Eremenko on Udemy which involved no math at all
2nd one was ML on Coursera by Andrew Ng and
3rd one was Maths for ML on Coursera offered by Imperial college london
I wasnt really understanding the math taught in andrew's course so the 3rd one helped me A LOT in understanding Andrew's course
I recommend you to check it out if you arent understanding it.
@small ore perhaps we're remembering differently because Columbia's course had excellent explanations. It was far from recitations of formulae. And I'm not sure what you expect. ML is essentially a field of maths and it has some bare minimum mathematical prerequisite. It's the same amount any quantitative degree should teach you. It's silly to think you can make progress in differential geometry without any background in calculus and ML isn't any different.
Agreed @lean ledge
But don't fault the course if you don't have a mathematical background because the course was absolutely excellent
@lean ledge can you suggest some advanced courses on ML and its applications?
specifically Neural Networks since I'm quite fascinated with it
I mean. Now that you say it, I am confused. Are there multiple courses of columbia on edx? I know basic calculus and have studied statistics and probability and stuff long ago. I am not sure if that is enough for the "bare munimum" standards. I also do not understand the notations of any set of formulae without them spelling it out and preferably give it to me in a handout
There was no special notation that wasn't already standard in maths. Yes, they didn't go through every single symbol like they would in high school but everything was explained qualitatively and it's purpose was explained. They did assume someone can read mathematical equations on their own. E.g. they would explain they would take the distance and the formula for distance would pop up. I don't expect it to be unfair to assume someone doing an ML course would know how summing notation and Pythagoras theorem works without having it explained? Perhaps if every single formula needs to be explained, you aren't as fluent at maths as you should be?
@river plume Once you've gone through an introductory ML course, for the most part your doors are open for starting on any specific topic
Pythogoras theorem, distance formula, summation notation are all easy and trying to start explaining it will further make it boring and will digress a lot from what is being taught. I am not at all talking about it. If they are teaching probability/expectations/statistics etc and not using the same notation in another text of the topic (which we can refer to or have to refer to coz it does not always make it clear) then it is difficult. Also for me an approach where one explains math through an example esp relating to the objective the course is trying to achieve will make it more involving
@river plume you can try Goodfellows deep learning book, it's meant to be good. Stanford has a DL course which posts it's syllabus and lecture slides online
Their probably notation is pretty standard across any text I've seen.
There's honestly not that many ways you can write probability stuff differently
And until what level do they need to explain everything with examples and qualitative expressions? Everyone comes with different levels of maths and they can't just spend four times as much effort to record more explanations of basic things with more examples and explanations of every single part of the formula just because someone is taking a maths course without prerequisite maths knowledge
As you said it is a difference in view-point arising from our different levels of understanding math. But I will certainly say I will not be ( and ask anyone to not be) disheartened at the lack of an advanced knowledge in math to learn ML. Maybe Jobs are a different thing. But if you want to enjoy learning ML that level of math is either not necessary or can be learned up. I find that Andrew Ng sometimes oversimplifies math and would have liked more but I am kinda happy I am learning something there. Columbia one is the other end. Makes it look complex ( if not being complex) and either above my level or requires a lot lot more effort from me.
Btw, I think in the amount of text you have typed for this discussion, you could have taught us a couple of lessons in probability ๐
I actually could have! I actually think I'll write a blog post or two at some point when my exams are over
๐
Maybe Rags ( If I may take the liberty of calling you so), you could also recommend me a text to read up perhaps. (Sometimes it can be better than a lecture). Preferably something that is available free online
I was more talking about http://cs230.stanford.edu than deeplearning.ai. Andrew is less wishy washy and shallow when he's teaching actual students.
I've never actually done deeplearning.ai myself
But CS230 should be better regardless
It doesn't have recorded lectures unfortunately
I don't actually know a suitable textbook for ML. I went with ESL at first and it was too dense even for me and had weird way of phrasing things. Much better as a reference when you already know something than as a new way to learn. Bishop's pattern recognition and ML is what I've settled on for myself since it clicks with me quite well
But likely a bit too intense for you mathematically. Best covered by an upper undergrad in physics, maths, engineering etc, probably even a bit too much for most CS majors.
The ISL exists but it had no maths at all and has similar problems as Andrew's course at times. Still better imo. At least it doesn't miss obvious stuff
Better at building conceptual understanding
They realise that if you don't understand the maths, you won't be able to implement it so they also have examples in R
ESL I reckon is Elements of Statistical learning? What s ISL? and meh, I do not want to learn R
Introduction to statistical learning. Book by the same people behind it but ESL was meant to be an introductory text for someone just out of uni doing their PhD. Isl is essentially meant for people from non quantitative degrees that want to get into some basics, eg for PhD students in biology or psychology or something where they might use ml for the data but they don't understand ML themselves so they want some basic background on things
I am neither here nor there. I hang in-between. ๐ . Anyway, thanks for the recos, Raggy. Looking forward to your blog
Yah it can be hard to be in between situations. I felt the same when I hadn't found Bishop's book
There's probably a book for your level too, just keep searching
So as a computer science student, I should be starting off with ESL?
I would recommend Bishop's book if you can handle the math
I'm a BSc CS student and it worked for me
Pattern Recognition and ML?
Yes
Alright I'll check it out
Bishop's is amazing โค โค โค
hi I'm using pandas at work to read in an excel file, compare strings with a database and make a new dataframe filled with ids pointing to those rows
however
read_excel is moving the contents of a column over to another column for some damn reason
anyone ever experienced something like this?
found the issue: don't have spaces in your column names, people
i've never had issues with spaces in my column names
i've had plenty of issues with duplicate column names however
what exactly was your issue (for the benefit of anyone who may have the same problem in the future)
How do we visualize higher dimensions such as 4D or 5D? I mean, is it even possible to conceive beyond numbers?
Would you mind explaining how it can be done to some extent?
4d -> 3d Projections http://eusebeia.dyndns.org/4d/vis/05-proj-1
3d slices of 4d objects https://www.youtube.com/watch?v=vZp0ETdD37E
Alternative methods https://www.youtube.com/watch?v=zwAD6dRSVyI
We build our 4D world using Tetrahedral (instead of Triangular) Meshes, and show 4D Crystals as an example. See also: How to walk through walls using the 4th...
How do you think about a sphere in four dimensions? What about ten dimensions? Podcast! https://www.benbenandblue.com/ Problem-driven learning on at https://...
wrong ping
sorry :c
@woven tundra
Thanks @lean ledge !
nw
hi
@woven tundra since you seem to know about pandas
mind helping me out with a query?
Sure @gritty hawk , what's up?
Hello again!
I found this one
fast.ai's practical deep learning MOOC for coders. Learn CNNs, RNNs, computer vision, NLP, recommendation systems, pytorch, time series, and much more
And it looks quite cool. I will go through this and Andrew Ng course ๐
i've tried experimenting with pyautogui and pytesseract to recognize these numbers, but neither work. how would you go about it? (it's runescape btw)
with pyautogui i manually saved images of each number 0,1,..9 and tried to find them in the image. pytesseract spits out letters (i think for 244 it read it as "eag")
One thing you can try is to restrict the characters pytesseract is looking for
This may help you: https://stackoverflow.com/a/43710072
@hardy drift
What are you trying to do with runescape? @hardy drift
my understanding is that tesseract is trained for things like scanned documents, not screenshots of a computer screen
hence poor results
I thought this maybe helpful to anyone trying to learn ML/AI:
Curated list of tutorials: https://medium.com/machine-learning-in-practice/over-150-of-the-best-machine-learning-nlp-and-python-tutorials-ive-found-ffce2939bd78
Curated list of cheat sheets ( Helpful when you quickly wan to look up a formula or recall a method in a ML module): https://medium.com/machine-learning-in-practice/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6```
This list is good too:
http://rise.cse.iitm.ac.in/wiki/index.php/Introduction_to_Machine_Learning
Is machine learning really hard?
This is super nice for anyone with interest in deep learning
https://github.com/floodsung/Deep-Learning-Papers-Reading-Roadmap
I would probably do a little bit of background reading and then familiarize yourself with something like numpy, try to solve some basic problems and see if you like it
The links I have posted above has recommendations for resources/tutorials. I suggest picking one and diving in before you decide if it is hard or not
i will read them! @small ore @reef bone , Does it require lots of computational power if I'm training a neural network like thousands of images, books, etc.....
Maybe. Depending on the problem
Don't worry about that now
Most problems that do not involve huge database of images can be done on simple laptops/pcs
If you need more later you can hire space/computing power on the cloud as per your requirement
At some points it becomes computationally expensive, but you probably won't encounter any issues for a long time if you're just starting out
If you have a non-ancient nvidia gpu you can easily accelerate using tensorflow-gpu
But for now that is not necessary at all
Thousands of books isnt enough data, you need millions :P Generally speaking, you often dont make your own NN for that, you get a Resnet model pretrained on Imagenet and wipe and retrain last few layers
Transfer learning is currently the only decent way to train on little data
@reef bone The reading list is excellent!
I never even thought to try look at the original dropout paper, whoa
Btw Raggy, Bishop seems good and I am able to manage till now. Only through first chapter though. I like the presentation
๐
It does ramp up a bit quickly. Try to keep in mind, it's a maths book. Maths books are hard to read because they're always dense af
Always worth the effort though
the presentation is soooo much better than ESL imo
As long as the explanation and clear description of notation ( preferably inline wherever needed) goes on I will be happy to read
@lean ledge what algorithms do you specialize in?
I wouldnt say I specialise considering I'm technically still fairly new to the field, at least compared to my colleagues. But I lean towards things that borrow from the electrical engineering side of things, e.g. things based on signal theory. Both traditional and ML-based computer vision, time series/stochastic processes etc.
Currently do time-series-y forecasting stuff at one job and am about to start an internship at CSIRO's Data61 in their Robotics and Autonomous Systems group for Deep Computer Vision
@late garnet
Nice - I might need to bug you about some stuff. I focus on time series, nlp and clustering problems.
@lean ledge
Primarily anomaly detection in time series - not so much forecasting
Specifically, I follow Eamonn Keogh at UCR with matrix profiling techniques. https://www.cs.ucr.edu/~eamonn/MatrixProfile.html
@lean ledge you in australia then?
@chilly shuttle oui
data61 has some pretty good people, worked with them before
It does, some very smart people. Where are you? Not brissy by any chance? @chilly shuttle
@lean ledge Dumb math question. What is a math notation that looks like a modulus notation but has two vertical lines on either side of expression stand for?
|like this| ?
@lean ledge nah melbs but I stop by brissy sometimes and know a few folk there
@small ore do you have a picture?
I think bicubic showed a better way than a picture. Why didn't I think of it.
It is ||x1*w-x2*w||
The lines look rather close by though
Usually that's norm
Norm?
In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector spaceโexcept for the zero vector, which is assigned a length of zero. A seminorm, on the other hand, is ...
Or apparently it could also mean nearest integer, never seen that before https://en.wikipedia.org/wiki/Nearest_integer_function
Yep, || is norm. Vectors can be generalised to things that arent just lists of numbers or geometrical objects and hence we create specific terminology that applies to all sorts of different vectors. Thats why you'll also see dot products like a . b written in inner product notation as <a, b> or (a, b)
You might want to study some linear algebra if you havent already. It's all very relevant to ML. You might even see something like the normal formula and see its derivation which might help you understand linear regression in a slightly deeper way
essentially norm is the genrealised version of the "length" of a vector @small ore
and in the same way you might see |a| as sqrt(a.a), you might see ||a|| = sqrt(<a,a>)
Currently it does not appear relevant. It was used for regularisation weights. Maybe I will dive into linalg later for the deeper meaning
It's taking the length or equivalently the "size" of the vector and penalising it for being too large
@small ore There is a course on Udemy that is pretty good; covering linear algebra. It has a good amount of theory, notation and application. Code examples for matlab and python exist as well. https://www.udemy.com/linear-algebra-theory-and-implementation/
Alternatively, you can take a linear algebra course via MIT's opencourseware: https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/
^^^^ THIS
Strang is a god
Get his book, watch his lectures
Udemy is too dodgy for me to trust lol
Strang, this one? Great introduction book.
The book we used when I first took a uni linear algebra class was Poole - Linear Algebra: A Modern Introduction. Didn't like that one at all.
But, this was a long time ago
I'm using Strang to refresh things atm
Udemy courses are definitely hit and miss. Glad they have the refund policy.
I like this book a lot - https://www.amazon.com/No-bullshit-guide-linear-algebra/dp/0992001021
I have downloaded a .bz2 compressed file of a month of reddit comments from http://files.pushshift.io/
Decompressed to file would be ~24 gb of json objects per line.
I have a list of reddit usernames which I would like to check against the authors in that pushshift file and extract only 3 k:v pairs of those lines where the author corresponds to one of the authors in my list.
Example of structure of the comment file would be:
`{'author': x, 'time_created': 'xy}', etc.
In other words: a line with a json object(?).
What would be a good way to approach this? I don't have access to a boatload of RAM.
Reading the file line by line and checking if author corresponds to my list and then extracting the k:v pairs which Im interested in will probably take a lot of time considering the large size of the file.
Any tips?
You can read compressed files in a buffered way. I'm guessing you only want to match the authors with the comments? What do you plan to do after that @thorn river ?
Ideally I would end up with only comments from my separate list from authors, and see if they havea certain keyword in their flairtext or comment text itself
You can probably just iterate over the buffer and process it as you iterate looking for the key words etc.
I'm not sure if you are a pandas user, but it has some nice buffering features - http://pandas.pydata.org/pandas-docs/stable/io.html
Otherwise you can look up whatever is appropriate for your toolset.
Does that allow for manipulation of large json files as well?
I recently had the issue of a json file being too large to be opened normaly, and was looking for a solution like this, if it does
I used pd.load_json once but I frzoe my system, so there might be something more to it
You can try what they suggest here - https://datascience.stackexchange.com/questions/27767/opening-a-20gb-file-for-analysis-with-pandas
Just use read_json instead of read_csv.
Thinkint about it my file sort of has a json file pr line, so might be worth reading line by line for me.
I'm not sure how it would handle a single json object that is large, but the approach I listed should work for line by line objects.
Thanks for the pointers! I wrote a script which will do what I (hope) i intended to do. If you could have a look and see if this will probably work?
data = {}
for i, line in enumerate(fi):
parser = json.loads(line)
for user in parser['author']:
if user in author_set:
data[i] = (parser['author'], parser['author_flair_text'], parser['body'])
oh wow dat formatting
c
sec
welp
I hope you understand formatting doesn't seem to work
It looks ok to me, but honestly I would need to test it myself. ๐
You can just set a break after the first line to test it.
You mean like: if i == 1: break?
with bz2.open(jsonnew, 'r') as fi:
data = {}
for i, line in enumerate(fi):
parser = json.loads(line)
print(parser)
break
for user in parser['author']:
if user in author_set:
data[i] = (parser['author'], parser['author_flair_text'], parser['body'])
ah alright, thanks!
Also - be sure to create a set object out of your users that you are looking for.
I shouldve mentioned that author_set is a set()
Great - I just wanted to be sure
I assume because checking if it is in a set is faster right?
Yes
Alright, will also check out that article you linked. Thanks!
FYI - I tested this in pandas and it appears my specific tar.gz that I created has extra information on the first line while iterating - "something.json0000664000175100017510000000034413370313602013031 0ustar tylertyler{"author": "someone", "text": "aldskfjasdf"}"
With pandas it breaks with read_json, however with read_csv it works.
Weird
Nice - I realized that I was being silly and have a tar.gz file not just a gunzip file.
import pandas as pd
for chunk in pd.read_json('/home/tyler/src/something.json.gz', lines=True, chunksize=2):
print('new chunk')
print(chunk)
new chunk
author text
0 someone aldskfjasdf
1 someone2 aldskfjasdf
new chunk
author text
2 someone3 aldskfjasdf
3 someone4 aldskfjasdf
new chunk
author text
4 someone5 aldskfjasdf
@thorn river This works pretty well, however be sure that pandas supports the compression type beforehand.
You may also want to consider multi-threading if you are processing so much data. It is fairly easy to do.
Hi, has anyone ever try to create Jupyter widgets before? I'm curious what approach, lib, or js-framework that people use thinking
@late garnet Thanks! Don't know much about multithreading but I can check it out. I'll try to match what is in the chunks with my list of authors and work that way!
from multiprocessing import Pool
import pandas as pd
def process_chunk(chunk):
pass
pool = Pool(8)
all_results = []
for chunk in pd.read_json('/home/tyler/src/something.json.gz', lines=True, chunksize=100):
all_results = all_results + pool.map(my_function, chunk)
Something like that will give you a list of results.
@thorn river
def process_chunk(chunk):
pass
what's the reason for this function?
What does it do?
it passes
I suppose you use that in place of my_function right?
Yeah sorry it was quick example
Ah no problem
So I could then iterate all_results to check if its in the author_set
If I understand your code correctly
The process_chunk would do all of the processing of each row for each chunk. Chunk in this case is a dataframe row.
Just return what is processed back. It could be an empty list or list of what you want.
@thorn river
My example isn't perfect, but conceptually you should have an idea of how to use it.
pool.map maps an iterable of items across N number of threads
So I would have to edit the process_chunk function for it to only include authors in author_set?
Having a hard time wrapping my head around it
Yes
is a gtx 1050 alright for data science?
Hi @lapis sequoia would you mind setting a nickname of the server to something consisting of characters on a normal US/EU keyboard so others can mention you easier?
oh yeah sorry
Thank you :)
there
i'll rephrase my question, anybody know of any good gpus for data science > ยฃ200?
@Viibrant depending on your needs, maybe AWS could be a cheaper alternative?
Disclosure - I'm not sponsored by AWS. ๐
anyone know a quick script that I can keep all the columns of this df4, and make a df5 that is the same layout but group by 'Order Number' + 'Item Number' but 'Quantity Ordered' is a sum of those?
Is this an excel or pandas or ? question?
i have data available in excel but I am using python to process it and I use panda package for it. hope it make sense
the snapshot is out of panda
hope I am posting in the right place
@hollow gulch
import pandas as pd
df4 = pd.DataFrame({'Order Number': [1,1,1,2,2,3,3,3], 'Item Number':[1,1,2,5,5,4,5,5], 'Quantity Ordered':[10,15,5,3,3,7,5,6]})
df5 = df4.groupby(by=['Order Number', 'Item Number'], sort=False, as_index=False)['Quantity Ordered'].sum()
I put in some dummy data
Thanks @olive trench I tried something similar. Let me see if it keep all the columns as it was
splendid!
can you explain the code a little bit?
for academic purpose
What exactly do need?
the by parameter specifies what keys to group by, the sort one doesn't sort the keys, and the index makes it so the keys aren't used as index. Then you sum the the groupby object by quantity and it returns a dataframe
here's my version df5=df4.groupby (['Order Number','Item Number'])['Quantity Ordered'].sum()
perhaps you could help me understand the difference betwen myversion and yours?
without the ['Quantity Ordered'].sum() part it's just a groupby object. If you perform a function on one of the columns it returns a dataframe where the column is the result of the function you used for all the grouped elements
you don't have the as_index=False
it's true by default, which will put the keys into the index
what does sort= false do
doesn't sort your keys. If you don't need it, it's better performance if you dataframe is big
I see
now, if I want to keep all the others column. I have to expand the code right, depends on which I want to keep as is and which I want to aggregrade
for example I have column A B C D E F
further version will be df5=df4.groupby([ all keep column], sort=False, as_index=False)['D']['F'].sum() to get sum D, F?
I'm not sure if that'll work. If you're summing the column, why don't you do that first, then the groupby and sum again?
df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
'Parent Number', 'Order Date', 'Item Number', 'Description ',
'Unit Price', 'Extended Price',
'Foreign Unit Price', 'Foreign Extended Price', 'Ln Ty', 'Last Stat',
'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['Quantity Ordered'].sum()
I have a large data set that summing is only applied to 'Total Quantity' and 'Extended Price' by 'Order Number' and 'Item Number'.
I hope that make sense
I'm a little confused what's the goal of this exercise. If you group by all the columns, it's gonna only group idential rows
so my actual code is something like this
df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
'Parent Number', 'Order Date', 'Item Number', 'Description ',
'Unit Price',
'Foreign Unit Price','Ln Ty', 'Last Stat',
'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['Quantity Ordered','Extended Price','Foreign Extended Price'].sum()
with the last 3 columns being sum
I am just curious how much flexibility i can do with this code
I am not sure if summing the list of columns will work. But as I said, if you prepare a new column that is sum of ['Quantity Ordered','Extended Price','Foreign Extended Price'] beforehand and then just sum it agian over the aggregated elemtns, it'll do the job
I am not sure what that would look like. I am quite new to python
df5['new_column'] = df5['Quantity Ordered']+df5['Extended Price']+...
sorry df4 ^
and then you do the groupby and use the .sum() on ['new_column']
sounds like 3 different step if i understand correctly
df4['newcol']=df4['Quantity Ordered']+df4['Extended Price']+...
df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
'Parent Number', 'Order Date', 'Item Number', 'Description ',
'Unit Price',
'Foreign Unit Price','Ln Ty', 'Last Stat',
'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['newcol'].sum()
just two
as I said, I am not sure
sorry for dump question
I don't have your dataset so it's hard to say
but thanks for the great help. I think that code works, I am verifying it
its been a whole day to bang myhead around that code
you definitely save the day
import pandas as pd
df4 = pd.DataFrame({'Order Number': [1,1,1,2,2,3,3,3], 'Item Number':[1,1,2,5,5,4,5,5], 'Quantity Ordered':[10,15,5,3,3,7,5,6], 'irr':[5,6,4,5,2,3,5,5]})
df5 = df4.groupby(by=['Order Number', 'Item Number'], sort=False, as_index=False)['Quantity Ordered', 'irr'].sum()
df5['newcol'] = df5['Quantity Ordered']+df5['irr']
it's the same thing as mine but you're doing the column sums after not before. If you do the ...['Quantity Ordered', 'irr'].sum(), it does sum but just the columns not together
haha no worries
@@ still dont get it completely yet haha
I had a very frustrating experience myself today that I only just solved haha
well ask away
does it matter if it does sum before/after?
I guess I dont understand whats going on in the background
I added you friend btw. ๐ you seem like a nice person.
no. Except if you do it after you'll have the two summed columns in the dataframe
that's what I want right? because I wanted to keep those column separated
No worries, and feel free to hit me up if you need help. I am not very experienced myself but I'll try to help with that I can ๐
I want the price.sum() and the quantity.sum() because they were the same order number
the order number was split due to different batch of shippnig
but for the analysis purpose, we want to consider them as 1 order.
ohhh
I work with suppy chain so these things happen alot in my data
I thought you were trying to sum those columns together
^ thus the confusion in us haha
excel can't handle this stuff as flexible as python so I am trying to pick up python to do the heavy work
I play alot with panda
Yeah then just specify which columns you want to have sum of for the aggregated elements
my bad I didn't understand
Do you not have this data in a database? I feel like this might be easier with SQL
trust me, I am ready to bang my head in the cabinet
IT doesnt trust us enough to give us access to SQL lol
Oh I know that struggle, gaining access to anything in my company is a nigthmare too lol
my thought SQL would be a way to go too
if it was read only you should be fine
it's odd though that they want you to do a job but don't give you the tools
that's something that I am trying to convince them
even my boss struggle too lol
IT doesnt understand all of our needs
are you like an analyst?
I'm a junior data scientist
sure thing! I am fairly capable with pandas
I am kind of fortunate that all our company is services in IT, so most people are capable in IT. And luckily I don't have to communicate with the business divisions ๐
haha
I work in business division
I do their heavy math
i used to do these things in VBA and excel
but I think python is the real hammer to get anything done so I am trying to pick up and get everything done in python
takes longer but much more learning curve and more flexibility in manipulatnig data
eww VBA. I had the displeasure to work in it and nevermore ๐
python is definitely powerful. I'd like to dive a little bit into big data sometime, so I hope we get a project including it eventually
Hello, is there anyone with some knowledge about reinforcement learning (Q-Learning, Monte Carlo, TD(0), greedy)?
Or Am I in the wrong channel ?
@charred crest it is more beneficial to you and everyone else to ask a specific question.
Seems like super interesting stuff
give the model candy until it likes you and gives good results
someone said something today that made me do a double take
'you can train a model on X wide feature vectors but do inference only on <X wide feature vectors'
that's... not right, right?
i mean you can do it, but running inference with consistently less information is the same as training the model without being aware of that extra information right
What do you guys think about this certificate?
https://academy.microsoft.com/en-us/professional-program/tracks/artificial-intelligence/
@chilly shuttle maybe they included the label vectors in X...?
no, their example was along the lines of 'train on postcode and age, then run inference on postcode only'
i... don't think windmills work that way
They are probably talking about the difference between a design matrix used in pure statistics and the one hot encoded feature matrix often used in machine learning.
It's been a long time since I looked at this, but it's something something matrix must be invertable, remove one variable and use it as a base for interpreting the others. It's one of the small differences between pure stats and ml thats easy to trip over.
@lean ledge I'd like to be able to simulate gyroscopic technology in the vacuum of space
Are you studying computational science right now?
Im teaching myself, but want to really dive into it and learn as much as i can
@true acorn nah, I'm an engineering student. Gyroscopic shouldn't be too bad
Given you have the moment of inertia tensor for the body, it's a simple simulation
Do you know the basics of rigid body dynamics?
Hell to the nawh nawh nawh. I just had a discussion in the mathematics discord server about learning math relative to the problem im trying to solve.
but id love to learn
Oof, yeah the thing is, computational science isn't really something people learn on their own. They usually learn it alongside and to aid in another discipline like physics or engineering or chemistry etc. Physics and engineering especially
Not having any background in what you're trying to simulate is harder because you don't know how it should behave and it takes longer to understand where to even start
But
There should be resources
Why couldnt i learn it on my own? What if i told you i was surrounded by some professors who teach this stuff?
ah
so as long as i have resources aka mentors aka professors who study this stuff, i should be okay?
@lean ledge I have you too, silly
I just cant afford school, im trying to save, but school is way too expensive so im trying my best to learn on my own
woooow, 1997
Gyroscopic stuff starts becoming relevant at rigid body dynamics
@true acorn
Fascinating
Look through those notes and you should be able to start figuring it out
i know the basics of programming and math so if i have a question about a function can i shoot you an IM?
A gyroscope is just a rigid body with a specific moment of inertia tensor
I'd rather talk here
okay,
Damn, why am i just hearing about Euler right now. i was never taught about him in school
Euler was a smart guy
Stuff in maths has to be named after the second person who discovered it because the first is always Euler :P
Nw, happy to help
Also he had cool hats, Euler that is
@lean ledge you must be pretty familiar with markovian processes and system theory yes?
To some extent ye. Don't expect too much, my knowledge is full of gaps
I am 18, i haven't had the time to know it all in detail
Ahh okay, I'm working on a problem in understanding operational inefficiencies through deriving markov transition matrices. I was hoping someone could provide some advice.
Essentially, I have a system with a finite number of states, but many states are interconnected. There are many users of the system and I am comparing user level efficiency of the system to optimize business processes.
Oof, I have worked a bit on something similqr and my coworkers have worked a lot on that but that's directly connected to a product my company has and another we're working on and I'm afraid I'd rather can't say much.
RIP
Sorry D:
I think I have the right idea on comparing at a high level to evaluate potential areas of inefficiency
What company do you work for?
Being able to rate and improve worker efficiency is a big topic in a mine where a single more round of shovelling a day will lead to 9M an year in profit etc
Do you have general advice on how to compare quantitatively other individuals based on some weighting within the matrix of transitions?
A paper or general concept would be great ๐
I can read and apply it myself - I hope
I think in my case I could take the weighted average of the differences between corresponding row, weighted by the fraction of time spent in the state corresponding to that row
Sounds like a good start. Should be able to experiment a bit with that. My boss taught me all the relevant stuff so I don't really have any link to papers but with a bit of searching, you should be able to find them on your own
I am not even sure if all these are data-science or if we need a new OT-Advacned(Nerds only) channel
there's lots of stuff here that's not data science because everyone just redirects people here
it's weird
hello
i'm told to come here and you for advice
i have this code https://paste.pydis.com/xuyukoqisu.py
and after this command it alters my dataframe
data1 = [] #create empty list
for name, dates in data.groupby(pd.Grouper(freq='D')): #separate days
data1.append(dates)
Did you check if your DataFrame is still how you want it to be between this:
data = data.drop(['use','hours','date','B','B2','E','A','w','d','year','month','day','sec'], 1)
###############################################################################
data = data.set_index('date_time')
And that code you've posted? Because I don't think that code you've posted should alter it.
continuation from https://discordapp.com/channels/267624335836053506/303906556754395136/510055520200032266 btw.
When you say you ran the upper part, does that include or exclude that set_index line?
yes
Because the code you've posted in the help channel excluded it
i run the set index too
and its perfect
when i run the groupby i get those blue cells in my last column
those blue cells are a datetime which is reapeted but also wrong
they mess my whole frame
I don't know what the colors mean, I've never used Spyder
Has the actual data in the DF in memory changed?
You're not creating a new one anywhere, though. The only thing you get from a groupby is a groupby object.
no its perfect it even creates empty frames from the days i don't have any info
it just adds double dates to some dates
Right, but if you look to the indexes, some date_time values are repeated exactly. Aren't those the ones that get grouped together and "doubled"?
yes that's them
the original doesn't contain them
i can't understand why they exist
Are you sure they don't exist after you've created that series with:
data['date_time'] = data[['date','time']].astype(str).apply(''.join,1)
?
And they show up right after data = data.set_index('date_time') before the loop?
let me check it one more time i'll run the program line by line too
You can always print out the first, say 30 rows, with DataFrame.head(n=30) at various points to track the changes throughout the script.
That way, you don't have to rely on running it line by line and checking the dataframe viewer in Spyder
Okay, I don't know what's going on.
There are people on this server who are much more fluent in Pandas than I am, maybe one of them will spot what's going on
Depends on if the question gets burried.
so what do you recommend me to do?
@lyric canopy i did it i found my mistake so you don't have to search. thank you for your time. it was a stupid mistake
Great! Do you mind sharing the mistake so I can learn from it, too?
sure i didn't change a string to float [.astype(float)] so instead of calculating a a number was making a sequence which made the program go nuts
so it seemed like correct
and i was so focused on what seemed to be wrong that i didn't pay attention to its origin
Right, thanks!
Hey, would this be the correct channel for a question related to outliers in a dataset?
If not, is there a related discord server that you folks care to recommend?
It's actually a dataset that I got from insideairbnb.com
I am terribly new at pandas and I though I'd practice with some real life examples
The issue with that dataset (at least the Barcelona one) is that the price data has a lot of stuff that is clearly an outlier
Since we're talking about apartments with a listed price of 6000 euro/night
Now, is there a practical and scientific way, so to say, to determine where to cut the line for outliers?
I suppose I could arbitrarily say that whatever is out of 2 or 3 standard deviations is an outlier
But an expert opinion would be much appreciated
It actually depends on what you want to do with the data, but just deleting outlying observations is usually a bad idea (although it happens way too often)
If there's no substantive reason to delete the data points, you could very well be deleting valid observations
I mean, just random analysis, like avg price per Neighbourhood, or something related to reviews
Well, by just deleting those outliers, you're biasing your sample
But to do that I need to know what is actual data and what is clearly not significant
Do you have any reason for why these values are not "actual data"?
I mean, I know they're wrong because if I access the relevant listing page on Airbnb I see that the price there is normal
So it's either badly parsed from Airbnb, or maybe it was set superhigh while the owner was creating the page
To avoid receiving bookings
So, how do you know those errors are only happening for those high values in your list?
Anyway, I assume the distributions are likely skewed anyway, so the average may not be the best measure to describe the central tendency in the data you have
So something like k-means clustering would be a better idea?
Anyway, if you have reasons to assume some values are truly erroneous, then you're justified to delete them
There are plenty of alternatives. Something like the median is often seen as more robust than the mean, for instance
That's why is sometimes used with wages
Fair enough
I don't have any info on how was the data mined so I guess I'll make some assumptions to have a "clean" dataset just for the sake of practice
It won't really describe reality, but that's not the point right now
Thanks a lot for the help
Appreciate it
I may be a bit on the fence about it, because it happens way to much in academics. (deleting observations based on some arbitrary cut-off without a substantive reason)
Nah, it's a totally fair point and it was very revealing, so thanks for that, I'll just ignore it in this specific case since I'm the guy that's still googling boolean indexing
So it's not a matter of doing analysis that make sense for now, it's more like "how does pandas work?"
Right, just play around with it, I'd say
If you're going to fit a model (e.g., linear regression with ordinary least squares), then you may also fit it twice: Once with and once without the unusual data points to see what kind of effects it has.
@teal veldt - There are many univariate statistical methods to find anomalies in a data set. It is tricky to pick the right algorithm without fully understanding the purpose as @lyric canopy points out. However, I can provide you with some algorithms that I implemented to get you some application into anomaly detection. In addition it would be a good idea to read up on each method.
Right, but in my opinion, none of these methods should be used in an automated way. Detection is fairly easy, but the biggest question is the one that comes after that: Why is this an unusual observation? And how should I deal with it? Far too often, people just delete them from their dataset, probably out of ignorance, but thats borders on unethical research practices.
@lyric canopy I completely agree.
Interesting code, though.
Generally it is best to use domain specific upper and lower bound thresholds to throw out bad data.
Specifically when gathering sensor data
Probably, I don't have that much experience with that; I work in the field of social sciences/psychology/cognitive neuroscience for a methodology/statistics department of a university. Unusual observations there are less likely to be caused by sensor error, but much more likely to be valid observations. But, since a lot of reseachers do know that outliers can be problematic (say influential cases in the GLM), but don't know how to deal with that, they start to throw out observations based on arbitrary rules-of-thumb, like +/- 3 sd, bonferroni-correct significant studentized residual, arbitrary cook's d cutoffs and what have you.
There are so many robust techniques and alternative approaches available today that just throwing out observations without a substantive reason for it triggers me.
Interesting, I actually created those anomaly detection algorithms to find unusual patterns in human behavior. ๐
That's a great use for it, because those unusual observations usually tell an interesting story.
(One of the other problems is that people sometimes only start deleting observations if their original run of a model did not provide a "significant" result, but won't do that if it did. In, for instance, a GLM [generalized linear model], an outlying observation can actually increase model fit if has a high leverage value, but a low residual, i.e., if it's an outlyer on the explanatory variables, but in line with the regression plane/model.)
Hello,
I'm kinda new here and I have a certain predicament I'm in where I need a bit of guidance/advice. I want to be a Data engineer but I'm not sure about how to do that. I'm currently a Market Science Analyst and after being here for a few months I feel as if the job isn't for me.
@haughty wharf data engineer is a DevOps like job. Learn SQL, docker, Hadoop, (AWS/GCP/Azure), ETL, REST APIs, Spark/Hive + some understanding of machine learning (doesn't have to be in depth) and maybe stuff like tableau. Your job is to be able to make a data pipeline for data scientists and for deployment
is there a way to name a plot in matplotlib so that plt.show() will only make something happen if an argument is passed into it
also, is there a way to automatically save generated images of that named plot, and replace old versions with the new one?
@amber kestrel Maybe you just need to add some conditionals before rendering the plot?
@amber kestrel I don't understand the first question, but the second one is plt.savefig('plot.png'). Use it before plt.show()
It'll save to your working directory, or you can pass a path in the filename to save it elsewhere
this is the question
mplement a perceptron for logistric regression. For your training data, generate
2000 training instances in two sets of random data points (1000 in each) from multi-variate normal
distribution with
ยต1 = [1, 0], ยต2 = [0, 1.5], ฮฃ1 =
1 0.75
0.75 1
, ฮฃ2 =
1 0.75
0.75 1
(1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class,
i.e., 1000 in total
def generate_data(mean,cov,size):
return (np.random.multivariate_normal(mean,cov,size))
train_data_1 = generate_data([1,0],[[1,0.75],[0.75,1]],1000)
train_data_2 = generate_data([0,1.5],[[1,0.75],[0.75,1]],1000)
test_data_1 = generate_data([1,0],[[1,0.75],[0.75,1]],500)
test_data_2 = generate_data([0,1.5],[[1,0.75],[0.75,1]],500)
so i dont know how to label it
I'm not sure what the question wants you to do
nvm i figured it out
Okay
You probably needed to generate a dependent variable with a binary coding
(Logistic regression, 40pts) Implement a perceptron for logistric regression. For your training data, generate
2000 training instances in two sets of random data points (1000 in each) from multi-variate normal
distribution with
ยต1 = [1, 0], ยต2 = [0, 1.5], ฮฃ1 =
1 0.75
0.75 1
, ฮฃ2 =
1 0.75
0.75 1
(1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class,
i.e., 1000 in total. Use sigmoid function for your activation function and cross entropy for your objective
function. You will implement a logistic regression for the following questions. Initialize the starting weight
as w = [1, 1, 1]. During training, stop your loop when the objective function (i.e., cross entropy) does not
decrease any more (below certain threshold) or when the gradient is close to 0 or the iteration reaches 10000.
Set your thresholds properly so that the iteration doesnโt reach 10000 for all the learning rate that you will
be using.
- Perform batch training using gradient descent. Divide the derivative with the total number of training
dataset as you go through iteration (it is very likely that you will get NaN if you donโt do this.).
Set your learning rate to be ฮท = {1, 0.1, 0.01}. How many iterations did you go through the training
dataset? What is the accuracy that you have? What are the edge weights that were learned?
and i am really new to this
so i lack knowledge
I'm not too big on ML; I've only done logistic regression in the context of the generalized linear model.
So, I can't really help you with it
guys
i need help
anyone knows about tensorflow and keras?
I need to create a dataset with words and I don't understand anything
I don't think you really need Keras or TF to create a dataset. That would be to process it.
Firstly, where are you trying to source the data from? Scrape? API?
If all you need to do is output to CSV or DB, you're better off using Scrapy and Pandas
@tardy portal Hey, what part of the problem do you need help with
Neural Network friends, am I specifying my layers properly? https://www.reddit.com/r/datascience/comments/9w6grq/my_first_neural_network_exercise_using/
1 vote and 1 comment so far on Reddit
I'd suggest an SVM with an appropriate kernel instead of a neural network there ๐
Having to use NNs is weird when there's other stuff
Looks okay to me but I don't use keras much
@lean ledge Hey man can I ask you a DS question in regards to normalizing/scaling and matplotlib?
Sure, just ask your questions here
@lean ledge ty
I typed it in help 5 if you dont mind taking a look
was few minutes ago
ohnvm someone is looking at it
Appreciate it though
Hi can someone help me write a code regarding online training using gradient descent without using scikit and sklearn
I was able to write a code for batch training
ask your question
(Logistic regression, 40pts) Implement a perceptron for logistric regression. For your training data, gen- erate 2000 training instances in two sets of random data points (1000 in each) from multi-variate normal distribution with
๔ฐ 1 0.75๔ฐ ๔ฐ 1 0.75๔ฐ
ฮผ1 =[1,0], ฮผ2 =[0,1.5], ฮฃ1 = 0.75 1 , ฮฃ2 = 0.75 1 (1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class, i.e., 1000 in total. Use sigmoid function for your activation function and cross entropy for your objective function. You will implement a logistic regression for the following questions. Initialize the starting weight as w = [1,1,1]. During training, stop your loop when the objective function (i.e., cross entropy) does not decrease any more (below certain threshold) or when the gradient is close to 0 or the iteration reaches 10000. Set your thresholds properly so that the iteration doesnโt reach 10000 for all the learning rate that you will be using.
- Perform batch training using gradient descent. Divide the derivative with the total number of training dataset as you go through iteration (it is very likely that you will get NaN if you donโt do this.). Set your learning rate to be ฮท = {1, 0.1, 0.01}. How many iterations did you go through the training dataset? What is the accuracy that you have? What are the edge weights that were learned?
- Perform online training using gradient descent. Set your learning rate to be ฮท = {1,0.1,0.01}. Set your maximum number of iterations to 10000. How many iterations did you go through your training dataset? What is the accuracy that you have? What are the edge weights that were learned? Compare the learned parameters and accuracy to the ones that you got from batch training. Are they the same? Explain in your report.
Do you see the second question
I did make the first one ,but i dont have mental capablites to write for the second
#this progrom contains both batch training and aoc
import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def create_data(mean1: list, mean2: list, sigma1: list, sigma2: list)-> tuple:
train1 = np.random.multivariate_normal(mean1, sigma1, 1000)
test1 = np.random.multivariate_normal(mean1, sigma1, 500)
train2 = np.random.multivariate_normal(mean2, sigma2, 1000)
test2 = np.random.multivariate_normal(mean2, sigma2, 500)
train_label1 = np.zeros(1000)
test_label1 = np.zeros(500)
train_label2 = np.ones(1000)
test_label2 = np.ones(500)
training_data = np.vstack([train1, train2])
test_data = np.vstack([test1, test2])
training_label = np.concatenate((train_label1, train_label2))
test_label = np.concatenate((test_label1, test_label2))
return training_data, training_label, test_data, test_label
def logistic_regression(data: list, labels: list, max_iter: int, lr: float)-> list:
weights = np.ones(2)
for i in range(max_iter):
predictions = sigmoid(np.dot(data, weights))
error = labels - predictions
gradient = np.dot(data.T, error)
weights += lr * gradient
return weights
def calc_metrics(test_data, test_label, weights):
confusion_matrix = [{'correct': 0, 'incorrect': 0}, {'correct': 0, 'incorrect': 0}]
acc = 0
predictions = []
for i in range(len(test_data)):
prediction = round(sigmoid(np.dot(test_data[i], weights)))
predictions.append(prediction)
if test_label[i] == prediction:
confusion_matrix[int(test_label[i])]['correct'] += 1
acc += 1
else:
confusion_matrix[int(test_label[i])]['incorrect'] += 1
return acc, confusion_matrix, predictions
def main():
mean1 = [1, 0]
mean2 = [0, 1.5]
cov1 = [[1, 0.75], [0.75, 1]]
cov2 = [[1, 0.75], [0.75, 1]]
steps = 100000
print ("Choose from 1,0.1,0.001")
lr = float(input("enter learning rate"))
training_data, training_label, test_data, test_label = create_data(mean1, mean2, cov1, cov2)
weights = logistic_regression(training_data, training_label, steps, lr)
acc, confusion_matrix, predictions = calc_metrics(test_data, test_label, weights)
fpr, tpr, th = metrics.roc_curve(test_label, predictions)
auc = metrics.roc_auc_score(test_label, predictions)
print('Accuracy: ' + repr(acc/len(test_data)*100) + '%')
main()
this is what i have written till now for batch which is 1
now i need help for second
why do you invoke repr directly?
Hey guys, what is the most efficient way to create pandas dataframes where you have to dynamically create it row by row?
What I try to do is create a dictionary with its keys as column names and the items are lists with the column values. I append new values to those lists and then call pd.DataFrame on the dictionary at the end. It's is a lot faster than using pd.append() especially for big amounts of rows, but I am thinking if there is a more efficient way to do this.
Anyone have any ideas/workflows that work out for them?
@olive trench it might be easier if you specify your use case.
I need to create a dataframe with certain columns . Then there is a for loop and each loop generates one row of the resulting dataframe
the df.append() function gets really slow if I generate each row as a df and append each time if the resulting dataframe is in thousands of rows
I do the same as you, either append to a dict or if I don't have that many columns I just keep track of a couple of lists and make the dict and df at the end.
@polar acorn oh thanks, that's actually even better!
i am starting to get more into machine learning using python
anyone have any reccomendations for videos?
@wanton silo Check pinned
Columbia's course is good
not python specific, it's language agnostic
o, thx
@wanton silo , check this out. Curated list of tutorials: https://medium.com/machine-learning-in-practice/over-150-of-the-best-machine-learning-nlp-and-python-tutorials-ive-found-ffce2939bd78
How hard would it be to take one of the catagories from the google quickdraw dataset (https://github.com/googlecreativelab/quickdraw-dataset), and then use machine learning to create new images that are similar to the images that are there? And what's a good resource to learn how to do something that?
(aka how can I learn basic image machine learning?)
that's not an easy question to answer, generative models are generally quite complicated, so "basic" machine learning might not be sufficient to develop good solutions
the good thing is that we have packages like tensorflow, keras to help us, which makes the field much more accessible, but unless you're just taking code from the internet, some grasp on the fundamentals will still be necessary
so depending on how deep you want to go, we can recommend articles or books
the best resource to start with is generally google
also bear in mind that image recognition is not the same as image generation
the first would be a more approachable problem to beginners i think
hey i got a question. I'm trying to make a neural network that can detect whether a word is either real english or just a bunch of letters bunched together. How would i make my input data? because the input layer has to have a fixed number of nodes (so i cant just do the amount of letters). and idk if you can just input the whole word? my first idea was to calculate a value for each word. I went with using A=1, B=2 etc. then calculating a base10 number using the letters A-Z in base26 (not using 1-10 instead just using the alphabet letters). So the word "hello" would calculate to be (value of "h") 8*(26characters_index) (starting at zero)+ value of "e" 5*(261). etc. This creates extremely large numbers but would this work as a dataset?
word_value += (ord(word[j])-64)*(26**j)```
that's an interesting approach to the problem, you're basically trying to create a hashing function and then check if the hash is valid? i'm not sure where the neural network comes into play
you can use something like https://github.com/rfk/pyenchant to check if a word is valid english
recurrent neural networks (especially lstms now) are generally the way to go for handling input of variable size, for natural language processing there are models that handle the input character by character, some handle it word by word, google has taken a compromise approach with "wordpieces" -> this is an interesting read https://arxiv.org/abs/1609.08144
but i'm still not sure why a neural network would be at all necessary for this problem
(not to discourage you)
well im relatively new to the concept of machine learning neural networks etc. so it may be absolutely pointless and not be correct but i thought it might work
it's not pointless and your idea to hash the word is good
and it definitely would work to some extent, but a simple check against a dictionary would be a lot easier to implement and also probably perform better
well, it wouldn't perform worse
ok thankyou. and yh probably checking it against a dictionary would be easier. im actually so stupid i didnt think about that ๐
well ibe done all the work on setting up the data that im not gonna just restart now. im gonna see how effective a neural network would be. i guess this will be interesting
the reason why it struck me as a little odd is that you will likely be using a dictionary as training data anyway, and in general terms we're trying to get the neural net to learn from the training dataset to then be able to predict unseen values outside of the dataset. in fact, learning the actual values in the dataset is not a good thing because that usually leads to poor performance on unseen data (overfitting), we want the network to extract features from the dataset that can then be used to identify and classify other data. in this case, you already have all the data available to you, so you would actually be trying to get the neural net to learn all the data. which is an odd approach because in that case you can just use the dataset itself to check if the element that you're trying to classify as either (valid) or (invalid) is within this data
but go for it, practice is good, i would be interested to know how your neural net will perform once finished
^ sorry that's not well written, hopefully it makes sense
nah im going to use a random probably 40% for testing. plus the dataset has both positive and negative values of being either real or non rea words
i mean training
maybe more idk
i have around 500000 examples
and yes i understood what you have written
@turbid bay you can solve this with machine learning. If you're trying to learn machine learning it's a fun excersize, if it's a real problem you're having you should look to the previous answers. You could for instance train an LSTM. Let each letter be a vector of length 27, A = [1,0...,0], B = [0,1,0,...0] etc. Each word is then a sequence of letters or vectors and this is something we can use an LSTM for.
In addition your model can tell you if a random mess of letters looks sort of like a word or not which is in itself an amusing thing to test out.
whats an LSTM?
It's a fancy way of making an neural network work with sequences, such as a sequence of letters i.e. a word.
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
If that seems a bit complex you can also just try turning each word into a vector, mapping each letter into a number a -> 1, b -> 2 etc. And then zero pad on the right so that all the words are the same length. So 'Apple' would be [1, 16, 16, 12, 5, 0, 0, ...,0] with enough zeros so that the length is equal to longest word in your list. Then you could use just a normal neural network.
@turbid bay An LSTM (long short-term memory) network is a specific type of recurrent neural network (read, neural network that saves some inner state) that represents internal state and what is and isnt stored in it using i, f, o and g gates. They're good because they can remember stuff eg context so good with sequential data like videos, sentences, different parts of images etc. They're good because they allow easier gradient flow compared to vanilla RNNs while being rather flexible through their input/output/forget update mechanisms.
data does not at all have to be sequential, it's just a common example
you can see the difference in stuff like attention-based models for computer vision where the same data is looked at through different focal lenses etc.
They're also rather easy ways of making generative models based on your training data without going through other effort in a markovian chain manner
Pretty useful networks
I have a pandas dataframe where I want to drop rows which have a total frequency in the dataframe below a certain threshold.
I.e. if a column 'city' has in total 5 entries, I want to drop the corresponding rows. (rows with City < 5 should be dropped).
However, if that city appears in a certain list of cities, I want to skip row from dropping (even if that city has <5 entries).
I.e.
Counts = {"London": 4, "Birmingham: 3}```
I already have the counts of the corresponding city in a single column. How do I drop rows with <5 total entries in the dataframe EXCEPT if they are in to_keep?
@thorn river what does your dataframe look like?
It sounds like you needs to check out the filtering section of pandas; specifically on isin and multiple criteria filtering.
Generally I filter to keep what I want and create a new copy of the dataframe instead of dropping. I'm not sure if that is more efficient.
query = (df['City'].isin(to_keep)) & (df['Count'] >= 5)
df = df[query].copy()
I guess filtering works too. I would want to keep the entire dataframe if they are > 5 and if city count in dataframe is <5 but in to keep it would be kept as well.
if I understand correctly, your code snippet create a dataframe with only those cities in to_keep?
I ended up dropping everything <5 since it was so little and it probably not of that much importance, thanks anyway!
hi, i have a problem with my genetic algorithm. i am making a GA to solve a sudoku puzzle (not the best way of achieving the solution, i know - but it's enjoyable and is definitely achievable).
My problem is that it never gets to a solution, it gets stuck at a fitness of around 50 (i'm using the fitness function from here https://www.researchgate.net/publication/224180108_Solving_Sudoku_with_genetic_operations_that_preserve_building_blocks, except (f(x) - 162) * -1 in order for fully fit to be = 0).
i'm new to ga's so i just set my crossover function to take 2 parents, and then make child1 the first 3 rows of parent1, second 3 rows of parent2 and third 3 rows of parent1, and child2 to be the inverse. i also did crossover on every individual - not sure if that's normal?
and then mutation, i mutated every individual as well, maybe i shouldnt do that
of course i kept the known values of the puzzle constant throughout
just looking for advice really
Are you stuck in a local maxima?
Maybe increase your mutation rate based on lack of genetic diversity
i think im mutating too much
my mutation rate is essentially 1 since i mutate everything (i think that's how it works), so perhaps i should change that
and i mutate quite aggressively (swap 2 values in every box of every individual in the population)
Unfortunately I've only minimal knowledge on this subject and most of it is in NN
I'd work on refining your mutation strategy. Worst case scenario, you learn more about them.
yeah im reading An Introduction to Evolutionary Computing rn
@past grove you might be interested in this repository
There is a talk on them as well
thanks i'll take a look @late garnet
Hello! I'm hoping for an idea/some advice on how to approach a python project
I'm scraping a webcam every 60 seconds (when it updates) of a feed looking at an exit to a business park
I'm trying to write some computer vision approach to detect when cars start queuing along this road at peak times
not necessarily an object detection problem or anything like that - just need to detect if there are cars or if there is just road (and cater for fading sunlight, weather etc)
I tried plotting the mean of the image to look for spikes when the cars started queuing but no success ๐ฆ anybody know of any other approaches?
the aim is to have a graph of timestamps against when there is a traffic jam vs clear roads
if this is the wrong place to ask (an idea problem not really a code problem) then please redirect me elsewhere if you know of a better place
but this has to have been considered by someone before
Is there any way to write/read to a website a JSON file to a website, I'm using Heroku to host it but I can't see the files of Heroku, so I wanna see/edit the data that it makes.
If so how would I do it?
every time you push a new version of the software heroku would delete the modified files anyway
so there isnt exactly a point in doing that
if you want to store data use the free postgresql they provide
which is why I wanted to write to a different website
update: got solutions after 123 and 75 generations
still working on crossover and mutation improvements though
Can anybody lend a hand with a pandas related issue?
Why in the hell are these scikit learn algorithms only outputting scores in multiples of .31
I mean itโs only giving scores of .31,.62,.93 I donโt understand what the hells goin on
I have this dataframe
How would I grab values for the key 'coordinates'
I would like to extract all values for coordinates in the col geometry and insert into another df
That same df will also be used to store data extracted from the properties col . Each col in this new dataframe will be a key inside either geometry or properties
hey im using tensorflow with keras. I want an input layer which has one node that takes on an integer value. How do i write the input layer? i tried inp = tf.keras.layers.Input(1) but that doesnt work
my input data is singular integer values between a 1 digit number and a 39 digit number (this is a word which i have turned into an integer). And my output values should be a number between 1 and 0. The neural network should decide whether it is a real word (1) or not (0).
inp = Input(shape=(1,),dtype='float')
layer_1 = Dense(128, activation="relu")(inp)
layer_2 = Dense(128, activation="relu")(layer_1)
pred = Dense(1,activation="softmax")(layer_2)
model = Model(inp,pred)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_train, Y_train)```
I would recommend using the vector approach suggested above to encode your input, and more than one input neuron
You should have 2 output nodes since you're doing classification with 2 classes, the softmax activation makes sense, but applying it to a single neuron doesn't
If your code runs then it's probably fine but generally speaking you would do something like
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense()) # first layer, specify input_shape
model.add() # your second layer
...
model.add() # your final layer
And uh, you should have validation and testing data
And pass validation data to the model.fit() method alongside your training data
So that you can see how it performs on unseen data
And then evaluate on testing data when you're done
Or when you think you're done
And maybe LeakyReLU() for your activation on the hidden layers, but make sure you pass a lower alpha value, keras defaults to 0.3 now which is ridiculously high in my opinion
Hi, I am interested in NLP, so should i 1st do a basic data-science course like https://www.coursera.org/learn/python-data-analysis/home/welcome or i can start with NLP directly?
@karmic axle depends on what you wanna do, but its worth taking a basic machine learning or data analysis course
you're gonna need probability and stats pretty much no matter what
Anyone on here might be able to help me with trying to create a specific graph using matplotlib and my dataframe? ๐
what graph? @cerulean magnet
@desert oar His question got burried in #help-chestnut just before my question coz no one answered for an hr
@desert oar okay.. i dont know what i want to do with NLP yet ๐ .. maybe will add some nlp to my discord bot.
Anyone have a solid tutorial on installing/using tensorflow--gpu? I've used this tutorial, only to result in python not finding the package after instillation. https://www.youtube.com/watch?v=r7-WPbx8VuY
In this tutorial, we cover how to install both the CPU and GPU version of TensorFlow onto 64bit Windows 10 (also works on Windows 7 and 8). TensorFlow is a P...
Any recommendations/advice is welcomed and appreciated.
Hey, why do I get this error when trying to plot different columns in a csv by defining their names?
TypeError: 'DataFrame' object is not callable
This doesn't happen when I just call the whole dataframe
@tall shuttle you call the package by? import tensorflow for both the cpu and gpu versions? (pending what's installed)
yes
@tall shuttle
Might be tmi, but on the left is the install, on the right is the call
I can't read that
rip
just give them separately
related to python version?
Rather, with respect to you and your time, is this a common issue I can find a solution for online?
install cuda
I'll re-install and share a screenie.
install cuda for your gpu
kk
one second, sorry.
mmk, restarting
same error:
(when installing cuda, I'm just following the prompts and clicking 'next')
do you have cuDNN
yep. followed the tutorial's instructions of drag/dropping (ill share the time, one sec)
In this tutorial, we cover how to install both the CPU and GPU version of TensorFlow onto 64bit Windows 10 (also works on Windows 7 and 8). TensorFlow is a P...
Wait. would it matter if Python was installed on a separate drive? (i have all python-related-ness on one drive..)
(nvidia's gpu computing toolkit, and corresponding cuDNN files are on a diff drive)
I think it should be fine as long as your path variables are set correctly, I remember I had to tamper with mine
But it's been a while since I installed mine
Sorry that was more of a shot in the dark
This guide provides step-by-step instructions on how to install and check for correct operation of NVIDIA cuDNN v7.4.1 on Linux, Mac OS X, and Microsoft Windows systems.
I followed this, at the bottom it mentions the variables you need to have set so that tensorflow can find cudnn
ah, gotcha. thank you!
rip, tried it with both cuda v9.0 and v10.0. Would my python version matter? 3.6.7
So I've got this code outputting this visual:
plt.rcParams['figure.figsize']=(8,10)
fig, ax = plt.subplots()
sns.countplot(x="Q18", palette="Blues", data=t10)
ax.set_ylabel('Number of Respondents', size='15')
ax.set_xlabel('')
ax.set_title('"What percentage of the team\nmeeting do you speak?"', size='20')
plt.suptitle('Top 10%', y=1.01, fontsize=30)
#code for Question 18```
How in gods name do I get this to sort in logical order? Do I need to setup the pandas dataframe to sort them by the order (e.g. 0%,10%,20%....etc) prior to plotting it with seaborn?
Sorry about the transparency. The problem is that it's randomly ordering (or to me at least, randomly) the bars
that's not pandas, that's seaborn
but t10 would be a pandas dataframe
also you're better off using plt.figure(figsize=...)) than rcParams
@young aurora you can specify the order in seaborn's countplot. For example, if you wanted to specify labels that are string percentages:
ordering = ['10%', '20%', '30%']
sns.countplot(x='value', data=df, order=ordering)
# you can also generate the ordering
# this gives you 0% to 100% in logical order
ordering = list(map(lambda s: str(s) + '%', range(0, 110, 10)))
pandas: is there a way to get the sql queries it will run if I do a to_sql()?
it doesn't run the sql in one shot, so not really
afaik the only way to do it would be to provide your own dummy connection and intercept the bulk_save_objects or whatever other calls pandas makes to serialise
its a bit weird that you cant generate the intermediate sql actually
the magic happens somewhere in here https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/io/sql.py
ah right it uses sqlalchemy on the backend
it's not that weird, bulk loads are almost never a plaintext sql query
Can someone explain to me how data mining works? Like what's the process?
That's a really broad topic. All it boils down to is getting a big heap of data and applying things like stats or machine learning to it to get new information. e.g. If I have a spreadsheet of raw basketball data (player, shot time, distance from net, success-or-failure, away-or-home-game) I could calculate a simple stat. like 3-pointer percentage by player, I could try to see if home games correlate to a higher accuracy, I could render a heatmap that shows how effective each player is based on their distance from net. Etc.
Generally you have so much more data than that. You do your best to use dimentionality reduction techniques (eg PCA), use data summary reports to find covariance between data (plot all your data on a big covariance diagram, see which bits look the hottest), use some domain specific knowledge to narrow down what you should be looking into, etc
"dimensionality reduction technique" . I stumble upon new words for subset selection
its not the same as subset selection
its more general
e.g. pca or multidimensional scaling.. you aren't choosing subsets of features, you are creating new features with the goal of reducing the number of features required to convey some amount of information
^^^
How come the "google it" answer to the data mining question was removed? For broad unspecific questions I find that to be a good answer. Vague non specific questions deserve vague non specific answers.
A better question would be, what is your favourite introductory text to data mining? Or does anyone have a simple and well explained example with code that uses some of the most common data mining techniques?
And in both cases the answer should hopefully be "look in the pinned messages" ๐
Because, whether intentional or not, it's a hostile and condescending thing to say to someone who is asking a question. As is "vague none specific questions deserve vague non specific answers" (emphasis mine)
Ask for more information, don't just dismiss them
Vague non specific questions do deserve vague non specific answers, though you're probably right the people who ask them deserve a chance to better frame what they're after.
No they don't, and they especially don't here. This might help to better explain the huge issue with that attitude: https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/
We've all asked vague non-specific questions, none of us popped out of the womb asking where we can find a good tutorial on the time complexity of a merge sort. If you have an answer in the form of what a better question would look like, why not make that your response? "Hey friend, that's a broad topic, here are a couple of resources I find handy and you might try these keywords to narrow your focus." You're still encouraging them to do their own work but you're providing a starting point. Or, you know, just don't respond.
Yeah no I didn't tell him to google it. And tried to come up with two other questions to ask instead. But I thought it was strange that the google answer was removed (if that was the case, he might have deleted it himself for all I know). I didn't know that was the policy.
Sure, fair enough then. I thought for some reason it was removed.
This is also a discussion cum help channel. Need not be as strict as a help channel when it comes to topics. A question, however vague may be taken as a good beginning for a discussion on that topic. However I believe "google it" is not a bad answer if put in a polite way
Oh. Didnt realize I am one hr late
Not really your call to make whether a channel is more or less strict.
Or should be, rather
How can I build a network to reason about depth/scale from objects that I already have hard sizes for?
LIke, say I have a picture of someone holding a Comcast Remote:
True. Well, let us say, that is my perception
All of those remotes are the same size. How can I use it to measure the scale of the rest of the image?
Also if google it is all that is said about the topic, then no, it's not helpful
@dim osprey I feel like you'd have to have some sort of reference to gauge size
The remote is the reference. I want to measure his hands.
Generically you'd have to segment the image to find the "remote-shaped" blobs and get the one that's most likely the remote
Once the image is segmented you'll have the coordinates relative to the image and can scale that based on the reference measurements you have
I can't train a model to recognize certain objects that alll have the same size?
Like, there's been a lot of work on street sign detection because of the runup to sell-driving cars. All stop signs are the same size.
You can, but at some point you have to start with segmenting the images so you can train the classifier
If I know that the stop sign is x inches, and y pixels, and the pole is y2 pixels, than it must be x2 inches, right?
Yes
If the remote blob is 100 pixels and your remote is 10 inches, it's 10 pixels/inch
However, that's not going to apply for things in the background
But if it's more or less in the same plane as the remote then you'll be ok
I'm sure there are corrections since this is a fairly well developed field but it's not one I'm particularly familiar with
Aha, but there are already neural networks that reason about scale and depth. A lot, actually.
They do pretty good, depending on how much training data you have.
My idea was to add the context from objects that have fixed sizes: Phones, Car tires
Look at this:
This is perfect: A CSX engine is always the same size, and so is that tanker car. I can take those hard cues and use them to measure the entire image.
doors are usually the same height too
But I'm interested in outdoor scenes.
Has anybody here built tensorflow from source to optimise for CPU? I won't have access to a GPU for some time and I'm considering if the speed up is worth it.
why build from source?
you mean so you can use -march=native -mtune=native -O3 or something?
Building TF from scratch theoretically allows you to optimize for the target system's CPU (according to TF). Though I've only ever seen this in the context of Intel-based CPUs and even then it's only applicable to CNNs. To answer pptt, yes I've done a CPU optimized build. Results were a mixed bag.
Dunno about being "worth it" or not. The effort in compiling an optimized build is pretty minimal, if you have an i5 or i7 give it a shot. It can't hurt.
does it use a blas or does it implement its own linear algebra
like would MKL vs OpenBLAS make a difference
This got me curious so I went back and ran a simple ConvNet against Python 3.6, Tensorflow 1.7 (MKL, SSE, AVX, and FMA enabled), Tensorflow 1.7 (Just SSE, AVX, and FMA enabled), Tensorflow 1.7 (Standard, No optimization install). WIth just SSE, AVX, and FMA optimizations the average time between steps decreased by 15% from the no frills install. With MKL enabled the time between steps increased by 260%. I suspect that weirdness with MKL is a broken build on my part. I need to recompile and also potentially shift to Tensorflow 1.12.
https://github.com/mind/wheels Has a curated set of .whls up to Tensorflow 1.8.
very cool
thanks for trying that
to be fair mkl isn't always faster. i noticed improvement on basic operations, matrix multiply and especially SVD
but apparently its not always faster even on intel hardware
Indeed. Though I'll probably be up until 3:00a.m. validating MKL is working as expected.
oh relevant, literally compiling TF from source right now
been waiting an hour and a half and no end in sight, kms
would not recommend for marginal benefits in speed
im only doing it because I want to work with TF on 18.04 which only supports CUDA 10 which isnt supported by stable TF releases as of yet
I ended up not doing it. Looks like it might have cost me more time than it would have saved me anyway.
my compilation took me 3-4 hours, so unless it's saving you that much time, probably not worth it :p
The joob took me 6 hours so unless the it halves training time then probably not
If it's data science related, yes. If it's more of a general Python issue, feel free to use a help channel.
Asking good questions will yield a much higher chance of a quick response:
โข Don't ask to ask your question, just go ahead and tell us your problem.
โข Try to solve the problem on your own first, we're not going to write code for you.
โข Show us the code you've tried and any errors or unexpected results it's giving
โข Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
I asked it in a help channel but they redirected me here .. So this is my Q.
I need to get a user input and determine his emotion from the answer. For example if user inputs 'I'm sad', he is sad. (:P obviously) .. But this can be more complex than that example (as i think). I need your ideas about how to achieve this. I can make a word list and match them but is it a good idea. Or do i need to involve something like TensorFlow (which is completely new to me) . ? confused
:/
๐ฆ
The broad term you're looking for is "sentiment analysis". Typically the first place to start, regardless of how you'll use this data, is to get a set of labelled data. Associating words or sentences with a feeling. Individual words are hard because they can have a double meaning: "I'm so happy I could cry" -> Is this person sad or happy?
How you then use those labels is sort of up to you. You could, as you say, just match words to sentences and assign a sentiment like happy or sad. For a small school project that might be fine. If you want a program that can more accurately predict sentiment when the user input contains multiple words from your list, as in my example, you would have to go to a deep learning approach in Tensorflow or PyTorch. If you've never tried deep learning before I would recommend you look at Keras or PyTorch first. Tensorflow is daunting for newcomers.
Using tensorflow, how would I return a "score" value to the model for judging itself?
Traceback (most recent call last):
File "X:\Python\lib\site-packages\rlbot\botmanager\bot_manager.py", line 187, in run
self.call_agent(agent, self.agent_class_wrapper.get_loaded_class())
File "X:\Python\lib\site-packages\rlbot\botmanager\bot_manager_struct.py", line 36, in call_agent
controller_input = agent.get_output(self.game_tick_packet)
File "X:\Downloads\bot\rlai\tflayers.py", line 85, in get_output
out = self.models[packet.num_cars].call(arr)
File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\sequential.py", line 229, in call
return super(Sequential, self).call(inputs, training=training, mask=mask)
File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\network.py", line 845, in call
mask=masks)
File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\network.py", line 1031, in _run_internal_graph
output_tensors = layer.call(computed_tensor, **kwargs)
File "X:\Python\lib\site-packages\tensorflow\python\keras\layers\core.py", line 970, in call
outputs = gen_math_ops.mat_mul(inputs, self.kernel)
File "X:\Python\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 4856, in mat_mul
name=name)
File "X:\Python\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "X:\Python\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
op_def=op_def)
File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 1792, in __init__
control_input_ops)
File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 1631, in _create_c_op
raise ValueError(str(e))
ValueError: Dimensions must be equal, but are 193 and 48 for 'dense_734/MatMul' (op: 'MatMul') with input shapes: [193], [48,48].
arr = np.array(total) # len(total) = 193
self.models[packet.num_cars].call(arr)
Thank you very much @proud raven .. This seems complex than i thought. Actually it's a school kind project but i wanna do it in the right way. I'll look into things you told. ๐ Thanks again !
may I ask a question here? can we ask questions?
!t ask
Asking good questions will yield a much higher chance of a quick response:
โข Don't ask to ask your question, just go ahead and tell us your problem.
โข Try to solve the problem on your own first, we're not going to write code for you.
โข Show us the code you've tried and any errors or unexpected results it's giving
โข Keep your patience while we're helping you.
You can find a much more detailed explanation on our website.
okay, I never used a ttest on a data set before. My boss wants me to compare two patients' features with t test, but wants me to do this for each feature
is this doable? is this a thing? the only library I found in python kinda uses lists as inputs and gives a single p value
also I'm looking for the ttest functions of the stats library in python and there are four different kind of t test, I wonder which one is more suitable for this task.. =/
I think I figured it out, thanks!
@lapis sequoia What did you end up doing?
I was doing the thing wrong, for each feature, I made a list that has the values of neighbors from knn
then for two hypothesis groups, I compared the lists of these features with each other
so for 15 feature, I had two lists of 5 values each
and I had 15 t test results, which was what I needed
Sounds fair. Was the purpose to test if the patients had a significant difference or if each feature had one?
oh sorry I saw your reply too late, was busy with sending the data to the right places.. the main goal was the former one
Matplotlib is being helpful and drawing lines directly between my data points for the green line. However I'd like it to stay horizontal until the value changes and then connect with a vertical line. Is there an option to do this or do I have to insert extra points to accomplish it?
Try step instead of plot: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.step.html @signal juniper
@hearty token Awesome, thank you so much!
So... how about that Tanenbaum?
I'm trying to count the number of entries of a same class using pandas
I have a label called CLASS and each entry is a type
So I need to count how many of each type I have there
hi , what is the most beautiful visualization tools/library/framework/app in python?
yeah that's it, cheers!

Anyone have a good SQL discord (like this one)?
Or, lol, anyone know how to use the 'where' clause in queries? I'm trying to filter down to [fldDuration] = 02:00
varchar(10)
(removed phi, for obvious reasons)
?
you need to connect them w/ logical AND and OR
WHERE
(user.age < 30 OR user.age > 40) AND
address.state = 'NY'
Hi
@gleaming wadi Dash is pretty good for building interactive visualizations as web apps. If this is just for a notebook however seaborn is good for non-interactive charts and plotly for interactive ones.
there's also bokeh
hi all- i am working on some NLP project and i have to read my data from PDF documents! i use tika parser to read the files and then proceed with text processing techniques. one thing i can't get my head around is basically how is it possible to get rid of headers and footers of a given document ! any direction would be really appreciated
@desert oar that was it. 12/13 had no and in-between them.
Hello my ask for advice with CVXOPT if anyone has some experience?
I am using quadratic programming solver for MI-SVM algorithm, but if I get negative weights it writes me this error:
ValueError: Rank(A) < p or Rank([P; A; G]) < n
and even if i try to run it without constraints it still fails
this is full error output:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/cvxopt/misc.py", line 1429, in factor
lapack.potrf(F['S'])
ArithmeticError: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 2011, in coneqp
matrix(0.0, (0,1)), 'beta': [], 'v': [], 'r': [], 'rti': []})
File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 1981, in kktsolver
return factor(W, P)
File "/usr/local/lib/python3.6/dist-packages/cvxopt/misc.py", line 1444, in factor
lapack.potrf(F['S'])
ArithmeticError: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 21, in <module>
main()
File "main.py", line 19, in main
model.learning()
File "/home/frovis/Programovani/Python/Bakalarka/Model.py", line 70, in learning
sol = solvers.qp(P,q)
File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 4487, in qp
return coneqp(P, q, G, h, None, A, b, initvals, kktsolver = kktsolver, options = options)
File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 2013, in coneqp
raise ValueError("Rank(A) < p or Rank([P; A; G]) < n")
ValueError: Rank(A) < p or Rank([P; A; G]) < n
hi everyone
I've got a simple question. Just looking for thoughts
I've got a scatter plot containing 3 classes. However, lots of overlapping.
I'm not sure how to best display this data
@arctic moth I have no experience with that, but I think you'd have more luck if you posted some code.
@arctic moth do you know what matrix rank is? A code snippet at least would be helpful but somehow youre ending up with an ill conditioned matrix
@fierce saffron if its too hard to visualize with 3 colors on the same plot, make 3 plots side by side w/ the same axis dimensions
I hadn't thought of that but it's a good idea
I have a dict with the following structure:
'Username' with a key 'label' and a key 'text'. So if i would want to access the text of a username i would do dic["username"]["text"].
I want to tokenize the strings in the [text] part. My goal is to supply the tokenized text to a classifier so i can train and make predixtions. How do i do this using spacy?
I have looked at the spacy docs and tutorials but dont know if their pipe operations replaces the text in the dict with the tokenized text
No it just returns a tokenized version
Theres no special handling for dicts... list of strings goes in, list of processed docs comes out
Its not like C where you pass a preallocated result container and the function fills that container...
Alright thanks. Then how should i provider the tokenized strings to a model?
i recommend finding some text classification 101... but the basic method is "bag of words"
one row per document, one column per word, word frequency in matrix cell
'the quick brown fox'
'the strong brown bear'
becomes
1 1 1 1 0 0
1 0 1 0 1 1
where the columns correspond to "the", "quick", "brown", "fox", "strong", and "bear" respectively
why are there 2 extra columns?
there aren't?
i must be misinterpreting it then
ah, like that
@thorn river spacy also provides word vectors you can use, look it up in their docs
here's a sloppy basic text pipeline using some data thats kinda like what you described @thorn river
import spacy
userdict = {
'joe': {
'text': 'milk is okay',
'photo': '13861361.jpg'
},
'rhylli': {
'text': 'i am the funniest weightlifter in florida',
'photo': '09370813.jpg'
}
}
en_core_web_md = spacy.load('en_core_web_md')
text_spacy = en_core_web_md.pipe(userdata['text'] for userdata in userdict.values())
Seaborn is pretty
How does this code work:
my_full_dataset = tf.data.Dataset.from_tensor_slices(
(tf.cast(mnist_images[..., tf.newaxis]/255, tf.float32),
tf.cast(mnist_labels, tf.int64)))```
load_data() generates a tuple of arrays (train_x, train_y), (test_x, test_y)
How does that get parsed into three places?
e.g. isn't the first like saying (mnist_images, mnist_labels), _ = (train_x, train_y), (test_x, test_y)
er wait, the _ is a throwaway variable so the only data that's passed is train_x, and train_y
I think?
Yes, "_" is a valid variable name in python so you just dump stuff there you aint gonna use
Your interpretation seems right
Their problem isn't that its dumping, it's dumping 2 things into 3 places
no
you have a tuple of two tuples. the second goes in _, the first is unpacked further into two variables
note that since there isn't anything magical about _,if it's a significant amount of data you should del _ so it can get garbage collected
@desert oar @hearty token Thanks guys, but I already found out, that I just made a logical error in constraints. But thanks ๐
Guys anyone have experience with Fox, Tiger, Elephant and Musk datasets?
I want to use them to train my mi-SVM classifier, but I can not find in SVM file which type of bag is positive or which one is negative. It just contains number of the bag and then values and I also can not find anything on the web, that would use those datasets.
CMU's 10-701 is also an excellent resource on introductory ML for anyone interested. Slides, lectures and additional readings available here: http://www.cs.cmu.edu/~pradeepr/701/
@placid galleon try pip -V?
18.1
Collecting tensorflow
Could not find a version that satisfies the requirement tensorflow (from versions: )
No matching distribution found for tensorflow```
also tried that as well
@arctic moth got a link to a page describing it or something?
@placid galleon you can install other packages?
what does pip list output? (use https://bpaste.net if its long)
also can you try using python3 -m pip instead of pip3?
Package Version
---------------- -------------------
aiohttp 3.3.2
altgraph 0.16.1
async-timeout 3.0.0
attrs 18.1.0
auto-py-to-exe 2.4.2
bottle 0.12.13
bottle-websocket 0.2.9
certifi 2018.4.16
cffi 1.11.5
chardet 3.0.4
discord.py 1.0.0a1483+gec3435b
Eel 0.9.10
future 0.17.1
gevent 1.3.7
gevent-websocket 0.10.1
greenlet 0.4.15
idna 2.7
keyboard 0.13.2
macholib 1.11
multidict 4.3.1
numpy 1.14.5
opencv-python 3.4.3.18
pandas 0.23.4
pefile 2018.8.8
Pillow 5.2.0
pip 18.1
PyAutoGUI 0.9.38
pycparser 2.18
PyInstaller 3.4
PyMsgBox 1.0.6
pynput 1.4
pypiwin32 223
PyScreeze 0.1.18
python-dateutil 2.7.3
PyTweening 1.0.3
pytz 2018.5
pywin32 224
pywin32-ctypes 0.2.0
requests 2.19.1
scipy 1.1.0
selenium 3.141.0
setuptools 39.0.1
six 1.11.0
tflearn 0.3.2
tqdm 4.28.1
urllib3 1.23
websockets 6.0
whichcraft 0.5.2
yarl 1.2.6```
python3 isn't recognised
@placid galleon TF doesnt support python3.7
Oh, i'll try down grading
I'd recommend python3.6 + CUDA 9 if you're trying to use tf
3.6 exactly or will 3.6.7 work?
not even 100% sure it will work but it should. pip cant find it because you have to tag which versions something supports as you publish
@placid galleon Are you using 64 bit python or?
try 64 bit, that should work
theres basically no reason to use 32 bit python unless youre in a very specific situation and you know you need it
I ran import platform platform.architecture() and it told me it was 32bit ... installing 64bit now silly me
Ok, i've installed 64bit
but i cant just hit cmd and type python now
i need to do it in the exact path
๐
now pip doesn't work
yikes 
nevermind, sorted the path ๐
Yeah that sorted it, installed tensorflow ๐
woo
Heya guys! two quick questions about time forecasting:
- What would you use to forecast a dataset of maximum 50 datapoints that are unpredicable?
(Every dataset of <= 50 is different, so you can't know if it even has seasonality in it or it doesn't)
Moltz?
Basic multivariant linear regression?
Arima seems too much as I don't think it will be accurate with <= 50...
- I heard there's an interesting way of doing that converting all the cycles to asin functions, but regarding to this and seasonality I'm wondering, is there a way to programatically calculate if there's even seasonality on it (as the dataset is different everytime it comes)?
@placid galleon highly recommend conda or at least a virtualenv
@rough pecan ive done arima on ~15 datapoints and gotten useful output. depends on the problem
can you describe the problem a bit more
inputs, nature of data, etc
I'm starting the project on AliExpress but it could grow from there...
The datapoints are initially the order amount and the date of the orders...
... but I'll be doing a lot of testing with both
A) existing (review count, review score average, etc) datapoints and
B) newly created datapoints from the existed data such a scores made by myself, sales per country etc... you know, using the data that's there to create more data out of it... feed A/B into different models and see if there's anything interesting I can find...
I'm quite new to this so it would be both interesting and quite rich in my learning proccess (I belive) to try out playing with all these datapoints to see if any of those (again, both the ones out there and the new data I'll create from it) could help in the search of trendy / hot products...
So, the goal: Spot trendy products that have a forecast of increased sales for the following days... @desert oar
So, for some reason despite the products having thousands of orders AliExpress only allows you to see the first 50 pages of orders.
We could assume then that this is a single variant forecasting problem but as I said, I'll be doing a lot of testing with more data so it's most likely to end up being multivariant in the long run unless the other datapoints aren't really useful / helping positively anyhow in the forecast of course... we only know for certain after we test... right? ๐
yeah so you have a couple of options
basic option: automatically fit arima to each product
i actually implemented something like that at work a long time ago, basically you run through a bunch of hypothesis tests then fit a model. nowadays you can probably get away with auto.arima in the R package forecast, or implement it on Python -- it's based on the practices in https://otexts.org/fpp2/
there are many traditional stats methods for seasonal decomposition as well
but yes, one way to encode "cyclical" data (eg. day of month, day of year, hour of day) is to use polar coordinates
so 0 and 23 hours are closer than 0 and 4 for example
you could probably fit one big linear model or neural network or whatever that way. drop in last 7 periods sales, plus polar encoded day, or something
more advanced option: bayesian arima model where the parameters across all products share a common distribution, allowing you to pool information that way
there are methods for doing online/incremental updates of bayesian models but its not as easy as just running some more epochs on a NN
Why would you want to pool such thing?
shared information across products
right? its not like every time you look at a product online, you forget everything you know about other products
sharing information will be especially useful for sharing information about cyclicality/seasonality
That smells like useful when forecasting let's say, the overall website perfromance, yet would it be good to make them compete with one and other? it sounds more of an inclusion and not opposition situation right?
For competition (which is the main objective), I belive that it may be more than enough to have them separated and of course, together in the same database but you understand, separated in their analysis, then just compare the results between all of them ๐ค
and the effect of any covariates
what do you mean separated?
or competing?
you said before that you're trying to forecast when sales are about to increase in some product
the more you know about a product, the easier that should be, right?
Correct...
What I mean is that is more about finding trends of individual products and not the entire business overall... rather to make the products "compete" with each other, which is why I assume that having their analysis separated (let's say an ARIMA fit for each product, without pooling them all together) seems a little bit more accurate isn't it? how other product's information would benefit the prediction of the previous one?
My potatoes are about to burn in the oven! I have to run to check them out haha... brb.
Sorry if I'm asking too many questions but this is just such a juicy conversation!
Thank you so much for helping me out, you're bringing a lot of value and clarity to my sight ๐ ๐
does forgetting what you know about Product A help you make better predictions about Product B?
Maybe and probably?... unless you're trying to predict the trend of a certain category doesn't it?
Unless product A and B are the exact same products of course...
For this I'll be training a image recognition model to compare all the pictures from two different products in case different publications (A/B) are two publications of the same products, in that case I was planning to "merge" those two, until now I never thought about merging the data of ALL products ๐ค
@desert oar
you arent merging them entirely. you are just sharing common information
its effectively a form of regularization
eg you shrink all arima coefs towards 0
Just to make the model (in case it was for example a neural net) more accurate on their predictions, just for the sake of making it smarter, then predict the trend of a product separately for each one? ๐ค
Oh... hell, I haven't even opened my mind to the posibility of sharing common data for the sake of normalizing it, I thought the factors affecting a product were very different for each one as the market (people) of each product and behaviors was very different ๐ค
yes, you still make separate predictions of course
its not a simple process to be sure
it takes some doing. but yes the idea is to only share information on common factors
Hi i need some help with data manipulation, a table with 3 columns. I want to group by a, then remove duplicated groups. That is, once I did the groupby, I might have 2 grouped elements that are the exact same โBโ rows, and want to remove those duplicated groups.
You want to remove duplicates within each group?