#data-science-and-ml

1 messages ยท Page 190 of 1

fallow summit
#

I didn't have multivariate calc in highschool, but we had tons of linear algebra here in Poland

stable tinsel
#

(ALE)

#

ALE is the most popular gaming environment that's being used for AI research right now

#

i would search around and try to find solutions to ALE roms

#

but as @lean ledge points out, you won't understand the solutions until you get through at least linear algebra and bayesian statistics

fallow summit
#

dw I'll learn it

#

ALE looks cool

woven tundra
#

There's a udemy course taught by Kirill Emchenko called ML with Python

#

Absolutely zero math, all he does is say how to use the ML libraries and what to put in and what you get out

#

If you just wanna play around with something and start playing with the libraries with no knowledge of the math required to understand what's going on, you could check that out

#

And hey who knows, if it piques your interest you could look into the math behind everything

#

Everybody learns in different ways and maybe tinkering until it fasincates you is your way

#

It certainly was mine for programming in general

#

But if you just wanna learn how to use the libraries you could check out some free stuff out there too. Start with what the usual machine learning models are and look up how to implement each of them in Python

#

I think Google has an intro to ML course as well somewhere on their website

fallow summit
#

@woven tundra exactly what I meant ๐Ÿ˜‰ Thanks ๐Ÿ˜ƒ

woven tundra
#

No worries

small ore
#

@lean ledge I never argued about the need of learning maths for ML or any other thing. It is all about how math is approached. Remember the discussion was about Andrew Ng's course and the Columbia course on Edx. Based on my viewing of a couple of vedios of the columbia course it is mere recitation of mathematical formulae without even clearly mentioning what each of those variables stand for. Nor is it in anyway making it interesting and getting people involved. Ofcourse if the viewer already knows the math and are super clever enough to understand the intutions and limitations then it may seem good to them and a mere revision. From my vantage there is no point in reciting those math formulae without going into the nitty gritties. It is not about math vs no-math but all about the level of math and the approach. I am personally disappointed with the columbia course (based on viewing only a couple of lessons) . It should either point to a math course as pre-req or just go to the point and omit the math without the basic details of what stands for what ( I have seen a few texts on statistics and ML now and the notations vary hugely. If you already know the math it is about being able to adapt but that is not the case with everyone). I would love to see a mathy course on ML but with a better beginner friendly + nuanced approach

#

@fallow summit I have a tonne of links on ML related courses and it all depends on how you want to approach and what you are already proficient with. ( Programming etc). DM me if you want to look at the text wall of links ๐Ÿ˜„

river plume
#

I've completed 3 courses in ML:
1st one was ML A-Z by Kirill Eremenko on Udemy which involved no math at all
2nd one was ML on Coursera by Andrew Ng and
3rd one was Maths for ML on Coursera offered by Imperial college london

#

I wasnt really understanding the math taught in andrew's course so the 3rd one helped me A LOT in understanding Andrew's course

#

I recommend you to check it out if you arent understanding it.

lean ledge
#

@small ore perhaps we're remembering differently because Columbia's course had excellent explanations. It was far from recitations of formulae. And I'm not sure what you expect. ML is essentially a field of maths and it has some bare minimum mathematical prerequisite. It's the same amount any quantitative degree should teach you. It's silly to think you can make progress in differential geometry without any background in calculus and ML isn't any different.

river plume
#

Agreed @lean ledge

lean ledge
#

But don't fault the course if you don't have a mathematical background because the course was absolutely excellent

river plume
#

@lean ledge can you suggest some advanced courses on ML and its applications?

#

specifically Neural Networks since I'm quite fascinated with it

small ore
#

I mean. Now that you say it, I am confused. Are there multiple courses of columbia on edx? I know basic calculus and have studied statistics and probability and stuff long ago. I am not sure if that is enough for the "bare munimum" standards. I also do not understand the notations of any set of formulae without them spelling it out and preferably give it to me in a handout

lean ledge
#

There was no special notation that wasn't already standard in maths. Yes, they didn't go through every single symbol like they would in high school but everything was explained qualitatively and it's purpose was explained. They did assume someone can read mathematical equations on their own. E.g. they would explain they would take the distance and the formula for distance would pop up. I don't expect it to be unfair to assume someone doing an ML course would know how summing notation and Pythagoras theorem works without having it explained? Perhaps if every single formula needs to be explained, you aren't as fluent at maths as you should be?

#

@river plume Once you've gone through an introductory ML course, for the most part your doors are open for starting on any specific topic

small ore
#

Pythogoras theorem, distance formula, summation notation are all easy and trying to start explaining it will further make it boring and will digress a lot from what is being taught. I am not at all talking about it. If they are teaching probability/expectations/statistics etc and not using the same notation in another text of the topic (which we can refer to or have to refer to coz it does not always make it clear) then it is difficult. Also for me an approach where one explains math through an example esp relating to the objective the course is trying to achieve will make it more involving

lean ledge
#

@river plume you can try Goodfellows deep learning book, it's meant to be good. Stanford has a DL course which posts it's syllabus and lecture slides online

#

Their probably notation is pretty standard across any text I've seen.

#

There's honestly not that many ways you can write probability stuff differently

#

And until what level do they need to explain everything with examples and qualitative expressions? Everyone comes with different levels of maths and they can't just spend four times as much effort to record more explanations of basic things with more examples and explanations of every single part of the formula just because someone is taking a maths course without prerequisite maths knowledge

small ore
#

As you said it is a difference in view-point arising from our different levels of understanding math. But I will certainly say I will not be ( and ask anyone to not be) disheartened at the lack of an advanced knowledge in math to learn ML. Maybe Jobs are a different thing. But if you want to enjoy learning ML that level of math is either not necessary or can be learned up. I find that Andrew Ng sometimes oversimplifies math and would have liked more but I am kinda happy I am learning something there. Columbia one is the other end. Makes it look complex ( if not being complex) and either above my level or requires a lot lot more effort from me.
Btw, I think in the amount of text you have typed for this discussion, you could have taught us a couple of lessons in probability ๐Ÿ˜‰

lean ledge
#

I actually could have! I actually think I'll write a blog post or two at some point when my exams are over

small ore
#

๐Ÿ‘

river plume
#

Thanks

small ore
#

Maybe Rags ( If I may take the liberty of calling you so), you could also recommend me a text to read up perhaps. (Sometimes it can be better than a lecture). Preferably something that is available free online

lean ledge
#

But CS230 should be better regardless

#

It doesn't have recorded lectures unfortunately

#

I don't actually know a suitable textbook for ML. I went with ESL at first and it was too dense even for me and had weird way of phrasing things. Much better as a reference when you already know something than as a new way to learn. Bishop's pattern recognition and ML is what I've settled on for myself since it clicks with me quite well

#

But likely a bit too intense for you mathematically. Best covered by an upper undergrad in physics, maths, engineering etc, probably even a bit too much for most CS majors.

#

The ISL exists but it had no maths at all and has similar problems as Andrew's course at times. Still better imo. At least it doesn't miss obvious stuff

#

Better at building conceptual understanding

#

They realise that if you don't understand the maths, you won't be able to implement it so they also have examples in R

small ore
#

ESL I reckon is Elements of Statistical learning? What s ISL? and meh, I do not want to learn R

lean ledge
#

Introduction to statistical learning. Book by the same people behind it but ESL was meant to be an introductory text for someone just out of uni doing their PhD. Isl is essentially meant for people from non quantitative degrees that want to get into some basics, eg for PhD students in biology or psychology or something where they might use ml for the data but they don't understand ML themselves so they want some basic background on things

small ore
#

I am neither here nor there. I hang in-between. ๐Ÿ˜„ . Anyway, thanks for the recos, Raggy. Looking forward to your blog

lean ledge
#

Yah it can be hard to be in between situations. I felt the same when I hadn't found Bishop's book

#

There's probably a book for your level too, just keep searching

tranquil iron
#

So as a computer science student, I should be starting off with ESL?

reef bone
#

I would recommend Bishop's book if you can handle the math

#

I'm a BSc CS student and it worked for me

tranquil iron
#

Pattern Recognition and ML?

reef bone
#

Yes

tranquil iron
#

Alright I'll check it out

lean ledge
#

Bishop's is amazing โค โค โค

gritty hawk
#

hi I'm using pandas at work to read in an excel file, compare strings with a database and make a new dataframe filled with ids pointing to those rows

#

however

#

read_excel is moving the contents of a column over to another column for some damn reason

#

anyone ever experienced something like this?

gritty hawk
#

found the issue: don't have spaces in your column names, people

woven tundra
#

i've never had issues with spaces in my column names

#

i've had plenty of issues with duplicate column names however

#

what exactly was your issue (for the benefit of anyone who may have the same problem in the future)

woven tundra
#

How do we visualize higher dimensions such as 4D or 5D? I mean, is it even possible to conceive beyond numbers?

lean ledge
#

to some extent, yes

#

not very productive though usually

woven tundra
#

Would you mind explaining how it can be done to some extent?

lean ledge
#

We build our 4D world using Tetrahedral (instead of Triangular) Meshes, and show 4D Crystals as an example. See also: How to walk through walls using the 4th...

โ–ถ Play video

How do you think about a sphere in four dimensions? What about ten dimensions? Podcast! https://www.benbenandblue.com/ Problem-driven learning on at https://...

โ–ถ Play video
#

wrong ping

#

sorry :c

#

@woven tundra

woven tundra
#

Thanks @lean ledge !

lean ledge
#

nw

gritty hawk
#

hi

#

@woven tundra since you seem to know about pandas

#

mind helping me out with a query?

woven tundra
#

Sure @gritty hawk , what's up?

fallow summit
#

Hello again!

#

I found this one

#

And it looks quite cool. I will go through this and Andrew Ng course ๐Ÿ˜‰

hardy drift
#

i've tried experimenting with pyautogui and pytesseract to recognize these numbers, but neither work. how would you go about it? (it's runescape btw)

#

with pyautogui i manually saved images of each number 0,1,..9 and tried to find them in the image. pytesseract spits out letters (i think for 244 it read it as "eag")

lyric canopy
#

One thing you can try is to restrict the characters pytesseract is looking for

#

@hardy drift

placid snow
#

What are you trying to do with runescape? @hardy drift

lone mist
#

my understanding is that tesseract is trained for things like scanned documents, not screenshots of a computer screen

#

hence poor results

small ore
#

I thought this maybe helpful to anyone trying to learn ML/AI:

Curated list of tutorials: https://medium.com/machine-learning-in-practice/over-150-of-the-best-machine-learning-nlp-and-python-tutorials-ive-found-ffce2939bd78
Curated list of cheat sheets ( Helpful when you quickly wan to look up a formula or recall a method in a ML module): https://medium.com/machine-learning-in-practice/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6```
#

This list is good too:
http://rise.cse.iitm.ac.in/wiki/index.php/Introduction_to_Machine_Learning

mint pike
#

Is machine learning really hard?

reef bone
#

This is super nice for anyone with interest in deep learning
https://github.com/floodsung/Deep-Learning-Papers-Reading-Roadmap

mint pike
#

I'm a beginner in python and im thinking of diving into it

#

@reef bone thanks

tame jacinth
#

well, python is the right language

#

thats all I know

reef bone
#

I would probably do a little bit of background reading and then familiarize yourself with something like numpy, try to solve some basic problems and see if you like it

small ore
#

The links I have posted above has recommendations for resources/tutorials. I suggest picking one and diving in before you decide if it is hard or not

mint pike
#

i will read them! @small ore @reef bone , Does it require lots of computational power if I'm training a neural network like thousands of images, books, etc.....

small ore
#

Maybe. Depending on the problem

reef bone
#

Don't worry about that now

small ore
#

Most problems that do not involve huge database of images can be done on simple laptops/pcs

#

If you need more later you can hire space/computing power on the cloud as per your requirement

reef bone
#

At some points it becomes computationally expensive, but you probably won't encounter any issues for a long time if you're just starting out

#

If you have a non-ancient nvidia gpu you can easily accelerate using tensorflow-gpu

#

But for now that is not necessary at all

mint pike
#

yea im thinking too far ahead lol. ive got a decent system

#

AWS is pretty good

lean ledge
#

Thousands of books isnt enough data, you need millions :P Generally speaking, you often dont make your own NN for that, you get a Resnet model pretrained on Imagenet and wipe and retrain last few layers

#

Transfer learning is currently the only decent way to train on little data

#

@reef bone The reading list is excellent!

#

I never even thought to try look at the original dropout paper, whoa

small ore
#

Btw Raggy, Bishop seems good and I am able to manage till now. Only through first chapter though. I like the presentation

lean ledge
#

๐Ÿ‘Œ

#

It does ramp up a bit quickly. Try to keep in mind, it's a maths book. Maths books are hard to read because they're always dense af

#

Always worth the effort though

#

the presentation is soooo much better than ESL imo

small ore
#

As long as the explanation and clear description of notation ( preferably inline wherever needed) goes on I will be happy to read

late garnet
#

@lean ledge what algorithms do you specialize in?

lean ledge
#

I wouldnt say I specialise considering I'm technically still fairly new to the field, at least compared to my colleagues. But I lean towards things that borrow from the electrical engineering side of things, e.g. things based on signal theory. Both traditional and ML-based computer vision, time series/stochastic processes etc.

#

Currently do time-series-y forecasting stuff at one job and am about to start an internship at CSIRO's Data61 in their Robotics and Autonomous Systems group for Deep Computer Vision

#

@late garnet

late garnet
#

Nice - I might need to bug you about some stuff. I focus on time series, nlp and clustering problems.

#

@lean ledge

#

Primarily anomaly detection in time series - not so much forecasting

chilly shuttle
#

@lean ledge you in australia then?

lean ledge
#

@chilly shuttle oui

chilly shuttle
#

data61 has some pretty good people, worked with them before

lean ledge
#

It does, some very smart people. Where are you? Not brissy by any chance? @chilly shuttle

small ore
#

@lean ledge Dumb math question. What is a math notation that looks like a modulus notation but has two vertical lines on either side of expression stand for?

chilly shuttle
#

|like this| ?

#

@lean ledge nah melbs but I stop by brissy sometimes and know a few folk there

simple crag
#

@small ore do you have a picture?

small ore
#

I think bicubic showed a better way than a picture. Why didn't I think of it.
It is ||x1*w-x2*w||

#

The lines look rather close by though

simple crag
#

Usually that's norm

small ore
#

Norm?

simple crag
#

In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector spaceโ€”except for the zero vector, which is assigned a length of zero. A seminorm, on the other hand, is ...

small ore
#

I am not sure what a negative-length vector is

#

If such exists

lean ledge
#

Yep, || is norm. Vectors can be generalised to things that arent just lists of numbers or geometrical objects and hence we create specific terminology that applies to all sorts of different vectors. Thats why you'll also see dot products like a . b written in inner product notation as <a, b> or (a, b)

#

You might want to study some linear algebra if you havent already. It's all very relevant to ML. You might even see something like the normal formula and see its derivation which might help you understand linear regression in a slightly deeper way

#

essentially norm is the genrealised version of the "length" of a vector @small ore

#

and in the same way you might see |a| as sqrt(a.a), you might see ||a|| = sqrt(<a,a>)

small ore
#

Currently it does not appear relevant. It was used for regularisation weights. Maybe I will dive into linalg later for the deeper meaning

lean ledge
#

It's taking the length or equivalently the "size" of the vector and penalising it for being too large

small ore
#

Huh. The 'size' itself is penalizing weights here

#

||w||^2

late garnet
twilit bolt
#

Alternatively, you can take a linear algebra course via MIT's opencourseware: https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/

lean ledge
#

^^^^ THIS

#

Strang is a god

#

Get his book, watch his lectures

#

Udemy is too dodgy for me to trust lol

lyric canopy
lean ledge
#

That's the one

#

Nice ISL

#

There's also linear algebra and it's applications

lyric canopy
#

The book we used when I first took a uni linear algebra class was Poole - Linear Algebra: A Modern Introduction. Didn't like that one at all.

#

But, this was a long time ago

#

I'm using Strang to refresh things atm

late garnet
#

Udemy courses are definitely hit and miss. Glad they have the refund policy.

thorn river
#

I have downloaded a .bz2 compressed file of a month of reddit comments from http://files.pushshift.io/
Decompressed to file would be ~24 gb of json objects per line.

I have a list of reddit usernames which I would like to check against the authors in that pushshift file and extract only 3 k:v pairs of those lines where the author corresponds to one of the authors in my list.

Example of structure of the comment file would be:

`{'author': x, 'time_created': 'xy}', etc.

In other words: a line with a json object(?).

What would be a good way to approach this? I don't have access to a boatload of RAM.
Reading the file line by line and checking if author corresponds to my list and then extracting the k:v pairs which Im interested in will probably take a lot of time considering the large size of the file.

Any tips?

late garnet
#

You can read compressed files in a buffered way. I'm guessing you only want to match the authors with the comments? What do you plan to do after that @thorn river ?

thorn river
#

Ideally I would end up with only comments from my separate list from authors, and see if they havea certain keyword in their flairtext or comment text itself

late garnet
#

You can probably just iterate over the buffer and process it as you iterate looking for the key words etc.

#

Otherwise you can look up whatever is appropriate for your toolset.

placid snow
#

Does that allow for manipulation of large json files as well?

#

I recently had the issue of a json file being too large to be opened normaly, and was looking for a solution like this, if it does

thorn river
#

I used pd.load_json once but I frzoe my system, so there might be something more to it

late garnet
#

Just use read_json instead of read_csv.

placid snow
#

Thinkint about it my file sort of has a json file pr line, so might be worth reading line by line for me.

late garnet
#

I'm not sure how it would handle a single json object that is large, but the approach I listed should work for line by line objects.

thorn river
#

Thanks for the pointers! I wrote a script which will do what I (hope) i intended to do. If you could have a look and see if this will probably work?

    data = {}
    for i, line in enumerate(fi):
        parser = json.loads(line)
        for user in parser['author']:
            if user in author_set: 
                data[i] = (parser['author'], parser['author_flair_text'], parser['body'])
#

oh wow dat formatting

#

c

#

sec

#

welp

#

I hope you understand formatting doesn't seem to work

late garnet
#

It looks ok to me, but honestly I would need to test it myself. ๐Ÿ˜ƒ

#

You can just set a break after the first line to test it.

thorn river
#

You mean like: if i == 1: break?

late garnet
#
with bz2.open(jsonnew, 'r') as fi:
    data = {}
    for i, line in enumerate(fi):
        parser = json.loads(line)
        print(parser)
        break
        for user in parser['author']:
            if user in author_set: 
                data[i] = (parser['author'], parser['author_flair_text'], parser['body'])
thorn river
#

ah alright, thanks!

late garnet
#

Also - be sure to create a set object out of your users that you are looking for.

thorn river
#

I shouldve mentioned that author_set is a set()

late garnet
#

Great - I just wanted to be sure

thorn river
#

I assume because checking if it is in a set is faster right?

late garnet
#

Yes

thorn river
#

Alright, will also check out that article you linked. Thanks!

late garnet
#

FYI - I tested this in pandas and it appears my specific tar.gz that I created has extra information on the first line while iterating - "something.json0000664000175100017510000000034413370313602013031 0ustar tylertyler{"author": "someone", "text": "aldskfjasdf"}"

#

With pandas it breaks with read_json, however with read_csv it works.

#

Weird

#

Nice - I realized that I was being silly and have a tar.gz file not just a gunzip file.

#
import pandas as pd
for chunk in pd.read_json('/home/tyler/src/something.json.gz', lines=True, chunksize=2):
    print('new chunk')
    print(chunk)
#

new chunk
     author         text
0   someone  aldskfjasdf
1  someone2  aldskfjasdf
new chunk
     author         text
2  someone3  aldskfjasdf
3  someone4  aldskfjasdf
new chunk
     author         text
4  someone5  aldskfjasdf
#

@thorn river This works pretty well, however be sure that pandas supports the compression type beforehand.

late garnet
#

You may also want to consider multi-threading if you are processing so much data. It is fairly easy to do.

rich swift
#

Hi, has anyone ever try to create Jupyter widgets before? I'm curious what approach, lib, or js-framework that people use thinking

thorn river
#

@late garnet Thanks! Don't know much about multithreading but I can check it out. I'll try to match what is in the chunks with my list of authors and work that way!

late garnet
#
from multiprocessing import Pool

import pandas as pd

def process_chunk(chunk):
    pass

pool = Pool(8)

all_results = []

for chunk in pd.read_json('/home/tyler/src/something.json.gz', lines=True, chunksize=100):
    all_results = all_results + pool.map(my_function, chunk)
#

Something like that will give you a list of results.

#

@thorn river

thorn river
#

def process_chunk(chunk):
pass

what's the reason for this function?

#

What does it do?

silk acorn
#

it passes

thorn river
#

I suppose you use that in place of my_function right?

late garnet
#

Yeah sorry it was quick example

thorn river
#

Ah no problem

#

So I could then iterate all_results to check if its in the author_set

#

If I understand your code correctly

late garnet
#

The process_chunk would do all of the processing of each row for each chunk. Chunk in this case is a dataframe row.

#

Just return what is processed back. It could be an empty list or list of what you want.

#

@thorn river

#

My example isn't perfect, but conceptually you should have an idea of how to use it.

#

pool.map maps an iterable of items across N number of threads

thorn river
#

So I would have to edit the process_chunk function for it to only include authors in author_set?

#

Having a hard time wrapping my head around it

late garnet
#

Yes

lapis sequoia
#

is a gtx 1050 alright for data science?

placid snow
#

Hi @lapis sequoia would you mind setting a nickname of the server to something consisting of characters on a normal US/EU keyboard so others can mention you easier?

lapis sequoia
#

oh yeah sorry

placid snow
#

Thank you :)

lapis sequoia
#

there

#

i'll rephrase my question, anybody know of any good gpus for data science > ยฃ200?

late garnet
#

@Viibrant depending on your needs, maybe AWS could be a cheaper alternative?

#

Disclosure - I'm not sponsored by AWS. ๐Ÿ˜ƒ

hollow gulch
#

anyone know a quick script that I can keep all the columns of this df4, and make a df5 that is the same layout but group by 'Order Number' + 'Item Number' but 'Quantity Ordered' is a sum of those?

late garnet
#

Is this an excel or pandas or ? question?

hollow gulch
#

i have data available in excel but I am using python to process it and I use panda package for it. hope it make sense

#

the snapshot is out of panda

#

hope I am posting in the right place

olive trench
#

@hollow gulch

import pandas as pd

df4 = pd.DataFrame({'Order Number': [1,1,1,2,2,3,3,3], 'Item Number':[1,1,2,5,5,4,5,5], 'Quantity Ordered':[10,15,5,3,3,7,5,6]})

df5 = df4.groupby(by=['Order Number', 'Item Number'], sort=False, as_index=False)['Quantity Ordered'].sum()
#

I put in some dummy data

hollow gulch
#

Thanks @olive trench I tried something similar. Let me see if it keep all the columns as it was

#

splendid!

#

can you explain the code a little bit?

#

for academic purpose

olive trench
#

What exactly do need?

#

the by parameter specifies what keys to group by, the sort one doesn't sort the keys, and the index makes it so the keys aren't used as index. Then you sum the the groupby object by quantity and it returns a dataframe

hollow gulch
#

here's my version df5=df4.groupby (['Order Number','Item Number'])['Quantity Ordered'].sum()

#

perhaps you could help me understand the difference betwen myversion and yours?

olive trench
#

without the ['Quantity Ordered'].sum() part it's just a groupby object. If you perform a function on one of the columns it returns a dataframe where the column is the result of the function you used for all the grouped elements

#

you don't have the as_index=False

#

it's true by default, which will put the keys into the index

hollow gulch
#

what does sort= false do

olive trench
#

doesn't sort your keys. If you don't need it, it's better performance if you dataframe is big

hollow gulch
#

I see

#

now, if I want to keep all the others column. I have to expand the code right, depends on which I want to keep as is and which I want to aggregrade

#

for example I have column A B C D E F
further version will be df5=df4.groupby([ all keep column], sort=False, as_index=False)['D']['F'].sum() to get sum D, F?

olive trench
#

I'm not sure if that'll work. If you're summing the column, why don't you do that first, then the groupby and sum again?

hollow gulch
#
df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
       'Parent Number', 'Order Date', 'Item Number', 'Description ',
       'Unit Price', 'Extended Price',
       'Foreign Unit Price', 'Foreign Extended Price', 'Ln Ty', 'Last Stat',
       'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['Quantity Ordered'].sum()
#

I have a large data set that summing is only applied to 'Total Quantity' and 'Extended Price' by 'Order Number' and 'Item Number'.

#

I hope that make sense

olive trench
#

I'm a little confused what's the goal of this exercise. If you group by all the columns, it's gonna only group idential rows

hollow gulch
#

so my actual code is something like this

#
df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
       'Parent Number', 'Order Date', 'Item Number', 'Description ',
       'Unit Price',
       'Foreign Unit Price','Ln Ty', 'Last Stat',
       'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['Quantity Ordered','Extended Price','Foreign Extended Price'].sum()
#

with the last 3 columns being sum

#

I am just curious how much flexibility i can do with this code

olive trench
#

I am not sure if summing the list of columns will work. But as I said, if you prepare a new column that is sum of ['Quantity Ordered','Extended Price','Foreign Extended Price'] beforehand and then just sum it agian over the aggregated elemtns, it'll do the job

hollow gulch
#

I am not sure what that would look like. I am quite new to python

olive trench
#

df5['new_column'] = df5['Quantity Ordered']+df5['Extended Price']+...

#

sorry df4 ^

#

and then you do the groupby and use the .sum() on ['new_column']

hollow gulch
#

sounds like 3 different step if i understand correctly

olive trench
#
df4['newcol']=df4['Quantity Ordered']+df4['Extended Price']+...
df5=df4.groupby (['Cur Cod', 'Order Number', 'Sold To Number', 'Ship To Number',
       'Parent Number', 'Order Date', 'Item Number', 'Description ',
       'Unit Price',
       'Foreign Unit Price','Ln Ty', 'Last Stat',
       'Next Stat', 'Adj. Schedule', 'Adj Name', 'Status'],sort=False, as_index=False)['newcol'].sum()
#

just two

hollow gulch
#

isnt that the same as the step above?

#

just written into 2 parts

olive trench
#

as I said, I am not sure

hollow gulch
#

sorry for dump question

olive trench
#

I don't have your dataset so it's hard to say

hollow gulch
#

but thanks for the great help. I think that code works, I am verifying it

#

its been a whole day to bang myhead around that code

#

you definitely save the day

olive trench
#
import pandas as pd

df4 = pd.DataFrame({'Order Number': [1,1,1,2,2,3,3,3], 'Item Number':[1,1,2,5,5,4,5,5], 'Quantity Ordered':[10,15,5,3,3,7,5,6], 'irr':[5,6,4,5,2,3,5,5]})

df5 = df4.groupby(by=['Order Number', 'Item Number'], sort=False, as_index=False)['Quantity Ordered', 'irr'].sum()
df5['newcol'] = df5['Quantity Ordered']+df5['irr']
#

it's the same thing as mine but you're doing the column sums after not before. If you do the ...['Quantity Ordered', 'irr'].sum(), it does sum but just the columns not together

#

haha no worries

hollow gulch
#

@@ still dont get it completely yet haha

olive trench
#

I had a very frustrating experience myself today that I only just solved haha

#

well ask away

hollow gulch
#

does it matter if it does sum before/after?

#

I guess I dont understand whats going on in the background

#

I added you friend btw. ๐Ÿ˜ƒ you seem like a nice person.

olive trench
#

no. Except if you do it after you'll have the two summed columns in the dataframe

hollow gulch
#

that's what I want right? because I wanted to keep those column separated

olive trench
#

No worries, and feel free to hit me up if you need help. I am not very experienced myself but I'll try to help with that I can ๐Ÿ™‚

hollow gulch
#

I want the price.sum() and the quantity.sum() because they were the same order number

#

the order number was split due to different batch of shippnig

#

but for the analysis purpose, we want to consider them as 1 order.

olive trench
#

ohhh

hollow gulch
#

I work with suppy chain so these things happen alot in my data

olive trench
#

I thought you were trying to sum those columns together

hollow gulch
#

^ thus the confusion in us haha

#

excel can't handle this stuff as flexible as python so I am trying to pick up python to do the heavy work

#

I play alot with panda

olive trench
#

Yeah then just specify which columns you want to have sum of for the aggregated elements

#

my bad I didn't understand

hollow gulch
#

noo, you helped alot

#

๐Ÿ˜ƒ

olive trench
#

Do you not have this data in a database? I feel like this might be easier with SQL

hollow gulch
#

trust me, I am ready to bang my head in the cabinet

#

IT doesnt trust us enough to give us access to SQL lol

olive trench
#

Oh I know that struggle, gaining access to anything in my company is a nigthmare too lol

hollow gulch
#

my thought SQL would be a way to go too

olive trench
#

if it was read only you should be fine

#

it's odd though that they want you to do a job but don't give you the tools

hollow gulch
#

that's something that I am trying to convince them

#

even my boss struggle too lol

#

IT doesnt understand all of our needs

olive trench
#

are you like an analyst?

hollow gulch
#

yep

#

and you?

olive trench
#

I'm a junior data scientist

hollow gulch
#

nice ๐Ÿ˜ƒ

#

let me know how it goes and perhaps we could learn more together

olive trench
#

sure thing! I am fairly capable with pandas

#

I am kind of fortunate that all our company is services in IT, so most people are capable in IT. And luckily I don't have to communicate with the business divisions ๐Ÿ˜„

hollow gulch
#

haha

#

I work in business division

#

I do their heavy math

#

i used to do these things in VBA and excel

#

but I think python is the real hammer to get anything done so I am trying to pick up and get everything done in python

#

takes longer but much more learning curve and more flexibility in manipulatnig data

olive trench
#

eww VBA. I had the displeasure to work in it and nevermore ๐Ÿ˜„

python is definitely powerful. I'd like to dive a little bit into big data sometime, so I hope we get a project including it eventually

charred crest
#

Hello, is there anyone with some knowledge about reinforcement learning (Q-Learning, Monte Carlo, TD(0), greedy)?

#

Or Am I in the wrong channel ?

late garnet
#

@charred crest it is more beneficial to you and everyone else to ask a specific question.

lean ledge
#

Wish I had time to look more into RL, it's so relevant to my field

#

Very cool stuff

terse pewter
#

Seems like super interesting stuff

#

give the model candy until it likes you and gives good results

chilly shuttle
#

someone said something today that made me do a double take

#

'you can train a model on X wide feature vectors but do inference only on <X wide feature vectors'

#

that's... not right, right?

#

i mean you can do it, but running inference with consistently less information is the same as training the model without being aware of that extra information right

woven tundra
olive trench
#

@chilly shuttle maybe they included the label vectors in X...?

chilly shuttle
#

no, their example was along the lines of 'train on postcode and age, then run inference on postcode only'

#

i... don't think windmills work that way

polar acorn
#

They are probably talking about the difference between a design matrix used in pure statistics and the one hot encoded feature matrix often used in machine learning.

#

It's been a long time since I looked at this, but it's something something matrix must be invertable, remove one variable and use it as a base for interpreting the others. It's one of the small differences between pure stats and ml thats easy to trip over.

true acorn
#

@lean ledge I'd like to be able to simulate gyroscopic technology in the vacuum of space

#

Are you studying computational science right now?

#

Im teaching myself, but want to really dive into it and learn as much as i can

lean ledge
#

@true acorn nah, I'm an engineering student. Gyroscopic shouldn't be too bad

#

Given you have the moment of inertia tensor for the body, it's a simple simulation

#

Do you know the basics of rigid body dynamics?

true acorn
#

Hell to the nawh nawh nawh. I just had a discussion in the mathematics discord server about learning math relative to the problem im trying to solve.

#

but id love to learn

lean ledge
#

Oof, yeah the thing is, computational science isn't really something people learn on their own. They usually learn it alongside and to aid in another discipline like physics or engineering or chemistry etc. Physics and engineering especially

#

Not having any background in what you're trying to simulate is harder because you don't know how it should behave and it takes longer to understand where to even start

#

But

#

There should be resources

true acorn
#

Why couldnt i learn it on my own? What if i told you i was surrounded by some professors who teach this stuff?

#

ah

#

so as long as i have resources aka mentors aka professors who study this stuff, i should be okay?

#

@lean ledge I have you too, silly

#

I just cant afford school, im trying to save, but school is way too expensive so im trying my best to learn on my own

lean ledge
true acorn
#

woooow, 1997

lean ledge
#

Gyroscopic stuff starts becoming relevant at rigid body dynamics
@true acorn

true acorn
#

Fascinating

lean ledge
#

Look through those notes and you should be able to start figuring it out

true acorn
#

i know the basics of programming and math so if i have a question about a function can i shoot you an IM?

lean ledge
#

A gyroscope is just a rigid body with a specific moment of inertia tensor

#

I'd rather talk here

true acorn
#

okay,

#

Damn, why am i just hearing about Euler right now. i was never taught about him in school

lean ledge
#

Euler was a smart guy

#

Stuff in maths has to be named after the second person who discovered it because the first is always Euler :P

true acorn
#

rofl

#

thanks for the resource @lean ledge

lean ledge
#

Nw, happy to help

polar acorn
#

Also he had cool hats, Euler that is

late garnet
#

@lean ledge you must be pretty familiar with markovian processes and system theory yes?

lean ledge
#

To some extent ye. Don't expect too much, my knowledge is full of gaps

#

I am 18, i haven't had the time to know it all in detail

late garnet
#

Ahh okay, I'm working on a problem in understanding operational inefficiencies through deriving markov transition matrices. I was hoping someone could provide some advice.

#

Essentially, I have a system with a finite number of states, but many states are interconnected. There are many users of the system and I am comparing user level efficiency of the system to optimize business processes.

lean ledge
#

Oof, I have worked a bit on something similqr and my coworkers have worked a lot on that but that's directly connected to a product my company has and another we're working on and I'm afraid I'd rather can't say much.

late garnet
#

RIP

lean ledge
#

Sorry D:

late garnet
#

I think I have the right idea on comparing at a high level to evaluate potential areas of inefficiency

#

What company do you work for?

lean ledge
#

Being able to rate and improve worker efficiency is a big topic in a mine where a single more round of shovelling a day will lead to 9M an year in profit etc

late garnet
#

Do you have general advice on how to compare quantitatively other individuals based on some weighting within the matrix of transitions?

#

A paper or general concept would be great ๐Ÿ˜ƒ

#

I can read and apply it myself - I hope

#

I think in my case I could take the weighted average of the differences between corresponding row, weighted by the fraction of time spent in the state corresponding to that row

lean ledge
#

Sounds like a good start. Should be able to experiment a bit with that. My boss taught me all the relevant stuff so I don't really have any link to papers but with a bit of searching, you should be able to find them on your own

small ore
#

I am not even sure if all these are data-science or if we need a new OT-Advacned(Nerds only) channel

lean ledge
#

there's lots of stuff here that's not data science because everyone just redirects people here

#

it's weird

delicate nymph
#

hello

#

i'm told to come here and you for advice

#

and after this command it alters my dataframe

data1 = []                                                                        #create empty list
for name, dates in data.groupby(pd.Grouper(freq='D')):                            #separate days
    data1.append(dates) 
lyric canopy
#

Did you check if your DataFrame is still how you want it to be between this:

data = data.drop(['use','hours','date','B','B2','E','A','w','d','year','month','day','sec'], 1)
###############################################################################
data = data.set_index('date_time')   

And that code you've posted? Because I don't think that code you've posted should alter it.

delicate nymph
#

up until the

data = data.set_index('date_time') 
#

my code is fine

placid snow
delicate nymph
#

it runs without any alternation

lyric canopy
#

When you say you ran the upper part, does that include or exclude that set_index line?

delicate nymph
#

yes

lyric canopy
#

Because the code you've posted in the help channel excluded it

delicate nymph
#

i run the set index too

#

and its perfect

#

when i run the groupby i get those blue cells in my last column

#

those blue cells are a datetime which is reapeted but also wrong

#

they mess my whole frame

lyric canopy
#

I don't know what the colors mean, I've never used Spyder

#

Has the actual data in the DF in memory changed?

delicate nymph
#

no the original is correct

#

just the new one that divides the days

lyric canopy
#

You're not creating a new one anywhere, though. The only thing you get from a groupby is a groupby object.

delicate nymph
#

i know

#

๐Ÿ˜ฆ

lyric canopy
#

But, the grouping is wrong ?

#

Or, what is actually going wrong?

delicate nymph
#

no its perfect it even creates empty frames from the days i don't have any info

#

it just adds double dates to some dates

lyric canopy
#

Right, but if you look to the indexes, some date_time values are repeated exactly. Aren't those the ones that get grouped together and "doubled"?

delicate nymph
#

yes that's them

#

the original doesn't contain them

#

i can't understand why they exist

lyric canopy
#

Are you sure they don't exist after you've created that series with:

data['date_time'] = data[['date','time']].astype(str).apply(''.join,1) 

?

delicate nymph
#

yes

#

i check the dataframe line by line

lyric canopy
#

And they show up right after data = data.set_index('date_time') before the loop?

delicate nymph
#

let me check it one more time i'll run the program line by line too

lyric canopy
#

You can always print out the first, say 30 rows, with DataFrame.head(n=30) at various points to track the changes throughout the script.

#

That way, you don't have to rely on running it line by line and checking the dataframe viewer in Spyder

delicate nymph
#

i did it

#

i check the extra lines

#

they don't exist before the groupby

lyric canopy
#

Okay, I don't know what's going on.

delicate nymph
#

and that is how i throw 1 month of work

#

thanks anyway

#

have a nice day

lyric canopy
#

There are people on this server who are much more fluent in Pandas than I am, maybe one of them will spot what's going on

delicate nymph
#

i would appreciate that ty

#

should i post it again or will they see it?

lyric canopy
#

Depends on if the question gets burried.

delicate nymph
#

so what do you recommend me to do?

delicate nymph
#

@lyric canopy i did it i found my mistake so you don't have to search. thank you for your time. it was a stupid mistake

lyric canopy
#

Great! Do you mind sharing the mistake so I can learn from it, too?

delicate nymph
#

sure i didn't change a string to float [.astype(float)] so instead of calculating a a number was making a sequence which made the program go nuts

#

so it seemed like correct

#

and i was so focused on what seemed to be wrong that i didn't pay attention to its origin

lyric canopy
#

Right, thanks!

teal veldt
#

Hey, would this be the correct channel for a question related to outliers in a dataset?

#

If not, is there a related discord server that you folks care to recommend?

lyric canopy
#

I think you can go ahead

#

What kind of model are we talking about?

teal veldt
#

I am terribly new at pandas and I though I'd practice with some real life examples

#

The issue with that dataset (at least the Barcelona one) is that the price data has a lot of stuff that is clearly an outlier

#

Since we're talking about apartments with a listed price of 6000 euro/night

#

Now, is there a practical and scientific way, so to say, to determine where to cut the line for outliers?

#

I suppose I could arbitrarily say that whatever is out of 2 or 3 standard deviations is an outlier

#

But an expert opinion would be much appreciated

lyric canopy
#

It actually depends on what you want to do with the data, but just deleting outlying observations is usually a bad idea (although it happens way too often)

#

If there's no substantive reason to delete the data points, you could very well be deleting valid observations

teal veldt
#

I mean, just random analysis, like avg price per Neighbourhood, or something related to reviews

lyric canopy
#

Well, by just deleting those outliers, you're biasing your sample

teal veldt
#

But to do that I need to know what is actual data and what is clearly not significant

lyric canopy
#

Do you have any reason for why these values are not "actual data"?

teal veldt
#

I mean, I know they're wrong because if I access the relevant listing page on Airbnb I see that the price there is normal

#

So it's either badly parsed from Airbnb, or maybe it was set superhigh while the owner was creating the page

#

To avoid receiving bookings

lyric canopy
#

So, how do you know those errors are only happening for those high values in your list?

teal veldt
#

Feck

#

I don't

lyric canopy
#

Anyway, I assume the distributions are likely skewed anyway, so the average may not be the best measure to describe the central tendency in the data you have

teal veldt
#

So something like k-means clustering would be a better idea?

lyric canopy
#

Anyway, if you have reasons to assume some values are truly erroneous, then you're justified to delete them

#

There are plenty of alternatives. Something like the median is often seen as more robust than the mean, for instance

#

That's why is sometimes used with wages

teal veldt
#

Fair enough

#

I don't have any info on how was the data mined so I guess I'll make some assumptions to have a "clean" dataset just for the sake of practice

#

It won't really describe reality, but that's not the point right now

#

Thanks a lot for the help

#

Appreciate it

lyric canopy
#

I may be a bit on the fence about it, because it happens way to much in academics. (deleting observations based on some arbitrary cut-off without a substantive reason)

teal veldt
#

Nah, it's a totally fair point and it was very revealing, so thanks for that, I'll just ignore it in this specific case since I'm the guy that's still googling boolean indexing

#

So it's not a matter of doing analysis that make sense for now, it's more like "how does pandas work?"

lyric canopy
#

Right, just play around with it, I'd say

#

If you're going to fit a model (e.g., linear regression with ordinary least squares), then you may also fit it twice: Once with and once without the unusual data points to see what kind of effects it has.

teal veldt
#

Right, I'll try that

#

Thanks again for taking the time to explain

late garnet
#

@teal veldt - There are many univariate statistical methods to find anomalies in a data set. It is tricky to pick the right algorithm without fully understanding the purpose as @lyric canopy points out. However, I can provide you with some algorithms that I implemented to get you some application into anomaly detection. In addition it would be a good idea to read up on each method.

lyric canopy
#

Right, but in my opinion, none of these methods should be used in an automated way. Detection is fairly easy, but the biggest question is the one that comes after that: Why is this an unusual observation? And how should I deal with it? Far too often, people just delete them from their dataset, probably out of ignorance, but thats borders on unethical research practices.

late garnet
#

@lyric canopy I completely agree.

lyric canopy
#

Interesting code, though.

late garnet
#

Generally it is best to use domain specific upper and lower bound thresholds to throw out bad data.

#

Specifically when gathering sensor data

lyric canopy
#

Probably, I don't have that much experience with that; I work in the field of social sciences/psychology/cognitive neuroscience for a methodology/statistics department of a university. Unusual observations there are less likely to be caused by sensor error, but much more likely to be valid observations. But, since a lot of reseachers do know that outliers can be problematic (say influential cases in the GLM), but don't know how to deal with that, they start to throw out observations based on arbitrary rules-of-thumb, like +/- 3 sd, bonferroni-correct significant studentized residual, arbitrary cook's d cutoffs and what have you.

There are so many robust techniques and alternative approaches available today that just throwing out observations without a substantive reason for it triggers me.

late garnet
#

Interesting, I actually created those anomaly detection algorithms to find unusual patterns in human behavior. ๐Ÿ˜ƒ

lyric canopy
#

That's a great use for it, because those unusual observations usually tell an interesting story.

#

(One of the other problems is that people sometimes only start deleting observations if their original run of a model did not provide a "significant" result, but won't do that if it did. In, for instance, a GLM [generalized linear model], an outlying observation can actually increase model fit if has a high leverage value, but a low residual, i.e., if it's an outlyer on the explanatory variables, but in line with the regression plane/model.)

haughty wharf
#

Hello,

I'm kinda new here and I have a certain predicament I'm in where I need a bit of guidance/advice. I want to be a Data engineer but I'm not sure about how to do that. I'm currently a Market Science Analyst and after being here for a few months I feel as if the job isn't for me.

lean ledge
#

@haughty wharf data engineer is a DevOps like job. Learn SQL, docker, Hadoop, (AWS/GCP/Azure), ETL, REST APIs, Spark/Hive + some understanding of machine learning (doesn't have to be in depth) and maybe stuff like tableau. Your job is to be able to make a data pipeline for data scientists and for deployment

amber kestrel
#

is there a way to name a plot in matplotlib so that plt.show() will only make something happen if an argument is passed into it

#

also, is there a way to automatically save generated images of that named plot, and replace old versions with the new one?

late garnet
#

@amber kestrel Maybe you just need to add some conditionals before rendering the plot?

olive trench
#

@amber kestrel I don't understand the first question, but the second one is plt.savefig('plot.png'). Use it before plt.show()

#

It'll save to your working directory, or you can pass a path in the filename to save it elsewhere

tardy portal
#

this is the question
mplement a perceptron for logistric regression. For your training data, generate
2000 training instances in two sets of random data points (1000 in each) from multi-variate normal
distribution with
ยต1 = [1, 0], ยต2 = [0, 1.5], ฮฃ1 =

1 0.75
0.75 1 
, ฮฃ2 =

1 0.75
0.75 1 
(1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class,
i.e., 1000 in total

def generate_data(mean,cov,size):
    return (np.random.multivariate_normal(mean,cov,size))

train_data_1 = generate_data([1,0],[[1,0.75],[0.75,1]],1000)
train_data_2 = generate_data([0,1.5],[[1,0.75],[0.75,1]],1000)

test_data_1 = generate_data([1,0],[[1,0.75],[0.75,1]],500)
test_data_2 = generate_data([0,1.5],[[1,0.75],[0.75,1]],500)

so i dont know how to label it

lyric canopy
#

I'm not sure what the question wants you to do

tardy portal
#

nvm i figured it out

lyric canopy
#

Okay

tardy portal
#

thanks

#

well the problem is this if you want to look at it

lyric canopy
#

You probably needed to generate a dependent variable with a binary coding

tardy portal
#

(Logistic regression, 40pts) Implement a perceptron for logistric regression. For your training data, generate
2000 training instances in two sets of random data points (1000 in each) from multi-variate normal
distribution with
ยต1 = [1, 0], ยต2 = [0, 1.5], ฮฃ1 =

1 0.75
0.75 1 
, ฮฃ2 =

1 0.75
0.75 1 
(1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class,
i.e., 1000 in total. Use sigmoid function for your activation function and cross entropy for your objective
function. You will implement a logistic regression for the following questions. Initialize the starting weight
as w = [1, 1, 1]. During training, stop your loop when the objective function (i.e., cross entropy) does not
decrease any more (below certain threshold) or when the gradient is close to 0 or the iteration reaches 10000.
Set your thresholds properly so that the iteration doesnโ€™t reach 10000 for all the learning rate that you will
be using.

  1. Perform batch training using gradient descent. Divide the derivative with the total number of training
    dataset as you go through iteration (it is very likely that you will get NaN if you donโ€™t do this.).
    Set your learning rate to be ฮท = {1, 0.1, 0.01}. How many iterations did you go through the training
    dataset? What is the accuracy that you have? What are the edge weights that were learned?
#

and i am really new to this

#

so i lack knowledge

lyric canopy
#

I'm not too big on ML; I've only done logistic regression in the context of the generalized linear model.

#

So, I can't really help you with it

mighty flame
#

guys

#

i need help

#

anyone knows about tensorflow and keras?

#

I need to create a dataset with words and I don't understand anything

unborn cave
#

I don't think you really need Keras or TF to create a dataset. That would be to process it.

#

Firstly, where are you trying to source the data from? Scrape? API?

#

If all you need to do is output to CSV or DB, you're better off using Scrapy and Pandas

cerulean magnet
#

@tardy portal Hey, what part of the problem do you need help with

spare karma
lean ledge
#

I'd suggest an SVM with an appropriate kernel instead of a neural network there ๐Ÿ‘€

#

Having to use NNs is weird when there's other stuff

#

Looks okay to me but I don't use keras much

cerulean magnet
#

@lean ledge Hey man can I ask you a DS question in regards to normalizing/scaling and matplotlib?

lean ledge
#

Sure, just ask your questions here

spare karma
#

@lean ledge ty

cerulean magnet
#

I typed it in help 5 if you dont mind taking a look

#

was few minutes ago

#

ohnvm someone is looking at it

#

Appreciate it though

tardy portal
#

Hi can someone help me write a code regarding online training using gradient descent without using scikit and sklearn

#

I was able to write a code for batch training

lunar oyster
#

ask your question

tardy portal
#

(Logistic regression, 40pts) Implement a perceptron for logistric regression. For your training data, gen- erate 2000 training instances in two sets of random data points (1000 in each) from multi-variate normal distribution with
๔ฐ€ 1 0.75๔ฐ ๔ฐ€ 1 0.75๔ฐ
ฮผ1 =[1,0], ฮผ2 =[0,1.5], ฮฃ1 = 0.75 1 , ฮฃ2 = 0.75 1 (1)
and label them 0 and 1. Generate testing data in the same manner but include 500 instances for each class, i.e., 1000 in total. Use sigmoid function for your activation function and cross entropy for your objective function. You will implement a logistic regression for the following questions. Initialize the starting weight as w = [1,1,1]. During training, stop your loop when the objective function (i.e., cross entropy) does not decrease any more (below certain threshold) or when the gradient is close to 0 or the iteration reaches 10000. Set your thresholds properly so that the iteration doesnโ€™t reach 10000 for all the learning rate that you will be using.

  1. Perform batch training using gradient descent. Divide the derivative with the total number of training dataset as you go through iteration (it is very likely that you will get NaN if you donโ€™t do this.). Set your learning rate to be ฮท = {1, 0.1, 0.01}. How many iterations did you go through the training dataset? What is the accuracy that you have? What are the edge weights that were learned?
  2. Perform online training using gradient descent. Set your learning rate to be ฮท = {1,0.1,0.01}. Set your maximum number of iterations to 10000. How many iterations did you go through your training dataset? What is the accuracy that you have? What are the edge weights that were learned? Compare the learned parameters and accuracy to the ones that you got from batch training. Are they the same? Explain in your report.
#

Do you see the second question

#

I did make the first one ,but i dont have mental capablites to write for the second

#
#this progrom contains both batch training and aoc
import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics


def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def create_data(mean1: list, mean2: list, sigma1: list, sigma2: list)-> tuple:
    train1 = np.random.multivariate_normal(mean1, sigma1, 1000)
    test1 = np.random.multivariate_normal(mean1, sigma1, 500)

    train2 = np.random.multivariate_normal(mean2, sigma2, 1000)
    test2 = np.random.multivariate_normal(mean2, sigma2, 500)

    
    train_label1 = np.zeros(1000)
    test_label1 = np.zeros(500)

    train_label2 = np.ones(1000)
    test_label2 = np.ones(500)

    training_data = np.vstack([train1, train2])
    test_data = np.vstack([test1, test2])

    training_label = np.concatenate((train_label1, train_label2))
    test_label = np.concatenate((test_label1, test_label2))

    return training_data, training_label, test_data, test_label


def logistic_regression(data: list, labels: list, max_iter: int, lr: float)-> list:
    weights = np.ones(2)
    for i in range(max_iter):
        predictions = sigmoid(np.dot(data, weights))
        error = labels - predictions
        gradient = np.dot(data.T, error)
        weights += lr * gradient
    return weights


def calc_metrics(test_data, test_label, weights):
    confusion_matrix = [{'correct': 0, 'incorrect': 0}, {'correct': 0, 'incorrect': 0}]
    acc = 0
    predictions = []
    for i in range(len(test_data)):
        prediction = round(sigmoid(np.dot(test_data[i], weights)))
        predictions.append(prediction)
        if test_label[i] == prediction:
            confusion_matrix[int(test_label[i])]['correct'] += 1
            acc += 1
        else:
            confusion_matrix[int(test_label[i])]['incorrect'] += 1
    return acc, confusion_matrix, predictions

#

def main():
    mean1 = [1, 0]
    mean2 = [0, 1.5]
    cov1 = [[1, 0.75], [0.75, 1]]
    cov2 = [[1, 0.75], [0.75, 1]]
    steps = 100000
    print ("Choose from 1,0.1,0.001")
    lr = float(input("enter learning rate"))
    training_data, training_label, test_data, test_label = create_data(mean1, mean2, cov1, cov2)
    weights = logistic_regression(training_data, training_label, steps, lr)
    acc, confusion_matrix, predictions = calc_metrics(test_data, test_label, weights)
    fpr, tpr, th = metrics.roc_curve(test_label, predictions)
    auc = metrics.roc_auc_score(test_label, predictions)
    print('Accuracy: ' + repr(acc/len(test_data)*100) + '%')


main()


#

this is what i have written till now for batch which is 1

#

now i need help for second

chilly shuttle
#

why do you invoke repr directly?

olive trench
#

Hey guys, what is the most efficient way to create pandas dataframes where you have to dynamically create it row by row?

What I try to do is create a dictionary with its keys as column names and the items are lists with the column values. I append new values to those lists and then call pd.DataFrame on the dictionary at the end. It's is a lot faster than using pd.append() especially for big amounts of rows, but I am thinking if there is a more efficient way to do this.

Anyone have any ideas/workflows that work out for them?

late garnet
#

@olive trench it might be easier if you specify your use case.

olive trench
#

I need to create a dataframe with certain columns . Then there is a for loop and each loop generates one row of the resulting dataframe

#

the df.append() function gets really slow if I generate each row as a df and append each time if the resulting dataframe is in thousands of rows

polar acorn
#

I do the same as you, either append to a dict or if I don't have that many columns I just keep track of a couple of lists and make the dict and df at the end.

olive trench
#

@polar acorn oh thanks, that's actually even better!

wanton silo
#

i am starting to get more into machine learning using python

#

anyone have any reccomendations for videos?

lean ledge
#

@wanton silo Check pinned

#

Columbia's course is good

#

not python specific, it's language agnostic

wanton silo
#

o, thx

small ore
#

@wanton silo , check this out. Curated list of tutorials: https://medium.com/machine-learning-in-practice/over-150-of-the-best-machine-learning-nlp-and-python-tutorials-ive-found-ffce2939bd78

wary willow
#

How hard would it be to take one of the catagories from the google quickdraw dataset (https://github.com/googlecreativelab/quickdraw-dataset), and then use machine learning to create new images that are similar to the images that are there? And what's a good resource to learn how to do something that?

wary willow
#

(aka how can I learn basic image machine learning?)

reef bone
#

that's not an easy question to answer, generative models are generally quite complicated, so "basic" machine learning might not be sufficient to develop good solutions
the good thing is that we have packages like tensorflow, keras to help us, which makes the field much more accessible, but unless you're just taking code from the internet, some grasp on the fundamentals will still be necessary

#

so depending on how deep you want to go, we can recommend articles or books

#

the best resource to start with is generally google

#

also bear in mind that image recognition is not the same as image generation

#

the first would be a more approachable problem to beginners i think

turbid bay
#

hey i got a question. I'm trying to make a neural network that can detect whether a word is either real english or just a bunch of letters bunched together. How would i make my input data? because the input layer has to have a fixed number of nodes (so i cant just do the amount of letters). and idk if you can just input the whole word? my first idea was to calculate a value for each word. I went with using A=1, B=2 etc. then calculating a base10 number using the letters A-Z in base26 (not using 1-10 instead just using the alphabet letters). So the word "hello" would calculate to be (value of "h") 8*(26characters_index) (starting at zero)+ value of "e" 5*(261). etc. This creates extremely large numbers but would this work as a dataset?

#
        word_value += (ord(word[j])-64)*(26**j)```
reef bone
#

that's an interesting approach to the problem, you're basically trying to create a hashing function and then check if the hash is valid? i'm not sure where the neural network comes into play

#

recurrent neural networks (especially lstms now) are generally the way to go for handling input of variable size, for natural language processing there are models that handle the input character by character, some handle it word by word, google has taken a compromise approach with "wordpieces" -> this is an interesting read https://arxiv.org/abs/1609.08144

#

but i'm still not sure why a neural network would be at all necessary for this problem

#

(not to discourage you)

turbid bay
#

well im relatively new to the concept of machine learning neural networks etc. so it may be absolutely pointless and not be correct but i thought it might work

reef bone
#

it's not pointless and your idea to hash the word is good

#

and it definitely would work to some extent, but a simple check against a dictionary would be a lot easier to implement and also probably perform better

#

well, it wouldn't perform worse

turbid bay
#

ok thankyou. and yh probably checking it against a dictionary would be easier. im actually so stupid i didnt think about that ๐Ÿ˜‚

#

well ibe done all the work on setting up the data that im not gonna just restart now. im gonna see how effective a neural network would be. i guess this will be interesting

reef bone
#

the reason why it struck me as a little odd is that you will likely be using a dictionary as training data anyway, and in general terms we're trying to get the neural net to learn from the training dataset to then be able to predict unseen values outside of the dataset. in fact, learning the actual values in the dataset is not a good thing because that usually leads to poor performance on unseen data (overfitting), we want the network to extract features from the dataset that can then be used to identify and classify other data. in this case, you already have all the data available to you, so you would actually be trying to get the neural net to learn all the data. which is an odd approach because in that case you can just use the dataset itself to check if the element that you're trying to classify as either (valid) or (invalid) is within this data

#

but go for it, practice is good, i would be interested to know how your neural net will perform once finished

#

^ sorry that's not well written, hopefully it makes sense

turbid bay
#

nah im going to use a random probably 40% for testing. plus the dataset has both positive and negative values of being either real or non rea words

#

i mean training

#

maybe more idk

#

i have around 500000 examples

#

and yes i understood what you have written

polar acorn
#

@turbid bay you can solve this with machine learning. If you're trying to learn machine learning it's a fun excersize, if it's a real problem you're having you should look to the previous answers. You could for instance train an LSTM. Let each letter be a vector of length 27, A = [1,0...,0], B = [0,1,0,...0] etc. Each word is then a sequence of letters or vectors and this is something we can use an LSTM for.

#

In addition your model can tell you if a random mess of letters looks sort of like a word or not which is in itself an amusing thing to test out.

turbid bay
#

whats an LSTM?

polar acorn
#

If that seems a bit complex you can also just try turning each word into a vector, mapping each letter into a number a -> 1, b -> 2 etc. And then zero pad on the right so that all the words are the same length. So 'Apple' would be [1, 16, 16, 12, 5, 0, 0, ...,0] with enough zeros so that the length is equal to longest word in your list. Then you could use just a normal neural network.

lean ledge
#

@turbid bay An LSTM (long short-term memory) network is a specific type of recurrent neural network (read, neural network that saves some inner state) that represents internal state and what is and isnt stored in it using i, f, o and g gates. They're good because they can remember stuff eg context so good with sequential data like videos, sentences, different parts of images etc. They're good because they allow easier gradient flow compared to vanilla RNNs while being rather flexible through their input/output/forget update mechanisms.

#

data does not at all have to be sequential, it's just a common example

#

you can see the difference in stuff like attention-based models for computer vision where the same data is looked at through different focal lenses etc.

#

They're also rather easy ways of making generative models based on your training data without going through other effort in a markovian chain manner

#

Pretty useful networks

thorn river
#

I have a pandas dataframe where I want to drop rows which have a total frequency in the dataframe below a certain threshold.

I.e. if a column 'city' has in total 5 entries, I want to drop the corresponding rows. (rows with City < 5 should be dropped).

However, if that city appears in a certain list of cities, I want to skip row from dropping (even if that city has <5 entries).

I.e.

Counts = {"London": 4, "Birmingham: 3}```

I already have the counts of the corresponding city in a single column. How do I drop rows with <5 total entries in the dataframe EXCEPT if they are in to_keep?
late garnet
#

@thorn river what does your dataframe look like?

#

It sounds like you needs to check out the filtering section of pandas; specifically on isin and multiple criteria filtering.

#

Generally I filter to keep what I want and create a new copy of the dataframe instead of dropping. I'm not sure if that is more efficient.

#
query = (df['City'].isin(to_keep)) & (df['Count'] >= 5)
df = df[query].copy()
thorn river
#

I guess filtering works too. I would want to keep the entire dataframe if they are > 5 and if city count in dataframe is <5 but in to keep it would be kept as well.
if I understand correctly, your code snippet create a dataframe with only those cities in to_keep?

#

I ended up dropping everything <5 since it was so little and it probably not of that much importance, thanks anyway!

past grove
#

hi, i have a problem with my genetic algorithm. i am making a GA to solve a sudoku puzzle (not the best way of achieving the solution, i know - but it's enjoyable and is definitely achievable).

My problem is that it never gets to a solution, it gets stuck at a fitness of around 50 (i'm using the fitness function from here https://www.researchgate.net/publication/224180108_Solving_Sudoku_with_genetic_operations_that_preserve_building_blocks, except (f(x) - 162) * -1 in order for fully fit to be = 0).

i'm new to ga's so i just set my crossover function to take 2 parents, and then make child1 the first 3 rows of parent1, second 3 rows of parent2 and third 3 rows of parent1, and child2 to be the inverse. i also did crossover on every individual - not sure if that's normal?

and then mutation, i mutated every individual as well, maybe i shouldnt do that

of course i kept the known values of the puzzle constant throughout

just looking for advice really

lost patio
#

Are you stuck in a local maxima?

#

Maybe increase your mutation rate based on lack of genetic diversity

past grove
#

i think im mutating too much

#

my mutation rate is essentially 1 since i mutate everything (i think that's how it works), so perhaps i should change that

#

and i mutate quite aggressively (swap 2 values in every box of every individual in the population)

lost patio
#

Unfortunately I've only minimal knowledge on this subject and most of it is in NN

#

I'd work on refining your mutation strategy. Worst case scenario, you learn more about them.

past grove
#

yeah im reading An Introduction to Evolutionary Computing rn

late garnet
#

@past grove you might be interested in this repository

#

There is a talk on them as well

past grove
#

thanks i'll take a look @late garnet

rapid spear
#

Hello! I'm hoping for an idea/some advice on how to approach a python project
I'm scraping a webcam every 60 seconds (when it updates) of a feed looking at an exit to a business park
I'm trying to write some computer vision approach to detect when cars start queuing along this road at peak times
not necessarily an object detection problem or anything like that - just need to detect if there are cars or if there is just road (and cater for fading sunlight, weather etc)
I tried plotting the mean of the image to look for spikes when the cars started queuing but no success ๐Ÿ˜ฆ anybody know of any other approaches?
the aim is to have a graph of timestamps against when there is a traffic jam vs clear roads
if this is the wrong place to ask (an idea problem not really a code problem) then please redirect me elsewhere if you know of a better place
but this has to have been considered by someone before

eager ermine
#

Is there any way to write/read to a website a JSON file to a website, I'm using Heroku to host it but I can't see the files of Heroku, so I wanna see/edit the data that it makes.

If so how would I do it?

earnest prawn
#

every time you push a new version of the software heroku would delete the modified files anyway

#

so there isnt exactly a point in doing that

#

if you want to store data use the free postgresql they provide

eager ermine
#

which is why I wanted to write to a different website

past grove
#

update: got solutions after 123 and 75 generations

#

still working on crossover and mutation improvements though

unborn cave
#

Can anybody lend a hand with a pandas related issue?

foggy minnow
#

Why in the hell are these scikit learn algorithms only outputting scores in multiples of .31

#

I mean itโ€™s only giving scores of .31,.62,.93 I donโ€™t understand what the hells goin on

unborn cave
#

I have this dataframe

How would I grab values for the key 'coordinates'
I would like to extract all values for coordinates in the col geometry and insert into another df
That same df will also be used to store data extracted from the properties col . Each col in this new dataframe will be a key inside either geometry or properties

turbid bay
#

hey im using tensorflow with keras. I want an input layer which has one node that takes on an integer value. How do i write the input layer? i tried inp = tf.keras.layers.Input(1) but that doesnt work

turbid bay
#

my input data is singular integer values between a 1 digit number and a 39 digit number (this is a word which i have turned into an integer). And my output values should be a number between 1 and 0. The neural network should decide whether it is a real word (1) or not (0).

#

inp = Input(shape=(1,),dtype='float')

layer_1 = Dense(128, activation="relu")(inp)
layer_2 = Dense(128, activation="relu")(layer_1)
pred = Dense(1,activation="softmax")(layer_2)

model = Model(inp,pred)
model.compile(optimizer='adam',
                          loss='sparse_categorical_crossentropy',
                          metrics=['accuracy'])
model.fit(X_train, Y_train)```
reef bone
#

I would recommend using the vector approach suggested above to encode your input, and more than one input neuron

#

You should have 2 output nodes since you're doing classification with 2 classes, the softmax activation makes sense, but applying it to a single neuron doesn't

#

If your code runs then it's probably fine but generally speaking you would do something like

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense()) # first layer, specify input_shape
model.add() # your second layer
...
model.add() # your final layer
#

And uh, you should have validation and testing data

#

And pass validation data to the model.fit() method alongside your training data

#

So that you can see how it performs on unseen data

#

And then evaluate on testing data when you're done

#

Or when you think you're done

#

And maybe LeakyReLU() for your activation on the hidden layers, but make sure you pass a lower alpha value, keras defaults to 0.3 now which is ridiculously high in my opinion

karmic axle
desert oar
#

@karmic axle depends on what you wanna do, but its worth taking a basic machine learning or data analysis course

#

you're gonna need probability and stats pretty much no matter what

cerulean magnet
#

Anyone on here might be able to help me with trying to create a specific graph using matplotlib and my dataframe? ๐Ÿ˜ƒ

desert oar
#

what graph? @cerulean magnet

small ore
#

@desert oar His question got burried in #help-chestnut just before my question coz no one answered for an hr

karmic axle
#

@desert oar okay.. i dont know what i want to do with NLP yet ๐Ÿ˜… .. maybe will add some nlp to my discord bot.

spare karma
#

Any recommendations/advice is welcomed and appreciated.

serene oar
#

Hey, why do I get this error when trying to plot different columns in a csv by defining their names?

TypeError: 'DataFrame' object is not callable
#

This doesn't happen when I just call the whole dataframe

tall shuttle
#

@spare karma yes?

#

is it just an import error

spare karma
#

@tall shuttle you call the package by? import tensorflow for both the cpu and gpu versions? (pending what's installed)

tall shuttle
#

yes

spare karma
#

Might be tmi, but on the left is the install, on the right is the call

tall shuttle
#

I can't read that

spare karma
#

rip

tall shuttle
#

just give them separately

spare karma
#

kk

#

the install:

#

the import call in python:

tall shuttle
#

oh

#

you installed right

#

that is a postinstall error

spare karma
#

related to python version?

#

Rather, with respect to you and your time, is this a common issue I can find a solution for online?

tall shuttle
#

install cuda

spare karma
#

I'll re-install and share a screenie.

tall shuttle
#

install cuda for your gpu

spare karma
#

this correct?

#

(re-installing now)

tall shuttle
#

yes

#

then restart

spare karma
#

kk

#

one second, sorry.

#

(when installing cuda, I'm just following the prompts and clicking 'next')

reef bone
#

do you have cuDNN

spare karma
#

yep. followed the tutorial's instructions of drag/dropping (ill share the time, one sec)

#

Wait. would it matter if Python was installed on a separate drive? (i have all python-related-ness on one drive..)

#

(nvidia's gpu computing toolkit, and corresponding cuDNN files are on a diff drive)

reef bone
#

I think it should be fine as long as your path variables are set correctly, I remember I had to tamper with mine

#

But it's been a while since I installed mine

#

Sorry that was more of a shot in the dark

spare karma
#

ah gotcha

reef bone
#

I followed this, at the bottom it mentions the variables you need to have set so that tensorflow can find cudnn

spare karma
#

ah, gotcha. thank you!

spare karma
young aurora
#

So I've got this code outputting this visual:

plt.rcParams['figure.figsize']=(8,10)
fig, ax = plt.subplots()
sns.countplot(x="Q18", palette="Blues", data=t10)
ax.set_ylabel('Number of Respondents', size='15')
ax.set_xlabel('')
ax.set_title('"What percentage of the team\nmeeting do you speak?"', size='20')
plt.suptitle('Top 10%', y=1.01, fontsize=30)
#code for Question 18```
#

How in gods name do I get this to sort in logical order? Do I need to setup the pandas dataframe to sort them by the order (e.g. 0%,10%,20%....etc) prior to plotting it with seaborn?

#

Sorry about the transparency. The problem is that it's randomly ordering (or to me at least, randomly) the bars

upper lily
#

Not familiar with pandas but

#

data=t10 what is t10?

#

@young aurora

chilly shuttle
#

that's not pandas, that's seaborn

#

but t10 would be a pandas dataframe

#

also you're better off using plt.figure(figsize=...)) than rcParams

late garnet
#

@young aurora you can specify the order in seaborn's countplot. For example, if you wanted to specify labels that are string percentages:

ordering = ['10%', '20%', '30%']
sns.countplot(x='value', data=df, order=ordering)

# you can also generate the ordering
# this gives you 0% to 100% in logical order
ordering = list(map(lambda s: str(s) + '%', range(0, 110, 10)))
gritty hawk
#

pandas: is there a way to get the sql queries it will run if I do a to_sql()?

chilly shuttle
#

it doesn't run the sql in one shot, so not really

#

afaik the only way to do it would be to provide your own dummy connection and intercept the bulk_save_objects or whatever other calls pandas makes to serialise

desert oar
#

its a bit weird that you cant generate the intermediate sql actually

#

ah right it uses sqlalchemy on the backend

chilly shuttle
#

it's not that weird, bulk loads are almost never a plaintext sql query

gusty onyx
#

Can someone explain to me how data mining works? Like what's the process?

proud raven
#

That's a really broad topic. All it boils down to is getting a big heap of data and applying things like stats or machine learning to it to get new information. e.g. If I have a spreadsheet of raw basketball data (player, shot time, distance from net, success-or-failure, away-or-home-game) I could calculate a simple stat. like 3-pointer percentage by player, I could try to see if home games correlate to a higher accuracy, I could render a heatmap that shows how effective each player is based on their distance from net. Etc.

lean ledge
#

Generally you have so much more data than that. You do your best to use dimentionality reduction techniques (eg PCA), use data summary reports to find covariance between data (plot all your data on a big covariance diagram, see which bits look the hottest), use some domain specific knowledge to narrow down what you should be looking into, etc

small ore
#

"dimensionality reduction technique" . I stumble upon new words for subset selection

desert oar
#

its not the same as subset selection

#

its more general

#

e.g. pca or multidimensional scaling.. you aren't choosing subsets of features, you are creating new features with the goal of reducing the number of features required to convey some amount of information

lean ledge
#

^^^

polar acorn
#

How come the "google it" answer to the data mining question was removed? For broad unspecific questions I find that to be a good answer. Vague non specific questions deserve vague non specific answers.

#

A better question would be, what is your favourite introductory text to data mining? Or does anyone have a simple and well explained example with code that uses some of the most common data mining techniques?

#

And in both cases the answer should hopefully be "look in the pinned messages" ๐Ÿ˜ƒ

simple crag
#

Because, whether intentional or not, it's a hostile and condescending thing to say to someone who is asking a question. As is "vague none specific questions deserve vague non specific answers" (emphasis mine)

#

Ask for more information, don't just dismiss them

polar acorn
#

Vague non specific questions do deserve vague non specific answers, though you're probably right the people who ask them deserve a chance to better frame what they're after.

simple crag
proud raven
#

We've all asked vague non-specific questions, none of us popped out of the womb asking where we can find a good tutorial on the time complexity of a merge sort. If you have an answer in the form of what a better question would look like, why not make that your response? "Hey friend, that's a broad topic, here are a couple of resources I find handy and you might try these keywords to narrow your focus." You're still encouraging them to do their own work but you're providing a starting point. Or, you know, just don't respond.

polar acorn
#

Yeah no I didn't tell him to google it. And tried to come up with two other questions to ask instead. But I thought it was strange that the google answer was removed (if that was the case, he might have deleted it himself for all I know). I didn't know that was the policy.

simple crag
#

He did delete it himself

#

After I said it wasn't helpful, because it wasn't

polar acorn
#

Sure, fair enough then. I thought for some reason it was removed.

small ore
#

This is also a discussion cum help channel. Need not be as strict as a help channel when it comes to topics. A question, however vague may be taken as a good beginning for a discussion on that topic. However I believe "google it" is not a bad answer if put in a polite way

#

Oh. Didnt realize I am one hr late

lucid hornet
#

Not really your call to make whether a channel is more or less strict.

#

Or should be, rather

dim osprey
#

How can I build a network to reason about depth/scale from objects that I already have hard sizes for?
LIke, say I have a picture of someone holding a Comcast Remote:

small ore
#

True. Well, let us say, that is my perception

dim osprey
#

All of those remotes are the same size. How can I use it to measure the scale of the rest of the image?

lucid hornet
#

Also if google it is all that is said about the topic, then no, it's not helpful

#

@dim osprey I feel like you'd have to have some sort of reference to gauge size

dim osprey
#

The remote is the reference. I want to measure his hands.

simple crag
#

Generically you'd have to segment the image to find the "remote-shaped" blobs and get the one that's most likely the remote

#

Once the image is segmented you'll have the coordinates relative to the image and can scale that based on the reference measurements you have

dim osprey
#

I can't train a model to recognize certain objects that alll have the same size?
Like, there's been a lot of work on street sign detection because of the runup to sell-driving cars. All stop signs are the same size.

simple crag
#

You can, but at some point you have to start with segmenting the images so you can train the classifier

dim osprey
#

If I know that the stop sign is x inches, and y pixels, and the pole is y2 pixels, than it must be x2 inches, right?

simple crag
#

Yes

#

If the remote blob is 100 pixels and your remote is 10 inches, it's 10 pixels/inch

#

However, that's not going to apply for things in the background

dim osprey
simple crag
#

But if it's more or less in the same plane as the remote then you'll be ok

#

I'm sure there are corrections since this is a fairly well developed field but it's not one I'm particularly familiar with

dim osprey
#

Aha, but there are already neural networks that reason about scale and depth. A lot, actually.

#

They do pretty good, depending on how much training data you have.

#

My idea was to add the context from objects that have fixed sizes: Phones, Car tires

#

Look at this:

#

This is perfect: A CSX engine is always the same size, and so is that tanker car. I can take those hard cues and use them to measure the entire image.

desert cradle
#

doors are usually the same height too

dim osprey
#

But I'm interested in outdoor scenes.

polar acorn
#

Has anybody here built tensorflow from source to optimise for CPU? I won't have access to a GPU for some time and I'm considering if the speed up is worth it.

desert oar
#

why build from source?

#

you mean so you can use -march=native -mtune=native -O3 or something?

proud raven
#

Building TF from scratch theoretically allows you to optimize for the target system's CPU (according to TF). Though I've only ever seen this in the context of Intel-based CPUs and even then it's only applicable to CNNs. To answer pptt, yes I've done a CPU optimized build. Results were a mixed bag.

#

Dunno about being "worth it" or not. The effort in compiling an optimized build is pretty minimal, if you have an i5 or i7 give it a shot. It can't hurt.

desert oar
#

does it use a blas or does it implement its own linear algebra

#

like would MKL vs OpenBLAS make a difference

proud raven
#

This got me curious so I went back and ran a simple ConvNet against Python 3.6, Tensorflow 1.7 (MKL, SSE, AVX, and FMA enabled), Tensorflow 1.7 (Just SSE, AVX, and FMA enabled), Tensorflow 1.7 (Standard, No optimization install). WIth just SSE, AVX, and FMA optimizations the average time between steps decreased by 15% from the no frills install. With MKL enabled the time between steps increased by 260%. I suspect that weirdness with MKL is a broken build on my part. I need to recompile and also potentially shift to Tensorflow 1.12.

desert oar
#

very cool

#

thanks for trying that

#

to be fair mkl isn't always faster. i noticed improvement on basic operations, matrix multiply and especially SVD

#

but apparently its not always faster even on intel hardware

proud raven
#

Indeed. Though I'll probably be up until 3:00a.m. validating MKL is working as expected.

desert oar
#

oof

#

conda is your friend here

lean ledge
#

oh relevant, literally compiling TF from source right now

#

been waiting an hour and a half and no end in sight, kms

#

would not recommend for marginal benefits in speed

#

im only doing it because I want to work with TF on 18.04 which only supports CUDA 10 which isnt supported by stable TF releases as of yet

polar acorn
#

I ended up not doing it. Looks like it might have cost me more time than it would have saved me anyway.

lean ledge
#

my compilation took me 3-4 hours, so unless it's saving you that much time, probably not worth it :p

polar acorn
#

The joob took me 6 hours so unless the it halves training time then probably not

sage carbon
#

hey can i ask a Q here ? O.o

#

O_O

proud raven
#

If it's data science related, yes. If it's more of a general Python issue, feel free to use a help channel.

lyric canopy
#

In general:

#

!t ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

โ€ข Don't ask to ask your question, just go ahead and tell us your problem.
โ€ข Try to solve the problem on your own first, we're not going to write code for you.
โ€ข Show us the code you've tried and any errors or unexpected results it's giving
โ€ข Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

sage carbon
#

I asked it in a help channel but they redirected me here .. So this is my Q.

I need to get a user input and determine his emotion from the answer. For example if user inputs 'I'm sad', he is sad. (:P obviously) .. But this can be more complex than that example (as i think). I need your ideas about how to achieve this. I can make a word list and match them but is it a good idea. Or do i need to involve something like TensorFlow (which is completely new to me) . ? confused
:/

#

๐Ÿ˜ฆ

proud raven
#

The broad term you're looking for is "sentiment analysis". Typically the first place to start, regardless of how you'll use this data, is to get a set of labelled data. Associating words or sentences with a feeling. Individual words are hard because they can have a double meaning: "I'm so happy I could cry" -> Is this person sad or happy?

#

How you then use those labels is sort of up to you. You could, as you say, just match words to sentences and assign a sentiment like happy or sad. For a small school project that might be fine. If you want a program that can more accurately predict sentiment when the user input contains multiple words from your list, as in my example, you would have to go to a deep learning approach in Tensorflow or PyTorch. If you've never tried deep learning before I would recommend you look at Keras or PyTorch first. Tensorflow is daunting for newcomers.

spark nimbus
#

Using tensorflow, how would I return a "score" value to the model for judging itself?

spark nimbus
#
Traceback (most recent call last):
  File "X:\Python\lib\site-packages\rlbot\botmanager\bot_manager.py", line 187, in run
    self.call_agent(agent, self.agent_class_wrapper.get_loaded_class())
  File "X:\Python\lib\site-packages\rlbot\botmanager\bot_manager_struct.py", line 36, in call_agent
    controller_input = agent.get_output(self.game_tick_packet)
  File "X:\Downloads\bot\rlai\tflayers.py", line 85, in get_output
    out = self.models[packet.num_cars].call(arr)
  File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\sequential.py", line 229, in call
    return super(Sequential, self).call(inputs, training=training, mask=mask)
  File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\network.py", line 845, in call
    mask=masks)
  File "X:\Python\lib\site-packages\tensorflow\python\keras\engine\network.py", line 1031, in _run_internal_graph
    output_tensors = layer.call(computed_tensor, **kwargs)
  File "X:\Python\lib\site-packages\tensorflow\python\keras\layers\core.py", line 970, in call
    outputs = gen_math_ops.mat_mul(inputs, self.kernel)
  File "X:\Python\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 4856, in mat_mul
    name=name)
  File "X:\Python\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "X:\Python\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
    op_def=op_def)
  File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 1792, in __init__
    control_input_ops)
  File "X:\Python\lib\site-packages\tensorflow\python\framework\ops.py", line 1631, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimensions must be equal, but are 193 and 48 for 'dense_734/MatMul' (op: 'MatMul') with input shapes: [193], [48,48].
#
arr = np.array(total)  # len(total) = 193
self.models[packet.num_cars].call(arr)
sage carbon
#

Thank you very much @proud raven .. This seems complex than i thought. Actually it's a school kind project but i wanna do it in the right way. I'll look into things you told. ๐Ÿ˜ƒ Thanks again !

lapis sequoia
#

may I ask a question here? can we ask questions?

silk acorn
#

!t ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

โ€ข Don't ask to ask your question, just go ahead and tell us your problem.
โ€ข Try to solve the problem on your own first, we're not going to write code for you.
โ€ข Show us the code you've tried and any errors or unexpected results it's giving
โ€ข Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

lapis sequoia
#

okay, I never used a ttest on a data set before. My boss wants me to compare two patients' features with t test, but wants me to do this for each feature

#

is this doable? is this a thing? the only library I found in python kinda uses lists as inputs and gives a single p value

#

also I'm looking for the ttest functions of the stats library in python and there are four different kind of t test, I wonder which one is more suitable for this task.. =/

#

I think I figured it out, thanks!

polar acorn
#

@lapis sequoia What did you end up doing?

lapis sequoia
#

I was doing the thing wrong, for each feature, I made a list that has the values of neighbors from knn
then for two hypothesis groups, I compared the lists of these features with each other

#

so for 15 feature, I had two lists of 5 values each

#

and I had 15 t test results, which was what I needed

polar acorn
#

Sounds fair. Was the purpose to test if the patients had a significant difference or if each feature had one?

lapis sequoia
#

oh sorry I saw your reply too late, was busy with sending the data to the right places.. the main goal was the former one

signal juniper
#

Matplotlib is being helpful and drawing lines directly between my data points for the green line. However I'd like it to stay horizontal until the value changes and then connect with a vertical line. Is there an option to do this or do I have to insert extra points to accomplish it?

hearty token
signal juniper
#

@hearty token Awesome, thank you so much!

inland viper
#

So... how about that Tanenbaum?

lapis sequoia
#

I'm trying to count the number of entries of a same class using pandas

#

I have a label called CLASS and each entry is a type

#

So I need to count how many of each type I have there

gleaming wadi
#

hi , what is the most beautiful visualization tools/library/framework/app in python?

gritty hawk
#

@lapis sequoia have you tried value_counts()?

#

df.label.value_counts()

lapis sequoia
#

yeah that's it, cheers!

gritty hawk
spare karma
#

Anyone have a good SQL discord (like this one)?

#

Or, lol, anyone know how to use the 'where' clause in queries? I'm trying to filter down to [fldDuration] = 02:00

#

varchar(10)

#

(removed phi, for obvious reasons)

desert oar
#

?

#

you need to connect them w/ logical AND and OR

#
WHERE
    (user.age < 30 OR user.age > 40) AND
    address.state = 'NY'
prisma comet
#

Hi

woven tundra
#

@gleaming wadi Dash is pretty good for building interactive visualizations as web apps. If this is just for a notebook however seaborn is good for non-interactive charts and plotly for interactive ones.

lean ledge
#

there's also bokeh

unkempt zinc
#

hi all- i am working on some NLP project and i have to read my data from PDF documents! i use tika parser to read the files and then proceed with text processing techniques. one thing i can't get my head around is basically how is it possible to get rid of headers and footers of a given document ! any direction would be really appreciated

spare karma
#

@desert oar that was it. 12/13 had no and in-between them.

arctic moth
#

Hello my ask for advice with CVXOPT if anyone has some experience?

#

I am using quadratic programming solver for MI-SVM algorithm, but if I get negative weights it writes me this error:
ValueError: Rank(A) < p or Rank([P; A; G]) < n
and even if i try to run it without constraints it still fails

#

this is full error output:

#
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/misc.py", line 1429, in factor
    lapack.potrf(F['S']) 
ArithmeticError: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 2011, in coneqp
    matrix(0.0, (0,1)), 'beta': [], 'v': [], 'r': [], 'rti': []})
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 1981, in kktsolver
    return factor(W, P)
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/misc.py", line 1444, in factor
    lapack.potrf(F['S']) 
ArithmeticError: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 21, in <module>
    main()
  File "main.py", line 19, in main
    model.learning()
  File "/home/frovis/Programovani/Python/Bakalarka/Model.py", line 70, in learning
    sol = solvers.qp(P,q)
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 4487, in qp
    return coneqp(P, q, G, h, None, A,  b, initvals, kktsolver = kktsolver, options = options)
  File "/usr/local/lib/python3.6/dist-packages/cvxopt/coneprog.py", line 2013, in coneqp
    raise ValueError("Rank(A) < p or Rank([P; A; G]) < n")
ValueError: Rank(A) < p or Rank([P; A; G]) < n
fierce saffron
#

hi everyone

#

I've got a simple question. Just looking for thoughts

#

I've got a scatter plot containing 3 classes. However, lots of overlapping.

#

I'm not sure how to best display this data

hearty token
#

@arctic moth I have no experience with that, but I think you'd have more luck if you posted some code.

desert oar
#

@arctic moth do you know what matrix rank is? A code snippet at least would be helpful but somehow youre ending up with an ill conditioned matrix

#

@fierce saffron if its too hard to visualize with 3 colors on the same plot, make 3 plots side by side w/ the same axis dimensions

fierce saffron
#

I hadn't thought of that but it's a good idea

thorn river
#

I have a dict with the following structure:

'Username' with a key 'label' and a key 'text'. So if i would want to access the text of a username i would do dic["username"]["text"].

I want to tokenize the strings in the [text] part. My goal is to supply the tokenized text to a classifier so i can train and make predixtions. How do i do this using spacy?

I have looked at the spacy docs and tutorials but dont know if their pipe operations replaces the text in the dict with the tokenized text

desert oar
#

No it just returns a tokenized version

#

Theres no special handling for dicts... list of strings goes in, list of processed docs comes out

#

Its not like C where you pass a preallocated result container and the function fills that container...

thorn river
#

Alright thanks. Then how should i provider the tokenized strings to a model?

desert oar
#

i recommend finding some text classification 101... but the basic method is "bag of words"

#

one row per document, one column per word, word frequency in matrix cell

#
'the quick brown fox'
'the strong brown bear'

becomes

1 1 1 1 0 0
1 0 1 0 1 1

where the columns correspond to "the", "quick", "brown", "fox", "strong", and "bear" respectively

silk acorn
#

why are there 2 extra columns?

desert oar
#

there aren't?

silk acorn
#

i must be misinterpreting it then

desert oar
#

perhaps. count the words

#

6 unique words (aka 'tokens'), 6 columns

silk acorn
#

ah, like that

desert oar
#

@thorn river spacy also provides word vectors you can use, look it up in their docs

#

here's a sloppy basic text pipeline using some data thats kinda like what you described @thorn river

import spacy

userdict = {
    'joe': {
        'text': 'milk is okay',
        'photo': '13861361.jpg'
    },
    'rhylli': {
        'text': 'i am the funniest weightlifter in florida',
        'photo': '09370813.jpg'
    }
}

en_core_web_md = spacy.load('en_core_web_md')

text_spacy = en_core_web_md.pipe(userdata['text'] for userdata in userdict.values())
waxen stump
#

Seaborn is pretty

#

How does this code work:

#
my_full_dataset = tf.data.Dataset.from_tensor_slices(
  (tf.cast(mnist_images[..., tf.newaxis]/255, tf.float32),
   tf.cast(mnist_labels, tf.int64)))```
#

load_data() generates a tuple of arrays (train_x, train_y), (test_x, test_y)

#

How does that get parsed into three places?

#

e.g. isn't the first like saying (mnist_images, mnist_labels), _ = (train_x, train_y), (test_x, test_y)

#

er wait, the _ is a throwaway variable so the only data that's passed is train_x, and train_y

#

I think?

desert oar
#

Yes, "_" is a valid variable name in python so you just dump stuff there you aint gonna use

#

Your interpretation seems right

lean ledge
#

Their problem isn't that its dumping, it's dumping 2 things into 3 places

desert cradle
#

no

#

you have a tuple of two tuples. the second goes in _, the first is unpacked further into two variables

#

note that since there isn't anything magical about _,if it's a significant amount of data you should del _ so it can get garbage collected

arctic moth
#

@desert oar @hearty token Thanks guys, but I already found out, that I just made a logical error in constraints. But thanks ๐Ÿ˜ƒ

#

Guys anyone have experience with Fox, Tiger, Elephant and Musk datasets?

#

I want to use them to train my mi-SVM classifier, but I can not find in SVM file which type of bag is positive or which one is negative. It just contains number of the bag and then values and I also can not find anything on the web, that would use those datasets.

placid galleon
#

I've been directed here from help.

#

I'm having a problem installing tensorflow

lean ledge
#

@placid galleon try pip -V?

placid galleon
#

18.1

#
Collecting tensorflow
  Could not find a version that satisfies the requirement tensorflow (from versions: )
No matching distribution found for tensorflow```
#

also tried that as well

desert oar
#

@arctic moth got a link to a page describing it or something?

#

@placid galleon you can install other packages?

#

also can you try using python3 -m pip instead of pip3?

placid galleon
#
Package          Version
---------------- -------------------
aiohttp          3.3.2
altgraph         0.16.1
async-timeout    3.0.0
attrs            18.1.0
auto-py-to-exe   2.4.2
bottle           0.12.13
bottle-websocket 0.2.9
certifi          2018.4.16
cffi             1.11.5
chardet          3.0.4
discord.py       1.0.0a1483+gec3435b
Eel              0.9.10
future           0.17.1
gevent           1.3.7
gevent-websocket 0.10.1
greenlet         0.4.15
idna             2.7
keyboard         0.13.2
macholib         1.11
multidict        4.3.1
numpy            1.14.5
opencv-python    3.4.3.18
pandas           0.23.4
pefile           2018.8.8
Pillow           5.2.0
pip              18.1
PyAutoGUI        0.9.38
pycparser        2.18
PyInstaller      3.4
PyMsgBox         1.0.6
pynput           1.4
pypiwin32        223
PyScreeze        0.1.18
python-dateutil  2.7.3
PyTweening       1.0.3
pytz             2018.5
pywin32          224
pywin32-ctypes   0.2.0
requests         2.19.1
scipy            1.1.0
selenium         3.141.0
setuptools       39.0.1
six              1.11.0
tflearn          0.3.2
tqdm             4.28.1
urllib3          1.23
websockets       6.0
whichcraft       0.5.2
yarl             1.2.6```
#

python3 isn't recognised

lean ledge
#

@placid galleon TF doesnt support python3.7

placid galleon
#

Oh, i'll try down grading

lean ledge
#

I'd recommend python3.6 + CUDA 9 if you're trying to use tf

placid galleon
#

3.6 exactly or will 3.6.7 work?

lean ledge
#

will work

#

i'm currently on 3.6.7

placid galleon
#

Ok doke installing now, I'm praying it works ^_^

#

Thanks in advance

lean ledge
#

not even 100% sure it will work but it should. pip cant find it because you have to tag which versions something supports as you publish

placid galleon
#

Still not working ๐Ÿ˜

#

same error

lean ledge
#

@placid galleon Are you using 64 bit python or?

placid galleon
#

I'm not sure how would I check?

#

oh it's 32 bit

#

๐Ÿ˜ what the hell

lean ledge
#

try 64 bit, that should work

desert oar
#

theres basically no reason to use 32 bit python unless youre in a very specific situation and you know you need it

placid galleon
#

I ran import platform platform.architecture() and it told me it was 32bit ... installing 64bit now silly me

#

Ok, i've installed 64bit

#

but i cant just hit cmd and type python now

#

i need to do it in the exact path

#

๐Ÿ˜

#

now pip doesn't work

#

yikes pepe

#

nevermind, sorted the path ๐Ÿ˜›

#

Yeah that sorted it, installed tensorflow ๐Ÿ˜„

#

woo

rough pecan
#

Heya guys! two quick questions about time forecasting:

  1. What would you use to forecast a dataset of maximum 50 datapoints that are unpredicable?
    (Every dataset of <= 50 is different, so you can't know if it even has seasonality in it or it doesn't)

Moltz?
Basic multivariant linear regression?

Arima seems too much as I don't think it will be accurate with <= 50...

  1. I heard there's an interesting way of doing that converting all the cycles to asin functions, but regarding to this and seasonality I'm wondering, is there a way to programatically calculate if there's even seasonality on it (as the dataset is different everytime it comes)?
desert oar
#

@placid galleon highly recommend conda or at least a virtualenv

#

@rough pecan ive done arima on ~15 datapoints and gotten useful output. depends on the problem

#

can you describe the problem a bit more

#

inputs, nature of data, etc

rough pecan
#

I'm starting the project on AliExpress but it could grow from there...

The datapoints are initially the order amount and the date of the orders...

... but I'll be doing a lot of testing with both
A) existing (review count, review score average, etc) datapoints and
B) newly created datapoints from the existed data such a scores made by myself, sales per country etc... you know, using the data that's there to create more data out of it... feed A/B into different models and see if there's anything interesting I can find...

I'm quite new to this so it would be both interesting and quite rich in my learning proccess (I belive) to try out playing with all these datapoints to see if any of those (again, both the ones out there and the new data I'll create from it) could help in the search of trendy / hot products...

So, the goal: Spot trendy products that have a forecast of increased sales for the following days... @desert oar

#

So, for some reason despite the products having thousands of orders AliExpress only allows you to see the first 50 pages of orders.

We could assume then that this is a single variant forecasting problem but as I said, I'll be doing a lot of testing with more data so it's most likely to end up being multivariant in the long run unless the other datapoints aren't really useful / helping positively anyhow in the forecast of course... we only know for certain after we test... right? ๐Ÿ˜›

desert oar
#

yeah so you have a couple of options

#

basic option: automatically fit arima to each product

#

i actually implemented something like that at work a long time ago, basically you run through a bunch of hypothesis tests then fit a model. nowadays you can probably get away with auto.arima in the R package forecast, or implement it on Python -- it's based on the practices in https://otexts.org/fpp2/

#

there are many traditional stats methods for seasonal decomposition as well

#

but yes, one way to encode "cyclical" data (eg. day of month, day of year, hour of day) is to use polar coordinates

#

so 0 and 23 hours are closer than 0 and 4 for example

#

you could probably fit one big linear model or neural network or whatever that way. drop in last 7 periods sales, plus polar encoded day, or something

#

more advanced option: bayesian arima model where the parameters across all products share a common distribution, allowing you to pool information that way

#

there are methods for doing online/incremental updates of bayesian models but its not as easy as just running some more epochs on a NN

rough pecan
#

Why would you want to pool such thing?

desert oar
#

shared information across products

#

right? its not like every time you look at a product online, you forget everything you know about other products

#

sharing information will be especially useful for sharing information about cyclicality/seasonality

rough pecan
#

That smells like useful when forecasting let's say, the overall website perfromance, yet would it be good to make them compete with one and other? it sounds more of an inclusion and not opposition situation right?

For competition (which is the main objective), I belive that it may be more than enough to have them separated and of course, together in the same database but you understand, separated in their analysis, then just compare the results between all of them ๐Ÿค”

desert oar
#

and the effect of any covariates

#

what do you mean separated?

#

or competing?

#

you said before that you're trying to forecast when sales are about to increase in some product

#

the more you know about a product, the easier that should be, right?

rough pecan
#

Correct...

#

What I mean is that is more about finding trends of individual products and not the entire business overall... rather to make the products "compete" with each other, which is why I assume that having their analysis separated (let's say an ARIMA fit for each product, without pooling them all together) seems a little bit more accurate isn't it? how other product's information would benefit the prediction of the previous one?

#

My potatoes are about to burn in the oven! I have to run to check them out haha... brb.
Sorry if I'm asking too many questions but this is just such a juicy conversation!

Thank you so much for helping me out, you're bringing a lot of value and clarity to my sight ๐Ÿ˜Š ๐Ÿ™

desert oar
#

does forgetting what you know about Product A help you make better predictions about Product B?

rough pecan
#

Maybe and probably?... unless you're trying to predict the trend of a certain category doesn't it?
Unless product A and B are the exact same products of course...
For this I'll be training a image recognition model to compare all the pictures from two different products in case different publications (A/B) are two publications of the same products, in that case I was planning to "merge" those two, until now I never thought about merging the data of ALL products ๐Ÿค”
@desert oar

desert oar
#

you arent merging them entirely. you are just sharing common information

#

its effectively a form of regularization

#

eg you shrink all arima coefs towards 0

rough pecan
#

Just to make the model (in case it was for example a neural net) more accurate on their predictions, just for the sake of making it smarter, then predict the trend of a product separately for each one? ๐Ÿค”

#

Oh... hell, I haven't even opened my mind to the posibility of sharing common data for the sake of normalizing it, I thought the factors affecting a product were very different for each one as the market (people) of each product and behaviors was very different ๐Ÿค”

desert oar
#

yes, you still make separate predictions of course

#

its not a simple process to be sure

#

it takes some doing. but yes the idea is to only share information on common factors

arctic moth
lime lava
#

Hi i need some help with data manipulation, a table with 3 columns. I want to group by a, then remove duplicated groups. That is, once I did the groupby, I might have 2 grouped elements that are the exact same โ€œBโ€ rows, and want to remove those duplicated groups.

desert oar
#

You want to remove duplicates within each group?