#data-science-and-ml | Python | Page 235

modest rune Jul 16, 2020, 6:10 PM

#

@void anvil My suggestion, because it will be the fastest, probably the easiest, and likely the cleanest... read as much of the json as you can, then use one of the 3 pandas .json IO functions.
https://pandas.pydata.org/pandas-docs/stable/reference/io.html#json

modern canyon Jul 16, 2020, 6:10 PM

#

What are the best methods for movie recommendation systems?
I'm right now using cosine similarity metric on the IMDb dataset to make recommendations. Although it performs reasonably well, I'd like to enhance the performance. So what are the SOTA methods available for movie recommendation systems?

lapis sequoia Jul 16, 2020, 6:16 PM

#

do yall see auto ML taking over data science in the upcoming decades

turbid oyster Jul 16, 2020, 6:26 PM

#

I see auto ML being a big automater of machine learning - but data science is more than ML

regal flax Jul 16, 2020, 6:59 PM

#

ey

worldly kindle Jul 16, 2020, 6:59 PM

#

@modest rune so you went with converting the arrays to lists?

pastel compass Jul 16, 2020, 7:15 PM

#

Is there any benefit to using an ASCII string over Unicode for text?

idle otter Jul 16, 2020, 7:23 PM

#

how would you say a shape of (2330, 3500, 3) in words?

#

i know it's 2330 arrays of 3500by3 but I am just wondering if there is a standard way of saying it

frank bone Jul 16, 2020, 7:26 PM

#

Is it possible to pass a list of dates to a pandas date series index?

#

i.e. only open market dates for a year of stocks

#

Which is like 250 days out of 365

proper fable Jul 16, 2020, 7:55 PM

#

Guys, if I have a dataset that has 'names' column, how should I deal with it to convert it into numerical data

#

Is it right to use one hot encode? Becuse it has literally 25 unique values

desert oar Jul 16, 2020, 7:58 PM

#

@pastel compass in some specific applications maybe but not in general

#

to be clear: by "Unicode" you probably mean UTF-8

vagrant fiber Jul 16, 2020, 7:59 PM

#

you can select the columns which you need to convert and use .astype()

desert oar Jul 16, 2020, 7:59 PM

#

@frank bone yes

restive obsidian Jul 16, 2020, 8:01 PM

#

hi can someone help me with scipy solve_bvp?

#

data-science seemed to be closest to a channel which might use numerical computation that's why I jumped in here

pastel compass Jul 16, 2020, 8:17 PM

#

to be clear: by "Unicode" you probably mean UTF-8
@desert oar

Ahh I didn't know there was a link between the two

desert oar Jul 16, 2020, 8:17 PM

#

Unicode is an abstract system that basically catalogues every character/symbol used by humans, and putting a number on it

#

UTF-8 is an encoding for Unicode text

#

so python strings are "unicode"

#

but a file would be "UTF-8"

pastel compass Jul 16, 2020, 8:18 PM

#

Oh that makes sense, I always see "encoding=utf8" but I didn't fully understand

desert oar Jul 16, 2020, 8:18 PM

#

yes

#

so that's a UTF-8 encoded file

#

which means that it contains Unicode text, in UTF-8 format

#

@restive obsidian what's your problem with it? don't ask to ask

pastel compass Jul 16, 2020, 8:19 PM

#

Thanks for the help!

pale thunder Jul 16, 2020, 8:21 PM

#

UTF-8 Is a way to represent a sequence of Unicode characters as 8bit bytes (octets)

desert oar Jul 16, 2020, 8:41 PM

#

^

modest rune Jul 16, 2020, 9:14 PM

#

@modest rune so you went with converting the arrays to lists?
@worldly kindle

A headache brought upon by too many consecutive days of Pandas forced me to take a break. After a 3 hour nap, I am ready to try again. Upon further reflection, I have decided to take the advice of the experts and attempt to index 2 dataframes instead of putting everything into 1 dataframe (which required the nested arrays).

restive obsidian Jul 16, 2020, 9:32 PM

#

@desert oar I m stuck from 3 hrs on a problem where I have to solve a system of 2 coupled 2nd order diff equations

#

can you help, I 'm using scipy

#

https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.solve_bvp.html the example in the docs seems confusing

#

the bc(ya, yb) says it should return a (n, ) array but why? Boundary conditions should be also for the differentials.

hardy folio Jul 16, 2020, 9:35 PM

#

Hello, Are there people here who may be a little experienced with web scraping?

restive obsidian Jul 16, 2020, 9:36 PM

#

@hardy folio what do you want to scrape?

#

and with what scapy/selenium/requests?

hardy folio Jul 16, 2020, 9:37 PM

#

So when I have scraped web pages before most of the time I would just scrape information from the page. This website I am looking at actually creates a link and give you the information in a csv file.

#

http://prntscr.com/tj7xw0

Lightshot

Screenshot

Captured with Lightshot

#

is there a way for me to actually use the link it creates and download the csv file instead of just scraping the information it returns on the whole web page

#

or would that be more difficult

#

I have spent quite a bit of time looking into this but so far cant find information that relates

#

it does not look like that creates a link I would use

#

http://prntscr.com/tj7zgj

Lightshot

Screenshot

Captured with Lightshot

#

the web page looks like that and the excel picture is a link

restive obsidian Jul 16, 2020, 9:40 PM

#

use a = requests.get(...).content and then with open(....csv, "wb") as f: f.write(a)

hardy folio Jul 16, 2020, 9:43 PM

#

Dont i have to give

#

request.get

#

the actual url link download

#

page = requests.get(URL)

#

like that but if I dont have a URL for the excel link

#

and obviously im still learning a lot so I appologize if im dumb

worldly kindle Jul 16, 2020, 11:48 PM

#

@modest rune good call haha

lapis sequoia Jul 17, 2020, 12:02 AM

#

i love python

fierce saffron Jul 17, 2020, 12:39 AM

#

any idea why a pandas describe works sometimes on a numpy array and doesn't other times?

umbral solar Jul 17, 2020, 12:47 AM

#

Hi! Does anyone know how I might unmask a masked numpy array? I tried ma.getdata() and .data but neither worked (they just returned the same masked array)

frank bone Jul 17, 2020, 1:12 AM

#

    df = pd.DataFrame(index=dti, columns=ticker_list)```

#

anyone know whats wrong with this?

#

but when passed as index it prints this Cannot convert input [['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11', '2012-01-12', '2012-01-13', '2012-01-17', '2012-01-18', '2012-01-19', '2012-01-20', '2012-01-23', '2012-01-24', '2012-01-25', '2012-01-26', '2012-01-27'..........................'2012-12-28', '2012-12-31']] of type <class 'list'> to Timestamp

#

trying to get a datetime index instead of 0 to n

frank bone Jul 17, 2020, 1:36 AM

#

nvm figured it out 😄

frank bone Jul 17, 2020, 1:58 AM

#

just do this df = pd.DataFrame(index=date_list, columns=ticker_list)

coarse spire Jul 17, 2020, 2:40 AM

#

Hi, I'm trying to categorize comments on twitch chat. I ran an unsupervised tweet topic modeling algorithm and got 10 topics but the results don't seem too promising.

Does anyone have suggestions to improve the process? Do I remove emotes?

flat quest Jul 17, 2020, 3:34 AM

#

i'm guessing u used a clustering algorithm? @coarse spire

coarse spire Jul 17, 2020, 3:36 AM

#

Yeah, so I used different embeddings (inlcuding a BERTembeddings from flair), ran PCA then used AgglomerativeClustering

#

Then I ran on TF-IDFT on each topic to pick out the most important terms in each

#

I don't know too much about clustering so I just followed this tweet analyzer post. https://towardsdatascience.com/covid-19-with-a-flair-2802a9f4c90f?gi=522b2c9f7c6

#

I guess I should look at different clustering techniques and varying the cluster size

#

I also don't know how to make much sense of my data after running PCA on my own. I should look into that too.

frank bone Jul 17, 2020, 4:07 AM

#

anyone got a clue how to skip NaN values?

#

doing a Simple Moving Average but it breaks as soon as there's 1 NaN value in a time series

#

data['SMA_3'] = data['CREE'].rolling(window=30).mean()

#

id want it to just ignore it and keep going

coarse spire Jul 17, 2020, 4:23 AM

#

(window=30, min_periods=29) would work for 1 NaN

#

You could also replace the NaN with the mean

#

I see people use pygame when they want an easy GUI.

worldly kindle Jul 17, 2020, 4:49 AM

#

any idea why a pandas describe works sometimes on a numpy array and doesn't other times?
looks to be hot topic today

regal hound Jul 17, 2020, 4:59 AM

#

@fierce saffron afaik the .describe() function is not implemented in numpy at all. Therefore it shouldn't work. Maybe you have some DataFrame that you think is an numpy array? If you want to describe an numpy array you can use scipy.stats.describe as work around.

frank bone Jul 17, 2020, 5:00 AM

#

@coarse spire does it divide by 30 or 29 in that case?

#

Does it effectively skip it? Or just treats it as a zero?

coarse spire Jul 17, 2020, 5:01 AM

#

30 until it hits the nan then 29 until it fully passes it

frank bone Jul 17, 2020, 5:02 AM

#

Great thanks 🙂 is there a possibility to just ignore NaN though? So if theres a NaN its like it doesnt even exist

#

If thats possible then 30/30 is always possible unless theres less than 30 datapoints

coarse spire Jul 17, 2020, 5:03 AM

#

Well, dropna will drop the nans before you do moving averafe

frank bone Jul 17, 2020, 5:03 AM

#

Tried that one but somehow it didnt work for me

#

Maybe i did it wrong

coarse spire Jul 17, 2020, 5:04 AM

#

It should definitely work but putting it back into the dataframe will require some finesse

#

Easiest thing to do would be replace nan with the mean

#

Then you have no nans

frank bone Jul 17, 2020, 5:04 AM

#

True, might do that instead

#

You have a good example/link on how to do that?

coarse spire Jul 17, 2020, 5:05 AM

#

Nah, if you search around for "replace nan with mean pandas" it should come up

frank bone Jul 17, 2020, 5:06 AM

#

Like is it possible on the go..while executing the SMA function?

#

Alright ill check it out

coarse spire Jul 17, 2020, 5:06 AM

#

Nope, gotta do it before

frank bone Jul 17, 2020, 5:06 AM

#

Thanks 👌🏻

#

Ah damn

coarse spire Jul 17, 2020, 5:06 AM

#

You're welcome good luck

verbal ice Jul 17, 2020, 6:08 AM

#

You could also replace the NaN with the mean
@coarse spire becareful when you do this though it depends on how many nulls you have, if its too many and you replace then with the mean it will be useless because you wont get any information out of it

#

Sorry late to the conversation 😅

real radish Jul 17, 2020, 6:10 AM

#

Hi all I hope you are well

#

Is this the right place to ask NLP and CNN questions?

velvet thorn Jul 17, 2020, 6:11 AM

#

yes, it is.

real radish Jul 17, 2020, 6:14 AM

#

Im currently doing a project where I'm using word2vec algorithm for classifying Facebook comments into how aggressive they are... Is there a common tool that I can use to iterate through my corpus of sentences to correct spelling mistakes?

#

At the moment I'm using gensim word2vec, but that could change as I'm only at the data preprocessing stage

agile anvil Jul 17, 2020, 6:22 AM

#

https://paste.pythondiscord.com/imazezuziy.py
discussion: https://www.reddit.com/r/dataisbeautiful/comments/hsqb6o/oc_us_hospital_intensive_care_unit_bed_use_by

📎 deaths-vs-beds_1.mp4

#

I spent 15 hours on those 50 lines, lol.

silk axle Jul 17, 2020, 7:41 AM

#

How would I add my own data to the mnist training dataset? I've got the images and have worked out the labels, but not sure what datatype the images + labels have to be, nor how to actually add them to my dataset correctly. (@me upon response please)

bitter harbor Jul 17, 2020, 8:30 AM

#

@silk axle "Thus, in MNIST training data set, `mnist.train.images` is shaped as a [60000, 784] tensor (60000 images, each involving a 784 element array). Using that syntax, you can refer to any of the pixels in any of the images. As shown above, each element in this tensor represents the intensity value of a pixel in a picture, between 0 and 1."

#

just concatenate your data to the end of the array?

silk axle Jul 17, 2020, 8:48 AM

#

I'm really new to ml + numpy so not sure how to

#

And if it's an array of (28x28) does that mean I need my image as an array 28x28?

#

@bitter harbor

bitter harbor Jul 17, 2020, 8:56 AM

#

yes I'm not sure how mnist orders the pixels but they are individual pixels not images

silk axle Jul 17, 2020, 8:56 AM

#

Right okay, thanks

#

I’ll see if I can figure something out

fleet moth Jul 17, 2020, 9:05 AM

#

I want to have multiple line for every interruption_type and priority field on my matplotlib char. Currently I have only that: ```py
def select(self):
return pd.read_sql_query("SELECT DISTINCT date, interruption_type, priority, SUM(interventiontime) from interruptions GROUP BY interruption_type, priority;",
self.conn, index_col="date")

class MplCanvas(FigureCanvasQTAgg):
def init(self, parent=None, width=5, height=4, dpi=100):
fig = Figure(figsize=(width, height), dpi=dpi)
self.axes = fig.add_subplot(111)
super().init(fig)
self.draw()

class StatisticDialog(QMainWindow):

def __init__(self, *args, **kwargs):
    super(StatisticDialog, self).__init__(*args, **kwargs)
    self.db = OctopusDB()
    self.setWindowTitle("Statistiques des interruptions")
    self.resize(600, 400)
    self.setWindowIcon(QIcon('icon.png'))

    try:
        datas = self.db.select()
        sc = MplCanvas(self, width=5, height=4, dpi=100)
        datas.plot(ax=sc.axes)

#

📎 Capture.PNG

#

how can I edit this one to get multiple line (or bar, or another graph who can show the sum by interruption_type, priority and date ?

bitter harbor Jul 17, 2020, 9:08 AM

#

put it into an array

fleet moth Jul 17, 2020, 9:11 AM

#

datas = self.db.select() from this line I must create an array so ?

bitter harbor Jul 17, 2020, 9:12 AM

#

np.random.random((len(data points), amount of interruption_types))

fleet moth Jul 17, 2020, 9:12 AM

#

datas = self.db.select()
print(datas)```

return me:

       interruption_type                   priority  SUM(interventiontime)

date
17/07/2020 Email Important, non urgent 10
17/07/2020 Présentielle Important, Urgent 39
17/07/2020 Présentielle Important, non urgent 10
17/07/2020 Présentielle Non important, non urgent 6
17/07/2020 Téléphone Non important, non urgent 4

silk axle Jul 17, 2020, 9:13 AM

#

Code```py

Load data and display shapes

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Load and add custom training samples

import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
number_in_image = int(image_file[0])
image = plt.imread(f"{_dir}/{image_file}")

## Resize image to 28x28x1 and invert
resized_image =  1 - resize(image, (28, 28, 1))
# print(resized_image.shape)

## Add data to training sets
x_train += resized_image  # this is the line that raises the error
y_train += number_in_image

Errorpy
UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'``` @bitter harbor

bitter harbor Jul 17, 2020, 9:14 AM

#

oh sorry yannick I thought your line space was time

silk axle Jul 17, 2020, 9:14 AM

#

I'm assuming I have to somehow convert the resized_image to be float32 but idk how

bitter harbor Jul 17, 2020, 9:14 AM

#

plt.bar()

#

https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.bar.html

fleet moth Jul 17, 2020, 9:14 AM

#

no, it's the sum of time for all same interruption_type and same priority for a date

bitter harbor Jul 17, 2020, 9:15 AM

#

separate the bars into urgencies then

fleet moth Jul 17, 2020, 9:16 AM

#

yes

#

https://matplotlib.org/3.2.1/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py this type of graph would be cool. Can you help to adapt my code for the same result @bitter harbor ?

bitter harbor Jul 17, 2020, 9:23 AM

#

No sorry but you're going to have to do a bit of adapting, that graph is 2 dimentional: (scores, men) and (scores, women) so basically its like saying (scores, human). Your graph is (date, incident type, priority, amount)

#

so even if you split priority into groups, you'll still essentially have a 4 graph

silk axle Jul 17, 2020, 9:24 AM

#

@bitter harbor do you know how to solve my above issue? https://discordapp.com/channels/267624335836053506/366673247892275221/733612255199100961

bitter harbor Jul 17, 2020, 9:25 AM

#

you don't have to link it I can still see it on my screen

#

look up the mnist docs

silk axle Jul 17, 2020, 9:27 AM

#

? @bitter harbor

📎 unknown.png

bitter harbor Jul 17, 2020, 9:29 AM

#

what do your input photos look like

silk axle Jul 17, 2020, 9:30 AM

#

Do you mean the actual image or the numpy array?

bitter harbor Jul 17, 2020, 9:30 AM

#

actual image

silk axle Jul 17, 2020, 9:30 AM

#

This is an example one

📎 0.png

#

this is example after resizing and inverting (plt.imshow)

📎 unknown.png

bitter harbor Jul 17, 2020, 9:31 AM

#

why is it inverted?

silk axle Jul 17, 2020, 9:31 AM

#

Because the mnist set is apparently inverted

bitter harbor Jul 17, 2020, 9:31 AM

#

and that's no longer a 0

silk axle Jul 17, 2020, 9:31 AM

#

And yea they're not the same image, sorry

#

Gimme a sec

bitter harbor Jul 17, 2020, 9:32 AM

#

no its fine

#

look up an image of the mnist training set

silk axle Jul 17, 2020, 9:32 AM

#

📎 unknown.png

#

That's why I'm inverting ^^

bitter harbor Jul 17, 2020, 9:33 AM

#

look up an image of the mnist training set ```

#

@fleet moth what you could do is create a bar graph separated by time, split by the types with lengths (y) of the sum, then heat mapped to the priority

silk axle Jul 17, 2020, 9:36 AM

#

Idk what you mean by that @bitter harbor

bitter harbor Jul 17, 2020, 9:36 AM

#

@silk axle they're white numbers and the white space is black, that's the images inverted

#

you can't combine numbers of different colours without screwing with the dataset

silk axle Jul 17, 2020, 9:37 AM

#

I'm inverting it though

#

That's the point

#

resized_image = 1 - resize(image, (28, 28, 1))

#

This line resizes to (28, 28) and then inverts it

#

So that the colours do match

#

@bitter harbor

bitter harbor Jul 17, 2020, 9:42 AM

#

please stop pinging

#

what's the purpose of the nn?

silk axle Jul 17, 2020, 9:44 AM

#

To predict what the number is basically

#

But I am inverting it so that it's white numbers and black background

bitter harbor Jul 17, 2020, 9:45 AM

#

ok but your image is purple and yellow

silk axle Jul 17, 2020, 9:45 AM

#

But I inverted it

#

So it's not

bitter harbor Jul 17, 2020, 9:45 AM

#

the inverse/reverse of purple and yellow is yellow and purple

silk axle Jul 17, 2020, 9:45 AM

#

The dataset has yellow numbers and purple background (in the plt.imshow)
My data has purple numbers and yellow background, but I invert it (in the plt.imshow)

#

The dataset is also purple+yellow

#

this is a random element taken from the dataset

📎 unknown.png

#

Idk why it shows as purple and yellow (whether that's a plt.imshow thing or just the python mnist dataset), but it does

#

So the colours do match

#

Either way I don't see how this is relevant to the error I'm getting

bitter harbor Jul 17, 2020, 9:52 AM

#

whats the full error then

silk axle Jul 17, 2020, 9:52 AM

#

https://discordapp.com/channels/267624335836053506/366673247892275221/733612255199100961

#

UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'```

bitter harbor Jul 17, 2020, 9:52 AM

#

the full error

silk axle Jul 17, 2020, 9:52 AM

#

It's only that because I'm using google collab

#

---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
<ipython-input-9-d7ac403beec4> in <module>()
     14 
     15     ## Add data to training sets
---> 16     x_train += resized_image
     17     y_train += number_in_image
     18 

UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'```

pale thunder Jul 17, 2020, 9:54 AM

#

Maybe your custom images are from floats 0 to 1 and mnist is 0-255 uint8 or vice versa

#

Print out the dtype of x_train and resized_image

silk axle Jul 17, 2020, 9:55 AM

#

Mnist is 0-255 but I change that somewhere so that it's 0-1 iirc

#

wait nvm mnist is 0-1 I think

#

uint8
float32```

#

First is mnist, second is my image

#

So ig that means MNIST is 0-255? And I have to /255?

pale thunder Jul 17, 2020, 9:57 AM

#

I would *255 and astype('uint8') your images instead

silk axle Jul 17, 2020, 9:57 AM

#

Later on I need everything as 0-1 though

#

I think

pale thunder Jul 17, 2020, 9:58 AM

#

Then do that later. I would think learning on uint8 would be faster than 32 bit floats

#

But feel free to try either way

#

It does probably just become floats later regardless

silk axle Jul 17, 2020, 9:58 AM

#

Right yea ig

#

Sopy reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')?

pale thunder Jul 17, 2020, 10:00 AM

#

Looks about right

silk axle Jul 17, 2020, 10:00 AM

#

Doesn't seem to convert to uint8

#

MNIST dataset: uint8
My image: float32```

#

I'm so confused lol

#

Okay so it's not converting to uint8

#

## Load data and display shapes
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(f"MNIST dataset: {x_train.dtype}")  # outputs uint8

## Load and add custom training samples
import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
    number_in_image = int(image_file[0])
    image = plt.imread(f"{_dir}/{image_file}")

    ## Resize image to 28x28x1 and invert
    reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')
    print(f"My image: {resized_image.dtype}")  # outputs float32
    # print(resized_image.shape)

    ## Add data to training sets
    x_train += resized_image
    y_train += number_in_image```

#

@pale thunder

pale thunder Jul 17, 2020, 10:05 AM

#

Where does resized_image come from?

silk axle Jul 17, 2020, 10:05 AM

#

oh

bitter harbor Jul 17, 2020, 10:05 AM

#

255 * abs(1 - resize(image, (28, 28, 1)))

silk axle Jul 17, 2020, 10:05 AM

#

I'm using wrong variable lmao

#

Okay new error

#

MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-287f62ccfef9> in <module>()
     16 
     17     ## Add data to training sets
---> 18     x_train += reversed_image
     19     y_train += number_in_image
     20 

ValueError: operands could not be broadcast together with shapes (60000,28,28) (28,28,1) (60000,28,28) ```

#

And the 1 - resize(...) will never be <0 so don't need to abs it @bitter harbor

#

So the types are matching now, but says the shapes don't match

bitter harbor Jul 17, 2020, 10:07 AM

#

I didn't say add them together, I said concatenate

#

as in extend the array

silk axle Jul 17, 2020, 10:07 AM

#

Isn't that how u concatenate numpy stuff?

pale thunder Jul 17, 2020, 10:07 AM

#

Ah, you cannot append things with + like that. You need some numpy stack function, concat or append. Unfortunately not at a PC, so I cannot test which one works

bitter harbor Jul 17, 2020, 10:08 AM

#

just look up numpy.concatenate

#

(1, 28, 28)

silk axle Jul 17, 2020, 10:09 AM

#

MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-c1eac66c6ac2> in <module>()
     16 
     17     ## Add data to training sets
---> 18     x_train.concatenate(reversed_image)
     19     y_train.concatenate(number_in_image)
     20 

AttributeError: 'numpy.ndarray' object has no attribute 'concatenate'```

bitter harbor Jul 17, 2020, 10:09 AM

#

again look up numpy.concatenate

silk axle Jul 17, 2020, 10:09 AM

#

!d numpy.concatenate

arctic wedgeBOT Jul 17, 2020, 10:09 AM

#

`numpy.concatenate`

numpy.concatenate((a1, a2, ...), axis=0, out=None)```
Join a sequence of arrays along an existing axis.

Parameters  **a1, a2, …**sequence of array\_likeThe arrays must have the same shape, except in the dimension corresponding to *axis* (the first, by default).

**axis**int, optionalThe axis along which the arrays will be joined. If axis is None, arrays are flattened before use. Default is 0.

**out**ndarray, optionalIf provided, the destination to place the result. The shape must be correct, matching that of what concatenate would have returned if no out argument were specified.

Returns  **res**ndarrayThe concatenated array.

See also

[`ma.concatenate`](numpy.ma.concatenate.html#numpy.ma.concatenate "numpy.ma.concatenate")Concatenate function that preserves input masks.

[`array_split`](numpy.array_split.html#numpy.array_split "numpy.array_split")Split an array into multiple sub-arrays of equal or near-equal size.... [read more](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html#numpy.concatenate)

silk axle Jul 17, 2020, 10:09 AM

#

Oh

#

MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-6ffe324b79a0> in <module>()
     16 
     17     ## Add data to training sets
---> 18     np.concatenate(x_train, reversed_image)
     19     np.concatenate(y_train, number_in_image)
     20 

<__array_function__ internals> in concatenate(*args, **kwargs)

TypeError: only integer scalar arrays can be converted to a scalar index```

pale thunder Jul 17, 2020, 10:10 AM

#

Look at the signature once more

bitter harbor Jul 17, 2020, 10:10 AM

#

the mnist is a list (60000 items) of arrays (60000, 28, 28) you need to change the shape to (1, 28, 28) because you're adding 1 item

silk axle Jul 17, 2020, 10:11 AM

#

🤔

#

Also no clue what you mean by signature @pale thunder

bitter harbor Jul 17, 2020, 10:12 AM

#

look at the parameters of the function

silk axle Jul 17, 2020, 10:12 AM

#

ah

#

MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-ce76f79fc976> in <module>()
     16 
     17     ## Add data to training sets
---> 18     np.concatenate((x_train, reversed_image))
     19     np.concatenate((y_train, number_in_image))
     20 

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 2, the array at index 0 has size 28 and the array at index 1 has size 1```I'm assuming this is the thing of needing to reshape?

#

I'm really confused

bitter harbor Jul 17, 2020, 10:15 AM

#

are you going to use a library to build your classifier

silk axle Jul 17, 2020, 10:15 AM

#

I've already built the classifier (tensorflow.keras.models.Sequential)

#

So yes

#

If that's what you mean by classifier

bitter harbor Jul 17, 2020, 10:16 AM

#

ah that makes sense

#

yes the type of nn

silk axle Jul 17, 2020, 10:16 AM

#

yea

#

## Build the CNN model
model = Sequential()
## Add model layers
model.add(Conv2D(32, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:])))
model.add(Conv2D(32, kernel_size=3, activation='relu'))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))```this is how I build the CNN

bitter harbor Jul 17, 2020, 10:17 AM

#

why'd you use ReLu

silk axle Jul 17, 2020, 10:17 AM

#

Because the tutorial used relu 🤷

#

I've got no clue what relu/softmax is other than 'a classification algorithm'

#

The tutorial showed how to get it all working, and I'm now extending on it to make it better

bitter harbor Jul 17, 2020, 10:19 AM

#

ok ya I was thinking you just looked it up

#

you should really learn about machine learning before you mess around with prebuild algorithms

silk axle Jul 17, 2020, 10:20 AM

#

I did a while back (about 2 years ago) so I know some basics, like kernels, but most stuff I either didn't learn or forgot

bitter harbor Jul 17, 2020, 10:21 AM

#

because it all involves things like dot multiplication, cost functions, stats, and optimization in machine learning

#

or like matrix manipulation in general

#

3blue1brown has some excellent videos on nn's and linear algebra

silk axle Jul 17, 2020, 10:24 AM

#

I'll check that out, thanks

#

But how do I resolve the above error?

bitter harbor Jul 17, 2020, 10:25 AM

#

print np.shape(image)

silk axle Jul 17, 2020, 10:26 AM

#

The reversed_image?

#

(28, 28, 1)

bitter harbor Jul 17, 2020, 10:26 AM

#

no the mnist one

silk axle Jul 17, 2020, 10:26 AM

#

Oh, I see

#

mnist is (28, 28)

#

So I need to add the 1?

#

I do that latter in the code but ig I should do here?

#

## Reshape the data to fit the model
x_train = x_train.reshape(list(x_train.shape) + [1])
x_test = x_test.reshape(list(x_test.shape) + [1])```

bitter harbor Jul 17, 2020, 10:27 AM

#

no

#

mnist database is (60000, 28, 28) each image in that database is (1, 28, 28), the image individually is (28, 28 (number of pixels)) but you're adding your image to the database's list - making it (60001, 28, 28)

silk axle Jul 17, 2020, 10:30 AM

#

Why's that a problem though?

bitter harbor Jul 17, 2020, 10:32 AM

#

you have (28, 28, 1)

#

that's not the same size as (1, 28, 28)

silk axle Jul 17, 2020, 10:33 AM

#

🤔

#

So I want to make both (1, 28, 28, 1)?

#

Since I can't make both (1, 28, 28)

bitter harbor Jul 17, 2020, 10:34 AM

#

what

#

you definitely can

silk axle Jul 17, 2020, 10:34 AM

#

(28, 28, 1) = 28x28 1d
(1, 28, 28) = 1 image 28x28

#

That's not that same

#

The three numbers don't represent the same thing

bitter harbor Jul 17, 2020, 10:35 AM

#

np.resize(image, (1,np.shape(image)))

silk axle Jul 17, 2020, 10:35 AM

#

Which would just make it (1, 28, 28, 1) like I said?

bitter harbor Jul 17, 2020, 10:36 AM

#

no it wouldn't

#

where are you getting the 1 at the end from

silk axle Jul 17, 2020, 10:36 AM

#

The image is (28, 28, 1)

#

Adding a one in front makes it (1, 28, 28, 1)?

bitter harbor Jul 17, 2020, 10:37 AM

#

no you're reshaping it not adding a one

silk axle Jul 17, 2020, 10:37 AM

#

np.concatenate((x_train.reshape(1, x_train.shape), reversed_image))```do you mean this?

#

I'm really confused as to what you're saying

bitter harbor Jul 17, 2020, 10:39 AM

#

📎 unknown.png

#

it doesn't change the values

silk axle Jul 17, 2020, 10:40 AM

#

Yea ik, I meant prepending the shape with a 1

#

Ig I just worded badly

bitter harbor Jul 17, 2020, 10:40 AM

#

so when you import the 28 by 28 images as an array, you change the shape so that you 'move' the image into the second+third dimension and 'list' it by making the first equal to 1

#

like I said I'd suggest learning about the topics I mentioned above

#

even a lot of the preprocessing involves them

silk axle Jul 17, 2020, 10:43 AM

#

I still don't get how to solve the issue I've got

#

I need to reshape it, I get that

#

Gimme a sec

#

Surely what you're saying to do would bepy np.concatenate((x_train.reshape(tuple([1] + list(x_train.shape))), reversed_image))?

#

nvm

#

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimension(s) and the array at index 1 has 3 dimension(s)```

#

I really don't get what I'm doing

bitter harbor Jul 17, 2020, 10:52 AM

#

image = np.reshape(image, (1, 28, 28))
np.concatenate((x_train,image)) ```

silk axle Jul 17, 2020, 10:52 AM

#

Right

#

I did that also

#

Got a different error

#

ValueError: cannot reshape array of size 47040000 into shape (1,28,28)```

bitter harbor Jul 17, 2020, 10:53 AM

#

sorry other way arround

silk axle Jul 17, 2020, 10:54 AM

#

I'm reshaping x_train atm

#

np.concatenate((x_train.reshape(tuple([1] + list(x_train.shape[1:]))), reversed_image))```

bitter harbor Jul 17, 2020, 10:54 AM

#

why

silk axle Jul 17, 2020, 10:55 AM

#

wait

#

Reverse image is (28, 28, 1)

#

And I want to make that (1, 28, 28), right?

#

Surely I can just reverse the shape?

#

Okay, that seems to have worked

#

Now error on the next line

#

---> 19     np.concatenate((y_train, number_in_image))
     20 
     21 print(f"Train Shapes: X={x_train.shape}, y={y_train.shape}")

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)```

#

Wait nvm, I think I know why

#

Okay I think I got it working now? No errors at least

#

## Load data and display shapes
(x_train, y_train), (x_test, y_test) = mnist.load_data()
#print(f"MNIST dataset: {x_train.dtype}")
#print(x_train.shape)
## Load and add custom training samples
import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
    number_in_image = int(image_file[0])
    image = plt.imread(f"{_dir}/{image_file}")

    ## Resize image to 28x28x1 and invert
    reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')
    #print(f"My image: {reversed_image.dtype}")
    #print(resized_image.shape)

    ## Add data to training sets
    np.concatenate((x_train, reversed_image.reshape(tuple(reversed(reversed_image.shape)))))
    np.concatenate((y_train, np.array([number_in_image])))```

#

Except it doesn't actually concatenate 🤦

#

Ig I need to assign maybe?

#

yea

#

Seems to have worked :~)

bitter harbor Jul 17, 2020, 11:08 AM

#

import numpy as np
import glob
import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
_dir = "/content/drive/My Drive/training numbers/*"
for image_file in glob.glob(_dir):
    image = skimage.data(image_file)
    reversed_image = 1 - np.reshape(image, (1,28,28))
    x_train = np.concatenate((x_train, reversed_image))
    y_train = np.concatenate((y_train, np.array([int(image_file[0])])))

silk axle Jul 17, 2020, 11:10 AM

#

I don't need to /255 because it's already greyscale

#

And concatenate doesn't edit in-place, so I needed to do like x_train = np.concatenate((x_train, reversed_image))

#

But yea that's basically what I need ig

#

Thanks for the help 👍

bitter harbor Jul 17, 2020, 11:13 AM

#

that should work now

#

as long as the path is the actual path not just what you have

fleet moth Jul 17, 2020, 11:23 AM

#

Is is possible to create an array of legends and datas from my Dataframe ?

#

datas = self.db.select().to_numpy() ?

desert oar Jul 17, 2020, 2:32 PM

#

@fleet moth what is thie purpose of this?

#

pandas doesn't have a concept of "legends"

mellow tiger Jul 17, 2020, 3:02 PM

#

Howdy. What are some good tools for visualization using Pandas? I'm looking for something more presentable than pandas profiling. Also it needs to be confidenciality compliant, so no Datapane

desert oar Jul 17, 2020, 3:05 PM

#

i just make my own plots w/ matplotlib

#

pandas has a bunch of convenience functions that generate matplotlib plots for you

limpid raft Jul 17, 2020, 3:15 PM

#

Is there a difference between an array of shape (x,1) and an array (x)?

pale thunder Jul 17, 2020, 3:16 PM

#

Yes, you have 2 axis in one case and 1 in the other. It affects quite a bit actually

limpid raft Jul 17, 2020, 3:18 PM

#

but if the length of the second axis is 1, doesn't it become 1D again?

pale thunder Jul 17, 2020, 3:19 PM

#

No, you can have a 1 long axis. It is useful for example when concating a (28,28) to a (1000, 28, 28)

pale marsh Jul 17, 2020, 3:41 PM

#

What's better, having lots of mini dataframes or one big dataframe in pandas ?

verbal ice Jul 17, 2020, 3:44 PM

#

Howdy. What are some good tools for visualization using Pandas? I'm looking for something more presentable than pandas profiling. Also it needs to be confidenciality compliant, so no Datapane
@mellow tiger you can use seaborn

bitter harbor Jul 17, 2020, 3:47 PM

#

@pale marsh it depends on how it’s being generated/used, how large the total dataset it, and what it’s for

pale marsh Jul 17, 2020, 3:49 PM

#

@bitter harbor right now my program takes one big dataframe and splits it into loads of small ones, it gets passed in a list of dataframes to another class which plots graphs (just simple histograms rn) with them using altair

bitter harbor Jul 17, 2020, 3:50 PM

#

What’s the format of the dataframe?

pale marsh Jul 17, 2020, 3:51 PM

#

There are some columns with 15000 rows and I think I'm just using at the most 1/4 of the total dataset I think

#

It gets it from a rosbag

#

Oh wait misread that

bitter harbor Jul 17, 2020, 3:53 PM

#

Can you send the first couple rows?

#

With titles if there are any

pale marsh Jul 17, 2020, 3:54 PM

#

Sorry I don't wanna login to discord on their laptop

#

4000x18 set

bitter harbor Jul 17, 2020, 3:55 PM

#

Of?

#

Numbers?

#

Strings?

pale marsh Jul 17, 2020, 3:56 PM

#

Floats, ints and arrays of floats, and 2D arrays of floats

bitter harbor Jul 17, 2020, 3:57 PM

#

How were you planning to use a histogram for that then?

#

Like unless you do a separate one for each data set idk how possible it’d be

pale marsh Jul 17, 2020, 3:59 PM

#

Right now I split each column into its own dataframe and just histogram it up like that but in with Altair I can just give it a large dataframe and specify which column to use the data from

bitter harbor Jul 17, 2020, 4:01 PM

#

That won’t work with 2d arrays mixed in

#

Or it will, but you’ll won’t be able to use the same function

pale marsh Jul 17, 2020, 4:03 PM

#

Oh yh I was thinking of splitting those off into their own separate dataframe while the single value columns I leave in one big one

#

But idk which is more efficient

#

Think it's suppose to scale up later

bitter harbor Jul 17, 2020, 4:04 PM

#

You’ll have to break them up

#

At least I can’t think of any way to do that

#

Also is there a reason you’re using python 2?

desert oar Jul 17, 2020, 4:07 PM

#

If you are just plotting it doesn't really matter how you organized your data, as long as you understand the code and it's not too complicated for others to understand

#

That said, I don't think I fully understand what you are doing with this data

serene scaffold Jul 17, 2020, 4:20 PM

#

salt rock lamp, do you understand how torch.Softmax works?

#

I'm trying to use it as a loss function

desert oar Jul 17, 2020, 4:22 PM

#

i know how softmax works, i dont know how the intricacies work in torch

#

it's not a loss function

#

it's a "layer"

serene scaffold Jul 17, 2020, 4:22 PM

#

Ah

pale marsh Jul 17, 2020, 4:22 PM

#

Also is there a reason you’re using python 2?
@bitter harbor something to do with the rosbags not being compatible with python 3 I think tho I'm not entirely clear on that

desert oar Jul 17, 2020, 4:22 PM

#

you know what softmax means/does?

serene scaffold Jul 17, 2020, 4:23 PM

#

I remember using it when I took linear algebra but I forgot how it's defined.

desert oar Jul 17, 2020, 4:23 PM

#

ok

#

you've seen a logistic curve?

serene scaffold Jul 17, 2020, 4:23 PM

#

let's see

desert oar Jul 17, 2020, 4:23 PM

#

https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg

#

it compresses the real line to (0, 1)

serene scaffold Jul 17, 2020, 4:23 PM

#

Looks like sigmoid

desert oar Jul 17, 2020, 4:23 PM

#

yeah

#

softmax is the multivariate generalization thereof

#

so it compresses R^n to (0,1)^n

bitter harbor Jul 17, 2020, 4:24 PM

#

Sigmoid is -1,1 softmax/ReLu is 0,1

serene scaffold Jul 17, 2020, 4:25 PM

#

Do I even need it then if my vectors are one dimensional?

desert oar Jul 17, 2020, 4:25 PM

#

what is your output?

#

if you're doing classification you (probably) need it

#

if you're doing regression you (probably) shouldn't use it, unless your regression target has hard upper and lower bounds

serene scaffold Jul 17, 2020, 4:26 PM

#

Mapping between two vector spaces. So the idea is that once the weights are tuned correctly, a length 768 vector will be the right 200 length vector in another space.

desert oar Jul 17, 2020, 4:26 PM

#

both are vector spaces over R though?

#

R^768 -> R^200 ?

serene scaffold Jul 17, 2020, 4:27 PM

#

I'm not sure what that means

desert oar Jul 17, 2020, 4:27 PM

#

ℝ, real numbers

spark stag Jul 17, 2020, 4:27 PM

#

Sigmoid is -1,1 softmax/ReLu is 0,1
@bitter harbor i think sigmoid is also 0-1, hyperbolic tangent is -1, 1, is this what you ment

desert oar Jul 17, 2020, 4:27 PM

#

p sure sigmoid is just "neural network lingo" for logistic function, no?

serene scaffold Jul 17, 2020, 4:27 PM

#

All the elements in the vector are real numbers, yes

desert oar Jul 17, 2020, 4:27 PM

#

the logistic curve is quite literally sigmoid

#

in that case you do not want a sigmoid/logistic/softmax on your output layer. you can have it on the hidden layers

bitter harbor Jul 17, 2020, 4:28 PM

#

Idk why I’ve been normalizing data to -1,1 then

desert oar Jul 17, 2020, 4:28 PM

#

you can also re-parameterize the logistic function to map to (-1, 1)

#

hell you can change the center

#

so you can have (-300, 500) with a center at 100

#

but why would you bother

#

@bitter harbor idk either 🙂 i dont like normalizing real-valued data. only when it has known and strict upper/lower bounds such as images which are 0-255 for example

#

for classification -1/1 was an old-school thing from when everyone loved SVMs

bitter harbor Jul 17, 2020, 4:30 PM

#

Oh I stg I wasted so much time on doing that to audio

desert oar Jul 17, 2020, 4:30 PM

#

it probably makes sense in specific domains

#

maybe its recommended by audio people

#

i usually work with what youd call "social science data" so thats where my recommendations come from

bitter harbor Jul 17, 2020, 4:31 PM

#

Ah ya idk I remember hearing it somewhere once and just went with it

#

Oh i think it’d better if you need more certainty with your floats

desert oar Jul 17, 2020, 4:32 PM

#

yeah but why normalize when you can also standardize

#

then you're just shifting and scaling without actually clipping your data

#

which might or might not be relevant depending on your data and model

bitter harbor Jul 17, 2020, 4:33 PM

#

Because standard deviation is gross

#

And I refuse to learn it

desert oar Jul 17, 2020, 4:34 PM

#

hyperlemon

modest rune Jul 17, 2020, 4:34 PM

#

Hopefully an easy question to answer. When writing functions to do some pandas dataframe manipulation. Is it better to: (from a performance perspective)
(a) construct the dataframe outside of the function, pass into the function as a parameter, then modify it.
(b) construct the dataframe inside the function, modify it, then return the dataframe?
(c) both work equally well.

bitter harbor Jul 17, 2020, 4:34 PM

#

Nns are hard enough as is

desert oar Jul 17, 2020, 4:34 PM

#

idk how you expect to do anything with data and not at least know basic stats

#

you dont even need to understand it to scale by it

bitter harbor Jul 17, 2020, 4:35 PM

#

d) use numpy

desert oar Jul 17, 2020, 4:35 PM

#

@modest rune (c)

#

numpy has the same considerations

bitter harbor Jul 17, 2020, 4:35 PM

#

Yes but numpy

desert oar Jul 17, 2020, 4:35 PM

#

...is underneath most pandas ops

#

so if you have mixed data types or you happen to enjoy the use of column names

#

pandas is much easier to work with

#

numpy is for math

bitter harbor Jul 17, 2020, 4:35 PM

#

Yes but numpy

desert oar Jul 17, 2020, 4:35 PM

#

pandas is for data

#

using numpy for data is like using a bit mask instead of kwargs in python

#

it works, but why

bitter harbor Jul 17, 2020, 4:37 PM

#

Math is better anyways and as for data manipulation, I’ve found that with audio/images/not social data there’s barely any sort of stat stuff

desert oar Jul 17, 2020, 4:37 PM

#

yes, that's fine

#

Math is better anyways
so go learn standard deviation 😉

#

i wouldn't use pandas for images nor would i use numpy for HLOC time series ticker data

bitter harbor Jul 17, 2020, 4:38 PM

#

Does pandas have the same sort of flexibility as numpy tho?

#

Because numpy seems to be useful for pretty much everything

modest rune Jul 17, 2020, 4:39 PM

#

OK, well... I took my pandas vectorized math... 3 lines of relatively simple code. Moved it into a function without modifying what it does, and the performance decreased by 35%. I am trying to understand what changed. Any ideas?

desert oar Jul 17, 2020, 4:39 PM

#

what do you mean flexibility? it feels like you're trying to artificially make this into a "vs" argument where none exists

#

pandas is a tool for manipulating tabular data, using numpy under the hood

#

you can use whatever tool you want for whatever purpose you want. i'm just recommending against using plain numpy for most datasets

#

@modest rune can you show your code with some context

#

not just the function

bitter harbor Jul 17, 2020, 4:40 PM

#

Hm ya I didn’t know that but the little data I’ve worked with I’ve used plain numpy

#

I’ll keep that in mind thanks

modern canyon Jul 17, 2020, 4:43 PM

#

Hello y'all, I am building a recommendation system and I have the following features 'genres', 'numVotes', 'averageRating' with the following stats:

mean 7.365355
std 0.588674
min 6.500000
max 9.800000
Name: averageRating, dtype: float64

mean 0.068966
std 0.257881
min 0.000000
max 1.000000
Name: genres (30 classes -> one hot encoded)

mean 75096.11
std 151962.13
min 5000.00
max 2260919.00
Name: numVotes, dtype: float64

How do I normalize these features?
I want to calculate cosine similarity after concatenating all these three features together

bitter harbor Jul 17, 2020, 4:44 PM

#

Normalize the features all together or separately?

modern canyon Jul 17, 2020, 4:44 PM

#

together

#

idk for sure I'm a beginner

bitter harbor Jul 17, 2020, 4:46 PM

#

Normalizing 2 million and 0-1 is gonna give you some pretty small values

modern canyon Jul 17, 2020, 4:47 PM

#

yeah, probably have to normalize separately and weight them accordingly afterwards

#

what do you think?

modest rune Jul 17, 2020, 4:48 PM

#

I had to mess with the code a bit to obscure what it does, but here it is.

Function Call

call_profit_df = self.CalcOptionProfit(options_df, one_df, broker_fees, 
                                      options['underlyingPrice'], investment)

Function

    def CalcOptionProfit(self, options_df, one_df, broker_fees, underlyingPrice, investment):
        profit_df = pd.concat([one_df] * options_df.shape[0], axis=1, ignore_index=True)
        profit_df = ((profit_df * underlyingPrice) + options_df['strikePrice']
        profit_df = investment + ((profit_df + options_df['price'] - broker_fees['per_contract']
                            - (broker_fees['percent'] * options_df['price'])) *       
                            options_df['contracts']) + broker_fees['flat']
        
        return profit_df

#

like I said above, the only difference between before and now that caused the 30% speed reduction, moving the code from the function call location, into the newly created function.

bitter harbor Jul 17, 2020, 4:50 PM

#

yeah, probably have to normalize separately and weight them accordingly afterwards
Why are you normalizing in the first place?

desert oar Jul 17, 2020, 4:51 PM

#

i'd consider normalizing averageRating and standardizing numVotes

modern canyon Jul 17, 2020, 4:51 PM

#

because there are too many outliers in numVotes column

desert oar Jul 17, 2020, 4:51 PM

#

clipping outliers should be considered a separate task

#

from centering/scaling or normalizing the data

modern canyon Jul 17, 2020, 4:51 PM

#

i'd consider normalizing averageRating and standardizing numVotes
@desert oar standardizing? what's that?

desert oar Jul 17, 2020, 4:52 PM

#

subtract sample mean, divide by sample std dev

#

@modest rune i think your function looks fine. it might be slower if you're doing this a very large number of times in a hot loop due to function call overhead

#

but just moving code to a function should not make it slower

modern canyon Jul 17, 2020, 4:52 PM

#

hmm

modest rune Jul 17, 2020, 4:53 PM

#

@desert oar But it does... And I am only calling the function once. I have spent the past 4 days converting all my math to vectorized math... no more loops (at least not yet).

modern canyon Jul 17, 2020, 4:54 PM

#

thanks for the info guys!

desert oar Jul 17, 2020, 4:55 PM

#

@modest rune

        profit_df = pd.concat([one_df] * options_df.shape[0], axis=1, ignore_index=True)

this looks like weird code

#

you're concat-ing the same data frame N times?

#

that looks guaranteed to be slow

modest rune Jul 17, 2020, 4:57 PM

#

I created the function for 1 main reason and 1 secondary reason... (Main Reason): An article I read said it is a good practice so that garbage collection could clean up any unused variables when the variables go out of scope. (2nd Reason): Code cleanliness.

Regarding (Main Reason)... I was worried that my assignment of dataframes to a different name each time I make a modification, might be leading to a lot of copies and making a bigger memory footprint. Is this something I should be worried about, or is pandas good about only making copies when absolutely necessary?

#

@desert oar There might be a better way... but there is a good reason for why I am doing that. Let me see if I can explain.

Profit_Scenarios = [1.2, 1.5, 0.6, 5.0]
Stock_Data = pd.DataFrame( columns = ['ticker', 'Price'],
                            data =  [[  'NFLX',   150.2],
                                     [  'GOOG',   304.1]])
## Desired Results ##
['ticker', 'price']   [          1.2,          1.5,          0.6,          5.0]
[  'NFLX',   150.2]   [f(150.2, 1.2),f(150.2, 1.5),f(150.2, 0.6),f(150.2, 5.0)]
[  'GOOG',   304.1]   [f(304.1, 1.2),f(304.1, 1.5),f(304.1, 0.6),f(304.1, 5.0)]

So, I start this process by first creating a Profit_Scenarios dataframe with the right number of rows to match the Stock_Data dataframe.

#

But... we are getting side tracked... I still don't understand the 30% slowdown

desert oar Jul 17, 2020, 5:08 PM

#

both are good reasons

#

and yes pandas is usually good about not copying data, but not always

#

i dont understand it either. frankly it shouldnt happen

#

it suggests that you made a mistake and changed something during refactoring

#

ah ok so

#

you're just overwriting the same dataframe

modest rune Jul 17, 2020, 5:10 PM

#

it suggests that you made a mistake and changed something during refactoring
@desert oar

Yeah... probably a 10% chance I did something. But, the change was so simple, as I spend more time with eyes on code, I am feeling less like that is happening.

desert oar Jul 17, 2020, 5:10 PM

#

overwriting the same df basically means that the "old" version of the df already goes out of scope

#

x = 1
x = 2

the 1 is out of scope as soon as you re-assign to x

#

so moving to a function in this particular case doesn't help give any hints to the GC

modest rune Jul 17, 2020, 5:12 PM

#

"GC"?

#

oh garbage collectin

#

Is there a better way to pull off the math without duplicating those rows? I couldn't think of a difference vectorized way to do it.

#

I know I could loop... but that how I used to do it and it was much much slower

desert oar Jul 17, 2020, 5:14 PM

#

what is one_df

modest rune Jul 17, 2020, 5:14 PM

#

one_df is the the same thing as profit_scenarios in my example

desert oar Jul 17, 2020, 5:14 PM

#

ahh

#

ok this actually might be a good case for numpy

#

but have fun keeping track of all those indexes

#

how important is performance? i'd just use a loop personally

modest rune Jul 17, 2020, 5:15 PM

#

Well, it IS working without looping. I am just trying to optimize further.

#

And looping was a 400% slowdown.

desert oar Jul 17, 2020, 5:16 PM

#

profit_scenarios = [1.2, 1.5, 0.6, 5.0]
scenario_outcomes = [compute_profit(stock_data, scenario) for scenario in profit_scenarios]

modest rune Jul 17, 2020, 5:17 PM

#

yeah, that is what I used to do. Took 10 seconds, right now it is taking 2.5 seconds with an even larger data set

desert oar Jul 17, 2020, 5:17 PM

#

Ohhhh so you are just expanding the scenario values to match the data shape

modest rune Jul 17, 2020, 5:17 PM

#

yes

desert oar Jul 17, 2020, 5:18 PM

#

Sec

bitter harbor Jul 17, 2020, 5:18 PM

#

ok this actually might be a good case for numpy
Seems like it

modest rune Jul 17, 2020, 5:18 PM

#

Switching to numpy is on my list of things to do next.

desert oar Jul 17, 2020, 5:19 PM

#

hold on though

#

you might still end up with lots of copies of your data?

#

yeah actually

modest rune Jul 17, 2020, 5:19 PM

#

I am thinking the temporary switch to numpy for this particular function will be trivial... a lot of DataFrame.values and maybe some conversions to float64 here an there.

desert oar Jul 17, 2020, 5:20 PM

#

is options_df repeating the data already? like once per scenario?

modest rune Jul 17, 2020, 5:20 PM

#

you might still end up with lots of copies of your data?
@desert oar

Help me understand. I have difficulties understanding when copies might be occurring and when not.

#

is options_df repeating the data already? like once per scenario?
@desert oar

I don't understand what you mean?

#

What data?

desert oar Jul 17, 2020, 5:21 PM

#

i don't see how your concat code does the same thing as the for loop

#

this might just be me being slow/thick

#

is broker_fees the same shape as options_df?

#

or is broker_fees a dict of scalars

modest rune Jul 17, 2020, 5:23 PM

#

broker_fees is a dict of scalars

desert oar Jul 17, 2020, 5:24 PM

#

are you using calcOptionProfit inside a for loop?

modest rune Jul 17, 2020, 5:24 PM

#

Since the two dataframes have the same height, pandas is looping through options and one behind the scenes.

#

nope... one call

desert oar Jul 17, 2020, 5:25 PM

#

it looks like options_df is something like

profit_scenarios = [1.2, 1.5, 0.6, 5.0]

stock_data = pd.DataFrame( columns = ['ticker', 'price'],
                            data =  [[  'NFLX',   150.2],
                                     [  'GOOG',   304.1]])

options_df = pd.concat([stock_data] * len(profit_scenarios)], axis=1)

modest rune Jul 17, 2020, 5:25 PM

#

That is what I have learned about pandas an numpy. The bigger set of data you can pass at one time, the more time savings.

desert oar Jul 17, 2020, 5:25 PM

#

otherwise this doesn't make sense to me

modest rune Jul 17, 2020, 5:32 PM

#

one_series = [1.2, 1.5, 0.6, 5.0]
options_df = pd.DataFrame( columns = ['ticker', 'Price'],
                            data =  [[  'NFLX',   150.2],
                                     [  'GOOG',   304.1]])

one_df = pd.concat([stock_data] * len(one_series)], axis=1)
## Result ##
['ticker', 'price']
[  'NFLX',   150.2]   [          1.2,          1.5,          0.6,          5.0]
[  'GOOG',   304.1]   [          1.2,          1.5,          0.6,          5.0]

new_df = one_df.transpose() * options_df['price']
## Result ##
['ticker', 'price']
[  'NFLX',   150.2]   [    1.2*150.2,    1.5*150.2,    0.6*150.2,    5.0*150.2]
[  'GOOG',   304.1]   [    1.2*304.1,    1.5*304.1,    0.6*304.1,    5.0*304.1]

#

Only, I think you have to use one_df.transpose() to get pandas to do the math right. edited pseudocode to show that.

#

@desert oar Good news! I was chasing a ghost! I am using pyinstrument, which is a great profiler but it isn't deterministic. I reverted back to my non-function implementation and I am seeing the same slowdown. There must be something slowing my laptop down.

#

Sorry for dragging you all along for the journey.

#

I do want to go back to this though...
"you might still end up with lots of copies of your data?"
"is options_df repeating the data already? like once per scenario?"

Do you see something suspicious from a performance perspective that I should be aware of?

#

or at least investigate?

#

Gonna blame the slowdown on windows update. Now the function implementation is just as fast. 🙂

earnest wadi Jul 17, 2020, 5:55 PM

#

I really ,dont understand this

def forward(self, inputs):
        print (inputs)
        print (self.weights)
        self.output = sigmoid(np.dot(inputs, self.weights) + self.biases)```

📎 unknown.png

modest rune Jul 17, 2020, 5:59 PM

#

You need to transpose the second variable in your dot operation

#

so that it is (2,1) and (1,2)

#

The inner dimensions of a dot operation need to be equal

earnest wadi Jul 17, 2020, 6:07 PM

#

alright, that worked

#

now htis

#

📎 unknown.png

modest rune Jul 17, 2020, 6:09 PM

#

I am not well versed in matrix math, only knew the answer because I had just run into the same issue the other day. But... my guess is that, broadcasting (I don't know what broadcasting is), for whatever reason, needs the two values to have the same shape.

#

From what little I know about matrix math. Addition and subtraction can only happen on matrices with the same shape

earnest wadi Jul 17, 2020, 6:09 PM

#

oh, haha, alright, ill have a play around

modest rune Jul 17, 2020, 6:09 PM

#

Yours are not the same shape.

#

(2,1) and (2,2). My geuss is that your (2,2) matrix was the result of your dot operation. (2,1) dot (1,2) produces a matrix of shape (2,2)

earnest wadi Jul 17, 2020, 6:11 PM

#

oops

#

self.delta: [[-3.24185123e-14 -3.48982562e-12] [ 7.88860905e-31 3.86714987e-07]]
inputs.t [[1.00000000e+00 1.00000000e+00] [1.86810924e-06 3.86715286e-07]]

def backward(self, inputs, outputs):
        self.delta = (outputs - self.output) * (self.output * (1 - self.output))
        print (self.delta)
        print (inputs.T)
        self.weights += inputs.T.dot(self.delta)```

any clue?

#

they both look the same shape to me

#

they both look 2,2

#

tbf idrk what im talking abt

#

haha

modest rune Jul 17, 2020, 6:14 PM

#

That was not helpful, because I cannot know the shape of inputs and outputs by looking at the code. FYI, you can take your numpy array and print the .shape attribute if you want to quickly see its shape. EX: print(inputs.shape)

earnest wadi Jul 17, 2020, 6:14 PM

#

alr

#

they are both (2, 2)

#

hmmmmmmmmmmmmmmmmm

#

oh

#

wait

#

im dum

#

self.weights is (2, 1)

desert oar Jul 17, 2020, 6:17 PM

#

@modest rune my question isn't about performance, it's that your code doesn't look like it would emit the right result. Can you provide some sample data and the expected outputs?

modest rune Jul 17, 2020, 6:18 PM

#

@desert oar only if I still have the code I used to test it out. Let me see. Otherwise, I already validated the data and it will take too much time to mock something up.

desert oar Jul 17, 2020, 6:18 PM

#

Just some made up option ticker data?

earnest wadi Jul 17, 2020, 6:19 PM

#

salt rock, you got any idea about my lil problem?

modest rune Jul 17, 2020, 6:20 PM

#

I still have it...

import pandas as pd
import numpy as np

gain_scenarios = pd.Series([0.34, 0.21, 0.56, 0.11, .54, 1.6, 0.88, 0.01, 0.5])
scalar = 52.0

stock_data = pd.DataFrame(columns =  ['Ticker', 'Shares', 'Cost_Per_Share'],
                             data = [['NFLX'  , 100.0     , 0.10          ],
                                     ['AAPL'  , 150.0    , 0.20           ],
                                     ['GOOG'  , 500.0     , 5.10          ],
                                     ['F'     , 70.0      , 7.10          ],
                                     ['BKSR'  , 130.0     , 0.90          ],
                                     ['AMZN'  , 90.0      , 5.10          ]])

gain_expanded = pd.concat([gain_scenarios] * stock_data.shape[0], axis=1, ignore_index=True)
print(gain_expanded.shape)
print(gain_expanded)
gain_expanded = ((gain_expanded + 1) * scalar)
print(gain_expanded)
gain_expanded = gain_expanded - stock_data['Shares']
print(gain_expanded)

#

@earnest wadi I can help, but, please print the shape of inputs and outputs

#

I mean, let me know what they are.

desert oar Jul 17, 2020, 6:22 PM

#

^ this

earnest wadi Jul 17, 2020, 6:22 PM

#

inputs (2, 2)
outputs (2, 2)
weights (2, 1)

#

weights is the problem

desert oar Jul 17, 2020, 6:23 PM

#

Seems ok

#

inputs @ weights should work

earnest wadi Jul 17, 2020, 6:23 PM

#

?

modest rune Jul 17, 2020, 6:23 PM

#

self.weights and self.delta

earnest wadi Jul 17, 2020, 6:23 PM

#

(2, 2)
(2, 2)
(2, 1)
Traceback (most recent call last):
File "a:/Python/Libraries/test.py", line 13, in <module>
network.run(X, y, epochs=90)
File "a:\Python\Libraries\main.py", line 55, in run
layer.backward(X, y)
File "a:\Python\Libraries\main.py", line 33, in backward
self.weights += inputs.T.dot(self.delta)
ValueError: non-broadcastable output operand with shape (2,1) doesn't match the broadcast shape (2,2)

modest rune Jul 17, 2020, 6:23 PM

#

nevermind, you are defining delta

earnest wadi Jul 17, 2020, 6:24 PM

#

def backward(self, inputs, outputs):
        self.delta = (outputs - self.output) * (self.output * (1 - self.output))
        print (self.delta.shape)
        print (inputs.shape)
        print (self.weights.shape)
        self.weights += inputs.T.dot(self.delta)```

modest rune Jul 17, 2020, 6:24 PM

#

what is "self.output", that is a different variable than outputs

earnest wadi Jul 17, 2020, 6:25 PM

#

self.output is the output for the layer in question, outputs is the output of the whole neural network up to said layer

#

tryna use this code to work with mny current script

📎 unknown.png

#

this guys isnt written amazingly as it is fixed and you have to manually add and remove code to add layers, mine just uses classes and functions

#

class layer_dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.10 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = sigmoid(np.dot(inputs, self.weights.T) + self.biases)

    def backward(self, inputs, outputs):
        self.delta = (outputs - self.output) * (self.output * (1 - self.output))
        print (self.delta.shape)
        print (inputs.shape)
        print (self.weights.shape)
        self.weights += inputs.T.dot(self.delta)```

modest rune Jul 17, 2020, 6:27 PM

#

Sorry all these numbers flying around... I have lost track of everything. Can you clearly tell me the values of these:
self.output.shape
inputs.shape
outputs.shape (edited to add this)
self.weights.shape (before the function is run)

earnest wadi Jul 17, 2020, 6:27 PM

#

alr

desert oar Jul 17, 2020, 6:28 PM

#

For future reference, the best way to get help is with a minimal reproducible example

#

Sample data + code that reproduces the error

earnest wadi Jul 17, 2020, 6:28 PM

#

self.output.shape (2, 2)
inputs.shape (2, 2)
outputs.shape (2, 1)
self.weights(2, 1)

#

oh

modest rune Jul 17, 2020, 6:29 PM

#

please inculde the name of the variable... that way I don't have to make assumptions

#

And, if you didn't notice, I snuck in outputs.shape too

earnest wadi Jul 17, 2020, 6:29 PM

#

there

modest rune Jul 17, 2020, 6:29 PM

#

Please, the shape of of self.weights, not the values

earnest wadi Jul 17, 2020, 6:30 PM

#

you asked for the values -_-

#

alr

modest rune Jul 17, 2020, 6:30 PM

#

oh, I did 🙂

#

my typo!

earnest wadi Jul 17, 2020, 6:30 PM

#

self.output.shape (2, 2)
inputs.shape (2, 2)
outputs.shape (2, 1)
self.weights.shape(2, 1)

modest rune Jul 17, 2020, 6:30 PM

#

Thanks!

#

which line is line # 13 in your code?

earnest wadi Jul 17, 2020, 6:32 PM

#

network.run(X, y, epochs=90) which is this

#

class network:
    def __init__(self, io):
        total = len(io)
        layers = []
        for i in range(total):
            layers.append(layer_dense(io[i][0], io[i][1]))
        self.layers = layers

    def run(self, X, y, epochs):
        for r in range(epochs):
            i = -1
            for layer in self.layers:
                layer.forward(X)
                X = layer.output
                i += 1
                layer.backward(X, y)
        self.output = self.layers[i].output```

modest rune Jul 17, 2020, 6:33 PM

#

Dumb, question, I didn't need that. Thanks though 🙂

earnest wadi Jul 17, 2020, 6:33 PM

#

lol

modest rune Jul 17, 2020, 6:34 PM

#

I didn't think this was possible:
(outputs - self.output)

#

since outputs is (2,1) and self.outputs is (2,2)

earnest wadi Jul 17, 2020, 6:35 PM

#

oh yeah

#

uh

modest rune Jul 17, 2020, 6:35 PM

#

but... like I said before, I am not an expert with matrix math, so take what I write with a grain of salt

earnest wadi Jul 17, 2020, 6:35 PM

#

haha

modest rune Jul 17, 2020, 6:36 PM

#

maybe numpy is able to handle that situation and makes some assumptions about what you are trying to do.

earnest wadi Jul 17, 2020, 6:36 PM

#

maybe

modest rune Jul 17, 2020, 6:37 PM

#

@desert oar Where you able to verify my code works on your end?

desert oar Jul 17, 2020, 6:44 PM

#

Numpy will maybe broadcast the mismatched vector

#

No i havent

#

I need to head offline soon. @ me in a few hours

modest rune Jul 17, 2020, 6:45 PM

#

k

earnest wadi Jul 17, 2020, 6:45 PM

#

im gonna take a break aswell, unless yu have any final ideas?

modest rune Jul 17, 2020, 6:45 PM

#

yes, I have an idea

earnest wadi Jul 17, 2020, 6:45 PM

#

im all ears

modest rune Jul 17, 2020, 6:46 PM

#

the error ValueError: non-broadcastable output operand with shape (2,1) doesn't match the broadcast shape (2,2) is pretty specific

#

It is saying that self_weights is (2,1) and the other variables on that line of code are (2,2), and that normally this won't work because of Matrix Math. But... Numpy is nice enough to make assumptions for you by doing what they call broadcasting (as salt rock lamp mentioned)

#

BUT

#

They are refusing to do broadcasting for the += operation.

earnest wadi Jul 17, 2020, 6:47 PM

#

so

#

should I do

#

what

modest rune Jul 17, 2020, 6:49 PM

#

Well, if it were me, I would assume that you shouldn't be mixing (2,1) and (2,2) in the first place. Maybe they are all supposed to be (2,1) or all supposed to be (2,2), maybe you ended up with a matrix shape that is incorrect somewhere along the line.

#

This will require you understand what your algorithm is trying to do.

#

Not a clean answer I know, best I can do though. Good LUCK!

earnest wadi Jul 17, 2020, 6:50 PM

#

thanks for your help 😄

#

ahhahahaha

#

@modest rune

#

I made

#

self.weights += blah

#

to self.weights = self.,weights + blah

modern canyon Jul 17, 2020, 6:51 PM

#

clipping outliers should be considered a separate task
@desert oar how do I handle outliers?

earnest wadi Jul 17, 2020, 6:52 PM

#

and it worked

#

hahahahaha

modest rune Jul 17, 2020, 6:53 PM

#

there you go! The error only said += won't work... never said B = B + A won't work

earnest wadi Jul 17, 2020, 6:53 PM

#

xd

desert oar Jul 17, 2020, 6:59 PM

#

@modern canyon that's a big topic

modern canyon Jul 17, 2020, 7:00 PM

#

I see

#

is there any scikit-learn function that can do it for me?

desert oar Jul 17, 2020, 7:02 PM

#

err

#

https://scikit-learn.org/stable/modules/outlier_detection.html

#

actually it looks like they've added a lot more useful functionality for outlier detection

modern canyon Jul 17, 2020, 7:05 PM

#

thanks!

desert oar Jul 17, 2020, 7:07 PM

#

this is much more complete than in the past

#

and this is a nice user guide. lots of pretty pictures

#

the sklearn user guide is turning into quite a nice document

queen barn Jul 17, 2020, 7:11 PM

#

I have a pretty broad question, but I hope that there is a data science pro that's nerdy enough to find some joy in helping me out. I'm working on an analysis with a data set of about 4500 rows, 95 categorical variables (optional configurations for a product), and a binary output "did it fail or not". What kind of approach would help me figure out which of these categorical variables is more correlated to failures?

#

It's not quite a multiple linear regression because while there are multiple factors that need to be considered, I'm more interested in which of the individual factors contribute to failures more often, and thus are better indicators of failures.

bitter harbor Jul 17, 2020, 7:16 PM

#

Is the output for each variable or for each row

queen barn Jul 17, 2020, 7:18 PM

#

No, one ID, many categories, binary result.

#

That's the row configuration.

bitter harbor Jul 17, 2020, 7:20 PM

#

I’d say tag the categories for when it fails, so maybe append them to a list? and checking which item(s) shows up the most in the list

queen barn Jul 17, 2020, 7:25 PM

#

Yeah, I can definitely do that manually, but I'm concerned about that approach for two reasons. 1) gotta be a better way to do that and 2) that really only captures if a category was involved in an outcome that was a failure, not whether or not it's a factor in predicting failures. Will it be more likely that if that variable is involved that a failure will happen? Yes, but how much more likely? How strong is the correlation? Are there other variables that were also involved in the failure that are more likely to be correlated to the failure? Is it the combination of the variables that makes the failure more likely?

bitter harbor Jul 17, 2020, 7:26 PM

#

Those can be answered in post processing

#

Factors = []
For column in dataset:
    If column[3] == failed:
        Factors.append(column[2])```

queen barn Jul 17, 2020, 7:38 PM

#

I can definitely do that. Do you mind elaborating on how that would be answered in post processing?

bitter harbor Jul 17, 2020, 7:47 PM

#

No sorry but it would depend on what you’re trying to calculate

queen barn Jul 17, 2020, 7:49 PM

#

Do you need me to elaborate more on my end?

bitter harbor Jul 17, 2020, 7:50 PM

#

Nope

#

You’ll have a list and you’ll have the ability to count how many times a factor comes up

#

Correlation might be a bit harder to define considering its 95 variables to 1 outcome

#

Finding math oriented statistical proprieties on the other hand would be as simple as just performing basic operations

mossy sand Jul 17, 2020, 8:14 PM

#

Not sure if this would be the correct channel. Was contimplating the idea of webscraping individual house value estimates. Like if a homeowner wanted to check the value of their house.

I know Zillow has an API. Not sure if that would be the best route

Thoughts? Different channel?

bitter harbor Jul 17, 2020, 8:17 PM

#

I’d put your question in new help channel

#

Might get more people viewing it

mossy sand Jul 17, 2020, 8:18 PM

#

Okay, thank you.

terse torrent Jul 17, 2020, 9:53 PM

#

Is there a way to have Pandas import all Excel sheets without having to label each one individually?

desert oar Jul 17, 2020, 10:09 PM

#

@terse torrent https://github.com/pandas-dev/pandas/blob/v1.0.5/pandas/io/excel/_base.py#L784

GitHub

pandas-dev/pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more - pandas-dev/pandas

#

with pd.ExcelFile('mybook.xlsx') as efile:
    dfs = {sheet: efile.parse(sheet) for sheet in efile.sheet_names}

#

that gives you a dict mapping sheet names to data frames

terse torrent Jul 17, 2020, 10:17 PM

#

Allright thanks. Because I got over 100 sheets

modest rune Jul 17, 2020, 10:28 PM

#

Looking for advice to speedup this function:

def CalcOptionProfit2(self, options_df, profit_df, broker_fees, underlyingPrice, investment):
   profit_df = (options_df['putCall_float'] * (profit_df + 1) * underlyingPrice) - 
                options_df['strikePrice']
   profit_df = investment + ((profit_df - options_df['price'] - broker_fees['per_contract']
                            - (broker_fees['percent'] * options_df['price'])) * 
                            options_df['contracts']) - broker_fees['flat']

options_df: 3,000 x 9, mixed datatypes
profit_df: 3,000 x 100, float64
broker_fees: Dict of scalars
unerlyingPrice: Scalar
investment: Scalar

This function runs once. Takes 4 seconds to run. I am starting to think this is not the most efficient way to do this math in pandas. I say that, because I compare these two lines of code to all of the other pandas math I am doing, and I am surprised it takes 4 seconds to run.

I have tried for loops and the apply function (using a custom function). I cannot promise I tried those approaches the right way, but when I did try those approaches, things were significantly slower... like 12 seconds instead of 4.

Any advice or links to documents I should read are greatly appreciated!

#

The rest of my code runs in 0.180 seconds... it is just these 2 lines that are giving me a headache.

modest rune Jul 17, 2020, 10:58 PM

#

Ideas I have:
A. Switch to numpy for calc.not sure how to do this.
B. Combine 2 dataframes into one using groupby

lapis sequoia Jul 17, 2020, 11:01 PM

#

I want to output matrix like this
9 8 7 6
10 11 12 5
1 2 3 4

I want to output spiral matrix...

this is my first message in this server... can someone to help me?

!!! i=x, j=x
for more example =>
if i give as arguments function: 3, 5, it mast print like this:

11 10 9 8 7
12 13 14 15 6
1 2 3 4 5

modest rune Jul 17, 2020, 11:19 PM

#

Other hints I have to why something is wrong:
A. Each time I add a dataframe column to my equation (ex. `options_df['strikePrice']) the runtime increases by about 0.5 seconds... that seems like a hefty increase for an item that is on the same row as the other items in the dataframe.
B. I decreased my dataset from 6500 x 9 to 3000 x 9 and only saw a 25% speed improvement. How does that make any sense?

desert oar Jul 18, 2020, 12:38 AM

#

@modest rune this is more helpful let me see what i can do here

pastel compass Jul 18, 2020, 12:39 AM

#

I'm currently training a seq2seq model and I am trying to diagnose my fairly stagnant losses. Are they a result of any of the following?:

My learning rate is too low
I should use a different criterion or optimizer
I messed up somewhere in the code

📎 unknown.png

desert oar Jul 18, 2020, 12:39 AM

#

honestly this is about as efficient as you can make it. the only other option is to use something like numexpr

modest rune Jul 18, 2020, 12:39 AM

#

Since my comment on here, I have started trying to do the math with numpy... seems to be MUCH faster, but I haven't validated my output yet though

desert oar Jul 18, 2020, 12:39 AM

#

@modest rune numexpr "compiles" all your operations so you don't have all these intermediate results

#

and yes raw numpy will be faster

#

because pandas does a lot of work to align indices

#

whereas numpy is purely position-based

#

but numexpr will probably be more efficient than both

modest rune Jul 18, 2020, 12:41 AM

#

Well... I hope what I am seeing is correct and that my output is correct. Right now, those 2 lines run in 0.011 seconds with Numpy, versus 4 seconds with pandas.

desert oar Jul 18, 2020, 12:41 AM

#

heh

#

thats a bigger difference than i expected

modest rune Jul 18, 2020, 12:41 AM

#

me too

desert oar Jul 18, 2020, 12:41 AM

#

ahhh hold on

#

options_df['putCall_float'] * (profit_df + 1)

this might be a much more expensive broadcasting operation in pandas vs in numpy

#

what's the idea here, you want to multiply each column of the profit_df matrix by the options_df['putCall_float'] vector?

modest rune Jul 18, 2020, 12:42 AM

#

that particular one is sort of a workaround.

#

I have an equation that is ALMOST the same for puts and calls, except I have to negate part of the equation.

desert oar Jul 18, 2020, 12:43 AM

#

this is what i was trying to understand before. what does each row and column of profit_df represent

modest rune Jul 18, 2020, 12:43 AM

#

So, I created a column that stores either 1 or -1 depending on whether or not it is a put or call.

#

profit scenarios. Can't go into more details than that. That is the magic sauce.

desert oar Jul 18, 2020, 12:44 AM

#

i don't really care what they are from that perspective

#

i mean, is each row a list of parameters/scenarios that you want to try, and you're duplicating that list over and over?

#

or something like that

#

is each row an hour? a different ticker label?

#

i just need to know what each one represents, i dont care about the magic sauce so to speak

#

make up things if you want i just need some context for the problem

modest rune Jul 18, 2020, 12:45 AM

#

each element is a percentage. The rows are duplicated, but will all be different in the end after the options_df gets ahold of everything in the math.

desert oar Jul 18, 2020, 12:46 AM

#

i see

#

so each column is one "parameter" (corresponding to the "profit scenarios" in your earlier example)?

modest rune Jul 18, 2020, 12:46 AM

#

yes

desert oar Jul 18, 2020, 12:47 AM

#

profit_df = pd.DataFrame([
    [1, 2, 3],
    [1, 2, 3],
    ...
], columns=['A', 'B', 'C'])

like that?

#

and yes i appreciate that you need to be cautious about revealing the magic sauce, trust me i'm not trying to get you to reveal any of it

modest rune Jul 18, 2020, 12:50 AM

#

Yes, that is basically profit_df, with A, B, C representing different profit scenarios.

desert oar Jul 18, 2020, 12:50 AM

#

great

#

let me spend a minute with the numpy docs

#

that said if < 1 sec is good enough for you

#

i have a code snippet for you

modest rune Jul 18, 2020, 12:51 AM

#

Yes. the current speed is excellent, but, if you have additional suggestions, I'd love to hear them.

#

I am still quite surprised that pandas was so slow. Makes me reticent to use pandas in the future.

desert oar Jul 18, 2020, 12:56 AM

#

its because of the indexing

#

you can always drop down to numpy for better performance, ill show you

#

i think the organizational benefits of pandas makes it worth using for most things

#

then inside your functions you can switch to numpy

#

or again, numexpr

#

ok this is easier than i thought

idle otter Jul 18, 2020, 1:12 AM

#

does environment.yml have to always be named environment.yml?

desert oar Jul 18, 2020, 1:12 AM

#

no, as long as you always refer to it by name with -f

idle otter Jul 18, 2020, 1:12 AM

#

thank you

desert oar Jul 18, 2020, 1:12 AM

#

conda env update -n myenv -f env1.yaml

idle otter Jul 18, 2020, 1:12 AM

#

ty

desert oar Jul 18, 2020, 1:14 AM

#

so @modest rune numpy broadcasting has two rules: the dimensions that match are preserved, and the dimensions that are 1 are broadcast

#

that's what the solution i'm writing makes use of

#

this is good practice for me too btw i haven't written code like this in a while

modest rune Jul 18, 2020, 1:16 AM

#

Ahh... I think I need to read up on broadcasting. I am guessing your code does the same as mine but avoids the whole duplicating rows bit. Is my guess correct?

desert oar Jul 18, 2020, 1:24 AM

#

yep hang on

#

@modest rune i think your code is missing a close )

modest rune Jul 18, 2020, 1:25 AM

#

maybe... I definitely had a variable in the wrong place... that putCall_float variable. I had added it without validating the data. Doing all of that now.

#

I gotta run. You mind PMing me the code snippit? that way it is easier for me to find in the future? Post it here too, in the rare chance someone is following our conversation.

desert oar Jul 18, 2020, 1:27 AM

#

sure

#

https://paste.pythondiscord.com/qoduxenezo.py

#

https://repl.it/@maximum__/numpy-broadcasting-sandbox demo of broadcasting

repl.it

maximum__

numpy broadcasting sandbox

A Python repl by maximum__

#

def CalcOptionProfit2(self, options_df, profit_scenarios, broker_fees, underlying_price, investment):
    """ Calculate option profit scenarios

    Arguments:
        options_df: DataFrame with columns 'putCall' (bool), 'strikePrice' (float), 'price' (float), and 'contracts' (float)
        profit_scenarios: List of floats
        broker_fees: Dict of broker fees with keys 'percent', 'flat', and 'per_contract'
        underlying_price: Float
        investment: Float

    Returns:
        DataFrame with option profits corresponding to the profit scenarios
    """
    # Capture the original data index to use at the end
    original_index = options_df.index

    # Convert option data into column vectors
    putcall_vec = options_df['putCall'].to_numpy('float32').reshape((-1, 1))
    strikeprice_vec = options_df['strikePrice'].to_numpy('float32').reshape((-1, 1))
    price_vec = options_df['price'].to_numpy('float32').reshape((-1, 1))
    contracts_vec = options_df['contracts'].to_numpy('float32').reshape((-1, 1))

    # Convert profit_scenarios into a row vector
    profit_scenarios_vec = np.asarray(profit_scenarios, dtype='float32').reshape((1, -1))

    fee_per_contract = broker_fees['per_contract']
    fee_percent = broker_fees['percent']
    fee_flat = broker_fees['flat']

    result = (putcall_vec * (profit_scenarios_vec + 1) * underlying_price) - strikeprice_vec
    result = investment + ((result - price - fee_per_contract - (fee_percent * price)) * contracts) - fee_flat
    return result

paste contents

frank bone Jul 18, 2020, 2:56 AM

#

how can i pass a variable from function1 to function2? I thought it was as easy as this...apparently not?

       a=1
       function2(a)
def function2(a):
        dosomestuff with variable a from function1
function1()```

#

seems like i have to return the value then save it as a global variable, my code works now, is that the best/only way to do it?

desert oar Jul 18, 2020, 3:10 AM

#

No

#

Sounds like you need to go through Beazley's python course

#

https://dabeaz-course.github.io/practical-python/

practical-python

Welcome!

Practical Python Programming (course by @dabeaz)

#

But this is more help channel content than anything

#

#❓｜how-to-get-help

frank bone Jul 18, 2020, 3:13 AM

#

thanks for the link will go through the course, been only been working by watching a 1 hour tut + gooling the rest 😄

#

but quick answer maybe?

desert oar Jul 18, 2020, 3:44 AM

#

it's a bit more than a quick answer

#

i'll try

#

if you define a function that accepts input, you must write that in the definition

#

!e ```python
def function1():
a = 1
b = function2(a)
return b * 7

def function2(a):
return a + 2

result = function1()
print(result)

arctic wedgeBOT Jul 18, 2020, 3:45 AM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

frank bone Jul 18, 2020, 3:47 AM

#

yeah made a typo, meant to write function2(a)

desert oar Jul 18, 2020, 3:48 AM

#

looks like you got the right idea then

frank bone Jul 18, 2020, 3:48 AM

#

hmm i have to recheck maybe i made a typo in my code or forgot something

#

but can you recheck my pseudo code above? correct now?

desert oar Jul 18, 2020, 3:49 AM

#

seems ok

#

you will need to learn about return at some point

frank bone Jul 18, 2020, 3:51 AM

#

return i know but its not necessary for this example

#

just want to pass 1 variable from function to next function when calling the latter

#

alright it worked 😄 thanks for explaining

#

i had the right idea but mustve made a typo..multiple times lol

odd apex Jul 18, 2020, 4:58 AM

#

if anyone could help me with a scatterplot that be pretty cool

#

x being one column of target values

#

y being a group of predictor columns

olive fossil Jul 18, 2020, 4:59 AM

#

hi guys

#

do I have to learn any other languages aside for python

#

to be a legitimate data-engineer?

odd apex Jul 18, 2020, 4:59 AM

#

R

#

is very useful, for what I heard

timber eagle Jul 18, 2020, 5:00 AM

#

Perhaps SQL too

olive fossil Jul 18, 2020, 5:00 AM

#

oh

desert oar Jul 18, 2020, 5:03 AM

#

SQL for sure, java or scala might help

timber eagle Jul 18, 2020, 5:04 AM

#

Oh really? Java???

flat quest Jul 18, 2020, 5:10 AM

#

eh python is usually sufficient @odd apex.

But it would be recommended to learn a form of SQL. It shouldn't be too difficult once you understand Pandas, considering df's and db's are conceptually similar.

verbal ice Jul 18, 2020, 6:30 AM

#

Oh really? Java???
@timber eagle scala/java are useful for big data applications (eg: data ingestion, processing etc)

low finch Jul 18, 2020, 6:36 AM

#

and distributed computing

#

even though i guess that's kinda processing

potent nymph Jul 18, 2020, 8:53 AM

#

Has anyone tried Tech With Tim's Machine Learning tutorial series? Is it good?

frank bone Jul 18, 2020, 9:03 AM

#

anyone have a little experience with extended isolation forest?

#

was able to implement normal isolation forest without problem

#

does extended isolation forest actually work for 1 dimension time series data?

bitter harbor Jul 18, 2020, 9:51 AM

#

@potent nymph idk about that one but personally id recommend 3blue1brown

gaunt tusk Jul 18, 2020, 10:22 AM

#

^

#

3blue1brown has the best introduction to neural networks

#

easily

#

by far

#

speaking of neural networks

#

is anyone able to confirm that these notes are correct?

📎 IMG_7893.jpg

#

just want to double check i have the right formulas for calculating the derivatives

lapis sequoia Jul 18, 2020, 11:12 AM

#

I'd learn SQL, considering even at my beginner level SQL gets data very fast especially if you are dealing with huge corporate databases.

#

Above about 100k rows in excel things start getting slow

#

So if a company is being smart about their data, they should migrate to SQL once their rows reach 100k or more.

#

SQL is not like Excel, instead of clicking a cell or selecting a row, you must write a script.

#

This script can be shared to access the same database on the server.

urban island Jul 18, 2020, 11:57 AM

#

@gaunt tusk isn't a^{L-1} just z^{L}? I take it that a is the current layer and z is the previous layer. Also, is W^{L} just the weight matrix from z^{L} to a^{L}? Because in that case z^{L} is independent of w^{L}

gaunt tusk Jul 18, 2020, 11:58 AM

#

z^{L} is a^{L-1} multiplied by its weight + bias

#

the weighted sum

urban island Jul 18, 2020, 11:59 AM

#

it's only the weighted sum? What about the activation/transition function? Or is your network linear

gaunt tusk Jul 18, 2020, 12:00 PM

#

holdon i'll list out what each one is

#

a{l} = Activation ( ó(z{l}) )
a{l-1} = Previous neuron activation
z{l} = weighted sum ( (a{l-1}*w{l}) + b{l} )
w{l} = weight
b{l} = bias
c0 = cost ( (a{l} - y)^2 )
ó(x) = sigmoid function

#

using sigmoid for my activation/transition

urban island Jul 18, 2020, 12:08 PM

#

and your cost function is sum of squares?

#

sorry I'm just now walking up

gaunt tusk Jul 18, 2020, 12:09 PM

#

yeah sorry forgot to stick it there, its (a{l} - y)^2

#

all good

#

and this is just for like

#

a single training example

#

i'll end up using it on matrices in my actual thing

#

just trying to lay it out first

#

so yeah the cost will end up being the sum of squares

urban island Jul 18, 2020, 12:17 PM

#

ok yeah, you're derivatives look fine

#

@gaunt tusk just wondering, how come you're not using something like MSE for your cost function?

gaunt tusk Jul 18, 2020, 12:23 PM

#

not sure i haven't looked at any other cost functions as of yet

#

whats the benefit of MSE?

dull turtle Jul 18, 2020, 12:32 PM

#

how i can reduce val_loss ?

#

Epoch 145/150
32/32 [==============================] - 3s 80ms/step - loss: 0.3107 - accuracy: 0.9277 - val_loss: 0.9093 - val_accuracy: 0.6875
Epoch 146/150
32/32 [==============================] - 3s 85ms/step - loss: 0.3060 - accuracy: 0.9283 - val_loss: 1.8575 - val_accuracy: 0.6228
Epoch 147/150
32/32 [==============================] - 3s 82ms/step - loss: 0.2562 - accuracy: 0.9507 - val_loss: 3.1728 - val_accuracy: 0.6491
Epoch 148/150
32/32 [==============================] - 3s 79ms/step - loss: 0.2472 - accuracy: 0.9473 - val_loss: 3.3467 - val_accuracy: 0.6140
Epoch 149/150
32/32 [==============================] - 3s 79ms/step - loss: 0.3238 - accuracy: 0.9191 - val_loss: 2.0550 - val_accuracy: 0.6404
Epoch 150/150
32/32 [==============================] - 3s 81ms/step - loss: 0.2501 - accuracy: 0.9507 - val_loss: 3.1427 - val_accuracy: 0.5877```

#

free to ping me

urban island Jul 18, 2020, 12:40 PM

#

@gaunt tusk well it depends on the dataset. I've seen MSE used a lot more than least squares. I dont remember the exact properties of each but from my experience MSE usually provides better results in regression type problems

gaunt tusk Jul 18, 2020, 12:42 PM

#

Hmm i'll have a look into it

#

i believe the one i'm using should be fine for what i'm doing atm though

#

just making a simple handwritten digit recogniser

#

using the mnist dataset

urban island Jul 18, 2020, 12:45 PM

#

I've used RELU (and its cousins) way more since it's less expensive (computationally) and provides faster convergence on networks where I dont have to worry about negative values

#

but anyways, yeah you'll prob be fine with your current cost function

bitter harbor Jul 18, 2020, 12:48 PM

#

^ReLu would probably work better for image recognition

#

considering your values are between 0 and 1 already

gaunt tusk Jul 18, 2020, 12:49 PM

#

hmm i'll check it out

urban island Jul 18, 2020, 12:50 PM

#

welp I'm mixing things. Relu is an activation function

gaunt tusk Jul 18, 2020, 12:50 PM

#

i have heard that its a more modernly used activation function

#

and i've been looking around and i believe the formulas for the partial derivatives i have are correct

#

so i believe i'm all set

bitter harbor Jul 18, 2020, 12:50 PM

#

yes it is but you'll still have to use sigmoid if you ever have negative data

gaunt tusk Jul 18, 2020, 12:51 PM

#

would the bias be able to make it negative?

bitter harbor Jul 18, 2020, 12:51 PM

#

idk if there's anything similar to to tho

urban island Jul 18, 2020, 12:51 PM

#

or leaky RELU

#

that has the benefit of dealing with negative values

bitter harbor Jul 18, 2020, 12:51 PM

#

wydm the bais

#

the weights and baises have noting to do with: 1) Your input/output 2) activation function

#

you can consider them separate items

gaunt tusk Jul 18, 2020, 12:52 PM

#

i thought you passed in the weighted sum to the activation?

bitter harbor Jul 18, 2020, 12:53 PM

#

thats right

gaunt tusk Jul 18, 2020, 12:53 PM

#

so the bias would never be able to be low enough to make it negative is what you're saying?

bitter harbor Jul 18, 2020, 12:54 PM

#

so if your activation function normalizes/standardizes has a range of 0,1 like ReLu, all values will be between the two

#

whereas if you use the sigmoid, it'll all be between -1,1

gaunt tusk Jul 18, 2020, 12:54 PM

#

oh wait yeah my inputs are always going to be between 0 and 1

#

so relu probably would be the better option here yeah

#

i'll have a look at it

#

and one other question actually

bitter harbor Jul 18, 2020, 12:55 PM

#

have a look at cnn's too

gaunt tusk Jul 18, 2020, 12:56 PM

#

https://paste.pythondiscord.com/oqowucuyob.py currently have this so far

bitter harbor Jul 18, 2020, 12:56 PM

#

image regnc. usually uses either that or what you have which is a perceptron

gaunt tusk Jul 18, 2020, 12:56 PM

#

and i'm running a test image through it

#

just testing the forward passing

#

it works fine on the first two neurons

#

but for some reason the last one it throws an error

bitter harbor Jul 18, 2020, 12:57 PM

#

layers or neurons

gaunt tusk Jul 18, 2020, 12:57 PM

#

ah yeah layers

#

Traceback (most recent call last):
  File "/Users/6503/Desktop/CodeStuff/MachineLearning/NeuralStuff.py", line 115, in <module>
    test = thing.feed_forward(list(letssee[0])[0][0])
  File "/Users/6503/Desktop/CodeStuff/MachineLearning/NeuralStuff.py", line 103, in feed_forward
    activation = self.sigmoid(np.dot(weight, activation)+bias)
ValueError: operands could not be broadcast together with shapes (10,16) (10,)

#

it seems to have changed the arrays shape somehow

#

not sure where though

#

i was printing out the shapes just to check what the first two layers were doing

#

(784, 1)
(16, 784)
(16, 16)
(10, 16)
``` they look fine

bitter harbor Jul 18, 2020, 12:58 PM

#

by first two layers you mean input+1 hidden?

gaunt tusk Jul 18, 2020, 12:59 PM

#

yeah

#

so it basically just doesn't make it to the output

#

goes through the one hidden layer

bitter harbor Jul 18, 2020, 1:01 PM

#

I'd suggest you look at examples of perceptron in python, I could be wrong but your code seems too short

gaunt tusk Jul 18, 2020, 1:02 PM

#

i mean its not even close to the full thing

#

its just the feedforward part

#

but yeah i'll have a look around

#

thanks for the help and suggestions

#

ah and i think i just found the issue to

#

yep

#

i flattened the bias array earlier for some reason

#

so just removed that

#

[[0.96463723]
 [0.6023769 ]
 [0.45853454]
 [0.13891415]
 [0.02237485]
 [0.09243328]
 [0.30762676]
 [0.84720262]
 [0.99305502]
 [0.13672393]]
``` now the outputs lookin right

bitter harbor Jul 18, 2020, 1:05 PM

#

also you can look at the hidden layer for sure, just like you can list weights/biases but they won't tell you anything

#

"Neural networks are so-called [black boxes] because they mimic, to a degree, the way the human brain is structured: they're built from layers of interconnected, neuron-like, nodes and comprise an input layer, an output layer and a variable number of intermediate 'hidden' layers -- 'deep' neural nets merely have more than one hidden layer. The nodes themselves carry out relatively simple mathematical operations, but between them, after training, they can process previously unseen data and generate correct results based on what was learned from the training data."

#

tl;dr they're preforming functions and you won't be able to tell shit from it

lapis sequoia Jul 18, 2020, 1:30 PM

#

to be fair, a lot of work has been done on interpretability of neural network.

bitter harbor Jul 18, 2020, 1:32 PM

#

how so

tight stone Jul 18, 2020, 1:50 PM

#

WARNING:tensorflow:From /Users/jaqqen/.local/share/virtualenvs/ShaVas-DrKzIL9u/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

I get this warning plus that update-instruction everytime my program hits model.save(path)

I already passed in the *_constraint-arguments to the layers and it looks like this now:

    model.add(Flatten())
    model.add(Dense(50, activation=leaky_relu, kernel_constraint=None, bias_constraint=None))
    model.add(Dropout(.1))
    model.add(Dense(50, activation=leaky_relu, kernel_constraint=None, bias_constraint=None))
    model.add(Dropout(.3))
    model.add(Dense(2, activation=softmax, kernel_constraint=None, bias_constraint=None))

unkempt lotus Jul 18, 2020, 2:27 PM

#

Good evening
If I have a 9x9 numpy array b:
[[0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8]]
which I got from the following code:
a = [] for i in range(9): a.append(i) b = [] for i in range(9): b.append(a) b = np.array(b)
I am trying to turn it into 9 3x3 images using .reshape method:
c = b.reshape(9,3,3)

#

However, the result I get if I print c[0], namely the first sample, is:
array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
whereas what I want is the upper left corner of the image, namely:
array([[0, 1, 2], [0, 1, 2], [0, 1, 2]])

#

Researching over stack overflow, the solution might have to do with reshaping, using .swapaxes method, then reshaping again: https://stackoverflow.com/questions/45950264/reshape-array-in-squares-like-an-image
But I couldn't figure out how should I use this for my case
Any help would be very much appreciated!

Stack Overflow

Reshape array in squares like an image

I would like to reshape an array in python (a numpy array at first) in a way that each element at the first index becomes a square/quadrant in that array, not the regular reshaping that takes all

dull turtle Jul 18, 2020, 2:58 PM

#

how to reduce val_loss in CNN ?

#

https://paste.pythondiscord.com/gidavuvoti.py my code here

#

Epoch 80/85
32/32 [==============================] - 3s 79ms/step - loss: 0.2400 - accuracy: 0.9551 - val_loss: 3.0391 - val_accuracy: 0.6435
Epoch 81/85
32/32 [==============================] - 2s 77ms/step - loss: 0.2816 - accuracy: 0.9331 - val_loss: 1.7805 - val_accuracy: 0.5913
Epoch 82/85
32/32 [==============================] - 3s 79ms/step - loss: 0.3244 - accuracy: 0.9147 - val_loss: 2.2709 - val_accuracy: 0.6172
Epoch 83/85
32/32 [==============================] - 3s 84ms/step - loss: 0.2983 - accuracy: 0.9395 - val_loss: 1.6353 - val_accuracy: 0.6174
Epoch 84/85
32/32 [==============================] - 2s 76ms/step - loss: 0.3258 - accuracy: 0.9206 - val_loss: 3.8418 - val_accuracy: 0.5913
Epoch 85/85
32/32 [==============================] - 3s 85ms/step - loss: 0.2855 - accuracy: 0.9390 - val_loss: 2.8588 - val_accuracy: 0.6783
training completed...2
Epoch 1/1
9/9 [==============================] - 1s 70ms/step - loss: 3.6322 - accuracy: 0.4427
score :  [1.7107210159301758, 0.572519063949585]```

unkempt lotus Jul 18, 2020, 3:34 PM

#

I once was stuck with validation loss, turns out I need to shuffle the data. Note that setting shuffle=True would only shuffle the data after the validation split, if i'm not mistaken
@dull turtle

dull turtle Jul 18, 2020, 3:35 PM

#

@unkempt lotus can u refer to my code above pasted in link

#

https://paste.pythondiscord.com/gidavuvoti.py @unkempt lotus code here

unkempt lotus Jul 18, 2020, 3:36 PM

#

It is a bit too long, apologies, but I searched for "shuffle" and didn't find anything

#

Not sure if this is your issue, but if you want give it at least a try

bitter harbor Jul 18, 2020, 3:46 PM

#

@unkempt lotus if your array is 9x9:py array = array array1 = array[0:3, 0:3] array2 = array[0:3, 3:6] array3 = array[0:3, 6:9] etc

#

I was trying to think of a way to do this with recursion, but pretty sure it'd be more complicated

unkempt lotus Jul 18, 2020, 4:05 PM

#

@bitter harbor No worries, thank you for the effort, I might consider implementing this

proper basin Jul 18, 2020, 4:05 PM

#

I want to build a KDTree (scikit-learn) of unique points, however calling numpy.unique() on the array of points takes much longer than building the KDTree (over 10x longer). Is there a way to use the KDTree structure to make it unique, rather than the apparently-expensive numpy.unique operation?

bitter harbor Jul 18, 2020, 4:07 PM

#

how long is '10x longer'?

proper basin Jul 18, 2020, 4:07 PM

#

depends on the number of points

#

in my unittest it goes from 1s to 12s

#

120,000 points

bitter harbor Jul 18, 2020, 4:09 PM

#

Thats pretty good considering what its doing

#

maybe check this out?

#

https://stackoverflow.com/questions/8560440/removing-duplicate-columns-and-rows-from-a-numpy-2d-array

Stack Overflow

Removing duplicate columns and rows from a NumPy 2D array

I'm using a 2D shape array to store pairs of longitudes+latitudes. At one point, I have to merge two of these 2D arrays, and then remove any duplicated entry. I've been searching for a function sim...

proper basin Jul 18, 2020, 4:10 PM

#

I've been searching for a function similar to numpy.unique, but I've had no luck
Hmm I don't get why he doesn't use numpy.unique()

bitter harbor Jul 18, 2020, 4:10 PM

#

who knows???

#

maybe because of the same problem you're running into

#

if you think about whats happening, it makes sense that it would take 12s

#

because its looking at the 120000 points, comparing them to themselves and returning it

proper basin Jul 18, 2020, 4:13 PM

#

Well it seems to me that the KDTree could easily remove duplicates

#

during insertion, with little overhead

#

it just doesn't have an argument to do that

bitter harbor Jul 18, 2020, 4:14 PM

#

I think the issue is the time cost in general

proper basin Jul 18, 2020, 4:17 PM

#

Oh from your link I found a solution using set() which is much faster

#

(0.2s)

#

I assume this is more memory-hungry though

bitter harbor Jul 18, 2020, 4:18 PM

#

couldn't tell ya

#

this might help too https://stackoverflow.com/questions/46575364/efficiently-counting-number-of-unique-elements-numpy-python

Stack Overflow

Efficiently counting number of unique elements - NumPy / Python

When running np.unique(), it first flattens the array, sorts the array, then finds the unique values. When I have arrays with shape (10, 3000, 3000), it takes about a second to find the uniques, bu...

raven mulch Jul 18, 2020, 6:42 PM

#

If anyone is interested in learning about making their own deep learning library in python feel free to check this first video out 🙂

#

https://www.youtube.com/watch?v=nNFsHQaD7gQ&t=761s

YouTube

Federico Barbero

Developing a Deep Learning Library - Part 1 - JoelNet Library and N...

Hello!
Today we start a new adventure where we will be expanding on the JoelNet library with the ultimate goal of deploying our own MNIST web classifier (and maybe attacking it using some simple adversarial attacks). The idea is to model the library around the scikit-learn api...

▶ Play video

#

I am an undergraduate researcher in machine learning

flat quest Jul 18, 2020, 7:31 PM

#

oooh
might be something i'll look into. Though I have a feeling autograd will be a pain...

cold shore Jul 18, 2020, 9:53 PM

#

Thanks @raven mulch

raven mulch Jul 18, 2020, 10:12 PM

#

No prob let me know what you think 🙂

desert oar Jul 18, 2020, 10:44 PM

#

You can bypass auto grad by hard coding the layer types

#

Write the gradients out by hand

#

Probably more educational than using an autograd lib

bitter harbor Jul 18, 2020, 10:46 PM

#

^^ i'd suggest doing everything (maybe except matrix manipulation (that's what numpy's for)) by hand

desert oar Jul 18, 2020, 10:46 PM

#

Yeah just use numpy for that

bitter harbor Jul 18, 2020, 10:47 PM

#

doing all that would be painful

#

but doing it all yourself will help you understand everything better

#

just like how it's Very useful to learn linear algebra (or concepts used in ml)

fervent crypt Jul 19, 2020, 12:54 AM

#

I'm trying to parse through an html table using pandas but i keep having a problem with values coming out as NaN when td values are there.

https://pastebin.com/zkjf6sqm

This is what part of the html table looks like.

My table ends up looking like this:

https://pastebin.com/ewB5Zf2K

The problem is that Role keeps coming out as NaN when i have things like "Bot Laner" still there.

for my code I think these are the relevant parts.

soup=BeautifulSoup(req.text, 'lxml')

my_table = soup.find('table', {'class':'wikitable'})

pd.read_html(str(my_table))

Any help would be really appreciated thanks!

Pastebin

NA

Pastebin

[ R C ID Name Role Contract Ends ...

flat quest Jul 19, 2020, 2:09 AM

#

true tho its always been interesting to me how tf and pytorch actually compute all their gradients

I know tf uses a graph execution, and that helps them deal with the gradients for non-standard functions, but would be interesting to see if that could be reimplemented. @desert oar

simple shadow Jul 19, 2020, 2:37 AM

#

hey all, i need help with something
in a specific dataset, there is data like ''Apr 3, 1998 to Apr 24, 1999'' how do i extract Apr 3 1998 and put it in one column and put Apr 24 1999 in another column

#

i am using python pandas

rancid brook Jul 19, 2020, 4:42 AM

#

You could split the string on " to "

tawdry sedge Jul 19, 2020, 4:42 AM

#

Hey guys need some advice

#

I found this question on stack overflow to find the line that touched most of the rectangles

#

📎 image.png

#

I have no clue at the moment but we need to find the line which touches the most of the rectanles and does not have to be the corners

#

any clue?

lapis sequoia Jul 19, 2020, 10:43 AM

#

Need advice for a good Data Science book. Any ideas?

chrome barn Jul 19, 2020, 10:55 AM

#

https://github.com/carl24k/awesome-datascience#books

GitHub

carl24k/awesome-datascience

:memo: An awesome Data Science repository to learn and apply for real world problems. - carl24k/awesome-datascience

#

take your pick

verbal ice Jul 19, 2020, 12:20 PM

#

Oh my god this is great thanks for sharing!

#

I must say this is very overwhelming 😅

lapis sequoia Jul 19, 2020, 12:23 PM

#

hye, has anyone tried this course?
https://www.coursera.org/learn/machine-learning

Coursera

Machine Learning by Stanford University | Coursera

Learn Machine Learning from Stanford University. Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, ...

#

what are the prerequisites for this course

rigid glade Jul 19, 2020, 1:25 PM

#

Hi guys. I have a pandas question.

#

📎 unknown.png

#

How do i compare the first -1 to the next row down number?

#

The conditions are: the first number needs to be -1, and if the next row down number is +1 then the count of True goes up by one.

#

Then the comparison starts on the second number to the third number with the same conditions

lapis sequoia Jul 19, 2020, 1:29 PM

#

Does anyone on here use Python for chemistry related work? I’m trying to find other packages like Cantera https://cantera.org/ for Python.

Cantera

Cantera's Homepage

gleaming gyro Jul 19, 2020, 3:17 PM

#

is there any astronomy related projects that i can work in?

simple shadow Jul 19, 2020, 3:35 PM

#

anime_new['Aired'].drop(index=anime_new['Aired'][filt].index)
anime_new.dropna(subset = ['Aired'])

if i print out anime_new after the last line, it includes the NA```
does anyone know why?

ripe forge Jul 19, 2020, 3:47 PM

#

Oh

#

@simple shadow because by default in pandas, drop or dropna returns copies. And you never reassigned. So the original variable stays the same. Either use inplace=True or reassign yourself.

simple shadow Jul 19, 2020, 3:57 PM

#

thank you!! @ripe forge

digital juniper Jul 19, 2020, 4:10 PM

#

just started learning data science, does anyone have a good server or place to ask for ML specific python questions?

ripe forge Jul 19, 2020, 4:18 PM

#

#data-science-and-ml

bitter harbor Jul 19, 2020, 4:22 PM

#

lmao

digital juniper Jul 19, 2020, 4:24 PM

#

i mean this is a discussion channel so i wasn't sure haha

#

but if anyone knows about the scikit confusion matrix, i'm doing logistic regression on a cancer data set where the target var is either M or B for malignant or benign

#

and this is the matrix

📎 unknown.png

#

but i don't know how to label it with the prediction axis and the actual data axis

#

so idk which way round it is

#

but i do know which one is M and which one is B using the labels parameter

#

there's the code

📎 unknown.png

bitter harbor Jul 19, 2020, 4:31 PM

#

not too familiar with scikit but I do know the actual output has to be a number

#

so instead of M/B you could have 1,0

digital juniper Jul 19, 2020, 4:32 PM

#

yeah, the only thing the labels thing does is switch the order from being B then M or M then B

#

so if i switch them the matrix goes the other way round if that makes sense

#

but idk which axis is which

flat quest Jul 19, 2020, 4:32 PM

#

wait what are u trying to do
print the confusion matrix along with the actual labels?

digital juniper Jul 19, 2020, 4:32 PM

#

yeah so i printed the matrix but i don't know which line is the predicted data and the actual data

flat quest Jul 19, 2020, 4:33 PM

#

it doesn't really matter which is which

They're interchangeable. Both axis have the same labels.

digital juniper Jul 19, 2020, 4:33 PM

#

i thought one axis was predicted M and predicted B then the other was actual M and actual B

#

for the diagonal it doesn't make a difference but for off diagonal elements it matters

flat quest Jul 19, 2020, 4:36 PM

#

ah well

i meant that as long as the confusion matrix generator function said which axis was which, it didn't really matter.

If it doesn't state it. The general notation is the column axis(left -> right) is predicted values. And row axis (top to bottom) is the actual axis.

digital juniper Jul 19, 2020, 4:37 PM

#

ahh thanks, yeah i couldn't find a default on the scikit docs

#

maybe i just missed it but i was wondering if there was some convention

#

cuz i wanted to do work out the precision and recall manually, instead of using scikit funcs

flat quest Jul 19, 2020, 4:38 PM

#

ah gotcha. Yeah weird of them to not state it.

Yeah that's the general convention. Unlikely for scikit to use a different axis system

#

ah i see. Do you know the math behind the two functions?

digital juniper Jul 19, 2020, 4:38 PM

#

precision and recall?

flat quest Jul 19, 2020, 4:38 PM

#

yeah

digital juniper Jul 19, 2020, 4:38 PM

#

i mean it's 3 numbers in a fraction haha

#

unless there's something i'm missing

flat quest Jul 19, 2020, 4:39 PM

#

well for binary classification its really simple. If you're looking to do multilabel as well, its a little more complex.

digital juniper Jul 19, 2020, 4:40 PM

#

ah yeah i mean i've just started so i'm starting with binary classification