#data-science-and-ml

1 messages · Page 235 of 1

modest rune
modern canyon
#

What are the best methods for movie recommendation systems?
I'm right now using cosine similarity metric on the IMDb dataset to make recommendations. Although it performs reasonably well, I'd like to enhance the performance. So what are the SOTA methods available for movie recommendation systems?

lapis sequoia
#

do yall see auto ML taking over data science in the upcoming decades

turbid oyster
#

I see auto ML being a big automater of machine learning - but data science is more than ML

regal flax
#

ey

worldly kindle
#

@modest rune so you went with converting the arrays to lists?

pastel compass
#

Is there any benefit to using an ASCII string over Unicode for text?

idle otter
#

how would you say a shape of (2330, 3500, 3) in words?

#

i know it's 2330 arrays of 3500by3 but I am just wondering if there is a standard way of saying it

frank bone
#

Is it possible to pass a list of dates to a pandas date series index?

#

i.e. only open market dates for a year of stocks

#

Which is like 250 days out of 365

proper fable
#

Guys, if I have a dataset that has 'names' column, how should I deal with it to convert it into numerical data

#

Is it right to use one hot encode? Becuse it has literally 25 unique values

desert oar
#

@pastel compass in some specific applications maybe but not in general

#

to be clear: by "Unicode" you probably mean UTF-8

vagrant fiber
#

you can select the columns which you need to convert and use .astype()

desert oar
#

@frank bone yes

restive obsidian
#

hi can someone help me with scipy solve_bvp?

#

data-science seemed to be closest to a channel which might use numerical computation that's why I jumped in here

pastel compass
#

to be clear: by "Unicode" you probably mean UTF-8
@desert oar

Ahh I didn't know there was a link between the two

desert oar
#

Unicode is an abstract system that basically catalogues every character/symbol used by humans, and putting a number on it

#

UTF-8 is an encoding for Unicode text

#

so python strings are "unicode"

#

but a file would be "UTF-8"

pastel compass
#

Oh that makes sense, I always see "encoding=utf8" but I didn't fully understand

desert oar
#

yes

#

so that's a UTF-8 encoded file

#

which means that it contains Unicode text, in UTF-8 format

#

@restive obsidian what's your problem with it? don't ask to ask

pastel compass
#

Thanks for the help!

pale thunder
#

UTF-8 Is a way to represent a sequence of Unicode characters as 8bit bytes (octets)

desert oar
#

^

modest rune
#

@modest rune so you went with converting the arrays to lists?
@worldly kindle

A headache brought upon by too many consecutive days of Pandas forced me to take a break. After a 3 hour nap, I am ready to try again. Upon further reflection, I have decided to take the advice of the experts and attempt to index 2 dataframes instead of putting everything into 1 dataframe (which required the nested arrays).

restive obsidian
#

@desert oar I m stuck from 3 hrs on a problem where I have to solve a system of 2 coupled 2nd order diff equations

#

can you help, I 'm using scipy

#

the bc(ya, yb) says it should return a (n, ) array but why? Boundary conditions should be also for the differentials.

hardy folio
#

Hello, Are there people here who may be a little experienced with web scraping?

restive obsidian
#

@hardy folio what do you want to scrape?

#

and with what scapy/selenium/requests?

hardy folio
#

So when I have scraped web pages before most of the time I would just scrape information from the page. This website I am looking at actually creates a link and give you the information in a csv file.

#

is there a way for me to actually use the link it creates and download the csv file instead of just scraping the information it returns on the whole web page

#

or would that be more difficult

#

I have spent quite a bit of time looking into this but so far cant find information that relates

#

it does not look like that creates a link I would use

#

the web page looks like that and the excel picture is a link

restive obsidian
#

use a = requests.get(...).content and then with open(....csv, "wb") as f: f.write(a)

hardy folio
#

Dont i have to give

#

request.get

#

the actual url link download

#

page = requests.get(URL)

#

like that but if I dont have a URL for the excel link

#

and obviously im still learning a lot so I appologize if im dumb

worldly kindle
#

@modest rune good call haha

lapis sequoia
#

i love python

fierce saffron
#

any idea why a pandas describe works sometimes on a numpy array and doesn't other times?

umbral solar
#

Hi! Does anyone know how I might unmask a masked numpy array? I tried ma.getdata() and .data but neither worked (they just returned the same masked array)

frank bone
#
    df = pd.DataFrame(index=dti, columns=ticker_list)```
#

anyone know whats wrong with this?

#

but when passed as index it prints this Cannot convert input [['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11', '2012-01-12', '2012-01-13', '2012-01-17', '2012-01-18', '2012-01-19', '2012-01-20', '2012-01-23', '2012-01-24', '2012-01-25', '2012-01-26', '2012-01-27'..........................'2012-12-28', '2012-12-31']] of type <class 'list'> to Timestamp

#

trying to get a datetime index instead of 0 to n

frank bone
#

nvm figured it out 😄

frank bone
#

just do this df = pd.DataFrame(index=date_list, columns=ticker_list)

coarse spire
#

Hi, I'm trying to categorize comments on twitch chat. I ran an unsupervised tweet topic modeling algorithm and got 10 topics but the results don't seem too promising.

Does anyone have suggestions to improve the process? Do I remove emotes?

flat quest
#

i'm guessing u used a clustering algorithm? @coarse spire

coarse spire
#

Yeah, so I used different embeddings (inlcuding a BERTembeddings from flair), ran PCA then used AgglomerativeClustering

#

Then I ran on TF-IDFT on each topic to pick out the most important terms in each

#

I guess I should look at different clustering techniques and varying the cluster size

#

I also don't know how to make much sense of my data after running PCA on my own. I should look into that too.

frank bone
#

anyone got a clue how to skip NaN values?

#

doing a Simple Moving Average but it breaks as soon as there's 1 NaN value in a time series

#

data['SMA_3'] = data['CREE'].rolling(window=30).mean()

#

id want it to just ignore it and keep going

coarse spire
#

(window=30, min_periods=29) would work for 1 NaN

#

You could also replace the NaN with the mean

#

I see people use pygame when they want an easy GUI.

worldly kindle
#

any idea why a pandas describe works sometimes on a numpy array and doesn't other times?
looks to be hot topic today

regal hound
#

@fierce saffron afaik the .describe() function is not implemented in numpy at all. Therefore it shouldn't work. Maybe you have some DataFrame that you think is an numpy array? If you want to describe an numpy array you can use scipy.stats.describe as work around.

frank bone
#

@coarse spire does it divide by 30 or 29 in that case?

#

Does it effectively skip it? Or just treats it as a zero?

coarse spire
#

30 until it hits the nan then 29 until it fully passes it

frank bone
#

Great thanks 🙂 is there a possibility to just ignore NaN though? So if theres a NaN its like it doesnt even exist

#

If thats possible then 30/30 is always possible unless theres less than 30 datapoints

coarse spire
#

Well, dropna will drop the nans before you do moving averafe

frank bone
#

Tried that one but somehow it didnt work for me

#

Maybe i did it wrong

coarse spire
#

It should definitely work but putting it back into the dataframe will require some finesse

#

Easiest thing to do would be replace nan with the mean

#

Then you have no nans

frank bone
#

True, might do that instead

#

You have a good example/link on how to do that?

coarse spire
#

Nah, if you search around for "replace nan with mean pandas" it should come up

frank bone
#

Like is it possible on the go..while executing the SMA function?

#

Alright ill check it out

coarse spire
#

Nope, gotta do it before

frank bone
#

Thanks 👌🏻

#

Ah damn

coarse spire
#

You're welcome good luck

verbal ice
#

You could also replace the NaN with the mean
@coarse spire becareful when you do this though it depends on how many nulls you have, if its too many and you replace then with the mean it will be useless because you wont get any information out of it

#

Sorry late to the conversation 😅

real radish
#

Hi all I hope you are well

#

Is this the right place to ask NLP and CNN questions?

velvet thorn
#

yes, it is.

real radish
#

Im currently doing a project where I'm using word2vec algorithm for classifying Facebook comments into how aggressive they are... Is there a common tool that I can use to iterate through my corpus of sentences to correct spelling mistakes?

#

At the moment I'm using gensim word2vec, but that could change as I'm only at the data preprocessing stage

silk axle
#

How would I add my own data to the mnist training dataset? I've got the images and have worked out the labels, but not sure what datatype the images + labels have to be, nor how to actually add them to my dataset correctly. (@me upon response please)

bitter harbor
#

@silk axle "Thus, in MNIST training data set, `mnist.train.images` is shaped as a [60000, 784] tensor (60000 images, each involving a 784 element array). Using that syntax, you can refer to any of the pixels in any of the images. As shown above, each element in this tensor represents the intensity value of a pixel in a picture, between 0 and 1."

#

just concatenate your data to the end of the array?

silk axle
#

I'm really new to ml + numpy so not sure how to

#

And if it's an array of (28x28) does that mean I need my image as an array 28x28?

#

@bitter harbor

bitter harbor
#

yes I'm not sure how mnist orders the pixels but they are individual pixels not images

silk axle
#

Right okay, thanks

#

I’ll see if I can figure something out

fleet moth
#

I want to have multiple line for every interruption_type and priority field on my matplotlib char. Currently I have only that: ```py
def select(self):
return pd.read_sql_query("SELECT DISTINCT date, interruption_type, priority, SUM(interventiontime) from interruptions GROUP BY interruption_type, priority;",
self.conn, index_col="date")

class MplCanvas(FigureCanvasQTAgg):
def init(self, parent=None, width=5, height=4, dpi=100):
fig = Figure(figsize=(width, height), dpi=dpi)
self.axes = fig.add_subplot(111)
super().init(fig)
self.draw()

class StatisticDialog(QMainWindow):

def __init__(self, *args, **kwargs):
    super(StatisticDialog, self).__init__(*args, **kwargs)
    self.db = OctopusDB()
    self.setWindowTitle("Statistiques des interruptions")
    self.resize(600, 400)
    self.setWindowIcon(QIcon('icon.png'))

    try:
        datas = self.db.select()
        sc = MplCanvas(self, width=5, height=4, dpi=100)
        datas.plot(ax=sc.axes)
#

how can I edit this one to get multiple line (or bar, or another graph who can show the sum by interruption_type, priority and date ?

bitter harbor
#

put it into an array

fleet moth
#

datas = self.db.select() from this line I must create an array so ?

bitter harbor
#

np.random.random((len(data points), amount of interruption_types))

fleet moth
#
datas = self.db.select()
print(datas)```

return me:
       interruption_type                   priority  SUM(interventiontime)

date
17/07/2020 Email Important, non urgent 10
17/07/2020 Présentielle Important, Urgent 39
17/07/2020 Présentielle Important, non urgent 10
17/07/2020 Présentielle Non important, non urgent 6
17/07/2020 Téléphone Non important, non urgent 4

silk axle
#

Code```py

Load data and display shapes

(x_train, y_train), (x_test, y_test) = mnist.load_data()

Load and add custom training samples

import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
number_in_image = int(image_file[0])
image = plt.imread(f"{_dir}/{image_file}")

## Resize image to 28x28x1 and invert
resized_image =  1 - resize(image, (28, 28, 1))
# print(resized_image.shape)

## Add data to training sets
x_train += resized_image  # this is the line that raises the error
y_train += number_in_image

Errorpy
UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'``` @bitter harbor

bitter harbor
#

oh sorry yannick I thought your line space was time

silk axle
#

I'm assuming I have to somehow convert the resized_image to be float32 but idk how

bitter harbor
#

plt.bar()

fleet moth
#

no, it's the sum of time for all same interruption_type and same priority for a date

bitter harbor
#

separate the bars into urgencies then

fleet moth
#

yes

bitter harbor
#

No sorry but you're going to have to do a bit of adapting, that graph is 2 dimentional: (scores, men) and (scores, women) so basically its like saying (scores, human). Your graph is (date, incident type, priority, amount)

#

so even if you split priority into groups, you'll still essentially have a 4 graph

silk axle
bitter harbor
#

you don't have to link it I can still see it on my screen

#

look up the mnist docs

silk axle
bitter harbor
#

what do your input photos look like

silk axle
#

Do you mean the actual image or the numpy array?

bitter harbor
#

actual image

silk axle
bitter harbor
#

why is it inverted?

silk axle
#

Because the mnist set is apparently inverted

bitter harbor
#

and that's no longer a 0

silk axle
#

And yea they're not the same image, sorry

#

Gimme a sec

bitter harbor
#

no its fine

#

look up an image of the mnist training set

silk axle
#

That's why I'm inverting ^^

bitter harbor
#

look up an image of the mnist training set ```

#

@fleet moth what you could do is create a bar graph separated by time, split by the types with lengths (y) of the sum, then heat mapped to the priority

silk axle
#

Idk what you mean by that @bitter harbor

bitter harbor
#

@silk axle they're white numbers and the white space is black, that's the images inverted

#

you can't combine numbers of different colours without screwing with the dataset

silk axle
#

I'm inverting it though

#

That's the point

#

resized_image = 1 - resize(image, (28, 28, 1))

#

This line resizes to (28, 28) and then inverts it

#

So that the colours do match

#

@bitter harbor

bitter harbor
#

please stop pinging

#

what's the purpose of the nn?

silk axle
#

To predict what the number is basically

#

But I am inverting it so that it's white numbers and black background

bitter harbor
#

ok but your image is purple and yellow

silk axle
#

But I inverted it

#

So it's not

bitter harbor
#

the inverse/reverse of purple and yellow is yellow and purple

silk axle
#

The dataset has yellow numbers and purple background (in the plt.imshow)
My data has purple numbers and yellow background, but I invert it (in the plt.imshow)

#

The dataset is also purple+yellow

#

Idk why it shows as purple and yellow (whether that's a plt.imshow thing or just the python mnist dataset), but it does

#

So the colours do match

#

Either way I don't see how this is relevant to the error I'm getting

bitter harbor
#

whats the full error then

silk axle
#
UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'```
bitter harbor
#

the full error

silk axle
#

It's only that because I'm using google collab

#
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
<ipython-input-9-d7ac403beec4> in <module>()
     14 
     15     ## Add data to training sets
---> 16     x_train += resized_image
     17     y_train += number_in_image
     18 

UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'```
pale thunder
#

Maybe your custom images are from floats 0 to 1 and mnist is 0-255 uint8 or vice versa

#

Print out the dtype of x_train and resized_image

silk axle
#

Mnist is 0-255 but I change that somewhere so that it's 0-1 iirc

#

wait nvm mnist is 0-1 I think

#
uint8
float32```
#

First is mnist, second is my image

#

So ig that means MNIST is 0-255? And I have to /255?

pale thunder
#

I would *255 and astype('uint8') your images instead

silk axle
#

Later on I need everything as 0-1 though

#

I think

pale thunder
#

Then do that later. I would think learning on uint8 would be faster than 32 bit floats

#

But feel free to try either way

#

It does probably just become floats later regardless

silk axle
#

Right yea ig

#

Sopy reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')?

pale thunder
#

Looks about right

silk axle
#

Doesn't seem to convert to uint8

#
MNIST dataset: uint8
My image: float32```
#

I'm so confused lol

#

Okay so it's not converting to uint8

#
## Load data and display shapes
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(f"MNIST dataset: {x_train.dtype}")  # outputs uint8

## Load and add custom training samples
import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
    number_in_image = int(image_file[0])
    image = plt.imread(f"{_dir}/{image_file}")

    ## Resize image to 28x28x1 and invert
    reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')
    print(f"My image: {resized_image.dtype}")  # outputs float32
    # print(resized_image.shape)

    ## Add data to training sets
    x_train += resized_image
    y_train += number_in_image```
#

@pale thunder

pale thunder
#

Where does resized_image come from?

silk axle
#

oh

bitter harbor
#

255 * abs(1 - resize(image, (28, 28, 1)))

silk axle
#

I'm using wrong variable lmao

#

Okay new error

#
MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-287f62ccfef9> in <module>()
     16 
     17     ## Add data to training sets
---> 18     x_train += reversed_image
     19     y_train += number_in_image
     20 

ValueError: operands could not be broadcast together with shapes (60000,28,28) (28,28,1) (60000,28,28) ```
#

And the 1 - resize(...) will never be <0 so don't need to abs it @bitter harbor

#

So the types are matching now, but says the shapes don't match

bitter harbor
#

I didn't say add them together, I said concatenate

#

as in extend the array

silk axle
#

Isn't that how u concatenate numpy stuff?

pale thunder
#

Ah, you cannot append things with + like that. You need some numpy stack function, concat or append. Unfortunately not at a PC, so I cannot test which one works

bitter harbor
#

just look up numpy.concatenate

#

(1, 28, 28)

silk axle
#
MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-c1eac66c6ac2> in <module>()
     16 
     17     ## Add data to training sets
---> 18     x_train.concatenate(reversed_image)
     19     y_train.concatenate(number_in_image)
     20 

AttributeError: 'numpy.ndarray' object has no attribute 'concatenate'```
bitter harbor
#

again look up numpy.concatenate

silk axle
#

!d numpy.concatenate

arctic wedgeBOT
#
numpy.concatenate((a1, a2, ...), axis=0, out=None)```
Join a sequence of arrays along an existing axis.

Parameters  **a1, a2, …**sequence of array\_likeThe arrays must have the same shape, except in the dimension corresponding to *axis* (the first, by default).

**axis**int, optionalThe axis along which the arrays will be joined. If axis is None, arrays are flattened before use. Default is 0.

**out**ndarray, optionalIf provided, the destination to place the result. The shape must be correct, matching that of what concatenate would have returned if no out argument were specified.

Returns  **res**ndarrayThe concatenated array.

See also

[`ma.concatenate`](numpy.ma.concatenate.html#numpy.ma.concatenate "numpy.ma.concatenate")Concatenate function that preserves input masks.

[`array_split`](numpy.array_split.html#numpy.array_split "numpy.array_split")Split an array into multiple sub-arrays of equal or near-equal size.... [read more](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html#numpy.concatenate)
silk axle
#

Oh

#
MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-6ffe324b79a0> in <module>()
     16 
     17     ## Add data to training sets
---> 18     np.concatenate(x_train, reversed_image)
     19     np.concatenate(y_train, number_in_image)
     20 

<__array_function__ internals> in concatenate(*args, **kwargs)

TypeError: only integer scalar arrays can be converted to a scalar index```
pale thunder
#

Look at the signature once more

bitter harbor
#

the mnist is a list (60000 items) of arrays (60000, 28, 28) you need to change the shape to (1, 28, 28) because you're adding 1 item

silk axle
#

🤔

#

Also no clue what you mean by signature @pale thunder

bitter harbor
#

look at the parameters of the function

silk axle
#

ah

#
MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-ce76f79fc976> in <module>()
     16 
     17     ## Add data to training sets
---> 18     np.concatenate((x_train, reversed_image))
     19     np.concatenate((y_train, number_in_image))
     20 

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 2, the array at index 0 has size 28 and the array at index 1 has size 1```I'm assuming this is the thing of needing to reshape?
#

I'm really confused

bitter harbor
#

are you going to use a library to build your classifier

silk axle
#

I've already built the classifier (tensorflow.keras.models.Sequential)

#

So yes

#

If that's what you mean by classifier

bitter harbor
#

ah that makes sense

#

yes the type of nn

silk axle
#

yea

#
## Build the CNN model
model = Sequential()
## Add model layers
model.add(Conv2D(32, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:])))
model.add(Conv2D(32, kernel_size=3, activation='relu'))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))```this is how I build the CNN
bitter harbor
#

why'd you use ReLu

silk axle
#

Because the tutorial used relu 🤷

#

I've got no clue what relu/softmax is other than 'a classification algorithm'

#

The tutorial showed how to get it all working, and I'm now extending on it to make it better

bitter harbor
#

ok ya I was thinking you just looked it up

#

you should really learn about machine learning before you mess around with prebuild algorithms

silk axle
#

I did a while back (about 2 years ago) so I know some basics, like kernels, but most stuff I either didn't learn or forgot

bitter harbor
#

because it all involves things like dot multiplication, cost functions, stats, and optimization in machine learning

#

or like matrix manipulation in general

#

3blue1brown has some excellent videos on nn's and linear algebra

silk axle
#

I'll check that out, thanks

#

But how do I resolve the above error?

bitter harbor
#

print np.shape(image)

silk axle
#

The reversed_image?

#

(28, 28, 1)

bitter harbor
#

no the mnist one

silk axle
#

Oh, I see

#

mnist is (28, 28)

#

So I need to add the 1?

#

I do that latter in the code but ig I should do here?

#
## Reshape the data to fit the model
x_train = x_train.reshape(list(x_train.shape) + [1])
x_test = x_test.reshape(list(x_test.shape) + [1])```
bitter harbor
#

no

#

mnist database is (60000, 28, 28) each image in that database is (1, 28, 28), the image individually is (28, 28 (number of pixels)) but you're adding your image to the database's list - making it (60001, 28, 28)

silk axle
#

Why's that a problem though?

bitter harbor
#

you have (28, 28, 1)

#

that's not the same size as (1, 28, 28)

silk axle
#

🤔

#

So I want to make both (1, 28, 28, 1)?

#

Since I can't make both (1, 28, 28)

bitter harbor
#

what

#

you definitely can

silk axle
#

(28, 28, 1) = 28x28 1d
(1, 28, 28) = 1 image 28x28

#

That's not that same

#

The three numbers don't represent the same thing

bitter harbor
#

np.resize(image, (1,np.shape(image)))

silk axle
#

Which would just make it (1, 28, 28, 1) like I said?

bitter harbor
#

no it wouldn't

#

where are you getting the 1 at the end from

silk axle
#

The image is (28, 28, 1)

#

Adding a one in front makes it (1, 28, 28, 1)?

bitter harbor
#

no you're reshaping it not adding a one

silk axle
#
np.concatenate((x_train.reshape(1, x_train.shape), reversed_image))```do you mean this?
#

I'm really confused as to what you're saying

bitter harbor
#

it doesn't change the values

silk axle
#

Yea ik, I meant prepending the shape with a 1

#

Ig I just worded badly

bitter harbor
#

so when you import the 28 by 28 images as an array, you change the shape so that you 'move' the image into the second+third dimension and 'list' it by making the first equal to 1

#

like I said I'd suggest learning about the topics I mentioned above

#

even a lot of the preprocessing involves them

silk axle
#

I still don't get how to solve the issue I've got

#

I need to reshape it, I get that

#

Gimme a sec

#

Surely what you're saying to do would bepy np.concatenate((x_train.reshape(tuple([1] + list(x_train.shape))), reversed_image))?

#

nvm

#
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimension(s) and the array at index 1 has 3 dimension(s)```
#

I really don't get what I'm doing

bitter harbor
#
image = np.reshape(image, (1, 28, 28))
np.concatenate((x_train,image)) ```
silk axle
#

Right

#

I did that also

#

Got a different error

#
ValueError: cannot reshape array of size 47040000 into shape (1,28,28)```
bitter harbor
#

sorry other way arround

silk axle
#

I'm reshaping x_train atm

#
np.concatenate((x_train.reshape(tuple([1] + list(x_train.shape[1:]))), reversed_image))```
bitter harbor
#

why

silk axle
#

wait

#

Reverse image is (28, 28, 1)

#

And I want to make that (1, 28, 28), right?

#

Surely I can just reverse the shape?

#

Okay, that seems to have worked

#

Now error on the next line

#
---> 19     np.concatenate((y_train, number_in_image))
     20 
     21 print(f"Train Shapes: X={x_train.shape}, y={y_train.shape}")

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)```
#

Wait nvm, I think I know why

#

Okay I think I got it working now? No errors at least

#
## Load data and display shapes
(x_train, y_train), (x_test, y_test) = mnist.load_data()
#print(f"MNIST dataset: {x_train.dtype}")
#print(x_train.shape)
## Load and add custom training samples
import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
    number_in_image = int(image_file[0])
    image = plt.imread(f"{_dir}/{image_file}")

    ## Resize image to 28x28x1 and invert
    reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')
    #print(f"My image: {reversed_image.dtype}")
    #print(resized_image.shape)

    ## Add data to training sets
    np.concatenate((x_train, reversed_image.reshape(tuple(reversed(reversed_image.shape)))))
    np.concatenate((y_train, np.array([number_in_image])))```
#

Except it doesn't actually concatenate 🤦

#

Ig I need to assign maybe?

#

yea

#

Seems to have worked :~)

bitter harbor
#
import numpy as np
import glob
import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
_dir = "/content/drive/My Drive/training numbers/*"
for image_file in glob.glob(_dir):
    image = skimage.data(image_file)
    reversed_image = 1 - np.reshape(image, (1,28,28))
    x_train = np.concatenate((x_train, reversed_image))
    y_train = np.concatenate((y_train, np.array([int(image_file[0])])))
silk axle
#

I don't need to /255 because it's already greyscale

#

And concatenate doesn't edit in-place, so I needed to do like x_train = np.concatenate((x_train, reversed_image))

#

But yea that's basically what I need ig

#

Thanks for the help 👍

bitter harbor
#

that should work now

#

as long as the path is the actual path not just what you have

fleet moth
#

Is is possible to create an array of legends and datas from my Dataframe ?

#

datas = self.db.select().to_numpy() ?

desert oar
#

@fleet moth what is thie purpose of this?

#

pandas doesn't have a concept of "legends"

mellow tiger
#

Howdy. What are some good tools for visualization using Pandas? I'm looking for something more presentable than pandas profiling. Also it needs to be confidenciality compliant, so no Datapane

desert oar
#

i just make my own plots w/ matplotlib

#

pandas has a bunch of convenience functions that generate matplotlib plots for you

limpid raft
#

Is there a difference between an array of shape (x,1) and an array (x)?

pale thunder
#

Yes, you have 2 axis in one case and 1 in the other. It affects quite a bit actually

limpid raft
#

but if the length of the second axis is 1, doesn't it become 1D again?

pale thunder
#

No, you can have a 1 long axis. It is useful for example when concating a (28,28) to a (1000, 28, 28)

pale marsh
#

What's better, having lots of mini dataframes or one big dataframe in pandas ?

verbal ice
#

Howdy. What are some good tools for visualization using Pandas? I'm looking for something more presentable than pandas profiling. Also it needs to be confidenciality compliant, so no Datapane
@mellow tiger you can use seaborn

bitter harbor
#

@pale marsh it depends on how it’s being generated/used, how large the total dataset it, and what it’s for

pale marsh
#

@bitter harbor right now my program takes one big dataframe and splits it into loads of small ones, it gets passed in a list of dataframes to another class which plots graphs (just simple histograms rn) with them using altair

bitter harbor
#

What’s the format of the dataframe?

pale marsh
#

There are some columns with 15000 rows and I think I'm just using at the most 1/4 of the total dataset I think

#

It gets it from a rosbag

#

Oh wait misread that

bitter harbor
#

Can you send the first couple rows?

#

With titles if there are any

pale marsh
#

Sorry I don't wanna login to discord on their laptop

#

4000x18 set

bitter harbor
#

Of?

#

Numbers?

#

Strings?

pale marsh
#

Floats, ints and arrays of floats, and 2D arrays of floats

bitter harbor
#

How were you planning to use a histogram for that then?

#

Like unless you do a separate one for each data set idk how possible it’d be

pale marsh
#

Right now I split each column into its own dataframe and just histogram it up like that but in with Altair I can just give it a large dataframe and specify which column to use the data from

bitter harbor
#

That won’t work with 2d arrays mixed in

#

Or it will, but you’ll won’t be able to use the same function

pale marsh
#

Oh yh I was thinking of splitting those off into their own separate dataframe while the single value columns I leave in one big one

#

But idk which is more efficient

#

Think it's suppose to scale up later

bitter harbor
#

You’ll have to break them up

#

At least I can’t think of any way to do that

#

Also is there a reason you’re using python 2?

desert oar
#

If you are just plotting it doesn't really matter how you organized your data, as long as you understand the code and it's not too complicated for others to understand

#

That said, I don't think I fully understand what you are doing with this data

serene scaffold
#

salt rock lamp, do you understand how torch.Softmax works?

#

I'm trying to use it as a loss function

desert oar
#

i know how softmax works, i dont know how the intricacies work in torch

#

it's not a loss function

#

it's a "layer"

serene scaffold
#

Ah

pale marsh
#

Also is there a reason you’re using python 2?
@bitter harbor something to do with the rosbags not being compatible with python 3 I think tho I'm not entirely clear on that

desert oar
#

you know what softmax means/does?

serene scaffold
#

I remember using it when I took linear algebra but I forgot how it's defined.

desert oar
#

ok

#

you've seen a logistic curve?

serene scaffold
#

let's see

desert oar
#

it compresses the real line to (0, 1)

serene scaffold
#

Looks like sigmoid

desert oar
#

yeah

#

softmax is the multivariate generalization thereof

#

so it compresses R^n to (0,1)^n

bitter harbor
#

Sigmoid is -1,1 softmax/ReLu is 0,1

serene scaffold
#

Do I even need it then if my vectors are one dimensional?

desert oar
#

what is your output?

#

if you're doing classification you (probably) need it

#

if you're doing regression you (probably) shouldn't use it, unless your regression target has hard upper and lower bounds

serene scaffold
#

Mapping between two vector spaces. So the idea is that once the weights are tuned correctly, a length 768 vector will be the right 200 length vector in another space.

desert oar
#

both are vector spaces over R though?

#

R^768 -> R^200 ?

serene scaffold
#

I'm not sure what that means

desert oar
#

ℝ, real numbers

spark stag
#

Sigmoid is -1,1 softmax/ReLu is 0,1
@bitter harbor i think sigmoid is also 0-1, hyperbolic tangent is -1, 1, is this what you ment

desert oar
#

p sure sigmoid is just "neural network lingo" for logistic function, no?

serene scaffold
#

All the elements in the vector are real numbers, yes

desert oar
#

the logistic curve is quite literally sigmoid

#

in that case you do not want a sigmoid/logistic/softmax on your output layer. you can have it on the hidden layers

bitter harbor
#

Idk why I’ve been normalizing data to -1,1 then

desert oar
#

you can also re-parameterize the logistic function to map to (-1, 1)

#

hell you can change the center

#

so you can have (-300, 500) with a center at 100

#

but why would you bother

#

@bitter harbor idk either 🙂 i dont like normalizing real-valued data. only when it has known and strict upper/lower bounds such as images which are 0-255 for example

#

for classification -1/1 was an old-school thing from when everyone loved SVMs

bitter harbor
#

Oh I stg I wasted so much time on doing that to audio

desert oar
#

it probably makes sense in specific domains

#

maybe its recommended by audio people

#

i usually work with what youd call "social science data" so thats where my recommendations come from

bitter harbor
#

Ah ya idk I remember hearing it somewhere once and just went with it

#

Oh i think it’d better if you need more certainty with your floats

desert oar
#

yeah but why normalize when you can also standardize

#

then you're just shifting and scaling without actually clipping your data

#

which might or might not be relevant depending on your data and model

bitter harbor
#

Because standard deviation is gross

#

And I refuse to learn it

desert oar
modest rune
#

Hopefully an easy question to answer. When writing functions to do some pandas dataframe manipulation. Is it better to: (from a performance perspective)
(a) construct the dataframe outside of the function, pass into the function as a parameter, then modify it.
(b) construct the dataframe inside the function, modify it, then return the dataframe?
(c) both work equally well.

bitter harbor
#

Nns are hard enough as is

desert oar
#

idk how you expect to do anything with data and not at least know basic stats

#

you dont even need to understand it to scale by it

bitter harbor
#

d) use numpy

desert oar
#

@modest rune (c)

#

numpy has the same considerations

bitter harbor
#

Yes but numpy

desert oar
#

...is underneath most pandas ops

#

so if you have mixed data types or you happen to enjoy the use of column names

#

pandas is much easier to work with

#

numpy is for math

bitter harbor
#

Yes but numpy

desert oar
#

pandas is for data

#

using numpy for data is like using a bit mask instead of kwargs in python

#

it works, but why

bitter harbor
#

Math is better anyways and as for data manipulation, I’ve found that with audio/images/not social data there’s barely any sort of stat stuff

desert oar
#

yes, that's fine

#

Math is better anyways
so go learn standard deviation 😉

#

i wouldn't use pandas for images nor would i use numpy for HLOC time series ticker data

bitter harbor
#

Does pandas have the same sort of flexibility as numpy tho?

#

Because numpy seems to be useful for pretty much everything

modest rune
#

OK, well... I took my pandas vectorized math... 3 lines of relatively simple code. Moved it into a function without modifying what it does, and the performance decreased by 35%. I am trying to understand what changed. Any ideas?

desert oar
#

what do you mean flexibility? it feels like you're trying to artificially make this into a "vs" argument where none exists

#

pandas is a tool for manipulating tabular data, using numpy under the hood

#

you can use whatever tool you want for whatever purpose you want. i'm just recommending against using plain numpy for most datasets

#

@modest rune can you show your code with some context

#

not just the function

bitter harbor
#

Hm ya I didn’t know that but the little data I’ve worked with I’ve used plain numpy

#

I’ll keep that in mind thanks

modern canyon
#

Hello y'all, I am building a recommendation system and I have the following features 'genres', 'numVotes', 'averageRating' with the following stats:

mean 7.365355
std 0.588674
min 6.500000
max 9.800000
Name: averageRating, dtype: float64

mean 0.068966
std 0.257881
min 0.000000
max 1.000000
Name: genres (30 classes -> one hot encoded)

mean 75096.11
std 151962.13
min 5000.00
max 2260919.00
Name: numVotes, dtype: float64

How do I normalize these features?
I want to calculate cosine similarity after concatenating all these three features together

bitter harbor
#

Normalize the features all together or separately?

modern canyon
#

together

#

idk for sure I'm a beginner

bitter harbor
#

Normalizing 2 million and 0-1 is gonna give you some pretty small values

modern canyon
#

yeah, probably have to normalize separately and weight them accordingly afterwards

#

what do you think?

modest rune
#

I had to mess with the code a bit to obscure what it does, but here it is.

Function Call

call_profit_df = self.CalcOptionProfit(options_df, one_df, broker_fees, 
                                      options['underlyingPrice'], investment)

Function

    def CalcOptionProfit(self, options_df, one_df, broker_fees, underlyingPrice, investment):
        profit_df = pd.concat([one_df] * options_df.shape[0], axis=1, ignore_index=True)
        profit_df = ((profit_df * underlyingPrice) + options_df['strikePrice']
        profit_df = investment + ((profit_df + options_df['price'] - broker_fees['per_contract']
                            - (broker_fees['percent'] * options_df['price'])) *       
                            options_df['contracts']) + broker_fees['flat']
        
        return profit_df
#

like I said above, the only difference between before and now that caused the 30% speed reduction, moving the code from the function call location, into the newly created function.

bitter harbor
#

yeah, probably have to normalize separately and weight them accordingly afterwards
Why are you normalizing in the first place?

desert oar
#

i'd consider normalizing averageRating and standardizing numVotes

modern canyon
#

because there are too many outliers in numVotes column

desert oar
#

clipping outliers should be considered a separate task

#

from centering/scaling or normalizing the data

modern canyon
#

i'd consider normalizing averageRating and standardizing numVotes
@desert oar standardizing? what's that?

desert oar
#

subtract sample mean, divide by sample std dev

#

@modest rune i think your function looks fine. it might be slower if you're doing this a very large number of times in a hot loop due to function call overhead

#

but just moving code to a function should not make it slower

modern canyon
#

hmm

modest rune
#

@desert oar But it does... And I am only calling the function once. I have spent the past 4 days converting all my math to vectorized math... no more loops (at least not yet).

modern canyon
#

thanks for the info guys!

desert oar
#

@modest rune

        profit_df = pd.concat([one_df] * options_df.shape[0], axis=1, ignore_index=True)

this looks like weird code

#

you're concat-ing the same data frame N times?

#

that looks guaranteed to be slow

modest rune
#

I created the function for 1 main reason and 1 secondary reason... (Main Reason): An article I read said it is a good practice so that garbage collection could clean up any unused variables when the variables go out of scope. (2nd Reason): Code cleanliness.

Regarding (Main Reason)... I was worried that my assignment of dataframes to a different name each time I make a modification, might be leading to a lot of copies and making a bigger memory footprint. Is this something I should be worried about, or is pandas good about only making copies when absolutely necessary?

#

@desert oar There might be a better way... but there is a good reason for why I am doing that. Let me see if I can explain.

Profit_Scenarios = [1.2, 1.5, 0.6, 5.0]
Stock_Data = pd.DataFrame( columns = ['ticker', 'Price'],
                            data =  [[  'NFLX',   150.2],
                                     [  'GOOG',   304.1]])
## Desired Results ##
['ticker', 'price']   [          1.2,          1.5,          0.6,          5.0]
[  'NFLX',   150.2]   [f(150.2, 1.2),f(150.2, 1.5),f(150.2, 0.6),f(150.2, 5.0)]
[  'GOOG',   304.1]   [f(304.1, 1.2),f(304.1, 1.5),f(304.1, 0.6),f(304.1, 5.0)]

So, I start this process by first creating a Profit_Scenarios dataframe with the right number of rows to match the Stock_Data dataframe.

#

But... we are getting side tracked... I still don't understand the 30% slowdown

desert oar
#

both are good reasons

#

and yes pandas is usually good about not copying data, but not always

#

i dont understand it either. frankly it shouldnt happen

#

it suggests that you made a mistake and changed something during refactoring

#

ah ok so

#

you're just overwriting the same dataframe

modest rune
#

it suggests that you made a mistake and changed something during refactoring
@desert oar

Yeah... probably a 10% chance I did something. But, the change was so simple, as I spend more time with eyes on code, I am feeling less like that is happening.

desert oar
#

overwriting the same df basically means that the "old" version of the df already goes out of scope

#
x = 1
x = 2

the 1 is out of scope as soon as you re-assign to x

#

so moving to a function in this particular case doesn't help give any hints to the GC

modest rune
#

"GC"?

#

oh garbage collectin

#

Is there a better way to pull off the math without duplicating those rows? I couldn't think of a difference vectorized way to do it.

#

I know I could loop... but that how I used to do it and it was much much slower

desert oar
#

what is one_df

modest rune
#

one_df is the the same thing as profit_scenarios in my example

desert oar
#

ahh

#

ok this actually might be a good case for numpy

#

but have fun keeping track of all those indexes

#

how important is performance? i'd just use a loop personally

modest rune
#

Well, it IS working without looping. I am just trying to optimize further.

#

And looping was a 400% slowdown.

desert oar
#
profit_scenarios = [1.2, 1.5, 0.6, 5.0]
scenario_outcomes = [compute_profit(stock_data, scenario) for scenario in profit_scenarios]
modest rune
#

yeah, that is what I used to do. Took 10 seconds, right now it is taking 2.5 seconds with an even larger data set

desert oar
#

Ohhhh so you are just expanding the scenario values to match the data shape

modest rune
#

yes

desert oar
#

Sec

bitter harbor
#

ok this actually might be a good case for numpy
Seems like it

modest rune
#

Switching to numpy is on my list of things to do next.

desert oar
#

hold on though

#

you might still end up with lots of copies of your data?

#

yeah actually

modest rune
#

I am thinking the temporary switch to numpy for this particular function will be trivial... a lot of DataFrame.values and maybe some conversions to float64 here an there.

desert oar
#

is options_df repeating the data already? like once per scenario?

modest rune
#

you might still end up with lots of copies of your data?
@desert oar

Help me understand. I have difficulties understanding when copies might be occurring and when not.

#

is options_df repeating the data already? like once per scenario?
@desert oar

I don't understand what you mean?

#

What data?

desert oar
#

i don't see how your concat code does the same thing as the for loop

#

this might just be me being slow/thick

#

is broker_fees the same shape as options_df?

#

or is broker_fees a dict of scalars

modest rune
#

broker_fees is a dict of scalars

desert oar
#

are you using calcOptionProfit inside a for loop?

modest rune
#

Since the two dataframes have the same height, pandas is looping through options and one behind the scenes.

#

nope... one call

desert oar
#

it looks like options_df is something like

profit_scenarios = [1.2, 1.5, 0.6, 5.0]

stock_data = pd.DataFrame( columns = ['ticker', 'price'],
                            data =  [[  'NFLX',   150.2],
                                     [  'GOOG',   304.1]])

options_df = pd.concat([stock_data] * len(profit_scenarios)], axis=1)
modest rune
#

That is what I have learned about pandas an numpy. The bigger set of data you can pass at one time, the more time savings.

desert oar
#

otherwise this doesn't make sense to me

modest rune
#
one_series = [1.2, 1.5, 0.6, 5.0]
options_df = pd.DataFrame( columns = ['ticker', 'Price'],
                            data =  [[  'NFLX',   150.2],
                                     [  'GOOG',   304.1]])

one_df = pd.concat([stock_data] * len(one_series)], axis=1)
## Result ##
['ticker', 'price']
[  'NFLX',   150.2]   [          1.2,          1.5,          0.6,          5.0]
[  'GOOG',   304.1]   [          1.2,          1.5,          0.6,          5.0]

new_df = one_df.transpose() * options_df['price']
## Result ##
['ticker', 'price']
[  'NFLX',   150.2]   [    1.2*150.2,    1.5*150.2,    0.6*150.2,    5.0*150.2]
[  'GOOG',   304.1]   [    1.2*304.1,    1.5*304.1,    0.6*304.1,    5.0*304.1]
#

Only, I think you have to use one_df.transpose() to get pandas to do the math right. edited pseudocode to show that.

#

@desert oar Good news! I was chasing a ghost! I am using pyinstrument, which is a great profiler but it isn't deterministic. I reverted back to my non-function implementation and I am seeing the same slowdown. There must be something slowing my laptop down.

#

Sorry for dragging you all along for the journey.

#

I do want to go back to this though...
"you might still end up with lots of copies of your data?"
"is options_df repeating the data already? like once per scenario?"

Do you see something suspicious from a performance perspective that I should be aware of?

#

or at least investigate?

#

Gonna blame the slowdown on windows update. Now the function implementation is just as fast. 🙂

earnest wadi
#

I really ,dont understand this

def forward(self, inputs):
        print (inputs)
        print (self.weights)
        self.output = sigmoid(np.dot(inputs, self.weights) + self.biases)```
modest rune
#

You need to transpose the second variable in your dot operation

#

so that it is (2,1) and (1,2)

#

The inner dimensions of a dot operation need to be equal

earnest wadi
#

alright, that worked

#

now htis

modest rune
#

I am not well versed in matrix math, only knew the answer because I had just run into the same issue the other day. But... my guess is that, broadcasting (I don't know what broadcasting is), for whatever reason, needs the two values to have the same shape.

#

From what little I know about matrix math. Addition and subtraction can only happen on matrices with the same shape

earnest wadi
#

oh, haha, alright, ill have a play around

modest rune
#

Yours are not the same shape.

#

(2,1) and (2,2). My geuss is that your (2,2) matrix was the result of your dot operation. (2,1) dot (1,2) produces a matrix of shape (2,2)

earnest wadi
#

oops

#

self.delta: [[-3.24185123e-14 -3.48982562e-12] [ 7.88860905e-31 3.86714987e-07]]
inputs.t [[1.00000000e+00 1.00000000e+00] [1.86810924e-06 3.86715286e-07]]

def backward(self, inputs, outputs):
        self.delta = (outputs - self.output) * (self.output * (1 - self.output))
        print (self.delta)
        print (inputs.T)
        self.weights += inputs.T.dot(self.delta)```

any clue?
#

they both look the same shape to me

#

they both look 2,2

#

tbf idrk what im talking abt

#

haha

modest rune
#

That was not helpful, because I cannot know the shape of inputs and outputs by looking at the code. FYI, you can take your numpy array and print the .shape attribute if you want to quickly see its shape. EX: print(inputs.shape)

earnest wadi
#

alr

#

they are both (2, 2)

#

hmmmmmmmmmmmmmmmmm

#

oh

#

wait

#

im dum

#

self.weights is (2, 1)

desert oar
#

@modest rune my question isn't about performance, it's that your code doesn't look like it would emit the right result. Can you provide some sample data and the expected outputs?

modest rune
#

@desert oar only if I still have the code I used to test it out. Let me see. Otherwise, I already validated the data and it will take too much time to mock something up.

desert oar
#

Just some made up option ticker data?

earnest wadi
#

salt rock, you got any idea about my lil problem?

modest rune
#

I still have it...

import pandas as pd
import numpy as np

gain_scenarios = pd.Series([0.34, 0.21, 0.56, 0.11, .54, 1.6, 0.88, 0.01, 0.5])
scalar = 52.0

stock_data = pd.DataFrame(columns =  ['Ticker', 'Shares', 'Cost_Per_Share'],
                             data = [['NFLX'  , 100.0     , 0.10          ],
                                     ['AAPL'  , 150.0    , 0.20           ],
                                     ['GOOG'  , 500.0     , 5.10          ],
                                     ['F'     , 70.0      , 7.10          ],
                                     ['BKSR'  , 130.0     , 0.90          ],
                                     ['AMZN'  , 90.0      , 5.10          ]])

gain_expanded = pd.concat([gain_scenarios] * stock_data.shape[0], axis=1, ignore_index=True)
print(gain_expanded.shape)
print(gain_expanded)
gain_expanded = ((gain_expanded + 1) * scalar)
print(gain_expanded)
gain_expanded = gain_expanded - stock_data['Shares']
print(gain_expanded)
#

@earnest wadi I can help, but, please print the shape of inputs and outputs

#

I mean, let me know what they are.

desert oar
#

^ this

earnest wadi
#

inputs (2, 2)
outputs (2, 2)
weights (2, 1)

#

weights is the problem

desert oar
#

Seems ok

#

inputs @ weights should work

earnest wadi
#

?

modest rune
#

self.weights and self.delta

earnest wadi
#

(2, 2)
(2, 2)
(2, 1)
Traceback (most recent call last):
File "a:/Python/Libraries/test.py", line 13, in <module>
network.run(X, y, epochs=90)
File "a:\Python\Libraries\main.py", line 55, in run
layer.backward(X, y)
File "a:\Python\Libraries\main.py", line 33, in backward
self.weights += inputs.T.dot(self.delta)
ValueError: non-broadcastable output operand with shape (2,1) doesn't match the broadcast shape (2,2)

modest rune
#

nevermind, you are defining delta

earnest wadi
#
def backward(self, inputs, outputs):
        self.delta = (outputs - self.output) * (self.output * (1 - self.output))
        print (self.delta.shape)
        print (inputs.shape)
        print (self.weights.shape)
        self.weights += inputs.T.dot(self.delta)```
modest rune
#

what is "self.output", that is a different variable than outputs

earnest wadi
#

self.output is the output for the layer in question, outputs is the output of the whole neural network up to said layer

#

this guys isnt written amazingly as it is fixed and you have to manually add and remove code to add layers, mine just uses classes and functions

#
class layer_dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.10 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
    def forward(self, inputs):
        self.output = sigmoid(np.dot(inputs, self.weights.T) + self.biases)

    def backward(self, inputs, outputs):
        self.delta = (outputs - self.output) * (self.output * (1 - self.output))
        print (self.delta.shape)
        print (inputs.shape)
        print (self.weights.shape)
        self.weights += inputs.T.dot(self.delta)```
modest rune
#

Sorry all these numbers flying around... I have lost track of everything. Can you clearly tell me the values of these:
self.output.shape
inputs.shape
outputs.shape (edited to add this)
self.weights.shape (before the function is run)

earnest wadi
#

alr

desert oar
#

For future reference, the best way to get help is with a minimal reproducible example

#

Sample data + code that reproduces the error

earnest wadi
#

self.output.shape (2, 2)
inputs.shape (2, 2)
outputs.shape (2, 1)
self.weights(2, 1)

#

oh

modest rune
#

please inculde the name of the variable... that way I don't have to make assumptions

#

And, if you didn't notice, I snuck in outputs.shape too

earnest wadi
#

there

modest rune
#

Please, the shape of of self.weights, not the values

earnest wadi
#

you asked for the values -_-

#

alr

modest rune
#

oh, I did 🙂

#

my typo!

earnest wadi
#

self.output.shape (2, 2)
inputs.shape (2, 2)
outputs.shape (2, 1)
self.weights.shape(2, 1)

modest rune
#

Thanks!

#

which line is line # 13 in your code?

earnest wadi
#

network.run(X, y, epochs=90) which is this

#
class network:
    def __init__(self, io):
        total = len(io)
        layers = []
        for i in range(total):
            layers.append(layer_dense(io[i][0], io[i][1]))
        self.layers = layers

    def run(self, X, y, epochs):
        for r in range(epochs):
            i = -1
            for layer in self.layers:
                layer.forward(X)
                X = layer.output
                i += 1
                layer.backward(X, y)
        self.output = self.layers[i].output```
modest rune
#

Dumb, question, I didn't need that. Thanks though 🙂

earnest wadi
#

lol

modest rune
#

I didn't think this was possible:
(outputs - self.output)

#

since outputs is (2,1) and self.outputs is (2,2)

earnest wadi
#

oh yeah

#

uh

modest rune
#

but... like I said before, I am not an expert with matrix math, so take what I write with a grain of salt

earnest wadi
#

haha

modest rune
#

maybe numpy is able to handle that situation and makes some assumptions about what you are trying to do.

earnest wadi
#

maybe

modest rune
#

@desert oar Where you able to verify my code works on your end?

desert oar
#

Numpy will maybe broadcast the mismatched vector

#

No i havent

#

I need to head offline soon. @ me in a few hours

modest rune
#

k

earnest wadi
#

im gonna take a break aswell, unless yu have any final ideas?

modest rune
#

yes, I have an idea

earnest wadi
#

im all ears

modest rune
#

the error ValueError: non-broadcastable output operand with shape (2,1) doesn't match the broadcast shape (2,2) is pretty specific

#

It is saying that self_weights is (2,1) and the other variables on that line of code are (2,2), and that normally this won't work because of Matrix Math. But... Numpy is nice enough to make assumptions for you by doing what they call broadcasting (as salt rock lamp mentioned)

#

BUT

#

They are refusing to do broadcasting for the += operation.

earnest wadi
#

so

#

should I do

#

what

modest rune
#

Well, if it were me, I would assume that you shouldn't be mixing (2,1) and (2,2) in the first place. Maybe they are all supposed to be (2,1) or all supposed to be (2,2), maybe you ended up with a matrix shape that is incorrect somewhere along the line.

#

This will require you understand what your algorithm is trying to do.

#

Not a clean answer I know, best I can do though. Good LUCK!

earnest wadi
#

thanks for your help 😄

#

ahhahahaha

#

@modest rune

#

I made

#

self.weights += blah

#

to self.weights = self.,weights + blah

modern canyon
#

clipping outliers should be considered a separate task
@desert oar how do I handle outliers?

earnest wadi
#

and it worked

#

hahahahaha

modest rune
#

there you go! The error only said += won't work... never said B = B + A won't work

earnest wadi
#

xd

desert oar
#

@modern canyon that's a big topic

modern canyon
#

I see

#

is there any scikit-learn function that can do it for me?

desert oar
#

err

#

actually it looks like they've added a lot more useful functionality for outlier detection

modern canyon
#

thanks!

desert oar
#

this is much more complete than in the past

#

and this is a nice user guide. lots of pretty pictures

#

the sklearn user guide is turning into quite a nice document

queen barn
#

I have a pretty broad question, but I hope that there is a data science pro that's nerdy enough to find some joy in helping me out. I'm working on an analysis with a data set of about 4500 rows, 95 categorical variables (optional configurations for a product), and a binary output "did it fail or not". What kind of approach would help me figure out which of these categorical variables is more correlated to failures?

#

It's not quite a multiple linear regression because while there are multiple factors that need to be considered, I'm more interested in which of the individual factors contribute to failures more often, and thus are better indicators of failures.

bitter harbor
#

Is the output for each variable or for each row

queen barn
#

No, one ID, many categories, binary result.

#

That's the row configuration.

bitter harbor
#

I’d say tag the categories for when it fails, so maybe append them to a list? and checking which item(s) shows up the most in the list

queen barn
#

Yeah, I can definitely do that manually, but I'm concerned about that approach for two reasons. 1) gotta be a better way to do that and 2) that really only captures if a category was involved in an outcome that was a failure, not whether or not it's a factor in predicting failures. Will it be more likely that if that variable is involved that a failure will happen? Yes, but how much more likely? How strong is the correlation? Are there other variables that were also involved in the failure that are more likely to be correlated to the failure? Is it the combination of the variables that makes the failure more likely?

bitter harbor
#

Those can be answered in post processing

#
Factors = []
For column in dataset:
    If column[3] == failed:
        Factors.append(column[2])```
queen barn
#

I can definitely do that. Do you mind elaborating on how that would be answered in post processing?

bitter harbor
#

No sorry but it would depend on what you’re trying to calculate

queen barn
#

Do you need me to elaborate more on my end?

bitter harbor
#

Nope

#

You’ll have a list and you’ll have the ability to count how many times a factor comes up

#

Correlation might be a bit harder to define considering its 95 variables to 1 outcome

#

Finding math oriented statistical proprieties on the other hand would be as simple as just performing basic operations

mossy sand
#

Not sure if this would be the correct channel. Was contimplating the idea of webscraping individual house value estimates. Like if a homeowner wanted to check the value of their house.

I know Zillow has an API. Not sure if that would be the best route

Thoughts? Different channel?

bitter harbor
#

I’d put your question in new help channel

#

Might get more people viewing it

mossy sand
#

Okay, thank you.

terse torrent
#

Is there a way to have Pandas import all Excel sheets without having to label each one individually?

desert oar
#
with pd.ExcelFile('mybook.xlsx') as efile:
    dfs = {sheet: efile.parse(sheet) for sheet in efile.sheet_names}
#

that gives you a dict mapping sheet names to data frames

terse torrent
#

Allright thanks. Because I got over 100 sheets

modest rune
#

Looking for advice to speedup this function:

def CalcOptionProfit2(self, options_df, profit_df, broker_fees, underlyingPrice, investment):
   profit_df = (options_df['putCall_float'] * (profit_df + 1) * underlyingPrice) - 
                options_df['strikePrice']
   profit_df = investment + ((profit_df - options_df['price'] - broker_fees['per_contract']
                            - (broker_fees['percent'] * options_df['price'])) * 
                            options_df['contracts']) - broker_fees['flat']

options_df: 3,000 x 9, mixed datatypes
profit_df: 3,000 x 100, float64
broker_fees: Dict of scalars
unerlyingPrice: Scalar
investment: Scalar

This function runs once. Takes 4 seconds to run. I am starting to think this is not the most efficient way to do this math in pandas. I say that, because I compare these two lines of code to all of the other pandas math I am doing, and I am surprised it takes 4 seconds to run.

I have tried for loops and the apply function (using a custom function). I cannot promise I tried those approaches the right way, but when I did try those approaches, things were significantly slower... like 12 seconds instead of 4.

Any advice or links to documents I should read are greatly appreciated!

#

The rest of my code runs in 0.180 seconds... it is just these 2 lines that are giving me a headache.

modest rune
#

Ideas I have:
A. Switch to numpy for calc.not sure how to do this.
B. Combine 2 dataframes into one using groupby

lapis sequoia
#

I want to output matrix like this
9 8 7 6
10 11 12 5
1 2 3 4

I want to output spiral matrix...

this is my first message in this server... can someone to help me?

!!! i=x, j=x
for more example =>
if i give as arguments function: 3, 5, it mast print like this:

11 10 9 8 7
12 13 14 15 6
1 2 3 4 5

modest rune
#

Other hints I have to why something is wrong:
A. Each time I add a dataframe column to my equation (ex. `options_df['strikePrice']) the runtime increases by about 0.5 seconds... that seems like a hefty increase for an item that is on the same row as the other items in the dataframe.
B. I decreased my dataset from 6500 x 9 to 3000 x 9 and only saw a 25% speed improvement. How does that make any sense?

desert oar
#

@modest rune this is more helpful let me see what i can do here

pastel compass
#

I'm currently training a seq2seq model and I am trying to diagnose my fairly stagnant losses. Are they a result of any of the following?:

  1. My learning rate is too low
  2. I should use a different criterion or optimizer
  3. I messed up somewhere in the code
desert oar
#

honestly this is about as efficient as you can make it. the only other option is to use something like numexpr

modest rune
#

Since my comment on here, I have started trying to do the math with numpy... seems to be MUCH faster, but I haven't validated my output yet though

desert oar
#

@modest rune numexpr "compiles" all your operations so you don't have all these intermediate results

#

and yes raw numpy will be faster

#

because pandas does a lot of work to align indices

#

whereas numpy is purely position-based

#

but numexpr will probably be more efficient than both

modest rune
#

Well... I hope what I am seeing is correct and that my output is correct. Right now, those 2 lines run in 0.011 seconds with Numpy, versus 4 seconds with pandas.

desert oar
#

heh

#

thats a bigger difference than i expected

modest rune
#

me too

desert oar
#

ahhh hold on

#
options_df['putCall_float'] * (profit_df + 1)

this might be a much more expensive broadcasting operation in pandas vs in numpy

#

what's the idea here, you want to multiply each column of the profit_df matrix by the options_df['putCall_float'] vector?

modest rune
#

that particular one is sort of a workaround.

#

I have an equation that is ALMOST the same for puts and calls, except I have to negate part of the equation.

desert oar
#

this is what i was trying to understand before. what does each row and column of profit_df represent

modest rune
#

So, I created a column that stores either 1 or -1 depending on whether or not it is a put or call.

#

profit scenarios. Can't go into more details than that. That is the magic sauce.

desert oar
#

i don't really care what they are from that perspective

#

i mean, is each row a list of parameters/scenarios that you want to try, and you're duplicating that list over and over?

#

or something like that

#

is each row an hour? a different ticker label?

#

i just need to know what each one represents, i dont care about the magic sauce so to speak

#

make up things if you want i just need some context for the problem

modest rune
#

each element is a percentage. The rows are duplicated, but will all be different in the end after the options_df gets ahold of everything in the math.

desert oar
#

i see

#

so each column is one "parameter" (corresponding to the "profit scenarios" in your earlier example)?

modest rune
#

yes

desert oar
#
profit_df = pd.DataFrame([
    [1, 2, 3],
    [1, 2, 3],
    ...
], columns=['A', 'B', 'C'])

like that?

#

and yes i appreciate that you need to be cautious about revealing the magic sauce, trust me i'm not trying to get you to reveal any of it

modest rune
#

Yes, that is basically profit_df, with A, B, C representing different profit scenarios.

desert oar
#

great

#

let me spend a minute with the numpy docs

#

that said if < 1 sec is good enough for you

#

i have a code snippet for you

modest rune
#

Yes. the current speed is excellent, but, if you have additional suggestions, I'd love to hear them.

#

I am still quite surprised that pandas was so slow. Makes me reticent to use pandas in the future.

desert oar
#

its because of the indexing

#

you can always drop down to numpy for better performance, ill show you

#

i think the organizational benefits of pandas makes it worth using for most things

#

then inside your functions you can switch to numpy

#

or again, numexpr

#

ok this is easier than i thought

idle otter
#

does environment.yml have to always be named environment.yml?

desert oar
#

no, as long as you always refer to it by name with -f

idle otter
#

thank you

desert oar
#

conda env update -n myenv -f env1.yaml

idle otter
#

ty

desert oar
#

so @modest rune numpy broadcasting has two rules: the dimensions that match are preserved, and the dimensions that are 1 are broadcast

#

that's what the solution i'm writing makes use of

#

this is good practice for me too btw i haven't written code like this in a while

modest rune
#

Ahh... I think I need to read up on broadcasting. I am guessing your code does the same as mine but avoids the whole duplicating rows bit. Is my guess correct?

desert oar
#

yep hang on

#

@modest rune i think your code is missing a close )

modest rune
#

maybe... I definitely had a variable in the wrong place... that putCall_float variable. I had added it without validating the data. Doing all of that now.

#

I gotta run. You mind PMing me the code snippit? that way it is easier for me to find in the future? Post it here too, in the rare chance someone is following our conversation.

desert oar
#

sure

#
def CalcOptionProfit2(self, options_df, profit_scenarios, broker_fees, underlying_price, investment):
    """ Calculate option profit scenarios

    Arguments:
        options_df: DataFrame with columns 'putCall' (bool), 'strikePrice' (float), 'price' (float), and 'contracts' (float)
        profit_scenarios: List of floats
        broker_fees: Dict of broker fees with keys 'percent', 'flat', and 'per_contract'
        underlying_price: Float
        investment: Float

    Returns:
        DataFrame with option profits corresponding to the profit scenarios
    """
    # Capture the original data index to use at the end
    original_index = options_df.index

    # Convert option data into column vectors
    putcall_vec = options_df['putCall'].to_numpy('float32').reshape((-1, 1))
    strikeprice_vec = options_df['strikePrice'].to_numpy('float32').reshape((-1, 1))
    price_vec = options_df['price'].to_numpy('float32').reshape((-1, 1))
    contracts_vec = options_df['contracts'].to_numpy('float32').reshape((-1, 1))

    # Convert profit_scenarios into a row vector
    profit_scenarios_vec = np.asarray(profit_scenarios, dtype='float32').reshape((1, -1))

    fee_per_contract = broker_fees['per_contract']
    fee_percent = broker_fees['percent']
    fee_flat = broker_fees['flat']

    result = (putcall_vec * (profit_scenarios_vec + 1) * underlying_price) - strikeprice_vec
    result = investment + ((result - price - fee_per_contract - (fee_percent * price)) * contracts) - fee_flat
    return result

paste contents

frank bone
#

how can i pass a variable from function1 to function2? I thought it was as easy as this...apparently not?

       a=1
       function2(a)
def function2(a):
        dosomestuff with variable a from function1
function1()```
#

seems like i have to return the value then save it as a global variable, my code works now, is that the best/only way to do it?

desert oar
#

No

#

Sounds like you need to go through Beazley's python course

#

But this is more help channel content than anything

frank bone
#

thanks for the link will go through the course, been only been working by watching a 1 hour tut + gooling the rest 😄

#

but quick answer maybe?

desert oar
#

it's a bit more than a quick answer

#

i'll try

#

if you define a function that accepts input, you must write that in the definition

#

!e ```python
def function1():
a = 1
b = function2(a)
return b * 7

def function2(a):
return a + 2

result = function1()
print(result)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

21
frank bone
#

yeah made a typo, meant to write function2(a)

desert oar
#

looks like you got the right idea then

frank bone
#

hmm i have to recheck maybe i made a typo in my code or forgot something

#

but can you recheck my pseudo code above? correct now?

desert oar
#

seems ok

#

you will need to learn about return at some point

frank bone
#

return i know but its not necessary for this example

#

just want to pass 1 variable from function to next function when calling the latter

#

alright it worked 😄 thanks for explaining

#

i had the right idea but mustve made a typo..multiple times lol

odd apex
#

if anyone could help me with a scatterplot that be pretty cool

#

x being one column of target values

#

y being a group of predictor columns

olive fossil
#

hi guys

#

do I have to learn any other languages aside for python

#

to be a legitimate data-engineer?

odd apex
#

R

#

is very useful, for what I heard

timber eagle
#

Perhaps SQL too

olive fossil
#

oh

desert oar
#

SQL for sure, java or scala might help

timber eagle
#

Oh really? Java???

flat quest
#

eh python is usually sufficient @odd apex.

But it would be recommended to learn a form of SQL. It shouldn't be too difficult once you understand Pandas, considering df's and db's are conceptually similar.

verbal ice
#

Oh really? Java???
@timber eagle scala/java are useful for big data applications (eg: data ingestion, processing etc)

low finch
#

and distributed computing

#

even though i guess that's kinda processing

potent nymph
#

Has anyone tried Tech With Tim's Machine Learning tutorial series? Is it good?

frank bone
#

anyone have a little experience with extended isolation forest?

#

was able to implement normal isolation forest without problem

#

does extended isolation forest actually work for 1 dimension time series data?

bitter harbor
#

@potent nymph idk about that one but personally id recommend 3blue1brown

gaunt tusk
#

^

#

3blue1brown has the best introduction to neural networks

#

easily

#

by far

#

speaking of neural networks

#

just want to double check i have the right formulas for calculating the derivatives

lapis sequoia
#

I'd learn SQL, considering even at my beginner level SQL gets data very fast especially if you are dealing with huge corporate databases.

#

Above about 100k rows in excel things start getting slow

#

So if a company is being smart about their data, they should migrate to SQL once their rows reach 100k or more.

#

SQL is not like Excel, instead of clicking a cell or selecting a row, you must write a script.

#

This script can be shared to access the same database on the server.

urban island
#

@gaunt tusk isn't a^{L-1} just z^{L}? I take it that a is the current layer and z is the previous layer. Also, is W^{L} just the weight matrix from z^{L} to a^{L}? Because in that case z^{L} is independent of w^{L}

gaunt tusk
#

z^{L} is a^{L-1} multiplied by its weight + bias

#

the weighted sum

urban island
#

it's only the weighted sum? What about the activation/transition function? Or is your network linear

gaunt tusk
#

holdon i'll list out what each one is

#
a{l} = Activation ( ó(z{l}) )
a{l-1} = Previous neuron activation
z{l} = weighted sum ( (a{l-1}*w{l}) + b{l} )
w{l} = weight
b{l} = bias
c0 = cost ( (a{l} - y)^2 )
ó(x) = sigmoid function
#

using sigmoid for my activation/transition

urban island
#

and your cost function is sum of squares?

#

sorry I'm just now walking up

gaunt tusk
#

yeah sorry forgot to stick it there, its (a{l} - y)^2

#

all good

#

and this is just for like

#

a single training example

#

i'll end up using it on matrices in my actual thing

#

just trying to lay it out first

#

so yeah the cost will end up being the sum of squares

urban island
#

ok yeah, you're derivatives look fine

#

@gaunt tusk just wondering, how come you're not using something like MSE for your cost function?

gaunt tusk
#

not sure i haven't looked at any other cost functions as of yet

#

whats the benefit of MSE?

dull turtle
#

how i can reduce val_loss ?

#
Epoch 145/150
32/32 [==============================] - 3s 80ms/step - loss: 0.3107 - accuracy: 0.9277 - val_loss: 0.9093 - val_accuracy: 0.6875
Epoch 146/150
32/32 [==============================] - 3s 85ms/step - loss: 0.3060 - accuracy: 0.9283 - val_loss: 1.8575 - val_accuracy: 0.6228
Epoch 147/150
32/32 [==============================] - 3s 82ms/step - loss: 0.2562 - accuracy: 0.9507 - val_loss: 3.1728 - val_accuracy: 0.6491
Epoch 148/150
32/32 [==============================] - 3s 79ms/step - loss: 0.2472 - accuracy: 0.9473 - val_loss: 3.3467 - val_accuracy: 0.6140
Epoch 149/150
32/32 [==============================] - 3s 79ms/step - loss: 0.3238 - accuracy: 0.9191 - val_loss: 2.0550 - val_accuracy: 0.6404
Epoch 150/150
32/32 [==============================] - 3s 81ms/step - loss: 0.2501 - accuracy: 0.9507 - val_loss: 3.1427 - val_accuracy: 0.5877```
#

free to ping me

urban island
#

@gaunt tusk well it depends on the dataset. I've seen MSE used a lot more than least squares. I dont remember the exact properties of each but from my experience MSE usually provides better results in regression type problems

gaunt tusk
#

Hmm i'll have a look into it

#

i believe the one i'm using should be fine for what i'm doing atm though

#

just making a simple handwritten digit recogniser

#

using the mnist dataset

urban island
#

I've used RELU (and its cousins) way more since it's less expensive (computationally) and provides faster convergence on networks where I dont have to worry about negative values

#

but anyways, yeah you'll prob be fine with your current cost function

bitter harbor
#

^ReLu would probably work better for image recognition

#

considering your values are between 0 and 1 already

gaunt tusk
#

hmm i'll check it out

urban island
#

welp I'm mixing things. Relu is an activation function

gaunt tusk
#

i have heard that its a more modernly used activation function

#

and i've been looking around and i believe the formulas for the partial derivatives i have are correct

#

so i believe i'm all set

bitter harbor
#

yes it is but you'll still have to use sigmoid if you ever have negative data

gaunt tusk
#

would the bias be able to make it negative?

bitter harbor
#

idk if there's anything similar to to tho

urban island
#

or leaky RELU

#

that has the benefit of dealing with negative values

bitter harbor
#

wydm the bais

#

the weights and baises have noting to do with: 1) Your input/output 2) activation function

#

you can consider them separate items

gaunt tusk
#

i thought you passed in the weighted sum to the activation?

bitter harbor
#

thats right

gaunt tusk
#

so the bias would never be able to be low enough to make it negative is what you're saying?

bitter harbor
#

so if your activation function normalizes/standardizes has a range of 0,1 like ReLu, all values will be between the two

#

whereas if you use the sigmoid, it'll all be between -1,1

gaunt tusk
#

oh wait yeah my inputs are always going to be between 0 and 1

#

so relu probably would be the better option here yeah

#

i'll have a look at it

#

and one other question actually

bitter harbor
#

have a look at cnn's too

gaunt tusk
bitter harbor
#

image regnc. usually uses either that or what you have which is a perceptron

gaunt tusk
#

and i'm running a test image through it

#

just testing the forward passing

#

it works fine on the first two neurons

#

but for some reason the last one it throws an error

bitter harbor
#

layers or neurons

gaunt tusk
#

ah yeah layers

#
Traceback (most recent call last):
  File "/Users/6503/Desktop/CodeStuff/MachineLearning/NeuralStuff.py", line 115, in <module>
    test = thing.feed_forward(list(letssee[0])[0][0])
  File "/Users/6503/Desktop/CodeStuff/MachineLearning/NeuralStuff.py", line 103, in feed_forward
    activation = self.sigmoid(np.dot(weight, activation)+bias)
ValueError: operands could not be broadcast together with shapes (10,16) (10,) 
#

it seems to have changed the arrays shape somehow

#

not sure where though

#

i was printing out the shapes just to check what the first two layers were doing

#
(784, 1)
(16, 784)
(16, 16)
(10, 16)
``` they look fine
bitter harbor
#

by first two layers you mean input+1 hidden?

gaunt tusk
#

yeah

#

so it basically just doesn't make it to the output

#

goes through the one hidden layer

bitter harbor
#

I'd suggest you look at examples of perceptron in python, I could be wrong but your code seems too short

gaunt tusk
#

i mean its not even close to the full thing

#

its just the feedforward part

#

but yeah i'll have a look around

#

thanks for the help and suggestions

#

ah and i think i just found the issue to

#

yep

#

i flattened the bias array earlier for some reason

#

so just removed that

#
[[0.96463723]
 [0.6023769 ]
 [0.45853454]
 [0.13891415]
 [0.02237485]
 [0.09243328]
 [0.30762676]
 [0.84720262]
 [0.99305502]
 [0.13672393]]
``` now the outputs lookin right
bitter harbor
#

also you can look at the hidden layer for sure, just like you can list weights/biases but they won't tell you anything

#

"Neural networks are so-called [black boxes] because they mimic, to a degree, the way the human brain is structured: they're built from layers of interconnected, neuron-like, nodes and comprise an input layer, an output layer and a variable number of intermediate 'hidden' layers -- 'deep' neural nets merely have more than one hidden layer. The nodes themselves carry out relatively simple mathematical operations, but between them, after training, they can process previously unseen data and generate correct results based on what was learned from the training data."

#

tl;dr they're preforming functions and you won't be able to tell shit from it

lapis sequoia
#

to be fair, a lot of work has been done on interpretability of neural network.

bitter harbor
#

how so

tight stone
#
WARNING:tensorflow:From /Users/jaqqen/.local/share/virtualenvs/ShaVas-DrKzIL9u/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

I get this warning plus that update-instruction everytime my program hits model.save(path)

I already passed in the *_constraint-arguments to the layers and it looks like this now:

    model.add(Flatten())
    model.add(Dense(50, activation=leaky_relu, kernel_constraint=None, bias_constraint=None))
    model.add(Dropout(.1))
    model.add(Dense(50, activation=leaky_relu, kernel_constraint=None, bias_constraint=None))
    model.add(Dropout(.3))
    model.add(Dense(2, activation=softmax, kernel_constraint=None, bias_constraint=None))
unkempt lotus
#

Good evening
If I have a 9x9 numpy array b:
[[0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8]]
which I got from the following code:
a = [] for i in range(9): a.append(i) b = [] for i in range(9): b.append(a) b = np.array(b)
I am trying to turn it into 9 3x3 images using .reshape method:
c = b.reshape(9,3,3)

#

However, the result I get if I print c[0], namely the first sample, is:
array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
whereas what I want is the upper left corner of the image, namely:
array([[0, 1, 2], [0, 1, 2], [0, 1, 2]])

#

Researching over stack overflow, the solution might have to do with reshaping, using .swapaxes method, then reshaping again: https://stackoverflow.com/questions/45950264/reshape-array-in-squares-like-an-image
But I couldn't figure out how should I use this for my case
Any help would be very much appreciated!

dull turtle
#

how to reduce val_loss in CNN ?

#
Epoch 80/85
32/32 [==============================] - 3s 79ms/step - loss: 0.2400 - accuracy: 0.9551 - val_loss: 3.0391 - val_accuracy: 0.6435
Epoch 81/85
32/32 [==============================] - 2s 77ms/step - loss: 0.2816 - accuracy: 0.9331 - val_loss: 1.7805 - val_accuracy: 0.5913
Epoch 82/85
32/32 [==============================] - 3s 79ms/step - loss: 0.3244 - accuracy: 0.9147 - val_loss: 2.2709 - val_accuracy: 0.6172
Epoch 83/85
32/32 [==============================] - 3s 84ms/step - loss: 0.2983 - accuracy: 0.9395 - val_loss: 1.6353 - val_accuracy: 0.6174
Epoch 84/85
32/32 [==============================] - 2s 76ms/step - loss: 0.3258 - accuracy: 0.9206 - val_loss: 3.8418 - val_accuracy: 0.5913
Epoch 85/85
32/32 [==============================] - 3s 85ms/step - loss: 0.2855 - accuracy: 0.9390 - val_loss: 2.8588 - val_accuracy: 0.6783
training completed...2
Epoch 1/1
9/9 [==============================] - 1s 70ms/step - loss: 3.6322 - accuracy: 0.4427
score :  [1.7107210159301758, 0.572519063949585]```
unkempt lotus
#

I once was stuck with validation loss, turns out I need to shuffle the data. Note that setting shuffle=True would only shuffle the data after the validation split, if i'm not mistaken
@dull turtle

dull turtle
#

@unkempt lotus can u refer to my code above pasted in link

unkempt lotus
#

It is a bit too long, apologies, but I searched for "shuffle" and didn't find anything

#

Not sure if this is your issue, but if you want give it at least a try

bitter harbor
#

@unkempt lotus if your array is 9x9:py array = array array1 = array[0:3, 0:3] array2 = array[0:3, 3:6] array3 = array[0:3, 6:9] etc

#

I was trying to think of a way to do this with recursion, but pretty sure it'd be more complicated

unkempt lotus
#

@bitter harbor No worries, thank you for the effort, I might consider implementing this

proper basin
#

I want to build a KDTree (scikit-learn) of unique points, however calling numpy.unique() on the array of points takes much longer than building the KDTree (over 10x longer). Is there a way to use the KDTree structure to make it unique, rather than the apparently-expensive numpy.unique operation?

bitter harbor
#

how long is '10x longer'?

proper basin
#

depends on the number of points

#

in my unittest it goes from 1s to 12s

#

120,000 points

bitter harbor
#

Thats pretty good considering what its doing

#

maybe check this out?

proper basin
#

I've been searching for a function similar to numpy.unique, but I've had no luck
Hmm I don't get why he doesn't use numpy.unique()

bitter harbor
#

who knows???

#

maybe because of the same problem you're running into

#

if you think about whats happening, it makes sense that it would take 12s

#

because its looking at the 120000 points, comparing them to themselves and returning it

proper basin
#

Well it seems to me that the KDTree could easily remove duplicates

#

during insertion, with little overhead

#

it just doesn't have an argument to do that

bitter harbor
#

I think the issue is the time cost in general

proper basin
#

Oh from your link I found a solution using set() which is much faster

#

(0.2s)

#

I assume this is more memory-hungry though

bitter harbor
#

couldn't tell ya

raven mulch
#

If anyone is interested in learning about making their own deep learning library in python feel free to check this first video out 🙂

#

I am an undergraduate researcher in machine learning

flat quest
#

oooh
might be something i'll look into. Though I have a feeling autograd will be a pain...

cold shore
#

Thanks @raven mulch

raven mulch
#

No prob let me know what you think 🙂

desert oar
#

You can bypass auto grad by hard coding the layer types

#

Write the gradients out by hand

#

Probably more educational than using an autograd lib

bitter harbor
#

^^ i'd suggest doing everything (maybe except matrix manipulation (that's what numpy's for)) by hand

desert oar
#

Yeah just use numpy for that

bitter harbor
#

doing all that would be painful

#

but doing it all yourself will help you understand everything better

#

just like how it's Very useful to learn linear algebra (or concepts used in ml)

fervent crypt
#

I'm trying to parse through an html table using pandas but i keep having a problem with values coming out as NaN when td values are there.

https://pastebin.com/zkjf6sqm

This is what part of the html table looks like.

My table ends up looking like this:

https://pastebin.com/ewB5Zf2K

The problem is that Role keeps coming out as NaN when i have things like "Bot Laner" still there.

for my code I think these are the relevant parts.

soup=BeautifulSoup(req.text, 'lxml')

my_table = soup.find('table', {'class':'wikitable'})

pd.read_html(str(my_table))

Any help would be really appreciated thanks!

flat quest
#

true tho its always been interesting to me how tf and pytorch actually compute all their gradients

I know tf uses a graph execution, and that helps them deal with the gradients for non-standard functions, but would be interesting to see if that could be reimplemented. @desert oar

simple shadow
#

hey all, i need help with something
in a specific dataset, there is data like ''Apr 3, 1998 to Apr 24, 1999'' how do i extract Apr 3 1998 and put it in one column and put Apr 24 1999 in another column

#

i am using python pandas

rancid brook
#

You could split the string on " to "

tawdry sedge
#

Hey guys need some advice

#

I found this question on stack overflow to find the line that touched most of the rectangles

#

I have no clue at the moment but we need to find the line which touches the most of the rectanles and does not have to be the corners

#

any clue?

lapis sequoia
#

Need advice for a good Data Science book. Any ideas?

chrome barn
#

take your pick

verbal ice
#

Oh my god this is great thanks for sharing!

#

I must say this is very overwhelming 😅

lapis sequoia
#

what are the prerequisites for this course

rigid glade
#

Hi guys. I have a pandas question.

#

How do i compare the first -1 to the next row down number?

#

The conditions are: the first number needs to be -1, and if the next row down number is +1 then the count of True goes up by one.

#

Then the comparison starts on the second number to the third number with the same conditions

lapis sequoia
#

Does anyone on here use Python for chemistry related work? I’m trying to find other packages like Cantera https://cantera.org/ for Python.

gleaming gyro
#

is there any astronomy related projects that i can work in?

simple shadow
#
anime_new['Aired'].drop(index=anime_new['Aired'][filt].index)
anime_new.dropna(subset = ['Aired'])

if i print out anime_new after the last line, it includes the NA```
does anyone know why?
ripe forge
#

Oh

#

@simple shadow because by default in pandas, drop or dropna returns copies. And you never reassigned. So the original variable stays the same. Either use inplace=True or reassign yourself.

simple shadow
#

thank you!! @ripe forge

digital juniper
#

just started learning data science, does anyone have a good server or place to ask for ML specific python questions?

ripe forge
bitter harbor
#

lmao

digital juniper
#

i mean this is a discussion channel so i wasn't sure haha

#

but if anyone knows about the scikit confusion matrix, i'm doing logistic regression on a cancer data set where the target var is either M or B for malignant or benign

#

but i don't know how to label it with the prediction axis and the actual data axis

#

so idk which way round it is

#

but i do know which one is M and which one is B using the labels parameter

bitter harbor
#

not too familiar with scikit but I do know the actual output has to be a number

#

so instead of M/B you could have 1,0

digital juniper
#

yeah, the only thing the labels thing does is switch the order from being B then M or M then B

#

so if i switch them the matrix goes the other way round if that makes sense

#

but idk which axis is which

flat quest
#

wait what are u trying to do
print the confusion matrix along with the actual labels?

digital juniper
#

yeah so i printed the matrix but i don't know which line is the predicted data and the actual data

flat quest
#

it doesn't really matter which is which

They're interchangeable. Both axis have the same labels.

digital juniper
#

i thought one axis was predicted M and predicted B then the other was actual M and actual B

#

for the diagonal it doesn't make a difference but for off diagonal elements it matters

flat quest
#

ah well

i meant that as long as the confusion matrix generator function said which axis was which, it didn't really matter.

If it doesn't state it. The general notation is the column axis(left -> right) is predicted values. And row axis (top to bottom) is the actual axis.

digital juniper
#

ahh thanks, yeah i couldn't find a default on the scikit docs

#

maybe i just missed it but i was wondering if there was some convention

#

cuz i wanted to do work out the precision and recall manually, instead of using scikit funcs

flat quest
#

ah gotcha. Yeah weird of them to not state it.

Yeah that's the general convention. Unlikely for scikit to use a different axis system

#

ah i see. Do you know the math behind the two functions?

digital juniper
#

precision and recall?

flat quest
#

yeah

digital juniper
#

i mean it's 3 numbers in a fraction haha

#

unless there's something i'm missing

flat quest
#

well for binary classification its really simple. If you're looking to do multilabel as well, its a little more complex.

digital juniper
#

ah yeah i mean i've just started so i'm starting with binary classification