#data-science-and-ml
1 messages · Page 235 of 1
What are the best methods for movie recommendation systems?
I'm right now using cosine similarity metric on the IMDb dataset to make recommendations. Although it performs reasonably well, I'd like to enhance the performance. So what are the SOTA methods available for movie recommendation systems?
do yall see auto ML taking over data science in the upcoming decades
I see auto ML being a big automater of machine learning - but data science is more than ML
ey
@modest rune so you went with converting the arrays to lists?
Is there any benefit to using an ASCII string over Unicode for text?
how would you say a shape of (2330, 3500, 3) in words?
i know it's 2330 arrays of 3500by3 but I am just wondering if there is a standard way of saying it
Is it possible to pass a list of dates to a pandas date series index?
i.e. only open market dates for a year of stocks
Which is like 250 days out of 365
Guys, if I have a dataset that has 'names' column, how should I deal with it to convert it into numerical data
Is it right to use one hot encode? Becuse it has literally 25 unique values
@pastel compass in some specific applications maybe but not in general
to be clear: by "Unicode" you probably mean UTF-8
you can select the columns which you need to convert and use .astype()
@frank bone yes
hi can someone help me with scipy solve_bvp?
data-science seemed to be closest to a channel which might use numerical computation that's why I jumped in here
to be clear: by "Unicode" you probably mean UTF-8
@desert oar
Ahh I didn't know there was a link between the two
Unicode is an abstract system that basically catalogues every character/symbol used by humans, and putting a number on it
UTF-8 is an encoding for Unicode text
so python strings are "unicode"
but a file would be "UTF-8"
Oh that makes sense, I always see "encoding=utf8" but I didn't fully understand
yes
so that's a UTF-8 encoded file
which means that it contains Unicode text, in UTF-8 format
@restive obsidian what's your problem with it? don't ask to ask
Thanks for the help!
UTF-8 Is a way to represent a sequence of Unicode characters as 8bit bytes (octets)
^
@modest rune so you went with converting the arrays to lists?
@worldly kindle
A headache brought upon by too many consecutive days of Pandas forced me to take a break. After a 3 hour nap, I am ready to try again. Upon further reflection, I have decided to take the advice of the experts and attempt to index 2 dataframes instead of putting everything into 1 dataframe (which required the nested arrays).
@desert oar I m stuck from 3 hrs on a problem where I have to solve a system of 2 coupled 2nd order diff equations
can you help, I 'm using scipy
https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.solve_bvp.html the example in the docs seems confusing
the bc(ya, yb) says it should return a (n, ) array but why? Boundary conditions should be also for the differentials.
Hello, Are there people here who may be a little experienced with web scraping?
@hardy folio what do you want to scrape?
and with what scapy/selenium/requests?
So when I have scraped web pages before most of the time I would just scrape information from the page. This website I am looking at actually creates a link and give you the information in a csv file.
is there a way for me to actually use the link it creates and download the csv file instead of just scraping the information it returns on the whole web page
or would that be more difficult
I have spent quite a bit of time looking into this but so far cant find information that relates
it does not look like that creates a link I would use
the web page looks like that and the excel picture is a link
use a = requests.get(...).content and then with open(....csv, "wb") as f: f.write(a)
Dont i have to give
request.get
the actual url link download
page = requests.get(URL)
like that but if I dont have a URL for the excel link
and obviously im still learning a lot so I appologize if im dumb
@modest rune good call haha
i love python
any idea why a pandas describe works sometimes on a numpy array and doesn't other times?
Hi! Does anyone know how I might unmask a masked numpy array? I tried ma.getdata() and .data but neither worked (they just returned the same masked array)
df = pd.DataFrame(index=dti, columns=ticker_list)```
anyone know whats wrong with this?
but when passed as index it prints this Cannot convert input [['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09', '2012-01-10', '2012-01-11', '2012-01-12', '2012-01-13', '2012-01-17', '2012-01-18', '2012-01-19', '2012-01-20', '2012-01-23', '2012-01-24', '2012-01-25', '2012-01-26', '2012-01-27'..........................'2012-12-28', '2012-12-31']] of type <class 'list'> to Timestamp
trying to get a datetime index instead of 0 to n
nvm figured it out 😄
just do this df = pd.DataFrame(index=date_list, columns=ticker_list)
Hi, I'm trying to categorize comments on twitch chat. I ran an unsupervised tweet topic modeling algorithm and got 10 topics but the results don't seem too promising.
Does anyone have suggestions to improve the process? Do I remove emotes?
i'm guessing u used a clustering algorithm? @coarse spire
Yeah, so I used different embeddings (inlcuding a BERTembeddings from flair), ran PCA then used AgglomerativeClustering
Then I ran on TF-IDFT on each topic to pick out the most important terms in each
I don't know too much about clustering so I just followed this tweet analyzer post. https://towardsdatascience.com/covid-19-with-a-flair-2802a9f4c90f?gi=522b2c9f7c6
I guess I should look at different clustering techniques and varying the cluster size
I also don't know how to make much sense of my data after running PCA on my own. I should look into that too.
anyone got a clue how to skip NaN values?
doing a Simple Moving Average but it breaks as soon as there's 1 NaN value in a time series
data['SMA_3'] = data['CREE'].rolling(window=30).mean()
id want it to just ignore it and keep going
(window=30, min_periods=29) would work for 1 NaN
You could also replace the NaN with the mean
I see people use pygame when they want an easy GUI.
any idea why a pandas describe works sometimes on a numpy array and doesn't other times?
looks to be hot topic today
@fierce saffron afaik the .describe() function is not implemented in numpy at all. Therefore it shouldn't work. Maybe you have some DataFrame that you think is an numpy array? If you want to describe an numpy array you can use scipy.stats.describe as work around.
@coarse spire does it divide by 30 or 29 in that case?
Does it effectively skip it? Or just treats it as a zero?
30 until it hits the nan then 29 until it fully passes it
Great thanks 🙂 is there a possibility to just ignore NaN though? So if theres a NaN its like it doesnt even exist
If thats possible then 30/30 is always possible unless theres less than 30 datapoints
Well, dropna will drop the nans before you do moving averafe
It should definitely work but putting it back into the dataframe will require some finesse
Easiest thing to do would be replace nan with the mean
Then you have no nans
Nah, if you search around for "replace nan with mean pandas" it should come up
Like is it possible on the go..while executing the SMA function?
Alright ill check it out
Nope, gotta do it before
You're welcome good luck
You could also replace the NaN with the mean
@coarse spire becareful when you do this though it depends on how many nulls you have, if its too many and you replace then with the mean it will be useless because you wont get any information out of it
Sorry late to the conversation 😅
yes, it is.
Im currently doing a project where I'm using word2vec algorithm for classifying Facebook comments into how aggressive they are... Is there a common tool that I can use to iterate through my corpus of sentences to correct spelling mistakes?
At the moment I'm using gensim word2vec, but that could change as I'm only at the data preprocessing stage
I spent 15 hours on those 50 lines, lol.
How would I add my own data to the mnist training dataset? I've got the images and have worked out the labels, but not sure what datatype the images + labels have to be, nor how to actually add them to my dataset correctly. (@me upon response please)
@silk axle "Thus, in MNIST training data set, `mnist.train.images` is shaped as a [60000, 784] tensor (60000 images, each involving a 784 element array). Using that syntax, you can refer to any of the pixels in any of the images. As shown above, each element in this tensor represents the intensity value of a pixel in a picture, between 0 and 1."
just concatenate your data to the end of the array?
I'm really new to ml + numpy so not sure how to
And if it's an array of (28x28) does that mean I need my image as an array 28x28?
@bitter harbor
yes I'm not sure how mnist orders the pixels but they are individual pixels not images
I want to have multiple line for every interruption_type and priority field on my matplotlib char. Currently I have only that: ```py
def select(self):
return pd.read_sql_query("SELECT DISTINCT date, interruption_type, priority, SUM(interventiontime) from interruptions GROUP BY interruption_type, priority;",
self.conn, index_col="date")
class MplCanvas(FigureCanvasQTAgg):
def init(self, parent=None, width=5, height=4, dpi=100):
fig = Figure(figsize=(width, height), dpi=dpi)
self.axes = fig.add_subplot(111)
super().init(fig)
self.draw()
class StatisticDialog(QMainWindow):
def __init__(self, *args, **kwargs):
super(StatisticDialog, self).__init__(*args, **kwargs)
self.db = OctopusDB()
self.setWindowTitle("Statistiques des interruptions")
self.resize(600, 400)
self.setWindowIcon(QIcon('icon.png'))
try:
datas = self.db.select()
sc = MplCanvas(self, width=5, height=4, dpi=100)
datas.plot(ax=sc.axes)
how can I edit this one to get multiple line (or bar, or another graph who can show the sum by interruption_type, priority and date ?
put it into an array
datas = self.db.select() from this line I must create an array so ?
np.random.random((len(data points), amount of interruption_types))
datas = self.db.select()
print(datas)```
return me:
interruption_type priority SUM(interventiontime)
date
17/07/2020 Email Important, non urgent 10
17/07/2020 Présentielle Important, Urgent 39
17/07/2020 Présentielle Important, non urgent 10
17/07/2020 Présentielle Non important, non urgent 6
17/07/2020 Téléphone Non important, non urgent 4
Code```py
Load data and display shapes
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Load and add custom training samples
import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
number_in_image = int(image_file[0])
image = plt.imread(f"{_dir}/{image_file}")
## Resize image to 28x28x1 and invert
resized_image = 1 - resize(image, (28, 28, 1))
# print(resized_image.shape)
## Add data to training sets
x_train += resized_image # this is the line that raises the error
y_train += number_in_image
Errorpy
UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'``` @bitter harbor
oh sorry yannick I thought your line space was time
I'm assuming I have to somehow convert the resized_image to be float32 but idk how
no, it's the sum of time for all same interruption_type and same priority for a date
separate the bars into urgencies then
yes
https://matplotlib.org/3.2.1/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py this type of graph would be cool. Can you help to adapt my code for the same result @bitter harbor ?
No sorry but you're going to have to do a bit of adapting, that graph is 2 dimentional: (scores, men) and (scores, women) so basically its like saying (scores, human). Your graph is (date, incident type, priority, amount)
so even if you split priority into groups, you'll still essentially have a 4 graph
@bitter harbor do you know how to solve my above issue? https://discordapp.com/channels/267624335836053506/366673247892275221/733612255199100961
? @bitter harbor
what do your input photos look like
Do you mean the actual image or the numpy array?
actual image
why is it inverted?
Because the mnist set is apparently inverted
and that's no longer a 0
look up an image of the mnist training set ```
@fleet moth what you could do is create a bar graph separated by time, split by the types with lengths (y) of the sum, then heat mapped to the priority
Idk what you mean by that @bitter harbor
@silk axle they're white numbers and the white space is black, that's the images inverted
you can't combine numbers of different colours without screwing with the dataset
I'm inverting it though
That's the point
resized_image = 1 - resize(image, (28, 28, 1))
This line resizes to (28, 28) and then inverts it
So that the colours do match
@bitter harbor
To predict what the number is basically
But I am inverting it so that it's white numbers and black background
ok but your image is purple and yellow
the inverse/reverse of purple and yellow is yellow and purple
The dataset has yellow numbers and purple background (in the plt.imshow)
My data has purple numbers and yellow background, but I invert it (in the plt.imshow)
The dataset is also purple+yellow
this is a random element taken from the dataset
Idk why it shows as purple and yellow (whether that's a plt.imshow thing or just the python mnist dataset), but it does
So the colours do match
Either way I don't see how this is relevant to the error I'm getting
whats the full error then
UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'```
the full error
It's only that because I'm using google collab
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
<ipython-input-9-d7ac403beec4> in <module>()
14
15 ## Add data to training sets
---> 16 x_train += resized_image
17 y_train += number_in_image
18
UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float32') to dtype('uint8') with casting rule 'same_kind'```
Maybe your custom images are from floats 0 to 1 and mnist is 0-255 uint8 or vice versa
Print out the dtype of x_train and resized_image
Mnist is 0-255 but I change that somewhere so that it's 0-1 iirc
wait nvm mnist is 0-1 I think
uint8
float32```
First is mnist, second is my image
So ig that means MNIST is 0-255? And I have to /255?
I would *255 and astype('uint8') your images instead
Then do that later. I would think learning on uint8 would be faster than 32 bit floats
But feel free to try either way
It does probably just become floats later regardless
Right yea ig
Sopy reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')?
Looks about right
Doesn't seem to convert to uint8
MNIST dataset: uint8
My image: float32```
I'm so confused lol
Okay so it's not converting to uint8
## Load data and display shapes
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(f"MNIST dataset: {x_train.dtype}") # outputs uint8
## Load and add custom training samples
import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
number_in_image = int(image_file[0])
image = plt.imread(f"{_dir}/{image_file}")
## Resize image to 28x28x1 and invert
reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')
print(f"My image: {resized_image.dtype}") # outputs float32
# print(resized_image.shape)
## Add data to training sets
x_train += resized_image
y_train += number_in_image```
@pale thunder
Where does resized_image come from?
oh
255 * abs(1 - resize(image, (28, 28, 1)))
I'm using wrong variable lmao
Okay new error
MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-287f62ccfef9> in <module>()
16
17 ## Add data to training sets
---> 18 x_train += reversed_image
19 y_train += number_in_image
20
ValueError: operands could not be broadcast together with shapes (60000,28,28) (28,28,1) (60000,28,28) ```
And the 1 - resize(...) will never be <0 so don't need to abs it @bitter harbor
So the types are matching now, but says the shapes don't match
Isn't that how u concatenate numpy stuff?
Ah, you cannot append things with + like that. You need some numpy stack function, concat or append. Unfortunately not at a PC, so I cannot test which one works
MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-20-c1eac66c6ac2> in <module>()
16
17 ## Add data to training sets
---> 18 x_train.concatenate(reversed_image)
19 y_train.concatenate(number_in_image)
20
AttributeError: 'numpy.ndarray' object has no attribute 'concatenate'```
again look up numpy.concatenate
!d numpy.concatenate
numpy.concatenate((a1, a2, ...), axis=0, out=None)```
Join a sequence of arrays along an existing axis.
Parameters **a1, a2, …**sequence of array\_likeThe arrays must have the same shape, except in the dimension corresponding to *axis* (the first, by default).
**axis**int, optionalThe axis along which the arrays will be joined. If axis is None, arrays are flattened before use. Default is 0.
**out**ndarray, optionalIf provided, the destination to place the result. The shape must be correct, matching that of what concatenate would have returned if no out argument were specified.
Returns **res**ndarrayThe concatenated array.
See also
[`ma.concatenate`](numpy.ma.concatenate.html#numpy.ma.concatenate "numpy.ma.concatenate")Concatenate function that preserves input masks.
[`array_split`](numpy.array_split.html#numpy.array_split "numpy.array_split")Split an array into multiple sub-arrays of equal or near-equal size.... [read more](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html#numpy.concatenate)
Oh
MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-21-6ffe324b79a0> in <module>()
16
17 ## Add data to training sets
---> 18 np.concatenate(x_train, reversed_image)
19 np.concatenate(y_train, number_in_image)
20
<__array_function__ internals> in concatenate(*args, **kwargs)
TypeError: only integer scalar arrays can be converted to a scalar index```
Look at the signature once more
the mnist is a list (60000 items) of arrays (60000, 28, 28) you need to change the shape to (1, 28, 28) because you're adding 1 item
look at the parameters of the function
ah
MNIST dataset: uint8
My image: uint8
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-22-ce76f79fc976> in <module>()
16
17 ## Add data to training sets
---> 18 np.concatenate((x_train, reversed_image))
19 np.concatenate((y_train, number_in_image))
20
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 2, the array at index 0 has size 28 and the array at index 1 has size 1```I'm assuming this is the thing of needing to reshape?
I'm really confused
are you going to use a library to build your classifier
I've already built the classifier (tensorflow.keras.models.Sequential)
So yes
If that's what you mean by classifier
yea
## Build the CNN model
model = Sequential()
## Add model layers
model.add(Conv2D(32, kernel_size=3, activation='relu', input_shape=(x_train.shape[1:])))
model.add(Conv2D(32, kernel_size=3, activation='relu'))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))```this is how I build the CNN
why'd you use ReLu
Because the tutorial used relu 🤷
I've got no clue what relu/softmax is other than 'a classification algorithm'
The tutorial showed how to get it all working, and I'm now extending on it to make it better
ok ya I was thinking you just looked it up
you should really learn about machine learning before you mess around with prebuild algorithms
I did a while back (about 2 years ago) so I know some basics, like kernels, but most stuff I either didn't learn or forgot
because it all involves things like dot multiplication, cost functions, stats, and optimization in machine learning
or like matrix manipulation in general
3blue1brown has some excellent videos on nn's and linear algebra
print np.shape(image)
no the mnist one
Oh, I see
mnist is (28, 28)
So I need to add the 1?
I do that latter in the code but ig I should do here?
## Reshape the data to fit the model
x_train = x_train.reshape(list(x_train.shape) + [1])
x_test = x_test.reshape(list(x_test.shape) + [1])```
no
mnist database is (60000, 28, 28) each image in that database is (1, 28, 28), the image individually is (28, 28 (number of pixels)) but you're adding your image to the database's list - making it (60001, 28, 28)
Why's that a problem though?
(28, 28, 1) = 28x28 1d
(1, 28, 28) = 1 image 28x28
That's not that same
The three numbers don't represent the same thing
np.resize(image, (1,np.shape(image)))
Which would just make it (1, 28, 28, 1) like I said?
no you're reshaping it not adding a one
np.concatenate((x_train.reshape(1, x_train.shape), reversed_image))```do you mean this?
I'm really confused as to what you're saying
so when you import the 28 by 28 images as an array, you change the shape so that you 'move' the image into the second+third dimension and 'list' it by making the first equal to 1
like I said I'd suggest learning about the topics I mentioned above
even a lot of the preprocessing involves them
I still don't get how to solve the issue I've got
I need to reshape it, I get that
Gimme a sec
Surely what you're saying to do would bepy np.concatenate((x_train.reshape(tuple([1] + list(x_train.shape))), reversed_image))?
nvm
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimension(s) and the array at index 1 has 3 dimension(s)```
I really don't get what I'm doing
image = np.reshape(image, (1, 28, 28))
np.concatenate((x_train,image)) ```
Right
I did that also
Got a different error
ValueError: cannot reshape array of size 47040000 into shape (1,28,28)```
sorry other way arround
I'm reshaping x_train atm
np.concatenate((x_train.reshape(tuple([1] + list(x_train.shape[1:]))), reversed_image))```
why
wait
Reverse image is (28, 28, 1)
And I want to make that (1, 28, 28), right?
Surely I can just reverse the shape?
Okay, that seems to have worked
Now error on the next line
---> 19 np.concatenate((y_train, number_in_image))
20
21 print(f"Train Shapes: X={x_train.shape}, y={y_train.shape}")
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)```
Wait nvm, I think I know why
Okay I think I got it working now? No errors at least
## Load data and display shapes
(x_train, y_train), (x_test, y_test) = mnist.load_data()
#print(f"MNIST dataset: {x_train.dtype}")
#print(x_train.shape)
## Load and add custom training samples
import os
_dir = "/content/drive/My Drive/training numbers"
for image_file in os.listdir(_dir):
number_in_image = int(image_file[0])
image = plt.imread(f"{_dir}/{image_file}")
## Resize image to 28x28x1 and invert
reversed_image = (255 * (1 - resize(image, (28, 28, 1)))).astype('uint8')
#print(f"My image: {reversed_image.dtype}")
#print(resized_image.shape)
## Add data to training sets
np.concatenate((x_train, reversed_image.reshape(tuple(reversed(reversed_image.shape)))))
np.concatenate((y_train, np.array([number_in_image])))```
Except it doesn't actually concatenate 🤦
Ig I need to assign maybe?
yea
Seems to have worked :~)
import numpy as np
import glob
import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
_dir = "/content/drive/My Drive/training numbers/*"
for image_file in glob.glob(_dir):
image = skimage.data(image_file)
reversed_image = 1 - np.reshape(image, (1,28,28))
x_train = np.concatenate((x_train, reversed_image))
y_train = np.concatenate((y_train, np.array([int(image_file[0])])))
I don't need to /255 because it's already greyscale
And concatenate doesn't edit in-place, so I needed to do like x_train = np.concatenate((x_train, reversed_image))
But yea that's basically what I need ig
Thanks for the help 👍
that should work now
as long as the path is the actual path not just what you have
Is is possible to create an array of legends and datas from my Dataframe ?
datas = self.db.select().to_numpy() ?
Howdy. What are some good tools for visualization using Pandas? I'm looking for something more presentable than pandas profiling. Also it needs to be confidenciality compliant, so no Datapane
i just make my own plots w/ matplotlib
pandas has a bunch of convenience functions that generate matplotlib plots for you
Is there a difference between an array of shape (x,1) and an array (x)?
Yes, you have 2 axis in one case and 1 in the other. It affects quite a bit actually
but if the length of the second axis is 1, doesn't it become 1D again?
No, you can have a 1 long axis. It is useful for example when concating a (28,28) to a (1000, 28, 28)
What's better, having lots of mini dataframes or one big dataframe in pandas ?
Howdy. What are some good tools for visualization using Pandas? I'm looking for something more presentable than pandas profiling. Also it needs to be confidenciality compliant, so no Datapane
@mellow tiger you can use seaborn
@pale marsh it depends on how it’s being generated/used, how large the total dataset it, and what it’s for
@bitter harbor right now my program takes one big dataframe and splits it into loads of small ones, it gets passed in a list of dataframes to another class which plots graphs (just simple histograms rn) with them using altair
What’s the format of the dataframe?
There are some columns with 15000 rows and I think I'm just using at the most 1/4 of the total dataset I think
It gets it from a rosbag
Oh wait misread that
Floats, ints and arrays of floats, and 2D arrays of floats
How were you planning to use a histogram for that then?
Like unless you do a separate one for each data set idk how possible it’d be
Right now I split each column into its own dataframe and just histogram it up like that but in with Altair I can just give it a large dataframe and specify which column to use the data from
That won’t work with 2d arrays mixed in
Or it will, but you’ll won’t be able to use the same function
Oh yh I was thinking of splitting those off into their own separate dataframe while the single value columns I leave in one big one
But idk which is more efficient
Think it's suppose to scale up later
You’ll have to break them up
At least I can’t think of any way to do that
Also is there a reason you’re using python 2?
If you are just plotting it doesn't really matter how you organized your data, as long as you understand the code and it's not too complicated for others to understand
That said, I don't think I fully understand what you are doing with this data
salt rock lamp, do you understand how torch.Softmax works?
I'm trying to use it as a loss function
i know how softmax works, i dont know how the intricacies work in torch
it's not a loss function
it's a "layer"
Ah
Also is there a reason you’re using python 2?
@bitter harbor something to do with the rosbags not being compatible with python 3 I think tho I'm not entirely clear on that
you know what softmax means/does?
I remember using it when I took linear algebra but I forgot how it's defined.
let's see
Looks like sigmoid
yeah
softmax is the multivariate generalization thereof
so it compresses R^n to (0,1)^n
Sigmoid is -1,1 softmax/ReLu is 0,1
Do I even need it then if my vectors are one dimensional?
what is your output?
if you're doing classification you (probably) need it
if you're doing regression you (probably) shouldn't use it, unless your regression target has hard upper and lower bounds
Mapping between two vector spaces. So the idea is that once the weights are tuned correctly, a length 768 vector will be the right 200 length vector in another space.
I'm not sure what that means
ℝ, real numbers
Sigmoid is -1,1 softmax/ReLu is 0,1
@bitter harbor i think sigmoid is also 0-1, hyperbolic tangent is -1, 1, is this what you ment
p sure sigmoid is just "neural network lingo" for logistic function, no?
All the elements in the vector are real numbers, yes
the logistic curve is quite literally sigmoid
in that case you do not want a sigmoid/logistic/softmax on your output layer. you can have it on the hidden layers
Idk why I’ve been normalizing data to -1,1 then
you can also re-parameterize the logistic function to map to (-1, 1)
hell you can change the center
so you can have (-300, 500) with a center at 100
but why would you bother
@bitter harbor idk either 🙂 i dont like normalizing real-valued data. only when it has known and strict upper/lower bounds such as images which are 0-255 for example
for classification -1/1 was an old-school thing from when everyone loved SVMs
Oh I stg I wasted so much time on doing that to audio
it probably makes sense in specific domains
maybe its recommended by audio people
i usually work with what youd call "social science data" so thats where my recommendations come from
Ah ya idk I remember hearing it somewhere once and just went with it
Oh i think it’d better if you need more certainty with your floats
yeah but why normalize when you can also standardize
then you're just shifting and scaling without actually clipping your data
which might or might not be relevant depending on your data and model

Hopefully an easy question to answer. When writing functions to do some pandas dataframe manipulation. Is it better to: (from a performance perspective)
(a) construct the dataframe outside of the function, pass into the function as a parameter, then modify it.
(b) construct the dataframe inside the function, modify it, then return the dataframe?
(c) both work equally well.
Nns are hard enough as is
idk how you expect to do anything with data and not at least know basic stats
you dont even need to understand it to scale by it
d) use numpy
Yes but numpy
...is underneath most pandas ops
so if you have mixed data types or you happen to enjoy the use of column names
pandas is much easier to work with
numpy is for math
Yes but numpy
pandas is for data
using numpy for data is like using a bit mask instead of kwargs in python
it works, but why
Math is better anyways and as for data manipulation, I’ve found that with audio/images/not social data there’s barely any sort of stat stuff
yes, that's fine
Math is better anyways
so go learn standard deviation 😉
i wouldn't use pandas for images nor would i use numpy for HLOC time series ticker data
Does pandas have the same sort of flexibility as numpy tho?
Because numpy seems to be useful for pretty much everything
OK, well... I took my pandas vectorized math... 3 lines of relatively simple code. Moved it into a function without modifying what it does, and the performance decreased by 35%. I am trying to understand what changed. Any ideas?
what do you mean flexibility? it feels like you're trying to artificially make this into a "vs" argument where none exists
pandas is a tool for manipulating tabular data, using numpy under the hood
you can use whatever tool you want for whatever purpose you want. i'm just recommending against using plain numpy for most datasets
@modest rune can you show your code with some context
not just the function
Hm ya I didn’t know that but the little data I’ve worked with I’ve used plain numpy
I’ll keep that in mind thanks
Hello y'all, I am building a recommendation system and I have the following features 'genres', 'numVotes', 'averageRating' with the following stats:
mean 7.365355
std 0.588674
min 6.500000
max 9.800000
Name: averageRating, dtype: float64
mean 0.068966
std 0.257881
min 0.000000
max 1.000000
Name: genres (30 classes -> one hot encoded)
mean 75096.11
std 151962.13
min 5000.00
max 2260919.00
Name: numVotes, dtype: float64
How do I normalize these features?
I want to calculate cosine similarity after concatenating all these three features together
Normalize the features all together or separately?
Normalizing 2 million and 0-1 is gonna give you some pretty small values
yeah, probably have to normalize separately and weight them accordingly afterwards
what do you think?
I had to mess with the code a bit to obscure what it does, but here it is.
Function Call
call_profit_df = self.CalcOptionProfit(options_df, one_df, broker_fees,
options['underlyingPrice'], investment)
Function
def CalcOptionProfit(self, options_df, one_df, broker_fees, underlyingPrice, investment):
profit_df = pd.concat([one_df] * options_df.shape[0], axis=1, ignore_index=True)
profit_df = ((profit_df * underlyingPrice) + options_df['strikePrice']
profit_df = investment + ((profit_df + options_df['price'] - broker_fees['per_contract']
- (broker_fees['percent'] * options_df['price'])) *
options_df['contracts']) + broker_fees['flat']
return profit_df
like I said above, the only difference between before and now that caused the 30% speed reduction, moving the code from the function call location, into the newly created function.
yeah, probably have to normalize separately and weight them accordingly afterwards
Why are you normalizing in the first place?
i'd consider normalizing averageRating and standardizing numVotes
because there are too many outliers in numVotes column
clipping outliers should be considered a separate task
from centering/scaling or normalizing the data
i'd consider normalizing
averageRatingand standardizingnumVotes
@desert oar standardizing? what's that?
subtract sample mean, divide by sample std dev
@modest rune i think your function looks fine. it might be slower if you're doing this a very large number of times in a hot loop due to function call overhead
but just moving code to a function should not make it slower
hmm
@desert oar But it does... And I am only calling the function once. I have spent the past 4 days converting all my math to vectorized math... no more loops (at least not yet).
thanks for the info guys!
@modest rune
profit_df = pd.concat([one_df] * options_df.shape[0], axis=1, ignore_index=True)
this looks like weird code
you're concat-ing the same data frame N times?
that looks guaranteed to be slow
I created the function for 1 main reason and 1 secondary reason... (Main Reason): An article I read said it is a good practice so that garbage collection could clean up any unused variables when the variables go out of scope. (2nd Reason): Code cleanliness.
Regarding (Main Reason)... I was worried that my assignment of dataframes to a different name each time I make a modification, might be leading to a lot of copies and making a bigger memory footprint. Is this something I should be worried about, or is pandas good about only making copies when absolutely necessary?
@desert oar There might be a better way... but there is a good reason for why I am doing that. Let me see if I can explain.
Profit_Scenarios = [1.2, 1.5, 0.6, 5.0]
Stock_Data = pd.DataFrame( columns = ['ticker', 'Price'],
data = [[ 'NFLX', 150.2],
[ 'GOOG', 304.1]])
## Desired Results ##
['ticker', 'price'] [ 1.2, 1.5, 0.6, 5.0]
[ 'NFLX', 150.2] [f(150.2, 1.2),f(150.2, 1.5),f(150.2, 0.6),f(150.2, 5.0)]
[ 'GOOG', 304.1] [f(304.1, 1.2),f(304.1, 1.5),f(304.1, 0.6),f(304.1, 5.0)]
So, I start this process by first creating a Profit_Scenarios dataframe with the right number of rows to match the Stock_Data dataframe.
But... we are getting side tracked... I still don't understand the 30% slowdown
both are good reasons
and yes pandas is usually good about not copying data, but not always
i dont understand it either. frankly it shouldnt happen
it suggests that you made a mistake and changed something during refactoring
ah ok so
you're just overwriting the same dataframe
it suggests that you made a mistake and changed something during refactoring
@desert oar
Yeah... probably a 10% chance I did something. But, the change was so simple, as I spend more time with eyes on code, I am feeling less like that is happening.
overwriting the same df basically means that the "old" version of the df already goes out of scope
x = 1
x = 2
the 1 is out of scope as soon as you re-assign to x
so moving to a function in this particular case doesn't help give any hints to the GC
"GC"?
oh garbage collectin
Is there a better way to pull off the math without duplicating those rows? I couldn't think of a difference vectorized way to do it.
I know I could loop... but that how I used to do it and it was much much slower
what is one_df
one_df is the the same thing as profit_scenarios in my example
ahh
ok this actually might be a good case for numpy
but have fun keeping track of all those indexes
how important is performance? i'd just use a loop personally
Well, it IS working without looping. I am just trying to optimize further.
And looping was a 400% slowdown.
profit_scenarios = [1.2, 1.5, 0.6, 5.0]
scenario_outcomes = [compute_profit(stock_data, scenario) for scenario in profit_scenarios]
yeah, that is what I used to do. Took 10 seconds, right now it is taking 2.5 seconds with an even larger data set
Ohhhh so you are just expanding the scenario values to match the data shape
yes
Sec
ok this actually might be a good case for numpy
Seems like it
Switching to numpy is on my list of things to do next.
hold on though
you might still end up with lots of copies of your data?
yeah actually
I am thinking the temporary switch to numpy for this particular function will be trivial... a lot of DataFrame.values and maybe some conversions to float64 here an there.
is options_df repeating the data already? like once per scenario?
you might still end up with lots of copies of your data?
@desert oar
Help me understand. I have difficulties understanding when copies might be occurring and when not.
is
options_dfrepeating the data already? like once per scenario?
@desert oar
I don't understand what you mean?
What data?
i don't see how your concat code does the same thing as the for loop
this might just be me being slow/thick
is broker_fees the same shape as options_df?
or is broker_fees a dict of scalars
broker_fees is a dict of scalars
are you using calcOptionProfit inside a for loop?
Since the two dataframes have the same height, pandas is looping through options and one behind the scenes.
nope... one call
it looks like options_df is something like
profit_scenarios = [1.2, 1.5, 0.6, 5.0]
stock_data = pd.DataFrame( columns = ['ticker', 'price'],
data = [[ 'NFLX', 150.2],
[ 'GOOG', 304.1]])
options_df = pd.concat([stock_data] * len(profit_scenarios)], axis=1)
That is what I have learned about pandas an numpy. The bigger set of data you can pass at one time, the more time savings.
otherwise this doesn't make sense to me
one_series = [1.2, 1.5, 0.6, 5.0]
options_df = pd.DataFrame( columns = ['ticker', 'Price'],
data = [[ 'NFLX', 150.2],
[ 'GOOG', 304.1]])
one_df = pd.concat([stock_data] * len(one_series)], axis=1)
## Result ##
['ticker', 'price']
[ 'NFLX', 150.2] [ 1.2, 1.5, 0.6, 5.0]
[ 'GOOG', 304.1] [ 1.2, 1.5, 0.6, 5.0]
new_df = one_df.transpose() * options_df['price']
## Result ##
['ticker', 'price']
[ 'NFLX', 150.2] [ 1.2*150.2, 1.5*150.2, 0.6*150.2, 5.0*150.2]
[ 'GOOG', 304.1] [ 1.2*304.1, 1.5*304.1, 0.6*304.1, 5.0*304.1]
Only, I think you have to use one_df.transpose() to get pandas to do the math right. edited pseudocode to show that.
@desert oar Good news! I was chasing a ghost! I am using pyinstrument, which is a great profiler but it isn't deterministic. I reverted back to my non-function implementation and I am seeing the same slowdown. There must be something slowing my laptop down.
Sorry for dragging you all along for the journey.
I do want to go back to this though...
"you might still end up with lots of copies of your data?"
"is options_df repeating the data already? like once per scenario?"
Do you see something suspicious from a performance perspective that I should be aware of?
or at least investigate?
Gonna blame the slowdown on windows update. Now the function implementation is just as fast. 🙂
I really ,dont understand this
def forward(self, inputs):
print (inputs)
print (self.weights)
self.output = sigmoid(np.dot(inputs, self.weights) + self.biases)```
You need to transpose the second variable in your dot operation
so that it is (2,1) and (1,2)
The inner dimensions of a dot operation need to be equal
I am not well versed in matrix math, only knew the answer because I had just run into the same issue the other day. But... my guess is that, broadcasting (I don't know what broadcasting is), for whatever reason, needs the two values to have the same shape.
From what little I know about matrix math. Addition and subtraction can only happen on matrices with the same shape
oh, haha, alright, ill have a play around
Yours are not the same shape.
(2,1) and (2,2). My geuss is that your (2,2) matrix was the result of your dot operation. (2,1) dot (1,2) produces a matrix of shape (2,2)
oops
self.delta: [[-3.24185123e-14 -3.48982562e-12] [ 7.88860905e-31 3.86714987e-07]]
inputs.t [[1.00000000e+00 1.00000000e+00] [1.86810924e-06 3.86715286e-07]]
def backward(self, inputs, outputs):
self.delta = (outputs - self.output) * (self.output * (1 - self.output))
print (self.delta)
print (inputs.T)
self.weights += inputs.T.dot(self.delta)```
any clue?
they both look the same shape to me
they both look 2,2
tbf idrk what im talking abt
haha
That was not helpful, because I cannot know the shape of inputs and outputs by looking at the code. FYI, you can take your numpy array and print the .shape attribute if you want to quickly see its shape. EX: print(inputs.shape)
alr
they are both (2, 2)
hmmmmmmmmmmmmmmmmm
oh
wait
im dum
self.weights is (2, 1)
@modest rune my question isn't about performance, it's that your code doesn't look like it would emit the right result. Can you provide some sample data and the expected outputs?
@desert oar only if I still have the code I used to test it out. Let me see. Otherwise, I already validated the data and it will take too much time to mock something up.
Just some made up option ticker data?
salt rock, you got any idea about my lil problem?
I still have it...
import pandas as pd
import numpy as np
gain_scenarios = pd.Series([0.34, 0.21, 0.56, 0.11, .54, 1.6, 0.88, 0.01, 0.5])
scalar = 52.0
stock_data = pd.DataFrame(columns = ['Ticker', 'Shares', 'Cost_Per_Share'],
data = [['NFLX' , 100.0 , 0.10 ],
['AAPL' , 150.0 , 0.20 ],
['GOOG' , 500.0 , 5.10 ],
['F' , 70.0 , 7.10 ],
['BKSR' , 130.0 , 0.90 ],
['AMZN' , 90.0 , 5.10 ]])
gain_expanded = pd.concat([gain_scenarios] * stock_data.shape[0], axis=1, ignore_index=True)
print(gain_expanded.shape)
print(gain_expanded)
gain_expanded = ((gain_expanded + 1) * scalar)
print(gain_expanded)
gain_expanded = gain_expanded - stock_data['Shares']
print(gain_expanded)
@earnest wadi I can help, but, please print the shape of inputs and outputs
I mean, let me know what they are.
^ this
?
self.weights and self.delta
(2, 2)
(2, 2)
(2, 1)
Traceback (most recent call last):
File "a:/Python/Libraries/test.py", line 13, in <module>
network.run(X, y, epochs=90)
File "a:\Python\Libraries\main.py", line 55, in run
layer.backward(X, y)
File "a:\Python\Libraries\main.py", line 33, in backward
self.weights += inputs.T.dot(self.delta)
ValueError: non-broadcastable output operand with shape (2,1) doesn't match the broadcast shape (2,2)
nevermind, you are defining delta
def backward(self, inputs, outputs):
self.delta = (outputs - self.output) * (self.output * (1 - self.output))
print (self.delta.shape)
print (inputs.shape)
print (self.weights.shape)
self.weights += inputs.T.dot(self.delta)```
what is "self.output", that is a different variable than outputs
self.output is the output for the layer in question, outputs is the output of the whole neural network up to said layer
tryna use this code to work with mny current script
this guys isnt written amazingly as it is fixed and you have to manually add and remove code to add layers, mine just uses classes and functions
class layer_dense:
def __init__(self, n_inputs, n_neurons):
self.weights = 0.10 * np.random.randn(n_inputs, n_neurons)
self.biases = np.zeros((1, n_neurons))
def forward(self, inputs):
self.output = sigmoid(np.dot(inputs, self.weights.T) + self.biases)
def backward(self, inputs, outputs):
self.delta = (outputs - self.output) * (self.output * (1 - self.output))
print (self.delta.shape)
print (inputs.shape)
print (self.weights.shape)
self.weights += inputs.T.dot(self.delta)```
Sorry all these numbers flying around... I have lost track of everything. Can you clearly tell me the values of these:
self.output.shape
inputs.shape
outputs.shape (edited to add this)
self.weights.shape (before the function is run)
alr
For future reference, the best way to get help is with a minimal reproducible example
Sample data + code that reproduces the error
self.output.shape (2, 2)
inputs.shape (2, 2)
outputs.shape (2, 1)
self.weights(2, 1)
oh
please inculde the name of the variable... that way I don't have to make assumptions
And, if you didn't notice, I snuck in outputs.shape too
there
Please, the shape of of self.weights, not the values
self.output.shape (2, 2)
inputs.shape (2, 2)
outputs.shape (2, 1)
self.weights.shape(2, 1)
network.run(X, y, epochs=90) which is this
class network:
def __init__(self, io):
total = len(io)
layers = []
for i in range(total):
layers.append(layer_dense(io[i][0], io[i][1]))
self.layers = layers
def run(self, X, y, epochs):
for r in range(epochs):
i = -1
for layer in self.layers:
layer.forward(X)
X = layer.output
i += 1
layer.backward(X, y)
self.output = self.layers[i].output```
Dumb, question, I didn't need that. Thanks though 🙂
lol
I didn't think this was possible:
(outputs - self.output)
since outputs is (2,1) and self.outputs is (2,2)
but... like I said before, I am not an expert with matrix math, so take what I write with a grain of salt
haha
maybe numpy is able to handle that situation and makes some assumptions about what you are trying to do.
maybe
@desert oar Where you able to verify my code works on your end?
Numpy will maybe broadcast the mismatched vector
No i havent
I need to head offline soon. @ me in a few hours
k
im gonna take a break aswell, unless yu have any final ideas?
yes, I have an idea
im all ears
the error ValueError: non-broadcastable output operand with shape (2,1) doesn't match the broadcast shape (2,2) is pretty specific
It is saying that self_weights is (2,1) and the other variables on that line of code are (2,2), and that normally this won't work because of Matrix Math. But... Numpy is nice enough to make assumptions for you by doing what they call broadcasting (as salt rock lamp mentioned)
BUT
They are refusing to do broadcasting for the += operation.
Well, if it were me, I would assume that you shouldn't be mixing (2,1) and (2,2) in the first place. Maybe they are all supposed to be (2,1) or all supposed to be (2,2), maybe you ended up with a matrix shape that is incorrect somewhere along the line.
This will require you understand what your algorithm is trying to do.
Not a clean answer I know, best I can do though. Good LUCK!
thanks for your help 😄
ahhahahaha
@modest rune
I made
self.weights += blah
to self.weights = self.,weights + blah
clipping outliers should be considered a separate task
@desert oar how do I handle outliers?
there you go! The error only said += won't work... never said B = B + A won't work
xd
@modern canyon that's a big topic
err
actually it looks like they've added a lot more useful functionality for outlier detection
thanks!
this is much more complete than in the past
and this is a nice user guide. lots of pretty pictures
the sklearn user guide is turning into quite a nice document
I have a pretty broad question, but I hope that there is a data science pro that's nerdy enough to find some joy in helping me out. I'm working on an analysis with a data set of about 4500 rows, 95 categorical variables (optional configurations for a product), and a binary output "did it fail or not". What kind of approach would help me figure out which of these categorical variables is more correlated to failures?
It's not quite a multiple linear regression because while there are multiple factors that need to be considered, I'm more interested in which of the individual factors contribute to failures more often, and thus are better indicators of failures.
Is the output for each variable or for each row
I’d say tag the categories for when it fails, so maybe append them to a list? and checking which item(s) shows up the most in the list
Yeah, I can definitely do that manually, but I'm concerned about that approach for two reasons. 1) gotta be a better way to do that and 2) that really only captures if a category was involved in an outcome that was a failure, not whether or not it's a factor in predicting failures. Will it be more likely that if that variable is involved that a failure will happen? Yes, but how much more likely? How strong is the correlation? Are there other variables that were also involved in the failure that are more likely to be correlated to the failure? Is it the combination of the variables that makes the failure more likely?
Those can be answered in post processing
Factors = []
For column in dataset:
If column[3] == failed:
Factors.append(column[2])```
I can definitely do that. Do you mind elaborating on how that would be answered in post processing?
No sorry but it would depend on what you’re trying to calculate
Do you need me to elaborate more on my end?
Nope
You’ll have a list and you’ll have the ability to count how many times a factor comes up
Correlation might be a bit harder to define considering its 95 variables to 1 outcome
Finding math oriented statistical proprieties on the other hand would be as simple as just performing basic operations
Not sure if this would be the correct channel. Was contimplating the idea of webscraping individual house value estimates. Like if a homeowner wanted to check the value of their house.
I know Zillow has an API. Not sure if that would be the best route
Thoughts? Different channel?
Okay, thank you.
Is there a way to have Pandas import all Excel sheets without having to label each one individually?
with pd.ExcelFile('mybook.xlsx') as efile:
dfs = {sheet: efile.parse(sheet) for sheet in efile.sheet_names}
that gives you a dict mapping sheet names to data frames
Allright thanks. Because I got over 100 sheets
Looking for advice to speedup this function:
def CalcOptionProfit2(self, options_df, profit_df, broker_fees, underlyingPrice, investment):
profit_df = (options_df['putCall_float'] * (profit_df + 1) * underlyingPrice) -
options_df['strikePrice']
profit_df = investment + ((profit_df - options_df['price'] - broker_fees['per_contract']
- (broker_fees['percent'] * options_df['price'])) *
options_df['contracts']) - broker_fees['flat']
options_df: 3,000 x 9, mixed datatypes
profit_df: 3,000 x 100, float64
broker_fees: Dict of scalars
unerlyingPrice: Scalar
investment: Scalar
This function runs once. Takes 4 seconds to run. I am starting to think this is not the most efficient way to do this math in pandas. I say that, because I compare these two lines of code to all of the other pandas math I am doing, and I am surprised it takes 4 seconds to run.
I have tried for loops and the apply function (using a custom function). I cannot promise I tried those approaches the right way, but when I did try those approaches, things were significantly slower... like 12 seconds instead of 4.
Any advice or links to documents I should read are greatly appreciated!
The rest of my code runs in 0.180 seconds... it is just these 2 lines that are giving me a headache.
Ideas I have:
A. Switch to numpy for calc.not sure how to do this.
B. Combine 2 dataframes into one using groupby
I want to output matrix like this
9 8 7 6
10 11 12 5
1 2 3 4
I want to output spiral matrix...
this is my first message in this server... can someone to help me?
!!! i=x, j=x
for more example =>
if i give as arguments function: 3, 5, it mast print like this:
11 10 9 8 7
12 13 14 15 6
1 2 3 4 5
Other hints I have to why something is wrong:
A. Each time I add a dataframe column to my equation (ex. `options_df['strikePrice']) the runtime increases by about 0.5 seconds... that seems like a hefty increase for an item that is on the same row as the other items in the dataframe.
B. I decreased my dataset from 6500 x 9 to 3000 x 9 and only saw a 25% speed improvement. How does that make any sense?
@modest rune this is more helpful let me see what i can do here
I'm currently training a seq2seq model and I am trying to diagnose my fairly stagnant losses. Are they a result of any of the following?:
- My learning rate is too low
- I should use a different criterion or optimizer
- I messed up somewhere in the code
honestly this is about as efficient as you can make it. the only other option is to use something like numexpr
Since my comment on here, I have started trying to do the math with numpy... seems to be MUCH faster, but I haven't validated my output yet though
@modest rune numexpr "compiles" all your operations so you don't have all these intermediate results
and yes raw numpy will be faster
because pandas does a lot of work to align indices
whereas numpy is purely position-based
but numexpr will probably be more efficient than both
Well... I hope what I am seeing is correct and that my output is correct. Right now, those 2 lines run in 0.011 seconds with Numpy, versus 4 seconds with pandas.
me too
ahhh hold on
options_df['putCall_float'] * (profit_df + 1)
this might be a much more expensive broadcasting operation in pandas vs in numpy
what's the idea here, you want to multiply each column of the profit_df matrix by the options_df['putCall_float'] vector?
that particular one is sort of a workaround.
I have an equation that is ALMOST the same for puts and calls, except I have to negate part of the equation.
this is what i was trying to understand before. what does each row and column of profit_df represent
So, I created a column that stores either 1 or -1 depending on whether or not it is a put or call.
profit scenarios. Can't go into more details than that. That is the magic sauce.
i don't really care what they are from that perspective
i mean, is each row a list of parameters/scenarios that you want to try, and you're duplicating that list over and over?
or something like that
is each row an hour? a different ticker label?
i just need to know what each one represents, i dont care about the magic sauce so to speak
make up things if you want i just need some context for the problem
each element is a percentage. The rows are duplicated, but will all be different in the end after the options_df gets ahold of everything in the math.
i see
so each column is one "parameter" (corresponding to the "profit scenarios" in your earlier example)?
yes
profit_df = pd.DataFrame([
[1, 2, 3],
[1, 2, 3],
...
], columns=['A', 'B', 'C'])
like that?
and yes i appreciate that you need to be cautious about revealing the magic sauce, trust me i'm not trying to get you to reveal any of it
Yes, that is basically profit_df, with A, B, C representing different profit scenarios.
great
let me spend a minute with the numpy docs
that said if < 1 sec is good enough for you
i have a code snippet for you
Yes. the current speed is excellent, but, if you have additional suggestions, I'd love to hear them.
I am still quite surprised that pandas was so slow. Makes me reticent to use pandas in the future.
its because of the indexing
you can always drop down to numpy for better performance, ill show you
i think the organizational benefits of pandas makes it worth using for most things
then inside your functions you can switch to numpy
or again, numexpr
ok this is easier than i thought
does environment.yml have to always be named environment.yml?
no, as long as you always refer to it by name with -f
thank you
conda env update -n myenv -f env1.yaml
ty
so @modest rune numpy broadcasting has two rules: the dimensions that match are preserved, and the dimensions that are 1 are broadcast
that's what the solution i'm writing makes use of
this is good practice for me too btw i haven't written code like this in a while
Ahh... I think I need to read up on broadcasting. I am guessing your code does the same as mine but avoids the whole duplicating rows bit. Is my guess correct?
maybe... I definitely had a variable in the wrong place... that putCall_float variable. I had added it without validating the data. Doing all of that now.
I gotta run. You mind PMing me the code snippit? that way it is easier for me to find in the future? Post it here too, in the rare chance someone is following our conversation.
sure
https://repl.it/@maximum__/numpy-broadcasting-sandbox demo of broadcasting
def CalcOptionProfit2(self, options_df, profit_scenarios, broker_fees, underlying_price, investment):
""" Calculate option profit scenarios
Arguments:
options_df: DataFrame with columns 'putCall' (bool), 'strikePrice' (float), 'price' (float), and 'contracts' (float)
profit_scenarios: List of floats
broker_fees: Dict of broker fees with keys 'percent', 'flat', and 'per_contract'
underlying_price: Float
investment: Float
Returns:
DataFrame with option profits corresponding to the profit scenarios
"""
# Capture the original data index to use at the end
original_index = options_df.index
# Convert option data into column vectors
putcall_vec = options_df['putCall'].to_numpy('float32').reshape((-1, 1))
strikeprice_vec = options_df['strikePrice'].to_numpy('float32').reshape((-1, 1))
price_vec = options_df['price'].to_numpy('float32').reshape((-1, 1))
contracts_vec = options_df['contracts'].to_numpy('float32').reshape((-1, 1))
# Convert profit_scenarios into a row vector
profit_scenarios_vec = np.asarray(profit_scenarios, dtype='float32').reshape((1, -1))
fee_per_contract = broker_fees['per_contract']
fee_percent = broker_fees['percent']
fee_flat = broker_fees['flat']
result = (putcall_vec * (profit_scenarios_vec + 1) * underlying_price) - strikeprice_vec
result = investment + ((result - price - fee_per_contract - (fee_percent * price)) * contracts) - fee_flat
return result
paste contents
how can i pass a variable from function1 to function2? I thought it was as easy as this...apparently not?
a=1
function2(a)
def function2(a):
dosomestuff with variable a from function1
function1()```
seems like i have to return the value then save it as a global variable, my code works now, is that the best/only way to do it?
No
Sounds like you need to go through Beazley's python course
But this is more help channel content than anything
thanks for the link will go through the course, been only been working by watching a 1 hour tut + gooling the rest 😄
but quick answer maybe?
it's a bit more than a quick answer
i'll try
if you define a function that accepts input, you must write that in the definition
!e ```python
def function1():
a = 1
b = function2(a)
return b * 7
def function2(a):
return a + 2
result = function1()
print(result)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
21
yeah made a typo, meant to write function2(a)
looks like you got the right idea then
hmm i have to recheck maybe i made a typo in my code or forgot something
but can you recheck my pseudo code above? correct now?
return i know but its not necessary for this example
just want to pass 1 variable from function to next function when calling the latter
alright it worked 😄 thanks for explaining
i had the right idea but mustve made a typo..multiple times lol
if anyone could help me with a scatterplot that be pretty cool
x being one column of target values
y being a group of predictor columns
hi guys
do I have to learn any other languages aside for python
to be a legitimate data-engineer?
Perhaps SQL too
oh
SQL for sure, java or scala might help
Oh really? Java???
eh python is usually sufficient @odd apex.
But it would be recommended to learn a form of SQL. It shouldn't be too difficult once you understand Pandas, considering df's and db's are conceptually similar.
Oh really? Java???
@timber eagle scala/java are useful for big data applications (eg: data ingestion, processing etc)
Has anyone tried Tech With Tim's Machine Learning tutorial series? Is it good?
anyone have a little experience with extended isolation forest?
was able to implement normal isolation forest without problem
does extended isolation forest actually work for 1 dimension time series data?
@potent nymph idk about that one but personally id recommend 3blue1brown
^
3blue1brown has the best introduction to neural networks
easily
by far
speaking of neural networks
is anyone able to confirm that these notes are correct?
just want to double check i have the right formulas for calculating the derivatives
I'd learn SQL, considering even at my beginner level SQL gets data very fast especially if you are dealing with huge corporate databases.
Above about 100k rows in excel things start getting slow
So if a company is being smart about their data, they should migrate to SQL once their rows reach 100k or more.
SQL is not like Excel, instead of clicking a cell or selecting a row, you must write a script.
This script can be shared to access the same database on the server.
@gaunt tusk isn't a^{L-1} just z^{L}? I take it that a is the current layer and z is the previous layer. Also, is W^{L} just the weight matrix from z^{L} to a^{L}? Because in that case z^{L} is independent of w^{L}
it's only the weighted sum? What about the activation/transition function? Or is your network linear
holdon i'll list out what each one is
a{l} = Activation ( ó(z{l}) )
a{l-1} = Previous neuron activation
z{l} = weighted sum ( (a{l-1}*w{l}) + b{l} )
w{l} = weight
b{l} = bias
c0 = cost ( (a{l} - y)^2 )
ó(x) = sigmoid function
using sigmoid for my activation/transition
yeah sorry forgot to stick it there, its (a{l} - y)^2
all good
and this is just for like
a single training example
i'll end up using it on matrices in my actual thing
just trying to lay it out first
so yeah the cost will end up being the sum of squares
ok yeah, you're derivatives look fine
@gaunt tusk just wondering, how come you're not using something like MSE for your cost function?
not sure i haven't looked at any other cost functions as of yet
whats the benefit of MSE?
how i can reduce val_loss ?
Epoch 145/150
32/32 [==============================] - 3s 80ms/step - loss: 0.3107 - accuracy: 0.9277 - val_loss: 0.9093 - val_accuracy: 0.6875
Epoch 146/150
32/32 [==============================] - 3s 85ms/step - loss: 0.3060 - accuracy: 0.9283 - val_loss: 1.8575 - val_accuracy: 0.6228
Epoch 147/150
32/32 [==============================] - 3s 82ms/step - loss: 0.2562 - accuracy: 0.9507 - val_loss: 3.1728 - val_accuracy: 0.6491
Epoch 148/150
32/32 [==============================] - 3s 79ms/step - loss: 0.2472 - accuracy: 0.9473 - val_loss: 3.3467 - val_accuracy: 0.6140
Epoch 149/150
32/32 [==============================] - 3s 79ms/step - loss: 0.3238 - accuracy: 0.9191 - val_loss: 2.0550 - val_accuracy: 0.6404
Epoch 150/150
32/32 [==============================] - 3s 81ms/step - loss: 0.2501 - accuracy: 0.9507 - val_loss: 3.1427 - val_accuracy: 0.5877```
free to ping me
@gaunt tusk well it depends on the dataset. I've seen MSE used a lot more than least squares. I dont remember the exact properties of each but from my experience MSE usually provides better results in regression type problems
Hmm i'll have a look into it
i believe the one i'm using should be fine for what i'm doing atm though
just making a simple handwritten digit recogniser
using the mnist dataset
I've used RELU (and its cousins) way more since it's less expensive (computationally) and provides faster convergence on networks where I dont have to worry about negative values
but anyways, yeah you'll prob be fine with your current cost function
^ReLu would probably work better for image recognition
considering your values are between 0 and 1 already
hmm i'll check it out
welp I'm mixing things. Relu is an activation function
i have heard that its a more modernly used activation function
and i've been looking around and i believe the formulas for the partial derivatives i have are correct
so i believe i'm all set
yes it is but you'll still have to use sigmoid if you ever have negative data
would the bias be able to make it negative?
idk if there's anything similar to to tho
wydm the bais
the weights and baises have noting to do with: 1) Your input/output 2) activation function
you can consider them separate items
i thought you passed in the weighted sum to the activation?
thats right
so the bias would never be able to be low enough to make it negative is what you're saying?
so if your activation function normalizes/standardizes has a range of 0,1 like ReLu, all values will be between the two
whereas if you use the sigmoid, it'll all be between -1,1
oh wait yeah my inputs are always going to be between 0 and 1
so relu probably would be the better option here yeah
i'll have a look at it
and one other question actually
have a look at cnn's too
https://paste.pythondiscord.com/oqowucuyob.py currently have this so far
image regnc. usually uses either that or what you have which is a perceptron
and i'm running a test image through it
just testing the forward passing
it works fine on the first two neurons
but for some reason the last one it throws an error
layers or neurons
ah yeah layers
Traceback (most recent call last):
File "/Users/6503/Desktop/CodeStuff/MachineLearning/NeuralStuff.py", line 115, in <module>
test = thing.feed_forward(list(letssee[0])[0][0])
File "/Users/6503/Desktop/CodeStuff/MachineLearning/NeuralStuff.py", line 103, in feed_forward
activation = self.sigmoid(np.dot(weight, activation)+bias)
ValueError: operands could not be broadcast together with shapes (10,16) (10,)
it seems to have changed the arrays shape somehow
not sure where though
i was printing out the shapes just to check what the first two layers were doing
(784, 1)
(16, 784)
(16, 16)
(10, 16)
``` they look fine
by first two layers you mean input+1 hidden?
yeah
so it basically just doesn't make it to the output
goes through the one hidden layer
I'd suggest you look at examples of perceptron in python, I could be wrong but your code seems too short
i mean its not even close to the full thing
its just the feedforward part
but yeah i'll have a look around
thanks for the help and suggestions
ah and i think i just found the issue to
yep
i flattened the bias array earlier for some reason
so just removed that
[[0.96463723]
[0.6023769 ]
[0.45853454]
[0.13891415]
[0.02237485]
[0.09243328]
[0.30762676]
[0.84720262]
[0.99305502]
[0.13672393]]
``` now the outputs lookin right
also you can look at the hidden layer for sure, just like you can list weights/biases but they won't tell you anything
"Neural networks are so-called [black boxes] because they mimic, to a degree, the way the human brain is structured: they're built from layers of interconnected, neuron-like, nodes and comprise an input layer, an output layer and a variable number of intermediate 'hidden' layers -- 'deep' neural nets merely have more than one hidden layer. The nodes themselves carry out relatively simple mathematical operations, but between them, after training, they can process previously unseen data and generate correct results based on what was learned from the training data."
tl;dr they're preforming functions and you won't be able to tell shit from it
to be fair, a lot of work has been done on interpretability of neural network.
how so
WARNING:tensorflow:From /Users/jaqqen/.local/share/virtualenvs/ShaVas-DrKzIL9u/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
I get this warning plus that update-instruction everytime my program hits model.save(path)
I already passed in the *_constraint-arguments to the layers and it looks like this now:
model.add(Flatten())
model.add(Dense(50, activation=leaky_relu, kernel_constraint=None, bias_constraint=None))
model.add(Dropout(.1))
model.add(Dense(50, activation=leaky_relu, kernel_constraint=None, bias_constraint=None))
model.add(Dropout(.3))
model.add(Dense(2, activation=softmax, kernel_constraint=None, bias_constraint=None))
Good evening
If I have a 9x9 numpy array b:
[[0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8] [0 1 2 3 4 5 6 7 8]]
which I got from the following code:
a = [] for i in range(9): a.append(i) b = [] for i in range(9): b.append(a) b = np.array(b)
I am trying to turn it into 9 3x3 images using .reshape method:
c = b.reshape(9,3,3)
However, the result I get if I print c[0], namely the first sample, is:
array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
whereas what I want is the upper left corner of the image, namely:
array([[0, 1, 2], [0, 1, 2], [0, 1, 2]])
Researching over stack overflow, the solution might have to do with reshaping, using .swapaxes method, then reshaping again: https://stackoverflow.com/questions/45950264/reshape-array-in-squares-like-an-image
But I couldn't figure out how should I use this for my case
Any help would be very much appreciated!
how to reduce val_loss in CNN ?
https://paste.pythondiscord.com/gidavuvoti.py my code here
Epoch 80/85
32/32 [==============================] - 3s 79ms/step - loss: 0.2400 - accuracy: 0.9551 - val_loss: 3.0391 - val_accuracy: 0.6435
Epoch 81/85
32/32 [==============================] - 2s 77ms/step - loss: 0.2816 - accuracy: 0.9331 - val_loss: 1.7805 - val_accuracy: 0.5913
Epoch 82/85
32/32 [==============================] - 3s 79ms/step - loss: 0.3244 - accuracy: 0.9147 - val_loss: 2.2709 - val_accuracy: 0.6172
Epoch 83/85
32/32 [==============================] - 3s 84ms/step - loss: 0.2983 - accuracy: 0.9395 - val_loss: 1.6353 - val_accuracy: 0.6174
Epoch 84/85
32/32 [==============================] - 2s 76ms/step - loss: 0.3258 - accuracy: 0.9206 - val_loss: 3.8418 - val_accuracy: 0.5913
Epoch 85/85
32/32 [==============================] - 3s 85ms/step - loss: 0.2855 - accuracy: 0.9390 - val_loss: 2.8588 - val_accuracy: 0.6783
training completed...2
Epoch 1/1
9/9 [==============================] - 1s 70ms/step - loss: 3.6322 - accuracy: 0.4427
score : [1.7107210159301758, 0.572519063949585]```
I once was stuck with validation loss, turns out I need to shuffle the data. Note that setting shuffle=True would only shuffle the data after the validation split, if i'm not mistaken
@dull turtle
@unkempt lotus can u refer to my code above pasted in link
https://paste.pythondiscord.com/gidavuvoti.py @unkempt lotus code here
It is a bit too long, apologies, but I searched for "shuffle" and didn't find anything
Not sure if this is your issue, but if you want give it at least a try
@unkempt lotus if your array is 9x9:py array = array array1 = array[0:3, 0:3] array2 = array[0:3, 3:6] array3 = array[0:3, 6:9] etc
I was trying to think of a way to do this with recursion, but pretty sure it'd be more complicated
@bitter harbor No worries, thank you for the effort, I might consider implementing this
I want to build a KDTree (scikit-learn) of unique points, however calling numpy.unique() on the array of points takes much longer than building the KDTree (over 10x longer). Is there a way to use the KDTree structure to make it unique, rather than the apparently-expensive numpy.unique operation?
how long is '10x longer'?
depends on the number of points
in my unittest it goes from 1s to 12s
120,000 points
Thats pretty good considering what its doing
maybe check this out?
I've been searching for a function similar to numpy.unique, but I've had no luck
Hmm I don't get why he doesn't usenumpy.unique()
who knows???
maybe because of the same problem you're running into
if you think about whats happening, it makes sense that it would take 12s
because its looking at the 120000 points, comparing them to themselves and returning it
Well it seems to me that the KDTree could easily remove duplicates
during insertion, with little overhead
it just doesn't have an argument to do that
I think the issue is the time cost in general
Oh from your link I found a solution using set() which is much faster
(0.2s)
I assume this is more memory-hungry though
couldn't tell ya
this might help too https://stackoverflow.com/questions/46575364/efficiently-counting-number-of-unique-elements-numpy-python
If anyone is interested in learning about making their own deep learning library in python feel free to check this first video out 🙂
Hello!
Today we start a new adventure where we will be expanding on the JoelNet library with the ultimate goal of deploying our own MNIST web classifier (and maybe attacking it using some simple adversarial attacks). The idea is to model the library around the scikit-learn api...
I am an undergraduate researcher in machine learning
oooh
might be something i'll look into. Though I have a feeling autograd will be a pain...
Thanks @raven mulch
No prob let me know what you think 🙂
You can bypass auto grad by hard coding the layer types
Write the gradients out by hand
Probably more educational than using an autograd lib
^^ i'd suggest doing everything (maybe except matrix manipulation (that's what numpy's for)) by hand
Yeah just use numpy for that
doing all that would be painful
but doing it all yourself will help you understand everything better
just like how it's Very useful to learn linear algebra (or concepts used in ml)
I'm trying to parse through an html table using pandas but i keep having a problem with values coming out as NaN when td values are there.
This is what part of the html table looks like.
My table ends up looking like this:
The problem is that Role keeps coming out as NaN when i have things like "Bot Laner" still there.
for my code I think these are the relevant parts.
soup=BeautifulSoup(req.text, 'lxml')
my_table = soup.find('table', {'class':'wikitable'})
pd.read_html(str(my_table))
Any help would be really appreciated thanks!
true tho its always been interesting to me how tf and pytorch actually compute all their gradients
I know tf uses a graph execution, and that helps them deal with the gradients for non-standard functions, but would be interesting to see if that could be reimplemented. @desert oar
hey all, i need help with something
in a specific dataset, there is data like ''Apr 3, 1998 to Apr 24, 1999'' how do i extract Apr 3 1998 and put it in one column and put Apr 24 1999 in another column
i am using python pandas
You could split the string on " to "
Hey guys need some advice
I found this question on stack overflow to find the line that touched most of the rectangles
I have no clue at the moment but we need to find the line which touches the most of the rectanles and does not have to be the corners
any clue?
Need advice for a good Data Science book. Any ideas?
take your pick
hye, has anyone tried this course?
https://www.coursera.org/learn/machine-learning
what are the prerequisites for this course
Hi guys. I have a pandas question.
How do i compare the first -1 to the next row down number?
The conditions are: the first number needs to be -1, and if the next row down number is +1 then the count of True goes up by one.
Then the comparison starts on the second number to the third number with the same conditions
Does anyone on here use Python for chemistry related work? I’m trying to find other packages like Cantera https://cantera.org/ for Python.
Cantera's Homepage
is there any astronomy related projects that i can work in?
anime_new['Aired'].drop(index=anime_new['Aired'][filt].index)
anime_new.dropna(subset = ['Aired'])
if i print out anime_new after the last line, it includes the NA```
does anyone know why?
Oh
@simple shadow because by default in pandas, drop or dropna returns copies. And you never reassigned. So the original variable stays the same. Either use inplace=True or reassign yourself.
thank you!! @ripe forge
just started learning data science, does anyone have a good server or place to ask for ML specific python questions?
lmao
i mean this is a discussion channel so i wasn't sure haha
but if anyone knows about the scikit confusion matrix, i'm doing logistic regression on a cancer data set where the target var is either M or B for malignant or benign
and this is the matrix
but i don't know how to label it with the prediction axis and the actual data axis
so idk which way round it is
but i do know which one is M and which one is B using the labels parameter
there's the code
not too familiar with scikit but I do know the actual output has to be a number
so instead of M/B you could have 1,0
yeah, the only thing the labels thing does is switch the order from being B then M or M then B
so if i switch them the matrix goes the other way round if that makes sense
but idk which axis is which
wait what are u trying to do
print the confusion matrix along with the actual labels?
yeah so i printed the matrix but i don't know which line is the predicted data and the actual data
it doesn't really matter which is which
They're interchangeable. Both axis have the same labels.
i thought one axis was predicted M and predicted B then the other was actual M and actual B
for the diagonal it doesn't make a difference but for off diagonal elements it matters
ah well
i meant that as long as the confusion matrix generator function said which axis was which, it didn't really matter.
If it doesn't state it. The general notation is the column axis(left -> right) is predicted values. And row axis (top to bottom) is the actual axis.
ahh thanks, yeah i couldn't find a default on the scikit docs
maybe i just missed it but i was wondering if there was some convention
cuz i wanted to do work out the precision and recall manually, instead of using scikit funcs
ah gotcha. Yeah weird of them to not state it.
Yeah that's the general convention. Unlikely for scikit to use a different axis system
ah i see. Do you know the math behind the two functions?
precision and recall?
yeah
well for binary classification its really simple. If you're looking to do multilabel as well, its a little more complex.
ah yeah i mean i've just started so i'm starting with binary classification