visual violet Jul 31, 2020, 12:11 AM

#

you work with money

#

you know how to fix?

drifting umbra Jul 31, 2020, 12:11 AM

#

what are your data types

#

if pandas
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

#

if numpy array

#

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html

visual violet Jul 31, 2020, 12:13 AM

#

https://stackoverflow.com/questions/38516481/trying-to-remove-commas-and-dollars-signs-with-pandas-in-python

Stack Overflow

Trying to remove commas and dollars signs with Pandas in Python

Tring to remove the commas and dollars signs from the columns. But when I do, the table prints them out and still has them in there. Is there a different way to remove the commans and dollars signs...

#

i found it

velvet thorn Jul 31, 2020, 12:19 AM

#

Hey all, I have this dataframe and need to do some subtraction. Every fourth row should be subtracted from the previous three rows. For example: Row 3's values should be subtracted from rows 0,1,2 and then row 7's values should be subtracted from rows 4,5,6 and so forth. How can I accomplish this via something like df.diff()?
@marsh berry groupby and transform

desert parcel Jul 31, 2020, 12:55 AM

#

input = np.array([
                  [[313, 1], #HCL
                   [323, 1],
                   [333, 1],
                   [343, 1]], 
                  [[313, 10e-3], #Ortho
                   [323, 10e-3],
                   [333, 10e-3],
                   [343, 10e-3]], 
                  [[313, 10e-3], #Para
                   [323, 10e-3],
                   [333, 10e-3],
                   [343, 10e-3]]
                  ], dtype='float32')
 
target = np.array([[[14.76, 16.42, 18.08, 23.41]],
                    [[5.87, 11.14, 13.20, 25.72]],
                    [[2.73, 4.42, 8.04, 13.68]]], dtype='float32')```
```py
loss_fn = F.mse_loss
loss = loss_fn(model(input), target)

Output:

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: UserWarning: Using a target size (torch.Size([3, 1, 4])) that is different to the input size (torch.Size([3, 4, 4])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.

#

📎 unknown.png

#

Actually I'll move this to help channel sorry

grave frost Jul 31, 2020, 7:38 AM

#

Is it a good place to ask Machine Learning Questions here??

#

Hey guys, Newbie here. just wanted some pointers on how to proceed for one of my project. I am pretty experinced in many parts of ML but there is one that I am finding almost no help/advice in.
I basically want to build a model (Pytorch or Tensorflow/Keras) that takes a number and a string as a feature and tries to find out some higly complex relation between the 2 fields. I have already made a csv file where data has a number and then a string both seperated by commas. The model will then predict the number based on the string.

So I wanted some pointers on how the model should be built - like what layers to use, recommended Hidden Layers for pretty complex relations, Maybe if you all are aware of some pre-trained NN that can accomplish this task. I tries Googling it but didn't come up with many examples that do that. I ruled out LSTM and GRU because they do not find very complex relations in the data but would experiment on it if the experts think so.
So does anybody have an idea how to accomplish this??

thin terrace Jul 31, 2020, 8:56 AM

#

Is imbalance on a feature-level also bad like class imbalance? I mean a feature that has the same value for all instances is useless as it gives no unique information and thus should be dropped. But what about a feature that 95% of value A and 5% of value B? Should it also be dropped? How do we approach this? Is there a term for this like "class imbalance"? Any resources for reading more on the topic?

fluid burrow Jul 31, 2020, 9:06 AM

#

can someone help me, i am not able to create a dataframe using Panda

#

i cna't seem to figure out the problem

📎 unknown.png

thin terrace Jul 31, 2020, 9:17 AM

#

ValueError: arrays must all be same length

frank bone Jul 31, 2020, 9:17 AM

#

you got 1x age too much

fluid burrow Jul 31, 2020, 9:30 AM

#

ahh , thanks found it.

grave frost Jul 31, 2020, 9:37 AM

#

@thin terrace Look up down-sampling and up-sampling (Data Augmentation) if unbalanced classes are what you mean....

thin terrace Jul 31, 2020, 9:56 AM

#

It's not

desert oar Jul 31, 2020, 10:48 AM

#

@thin terrace good question. If there isn't already a topic for that on stats.stackexchange.com you might get good answers if you ask it there

fiery frost Jul 31, 2020, 11:20 AM

#

Someone here that know ml, and can let me dm him?

#

for some short q.

velvet thorn Jul 31, 2020, 11:34 AM

#

Is imbalance on a feature-level also bad like class imbalance? I mean a feature that has the same value for all instances is useless as it gives no unique information and thus should be dropped. But what about a feature that 95% of value A and 5% of value B? Should it also be dropped? How do we approach this? Is there a term for this like "class imbalance"? Any resources for reading more on the topic?
@thin terrace feature variance

#

"should it be dropped" is not a simple question to answer

spark cape Jul 31, 2020, 11:35 AM

#

If I have two data frames with the same date index and I want to join on some fields in lhs... lhs.join(rhs, on=['field1', 'field2', 'field3']) doesnt work. So it looks like it needs to joined on the index. but if I use groupby, it doesn't work because you can't join on groupby objects. 😕

velvet thorn Jul 31, 2020, 11:35 AM

#

consider this: when you train a model, you're basically deriving a relationship between features (input) and target (output), keeping in mind certain assumptions

#

simplest example: linear regression; the assumption is that the target is a linear function of the features

#

so, going back to your question...

#

how much predictive power does the feature add to your model?

#

it must be clear that even a column of random values may add some predictive power, simply because of the distribution of your target (so, basically, overfitting).

#

say it's a classification problem and even though the feature's value ratio is 95:5, it is so strongly predictive that all of the samples with the minority value for that feature are of the same class.

#

if the training sample roughly reflects real-world data, you basically get confirmed predictions for 5% of incoming samples for free.

#

of course, it's not clear that that is actually the case, and if it's not, then you have overfitting. that is one of the reasons you would consider dropping a feature with low variance.

spark cape Jul 31, 2020, 11:46 AM

#

@fiery frost just ask the channel

fiery frost Jul 31, 2020, 11:56 AM

#

I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.

#

Here is before and after:

#

Before:

#

📎 unknown.png

#

After:

#

📎 unknown.png

#

Video for demonstrating how is should work.

#

📎 MHkBxWq.mp4

#

Thx for everyone who can help.
please take into consideration i am noobie to this.

#

(I tried to do this by detecting edges, but it is not perferct and require me to hand change it every time.)

molten ravine Jul 31, 2020, 12:02 PM

#

Hi everyone,

I come to you because after many research, i still cannot fix my problem.
So, i would like to read JSON file and put it in a dictionary. I know the library json with json.load() works.
BUT, after a certain size (500k Ko) json.load() stop working and send a memorryError.
I've found the library isjon with ijson.parse() method but the output is not a dictionary and i absolutely need an output as a dict.

Do you have any ideas to fix this?

Here there is my code :

import json
def getJsonFile(path):
    #========================================================
    #GET THE JSON FILE
    #========================================================
    data = {}
    try:
        with open(path,'r',encoding='UTF-8') as jFile:
            data = json.load(jFile)
    except Exception as e:
        print(type(e))
    finally:
        return data

Hope someone would can help me!

Have a nice day !

fiery frost Jul 31, 2020, 12:04 PM

#

@spark cape Can you help me?

molten ravine Jul 31, 2020, 12:05 PM

#

I don't know if it was the good channel to ask my question, i've just joined this discord, sorry for that

fiery frost Jul 31, 2020, 12:06 PM

#

@molten ravine it's problem with your PC, on mine I loaded 5mg of json.

uncut shadow Jul 31, 2020, 12:08 PM

#

@molten ravine What error do you get? I mean, you shouldn't actually use try except there, remove it and then try to run it. If you get any errors then paste them here so we know what might be wrong

fiery frost Jul 31, 2020, 12:10 PM

#

I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.
Can someone guide me pls? 🙏

molten ravine Jul 31, 2020, 12:11 PM

#

i have a ```python
MemoryError

wanton kiln Jul 31, 2020, 12:11 PM

#

Hi everyone,

I come to you because after many research, i still cannot fix my problem.
So, i would like to read JSON file and put it in a dictionary. I know the library json with json.load() works.
BUT, after a certain size (500k Ko) json.load() stop working and send a memorryError.
I've found the library isjon with ijson.parse() method but the output is not a dictionary and i absolutely need an output as a dict.

Do you have any ideas to fix this?

Here there is my code :
import json
def getJsonFile(path):
    #========================================================
    #GET THE JSON FILE
    #========================================================
    data = {}
    try:
        with open(path,'r',encoding='UTF-8') as jFile:
            data = json.load(jFile)
    except Exception as e:
        print(type(e))
    finally:
        return data
Hope someone would can help me!

Have a nice day !
@molten ravine Do you see that: https://stackoverflow.com/questions/40399933/memoryerror-when-loading-a-json-file ?

Stack Overflow

MemoryError when loading a JSON file

Python (and spyder) return a MemoryError when I load a JSON file which is 500Mo large.

But my computer have a 32Go RAM and the "memory" displayed by spyder go from 15% to 19% when I try to load i...

glass gorge Jul 31, 2020, 12:11 PM

#

hello! is it possible for features in a random forest regressor to be a mix of numericals and pd.get_dummies(categoricals)? sorry for the noob question >.<

molten ravine Jul 31, 2020, 12:16 PM

#

@wanton kiln i've already seen this...
As i said ijson is not a solution for me because i do not know how to how to convert it into dict...
So if someone could guide me if it's possible to convert my json object as a dict, using ijson, should be nice. But actually i do not know how to do this

spark cape Jul 31, 2020, 12:17 PM

#

I don't know what 500k Ko but if you mean 500kb that's trivially small and the issue is on your end there.

molten ravine Jul 31, 2020, 12:18 PM

#

I mean 500 000 Ko

spark cape Jul 31, 2020, 12:18 PM

#

what is ko

molten ravine Jul 31, 2020, 12:20 PM

#

sorry it's the french writing, i mean KB so i think

#

so yeah, it's a trivially small but i don't understand why i'm getting a memory error

fiery frost Jul 31, 2020, 12:21 PM

#

I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.
Help someone? 🙏

wanton kiln Jul 31, 2020, 12:22 PM

#

@wanton kiln i've already seen this...
As i said ijson is not a solution for me because i do not know how to how to convert it into dict...
So if someone could guide me if it's possible to convert my json object as a dict, using ijson, should be nice. But actually i do not know how to do this
@molten ravine try this exemples https://www.geeksforgeeks.org/convert-json-to-dictionary-in-python/

GeeksforGeeks

Convert JSON to dictionary in Python - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

fiery frost Jul 31, 2020, 12:22 PM

#

?

molten ravine Jul 31, 2020, 12:23 PM

#

@wanton kiln what you've sent it's completely my code right now


import json
def getJsonFile(path):
    #========================================================
    #GET THE JSON FILE
    #========================================================
    data = {}
    try:
        with open(path,'r',encoding='UTF-8') as jFile:
            data = json.load(jFile)
    except Exception as e:
        print(type(e))
    finally:
        return data

fiery frost Jul 31, 2020, 12:24 PM

#

Pls? 🙏 🙏

wanton kiln Jul 31, 2020, 12:26 PM

#

Pls? 🙏 🙏
@fiery frost You need construct a Deep Learning solution using Keras and Tensor.

fiery frost Jul 31, 2020, 12:27 PM

#

Can you guide me more pls?
I am very new to this. @wanton kiln

wanton kiln Jul 31, 2020, 12:32 PM

#

@wanton kiln what you've sent it's completely my code right now


import json
def getJsonFile(path):
    #========================================================
    #GET THE JSON FILE
    #========================================================
    data = {}
    try:
        with open(path,'r',encoding='UTF-8') as jFile:
            data = json.load(jFile)
    except Exception as e:
        print(type(e))
    finally:
        return data

@molten ravine I think you need index json strings like the exemple:

for reading nested data [0] represents

# the index value of the list 
print(data['people1'][0]) 
  
# for printing the key-value pair of 
# nested dictionary for looop can be used 
print("\nPrinting nested dicitonary as a key-value pair\n") 
for i in data['people1']: 
    print("Name:", i['name']) 
    print("Website:", i['website']) 
    print("From:", i['from']) 
    print()

#

Can you guide me more pls?
I am very new to this. @wanton kiln
@fiery frost ohhh... its some dificult address this issue here in chat.

let me think how can i help you, ok.

molten ravine Jul 31, 2020, 12:37 PM

#

I cannot even put index if i'm not able to load the json file.
The problem is the method json.load() return me a memorry error. So in the example, the datas have been loaded already

fiery frost Jul 31, 2020, 12:39 PM

#

@wanton kiln thx!!!

spark cape Jul 31, 2020, 12:48 PM

#

@molten ravine can you use jsonString = jFile.read() ? if not then it's the reading of the data. and if it works then ijson gives a generator that you can pull data through like [x for x in ijson.items(f, prefix='')]; then you have a list of dicts

#

(json data is often a top level list of json object entries)

woeful tusk Jul 31, 2020, 12:50 PM

#

I have an excel file with a list of medical exams made per place. I want to make a code to go through it and sum the amount made of exams per place and put it in a dataframe to later write as excel.

spark cape Jul 31, 2020, 12:51 PM

#

exams.groupby('location').agg({'examsTaken': 'sum'}).reset_index(['location'])

woeful tusk Jul 31, 2020, 12:51 PM

#

this would return only the amount?

#

i was going for a for loop but it would lets say pick the first line: hemogram in place A. It would go through the whole sheet and sum the amount done. Than if it pick hemogram in place B, it would repeat again

spark cape Jul 31, 2020, 12:53 PM

#

the pandas way to do that is groupby(...).apply(your function that works on the group you are interested in).reset_index(.../*put the index back how it was */)

#

apply if for your code; but there is sum, etc. can also do...
exams.groupby('location').examsTaken.sum()

woeful tusk Jul 31, 2020, 12:56 PM

#

My excel columns code, exam, amount asked, amount done, value. Is it better to just concat code and exam? Since they are unique to eachother

#

Ive deleted the place columns since i dont need it

spark cape Jul 31, 2020, 12:58 PM

#

its up to you but you can use groupby with an array an use groupby(['location', 'code']) to make two layers of grouping if that helps

woeful tusk Jul 31, 2020, 12:59 PM

#

Ok

#

Ill try

#

Do I need to equate it to a variable? Like, result=groupby...

spark cape Jul 31, 2020, 12:59 PM

#

let me know how you get on so if it's wrong i improve too. 🙂

woeful tusk Jul 31, 2020, 12:59 PM

#

To write as excel later

fiery frost Jul 31, 2020, 12:59 PM

#

Can someone pls guide me how to implement this?

spark cape Jul 31, 2020, 12:59 PM

#

yeah you'll want that @woeful tusk

#

but first dump the results in e.g. jupyter notebook to make sure it's correct. itll be quicker than saving file, loading file in excel. etc

#

@fiery frost your question is like a whole project. it's not really appropriate for this channel. you have a few subproblems: how do you find the speech bubble holes? Once you find them, how do you find the text? Once you do that do you detect language? Once you do that, can you OCR it? Once you OCR'd it, translate it using an online service, I guess. then write the text into the field with appropriate layout (based on the speech bubble dimensions)

fiery frost Jul 31, 2020, 1:02 PM

#

@fiery frost your question is like a whole project. it's not really appropriate for this channel. you have a few subproblems: how do you find the speech bubble holes? Once you find them, how do you find the text? Once you do that do you detect language? Once you do that, can you OCR it? Once you OCR'd it, translate it using an online service, I guess. then write the text into the field with appropriate layout (based on the speech bubble dimensions)
@spark cape is not that comlicated.

#

i dont need to ocr them.

#

just recognize them.

spark cape Jul 31, 2020, 1:02 PM

#

opencv is an image manipulation library for detecting things. to find the speech bubbles

#

the R in OCR is recognize

fiery frost Jul 31, 2020, 1:02 PM

#

i already did it.

#

it's not perferct.

#

there are a lot of loopholes.

#

so i thought moving to neural network.

grave frost Jul 31, 2020, 3:15 PM

#

Hey guys, Newbie here. just wanted some pointers on how to proceed for one of my project. I am pretty experinced in many parts of ML but there is one that I am finding almost no help/advice in.
I basically want to build a model (Pytorch or Tensorflow/Keras) that takes a number and a string as a feature and tries to find out some higly complex relation between the 2 fields. I have already made a csv file where data has a number and then a string both seperated by commas. The model will then predict the number based on the string.

So I wanted some pointers on how the model should be built - like what layers to use, recommended Hidden Layers for pretty complex relations, Maybe if you all are aware of some pre-trained NN that can accomplish this task. I tries Googling it but didn't come up with many examples that do that. I ruled out LSTM and GRU because they do not find very complex relations in the data but would experiment on it if the experts think so.
So does anybody have an idea how to accomplish this??

untold aspen Jul 31, 2020, 3:18 PM

#

for complex relationship between words look up embedding layers and Word2Vec

#

Word2Vec transforms words into vectors which allows the neural net to understand how close or far the relationship between each words are (like cat & dog, king & queen, etc...)

#

Embedding layer tries to capture this idea as well, but embedding layer is more of a general name for mapping features into high dimensional spaces to find relationship (e.g clusters) between data

acoustic halo Jul 31, 2020, 3:29 PM

#

@grave frost In addition to the above, I would look at BERT, it tends to perform better over most other word embedding techniques and is fine-tuned to the specific task at hand rather than just relying on the embeddings of stuff like word2vec and ELMo

grave frost Jul 31, 2020, 5:03 PM

#

Thanks guys for responding, But would it be a good idea to use Embeddings as the relationship is not word-to-word, but actually an interger to word? Would the NN be able to figure out the relationship?

#

Also @acoustic halo I have a ton of data (10 million rows with 2 data points each) would it be a good idea to retrain BERT from scratch on that data or should I do something else?

acoustic halo Jul 31, 2020, 5:13 PM

#

Can you give an example of one?

#

And Bert can give a classification vector for each string you use

grave frost Jul 31, 2020, 5:16 PM

#

Mainly for simple code breaking purposes. Like 1 --> Ad$r. So basically be able to break complex ciphers and predict the number from the data, Input:- Ad$r ; Output:- 1.....

acoustic halo Jul 31, 2020, 5:17 PM

#

Don't use Bert then

grave frost Jul 31, 2020, 5:17 PM

#

So it isn't a classification task, put rather a predicting or finding a "relation between data" task

#

Ohk

acoustic halo Jul 31, 2020, 5:18 PM

#

Bert is for natural language, you might be better using something character level, not sure what though

grave frost Jul 31, 2020, 5:19 PM

#

Me neither. I have been hunting most ML platforms, but looks like I have to go to SE now....

flat quest Jul 31, 2020, 5:31 PM

#

embeddings will work with numbers. You're just projecting the data into a higher dimensional space @grave frost.

character level is probably better for this, but you can still use the BERT architecture. You just encode the characters instead of subwords.

grave frost Jul 31, 2020, 5:33 PM

#

Hmm.... that's right but if the relation is very complex, would BERT still be able to find it out or should I boost up the HIdden Layers somehow?

flat quest Jul 31, 2020, 5:35 PM

#

BERT has a large number of hidden layers. I'd try using the architecture as a base, and then see how it performs before trying to boost up the number of hidden layers

desert oar Jul 31, 2020, 5:37 PM

#

isnt bert huge and therefore very difficult to train from scratch?

#

id be skeptical that fine-tuning an existing bert model pre-trained for human language would be effective in a more abstract problem, but maybe there are sources contradicting my uneducated intuition

#

but yes in general if you can use an existing proven architecture you definitely should

flat quest Jul 31, 2020, 5:41 PM

#

fine-tuning i'm not sure if it would work, unless there's some parallels with his particular ciphers and natural language.

Training from scratch would take a long time yes, but the smaller BERT architectures can help minimize that.

grave frost Jul 31, 2020, 5:45 PM

#

Well, I will start at 'Medium' to provide me a jump-off point....

tardy portal Jul 31, 2020, 7:13 PM

#

Hello everyone, I need some assistance and reassurance that I executed a histogram chart correctly
age_distribution = sns.distplot(df_ea['Age'], bins = 20)
in which case it outputted this:

📎 unknown.png

#

To clarify, this histogram chart states that majority of the employees within the dataset are between the ages of 30 and 40

lapis sequoia Jul 31, 2020, 7:14 PM

#

Hello! I am a new member of this channel and just wanted to greet everyone of you. Hello!

rare ice Jul 31, 2020, 7:21 PM

#

How can I add Python Type Hints for the columns of a spark dataframe? For example, suppose I have a dataframe called data with the following schema:

root
 |-- body: string (nullable = true)
 |-- partition: string (nullable = true)
 |-- enqueuedTime: timestamp (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- systemProperties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- event_type: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)

desert oar Jul 31, 2020, 7:36 PM

#

@tardy portal looks fine

tardy portal Jul 31, 2020, 7:37 PM

#

I updated it and made it more pretty

desert oar Jul 31, 2020, 7:37 PM

#

@rare ice spark dataframe columns aren't really strings though - they are placeholders representing strings

#

so im not sure how those type hints would even work

#

do you mean that you want to write a function e.g. process_data that only accepts spark dataframes with a certain schema?

#

if so, i don't think there's a good way to do this

#

i've wanted something like this for pandas as well

#

i haven't found a good answer

rare ice Jul 31, 2020, 7:38 PM

#

@desert oar I was more hoping to get my editor to recognize the schema and be able to autocomplete columns.

desert oar Jul 31, 2020, 7:38 PM

#

ah

#

same problem i suppose

rare ice Jul 31, 2020, 7:39 PM

#

data.ye + 'tab' would show 'year'

desert oar Jul 31, 2020, 7:39 PM

#

yeah, i see

#

that would definitely be nice

#

you could probably make this work with generic types

#

what happens if you add a column to the dataframe? i guess that means its type has changed, right?

#

in general this is the problem of: how to annotate an object that has runtime-generated attributes or keys, but is expected to be mostly static once it's been generated

#

it's also a bit messy because pyspark dataframes also support access with __getitem__ as in data['year']

lapis sequoia Jul 31, 2020, 7:47 PM

#

Can anyone recommend a channel about AI / ML ?

acoustic halo Jul 31, 2020, 8:01 PM

#

@lapis sequoia This one

tardy portal Jul 31, 2020, 8:51 PM

#

Okay so I need assistance with creating a pairplot for my dataset, specifically for the numerical variables. I'm trying to utilize this link: https://seaborn.pydata.org/generated/seaborn.pairplot.html, however, how am I able to specifically target the numerical variables within the dataset?

tardy portal Jul 31, 2020, 9:32 PM

#


df_ea.fillna(value = 0, inplace = True)

num_scatt_plot_matrix = sns.pairplot(df_ea.iloc[:, :17])

#

could someone explain what sns.pairplot(df_ea.iloc[:, :17])

#

that means?

desert oar Jul 31, 2020, 9:42 PM

#

df_ea is your data frame

#

.iloc[:, :17] is "all rows, and the first 17 columns"

tardy portal Jul 31, 2020, 9:57 PM

#

alright so since I want to run this pairplot for only the numerical variables I entered this: ```sns.set(style = 'whitegrid')

df_ea.fillna(value = 0, inplace = True)

num_scatt_plot_matrix = sns.pairplot(df_ea, hue = 'Attrition')

#

It generated a lot of information, however I came across this error: RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density.

lapis sequoia Jul 31, 2020, 10:16 PM

#

hey if someone here is familliar with mysql could you look in databases

tardy portal Jul 31, 2020, 10:16 PM

#

I managed to figure it out by entering the column names separately, but I wanted to know if there's an easier method I could run this pairplot based on the columns that hold numerical values.

desert oar Jul 31, 2020, 10:36 PM

#

@plucky crow in code like this

timediff = df1.groupby('masked_id')['date'].apply(lambda y: y - y.iloc[0])

the timediff series will typically end up with an extra index level, corresponding to the grouping variable. so if you want to assign it back to df1 as a column, you need to remove that index level. in this case i suggested .reset_index(level='date'). if it doesn't work by name, just say level=0, since the extra index level will be added as the 0th level

carmine iron Jul 31, 2020, 11:47 PM

#

is there a way i can and some type of if clause in my df column list comprehension. python group['growth'] = [ x[-1] / x[0] -1 for x in group['cases']

#

Im wanting to add if x[-1] - x[0] < 0 then ```python

group['growth'] =0'''

#

then group['growth'] = 0```

velvet thorn Aug 1, 2020, 12:30 AM

#

uh

#

if I understand you correctly

#

why not use max

woeful shore Aug 1, 2020, 2:09 AM

#

ValueError: logits and labels must have the same shape ((None, 6) vs (None, 4))

#

ValueError: logits and labels must have the same shape ((None, 6) vs (None, 4))
@woeful shore I keep getting this error, how do I make sure my training data and labels are of the same shape. my training data is a list of textual data while my label is ndarray that has been onehot encoded. My labels has four targets

deft harbor Aug 1, 2020, 2:11 AM

#

Print the shapes at different points

#

Also depends on the library you are using

frank bone Aug 1, 2020, 2:21 AM

#

does anyone have a good and well explained resource on how you would go about making a very simple supervised ML function with around 10 parameters classifiyng around 10 different structures in a dataset

#

I have 0 experience in ML(tensorflow or whatever tool you would use)
My plan is to create a synthetic dataset with the data structures I want to detect and train a model with that, then test on real data

#

how do you 'label' these different structures? lets say we have a csv file with 100 lines and in 30 lines I have different data structures (around 10 different) I want to detect. The real dataset will be millions if not billions of lines but that was just an example

#

each line having around 15-16 columns, of which 4-5 are relevant to define a unique structure

#

and sometimes these structures are comprised of several lines within that 100 lines file

#

i.e. 3 lines = 1 unique data structure

#

and they even overlap sometimes

#

found it very difficult even just trying to structure an algo for that, would probably be very error prone and be hundreds of conditional nesting. Thought ML might be the solution for something like that

#

as i understand it...usually youd have 1 datapoint and 1 corresponding label. But how would you go about mapping 3 datapoints to 1 label? then 2 datapoints to 1 label? And then make the training session go through all these and also check for overlap and classify the data according to the higest score? with labels having multiple datapoints have more weight than single ones for example

flat quest Aug 1, 2020, 3:34 AM

#

still having trouble understanding the data you're working with
So its multifeatured inputs (each line is an input im assuming)
But some of the lines are grouped together? @frank bone

frank bone Aug 1, 2020, 3:57 AM

#

@flat quest let me give you an example. Specifically the idea is to identify different option strategies that were executed

#

examples of such strategies

#

https://www.investopedia.com/trading/options-strategies/

Investopedia

10 Options Strategies to Know

From the covered call to the iron butterfly, here are 10 of the most common options strategies that you should know.

#

📎 unknown.png

#

example of data

#

usually theyre executed at same time but I want to introduce a tolerance of 3 minutes so that makes it a little more tricky but will get me more data output

#

now the tricky thing is "Label X" could be falsely identified with in "Label Y", or any data within all label above could be identified as "Label R". But they're all unique instances

#

So in total there will be 10 labels, and each label can be comprised of more than 1 row but there's no such requirement

#

The way I would solve it (as I'm currently understanding ML) is to give more weight to a label the more datapoints it comprises

drifting umbra Aug 1, 2020, 4:15 AM

#

i think calling them labels makes it very confusing for most ML people

#

label is what you want to predict

#

if you think an equation like y = mx + b

#

labels = y = what you want to predict

#

features are your inputs

#

do you have data for inputs and corresponding outputs?

flat quest Aug 1, 2020, 4:17 AM

#

So you're trying to categorize options trade into particular categories? @frank bone

frank bone Aug 1, 2020, 4:19 AM

#

@flat quest exactly. And sometimes an option trade is a single entry, sometimes it can be multiple entries

#

@drifting umbra I was reading this (https://www.tensorflow.org/tutorials/keras/classification) and thought labels is the right term?

drifting umbra Aug 1, 2020, 4:21 AM

#

yes labels is whaty ou want to predict

#

you dont have to call it label

#

you can call it anything

frank bone Aug 1, 2020, 4:21 AM

#

alright

drifting umbra Aug 1, 2020, 4:21 AM

#

you can aslo predict either:

#

number such as $ in sales, $ stock goes up / down

#

TRUE / FALSE

#

or categories

#

as you have it X, Y, Z, R you have 4 categories

frank bone Aug 1, 2020, 4:23 AM

#

that was just an example, I have around 10 categories and each category can be identified by timestamp, C/P, Strike and maybe TotalTrades (not sure)

#

the problem i see is the grouping of several datapoints into one category

drifting umbra Aug 1, 2020, 4:23 AM

#

i dont really think you need machine learning for this

#

if you are doing what's called "hard coding" rules

#

such as all options with one date

#

go into one category

#

e.g.

if month == Janurary:
  category = Label X

#

also to train your data you will need labels for many more options

#

or many more obseravtions than 4

#

if it truely needs machine learning

#

and the categroy is not set by strict rules

frank bone Aug 1, 2020, 4:27 AM

#

e.g.

if month == Janurary:
  category = Label X

@drifting umbra not sure you get what im trying to do. It has nothing to do with time

drifting umbra Aug 1, 2020, 4:27 AM

#

i am saying

#

if you are categorizing things with STRICT rules

#

you don't need machine learning

#

such as all bachlors = unmarried

#

all children are young

#

etc

frank bone Aug 1, 2020, 4:28 AM

#

i dont necessarily need machine learning no, but if its easier and faster to implement and is more precise then yes

#

i just dont know how i would write code to correctly identify i.e. 10 different categories when they can overlap

drifting umbra Aug 1, 2020, 4:30 AM

#

if you are using strict rules machine learning will be less percise

frank bone Aug 1, 2020, 4:30 AM

#

seems very complicated and error prone with many many many conditional nested loops but i just started learning python 2 weeks ago so maybe its a knowledge issue 😄

drifting umbra Aug 1, 2020, 4:30 AM

#

this is pretty easy problem

#

do you have more observations

#

than 4

#

need like hundreds of thousands

#

*OR thousands. hundreds OR thousands. sorryu

#

long day @ work

frank bone Aug 1, 2020, 4:32 AM

#

i do a prefilter, and out of that prefilterI get the data I colored in the above picture

#

and in that data i need to identify the categories and they're defined by just 4 or 5 observations or 2* if it a dual pair, 3* if triple pair

#

if there's a simple solution im all for it.. 😄

drifting umbra Aug 1, 2020, 4:35 AM

#

okay so

#

if the features for the colored things

#

are similar

#

and

#

you have more observations with labels

#

i only see 4 that are labeled

#

how many labeled rows

#

do you have is the question

frank bone Aug 1, 2020, 4:37 AM

#

it was just an example to demonstrate the issue

#

of there being categories with varying amount of csv rows

drifting umbra Aug 1, 2020, 4:37 AM

#

do you have excel or screenshot of actual data

#

if you want to do it as machine learning you will need more than 4 with both

#

input AND output

#

in column J most are missing output

jade walrus Aug 1, 2020, 4:38 AM

#

anyone tried both keras/tensorflow and fast.ai/pytorch? I only tried keras. Easy to use. Never tried the other combination. For those who've tried both, may I know what is your preference?

drifting umbra Aug 1, 2020, 4:39 AM

#

@jade walrus interested in this as well. i am reading Deep Learning with Python by François Chollet (author or Keras)

frank bone Aug 1, 2020, 4:40 AM

#

The actual data doesnt contain column J, I added those manually to demonstrate what I want to be filtered out

#

id create a dataset manually where id mark them

drifting umbra Aug 1, 2020, 4:40 AM

#

ok

frank bone Aug 1, 2020, 4:41 AM

#

or "label" as in the previous link i shared

drifting umbra Aug 1, 2020, 4:41 AM

#

like i said be aware you would need to do hundreds

#

or thousands

#

so maybe

#

what you want

frank bone Aug 1, 2020, 4:41 AM

#

absolutely

drifting umbra Aug 1, 2020, 4:41 AM

#

is to use rules

#

like strict rules

#

how do you decide what is J/

#

for example if strike is greater than 40, it's call, then it's Category Z

#

that could be a rule

frank bone Aug 1, 2020, 4:42 AM

#

let me show you in the example from the screenshot how id categorize it

drifting umbra Aug 1, 2020, 4:44 AM

#

okay i am saying if you are going to manually go through

#

and put something in column j

#

how do you decide what to put in column J ?

frank bone Aug 1, 2020, 4:46 AM

#

im looking at the timestamp with a tolerance of x seconds, if there's a pair (this can be 2 or more) then those 2 (or more) entries are a "pair". After such a pair is identified, i then look at C/P, Strike and Stock Price, if C and P strikes are within 10% of stock price, it's a "Straddle"

#

thats one label

#

and I want to identify like 10-20

#

the problem is the varying degree of numbers of datapoints in a label. For example a label where there's only 1 row, could mistakenly be identified in a pair?

#

the tolerance might also introduce overlapping between pairs if they happen close to each other

#

lets say one pair happens at 10:15 and the other pair at 10:17, if my tolerance connects the two, that's a mistake

#

i think itll be extremely complicated to code all these sometimes deeply nested conditionals and debugging will be hell

#

if you look at straddles, strangles and iron condor youll understand the problem

#

https://www.investopedia.com/trading/options-strategies/

Investopedia

10 Options Strategies to Know

From the covered call to the iron butterfly, here are 10 of the most common options strategies that you should know.

drifting umbra Aug 1, 2020, 4:58 AM

#

can u say category instead of label

#

and ok so

#

"lets say one pair happens at 10:15 and the other pair at 10:17, if my tolerance connects the two, that's a mistake"

#

yeah i understand options strategies i am CFA

#

actually know a lot more about investing than data science lol

frank bone Aug 1, 2020, 4:59 AM

#

great 😄

drifting umbra Aug 1, 2020, 4:59 AM

#

i think u will get much more accurate results hard coding

#

because you have no answer data

#

talking of going thru hundreds or thousands and labeling it yourself is a big mistake imo

#

just create code for conditions

frank bone Aug 1, 2020, 5:00 AM

#

can u say category instead of label
@drifting umbra can u just quickly define it to me. A little confused by terminoology

drifting umbra Aug 1, 2020, 5:00 AM

#

such as TIME_2 - TIME_1

#

is that >2 or <2

#

features are your X's. features are inputs
labels are your Ys. labels are your outputs

frank bone Aug 1, 2020, 5:01 AM

#

talking of going thru hundreds or thousands and labeling it yourself is a big mistake imo
@drifting umbra i was thinking of synthetically making them

drifting umbra Aug 1, 2020, 5:01 AM

#

step 1 import this excel into pandas

deft harbor Aug 1, 2020, 5:01 AM

#

@drifting umbra go back to stock market chat

drifting umbra Aug 1, 2020, 5:01 AM

#

pd.read_csv("file.csv")

frank bone Aug 1, 2020, 5:01 AM

#

yeah i have all filtered data in a dataframe already

#

the colored entries from picture above

deft harbor Aug 1, 2020, 5:02 AM

#

Study dask, skip pandas

drifting umbra Aug 1, 2020, 5:02 AM

#

@deft harbor wassup

frank bone Aug 1, 2020, 5:02 AM

#

thats where i extracted pairs using unixtimestamp comparison but im stuck and thats why i looked at ML

deft harbor Aug 1, 2020, 5:02 AM

#

Just checking in to see what's happening here.

drifting umbra Aug 1, 2020, 5:04 AM

#

@deft harbor ugh. new tech. this is why i hate technology and am glad i am not a software developer

#

always new shit to learn

#

i love learning but i question how much new languagse and frameworks even add

deft harbor Aug 1, 2020, 5:04 AM

#

I use dask because I found myself trapped using one core all the time

drifting umbra Aug 1, 2020, 5:05 AM

#

https://modin.readthedocs.io/en/latest/

deft harbor Aug 1, 2020, 5:05 AM

#

Dask is really useful for multicore processors and in a bigger world, clustering

drifting umbra Aug 1, 2020, 5:05 AM

#

import modin.pandas as pd

#

🙂

#

makes sense

#

i am using colab beacuse

#

although i have nvidia GPU. it is fk impossible to get tensorflow to work on windows

deft harbor Aug 1, 2020, 5:06 AM

#

Yeah, I use Linux only

drifting umbra Aug 1, 2020, 5:06 AM

#

@frank bone not sure your question

#

if you were to do it manually

#

labeling them

deft harbor Aug 1, 2020, 5:06 AM

#

I've worked with people who developed notebooks in windows, and its a mess

drifting umbra Aug 1, 2020, 5:06 AM

#

how would you decide?

frank bone Aug 1, 2020, 5:07 AM

#

i wouldnt label them manually, id create a synthetic dataset with only condors, only straddles, only strangles and then concat them

#

for training

#

but im all for a hardcoded classic function, it just seems stupidly complicated to me

#

https://modin.readthedocs.io/en/latest/
@drifting umbra wow this is great

#

do you have any suggestions? I mean you understand the problem well since you understand options strategies.
I have a table of different option strategies and I want to put them in correct categories (straddle, strangle, butterfly, etc)

#

defining observations are timestamp (nearness chronologically IN CASE its applied to multiple lines), C/P (defines possible categories), C/P strikes (nearness to stock price -> defines possible categories) and all this can be applied to several datapoints (csv lines) or a single line, which will also change possible categories

drifting umbra Aug 1, 2020, 5:26 AM

#

@frank bone make each row in your data frame an options contract

#

then use logic to build options trades

#

different data frames for different underlying securities

#

if you have all the options contracts in a data frame

#

then you can build strategies from it pretty easy

#

many different strategies

#

btw side comment is that i am skeptical of insights from this because generally options are not super liquid

#

so you may have problems with either
-stale prices
-getting fills / adequate liquidity when you go to implement strategy

#

you can store the options trades in mini data frames, one big data frame, w/e

#

like all the legs

#

that is cleaner than having a column in your df_all_options_AAPL

#

df_AAPL_Oct_Bear_Put could be a data frame with the rows for a bear put spread on aapl

#

or of course you can add columns

frank bone Aug 1, 2020, 5:48 AM

#

maybe i wasnt clear enough in how my current data structure looks

drifting umbra Aug 1, 2020, 5:48 AM

#

screenshot you sent has every options contract as a row

frank bone Aug 1, 2020, 5:48 AM

#

I have a df for a given ticker for a given day with all options trades in it

#

and im trying to find "option strategies" that were executed for each given day

drifting umbra Aug 1, 2020, 5:49 AM

#

wait but

#

ok

#

i will challange the theoretical basis for this even

#

or really just 2 things

#

1- how do you know what are options strategies?

#

together

#

and not just different traders

#

you have no "trader_ID" variable

#

i think it is just too much guessing

frank bone Aug 1, 2020, 5:50 AM

#

and whats the 2nd thing?

drifting umbra Aug 1, 2020, 5:50 AM

#

that it's just guessing

#

you need answers (labels)

#

a problem you could do

#

with this data

#

if i were you

frank bone Aug 1, 2020, 5:51 AM

#

well im 99.9% certain when I look at the data

drifting umbra Aug 1, 2020, 5:51 AM

#

ok then i would say hard code

#

the rules

frank bone Aug 1, 2020, 5:51 AM

#

that it was a strategy executed

drifting umbra Aug 1, 2020, 5:51 AM

#

yeah so if time is within 2 seconds

#

if different strike prices

#

whatever conditions

#

just think about what you're doing in your head when you are grouping the trades

#

i would try to use this options data to predict underlying stock price movement

#

rest of that day or next day maybe

#

that is something you actually have data for

#

input1 : $ value of puts bought
input2: $ value of calls bought
input3: # of puts bought
input4: # of calls bought

#

input5: % of trades in last 15 min of trading

#

label (what you want to predict) = stock return next day

#

boom, there you can get data for it

#

📎 ML_vs_Rule_Based_Programming.PNG

#

you have data and rules, want to output answers

#

that is fine

frank bone Aug 1, 2020, 5:54 AM

#

thats not my goal though, i just want to identify categories

drifting umbra Aug 1, 2020, 5:54 AM

#

ok well you have no answers right

#

you need to define the rules formally

#

of how you are grouping the options trades into being done by the same person

#

if u can do it mentally need to formalize

#

not even program, just write down

frank bone Aug 1, 2020, 5:55 AM

#

i like what you said to lay down step by step what im going through my head when i categorize it. And it would be somewhat doable but when I think about edge cases things get messy

#

so im wondering if i create synthetic answers and let ML find the rules, if that will be better

drifting umbra Aug 1, 2020, 5:57 AM

#

"i like what you said to lay down step by step what im going through my head when i categorize it"

#

exactly

#

i mean

#

you can

#

i just question how objective your categorizations really are

#

if u cant lay out the rules

#

as u see the 2 paradigms for programming

#

you need either data and rules

#

or data and answers, and the computer comes up with the rules

#

hm

#

wait

#

maybe i'm an idiot

#

there are ways to categorize data into groups

#

like k-means for example

#

📎 Screenshot-2019-08-12-at-1.png

#

but i think you have to tell it how many groups there are

#

also it would not make sense for your data

#

without answers, you could only group similar things together

#

you need either:
data + rules
or
data + answers

frank bone Aug 1, 2020, 6:02 AM

#

yup im following 😄 for the second answer my only problem is how would I put several datapoints in one category and for other categories just one line and for other 4

#

would i train the model once for 1 liners, then 2 line data, then 3 line data and then 4 line data(max) and then when predicting i give more weight to categories with more values?

#

or can i somehow do all in 1?

#

as i understood tensorflow, you train a model with data(1 csv line)='category'

#

can u do data(2 csv lines) = categgory1, (4 csv lines)=cat2

#

and then how would you apply (model.predict) that to data that only consits of 1 liners

#

ill try to write down rules for 1st case on paper, lets see if i can get clear answers, been trying to do that for 3 days though, all i got was headaches 😄

drifting umbra Aug 1, 2020, 6:14 AM

#

"how would I put several datapoints in one category"

#

put the category they are in

#

on their line in excel

#

just say "Trade1"

#

"Trade2"

#

"Trade3"

#

"as i understood tensorflow, you train a model with data(1 csv line)='category'
can u do data(2 csv lines) = categgory1, (4 csv lines)=cat2"

#

not sure what this means

#

i just think you are no offense
1- guessing. why do you think these trades belong together?

#

are there any rules you can define? being similar in timestamp makes logical sense

#

you can get python to group them based on that

#

just think this makes no sense idk

#

📎 ML_vs_Rule_Based_Programming.PNG

#

📎 unknown.png

#

we are here

#

lol

frank bone Aug 1, 2020, 6:19 AM

#

i just think you are no offense
1- guessing. why do you think these trades belong together?
@drifting umbra the data at that stage has gone through pre-filtering, in combination with timestamps its highly likely they belong to the same trader. It's not a make or break situation either. I just want high accuracy, never claiming 100%

#

actually the grouping is 100% only defnied on timestamp, after that it's just putting them in different categories. thats the easy part

#

but grouping i find hard but ill take a pen & paper and have at it again 😄

drifting umbra Aug 1, 2020, 6:22 AM

#

which is easy part

#

grouping based on time stamp?

#

and i have idea on how to do this mayube

#

what is max number of options legs in one trade?

frank bone Aug 1, 2020, 6:23 AM

#

4 but thats very rare. mostly 1 2 or 3

#

i mean you can take this as an example for a df with different categories

#

20200501    09:41    FAST    C    35.0    20200515    1.45    90    1    
20200501    10:41    FAST    P    35.0    20200515    0.71    350    1
20200501    10:41    FAST    P    32.5    20200515    0.25    600    1    
20200501    10:41    FAST    C    32.5    20200515    3.33    600    2    
20200501    15:40    FAST    P    35.0    20200515    0.75    160    1
20200501    15:40    FAST    P    35.0    20200515    0.75    145    1    
20200501    15:47    FAST    C    37.5    20200515    0.3063    80    5

#

adding a column at [2] with unixtimestamps easy

#

alrdy done that

#

now just trying to figure how to take out pairs and save them in a separate df or list and take them out of original df, so they dont interfere with other pairs

#

and all this dynamically, since this is part of an iteration and number of pairs will always be different

#

got my brain juices flowing a little bit now...trying something 😄

drifting umbra Aug 1, 2020, 6:39 AM

#

yo

#

sorry

#

have idea

#

your input should be 4 rows of these characteristics

#

output is TRUE/FALSE

#

whether they are in a trade together

#

THAT is doable with machine learning!

#

the 4 rows will consist of any related rows which actually form a category

#

plus the number of rows to get to 4

#

randomly chosen from data set

#

pretty sure it will work

#

using true/false as Y is solution i think

#

that way you are predicting which of inputs are in a group together

#

📎 unknown.png

#

so in the future when you give it a bunch of trades it can output predictions TRUE/FALSE of those that are related

frank bone Aug 1, 2020, 6:54 AM

#

but how to further differentiate between the different groups afterwards? that big group with consist of 10-20 subgroups (option strategies)

#

the table you showed actually consists only of those subgroups we're talking about. But I think it should be possible without ML, still writing down I think there's a way..so far

drifting umbra Aug 1, 2020, 7:07 AM

#

each sample

#

should contain how ever many that are in a group

#

and the rest of the rows selected at random from your data set

#

you wil feed the data into your model to project in rows of 4

#

splitting it up into 4s

distant dagger Aug 1, 2020, 7:09 AM

#

Is this a good place to ask minor stats questions?

frank bone Aug 1, 2020, 7:09 AM

#

should contain how ever many that are in a group
@drifting umbra Oh I get it

drifting umbra Aug 1, 2020, 7:10 AM

#

@distant dagger sure

#

i think there is rule no doing hw for people

frank bone Aug 1, 2020, 7:10 AM

#

splitting it up into 4s
@drifting umbra what do you mean by that?

drifting umbra Aug 1, 2020, 7:11 AM

#

okay so you have hundreds of trades right

#

this example you have 8 rows

#

it will be broken into groups with 4 rows

#

the model will output FALSE if it thinks the trade is part of a group

distant dagger Aug 1, 2020, 7:12 AM

#

Yeah, its not homework, I just didn't pay attention in college when they were teaching this stuff. Now I am almost done with my phd and doing some minor projects.

drifting umbra Aug 1, 2020, 7:13 AM

#

very cool

#

@frank bone given 4 rows

#

it will predict WHICH are part of a trade together

#

if TRUE you can add tag them with "Trade+counter"

#

where counter increases by 1 everytime

#

so for example it might tag the first 2 out of 4 TRUE

#

add "Trade1" next to those entities in your DF in a new column called "Trade_No_Predicted"

#

you feed in 4 more

#

it predicts all 4 are part of one trade

#

all 4 are predicted TRUE

#

you add "Trade2" next to all in the same column "Trade_No_Predicted"

distant dagger Aug 1, 2020, 7:19 AM

#

Ok...so here's the thing. I have this boxplot.

📎 boxplot.png

#

It doesn't show any outliers. But if I plot a histogram, it gives me this

📎 distplot.png

drifting umbra Aug 1, 2020, 7:20 AM

#

📎 unknown.png

#

but there are observations here

frank bone Aug 1, 2020, 7:21 AM

#

it will predict WHICH are part of a trade together
@drifting umbra ohh sort of a "sliding window" (just for understanding) from top to bottom, at each new row it checks whether trades are part of group. Do I understand correctly?

drifting umbra Aug 1, 2020, 7:21 AM

#

@frank bone yesss

frank bone Aug 1, 2020, 7:21 AM

#

Interesting, and for training data i just label true/false?

#

sorry if i got terminology mixed up with "label" here. But in training data I need to have features + label right

distant dagger Aug 1, 2020, 7:24 AM

#

I am sorry, but i don't get what you are saying. The whiskers are the min and max in the data aren't they?

drifting umbra Aug 1, 2020, 7:24 AM

#

@frank bone no with training data you label with groups

#

later convert to true/false

#

when converting from whole data set into rolling window of 4

#

rows

#

@frank bone "training data I need to have features + label right"
Yes

#

@distant dagger yeah so what are you saying about outliers

#

everything looks right to me

frank bone Aug 1, 2020, 7:26 AM

#

can you just give me an example what that would look like (the "group")? Slightly confused but I got the concept. thats smart, thanks 😄

drifting umbra Aug 1, 2020, 7:26 AM

#

@distant dagger

📎 unknown.png

#

are you talking about these

#

i think that is smoothing

distant dagger Aug 1, 2020, 7:26 AM

#

Oh ok...I was confused by the histogram cause it has two peaks. @drifting umbra

drifting umbra Aug 1, 2020, 7:26 AM

#

@frank bone you label "Trade1", "Trade2"

#

in your excel

#

or master data frame

frank bone Aug 1, 2020, 7:27 AM

#

ohhh

#

got it

#

will try that when classic method wont work. Or maybe do a benchmark classic vs. machine

drifting umbra Aug 1, 2020, 7:30 AM

#

📎 unknown.png

#

your data

#

will not be so brain dead easy

#

to see / predict

#

but if you have this type of thing

distant dagger Aug 1, 2020, 7:30 AM

#

@drifting umbra I guess what I was trying to ask was that since there are two peaks in the histogram, is it simply a bimodal histogram or is there something wrong with the way I have processed the data. The boxplot and the histogram are actually log values

#

And from what I have heard, bimodal histograms are not very common

drifting umbra Aug 1, 2020, 7:31 AM

#

bimodal histogram is def possible

#

depends on ur data

#

it suggets

#

u have 2 diff populations

#

or groups sorry

frank bone Aug 1, 2020, 7:32 AM

#

@drifting umbra very nice thanks. Everything noted down 🙂 have you compared classic vs machine implementations before? did one outperform the other in accuracy?

drifting umbra Aug 1, 2020, 7:32 AM

#

@frank bone depends entirely on data and problem ur trying to solve

#

one is not always "better"

#

i think you can use random forest for this really well

#

or if you convert everything to categories

#

either use xgboost or catboost

#

but basically

frank bone Aug 1, 2020, 7:34 AM

#

can this be done on a CPU? or GPU needed?

#

or if you convert everything to categories
@drifting umbra ?

#

not exactly sure what you mean by that? train with categories only?

distant dagger Aug 1, 2020, 7:38 AM

#

@drifting umbra I get it now. Thank you. that was really helpful.

drifting umbra Aug 1, 2020, 7:38 AM

#

📎 unknown.png

#

@frank bone my idea

#

@distant dagger 🙂

#

@frank bone categories i no big deal

#

computer cannot understand text

#

only numbers

#

can you send data again it's too far up

#

dont worry about that

#

that is 1 easy line of code

#

xgboost or catboost can run on CPU in less than 60 sec

#

if you need GPU you can use google colab

#

its free and thats what i do

#

i would suggest doing data science in jupyter notebooks

frank bone Aug 1, 2020, 7:46 AM

#

can i feed the training with many separate files instead of 1 big concatenated? I'm wondering if increasing label numbers and passing time (i.e. jan to june) would play a role in the algo deciding when that doesnt influence it

drifting umbra Aug 1, 2020, 7:48 AM

#

you mean features include month?

#

no problem with adding month to feature

frank bone Aug 1, 2020, 7:49 AM

#

no since ill have(?) to feed in unix timestamp, that includes the passing of time, when there should be no correlation in that

#

skewing the performance

drifting umbra Aug 1, 2020, 7:49 AM

#

you can easily delete or add columns

frank bone Aug 1, 2020, 7:49 AM

#

i dont want to include passing of time as a feature, but a timestamp would include that?

drifting umbra Aug 1, 2020, 7:50 AM

#

you can extract just the month or day of week

#

or year

#

whatever you want

#

or no time variable

#

all easy

frank bone Aug 1, 2020, 7:50 AM

#

can i make a timestamp of 10:15?

#

without date

#

just the time

drifting umbra Aug 1, 2020, 7:50 AM

#

https://docs.python.org/3.4/library/datetime.html?highlight=datetime

#

yes

#

easy

frank bone Aug 1, 2020, 7:51 AM

#

lol stupid me

drifting umbra Aug 1, 2020, 7:51 AM

#

naw all good

#

i would suggest keeping some time var

frank bone Aug 1, 2020, 7:51 AM

#

yeah

drifting umbra Aug 1, 2020, 7:51 AM

#

trades which occur closer obv more likely to be part of same trader

#

gotta sleep

#

feel free to PM me

#

good luck with project

frank bone Aug 1, 2020, 7:51 AM

#

thanks again! 😄

drifting umbra Aug 1, 2020, 7:51 AM

#

gl ☮️

frank bone Aug 1, 2020, 7:52 AM

#

u too

high pulsar Aug 1, 2020, 7:52 AM

#

I have sales data from US Cities. I don't have latitude and longitude values, just plain state and city names. How can I visualize them on a map? I tried using Geopy, Geopandas (Nominatim) but it takes too much time( I used as there are around 10k rows. Any better way?

#

I followed this https://towardsdatascience.com/geocode-with-python-161ec1e62b89. So if I use RateLimiter it takes too much time if I don't it still takes a lot of time and gives too many nan values.

austere swift Aug 1, 2020, 8:15 AM

#

have you tried plotly?

#

I'm not sure if it works with names but you can do some pretty nice map visualizations with it

chrome barn Aug 1, 2020, 8:29 AM

#

https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/information/

US Zip Code Latitude and Longitude

The ZIP code database contained in 'zipcode.csv' contains 43204 ZIP codes for the continental United States, Alaska, Hawaii, Puerto Rico, and American Samoa. The database is in comma separated value format, with columns for ZIP code, city, state, latitude, longitude, timezone ...

reef bone Aug 1, 2020, 11:05 AM

#

Hi @tame fractal, I'm not sure why you decided to post a monkey picture here but I deleted it as it was not relevant to the topic of this channel

austere swift Aug 1, 2020, 11:12 AM

#

passive aggression at its finest

high pulsar Aug 1, 2020, 11:13 AM

#

https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/information/
@chrome barn Thanks man. I used pandas merge with this csv. Got exactly what I wanted. Thankyou again.

US Zip Code Latitude and Longitude

The ZIP code database contained in 'zipcode.csv' contains 43204 ZIP codes for the continental United States, Alaska, Hawaii, Puerto Rico, and American Samoa. The database is in comma separated value format, with columns for ZIP code, city, state, latitude, longitude, timezone ...

frank bone Aug 1, 2020, 11:30 AM

#

sounds rather exotic but is it possible in python to skip a loop twice?

acoustic halo Aug 1, 2020, 11:33 AM

#

Use a variable as a flag?

frank bone Aug 1, 2020, 11:34 AM

#

i think that doesnt work? it happens many times in the loop

#

in my case

acoustic halo Aug 1, 2020, 11:35 AM

#

This wouldn't work?

#

skip_flag=False for loop: if skip_flag or need_to_skip: skip_flag=not skip_flag continue do_stuff()

frank bone Aug 1, 2020, 11:39 AM

#

not sure...i have a

      df.loc[var[i]]
      do stuff...
``` but sometimes i doesnt exist in the dataframe and it throws some ambiguous error

acoustic halo Aug 1, 2020, 11:40 AM

#

Can you not catch it in a try except statement and put the continue there?

frank bone Aug 1, 2020, 11:40 AM

#

oh i should make a list out of df and then compare, if i isnt in there continue

#

its an index

#

so i can do check = df.index.values.tolist()

#

ill try that

#

worked just fine 😄

frank bone Aug 1, 2020, 12:29 PM

#

dataframes seem to behave differently when calling within len(). If I do df.loc[var] and that var is in the index just once, len returns the number of colums in that row, however if var is in the index > 1 times, len returns the amount of times var is in the df index

#

im only interested in getting the amount of var in index, whether it's once or twice or more, not interested in len of colums in row

#

any ideas how to accomplish that?

desert oar Aug 1, 2020, 1:14 PM

#

You want the number of times a value appears in the index?

#

(df.index == val).sum()

#

Len does not behave differently in your case. loc behaves differently

#

If the value only appears once it returns a series representing a row - the length of which is the number of columns

frank bone Aug 1, 2020, 1:17 PM

#

Thanks a lot! 🙂

desert oar Aug 1, 2020, 1:18 PM

#

You can force loc to return a dataframe by wrapping the value in a list

#

df.loc[[val]]

frank bone Aug 1, 2020, 1:18 PM

#

Double square brackets?

#

Ah

desert oar Aug 1, 2020, 1:18 PM

#

Yes but be careful about what they mean

#

The outer brackets apply to df.loc

#

The inner brackets arent "attached" to anything so they create a new list

frank bone Aug 1, 2020, 1:19 PM

#

Fair enough makes sense

desert oar Aug 1, 2020, 1:20 PM

#

If val is already a list, series, or index, you should not use the second set of brackets

frank bone Aug 1, 2020, 1:20 PM

#

Good to know

desert oar Aug 1, 2020, 1:20 PM

#

Actually using len(df.loc[[val]]) might be quite a bit faster than the method I suggested above

#

Because in theory pandas can optimize individual index value look ups

#

Either with a binary search or a hash table look up

frank bone Aug 1, 2020, 1:21 PM

#

Ill try both

eternal sandal Aug 1, 2020, 1:54 PM

#

Does anyone have some code for sorting numbers. Beginner to python, and looking to learn something about lists

tidal bough Aug 1, 2020, 1:56 PM

#

First of all, not really #data-science-and-ml. You better get a help channel (see #❓｜how-to-get-help)
Second, are you asking "how to sort lists" (for which the answer is to use sort/sorted) or "what are some sorting algorithms I can implement" (for which the answer is "bubble sort, insertion sort, merge sort, and the other ones you can find in any Data Structures and Algorithms book")?

eternal sandal Aug 1, 2020, 1:57 PM

#

okay. just joined this server. Thanks for the help

tardy portal Aug 1, 2020, 2:15 PM

#

Question: Split the dataset into training and testing 30 percent testing 70 percent training You can use the train_test_split function in sklearn library

#

Answer: ```import pandas as pd
import numpy as np
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

df_ea = pd.read_csv('C:/Users/Bobby/Documents/School/INFO 357/employee_attrition_processed.csv')

input_var = set(df_ea.columns) - set(['Attrition_Yes'])
X = df_ea[input_var]
Y = df_ea['Attrition_Yes']

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 42)

log_reg = LogisticRegression(random_state = 0, solver = 'lbfgs', multi_class = 'ovr')
log_reg.fit(X_train, Y_train)

des_tree = DecisionTreeClassifier(criterion = 'gini', splitter = 'best', max_depth = 15)
des_tree.fit(X_train, Y_train)```

gleaming gyro Aug 1, 2020, 2:16 PM

#

Is there anyone who uses Astropy here?

tardy portal Aug 1, 2020, 2:16 PM

#

Results: DecisionTreeClassifier(max_depth=15)

#

Not sure if I am on the right path here

gleaming gyro Aug 1, 2020, 2:16 PM

#

uhh

#

i have done this one

#

df_ea is the dataset right?

tardy portal Aug 1, 2020, 2:18 PM

#

yeah

gleaming gyro Aug 1, 2020, 2:18 PM

#

do you have some results already?

tardy portal Aug 1, 2020, 2:18 PM

#

thats correct

gleaming gyro Aug 1, 2020, 2:18 PM

#

like, it should look like this

#

something like that?

#

or is it purely new?

tardy portal Aug 1, 2020, 2:19 PM

#

we're given a dataset and the first nine questions we reprocess it and question 10 is the question I just posted

gleaming gyro Aug 1, 2020, 2:20 PM

#

oh

#

then why you even need fit and all?

#

is it like you skipped this question and the next ones require that?

#

Is there anyone who uses Astropy here?
@gleaming gyro

tardy portal Aug 1, 2020, 2:27 PM

#

No I haven't skipped any questions, but the last two questions rely on this question

tardy portal Aug 1, 2020, 2:53 PM

#

@desert oar would you be able to assist me?

raven mulch Aug 1, 2020, 3:01 PM

#

Hey guys! If anyone is interested I made a second video on my series where I make my own deep learning library from scratch and the plan is to use it to deploy a MNIST classifier on a webserver! This video is about the implementation of Adam, RELU and the new scikit learn API it has. Here it is: https://www.youtube.com/watch?v=UELWdyJVVRg

YouTube

Federico Barbero

Developing a Deep Learning Library - Adam, RELU and Scikit-learn AP...

Welcome back!

In today's video I go over the implementation of the Adam optimizer, the RELU activation function and the scikit-learn inspired API of the library!

Code: https://github.com/Fedzbar/deepfedz
Adam paper: https://arxiv.org/abs/1412.6980

▶ Play video

desert oar Aug 1, 2020, 4:19 PM

#

@tardy portal What exactly did you need help with

tardy portal Aug 1, 2020, 4:21 PM

#

@desert oar trying to understand what this line of code executes from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 42)

desert oar Aug 1, 2020, 4:22 PM

#

Did you read the docs for that function, or the user guide?

#

Do you know what a train/test split is?

#

This is homework right? Did your teacher explain this at all?

tardy portal Aug 1, 2020, 4:23 PM

#

vaguely

#

hence why i'm in here asking for assistance

desert oar Aug 1, 2020, 4:24 PM

#

Which question is that an answer to

tardy portal Aug 1, 2020, 4:24 PM

#

if my professor explained this at all

desert oar Aug 1, 2020, 4:30 PM

#

I see

#

what did they say?

#

Are you familiar with the idea of train and test sets?

tardy portal Aug 1, 2020, 5:12 PM

#

From what i understand is that it splits the data into test and train models?

#

With that said, is there a specific method I should use for fitting the model?

acoustic halo Aug 1, 2020, 5:23 PM

#

It splits a single larger dataset into two smaller datasets, you use one of these to train a model, and the other to see how well the model works on new data. It's not making a model itself, you would have to pick a model yourself or whatever your homework specifies and feed these datasets into that

tidal bough Aug 1, 2020, 5:26 PM

#

The train/test split is itself to estimate how well the model performs on the data the model didn't train on. This is important to prevent overfitting - if your model has enough parameters, it can perfectly memorize the training data, but it won't help it with the testing set unless it actually got the general idea right.

fallow sandal Aug 1, 2020, 7:17 PM

#

I know this isn't directly related to Python, but does anyone here experience with TensorFlow particularly with object detection? I was wondering how many images in general would it take to get reasonable object detection for just one object? I was watching a tutorial for a dec of cards and for 6 objects they recommended atleast 200 images. Would it be the same if I was only training one object?

charred blaze Aug 1, 2020, 8:14 PM

#

not even close

#

and even 200 images seems way too low.

#

for such small amount of data, you would be better served with using an approach like kNN and using some good ol' feature extraction.

flat quest Aug 1, 2020, 8:18 PM

#

no use a pretrained model

#

and train over that with 200 images

fallow sandal Aug 1, 2020, 8:19 PM

#

Yeah sorry, I mean't a pre-trained model from the example code

flat quest Aug 1, 2020, 8:19 PM

#

you won't get a good obj detection model with something like KNN or traditional ML techniques.

fallow sandal Aug 1, 2020, 8:20 PM

#

I'm a big noob so sorry if I say anything technically wrong or if there is some clarification i need haha

#

I was just looking up a tutorial and I want to figure out how to train a custom image detection using one of the pre trained models

#

In the example, they were using playing cards which is relatively flat, so im guessing for a 3d object

#

i need multiple angles too?

flat quest Aug 1, 2020, 8:21 PM

#

there's a lot of obj detection models out there. Pick any one of them and train over your images. If you're only training to detect one object you only really need a couple. My models worked with only 5 images.

fallow sandal Aug 1, 2020, 8:21 PM

#

Oh sweet

flat quest Aug 1, 2020, 8:21 PM

#

yeah it would be better to have multiple angles, but it should still perform relatively well from a single angle.

fallow sandal Aug 1, 2020, 8:21 PM

#

so for example

#

I want to try with a toy goat

#

Do you suppose with one of the example training models, I could get it trained with only a few images?

charred blaze Aug 1, 2020, 8:24 PM

#

didn't really consider using pre-trained models... yeah, go with that.

flat quest Aug 1, 2020, 8:25 PM

#

yeah just a few images should work relatively well

fallow sandal Aug 1, 2020, 8:27 PM

#

Ok sweet, thank you so much for the help @charred blaze @flat quest. I really appreciate it, tensorflow object detection seems like a very powerful tool and I really want to get it figured out for projects

#

I have to get the set up installed, was watching videos at 3am and I forgot all the details to get it working

flat quest Aug 1, 2020, 8:30 PM

#

i don't know if tf has updated their object detection api for tf 2.0, but i do know they have a keras hub layer that can do obj detection out

#

should be an RCNN or SSD model

fallow sandal Aug 1, 2020, 8:35 PM

#

Yeah, I think I was watching a tutorial and they recmomended the RCNN model

#

(If you have the hardware which I do, or I hope i do)

arctic cliff Aug 1, 2020, 8:42 PM

#

How can I print all the category's names ?

📎 unknown.png

#

Like the different names

#

Not the whole values for sure xD

desert oar Aug 1, 2020, 8:58 PM

#

@arctic cliff .unique()

#

Or .drop_duplicates()

arctic cliff Aug 1, 2020, 8:59 PM

#

Another question

#

df.dropna(inplace = True)

#

What's the inplace for?

desert oar Aug 1, 2020, 9:00 PM

#

That modifies df without returning it

#

Like list.append

arctic cliff Aug 1, 2020, 9:00 PM

#

So I don't have to say df = df.dropna() ?

desert oar Aug 1, 2020, 9:01 PM

#

If you use inplace, then that's correct

arctic cliff Aug 1, 2020, 9:01 PM

#

Do I have to full-finish pandas to move on to matplotlib?

#

Because I'm following a book and pandas will take literally too much ..

#

What is left to me

📎 unknown.png

desert oar Aug 1, 2020, 9:10 PM

#

You don't have to

#

You don't need pandas at all to use matplotlib

arctic cliff Aug 1, 2020, 9:17 PM

#

How's that ?

uncut shadow Aug 1, 2020, 9:30 PM

#

They are both different and independent packages

#

U can use pandas without knowing matplotlib

#

And use matplotlib without pandas

#

But both tools will come in handy

#

So knowing How to use them will help

arctic cliff Aug 1, 2020, 9:32 PM

#

I see !

#

Thanks a lot

charred blaze Aug 1, 2020, 9:44 PM

#

consider that a reference guide, not a regular instructional book.

flat quest Aug 1, 2020, 9:57 PM

#

^^. You can't learn everything. By the time you've completed that book, you'll have forgotten more than half of it.

The real key is only remembering the parts you use most frequently, while being able to search up anything you don't know fairly quickly.

desert oar Aug 1, 2020, 10:23 PM

#

@arctic cliff matplotlib depends somewhat on numpy, but pandas not at all

#

Pandas objects all get converted to numpy before plotting

#

Pandas provides their own convenience plotting routines but you dont have to use them

arctic cliff Aug 1, 2020, 10:24 PM

#

Oh

#

@charred blaze, @flat quest, @desert oar I appreciate that !

fervent bridge Aug 1, 2020, 11:36 PM

#

📎 bm_example.png

#

📎 67759994-615b5980-fa38-11e9-9301-53432f11dac6.png

#

NPY VS HDF5, whats your guys preference?

frank bone Aug 1, 2020, 11:58 PM

#

anyone have experience on GPU parallelization & python?

#

My code is just 1 big iteration it goes from A to Z chronologically in a for loop, i.e. A1, A2, A3, A4....A6000....B1, B2.......Z6000
running through the whole dataset would take 45 days on 1 core
Would it be possible to split/parallelize the workload at the first for loop? So I could run A1..A6000 | B1..B6000 | Z1..Z6000 in parallel?

#

Numba looks promising (and very simple), would that work?

#

def some_func(*args):
    out = np.zeros(length_of_output)
    for i in prange(n):
        # independent and parallel loop
        out[i] = ...some stuff...
    return out```

#

or is it dependent on what is run inside the for loop?

#

https://towardsdatascience.com/better-parallelization-with-numba-3a41ca69452e

Medium

Better Python Parallelization with Numba on CPU and GPU

Geo-coordinate problem with Numba: 500x faster on multiple cores, 7500x faster on GPU (RTX 2070)

desert oar Aug 2, 2020, 2:09 AM

#

@frank bone what is your actual code

#

Maybe its faster vectorized

#

Numba parallel is one option

#

Chunking it into pieces then parallelizing with joblib is another, ideally also vectorizing

#

And yes the effectiveness of numba is heavily dependent on the code you use

frank bone Aug 2, 2020, 2:11 AM

#

most of code is reading csv files, concat them, calculate new values, isolationforest algo is applied as well

desert oar Aug 2, 2020, 2:11 AM

#

Oh

#

Numba won't know what to do with that

#

Very likely you cant run that with nopython anyway

#

Just chunk into batches and run with like 4-8 cores using joblib

frank bone Aug 2, 2020, 2:13 AM

#

so its only possible on CPU?

desert oar Aug 2, 2020, 2:13 AM

#

GPUs dont really work like you think

frank bone Aug 2, 2020, 2:13 AM

#

even with a 64 core CPU it would still take almost a day

desert oar Aug 2, 2020, 2:13 AM

#

A GPU is not just like a shit ton of CPU cores

#

It can speed up operations like huge matrix multiplications

#

It wont help read 1m csv files

frank bone Aug 2, 2020, 2:14 AM

#

biggest performance hit is isolationforest, could that be sped up with a GPU?

#

im retraining the model in every iteration

#

so for whole dataset the model is trained approximately 20million times

desert oar Aug 2, 2020, 2:15 AM

#

That seems excessive

#

Isolationforest itself is embarrassingly parallel, so you can parallelize individual model fits

frank bone Aug 2, 2020, 2:16 AM

#

yeah maybe ill just model it once and see if there's a big difference but i did it for max accuracy

desert oar Aug 2, 2020, 2:16 AM

#

But your problem is also embarrassingly parallel

#

What exactly are you trying to do

frank bone Aug 2, 2020, 2:16 AM

#

creating a backtesting algo with a 30 day sliding window

#

so each new day i retrain model

#

so the model always takes the past 30 days into account

#

its a time series obviously

desert oar Aug 2, 2020, 2:19 AM

#

so why do you need to do this 20 million times?

frank bone Aug 2, 2020, 2:20 AM

#

because im testing for 8000 tickers, 250 times a YEAR* for 10 years

desert oar Aug 2, 2020, 2:20 AM

#

That seems excessive but ok

#

If this is meant to be part of your model dev workflow, pick a smaller test sample

#

Then if you are confident your model is "ready" you can just do a big run over the weekend

#

Obviously don't test on your training data and vice versa

frank bone Aug 2, 2020, 2:21 AM

#

well with 1 core itll take 40 days or so

desert oar Aug 2, 2020, 2:22 AM

#

So rent a AWS or GCP machine with more than 1 core 🤷‍♂️

frank bone Aug 2, 2020, 2:22 AM

#

and it has tons of parameters...so id love to test what works best and not wait 40 days or even 1 day each time 😄

desert oar Aug 2, 2020, 2:22 AM

#

But like i said, use a smaller test set

frank bone Aug 2, 2020, 2:22 AM

#

might do that

desert oar Aug 2, 2020, 2:23 AM

#

10 years ago the economy was really different

frank bone Aug 2, 2020, 2:23 AM

#

but home GPU cluster not possible?

desert oar Aug 2, 2020, 2:23 AM

#

Whatever your algo is, i would be pretty surprised if it worked as well 10 years ago as it does today

#

Look up and read what i said about GPUs

#

Unless you find or write a GPU implementation of isolation forests you just wont get value from a GPU

#

There might also be extant faster non GPU implementations compared to whatever is in sklearn

#

But really testing on 6000 tickers over 10 years seems wildly excessive. But im not a finance guy so idk what's considered normal

#

Why not like 1 year

#

Also you probably shouldn't burn all your data testing the first version of your model

tame fractal Aug 2, 2020, 2:26 AM

#

@desert oar wtf are you on about?

#

thats pure nonsense

#

also 8000 tickers, 250 times a YEAR* for 10 years is a small dataset

frank bone Aug 2, 2020, 2:26 AM

#

i was looking at ROCm and saw it has python API

desert oar Aug 2, 2020, 2:26 AM

#

Like i said im not a finance guy

frank bone Aug 2, 2020, 2:26 AM

#

so i was wondering if something is doable

tame fractal Aug 2, 2020, 2:26 AM

#

should easily run on 1 cpu

frank bone Aug 2, 2020, 2:27 AM

#

also 8000 tickers, 250 times a YEAR* for 10 years is a small dataset
@tame fractal that's the amount of iForest fits

#

20m

desert oar Aug 2, 2020, 2:27 AM

#

He said hes reading a csv file and retraining an isolation forest among other things each iteration

tame fractal Aug 2, 2020, 2:27 AM

#

20m rows is very small...

desert oar Aug 2, 2020, 2:27 AM

#

If that is his algorithm then it is not small

frank bone Aug 2, 2020, 2:28 AM

#

not talking about rows, talking about retraining iforest 20m times, not with too much data though each time

#

let me measure how long a fit takes

tame fractal Aug 2, 2020, 2:28 AM

#

iforest?

desert oar Aug 2, 2020, 2:28 AM

#

@frank bone the idea is that your model is meant to work on 30 days of data?

tame fractal Aug 2, 2020, 2:29 AM

#

why would you do that?

frank bone Aug 2, 2020, 2:29 AM

#

the past 30 days, sliding window

tame fractal Aug 2, 2020, 2:29 AM

#

this doesn't make sense

desert oar Aug 2, 2020, 2:29 AM

#

@tame fractal now you see what im trying to work with

#

this isnt like, fitting one model on 20m rows

tame fractal Aug 2, 2020, 2:29 AM

#

no your approach is wrong

desert oar Aug 2, 2020, 2:30 AM

#

Good, if they can build a model that uses more of the data more effectively then they dont have this weird problem

frank bone Aug 2, 2020, 2:32 AM

#

training model takes 0.2s each time

#

just measured

#

its not training on a lot of data

#

just 30 day window each time

#

adds up though

desert oar Aug 2, 2020, 2:33 AM

#

You are using it to look for outliers?

frank bone Aug 2, 2020, 2:34 AM

#

yes

desert oar Aug 2, 2020, 2:34 AM

#

In a 30 day window?

frank bone Aug 2, 2020, 2:34 AM

#

yep

desert oar Aug 2, 2020, 2:35 AM

#

Maybe @tame fractal has an opinion on that. Sounds like you have some experience with this and can help out identifying some more conventional and tractable modeling techniques

frank bone Aug 2, 2020, 2:35 AM

#

i first used simple SMA + SD but i dont find it as accurate as iForest

#

in the worst case ill just get a 128 core server CPU 😄

#

but if there's a way to get it done on a laptop let me know your ideas

modest rune Aug 2, 2020, 2:52 AM

#

Maybe someone has a great data science discord server they can point me to, but I haven't found one. But I need some guidance.

In python, I am trying to calculate the time decay of a stock option, given the stock option's purchase price and the number of days until it expires. For example, if a call option costs 50 cents and I have 20 days till expiration. How will that price decay over the next 20 days assuming the underlying stock price never changes. The output should be a 20 element array containing the price at T=20 days left, T=19 days left... all the way to 0.

I know the black scholes model is capable of calculating what I want, but I need to get the T in the black sholes model on to the left side of the equation. I'd be able to do that, except that T is found inside the CDF function and I have no clue how to handle that.

Good news is that I have simplified the problem because r is zero in my case.

r = risk free interest rate = 0

📎 unknown.png

frank bone Aug 2, 2020, 2:59 AM

#

There's also a Monte-Carlo approach to option pricing, maybe that would work? Just quickly read about it somewhere, don't know if works for you but maybe worth a look.

#

http://www.codeandfinance.com/pricing-options-monte-carlo.html

Pricing options using Monte Carlo simulations | Code and Finance

Introduction to pricing European options using a Monte Carlo simulation.

#

Black Scholes is only for European Style options i think, monte carlo can also be applied to American Style

modest rune Aug 2, 2020, 3:27 AM

#

Black Scholes is still useful for American style options and there are modified versions of black scholes that are commonly used that take into account dividends and the ability to exercise the option at any time.

#

So, I'd say, that Black Scholes is not only for European Style, rather it was intentionally created for European Options by Black Scholes and Merton, and they were fully aware that with some small modifications it could be made to work in different scenarios.

frank bone Aug 2, 2020, 3:29 AM

#

didnt know that, its more resource saving too

tame fractal Aug 2, 2020, 6:08 AM

#

📎 17rmegjt0he51.png

past rover Aug 2, 2020, 7:24 AM

#

Anyone know of a Discord for NLP?

#

(Natural Language Processing)

desert parcel Aug 2, 2020, 7:26 AM

#

📎 unknown.png

#

does anyone know what nan means

pale thunder Aug 2, 2020, 7:26 AM

#

Not a Number

desert parcel Aug 2, 2020, 7:28 AM

#

Got it

fallow sandal Aug 2, 2020, 7:38 AM

#

I finally got TensorFlow installed, after 3 hours of troubleshooting cause of errors ^_^

frank bone Aug 2, 2020, 7:49 AM

#

how can I execute a php script from within python and pass command line arguments along?
solution: import subprocess subprocess.run(["php", "-f", "/path/to/php/script.php", "arg1", arg_n"], cwd="/path/to/php/", stdout=subprocess.PIPE, check=True)

#

where do i pass command line arguments?

frank bone Aug 2, 2020, 8:08 AM

#

got it

desert parcel Aug 2, 2020, 9:08 AM

#

📎 unknown.png

#

There is a huge gap in my predictions

tidal bough Aug 2, 2020, 9:09 AM

#

..so?

desert parcel Aug 2, 2020, 9:09 AM

#

I don't know what to do

#

https://colab.research.google.com/drive/1eb__paIZoLPdaRSdpcGPIp69lmMgvKMn#scrollTo=-uC7KKYnTTMw

Google Colaboratory

#

lowest loss for lr is 1e-6 anything above gives loss of over 5k anything below is NaN

desert parcel Aug 2, 2020, 10:51 AM

#

I got some where and I found the issue after a lot of digging

#

Can you have floats and ints in a numpy array?

#

Because from messing around I notice that the values will get converted to whatever data type has priority

#

I am not sure what the term is

#

input = np.array([
                  [[313, 1], #HCL
                   [323, 1],
                   [333, 1],
                   [343, 1]], 
                  [[313, 10e-3], #Ortho
                   [323, 10e-3],
                   [333, 10e-3],
                   [343, 10e-3]], 
                  [[313, 10e-3], #Para
                   [323, 10e-3],
                   [333, 10e-3],
                   [343, 10e-3]]
                  ], dtype='int64')```
Output:

tensor([[[313, 1],
[323, 1],
[333, 1],
[343, 1]],

    [[313,   0],
     [323,   0],
     [333,   0],
     [343,   0]],

    [[313,   0],
     [323,   0],
     [333,   0],
     [343,   0]]])

#

int64 makes everything an int

#

float32 makes everything a float

tidal bough Aug 2, 2020, 10:53 AM

#

Can you have floats and ints in a numpy array?
nope, don't think so. Well, you can, but only by specifying the type as obj and losing all performance advantages.

#

Basically, numpy arrays are wrappers over C arrays, so they need to have a constant dtype.

desert parcel Aug 2, 2020, 10:53 AM

#

what's a C array?

tidal bough Aug 2, 2020, 10:54 AM

#

an array in the C language 😛

desert parcel Aug 2, 2020, 10:54 AM

#

Oh alright

#

so then what are the options then?

#

Other than making it an object

#

Like making two new arrays?

#

then some how combining them?

tidal bough Aug 2, 2020, 10:55 AM

#

Why would you need an array of mixed ints and floats?

desert parcel Aug 2, 2020, 10:56 AM

#

the tables says so

#

let me get it

#

📎 unknown.png

#

and there are other floats in there

#

but not all of them are floats

#

Oh wait I have an idea

tidal bough Aug 2, 2020, 10:57 AM

#

I don't see a single value that can't be a float here.

desert parcel Aug 2, 2020, 10:58 AM

#

wdym

#

When it becomes a float

#

📎 unknown.png

#

The 313 becomes a 3.13

tidal bough Aug 2, 2020, 10:59 AM

#

what's the problem with converting it to a float?

desert parcel Aug 2, 2020, 10:59 AM

#

I don't want all of them to be a float only some of them

tidal bough Aug 2, 2020, 10:59 AM

#

The 313 becomes a 3.13
uhh, no? Have you even seen scientific notation?

#

3.13e+02 is 313.

desert parcel Aug 2, 2020, 10:59 AM

#

oh.

#

I didn't know that

tidal bough Aug 2, 2020, 10:59 AM

#

3.13e+02 is 3.13 * 10**2

desert parcel Aug 2, 2020, 11:00 AM

#

is 10**2 10^2?

tidal bough Aug 2, 2020, 11:00 AM

#

yeah

desert parcel Aug 2, 2020, 11:00 AM

#

alright

tidal bough Aug 2, 2020, 11:00 AM

#

https://en.wikipedia.org/wiki/Scientific_notation

Scientific notation

Scientific notation (also referred to as scientific form or standard index form, or standard form in the UK) is a way of expressing numbers that are too big or too small to be conveniently written in decimal form. It is commonly used by scientists, mathematicians and engineers...

desert parcel Aug 2, 2020, 11:00 AM

#

alright

#

tensor([[[3.1300e+02, 1.0000e+00],
         [3.2300e+02, 1.0000e+00],
         [3.3300e+02, 1.0000e+00],
         [3.4300e+02, 1.0000e+00]]```

#

Is this a 2D tensor or a 3D?

#

or is it 1D?

tidal bough Aug 2, 2020, 11:08 AM

#

check .shape 😛

#

it's 3D, though.

#

three brackets.

desert parcel Aug 2, 2020, 11:09 AM

#

Oh

#

I gave [3, 4,2]

#

I just thought that meant 3, 4 by 2s

#

📎 unknown.png

#

input = np.array([[[313.0, 1], #HCL
                   [323.0, 1],
                   [333.0, 1],
                   [343.0, 1]]], dtype='float32'```

#

Did I do the tensor right?

tidal bough Aug 2, 2020, 11:20 AM

#

3,4,2 means 3 "layers" of 4 rows of 2 elements.

#

that table looks 2d, no idea why you're making a 3d tensor .

gleaming gyro Aug 2, 2020, 2:32 PM

#

is mag() of python2 the same as abs() of python3?

fervent bridge Aug 2, 2020, 2:54 PM

#

Hmm is r+ new to numpy.memmap? I have searched and all links show that previously it wasn't possible to append to npy files. Seems like this has changed? But no one has talked about it?

acoustic halo Aug 2, 2020, 2:56 PM

#

You can't append to them, they are fixed size, r+ just mean read and write

fervent bridge Aug 2, 2020, 3:26 PM

#

@acoustic halo then what is a solution, I have 40,000 image arrays and can't load them into memory to just store it all in one go into a npy file

acoustic halo Aug 2, 2020, 3:26 PM

#

make an empty memmap of the right size first

#

You know how many images and their size

#

np.memmap('filename', dtype='float32 or whatever', mode='w+', shape=(40000, x ,y)

fervent bridge Aug 2, 2020, 3:43 PM

#

40000, 277, 277, 3

#

hmm but then what would prevent it from overwriting @acoustic halo

acoustic halo Aug 2, 2020, 3:44 PM

#

You can overwrite it freely

fervent bridge Aug 2, 2020, 3:44 PM

#

is HDF5 any good would it prevent me from having to load in all the files at once? also does it support memmap or something similar?

#

But would overwritting not delete previous data? per iteration?

acoustic halo Aug 2, 2020, 3:44 PM

#

as long as its open in a write mode

#

What do you mean? Yes overwriting will erase old data but why do you want to keep old data ?

#

While it is open in write mode, you can edit any part of the memmap in any order you want, it wont delete unmodfied data except for then you first open it as w+, in which case you would open it r+ instead

fervent bridge Aug 2, 2020, 3:48 PM

#

I want to loop over the image one by one and add into the npy file so that I do not have to load all 40000 images at once, ah ok I get it so I just have to keep it open with With open and loop through in there

acoustic halo Aug 2, 2020, 3:49 PM

#

No, you save it to a variable

fervent bridge Aug 2, 2020, 3:49 PM

#

So just create a file pre-established hmm let me try

acoustic halo Aug 2, 2020, 3:49 PM

#

so x = np.memmap(all the parameters)

#

then treat x as if it were an np array

#

just remember to flush it before exiting to save it all to file permanently

#

Just note that when you do flush it may freeze up your pc because it will max out disk usage for awhile, it'll be several gbs in size for that many images

pliant sky Aug 2, 2020, 4:17 PM

#

hi new here! found this discord while trying to learn python. currently trying to understand modulo. I understand the concept but i wanted to get an idea of practical applications (so i guess i don't really understand it)

acoustic halo Aug 2, 2020, 4:19 PM

#

This is probably the wrong channel for it, but in essence, modulo gives the remainder after a division ex 5 / 2 = 2 remainder 1, so 5%2 = 1

pliant sky Aug 2, 2020, 4:20 PM

#

oh sorry, which channel? i'm interested in learning python for data science? i understand that modulo gives remainder

acoustic halo Aug 2, 2020, 4:21 PM

#

As for practical applications, converting 24h time to 12 hour time by doing time%12

#

so 13h would be converted to 1, 16 to 4 etc etc

pliant sky Aug 2, 2020, 4:22 PM

#

i see thanks. for future reference where should i ask noob questions?

acoustic halo Aug 2, 2020, 4:22 PM

#

Probably the general channel or one of the help channels

#

or maybe the comp-sci channel

pliant sky Aug 2, 2020, 4:23 PM

#

thanks will do

ebon nebula Aug 2, 2020, 5:13 PM

#

Hello all. Has anyone finished/enrolled in the IBM data -science specialization and if so, would you recommend it to others.

visual violet Aug 2, 2020, 6:08 PM

#

📎 unknown.png

#

how do i filouter out these things

oblique belfry Aug 2, 2020, 6:10 PM

#

What things?

visual violet Aug 2, 2020, 6:10 PM

#

2/29/2020

#

i cant do

#

df_2020 = df_2020[(df_2020['Date Time'] != '2/29/2020')]

#

bc there it is not exact like that

fallow sandal Aug 2, 2020, 6:14 PM

#

Anyone here experienced with tensorflow object detection api? I looked at some old tutorials and new tutorials, do I still have to resize my images for training?

lapis sequoia Aug 2, 2020, 8:13 PM

#

@fallow sandal You most probably do need to do that.

fallow sandal Aug 2, 2020, 8:14 PM

#

I was just confused why all the tensorflow v2 zoo pretrained models all had resolutions in their name so thanks for the clarifications

lapis sequoia Aug 2, 2020, 8:16 PM

#

what are some good Machine/deep learning projects for an undergrad

odd yoke Aug 2, 2020, 9:19 PM

#

many operations (almost all really) operate on a known size tensor, this is why we enforce some size in the algorithm itself, resizing directly at the entry point of the model @fallow sandal

#

as for why the image sizes are generally so small, it was found that image size doesn't sub-linearly correlates with the performance of the algorithms, but it does linearly correlate with the time it takes to run

#

making an image 512x512 won't make the algorithm's result much better, but it may very well quadruple the time it takes to complete compared to a 256x256

fallow sandal Aug 2, 2020, 9:21 PM

#

ah, less pixels to process?

odd yoke Aug 2, 2020, 9:21 PM

#

yes

fallow sandal Aug 2, 2020, 9:21 PM

#

ok ty so much for the explanation, that was very helpful

drifting umbra Aug 2, 2020, 9:43 PM

#

@visual violet do you want to remove blanks?
you can drop np.nan
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

#

@lapis sequoia
https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-data-science-projects-to-boost-your-knowledge-and-skills/

Analytics Vidhya

Machine Learning Projects | Data Science Projects with Example

This article lists the best machine learning, data science projects for beginners to advanced level with example code to boost your knowledge and skills.

lapis sequoia Aug 2, 2020, 9:46 PM

#

I'm using Hyperas for hyperparameter tuning but it seems that for some reason the data function passed to optim.minimize() is giving error if I'm doind pd.read_csv in it

drifting umbra Aug 2, 2020, 9:48 PM

#

@lapis sequoia look at what type optim.minimize() wants for an input maybe?

lapis sequoia Aug 2, 2020, 9:49 PM

#

it takes functions and the model thats not the issue

drifting umbra Aug 2, 2020, 9:50 PM

#

sorry

#

i dont understand error and am not super advanced but happy to try help.
do you have exact error?

lapis sequoia Aug 2, 2020, 9:52 PM

#

thanks i figured it out

#

I just imported pandas inside the function

#

for some reason hyperas isn't using global imports

flat quest Aug 2, 2020, 9:56 PM

#

@lapis sequoia the best projects are honestly the ones that you create, rather than something already out there.

Have a question you'd like to answer, collect data, then run an analysis on that collected data, documenting each step of that process.

lapis sequoia Aug 2, 2020, 9:57 PM

#

thats the goal, but to get there i need reference i'm a noob xD

flat quest Aug 2, 2020, 10:04 PM

#

xD
Hmmm do you have any experience with the ds packages?

fallow sandal Aug 2, 2020, 10:06 PM

#

I can't seem to find documentation on how to manually load my own trained model, all the documentation code for object detection has code where it downloads a pretrained model from the internet

odd yoke Aug 2, 2020, 10:11 PM

#

did you look at this https://www.tensorflow.org/tutorials/keras/save_and_load ? I'm assuming you use keras

#

alternatively, tf.saved_model.load, passing the export dir containing checkpoints and metadata as a parameter

fervent bridge Aug 2, 2020, 10:49 PM

#

Ok so I have 40,000 images I need stored as Arrays, came to the conclusion that HDF5 is best as it lets me append to the HDF5 file, figured best to use Pytables also. So any idea on how I would iterate over these images and append to HDF5?

#

I would like to store my X_train, X_test, X_val and labels in the same file but under different dir

odd yoke Aug 2, 2020, 10:55 PM

#

are there any issues with manually iterating through the files and generating the arrays and putting them in batch inside dataset objects ?

fervent bridge Aug 2, 2020, 11:04 PM

#

Batches of 500? 80 batches @odd yoke ?

#

I don't know you think working with batches is easier?

#

I would have to put my Training, Testing and Valuation data all in batches, would TensorFlow easily read these?

odd yoke Aug 2, 2020, 11:12 PM

#

hdf5 has a concept of datasets and groups, datasets are basically files, in this case, i'd just make them contain one image, groups are folder like structures, which contain multiple datasets
processing the files in batches makes it easier to create groups without using up all your ram

#

if you're working with tensorflow, one thing I use fairly often is called a TFRecord, it's a file format that contains data in the form of a sequence of dictionary-like structures

#

it's very well supported within tensorflow obviously, which makes it trivial to pass them into models later on
they are also very easy to spilt into what are called "shards", which is basically just splitting the TFRecords into multiple files, very useful again when your data is too big to fit in RAM, or if you want to shuffle etc

#

(Also, TFRecords are nowhere near as complex as HDF5, so you may be missing on some feature, but at the same time it can also be much easier to use)

fervent bridge Aug 2, 2020, 11:19 PM

#

Hmm so I was first looking at NPY correct then I went to research HDF5 as NPY does not allow me to append, as I want to iterate and store all my arrays in the same set so that it would be the same as if I had them stored in a variable in RAM, I want 3 datasets in HDF5 one for X_train, X_test, X_val.

#

I will take a look at TFRecords if you recommend it

#

really just want to know whats the best way to proceed

#

@odd yoke

odd yoke Aug 2, 2020, 11:23 PM

#

I think either can work really, there'll be pros and cons for every format, for example, TFRecords are not widely supported across most libraries, let alone most languages, this will tie you to tensorflow

#

I don't have much experience with HDF5 to talk about its pros & cons tho

#

I am confident that if you stick to tensorflow, TFRecords will make your life easier to interact with your models, manipulate your splits as you will etc

#

I'm also not really sure what you mean by "append" here, can you elaborate on that ?

#

Do you want to add images to existing TFRecords/NPY/HDF5 dynamically ?

fervent bridge Aug 2, 2020, 11:26 PM

#

Have an open file and write/append each image array

#

Loop through my images and append to the previous write in the file

odd yoke Aug 2, 2020, 11:27 PM

#

I don't think you can do that with TFRecords either, what you can do however is create a new tfrecord, and add it to the list of the files you use when you generate your tf.data.Dataset object

#

You can also sync up everything every once in a while, by merging them all together

#

Tho this is a costly operation

#

one difference between tfrecords and hdf5 is that the data doesn't have to be homogenous, you basically store key/value maps, which can contain all the metadata you want of any type

#

now, whether that is useful or not depends on your use-case

fervent bridge Aug 2, 2020, 11:30 PM

#

Eh I just want to pass in these arrays, if my RAM allowed I would have one X_train list with 40,000 arrays and just pass these into my model but I can't so hmm I have to keep researching

#

Thanks I will looking into both HDF5 and TFRecords

fallow sandal Aug 2, 2020, 11:40 PM

#

alternatively, tf.saved_model.load, passing the export dir containing checkpoints and metadata as a parameter
@odd yoke I'm not sure if I'm using keras (still new to tthis sorry). I'm trying to modify the notebook file to fit my custom trained modeled, the last thing I have to modify is that I have to figure out what this function returns as a data type

#

📎 unknown.png

#

since i assume there is where it loads the model

#

📎 unknown.png

#

trying to dissect what load_model() does, and how I can just substitute the path directory to my pretrained model

#

Is there where the thing you mentioned will work? (tensorflow.saved_model.load()?) I have my tensorflow not as tf

odd yoke Aug 2, 2020, 11:42 PM

#

you want to replace model_dir with your path

fallow sandal Aug 2, 2020, 11:42 PM

#

Oh I guess it does import it as tf nvm

#

Ok, I will try that

#

Thank you so much @odd yoke

odd yoke Aug 2, 2020, 11:43 PM

#

the current one uses tf.keras.utils.get_file, you can remove that line and pass in your own path directly

#

where your saved model is

fallow sandal Aug 2, 2020, 11:44 PM

#

so just ignore the parameters inside?

#

"(
fname=model_name,
origin=base_url + model_file,
untar=True)"

#

sorry my ''' is broken on my keyboard so i can't do the gist thing

odd yoke Aug 2, 2020, 11:44 PM

#

remove the whole tf.keras.utils.get_file(...) part