#data-science-and-ml

1 messages · Page 240 of 1

visual violet
#

you work with money

#

you know how to fix?

visual violet
#

i found it

velvet thorn
#

Hey all, I have this dataframe and need to do some subtraction. Every fourth row should be subtracted from the previous three rows. For example: Row 3's values should be subtracted from rows 0,1,2 and then row 7's values should be subtracted from rows 4,5,6 and so forth. How can I accomplish this via something like df.diff()?
@marsh berry groupby and transform

desert parcel
#
input = np.array([
                  [[313, 1], #HCL
                   [323, 1],
                   [333, 1],
                   [343, 1]], 
                  [[313, 10e-3], #Ortho
                   [323, 10e-3],
                   [333, 10e-3],
                   [343, 10e-3]], 
                  [[313, 10e-3], #Para
                   [323, 10e-3],
                   [333, 10e-3],
                   [343, 10e-3]]
                  ], dtype='float32')
 
target = np.array([[[14.76, 16.42, 18.08, 23.41]],
                    [[5.87, 11.14, 13.20, 25.72]],
                    [[2.73, 4.42, 8.04, 13.68]]], dtype='float32')```
```py
loss_fn = F.mse_loss
loss = loss_fn(model(input), target)

Output:

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: UserWarning: Using a target size (torch.Size([3, 1, 4])) that is different to the input size (torch.Size([3, 4, 4])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
#

Actually I'll move this to help channel sorry

grave frost
#

Is it a good place to ask Machine Learning Questions here??

#

Hey guys, Newbie here. just wanted some pointers on how to proceed for one of my project. I am pretty experinced in many parts of ML but there is one that I am finding almost no help/advice in.
I basically want to build a model (Pytorch or Tensorflow/Keras) that takes a number and a string as a feature and tries to find out some higly complex relation between the 2 fields. I have already made a csv file where data has a number and then a string both seperated by commas. The model will then predict the number based on the string.

So I wanted some pointers on how the model should be built - like what layers to use, recommended Hidden Layers for pretty complex relations, Maybe if you all are aware of some pre-trained NN that can accomplish this task. I tries Googling it but didn't come up with many examples that do that. I ruled out LSTM and GRU because they do not find very complex relations in the data but would experiment on it if the experts think so.
So does anybody have an idea how to accomplish this??

thin terrace
#

Is imbalance on a feature-level also bad like class imbalance? I mean a feature that has the same value for all instances is useless as it gives no unique information and thus should be dropped. But what about a feature that 95% of value A and 5% of value B? Should it also be dropped? How do we approach this? Is there a term for this like "class imbalance"? Any resources for reading more on the topic?

fluid burrow
#

can someone help me, i am not able to create a dataframe using Panda

thin terrace
#

ValueError: arrays must all be same length

frank bone
#

you got 1x age too much

fluid burrow
#

ahh , thanks found it.

grave frost
#

@thin terrace Look up down-sampling and up-sampling (Data Augmentation) if unbalanced classes are what you mean....

thin terrace
#

It's not

desert oar
#

@thin terrace good question. If there isn't already a topic for that on stats.stackexchange.com you might get good answers if you ask it there

fiery frost
#

Someone here that know ml, and can let me dm him?

#

for some short q.

velvet thorn
#

Is imbalance on a feature-level also bad like class imbalance? I mean a feature that has the same value for all instances is useless as it gives no unique information and thus should be dropped. But what about a feature that 95% of value A and 5% of value B? Should it also be dropped? How do we approach this? Is there a term for this like "class imbalance"? Any resources for reading more on the topic?
@thin terrace feature variance

#

"should it be dropped" is not a simple question to answer

spark cape
#

If I have two data frames with the same date index and I want to join on some fields in lhs... lhs.join(rhs, on=['field1', 'field2', 'field3']) doesnt work. So it looks like it needs to joined on the index. but if I use groupby, it doesn't work because you can't join on groupby objects. 😕

velvet thorn
#

consider this: when you train a model, you're basically deriving a relationship between features (input) and target (output), keeping in mind certain assumptions

#

simplest example: linear regression; the assumption is that the target is a linear function of the features

#

so, going back to your question...

#

how much predictive power does the feature add to your model?

#

it must be clear that even a column of random values may add some predictive power, simply because of the distribution of your target (so, basically, overfitting).

#

say it's a classification problem and even though the feature's value ratio is 95:5, it is so strongly predictive that all of the samples with the minority value for that feature are of the same class.

#

if the training sample roughly reflects real-world data, you basically get confirmed predictions for 5% of incoming samples for free.

#

of course, it's not clear that that is actually the case, and if it's not, then you have overfitting. that is one of the reasons you would consider dropping a feature with low variance.

spark cape
#

@fiery frost just ask the channel

fiery frost
#

I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.

#

Here is before and after:

#

Before:

#

After:

#

Video for demonstrating how is should work.

#

Thx for everyone who can help.
please take into consideration i am noobie to this.

#

(I tried to do this by detecting edges, but it is not perferct and require me to hand change it every time.)

molten ravine
#

Hi everyone,

I come to you because after many research, i still cannot fix my problem.
So, i would like to read JSON file and put it in a dictionary. I know the library json with json.load() works.
BUT, after a certain size (500k Ko) json.load() stop working and send a memorryError.
I've found the library isjon with ijson.parse() method but the output is not a dictionary and i absolutely need an output as a dict.

Do you have any ideas to fix this?

Here there is my code :

import json
def getJsonFile(path):
    #========================================================
    #GET THE JSON FILE
    #========================================================
    data = {}
    try:
        with open(path,'r',encoding='UTF-8') as jFile:
            data = json.load(jFile)
    except Exception as e:
        print(type(e))
    finally:
        return data

Hope someone would can help me!

Have a nice day !

fiery frost
#

@spark cape Can you help me?

molten ravine
#

I don't know if it was the good channel to ask my question, i've just joined this discord, sorry for that

fiery frost
#

@molten ravine it's problem with your PC, on mine I loaded 5mg of json.

uncut shadow
#

@molten ravine What error do you get? I mean, you shouldn't actually use try except there, remove it and then try to run it. If you get any errors then paste them here so we know what might be wrong

fiery frost
#

I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.
Can someone guide me pls? 🙏

molten ravine
#

i have a ```python
MemoryError

wanton kiln
#

Hi everyone,

I come to you because after many research, i still cannot fix my problem.
So, i would like to read JSON file and put it in a dictionary. I know the library json with json.load() works.
BUT, after a certain size (500k Ko) json.load() stop working and send a memorryError.
I've found the library isjon with ijson.parse() method but the output is not a dictionary and i absolutely need an output as a dict.

Do you have any ideas to fix this?

Here there is my code :

import json
def getJsonFile(path):
    #========================================================
    #GET THE JSON FILE
    #========================================================
    data = {}
    try:
        with open(path,'r',encoding='UTF-8') as jFile:
            data = json.load(jFile)
    except Exception as e:
        print(type(e))
    finally:
        return data

Hope someone would can help me!

Have a nice day !
@molten ravine Do you see that: https://stackoverflow.com/questions/40399933/memoryerror-when-loading-a-json-file ?

glass gorge
#

hello! is it possible for features in a random forest regressor to be a mix of numericals and pd.get_dummies(categoricals)? sorry for the noob question >.<

molten ravine
#

@wanton kiln i've already seen this...
As i said ijson is not a solution for me because i do not know how to how to convert it into dict...
So if someone could guide me if it's possible to convert my json object as a dict, using ijson, should be nice. But actually i do not know how to do this

spark cape
#

I don't know what 500k Ko but if you mean 500kb that's trivially small and the issue is on your end there.

molten ravine
#

I mean 500 000 Ko

spark cape
#

what is ko

molten ravine
#

sorry it's the french writing, i mean KB so i think

#

so yeah, it's a trivially small but i don't understand why i'm getting a memory error

fiery frost
#

I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.
Help someone? 🙏

wanton kiln
#

@wanton kiln i've already seen this...
As i said ijson is not a solution for me because i do not know how to how to convert it into dict...
So if someone could guide me if it's possible to convert my json object as a dict, using ijson, should be nice. But actually i do not know how to do this
@molten ravine try this exemples https://www.geeksforgeeks.org/convert-json-to-dictionary-in-python/

fiery frost
#

?

molten ravine
#

@wanton kiln what you've sent it's completely my code right now


import json
def getJsonFile(path):
    #========================================================
    #GET THE JSON FILE
    #========================================================
    data = {}
    try:
        with open(path,'r',encoding='UTF-8') as jFile:
            data = json.load(jFile)
    except Exception as e:
        print(type(e))
    finally:
        return data
fiery frost
#

Pls? 🙏 🙏

wanton kiln
#

Pls? 🙏 🙏
@fiery frost You need construct a Deep Learning solution using Keras and Tensor.

fiery frost
#

Can you guide me more pls?
I am very new to this. @wanton kiln

wanton kiln
#

@wanton kiln what you've sent it's completely my code right now


import json
def getJsonFile(path):
    #========================================================
    #GET THE JSON FILE
    #========================================================
    data = {}
    try:
        with open(path,'r',encoding='UTF-8') as jFile:
            data = json.load(jFile)
    except Exception as e:
        print(type(e))
    finally:
        return data

@molten ravine I think you need index json strings like the exemple:

for reading nested data [0] represents

# the index value of the list 
print(data['people1'][0]) 
  
# for printing the key-value pair of 
# nested dictionary for looop can be used 
print("\nPrinting nested dicitonary as a key-value pair\n") 
for i in data['people1']: 
    print("Name:", i['name']) 
    print("Website:", i['website']) 
    print("From:", i['from']) 
    print()
#

Can you guide me more pls?
I am very new to this. @wanton kiln
@fiery frost ohhh... its some dificult address this issue here in chat.

let me think how can i help you, ok.

molten ravine
#

I cannot even put index if i'm not able to load the json file.
The problem is the method json.load() return me a memorry error. So in the example, the datas have been loaded already

fiery frost
#

@wanton kiln thx!!!

spark cape
#

@molten ravine can you use jsonString = jFile.read() ? if not then it's the reading of the data. and if it works then ijson gives a generator that you can pull data through like [x for x in ijson.items(f, prefix='')]; then you have a list of dicts

#

(json data is often a top level list of json object entries)

woeful tusk
#

I have an excel file with a list of medical exams made per place. I want to make a code to go through it and sum the amount made of exams per place and put it in a dataframe to later write as excel.

spark cape
#

exams.groupby('location').agg({'examsTaken': 'sum'}).reset_index(['location'])

woeful tusk
#

this would return only the amount?

#

i was going for a for loop but it would lets say pick the first line: hemogram in place A. It would go through the whole sheet and sum the amount done. Than if it pick hemogram in place B, it would repeat again

spark cape
#

the pandas way to do that is groupby(...).apply(your function that works on the group you are interested in).reset_index(.../*put the index back how it was */)

#

apply if for your code; but there is sum, etc. can also do...
exams.groupby('location').examsTaken.sum()

woeful tusk
#

My excel columns code, exam, amount asked, amount done, value. Is it better to just concat code and exam? Since they are unique to eachother

#

Ive deleted the place columns since i dont need it

spark cape
#

its up to you but you can use groupby with an array an use groupby(['location', 'code']) to make two layers of grouping if that helps

woeful tusk
#

Ok

#

Ill try

#

Do I need to equate it to a variable? Like, result=groupby...

spark cape
#

let me know how you get on so if it's wrong i improve too. 🙂

woeful tusk
#

To write as excel later

fiery frost
#

Can someone pls guide me how to implement this?

spark cape
#

yeah you'll want that @woeful tusk

#

but first dump the results in e.g. jupyter notebook to make sure it's correct. itll be quicker than saving file, loading file in excel. etc

#

@fiery frost your question is like a whole project. it's not really appropriate for this channel. you have a few subproblems: how do you find the speech bubble holes? Once you find them, how do you find the text? Once you do that do you detect language? Once you do that, can you OCR it? Once you OCR'd it, translate it using an online service, I guess. then write the text into the field with appropriate layout (based on the speech bubble dimensions)

fiery frost
#

@fiery frost your question is like a whole project. it's not really appropriate for this channel. you have a few subproblems: how do you find the speech bubble holes? Once you find them, how do you find the text? Once you do that do you detect language? Once you do that, can you OCR it? Once you OCR'd it, translate it using an online service, I guess. then write the text into the field with appropriate layout (based on the speech bubble dimensions)
@spark cape is not that comlicated.

#

i dont need to ocr them.

#

just recognize them.

spark cape
#

opencv is an image manipulation library for detecting things. to find the speech bubbles

#

the R in OCR is recognize

fiery frost
#

i already did it.

#

it's not perferct.

#

there are a lot of loopholes.

#

so i thought moving to neural network.

grave frost
#

Hey guys, Newbie here. just wanted some pointers on how to proceed for one of my project. I am pretty experinced in many parts of ML but there is one that I am finding almost no help/advice in.
I basically want to build a model (Pytorch or Tensorflow/Keras) that takes a number and a string as a feature and tries to find out some higly complex relation between the 2 fields. I have already made a csv file where data has a number and then a string both seperated by commas. The model will then predict the number based on the string.

So I wanted some pointers on how the model should be built - like what layers to use, recommended Hidden Layers for pretty complex relations, Maybe if you all are aware of some pre-trained NN that can accomplish this task. I tries Googling it but didn't come up with many examples that do that. I ruled out LSTM and GRU because they do not find very complex relations in the data but would experiment on it if the experts think so.
So does anybody have an idea how to accomplish this??

untold aspen
#

for complex relationship between words look up embedding layers and Word2Vec

#

Word2Vec transforms words into vectors which allows the neural net to understand how close or far the relationship between each words are (like cat & dog, king & queen, etc...)

#

Embedding layer tries to capture this idea as well, but embedding layer is more of a general name for mapping features into high dimensional spaces to find relationship (e.g clusters) between data

acoustic halo
#

@grave frost In addition to the above, I would look at BERT, it tends to perform better over most other word embedding techniques and is fine-tuned to the specific task at hand rather than just relying on the embeddings of stuff like word2vec and ELMo

grave frost
#

Thanks guys for responding, But would it be a good idea to use Embeddings as the relationship is not word-to-word, but actually an interger to word? Would the NN be able to figure out the relationship?

#

Also @acoustic halo I have a ton of data (10 million rows with 2 data points each) would it be a good idea to retrain BERT from scratch on that data or should I do something else?

acoustic halo
#

Can you give an example of one?

#

And Bert can give a classification vector for each string you use

grave frost
#

Mainly for simple code breaking purposes. Like 1 --> Ad$r. So basically be able to break complex ciphers and predict the number from the data, Input:- Ad$r ; Output:- 1.....

acoustic halo
#

Don't use Bert then

grave frost
#

So it isn't a classification task, put rather a predicting or finding a "relation between data" task

#

Ohk

acoustic halo
#

Bert is for natural language, you might be better using something character level, not sure what though

grave frost
#

Me neither. I have been hunting most ML platforms, but looks like I have to go to SE now....

flat quest
#

embeddings will work with numbers. You're just projecting the data into a higher dimensional space @grave frost.

character level is probably better for this, but you can still use the BERT architecture. You just encode the characters instead of subwords.

grave frost
#

Hmm.... that's right but if the relation is very complex, would BERT still be able to find it out or should I boost up the HIdden Layers somehow?

flat quest
#

BERT has a large number of hidden layers. I'd try using the architecture as a base, and then see how it performs before trying to boost up the number of hidden layers

desert oar
#

isnt bert huge and therefore very difficult to train from scratch?

#

id be skeptical that fine-tuning an existing bert model pre-trained for human language would be effective in a more abstract problem, but maybe there are sources contradicting my uneducated intuition

#

but yes in general if you can use an existing proven architecture you definitely should

flat quest
#

fine-tuning i'm not sure if it would work, unless there's some parallels with his particular ciphers and natural language.

Training from scratch would take a long time yes, but the smaller BERT architectures can help minimize that.

grave frost
#

Well, I will start at 'Medium' to provide me a jump-off point....

tardy portal
#

Hello everyone, I need some assistance and reassurance that I executed a histogram chart correctly
age_distribution = sns.distplot(df_ea['Age'], bins = 20)
in which case it outputted this:

#

To clarify, this histogram chart states that majority of the employees within the dataset are between the ages of 30 and 40

lapis sequoia
#

Hello! I am a new member of this channel and just wanted to greet everyone of you. Hello!

rare ice
#

How can I add Python Type Hints for the columns of a spark dataframe? For example, suppose I have a dataframe called data with the following schema:

root
 |-- body: string (nullable = true)
 |-- partition: string (nullable = true)
 |-- enqueuedTime: timestamp (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- systemProperties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- event_type: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
desert oar
#

@tardy portal looks fine

tardy portal
#

I updated it and made it more pretty

desert oar
#

@rare ice spark dataframe columns aren't really strings though - they are placeholders representing strings

#

so im not sure how those type hints would even work

#

do you mean that you want to write a function e.g. process_data that only accepts spark dataframes with a certain schema?

#

if so, i don't think there's a good way to do this

#

i've wanted something like this for pandas as well

#

i haven't found a good answer

rare ice
#

@desert oar I was more hoping to get my editor to recognize the schema and be able to autocomplete columns.

desert oar
#

ah

#

same problem i suppose

rare ice
desert oar
#

yeah, i see

#

that would definitely be nice

#

you could probably make this work with generic types

#

what happens if you add a column to the dataframe? i guess that means its type has changed, right?

#

in general this is the problem of: how to annotate an object that has runtime-generated attributes or keys, but is expected to be mostly static once it's been generated

#

it's also a bit messy because pyspark dataframes also support access with __getitem__ as in data['year']

lapis sequoia
#

Can anyone recommend a channel about AI / ML ?

acoustic halo
#

@lapis sequoia This one

tardy portal
#

Okay so I need assistance with creating a pairplot for my dataset, specifically for the numerical variables. I'm trying to utilize this link: https://seaborn.pydata.org/generated/seaborn.pairplot.html, however, how am I able to specifically target the numerical variables within the dataset?

tardy portal
#

df_ea.fillna(value = 0, inplace = True)

num_scatt_plot_matrix = sns.pairplot(df_ea.iloc[:, :17])
#

could someone explain what sns.pairplot(df_ea.iloc[:, :17])

#

that means?

desert oar
#

df_ea is your data frame

#

.iloc[:, :17] is "all rows, and the first 17 columns"

tardy portal
#

alright so since I want to run this pairplot for only the numerical variables I entered this: ```sns.set(style = 'whitegrid')

df_ea.fillna(value = 0, inplace = True)

num_scatt_plot_matrix = sns.pairplot(df_ea, hue = 'Attrition')

#

It generated a lot of information, however I came across this error: RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density.

lapis sequoia
#

hey if someone here is familliar with mysql could you look in databases

tardy portal
#

I managed to figure it out by entering the column names separately, but I wanted to know if there's an easier method I could run this pairplot based on the columns that hold numerical values.

desert oar
#

@plucky crow in code like this

timediff = df1.groupby('masked_id')['date'].apply(lambda y: y - y.iloc[0])

the timediff series will typically end up with an extra index level, corresponding to the grouping variable. so if you want to assign it back to df1 as a column, you need to remove that index level. in this case i suggested .reset_index(level='date'). if it doesn't work by name, just say level=0, since the extra index level will be added as the 0th level

carmine iron
#

is there a way i can and some type of if clause in my df column list comprehension. python group['growth'] = [ x[-1] / x[0] -1 for x in group['cases']

#

Im wanting to add if x[-1] - x[0] < 0 then ```python

group['growth'] =0'''

#
then group['growth'] = 0```
velvet thorn
#

uh

#

if I understand you correctly

#

why not use max

woeful shore
#

ValueError: logits and labels must have the same shape ((None, 6) vs (None, 4))

#

ValueError: logits and labels must have the same shape ((None, 6) vs (None, 4))
@woeful shore I keep getting this error, how do I make sure my training data and labels are of the same shape. my training data is a list of textual data while my label is ndarray that has been onehot encoded. My labels has four targets

deft harbor
#

Print the shapes at different points

#

Also depends on the library you are using

frank bone
#

does anyone have a good and well explained resource on how you would go about making a very simple supervised ML function with around 10 parameters classifiyng around 10 different structures in a dataset

#

I have 0 experience in ML(tensorflow or whatever tool you would use)
My plan is to create a synthetic dataset with the data structures I want to detect and train a model with that, then test on real data

#

how do you 'label' these different structures? lets say we have a csv file with 100 lines and in 30 lines I have different data structures (around 10 different) I want to detect. The real dataset will be millions if not billions of lines but that was just an example

#

each line having around 15-16 columns, of which 4-5 are relevant to define a unique structure

#

and sometimes these structures are comprised of several lines within that 100 lines file

#

i.e. 3 lines = 1 unique data structure

#

and they even overlap sometimes

#

found it very difficult even just trying to structure an algo for that, would probably be very error prone and be hundreds of conditional nesting. Thought ML might be the solution for something like that

#

as i understand it...usually youd have 1 datapoint and 1 corresponding label. But how would you go about mapping 3 datapoints to 1 label? then 2 datapoints to 1 label? And then make the training session go through all these and also check for overlap and classify the data according to the higest score? with labels having multiple datapoints have more weight than single ones for example

flat quest
#

still having trouble understanding the data you're working with
So its multifeatured inputs (each line is an input im assuming)
But some of the lines are grouped together? @frank bone

frank bone
#

@flat quest let me give you an example. Specifically the idea is to identify different option strategies that were executed

#

examples of such strategies

#

example of data

#

usually theyre executed at same time but I want to introduce a tolerance of 3 minutes so that makes it a little more tricky but will get me more data output

#

now the tricky thing is "Label X" could be falsely identified with in "Label Y", or any data within all label above could be identified as "Label R". But they're all unique instances

#

So in total there will be 10 labels, and each label can be comprised of more than 1 row but there's no such requirement

#

The way I would solve it (as I'm currently understanding ML) is to give more weight to a label the more datapoints it comprises

drifting umbra
#

i think calling them labels makes it very confusing for most ML people

#

label is what you want to predict

#

if you think an equation like y = mx + b

#

labels = y = what you want to predict

#

features are your inputs

#

do you have data for inputs and corresponding outputs?

flat quest
#

So you're trying to categorize options trade into particular categories? @frank bone

frank bone
#

@flat quest exactly. And sometimes an option trade is a single entry, sometimes it can be multiple entries

drifting umbra
#

yes labels is whaty ou want to predict

#

you dont have to call it label

#

you can call it anything

frank bone
#

alright

drifting umbra
#

you can aslo predict either:

#

number such as $ in sales, $ stock goes up / down

#

TRUE / FALSE

#

or categories

#

as you have it X, Y, Z, R you have 4 categories

frank bone
#

that was just an example, I have around 10 categories and each category can be identified by timestamp, C/P, Strike and maybe TotalTrades (not sure)

#

the problem i see is the grouping of several datapoints into one category

drifting umbra
#

i dont really think you need machine learning for this

#

if you are doing what's called "hard coding" rules

#

such as all options with one date

#

go into one category

#

e.g.

if month == Janurary:
  category = Label X
#

also to train your data you will need labels for many more options

#

or many more obseravtions than 4

#

if it truely needs machine learning

#

and the categroy is not set by strict rules

frank bone
#

e.g.

if month == Janurary:
  category = Label X

@drifting umbra not sure you get what im trying to do. It has nothing to do with time

drifting umbra
#

i am saying

#

if you are categorizing things with STRICT rules

#

you don't need machine learning

#

such as all bachlors = unmarried

#

all children are young

#

etc

frank bone
#

i dont necessarily need machine learning no, but if its easier and faster to implement and is more precise then yes

#

i just dont know how i would write code to correctly identify i.e. 10 different categories when they can overlap

drifting umbra
#

if you are using strict rules machine learning will be less percise

frank bone
#

seems very complicated and error prone with many many many conditional nested loops but i just started learning python 2 weeks ago so maybe its a knowledge issue 😄

drifting umbra
#

this is pretty easy problem

#

do you have more observations

#

than 4

#

need like hundreds of thousands

#

*OR thousands. hundreds OR thousands. sorryu

#

long day @ work

frank bone
#

i do a prefilter, and out of that prefilterI get the data I colored in the above picture

#

and in that data i need to identify the categories and they're defined by just 4 or 5 observations or 2* if it a dual pair, 3* if triple pair

#

if there's a simple solution im all for it.. 😄

drifting umbra
#

okay so

#

if the features for the colored things

#

are similar

#

and

#

you have more observations with labels

#

i only see 4 that are labeled

#

how many labeled rows

#

do you have is the question

frank bone
#

it was just an example to demonstrate the issue

#

of there being categories with varying amount of csv rows

drifting umbra
#

do you have excel or screenshot of actual data

#

if you want to do it as machine learning you will need more than 4 with both

#

input AND output

#

in column J most are missing output

jade walrus
#

anyone tried both keras/tensorflow and fast.ai/pytorch? I only tried keras. Easy to use. Never tried the other combination. For those who've tried both, may I know what is your preference?

drifting umbra
#

@jade walrus interested in this as well. i am reading Deep Learning with Python by François Chollet (author or Keras)

frank bone
#

The actual data doesnt contain column J, I added those manually to demonstrate what I want to be filtered out

#

id create a dataset manually where id mark them

drifting umbra
#

ok

frank bone
#

or "label" as in the previous link i shared

drifting umbra
#

like i said be aware you would need to do hundreds

#

or thousands

#

so maybe

#

what you want

frank bone
#

absolutely

drifting umbra
#

is to use rules

#

like strict rules

#

how do you decide what is J/

#

for example if strike is greater than 40, it's call, then it's Category Z

#

that could be a rule

frank bone
#

let me show you in the example from the screenshot how id categorize it

drifting umbra
#

okay i am saying if you are going to manually go through

#

and put something in column j

#

how do you decide what to put in column J ?

frank bone
#

im looking at the timestamp with a tolerance of x seconds, if there's a pair (this can be 2 or more) then those 2 (or more) entries are a "pair". After such a pair is identified, i then look at C/P, Strike and Stock Price, if C and P strikes are within 10% of stock price, it's a "Straddle"

#

thats one label

#

and I want to identify like 10-20

#

the problem is the varying degree of numbers of datapoints in a label. For example a label where there's only 1 row, could mistakenly be identified in a pair?

#

the tolerance might also introduce overlapping between pairs if they happen close to each other

#

lets say one pair happens at 10:15 and the other pair at 10:17, if my tolerance connects the two, that's a mistake

#

i think itll be extremely complicated to code all these sometimes deeply nested conditionals and debugging will be hell

#

if you look at straddles, strangles and iron condor youll understand the problem

drifting umbra
#

can u say category instead of label

#

and ok so

#

"lets say one pair happens at 10:15 and the other pair at 10:17, if my tolerance connects the two, that's a mistake"

#

yeah i understand options strategies i am CFA

#

actually know a lot more about investing than data science lol

frank bone
#

great 😄

drifting umbra
#

i think u will get much more accurate results hard coding

#

because you have no answer data

#

talking of going thru hundreds or thousands and labeling it yourself is a big mistake imo

#

just create code for conditions

frank bone
#

can u say category instead of label
@drifting umbra can u just quickly define it to me. A little confused by terminoology

drifting umbra
#

such as TIME_2 - TIME_1

#

is that >2 or <2

#

features are your X's. features are inputs
labels are your Ys. labels are your outputs

frank bone
#

talking of going thru hundreds or thousands and labeling it yourself is a big mistake imo
@drifting umbra i was thinking of synthetically making them

drifting umbra
#

step 1 import this excel into pandas

deft harbor
#

@drifting umbra go back to stock market chat

drifting umbra
#

pd.read_csv("file.csv")

frank bone
#

yeah i have all filtered data in a dataframe already

#

the colored entries from picture above

deft harbor
#

Study dask, skip pandas

drifting umbra
#

@deft harbor wassup

frank bone
#

thats where i extracted pairs using unixtimestamp comparison but im stuck and thats why i looked at ML

deft harbor
#

Just checking in to see what's happening here.

drifting umbra
#

@deft harbor ugh. new tech. this is why i hate technology and am glad i am not a software developer

#

always new shit to learn

#

i love learning but i question how much new languagse and frameworks even add

deft harbor
#

I use dask because I found myself trapped using one core all the time

drifting umbra
deft harbor
#

Dask is really useful for multicore processors and in a bigger world, clustering

drifting umbra
#

import modin.pandas as pd

#

🙂

#

makes sense

#

i am using colab beacuse

#

although i have nvidia GPU. it is fk impossible to get tensorflow to work on windows

deft harbor
#

Yeah, I use Linux only

drifting umbra
#

@frank bone not sure your question

#

if you were to do it manually

#

labeling them

deft harbor
#

I've worked with people who developed notebooks in windows, and its a mess

drifting umbra
#

how would you decide?

frank bone
#

i wouldnt label them manually, id create a synthetic dataset with only condors, only straddles, only strangles and then concat them

#

for training

#

but im all for a hardcoded classic function, it just seems stupidly complicated to me

#

do you have any suggestions? I mean you understand the problem well since you understand options strategies.
I have a table of different option strategies and I want to put them in correct categories (straddle, strangle, butterfly, etc)

#

defining observations are timestamp (nearness chronologically IN CASE its applied to multiple lines), C/P (defines possible categories), C/P strikes (nearness to stock price -> defines possible categories) and all this can be applied to several datapoints (csv lines) or a single line, which will also change possible categories

drifting umbra
#

@frank bone make each row in your data frame an options contract

#

then use logic to build options trades

#

different data frames for different underlying securities

#

if you have all the options contracts in a data frame

#

then you can build strategies from it pretty easy

#

many different strategies

#

btw side comment is that i am skeptical of insights from this because generally options are not super liquid

#

so you may have problems with either
-stale prices
-getting fills / adequate liquidity when you go to implement strategy

#

you can store the options trades in mini data frames, one big data frame, w/e

#

like all the legs

#

that is cleaner than having a column in your df_all_options_AAPL

#

df_AAPL_Oct_Bear_Put could be a data frame with the rows for a bear put spread on aapl

#

or of course you can add columns

frank bone
#

maybe i wasnt clear enough in how my current data structure looks

drifting umbra
#

screenshot you sent has every options contract as a row

frank bone
#

I have a df for a given ticker for a given day with all options trades in it

#

and im trying to find "option strategies" that were executed for each given day

drifting umbra
#

wait but

#

ok

#

i will challange the theoretical basis for this even

#

or really just 2 things

#

1- how do you know what are options strategies?

#

together

#

and not just different traders

#

you have no "trader_ID" variable

#

i think it is just too much guessing

frank bone
#

and whats the 2nd thing?

drifting umbra
#

that it's just guessing

#

you need answers (labels)

#

a problem you could do

#

with this data

#

if i were you

frank bone
#

well im 99.9% certain when I look at the data

drifting umbra
#

ok then i would say hard code

#

the rules

frank bone
#

that it was a strategy executed

drifting umbra
#

yeah so if time is within 2 seconds

#

if different strike prices

#

whatever conditions

#

just think about what you're doing in your head when you are grouping the trades

#

i would try to use this options data to predict underlying stock price movement

#

rest of that day or next day maybe

#

that is something you actually have data for

#

input1 : $ value of puts bought
input2: $ value of calls bought
input3: # of puts bought
input4: # of calls bought

#

input5: % of trades in last 15 min of trading

#

label (what you want to predict) = stock return next day

#

boom, there you can get data for it

#

you have data and rules, want to output answers

#

that is fine

frank bone
#

thats not my goal though, i just want to identify categories

drifting umbra
#

ok well you have no answers right

#

you need to define the rules formally

#

of how you are grouping the options trades into being done by the same person

#

if u can do it mentally need to formalize

#

not even program, just write down

frank bone
#

i like what you said to lay down step by step what im going through my head when i categorize it. And it would be somewhat doable but when I think about edge cases things get messy

#

so im wondering if i create synthetic answers and let ML find the rules, if that will be better

drifting umbra
#

"i like what you said to lay down step by step what im going through my head when i categorize it"

#

exactly

#

i mean

#

you can

#

i just question how objective your categorizations really are

#

if u cant lay out the rules

#

as u see the 2 paradigms for programming

#

you need either data and rules

#

or data and answers, and the computer comes up with the rules

#

hm

#

wait

#

maybe i'm an idiot

#

there are ways to categorize data into groups

#

like k-means for example

#

but i think you have to tell it how many groups there are

#

also it would not make sense for your data

#

without answers, you could only group similar things together

#

you need either:
data + rules
or
data + answers

frank bone
#

yup im following 😄 for the second answer my only problem is how would I put several datapoints in one category and for other categories just one line and for other 4

#

would i train the model once for 1 liners, then 2 line data, then 3 line data and then 4 line data(max) and then when predicting i give more weight to categories with more values?

#

or can i somehow do all in 1?

#

as i understood tensorflow, you train a model with data(1 csv line)='category'

#

can u do data(2 csv lines) = categgory1, (4 csv lines)=cat2

#

and then how would you apply (model.predict) that to data that only consits of 1 liners

#

ill try to write down rules for 1st case on paper, lets see if i can get clear answers, been trying to do that for 3 days though, all i got was headaches 😄

drifting umbra
#

"how would I put several datapoints in one category"

#

put the category they are in

#

on their line in excel

#

just say "Trade1"

#

"Trade2"

#

"Trade3"

#

"as i understood tensorflow, you train a model with data(1 csv line)='category'
can u do data(2 csv lines) = categgory1, (4 csv lines)=cat2"

#

not sure what this means

#

i just think you are no offense
1- guessing. why do you think these trades belong together?

#

are there any rules you can define? being similar in timestamp makes logical sense

#

you can get python to group them based on that

#

just think this makes no sense idk

#

we are here

#

lol

frank bone
#

i just think you are no offense
1- guessing. why do you think these trades belong together?
@drifting umbra the data at that stage has gone through pre-filtering, in combination with timestamps its highly likely they belong to the same trader. It's not a make or break situation either. I just want high accuracy, never claiming 100%

#

actually the grouping is 100% only defnied on timestamp, after that it's just putting them in different categories. thats the easy part

#

but grouping i find hard but ill take a pen & paper and have at it again 😄

drifting umbra
#

which is easy part

#

grouping based on time stamp?

#

and i have idea on how to do this mayube

#

what is max number of options legs in one trade?

frank bone
#

4 but thats very rare. mostly 1 2 or 3

#

i mean you can take this as an example for a df with different categories

#
20200501    09:41    FAST    C    35.0    20200515    1.45    90    1    
20200501    10:41    FAST    P    35.0    20200515    0.71    350    1
20200501    10:41    FAST    P    32.5    20200515    0.25    600    1    
20200501    10:41    FAST    C    32.5    20200515    3.33    600    2    
20200501    15:40    FAST    P    35.0    20200515    0.75    160    1
20200501    15:40    FAST    P    35.0    20200515    0.75    145    1    
20200501    15:47    FAST    C    37.5    20200515    0.3063    80    5
#

adding a column at [2] with unixtimestamps easy

#

alrdy done that

#

now just trying to figure how to take out pairs and save them in a separate df or list and take them out of original df, so they dont interfere with other pairs

#

and all this dynamically, since this is part of an iteration and number of pairs will always be different

#

got my brain juices flowing a little bit now...trying something 😄

drifting umbra
#

yo

#

sorry

#

have idea

#

your input should be 4 rows of these characteristics

#

output is TRUE/FALSE

#

whether they are in a trade together

#

THAT is doable with machine learning!

#

the 4 rows will consist of any related rows which actually form a category

#

plus the number of rows to get to 4

#

randomly chosen from data set

#

pretty sure it will work

#

using true/false as Y is solution i think

#

that way you are predicting which of inputs are in a group together

#

so in the future when you give it a bunch of trades it can output predictions TRUE/FALSE of those that are related

frank bone
#

but how to further differentiate between the different groups afterwards? that big group with consist of 10-20 subgroups (option strategies)

#

the table you showed actually consists only of those subgroups we're talking about. But I think it should be possible without ML, still writing down I think there's a way..so far

drifting umbra
#

each sample

#

should contain how ever many that are in a group

#

and the rest of the rows selected at random from your data set

#

you wil feed the data into your model to project in rows of 4

#

splitting it up into 4s

distant dagger
#

Is this a good place to ask minor stats questions?

frank bone
#

should contain how ever many that are in a group
@drifting umbra Oh I get it

drifting umbra
#

@distant dagger sure

#

i think there is rule no doing hw for people

frank bone
#

splitting it up into 4s
@drifting umbra what do you mean by that?

drifting umbra
#

okay so you have hundreds of trades right

#

this example you have 8 rows

#

it will be broken into groups with 4 rows

#

the model will output FALSE if it thinks the trade is part of a group

distant dagger
#

Yeah, its not homework, I just didn't pay attention in college when they were teaching this stuff. Now I am almost done with my phd and doing some minor projects.

drifting umbra
#

very cool

#

@frank bone given 4 rows

#

it will predict WHICH are part of a trade together

#

if TRUE you can add tag them with "Trade+counter"

#

where counter increases by 1 everytime

#

so for example it might tag the first 2 out of 4 TRUE

#

add "Trade1" next to those entities in your DF in a new column called "Trade_No_Predicted"

#

you feed in 4 more

#

it predicts all 4 are part of one trade

#

all 4 are predicted TRUE

#

you add "Trade2" next to all in the same column "Trade_No_Predicted"

distant dagger
#

It doesn't show any outliers. But if I plot a histogram, it gives me this

drifting umbra
#

but there are observations here

frank bone
#

it will predict WHICH are part of a trade together
@drifting umbra ohh sort of a "sliding window" (just for understanding) from top to bottom, at each new row it checks whether trades are part of group. Do I understand correctly?

drifting umbra
#

@frank bone yesss

frank bone
#

Interesting, and for training data i just label true/false?

#

sorry if i got terminology mixed up with "label" here. But in training data I need to have features + label right

distant dagger
#

I am sorry, but i don't get what you are saying. The whiskers are the min and max in the data aren't they?

drifting umbra
#

@frank bone no with training data you label with groups

#

later convert to true/false

#

when converting from whole data set into rolling window of 4

#

rows

#

@frank bone "training data I need to have features + label right"
Yes

#

@distant dagger yeah so what are you saying about outliers

#

everything looks right to me

frank bone
#

can you just give me an example what that would look like (the "group")? Slightly confused but I got the concept. thats smart, thanks 😄

drifting umbra
#

are you talking about these

#

i think that is smoothing

distant dagger
#

Oh ok...I was confused by the histogram cause it has two peaks. @drifting umbra

drifting umbra
#

@frank bone you label "Trade1", "Trade2"

#

in your excel

#

or master data frame

frank bone
#

ohhh

#

got it

#

will try that when classic method wont work. Or maybe do a benchmark classic vs. machine

drifting umbra
#

your data

#

will not be so brain dead easy

#

to see / predict

#

but if you have this type of thing

distant dagger
#

@drifting umbra I guess what I was trying to ask was that since there are two peaks in the histogram, is it simply a bimodal histogram or is there something wrong with the way I have processed the data. The boxplot and the histogram are actually log values

#

And from what I have heard, bimodal histograms are not very common

drifting umbra
#

bimodal histogram is def possible

#

depends on ur data

#

it suggets

#

u have 2 diff populations

#

or groups sorry

frank bone
#

@drifting umbra very nice thanks. Everything noted down 🙂 have you compared classic vs machine implementations before? did one outperform the other in accuracy?

drifting umbra
#

@frank bone depends entirely on data and problem ur trying to solve

#

one is not always "better"

#

i think you can use random forest for this really well

#

or if you convert everything to categories

#

either use xgboost or catboost

#

but basically

frank bone
#

can this be done on a CPU? or GPU needed?

#

or if you convert everything to categories
@drifting umbra ?

#

not exactly sure what you mean by that? train with categories only?

distant dagger
#

@drifting umbra I get it now. Thank you. that was really helpful.

drifting umbra
#

@frank bone my idea

#

@distant dagger 🙂

#

@frank bone categories i no big deal

#

computer cannot understand text

#

only numbers

#

can you send data again it's too far up

#

dont worry about that

#

that is 1 easy line of code

#

xgboost or catboost can run on CPU in less than 60 sec

#

if you need GPU you can use google colab

#

its free and thats what i do

#

i would suggest doing data science in jupyter notebooks

frank bone
#

can i feed the training with many separate files instead of 1 big concatenated? I'm wondering if increasing label numbers and passing time (i.e. jan to june) would play a role in the algo deciding when that doesnt influence it

drifting umbra
#

you mean features include month?

#

no problem with adding month to feature

frank bone
#

no since ill have(?) to feed in unix timestamp, that includes the passing of time, when there should be no correlation in that

#

skewing the performance

drifting umbra
#

you can easily delete or add columns

frank bone
#

i dont want to include passing of time as a feature, but a timestamp would include that?

drifting umbra
#

you can extract just the month or day of week

#

or year

#

whatever you want

#

or no time variable

#

all easy

frank bone
#

can i make a timestamp of 10:15?

#

without date

#

just the time

drifting umbra
#

yes

#

easy

frank bone
#

lol stupid me

drifting umbra
#

naw all good

#

i would suggest keeping some time var

frank bone
#

yeah

drifting umbra
#

trades which occur closer obv more likely to be part of same trader

#

gotta sleep

#

feel free to PM me

#

good luck with project

frank bone
#

thanks again! 😄

drifting umbra
#

gl ☮️

frank bone
#

u too

high pulsar
#

I have sales data from US Cities. I don't have latitude and longitude values, just plain state and city names. How can I visualize them on a map? I tried using Geopy, Geopandas (Nominatim) but it takes too much time( I used as there are around 10k rows. Any better way?

austere swift
#

have you tried plotly?

#

I'm not sure if it works with names but you can do some pretty nice map visualizations with it

chrome barn
reef bone
#

Hi @tame fractal, I'm not sure why you decided to post a monkey picture here but I deleted it as it was not relevant to the topic of this channel

austere swift
#

passive aggression at its finest

high pulsar
#

https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/information/
@chrome barn Thanks man. I used pandas merge with this csv. Got exactly what I wanted. Thankyou again.

frank bone
#

sounds rather exotic but is it possible in python to skip a loop twice?

acoustic halo
#

Use a variable as a flag?

frank bone
#

i think that doesnt work? it happens many times in the loop

#

in my case

acoustic halo
#

This wouldn't work?

#

skip_flag=False for loop: if skip_flag or need_to_skip: skip_flag=not skip_flag continue do_stuff()

frank bone
#

not sure...i have a

      df.loc[var[i]]
      do stuff...
``` but sometimes i doesnt exist in the dataframe and it throws some ambiguous error
acoustic halo
#

Can you not catch it in a try except statement and put the continue there?

frank bone
#

oh i should make a list out of df and then compare, if i isnt in there continue

#

its an index

#

so i can do check = df.index.values.tolist()

#

ill try that

#

worked just fine 😄

frank bone
#

dataframes seem to behave differently when calling within len(). If I do df.loc[var] and that var is in the index just once, len returns the number of colums in that row, however if var is in the index > 1 times, len returns the amount of times var is in the df index

#

im only interested in getting the amount of var in index, whether it's once or twice or more, not interested in len of colums in row

#

any ideas how to accomplish that?

desert oar
#

You want the number of times a value appears in the index?

#
(df.index == val).sum()
#

Len does not behave differently in your case. loc behaves differently

#

If the value only appears once it returns a series representing a row - the length of which is the number of columns

frank bone
#

Thanks a lot! 🙂

desert oar
#

You can force loc to return a dataframe by wrapping the value in a list

#
df.loc[[val]]
frank bone
#

Double square brackets?

#

Ah

desert oar
#

Yes but be careful about what they mean

#

The outer brackets apply to df.loc

#

The inner brackets arent "attached" to anything so they create a new list

frank bone
#

Fair enough makes sense

desert oar
#

If val is already a list, series, or index, you should not use the second set of brackets

frank bone
#

Good to know

desert oar
#

Actually using len(df.loc[[val]]) might be quite a bit faster than the method I suggested above

#

Because in theory pandas can optimize individual index value look ups

#

Either with a binary search or a hash table look up

frank bone
#

Ill try both

eternal sandal
#

Does anyone have some code for sorting numbers. Beginner to python, and looking to learn something about lists

tidal bough
#

First of all, not really #data-science-and-ml. You better get a help channel (see #❓|how-to-get-help)
Second, are you asking "how to sort lists" (for which the answer is to use sort/sorted) or "what are some sorting algorithms I can implement" (for which the answer is "bubble sort, insertion sort, merge sort, and the other ones you can find in any Data Structures and Algorithms book")?

eternal sandal
#

okay. just joined this server. Thanks for the help

tardy portal
#

Question: Split the dataset into training and testing 30 percent testing 70 percent training You can use the train_test_split function in sklearn library

#

Answer: ```import pandas as pd
import numpy as np
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

df_ea = pd.read_csv('C:/Users/Bobby/Documents/School/INFO 357/employee_attrition_processed.csv')

input_var = set(df_ea.columns) - set(['Attrition_Yes'])
X = df_ea[input_var]
Y = df_ea['Attrition_Yes']

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 42)

log_reg = LogisticRegression(random_state = 0, solver = 'lbfgs', multi_class = 'ovr')
log_reg.fit(X_train, Y_train)

des_tree = DecisionTreeClassifier(criterion = 'gini', splitter = 'best', max_depth = 15)
des_tree.fit(X_train, Y_train)```

gleaming gyro
#

Is there anyone who uses Astropy here?

tardy portal
#

Results: DecisionTreeClassifier(max_depth=15)

#

Not sure if I am on the right path here

gleaming gyro
#

uhh

#

i have done this one

#

df_ea is the dataset right?

tardy portal
#

yeah

gleaming gyro
#

do you have some results already?

tardy portal
#

thats correct

gleaming gyro
#

like, it should look like this

#

something like that?

#

or is it purely new?

tardy portal
#

we're given a dataset and the first nine questions we reprocess it and question 10 is the question I just posted

gleaming gyro
#

oh

#

then why you even need fit and all?

#

is it like you skipped this question and the next ones require that?

#

Is there anyone who uses Astropy here?
@gleaming gyro

tardy portal
#

No I haven't skipped any questions, but the last two questions rely on this question

tardy portal
#

@desert oar would you be able to assist me?

raven mulch
#

Hey guys! If anyone is interested I made a second video on my series where I make my own deep learning library from scratch and the plan is to use it to deploy a MNIST classifier on a webserver! This video is about the implementation of Adam, RELU and the new scikit learn API it has. Here it is: https://www.youtube.com/watch?v=UELWdyJVVRg

Welcome back!

In today's video I go over the implementation of the Adam optimizer, the RELU activation function and the scikit-learn inspired API of the library!

Code: https://github.com/Fedzbar/deepfedz
Adam paper: https://arxiv.org/abs/1412.6980

▶ Play video
desert oar
#

@tardy portal What exactly did you need help with

tardy portal
#

@desert oar trying to understand what this line of code executes from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 42)

desert oar
#

Did you read the docs for that function, or the user guide?

#

Do you know what a train/test split is?

#

This is homework right? Did your teacher explain this at all?

tardy portal
#

vaguely

#

hence why i'm in here asking for assistance

desert oar
#

Which question is that an answer to

tardy portal
#

if my professor explained this at all

desert oar
#

I see

#

what did they say?

#

Are you familiar with the idea of train and test sets?

tardy portal
#

From what i understand is that it splits the data into test and train models?

#

With that said, is there a specific method I should use for fitting the model?

acoustic halo
#

It splits a single larger dataset into two smaller datasets, you use one of these to train a model, and the other to see how well the model works on new data. It's not making a model itself, you would have to pick a model yourself or whatever your homework specifies and feed these datasets into that

tidal bough
#

The train/test split is itself to estimate how well the model performs on the data the model didn't train on. This is important to prevent overfitting - if your model has enough parameters, it can perfectly memorize the training data, but it won't help it with the testing set unless it actually got the general idea right.

fallow sandal
#

I know this isn't directly related to Python, but does anyone here experience with TensorFlow particularly with object detection? I was wondering how many images in general would it take to get reasonable object detection for just one object? I was watching a tutorial for a dec of cards and for 6 objects they recommended atleast 200 images. Would it be the same if I was only training one object?

charred blaze
#

not even close

#

and even 200 images seems way too low.

#

for such small amount of data, you would be better served with using an approach like kNN and using some good ol' feature extraction.

flat quest
#

no use a pretrained model

#

and train over that with 200 images

fallow sandal
#

Yeah sorry, I mean't a pre-trained model from the example code

flat quest
#

you won't get a good obj detection model with something like KNN or traditional ML techniques.

fallow sandal
#

I'm a big noob so sorry if I say anything technically wrong or if there is some clarification i need haha

#

I was just looking up a tutorial and I want to figure out how to train a custom image detection using one of the pre trained models

#

In the example, they were using playing cards which is relatively flat, so im guessing for a 3d object

#

i need multiple angles too?

flat quest
#

there's a lot of obj detection models out there. Pick any one of them and train over your images. If you're only training to detect one object you only really need a couple. My models worked with only 5 images.

fallow sandal
#

Oh sweet

flat quest
#

yeah it would be better to have multiple angles, but it should still perform relatively well from a single angle.

fallow sandal
#

so for example

#

I want to try with a toy goat

#

Do you suppose with one of the example training models, I could get it trained with only a few images?

charred blaze
#

didn't really consider using pre-trained models... yeah, go with that.

flat quest
#

yeah just a few images should work relatively well

fallow sandal
#

Ok sweet, thank you so much for the help @charred blaze @flat quest. I really appreciate it, tensorflow object detection seems like a very powerful tool and I really want to get it figured out for projects

#

I have to get the set up installed, was watching videos at 3am and I forgot all the details to get it working

flat quest
#

i don't know if tf has updated their object detection api for tf 2.0, but i do know they have a keras hub layer that can do obj detection out

#

should be an RCNN or SSD model

fallow sandal
#

Yeah, I think I was watching a tutorial and they recmomended the RCNN model

#

(If you have the hardware which I do, or I hope i do)

arctic cliff
#

Like the different names

#

Not the whole values for sure xD

desert oar
#

@arctic cliff .unique()

#

Or .drop_duplicates()

arctic cliff
#

Another question

#

df.dropna(inplace = True)

#

What's the inplace for?

desert oar
#

That modifies df without returning it

#

Like list.append

arctic cliff
#

So I don't have to say df = df.dropna() ?

desert oar
#

If you use inplace, then that's correct

arctic cliff
#

Do I have to full-finish pandas to move on to matplotlib?

#

Because I'm following a book and pandas will take literally too much ..

desert oar
#

You don't have to

#

You don't need pandas at all to use matplotlib

arctic cliff
#

How's that ?

uncut shadow
#

They are both different and independent packages

#

U can use pandas without knowing matplotlib

#

And use matplotlib without pandas

#

But both tools will come in handy

#

So knowing How to use them will help

arctic cliff
#

I see !

#

Thanks a lot

charred blaze
#

consider that a reference guide, not a regular instructional book.

flat quest
#

^^. You can't learn everything. By the time you've completed that book, you'll have forgotten more than half of it.

The real key is only remembering the parts you use most frequently, while being able to search up anything you don't know fairly quickly.

desert oar
#

@arctic cliff matplotlib depends somewhat on numpy, but pandas not at all

#

Pandas objects all get converted to numpy before plotting

#

Pandas provides their own convenience plotting routines but you dont have to use them

arctic cliff
#

Oh

#

@charred blaze, @flat quest, @desert oar I appreciate that !

fervent bridge
#

NPY VS HDF5, whats your guys preference?

frank bone
#

anyone have experience on GPU parallelization & python?

#

My code is just 1 big iteration it goes from A to Z chronologically in a for loop, i.e. A1, A2, A3, A4....A6000....B1, B2.......Z6000
running through the whole dataset would take 45 days on 1 core
Would it be possible to split/parallelize the workload at the first for loop? So I could run A1..A6000 | B1..B6000 | Z1..Z6000 in parallel?

#

Numba looks promising (and very simple), would that work?

#
def some_func(*args):
    out = np.zeros(length_of_output)
    for i in prange(n):
        # independent and parallel loop
        out[i] = ...some stuff...
    return out```
#

or is it dependent on what is run inside the for loop?

desert oar
#

@frank bone what is your actual code

#

Maybe its faster vectorized

#

Numba parallel is one option

#

Chunking it into pieces then parallelizing with joblib is another, ideally also vectorizing

#

And yes the effectiveness of numba is heavily dependent on the code you use

frank bone
#

most of code is reading csv files, concat them, calculate new values, isolationforest algo is applied as well

desert oar
#

Oh

#

Numba won't know what to do with that

#

Very likely you cant run that with nopython anyway

#

Just chunk into batches and run with like 4-8 cores using joblib

frank bone
#

so its only possible on CPU?

desert oar
#

GPUs dont really work like you think

frank bone
#

even with a 64 core CPU it would still take almost a day

desert oar
#

A GPU is not just like a shit ton of CPU cores

#

It can speed up operations like huge matrix multiplications

#

It wont help read 1m csv files

frank bone
#

biggest performance hit is isolationforest, could that be sped up with a GPU?

#

im retraining the model in every iteration

#

so for whole dataset the model is trained approximately 20million times

desert oar
#

That seems excessive

#

Isolationforest itself is embarrassingly parallel, so you can parallelize individual model fits

frank bone
#

yeah maybe ill just model it once and see if there's a big difference but i did it for max accuracy

desert oar
#

But your problem is also embarrassingly parallel

#

What exactly are you trying to do

frank bone
#

creating a backtesting algo with a 30 day sliding window

#

so each new day i retrain model

#

so the model always takes the past 30 days into account

#

its a time series obviously

desert oar
#

so why do you need to do this 20 million times?

frank bone
#

because im testing for 8000 tickers, 250 times a YEAR* for 10 years

desert oar
#

That seems excessive but ok

#

If this is meant to be part of your model dev workflow, pick a smaller test sample

#

Then if you are confident your model is "ready" you can just do a big run over the weekend

#

Obviously don't test on your training data and vice versa

frank bone
#

well with 1 core itll take 40 days or so

desert oar
#

So rent a AWS or GCP machine with more than 1 core 🤷‍♂️

frank bone
#

and it has tons of parameters...so id love to test what works best and not wait 40 days or even 1 day each time 😄

desert oar
#

But like i said, use a smaller test set

frank bone
#

might do that

desert oar
#

10 years ago the economy was really different

frank bone
#

but home GPU cluster not possible?

desert oar
#

Whatever your algo is, i would be pretty surprised if it worked as well 10 years ago as it does today

#

Look up and read what i said about GPUs

#

Unless you find or write a GPU implementation of isolation forests you just wont get value from a GPU

#

There might also be extant faster non GPU implementations compared to whatever is in sklearn

#

But really testing on 6000 tickers over 10 years seems wildly excessive. But im not a finance guy so idk what's considered normal

#

Why not like 1 year

#

Also you probably shouldn't burn all your data testing the first version of your model

tame fractal
#

@desert oar wtf are you on about?

#

thats pure nonsense

#

also 8000 tickers, 250 times a YEAR* for 10 years is a small dataset

frank bone
#

i was looking at ROCm and saw it has python API

desert oar
#

Like i said im not a finance guy

frank bone
#

so i was wondering if something is doable

tame fractal
#

should easily run on 1 cpu

frank bone
#

also 8000 tickers, 250 times a YEAR* for 10 years is a small dataset
@tame fractal that's the amount of iForest fits

#

20m

desert oar
#

He said hes reading a csv file and retraining an isolation forest among other things each iteration

tame fractal
#

20m rows is very small...

desert oar
#

If that is his algorithm then it is not small

frank bone
#

not talking about rows, talking about retraining iforest 20m times, not with too much data though each time

#

let me measure how long a fit takes

tame fractal
#

iforest?

desert oar
#

@frank bone the idea is that your model is meant to work on 30 days of data?

tame fractal
#

why would you do that?

frank bone
#

the past 30 days, sliding window

tame fractal
#

this doesn't make sense

desert oar
#

@tame fractal now you see what im trying to work with

#

this isnt like, fitting one model on 20m rows

tame fractal
#

no your approach is wrong

desert oar
#

Good, if they can build a model that uses more of the data more effectively then they dont have this weird problem

frank bone
#

training model takes 0.2s each time

#

just measured

#

its not training on a lot of data

#

just 30 day window each time

#

adds up though

desert oar
#

You are using it to look for outliers?

frank bone
#

yes

desert oar
#

In a 30 day window?

frank bone
#

yep

desert oar
#

Maybe @tame fractal has an opinion on that. Sounds like you have some experience with this and can help out identifying some more conventional and tractable modeling techniques

frank bone
#

i first used simple SMA + SD but i dont find it as accurate as iForest

#

in the worst case ill just get a 128 core server CPU 😄

#

but if there's a way to get it done on a laptop let me know your ideas

modest rune
#

Maybe someone has a great data science discord server they can point me to, but I haven't found one. But I need some guidance.

In python, I am trying to calculate the time decay of a stock option, given the stock option's purchase price and the number of days until it expires. For example, if a call option costs 50 cents and I have 20 days till expiration. How will that price decay over the next 20 days assuming the underlying stock price never changes. The output should be a 20 element array containing the price at T=20 days left, T=19 days left... all the way to 0.

I know the black scholes model is capable of calculating what I want, but I need to get the T in the black sholes model on to the left side of the equation. I'd be able to do that, except that T is found inside the CDF function and I have no clue how to handle that.

Good news is that I have simplified the problem because r is zero in my case.

r = risk free interest rate = 0

frank bone
#

There's also a Monte-Carlo approach to option pricing, maybe that would work? Just quickly read about it somewhere, don't know if works for you but maybe worth a look.

#

Black Scholes is only for European Style options i think, monte carlo can also be applied to American Style

modest rune
#

Black Scholes is still useful for American style options and there are modified versions of black scholes that are commonly used that take into account dividends and the ability to exercise the option at any time.

#

So, I'd say, that Black Scholes is not only for European Style, rather it was intentionally created for European Options by Black Scholes and Merton, and they were fully aware that with some small modifications it could be made to work in different scenarios.

frank bone
#

didnt know that, its more resource saving too

tame fractal
past rover
#

Anyone know of a Discord for NLP?

#

(Natural Language Processing)

desert parcel
#

does anyone know what nan means

pale thunder
#

Not a Number

desert parcel
#

Got it

fallow sandal
#

I finally got TensorFlow installed, after 3 hours of troubleshooting cause of errors ^_^

frank bone
#

how can I execute a php script from within python and pass command line arguments along?
solution: import subprocess subprocess.run(["php", "-f", "/path/to/php/script.php", "arg1", arg_n"], cwd="/path/to/php/", stdout=subprocess.PIPE, check=True)

#

where do i pass command line arguments?

frank bone
#

got it

desert parcel
#

There is a huge gap in my predictions

tidal bough
#

..so?

desert parcel
#

I don't know what to do

#

lowest loss for lr is 1e-6 anything above gives loss of over 5k anything below is NaN

desert parcel
#

I got some where and I found the issue after a lot of digging

#

Can you have floats and ints in a numpy array?

#

Because from messing around I notice that the values will get converted to whatever data type has priority

#

I am not sure what the term is

#
input = np.array([
                  [[313, 1], #HCL
                   [323, 1],
                   [333, 1],
                   [343, 1]], 
                  [[313, 10e-3], #Ortho
                   [323, 10e-3],
                   [333, 10e-3],
                   [343, 10e-3]], 
                  [[313, 10e-3], #Para
                   [323, 10e-3],
                   [333, 10e-3],
                   [343, 10e-3]]
                  ], dtype='int64')```
Output:

tensor([[[313, 1],
[323, 1],
[333, 1],
[343, 1]],

    [[313,   0],
     [323,   0],
     [333,   0],
     [343,   0]],

    [[313,   0],
     [323,   0],
     [333,   0],
     [343,   0]]])
#

int64 makes everything an int

#

float32 makes everything a float

tidal bough
#

Can you have floats and ints in a numpy array?
nope, don't think so. Well, you can, but only by specifying the type as obj and losing all performance advantages.

#

Basically, numpy arrays are wrappers over C arrays, so they need to have a constant dtype.

desert parcel
#

what's a C array?

tidal bough
#

an array in the C language 😛

desert parcel
#

Oh alright

#

so then what are the options then?

#

Other than making it an object

#

Like making two new arrays?

#

then some how combining them?

tidal bough
#

Why would you need an array of mixed ints and floats?

desert parcel
#

the tables says so

#

let me get it

#

and there are other floats in there

#

but not all of them are floats

#

Oh wait I have an idea

tidal bough
#

I don't see a single value that can't be a float here.

desert parcel
#

wdym

#

When it becomes a float

#

The 313 becomes a 3.13

tidal bough
#

what's the problem with converting it to a float?

desert parcel
#

I don't want all of them to be a float only some of them

tidal bough
#

The 313 becomes a 3.13
uhh, no? Have you even seen scientific notation?

#

3.13e+02 is 313.

desert parcel
#

oh.

#

I didn't know that

tidal bough
#

3.13e+02 is 3.13 * 10**2

desert parcel
#

is 10**2 10^2?

tidal bough
#

yeah

desert parcel
#

alright

tidal bough
#

Scientific notation (also referred to as scientific form or standard index form, or standard form in the UK) is a way of expressing numbers that are too big or too small to be conveniently written in decimal form. It is commonly used by scientists, mathematicians and engineers...

desert parcel
#

alright

#
tensor([[[3.1300e+02, 1.0000e+00],
         [3.2300e+02, 1.0000e+00],
         [3.3300e+02, 1.0000e+00],
         [3.4300e+02, 1.0000e+00]]```
#

Is this a 2D tensor or a 3D?

#

or is it 1D?

tidal bough
#

check .shape 😛

#

it's 3D, though.

#

three brackets.

desert parcel
#

Oh

#

I gave [3, 4,2]

#

I just thought that meant 3, 4 by 2s

#
input = np.array([[[313.0, 1], #HCL
                   [323.0, 1],
                   [333.0, 1],
                   [343.0, 1]]], dtype='float32'```
#

Did I do the tensor right?

tidal bough
#

3,4,2 means 3 "layers" of 4 rows of 2 elements.

#

that table looks 2d, no idea why you're making a 3d tensor .

gleaming gyro
#

is mag() of python2 the same as abs() of python3?

fervent bridge
#

Hmm is r+ new to numpy.memmap? I have searched and all links show that previously it wasn't possible to append to npy files. Seems like this has changed? But no one has talked about it?

acoustic halo
#

You can't append to them, they are fixed size, r+ just mean read and write

fervent bridge
#

@acoustic halo then what is a solution, I have 40,000 image arrays and can't load them into memory to just store it all in one go into a npy file

acoustic halo
#

make an empty memmap of the right size first

#

You know how many images and their size

#

np.memmap('filename', dtype='float32 or whatever', mode='w+', shape=(40000, x ,y)

fervent bridge
#

40000, 277, 277, 3

#

hmm but then what would prevent it from overwriting @acoustic halo

acoustic halo
#

You can overwrite it freely

fervent bridge
#

is HDF5 any good would it prevent me from having to load in all the files at once? also does it support memmap or something similar?

#

But would overwritting not delete previous data? per iteration?

acoustic halo
#

as long as its open in a write mode

#

What do you mean? Yes overwriting will erase old data but why do you want to keep old data ?

#

While it is open in write mode, you can edit any part of the memmap in any order you want, it wont delete unmodfied data except for then you first open it as w+, in which case you would open it r+ instead

fervent bridge
#

I want to loop over the image one by one and add into the npy file so that I do not have to load all 40000 images at once, ah ok I get it so I just have to keep it open with With open and loop through in there

acoustic halo
#

No, you save it to a variable

fervent bridge
#

So just create a file pre-established hmm let me try

acoustic halo
#

so x = np.memmap(all the parameters)

#

then treat x as if it were an np array

#

just remember to flush it before exiting to save it all to file permanently

#

Just note that when you do flush it may freeze up your pc because it will max out disk usage for awhile, it'll be several gbs in size for that many images

pliant sky
#

hi new here! found this discord while trying to learn python. currently trying to understand modulo. I understand the concept but i wanted to get an idea of practical applications (so i guess i don't really understand it)

acoustic halo
#

This is probably the wrong channel for it, but in essence, modulo gives the remainder after a division ex 5 / 2 = 2 remainder 1, so 5%2 = 1

pliant sky
#

oh sorry, which channel? i'm interested in learning python for data science? i understand that modulo gives remainder

acoustic halo
#

As for practical applications, converting 24h time to 12 hour time by doing time%12

#

so 13h would be converted to 1, 16 to 4 etc etc

pliant sky
#

i see thanks. for future reference where should i ask noob questions?

acoustic halo
#

Probably the general channel or one of the help channels

#

or maybe the comp-sci channel

pliant sky
#

thanks will do

ebon nebula
#

Hello all. Has anyone finished/enrolled in the IBM data -science specialization and if so, would you recommend it to others.

visual violet
#

how do i filouter out these things

oblique belfry
#

What things?

visual violet
#

2/29/2020

#

i cant do

#
df_2020 = df_2020[(df_2020['Date Time'] != '2/29/2020')]
#

bc there it is not exact like that

fallow sandal
#

Anyone here experienced with tensorflow object detection api? I looked at some old tutorials and new tutorials, do I still have to resize my images for training?

lapis sequoia
#

@fallow sandal You most probably do need to do that.

fallow sandal
#

I was just confused why all the tensorflow v2 zoo pretrained models all had resolutions in their name so thanks for the clarifications

lapis sequoia
#

what are some good Machine/deep learning projects for an undergrad

odd yoke
#

many operations (almost all really) operate on a known size tensor, this is why we enforce some size in the algorithm itself, resizing directly at the entry point of the model @fallow sandal

#

as for why the image sizes are generally so small, it was found that image size doesn't sub-linearly correlates with the performance of the algorithms, but it does linearly correlate with the time it takes to run

#

making an image 512x512 won't make the algorithm's result much better, but it may very well quadruple the time it takes to complete compared to a 256x256

fallow sandal
#

ah, less pixels to process?

odd yoke
#

yes

fallow sandal
#

ok ty so much for the explanation, that was very helpful

drifting umbra
lapis sequoia
#

I'm using Hyperas for hyperparameter tuning but it seems that for some reason the data function passed to optim.minimize() is giving error if I'm doind pd.read_csv in it

drifting umbra
#

@lapis sequoia look at what type optim.minimize() wants for an input maybe?

lapis sequoia
#

it takes functions and the model thats not the issue

drifting umbra
#

sorry

#

i dont understand error and am not super advanced but happy to try help.
do you have exact error?

lapis sequoia
#

thanks i figured it out

#

I just imported pandas inside the function

#

for some reason hyperas isn't using global imports

flat quest
#

@lapis sequoia the best projects are honestly the ones that you create, rather than something already out there.

Have a question you'd like to answer, collect data, then run an analysis on that collected data, documenting each step of that process.

lapis sequoia
#

thats the goal, but to get there i need reference i'm a noob xD

flat quest
#

xD
Hmmm do you have any experience with the ds packages?

fallow sandal
#

I can't seem to find documentation on how to manually load my own trained model, all the documentation code for object detection has code where it downloads a pretrained model from the internet

odd yoke
#

alternatively, tf.saved_model.load, passing the export dir containing checkpoints and metadata as a parameter

fervent bridge
#

Ok so I have 40,000 images I need stored as Arrays, came to the conclusion that HDF5 is best as it lets me append to the HDF5 file, figured best to use Pytables also. So any idea on how I would iterate over these images and append to HDF5?

#

I would like to store my X_train, X_test, X_val and labels in the same file but under different dir

odd yoke
#

are there any issues with manually iterating through the files and generating the arrays and putting them in batch inside dataset objects ?

fervent bridge
#

Batches of 500? 80 batches @odd yoke ?

#

I don't know you think working with batches is easier?

#

I would have to put my Training, Testing and Valuation data all in batches, would TensorFlow easily read these?

odd yoke
#

hdf5 has a concept of datasets and groups, datasets are basically files, in this case, i'd just make them contain one image, groups are folder like structures, which contain multiple datasets
processing the files in batches makes it easier to create groups without using up all your ram

#

if you're working with tensorflow, one thing I use fairly often is called a TFRecord, it's a file format that contains data in the form of a sequence of dictionary-like structures

#

it's very well supported within tensorflow obviously, which makes it trivial to pass them into models later on
they are also very easy to spilt into what are called "shards", which is basically just splitting the TFRecords into multiple files, very useful again when your data is too big to fit in RAM, or if you want to shuffle etc

#

(Also, TFRecords are nowhere near as complex as HDF5, so you may be missing on some feature, but at the same time it can also be much easier to use)

fervent bridge
#

Hmm so I was first looking at NPY correct then I went to research HDF5 as NPY does not allow me to append, as I want to iterate and store all my arrays in the same set so that it would be the same as if I had them stored in a variable in RAM, I want 3 datasets in HDF5 one for X_train, X_test, X_val.

#

I will take a look at TFRecords if you recommend it

#

really just want to know whats the best way to proceed

#

@odd yoke

odd yoke
#

I think either can work really, there'll be pros and cons for every format, for example, TFRecords are not widely supported across most libraries, let alone most languages, this will tie you to tensorflow

#

I don't have much experience with HDF5 to talk about its pros & cons tho

#

I am confident that if you stick to tensorflow, TFRecords will make your life easier to interact with your models, manipulate your splits as you will etc

#

I'm also not really sure what you mean by "append" here, can you elaborate on that ?

#

Do you want to add images to existing TFRecords/NPY/HDF5 dynamically ?

fervent bridge
#

Have an open file and write/append each image array

#

Loop through my images and append to the previous write in the file

odd yoke
#

I don't think you can do that with TFRecords either, what you can do however is create a new tfrecord, and add it to the list of the files you use when you generate your tf.data.Dataset object

#

You can also sync up everything every once in a while, by merging them all together

#

Tho this is a costly operation

#

one difference between tfrecords and hdf5 is that the data doesn't have to be homogenous, you basically store key/value maps, which can contain all the metadata you want of any type

#

now, whether that is useful or not depends on your use-case

fervent bridge
#

Eh I just want to pass in these arrays, if my RAM allowed I would have one X_train list with 40,000 arrays and just pass these into my model but I can't so hmm I have to keep researching

#

Thanks I will looking into both HDF5 and TFRecords

fallow sandal
#

alternatively, tf.saved_model.load, passing the export dir containing checkpoints and metadata as a parameter
@odd yoke I'm not sure if I'm using keras (still new to tthis sorry). I'm trying to modify the notebook file to fit my custom trained modeled, the last thing I have to modify is that I have to figure out what this function returns as a data type

#

since i assume there is where it loads the model

#

trying to dissect what load_model() does, and how I can just substitute the path directory to my pretrained model

#

Is there where the thing you mentioned will work? (tensorflow.saved_model.load()?) I have my tensorflow not as tf

odd yoke
#

you want to replace model_dir with your path

fallow sandal
#

Oh I guess it does import it as tf nvm

#

Ok, I will try that

#

Thank you so much @odd yoke

odd yoke
#

the current one uses tf.keras.utils.get_file, you can remove that line and pass in your own path directly

#

where your saved model is

fallow sandal
#

so just ignore the parameters inside?

#

"(
fname=model_name,
origin=base_url + model_file,
untar=True)"

#

sorry my ''' is broken on my keyboard so i can't do the gist thing

odd yoke
#

remove the whole tf.keras.utils.get_file(...) part