#data-science-and-ml
1 messages · Page 240 of 1
i found it
Hey all, I have this dataframe and need to do some subtraction. Every fourth row should be subtracted from the previous three rows. For example: Row 3's values should be subtracted from rows 0,1,2 and then row 7's values should be subtracted from rows 4,5,6 and so forth. How can I accomplish this via something like df.diff()?
@marsh berry groupby and transform
input = np.array([
[[313, 1], #HCL
[323, 1],
[333, 1],
[343, 1]],
[[313, 10e-3], #Ortho
[323, 10e-3],
[333, 10e-3],
[343, 10e-3]],
[[313, 10e-3], #Para
[323, 10e-3],
[333, 10e-3],
[343, 10e-3]]
], dtype='float32')
target = np.array([[[14.76, 16.42, 18.08, 23.41]],
[[5.87, 11.14, 13.20, 25.72]],
[[2.73, 4.42, 8.04, 13.68]]], dtype='float32')```
```py
loss_fn = F.mse_loss
loss = loss_fn(model(input), target)
Output:
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: UserWarning: Using a target size (torch.Size([3, 1, 4])) that is different to the input size (torch.Size([3, 4, 4])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
Actually I'll move this to help channel sorry
Is it a good place to ask Machine Learning Questions here??
Hey guys, Newbie here. just wanted some pointers on how to proceed for one of my project. I am pretty experinced in many parts of ML but there is one that I am finding almost no help/advice in.
I basically want to build a model (Pytorch or Tensorflow/Keras) that takes a number and a string as a feature and tries to find out some higly complex relation between the 2 fields. I have already made a csv file where data has a number and then a string both seperated by commas. The model will then predict the number based on the string.
So I wanted some pointers on how the model should be built - like what layers to use, recommended Hidden Layers for pretty complex relations, Maybe if you all are aware of some pre-trained NN that can accomplish this task. I tries Googling it but didn't come up with many examples that do that. I ruled out LSTM and GRU because they do not find very complex relations in the data but would experiment on it if the experts think so.
So does anybody have an idea how to accomplish this??
Is imbalance on a feature-level also bad like class imbalance? I mean a feature that has the same value for all instances is useless as it gives no unique information and thus should be dropped. But what about a feature that 95% of value A and 5% of value B? Should it also be dropped? How do we approach this? Is there a term for this like "class imbalance"? Any resources for reading more on the topic?
can someone help me, i am not able to create a dataframe using Panda
i cna't seem to figure out the problem
ValueError: arrays must all be same length
you got 1x age too much
ahh , thanks found it.
@thin terrace Look up down-sampling and up-sampling (Data Augmentation) if unbalanced classes are what you mean....
It's not
@thin terrace good question. If there isn't already a topic for that on stats.stackexchange.com you might get good answers if you ask it there
Is imbalance on a feature-level also bad like class imbalance? I mean a feature that has the same value for all instances is useless as it gives no unique information and thus should be dropped. But what about a feature that 95% of value A and 5% of value B? Should it also be dropped? How do we approach this? Is there a term for this like "class imbalance"? Any resources for reading more on the topic?
@thin terrace feature variance
"should it be dropped" is not a simple question to answer
If I have two data frames with the same date index and I want to join on some fields in lhs... lhs.join(rhs, on=['field1', 'field2', 'field3']) doesnt work. So it looks like it needs to joined on the index. but if I use groupby, it doesn't work because you can't join on groupby objects. 😕
consider this: when you train a model, you're basically deriving a relationship between features (input) and target (output), keeping in mind certain assumptions
simplest example: linear regression; the assumption is that the target is a linear function of the features
so, going back to your question...
how much predictive power does the feature add to your model?
it must be clear that even a column of random values may add some predictive power, simply because of the distribution of your target (so, basically, overfitting).
say it's a classification problem and even though the feature's value ratio is 95:5, it is so strongly predictive that all of the samples with the minority value for that feature are of the same class.
if the training sample roughly reflects real-world data, you basically get confirmed predictions for 5% of incoming samples for free.
of course, it's not clear that that is actually the case, and if it's not, then you have overfitting. that is one of the reasons you would consider dropping a feature with low variance.
@fiery frost just ask the channel
I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.
Here is before and after:
Before:
After:
Video for demonstrating how is should work.
Thx for everyone who can help.
please take into consideration i am noobie to this.
(I tried to do this by detecting edges, but it is not perferct and require me to hand change it every time.)
Hi everyone,
I come to you because after many research, i still cannot fix my problem.
So, i would like to read JSON file and put it in a dictionary. I know the library json with json.load() works.
BUT, after a certain size (500k Ko) json.load() stop working and send a memorryError.
I've found the library isjon with ijson.parse() method but the output is not a dictionary and i absolutely need an output as a dict.
Do you have any ideas to fix this?
Here there is my code :
import json
def getJsonFile(path):
#========================================================
#GET THE JSON FILE
#========================================================
data = {}
try:
with open(path,'r',encoding='UTF-8') as jFile:
data = json.load(jFile)
except Exception as e:
print(type(e))
finally:
return data
Hope someone would can help me!
Have a nice day !
@spark cape Can you help me?
I don't know if it was the good channel to ask my question, i've just joined this discord, sorry for that
@molten ravine it's problem with your PC, on mine I loaded 5mg of json.
@molten ravine What error do you get? I mean, you shouldn't actually use try except there, remove it and then try to run it. If you get any errors then paste them here so we know what might be wrong
I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.
Can someone guide me pls? 🙏
i have a ```python
MemoryError
Hi everyone,
I come to you because after many research, i still cannot fix my problem.
So, i would like to read JSON file and put it in a dictionary. I know the library json with json.load() works.
BUT, after a certain size (500k Ko) json.load() stop working and send a memorryError.
I've found the library isjon with ijson.parse() method but the output is not a dictionary and i absolutely need an output as a dict.Do you have any ideas to fix this?
Here there is my code :
import json def getJsonFile(path): #======================================================== #GET THE JSON FILE #======================================================== data = {} try: with open(path,'r',encoding='UTF-8') as jFile: data = json.load(jFile) except Exception as e: print(type(e)) finally: return dataHope someone would can help me!
Have a nice day !
@molten ravine Do you see that: https://stackoverflow.com/questions/40399933/memoryerror-when-loading-a-json-file ?
hello! is it possible for features in a random forest regressor to be a mix of numericals and pd.get_dummies(categoricals)? sorry for the noob question >.<
@wanton kiln i've already seen this...
As i said ijson is not a solution for me because i do not know how to how to convert it into dict...
So if someone could guide me if it's possible to convert my json object as a dict, using ijson, should be nice. But actually i do not know how to do this
I don't know what 500k Ko but if you mean 500kb that's trivially small and the issue is on your end there.
I mean 500 000 Ko
what is ko
sorry it's the french writing, i mean KB so i think
so yeah, it's a trivially small but i don't understand why i'm getting a memory error
I am trying to build network to do the following action.
clean speech bauble in manga(japan comics)
i have ton of before and after images.
But instead of just deleting, i want the network to recognize the speech bubbles and their edges.
Help someone? 🙏
@wanton kiln i've already seen this...
As i said ijson is not a solution for me because i do not know how to how to convert it into dict...
So if someone could guide me if it's possible to convert my json object as a dict, using ijson, should be nice. But actually i do not know how to do this
@molten ravine try this exemples https://www.geeksforgeeks.org/convert-json-to-dictionary-in-python/
?
@wanton kiln what you've sent it's completely my code right now
import json
def getJsonFile(path):
#========================================================
#GET THE JSON FILE
#========================================================
data = {}
try:
with open(path,'r',encoding='UTF-8') as jFile:
data = json.load(jFile)
except Exception as e:
print(type(e))
finally:
return data
Pls? 🙏 🙏
Pls? 🙏 🙏
@fiery frost You need construct a Deep Learning solution using Keras and Tensor.
Can you guide me more pls?
I am very new to this. @wanton kiln
@wanton kiln what you've sent it's completely my code right now
import json def getJsonFile(path): #======================================================== #GET THE JSON FILE #======================================================== data = {} try: with open(path,'r',encoding='UTF-8') as jFile: data = json.load(jFile) except Exception as e: print(type(e)) finally: return data
@molten ravine I think you need index json strings like the exemple:
for reading nested data [0] represents
# the index value of the list
print(data['people1'][0])
# for printing the key-value pair of
# nested dictionary for looop can be used
print("\nPrinting nested dicitonary as a key-value pair\n")
for i in data['people1']:
print("Name:", i['name'])
print("Website:", i['website'])
print("From:", i['from'])
print()
Can you guide me more pls?
I am very new to this. @wanton kiln
@fiery frost ohhh... its some dificult address this issue here in chat.
let me think how can i help you, ok.
I cannot even put index if i'm not able to load the json file.
The problem is the method json.load() return me a memorry error. So in the example, the datas have been loaded already
@wanton kiln thx!!!
@molten ravine can you use jsonString = jFile.read() ? if not then it's the reading of the data. and if it works then ijson gives a generator that you can pull data through like [x for x in ijson.items(f, prefix='')]; then you have a list of dicts
(json data is often a top level list of json object entries)
I have an excel file with a list of medical exams made per place. I want to make a code to go through it and sum the amount made of exams per place and put it in a dataframe to later write as excel.
exams.groupby('location').agg({'examsTaken': 'sum'}).reset_index(['location'])
this would return only the amount?
i was going for a for loop but it would lets say pick the first line: hemogram in place A. It would go through the whole sheet and sum the amount done. Than if it pick hemogram in place B, it would repeat again
the pandas way to do that is groupby(...).apply(your function that works on the group you are interested in).reset_index(.../*put the index back how it was */)
apply if for your code; but there is sum, etc. can also do...
exams.groupby('location').examsTaken.sum()
My excel columns code, exam, amount asked, amount done, value. Is it better to just concat code and exam? Since they are unique to eachother
Ive deleted the place columns since i dont need it
its up to you but you can use groupby with an array an use groupby(['location', 'code']) to make two layers of grouping if that helps
let me know how you get on so if it's wrong i improve too. 🙂
To write as excel later
Can someone pls guide me how to implement this?
yeah you'll want that @woeful tusk
but first dump the results in e.g. jupyter notebook to make sure it's correct. itll be quicker than saving file, loading file in excel. etc
@fiery frost your question is like a whole project. it's not really appropriate for this channel. you have a few subproblems: how do you find the speech bubble holes? Once you find them, how do you find the text? Once you do that do you detect language? Once you do that, can you OCR it? Once you OCR'd it, translate it using an online service, I guess. then write the text into the field with appropriate layout (based on the speech bubble dimensions)
@fiery frost your question is like a whole project. it's not really appropriate for this channel. you have a few subproblems: how do you find the speech bubble holes? Once you find them, how do you find the text? Once you do that do you detect language? Once you do that, can you OCR it? Once you OCR'd it, translate it using an online service, I guess. then write the text into the field with appropriate layout (based on the speech bubble dimensions)
@spark cape is not that comlicated.
i dont need to ocr them.
just recognize them.
opencv is an image manipulation library for detecting things. to find the speech bubbles
the R in OCR is recognize
i already did it.
it's not perferct.
there are a lot of loopholes.
so i thought moving to neural network.
Hey guys, Newbie here. just wanted some pointers on how to proceed for one of my project. I am pretty experinced in many parts of ML but there is one that I am finding almost no help/advice in.
I basically want to build a model (Pytorch or Tensorflow/Keras) that takes a number and a string as a feature and tries to find out some higly complex relation between the 2 fields. I have already made a csv file where data has a number and then a string both seperated by commas. The model will then predict the number based on the string.
So I wanted some pointers on how the model should be built - like what layers to use, recommended Hidden Layers for pretty complex relations, Maybe if you all are aware of some pre-trained NN that can accomplish this task. I tries Googling it but didn't come up with many examples that do that. I ruled out LSTM and GRU because they do not find very complex relations in the data but would experiment on it if the experts think so.
So does anybody have an idea how to accomplish this??
for complex relationship between words look up embedding layers and Word2Vec
Word2Vec transforms words into vectors which allows the neural net to understand how close or far the relationship between each words are (like cat & dog, king & queen, etc...)
Embedding layer tries to capture this idea as well, but embedding layer is more of a general name for mapping features into high dimensional spaces to find relationship (e.g clusters) between data
@grave frost In addition to the above, I would look at BERT, it tends to perform better over most other word embedding techniques and is fine-tuned to the specific task at hand rather than just relying on the embeddings of stuff like word2vec and ELMo
Thanks guys for responding, But would it be a good idea to use Embeddings as the relationship is not word-to-word, but actually an interger to word? Would the NN be able to figure out the relationship?
Also @acoustic halo I have a ton of data (10 million rows with 2 data points each) would it be a good idea to retrain BERT from scratch on that data or should I do something else?
Can you give an example of one?
And Bert can give a classification vector for each string you use
Mainly for simple code breaking purposes. Like 1 --> Ad$r. So basically be able to break complex ciphers and predict the number from the data, Input:- Ad$r ; Output:- 1.....
Don't use Bert then
So it isn't a classification task, put rather a predicting or finding a "relation between data" task
Ohk
Bert is for natural language, you might be better using something character level, not sure what though
Me neither. I have been hunting most ML platforms, but looks like I have to go to SE now....
embeddings will work with numbers. You're just projecting the data into a higher dimensional space @grave frost.
character level is probably better for this, but you can still use the BERT architecture. You just encode the characters instead of subwords.
Hmm.... that's right but if the relation is very complex, would BERT still be able to find it out or should I boost up the HIdden Layers somehow?
BERT has a large number of hidden layers. I'd try using the architecture as a base, and then see how it performs before trying to boost up the number of hidden layers
isnt bert huge and therefore very difficult to train from scratch?
id be skeptical that fine-tuning an existing bert model pre-trained for human language would be effective in a more abstract problem, but maybe there are sources contradicting my uneducated intuition
but yes in general if you can use an existing proven architecture you definitely should
fine-tuning i'm not sure if it would work, unless there's some parallels with his particular ciphers and natural language.
Training from scratch would take a long time yes, but the smaller BERT architectures can help minimize that.
Well, I will start at 'Medium' to provide me a jump-off point....
Hello everyone, I need some assistance and reassurance that I executed a histogram chart correctly
age_distribution = sns.distplot(df_ea['Age'], bins = 20)
in which case it outputted this:
To clarify, this histogram chart states that majority of the employees within the dataset are between the ages of 30 and 40
Hello! I am a new member of this channel and just wanted to greet everyone of you. Hello!
How can I add Python Type Hints for the columns of a spark dataframe? For example, suppose I have a dataframe called data with the following schema:
root
|-- body: string (nullable = true)
|-- partition: string (nullable = true)
|-- enqueuedTime: timestamp (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- systemProperties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_type: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
@tardy portal looks fine
I updated it and made it more pretty
@rare ice spark dataframe columns aren't really strings though - they are placeholders representing strings
so im not sure how those type hints would even work
do you mean that you want to write a function e.g. process_data that only accepts spark dataframes with a certain schema?
if so, i don't think there's a good way to do this
i've wanted something like this for pandas as well
i haven't found a good answer
@desert oar I was more hoping to get my editor to recognize the schema and be able to autocomplete columns.
yeah, i see
that would definitely be nice
you could probably make this work with generic types
what happens if you add a column to the dataframe? i guess that means its type has changed, right?
in general this is the problem of: how to annotate an object that has runtime-generated attributes or keys, but is expected to be mostly static once it's been generated
it's also a bit messy because pyspark dataframes also support access with __getitem__ as in data['year']
Can anyone recommend a channel about AI / ML ?
@lapis sequoia This one
Okay so I need assistance with creating a pairplot for my dataset, specifically for the numerical variables. I'm trying to utilize this link: https://seaborn.pydata.org/generated/seaborn.pairplot.html, however, how am I able to specifically target the numerical variables within the dataset?
df_ea.fillna(value = 0, inplace = True)
num_scatt_plot_matrix = sns.pairplot(df_ea.iloc[:, :17])
could someone explain what sns.pairplot(df_ea.iloc[:, :17])
that means?
alright so since I want to run this pairplot for only the numerical variables I entered this: ```sns.set(style = 'whitegrid')
df_ea.fillna(value = 0, inplace = True)
num_scatt_plot_matrix = sns.pairplot(df_ea, hue = 'Attrition')
It generated a lot of information, however I came across this error: RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density.
hey if someone here is familliar with mysql could you look in databases
I managed to figure it out by entering the column names separately, but I wanted to know if there's an easier method I could run this pairplot based on the columns that hold numerical values.
@plucky crow in code like this
timediff = df1.groupby('masked_id')['date'].apply(lambda y: y - y.iloc[0])
the timediff series will typically end up with an extra index level, corresponding to the grouping variable. so if you want to assign it back to df1 as a column, you need to remove that index level. in this case i suggested .reset_index(level='date'). if it doesn't work by name, just say level=0, since the extra index level will be added as the 0th level
is there a way i can and some type of if clause in my df column list comprehension. python group['growth'] = [ x[-1] / x[0] -1 for x in group['cases']
Im wanting to add if x[-1] - x[0] < 0 then ```python
group['growth'] =0'''
then group['growth'] = 0```
ValueError: logits and labels must have the same shape ((None, 6) vs (None, 4))
ValueError: logits and labels must have the same shape ((None, 6) vs (None, 4))
@woeful shore I keep getting this error, how do I make sure my training data and labels are of the same shape. my training data is a list of textual data while my label is ndarray that has been onehot encoded. My labels has four targets
does anyone have a good and well explained resource on how you would go about making a very simple supervised ML function with around 10 parameters classifiyng around 10 different structures in a dataset
I have 0 experience in ML(tensorflow or whatever tool you would use)
My plan is to create a synthetic dataset with the data structures I want to detect and train a model with that, then test on real data
how do you 'label' these different structures? lets say we have a csv file with 100 lines and in 30 lines I have different data structures (around 10 different) I want to detect. The real dataset will be millions if not billions of lines but that was just an example
each line having around 15-16 columns, of which 4-5 are relevant to define a unique structure
and sometimes these structures are comprised of several lines within that 100 lines file
i.e. 3 lines = 1 unique data structure
and they even overlap sometimes
found it very difficult even just trying to structure an algo for that, would probably be very error prone and be hundreds of conditional nesting. Thought ML might be the solution for something like that
as i understand it...usually youd have 1 datapoint and 1 corresponding label. But how would you go about mapping 3 datapoints to 1 label? then 2 datapoints to 1 label? And then make the training session go through all these and also check for overlap and classify the data according to the higest score? with labels having multiple datapoints have more weight than single ones for example
still having trouble understanding the data you're working with
So its multifeatured inputs (each line is an input im assuming)
But some of the lines are grouped together? @frank bone
@flat quest let me give you an example. Specifically the idea is to identify different option strategies that were executed
examples of such strategies
example of data
usually theyre executed at same time but I want to introduce a tolerance of 3 minutes so that makes it a little more tricky but will get me more data output
now the tricky thing is "Label X" could be falsely identified with in "Label Y", or any data within all label above could be identified as "Label R". But they're all unique instances
So in total there will be 10 labels, and each label can be comprised of more than 1 row but there's no such requirement
The way I would solve it (as I'm currently understanding ML) is to give more weight to a label the more datapoints it comprises
i think calling them labels makes it very confusing for most ML people
label is what you want to predict
if you think an equation like y = mx + b
labels = y = what you want to predict
features are your inputs
do you have data for inputs and corresponding outputs?
So you're trying to categorize options trade into particular categories? @frank bone
@flat quest exactly. And sometimes an option trade is a single entry, sometimes it can be multiple entries
@drifting umbra I was reading this (https://www.tensorflow.org/tutorials/keras/classification) and thought labels is the right term?
yes labels is whaty ou want to predict
you dont have to call it label
you can call it anything
alright
you can aslo predict either:
number such as $ in sales, $ stock goes up / down
TRUE / FALSE
or categories
as you have it X, Y, Z, R you have 4 categories
that was just an example, I have around 10 categories and each category can be identified by timestamp, C/P, Strike and maybe TotalTrades (not sure)
the problem i see is the grouping of several datapoints into one category
i dont really think you need machine learning for this
if you are doing what's called "hard coding" rules
such as all options with one date
go into one category
e.g.
if month == Janurary:
category = Label X
also to train your data you will need labels for many more options
or many more obseravtions than 4
if it truely needs machine learning
and the categroy is not set by strict rules
e.g.
if month == Janurary: category = Label X
@drifting umbra not sure you get what im trying to do. It has nothing to do with time
i am saying
if you are categorizing things with STRICT rules
you don't need machine learning
such as all bachlors = unmarried
all children are young
etc
i dont necessarily need machine learning no, but if its easier and faster to implement and is more precise then yes
i just dont know how i would write code to correctly identify i.e. 10 different categories when they can overlap
if you are using strict rules machine learning will be less percise
seems very complicated and error prone with many many many conditional nested loops but i just started learning python 2 weeks ago so maybe its a knowledge issue 😄
this is pretty easy problem
do you have more observations
than 4
need like hundreds of thousands
*OR thousands. hundreds OR thousands. sorryu
long day @ work
i do a prefilter, and out of that prefilterI get the data I colored in the above picture
and in that data i need to identify the categories and they're defined by just 4 or 5 observations or 2* if it a dual pair, 3* if triple pair
if there's a simple solution im all for it.. 😄
okay so
if the features for the colored things
are similar
and
you have more observations with labels
i only see 4 that are labeled
how many labeled rows
do you have is the question
it was just an example to demonstrate the issue
of there being categories with varying amount of csv rows
do you have excel or screenshot of actual data
if you want to do it as machine learning you will need more than 4 with both
input AND output
in column J most are missing output
anyone tried both keras/tensorflow and fast.ai/pytorch? I only tried keras. Easy to use. Never tried the other combination. For those who've tried both, may I know what is your preference?
@jade walrus interested in this as well. i am reading Deep Learning with Python by François Chollet (author or Keras)
The actual data doesnt contain column J, I added those manually to demonstrate what I want to be filtered out
id create a dataset manually where id mark them
ok
or "label" as in the previous link i shared
like i said be aware you would need to do hundreds
or thousands
so maybe
what you want
absolutely
is to use rules
like strict rules
how do you decide what is J/
for example if strike is greater than 40, it's call, then it's Category Z
that could be a rule
let me show you in the example from the screenshot how id categorize it
okay i am saying if you are going to manually go through
and put something in column j
how do you decide what to put in column J ?
im looking at the timestamp with a tolerance of x seconds, if there's a pair (this can be 2 or more) then those 2 (or more) entries are a "pair". After such a pair is identified, i then look at C/P, Strike and Stock Price, if C and P strikes are within 10% of stock price, it's a "Straddle"
thats one label
and I want to identify like 10-20
the problem is the varying degree of numbers of datapoints in a label. For example a label where there's only 1 row, could mistakenly be identified in a pair?
the tolerance might also introduce overlapping between pairs if they happen close to each other
lets say one pair happens at 10:15 and the other pair at 10:17, if my tolerance connects the two, that's a mistake
i think itll be extremely complicated to code all these sometimes deeply nested conditionals and debugging will be hell
if you look at straddles, strangles and iron condor youll understand the problem
can u say category instead of label
and ok so
"lets say one pair happens at 10:15 and the other pair at 10:17, if my tolerance connects the two, that's a mistake"
yeah i understand options strategies i am CFA
actually know a lot more about investing than data science lol
great 😄
i think u will get much more accurate results hard coding
because you have no answer data
talking of going thru hundreds or thousands and labeling it yourself is a big mistake imo
just create code for conditions
can u say category instead of label
@drifting umbra can u just quickly define it to me. A little confused by terminoology
such as TIME_2 - TIME_1
is that >2 or <2
features are your X's. features are inputs
labels are your Ys. labels are your outputs
talking of going thru hundreds or thousands and labeling it yourself is a big mistake imo
@drifting umbra i was thinking of synthetically making them
step 1 import this excel into pandas
@drifting umbra go back to stock market chat
pd.read_csv("file.csv")
yeah i have all filtered data in a dataframe already
the colored entries from picture above
Study dask, skip pandas
@deft harbor wassup
thats where i extracted pairs using unixtimestamp comparison but im stuck and thats why i looked at ML
Just checking in to see what's happening here.
@deft harbor ugh. new tech. this is why i hate technology and am glad i am not a software developer
always new shit to learn
i love learning but i question how much new languagse and frameworks even add
I use dask because I found myself trapped using one core all the time
Dask is really useful for multicore processors and in a bigger world, clustering
import modin.pandas as pd
🙂
makes sense
i am using colab beacuse
although i have nvidia GPU. it is fk impossible to get tensorflow to work on windows
Yeah, I use Linux only
I've worked with people who developed notebooks in windows, and its a mess
how would you decide?
i wouldnt label them manually, id create a synthetic dataset with only condors, only straddles, only strangles and then concat them
for training
but im all for a hardcoded classic function, it just seems stupidly complicated to me
https://modin.readthedocs.io/en/latest/
@drifting umbra wow this is great
do you have any suggestions? I mean you understand the problem well since you understand options strategies.
I have a table of different option strategies and I want to put them in correct categories (straddle, strangle, butterfly, etc)
defining observations are timestamp (nearness chronologically IN CASE its applied to multiple lines), C/P (defines possible categories), C/P strikes (nearness to stock price -> defines possible categories) and all this can be applied to several datapoints (csv lines) or a single line, which will also change possible categories
@frank bone make each row in your data frame an options contract
then use logic to build options trades
different data frames for different underlying securities
if you have all the options contracts in a data frame
then you can build strategies from it pretty easy
many different strategies
btw side comment is that i am skeptical of insights from this because generally options are not super liquid
so you may have problems with either
-stale prices
-getting fills / adequate liquidity when you go to implement strategy
you can store the options trades in mini data frames, one big data frame, w/e
like all the legs
that is cleaner than having a column in your df_all_options_AAPL
df_AAPL_Oct_Bear_Put could be a data frame with the rows for a bear put spread on aapl
or of course you can add columns
maybe i wasnt clear enough in how my current data structure looks
screenshot you sent has every options contract as a row
I have a df for a given ticker for a given day with all options trades in it
and im trying to find "option strategies" that were executed for each given day
wait but
ok
i will challange the theoretical basis for this even
or really just 2 things
1- how do you know what are options strategies?
together
and not just different traders
you have no "trader_ID" variable
i think it is just too much guessing
and whats the 2nd thing?
that it's just guessing
you need answers (labels)
a problem you could do
with this data
if i were you
well im 99.9% certain when I look at the data
that it was a strategy executed
yeah so if time is within 2 seconds
if different strike prices
whatever conditions
just think about what you're doing in your head when you are grouping the trades
i would try to use this options data to predict underlying stock price movement
rest of that day or next day maybe
that is something you actually have data for
input1 : $ value of puts bought
input2: $ value of calls bought
input3: # of puts bought
input4: # of calls bought
input5: % of trades in last 15 min of trading
label (what you want to predict) = stock return next day
boom, there you can get data for it
you have data and rules, want to output answers
that is fine
thats not my goal though, i just want to identify categories
ok well you have no answers right
you need to define the rules formally
of how you are grouping the options trades into being done by the same person
if u can do it mentally need to formalize
not even program, just write down
i like what you said to lay down step by step what im going through my head when i categorize it. And it would be somewhat doable but when I think about edge cases things get messy
so im wondering if i create synthetic answers and let ML find the rules, if that will be better
"i like what you said to lay down step by step what im going through my head when i categorize it"
exactly
i mean
you can
i just question how objective your categorizations really are
if u cant lay out the rules
as u see the 2 paradigms for programming
you need either data and rules
or data and answers, and the computer comes up with the rules
hm
wait
maybe i'm an idiot
there are ways to categorize data into groups
like k-means for example
but i think you have to tell it how many groups there are
also it would not make sense for your data
without answers, you could only group similar things together
you need either:
data + rules
or
data + answers
yup im following 😄 for the second answer my only problem is how would I put several datapoints in one category and for other categories just one line and for other 4
would i train the model once for 1 liners, then 2 line data, then 3 line data and then 4 line data(max) and then when predicting i give more weight to categories with more values?
or can i somehow do all in 1?
as i understood tensorflow, you train a model with data(1 csv line)='category'
can u do data(2 csv lines) = categgory1, (4 csv lines)=cat2
and then how would you apply (model.predict) that to data that only consits of 1 liners
ill try to write down rules for 1st case on paper, lets see if i can get clear answers, been trying to do that for 3 days though, all i got was headaches 😄
"how would I put several datapoints in one category"
put the category they are in
on their line in excel
just say "Trade1"
"Trade2"
"Trade3"
"as i understood tensorflow, you train a model with data(1 csv line)='category'
can u do data(2 csv lines) = categgory1, (4 csv lines)=cat2"
not sure what this means
i just think you are no offense
1- guessing. why do you think these trades belong together?
are there any rules you can define? being similar in timestamp makes logical sense
you can get python to group them based on that
just think this makes no sense idk
we are here
lol
i just think you are no offense
1- guessing. why do you think these trades belong together?
@drifting umbra the data at that stage has gone through pre-filtering, in combination with timestamps its highly likely they belong to the same trader. It's not a make or break situation either. I just want high accuracy, never claiming 100%
actually the grouping is 100% only defnied on timestamp, after that it's just putting them in different categories. thats the easy part
but grouping i find hard but ill take a pen & paper and have at it again 😄
which is easy part
grouping based on time stamp?
and i have idea on how to do this mayube
what is max number of options legs in one trade?
4 but thats very rare. mostly 1 2 or 3
i mean you can take this as an example for a df with different categories
20200501 09:41 FAST C 35.0 20200515 1.45 90 1
20200501 10:41 FAST P 35.0 20200515 0.71 350 1
20200501 10:41 FAST P 32.5 20200515 0.25 600 1
20200501 10:41 FAST C 32.5 20200515 3.33 600 2
20200501 15:40 FAST P 35.0 20200515 0.75 160 1
20200501 15:40 FAST P 35.0 20200515 0.75 145 1
20200501 15:47 FAST C 37.5 20200515 0.3063 80 5
adding a column at [2] with unixtimestamps easy
alrdy done that
now just trying to figure how to take out pairs and save them in a separate df or list and take them out of original df, so they dont interfere with other pairs
and all this dynamically, since this is part of an iteration and number of pairs will always be different
got my brain juices flowing a little bit now...trying something 😄
yo
sorry
have idea
your input should be 4 rows of these characteristics
output is TRUE/FALSE
whether they are in a trade together
THAT is doable with machine learning!
the 4 rows will consist of any related rows which actually form a category
plus the number of rows to get to 4
randomly chosen from data set
pretty sure it will work
using true/false as Y is solution i think
that way you are predicting which of inputs are in a group together
so in the future when you give it a bunch of trades it can output predictions TRUE/FALSE of those that are related
but how to further differentiate between the different groups afterwards? that big group with consist of 10-20 subgroups (option strategies)
the table you showed actually consists only of those subgroups we're talking about. But I think it should be possible without ML, still writing down I think there's a way..so far
each sample
should contain how ever many that are in a group
and the rest of the rows selected at random from your data set
you wil feed the data into your model to project in rows of 4
splitting it up into 4s
Is this a good place to ask minor stats questions?
should contain how ever many that are in a group
@drifting umbra Oh I get it
splitting it up into 4s
@drifting umbra what do you mean by that?
okay so you have hundreds of trades right
this example you have 8 rows
it will be broken into groups with 4 rows
the model will output FALSE if it thinks the trade is part of a group
Yeah, its not homework, I just didn't pay attention in college when they were teaching this stuff. Now I am almost done with my phd and doing some minor projects.
very cool
@frank bone given 4 rows
it will predict WHICH are part of a trade together
if TRUE you can add tag them with "Trade+counter"
where counter increases by 1 everytime
so for example it might tag the first 2 out of 4 TRUE
add "Trade1" next to those entities in your DF in a new column called "Trade_No_Predicted"
you feed in 4 more
it predicts all 4 are part of one trade
all 4 are predicted TRUE
you add "Trade2" next to all in the same column "Trade_No_Predicted"
Ok...so here's the thing. I have this boxplot.
It doesn't show any outliers. But if I plot a histogram, it gives me this
it will predict WHICH are part of a trade together
@drifting umbra ohh sort of a "sliding window" (just for understanding) from top to bottom, at each new row it checks whether trades are part of group. Do I understand correctly?
@frank bone yesss
Interesting, and for training data i just label true/false?
sorry if i got terminology mixed up with "label" here. But in training data I need to have features + label right
I am sorry, but i don't get what you are saying. The whiskers are the min and max in the data aren't they?
@frank bone no with training data you label with groups
later convert to true/false
when converting from whole data set into rolling window of 4
rows
@frank bone "training data I need to have features + label right"
Yes
@distant dagger yeah so what are you saying about outliers
everything looks right to me
can you just give me an example what that would look like (the "group")? Slightly confused but I got the concept. thats smart, thanks 😄
Oh ok...I was confused by the histogram cause it has two peaks. @drifting umbra
ohhh
got it
will try that when classic method wont work. Or maybe do a benchmark classic vs. machine
your data
will not be so brain dead easy
to see / predict
but if you have this type of thing
@drifting umbra I guess what I was trying to ask was that since there are two peaks in the histogram, is it simply a bimodal histogram or is there something wrong with the way I have processed the data. The boxplot and the histogram are actually log values
And from what I have heard, bimodal histograms are not very common
bimodal histogram is def possible
depends on ur data
it suggets
u have 2 diff populations
or groups sorry
@drifting umbra very nice thanks. Everything noted down 🙂 have you compared classic vs machine implementations before? did one outperform the other in accuracy?
@frank bone depends entirely on data and problem ur trying to solve
one is not always "better"
i think you can use random forest for this really well
or if you convert everything to categories
either use xgboost or catboost
but basically
can this be done on a CPU? or GPU needed?
or if you convert everything to categories
@drifting umbra ?
not exactly sure what you mean by that? train with categories only?
@drifting umbra I get it now. Thank you. that was really helpful.
@frank bone my idea
@distant dagger 🙂
@frank bone categories i no big deal
computer cannot understand text
only numbers
can you send data again it's too far up
dont worry about that
that is 1 easy line of code
xgboost or catboost can run on CPU in less than 60 sec
if you need GPU you can use google colab
its free and thats what i do
i would suggest doing data science in jupyter notebooks
can i feed the training with many separate files instead of 1 big concatenated? I'm wondering if increasing label numbers and passing time (i.e. jan to june) would play a role in the algo deciding when that doesnt influence it
no since ill have(?) to feed in unix timestamp, that includes the passing of time, when there should be no correlation in that
skewing the performance
you can easily delete or add columns
i dont want to include passing of time as a feature, but a timestamp would include that?
you can extract just the month or day of week
or year
whatever you want
or no time variable
all easy
lol stupid me
yeah
trades which occur closer obv more likely to be part of same trader
gotta sleep
feel free to PM me
good luck with project
thanks again! 😄
gl ☮️
u too
I have sales data from US Cities. I don't have latitude and longitude values, just plain state and city names. How can I visualize them on a map? I tried using Geopy, Geopandas (Nominatim) but it takes too much time( I used as there are around 10k rows. Any better way?
I followed this https://towardsdatascience.com/geocode-with-python-161ec1e62b89. So if I use RateLimiter it takes too much time if I don't it still takes a lot of time and gives too many nan values.
have you tried plotly?
I'm not sure if it works with names but you can do some pretty nice map visualizations with it
Hi @tame fractal, I'm not sure why you decided to post a monkey picture here but I deleted it as it was not relevant to the topic of this channel
passive aggression at its finest
https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/information/
@chrome barn Thanks man. I used pandas merge with this csv. Got exactly what I wanted. Thankyou again.
sounds rather exotic but is it possible in python to skip a loop twice?
Use a variable as a flag?
This wouldn't work?
skip_flag=False for loop: if skip_flag or need_to_skip: skip_flag=not skip_flag continue do_stuff()
not sure...i have a
df.loc[var[i]]
do stuff...
``` but sometimes i doesnt exist in the dataframe and it throws some ambiguous error
Can you not catch it in a try except statement and put the continue there?
oh i should make a list out of df and then compare, if i isnt in there continue
its an index
so i can do check = df.index.values.tolist()
ill try that
worked just fine 😄
dataframes seem to behave differently when calling within len(). If I do df.loc[var] and that var is in the index just once, len returns the number of colums in that row, however if var is in the index > 1 times, len returns the amount of times var is in the df index
im only interested in getting the amount of var in index, whether it's once or twice or more, not interested in len of colums in row
any ideas how to accomplish that?
You want the number of times a value appears in the index?
(df.index == val).sum()
Len does not behave differently in your case. loc behaves differently
If the value only appears once it returns a series representing a row - the length of which is the number of columns
Thanks a lot! 🙂
You can force loc to return a dataframe by wrapping the value in a list
df.loc[[val]]
Yes but be careful about what they mean
The outer brackets apply to df.loc
The inner brackets arent "attached" to anything so they create a new list
Fair enough makes sense
If val is already a list, series, or index, you should not use the second set of brackets
Good to know
Actually using len(df.loc[[val]]) might be quite a bit faster than the method I suggested above
Because in theory pandas can optimize individual index value look ups
Either with a binary search or a hash table look up
Ill try both
Does anyone have some code for sorting numbers. Beginner to python, and looking to learn something about lists
First of all, not really #data-science-and-ml. You better get a help channel (see #❓|how-to-get-help)
Second, are you asking "how to sort lists" (for which the answer is to use sort/sorted) or "what are some sorting algorithms I can implement" (for which the answer is "bubble sort, insertion sort, merge sort, and the other ones you can find in any Data Structures and Algorithms book")?
okay. just joined this server. Thanks for the help
Question: Split the dataset into training and testing 30 percent testing 70 percent training You can use the train_test_split function in sklearn library
Answer: ```import pandas as pd
import numpy as np
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
df_ea = pd.read_csv('C:/Users/Bobby/Documents/School/INFO 357/employee_attrition_processed.csv')
input_var = set(df_ea.columns) - set(['Attrition_Yes'])
X = df_ea[input_var]
Y = df_ea['Attrition_Yes']
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 42)
log_reg = LogisticRegression(random_state = 0, solver = 'lbfgs', multi_class = 'ovr')
log_reg.fit(X_train, Y_train)
des_tree = DecisionTreeClassifier(criterion = 'gini', splitter = 'best', max_depth = 15)
des_tree.fit(X_train, Y_train)```
Is there anyone who uses Astropy here?
Results: DecisionTreeClassifier(max_depth=15)
Not sure if I am on the right path here
yeah
do you have some results already?
thats correct
we're given a dataset and the first nine questions we reprocess it and question 10 is the question I just posted
oh
then why you even need fit and all?
is it like you skipped this question and the next ones require that?
Is there anyone who uses Astropy here?
@gleaming gyro
No I haven't skipped any questions, but the last two questions rely on this question
@desert oar would you be able to assist me?
Hey guys! If anyone is interested I made a second video on my series where I make my own deep learning library from scratch and the plan is to use it to deploy a MNIST classifier on a webserver! This video is about the implementation of Adam, RELU and the new scikit learn API it has. Here it is: https://www.youtube.com/watch?v=UELWdyJVVRg
Welcome back!
In today's video I go over the implementation of the Adam optimizer, the RELU activation function and the scikit-learn inspired API of the library!
Code: https://github.com/Fedzbar/deepfedz
Adam paper: https://arxiv.org/abs/1412.6980
@tardy portal What exactly did you need help with
@desert oar trying to understand what this line of code executes from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 42)
Did you read the docs for that function, or the user guide?
Do you know what a train/test split is?
This is homework right? Did your teacher explain this at all?
Which question is that an answer to
if my professor explained this at all
From what i understand is that it splits the data into test and train models?
With that said, is there a specific method I should use for fitting the model?
It splits a single larger dataset into two smaller datasets, you use one of these to train a model, and the other to see how well the model works on new data. It's not making a model itself, you would have to pick a model yourself or whatever your homework specifies and feed these datasets into that
The train/test split is itself to estimate how well the model performs on the data the model didn't train on. This is important to prevent overfitting - if your model has enough parameters, it can perfectly memorize the training data, but it won't help it with the testing set unless it actually got the general idea right.
I know this isn't directly related to Python, but does anyone here experience with TensorFlow particularly with object detection? I was wondering how many images in general would it take to get reasonable object detection for just one object? I was watching a tutorial for a dec of cards and for 6 objects they recommended atleast 200 images. Would it be the same if I was only training one object?
not even close
and even 200 images seems way too low.
for such small amount of data, you would be better served with using an approach like kNN and using some good ol' feature extraction.
Yeah sorry, I mean't a pre-trained model from the example code
you won't get a good obj detection model with something like KNN or traditional ML techniques.
I'm a big noob so sorry if I say anything technically wrong or if there is some clarification i need haha
I was just looking up a tutorial and I want to figure out how to train a custom image detection using one of the pre trained models
In the example, they were using playing cards which is relatively flat, so im guessing for a 3d object
i need multiple angles too?
there's a lot of obj detection models out there. Pick any one of them and train over your images. If you're only training to detect one object you only really need a couple. My models worked with only 5 images.
Oh sweet
yeah it would be better to have multiple angles, but it should still perform relatively well from a single angle.
so for example
I want to try with a toy goat
Do you suppose with one of the example training models, I could get it trained with only a few images?
didn't really consider using pre-trained models... yeah, go with that.
yeah just a few images should work relatively well
Ok sweet, thank you so much for the help @charred blaze @flat quest. I really appreciate it, tensorflow object detection seems like a very powerful tool and I really want to get it figured out for projects
I have to get the set up installed, was watching videos at 3am and I forgot all the details to get it working
i don't know if tf has updated their object detection api for tf 2.0, but i do know they have a keras hub layer that can do obj detection out
should be an RCNN or SSD model
Yeah, I think I was watching a tutorial and they recmomended the RCNN model
(If you have the hardware which I do, or I hope i do)
How can I print all the category's names ?
Like the different names
Not the whole values for sure xD
So I don't have to say df = df.dropna() ?
If you use inplace, then that's correct
Do I have to full-finish pandas to move on to matplotlib?
Because I'm following a book and pandas will take literally too much ..
What is left to me
How's that ?
They are both different and independent packages
U can use pandas without knowing matplotlib
And use matplotlib without pandas
But both tools will come in handy
So knowing How to use them will help
consider that a reference guide, not a regular instructional book.
^^. You can't learn everything. By the time you've completed that book, you'll have forgotten more than half of it.
The real key is only remembering the parts you use most frequently, while being able to search up anything you don't know fairly quickly.
@arctic cliff matplotlib depends somewhat on numpy, but pandas not at all
Pandas objects all get converted to numpy before plotting
Pandas provides their own convenience plotting routines but you dont have to use them
anyone have experience on GPU parallelization & python?
My code is just 1 big iteration it goes from A to Z chronologically in a for loop, i.e. A1, A2, A3, A4....A6000....B1, B2.......Z6000
running through the whole dataset would take 45 days on 1 core
Would it be possible to split/parallelize the workload at the first for loop? So I could run A1..A6000 | B1..B6000 | Z1..Z6000 in parallel?
Numba looks promising (and very simple), would that work?
def some_func(*args):
out = np.zeros(length_of_output)
for i in prange(n):
# independent and parallel loop
out[i] = ...some stuff...
return out```
or is it dependent on what is run inside the for loop?
@frank bone what is your actual code
Maybe its faster vectorized
Numba parallel is one option
Chunking it into pieces then parallelizing with joblib is another, ideally also vectorizing
And yes the effectiveness of numba is heavily dependent on the code you use
most of code is reading csv files, concat them, calculate new values, isolationforest algo is applied as well
Oh
Numba won't know what to do with that
Very likely you cant run that with nopython anyway
Just chunk into batches and run with like 4-8 cores using joblib
so its only possible on CPU?
GPUs dont really work like you think
even with a 64 core CPU it would still take almost a day
A GPU is not just like a shit ton of CPU cores
It can speed up operations like huge matrix multiplications
It wont help read 1m csv files
biggest performance hit is isolationforest, could that be sped up with a GPU?
im retraining the model in every iteration
so for whole dataset the model is trained approximately 20million times
That seems excessive
Isolationforest itself is embarrassingly parallel, so you can parallelize individual model fits
yeah maybe ill just model it once and see if there's a big difference but i did it for max accuracy
creating a backtesting algo with a 30 day sliding window
so each new day i retrain model
so the model always takes the past 30 days into account
its a time series obviously
so why do you need to do this 20 million times?
because im testing for 8000 tickers, 250 times a YEAR* for 10 years
That seems excessive but ok
If this is meant to be part of your model dev workflow, pick a smaller test sample
Then if you are confident your model is "ready" you can just do a big run over the weekend
Obviously don't test on your training data and vice versa
well with 1 core itll take 40 days or so
So rent a AWS or GCP machine with more than 1 core 🤷♂️
and it has tons of parameters...so id love to test what works best and not wait 40 days or even 1 day each time 😄
But like i said, use a smaller test set
might do that
10 years ago the economy was really different
but home GPU cluster not possible?
Whatever your algo is, i would be pretty surprised if it worked as well 10 years ago as it does today
Look up and read what i said about GPUs
Unless you find or write a GPU implementation of isolation forests you just wont get value from a GPU
There might also be extant faster non GPU implementations compared to whatever is in sklearn
But really testing on 6000 tickers over 10 years seems wildly excessive. But im not a finance guy so idk what's considered normal
Why not like 1 year
Also you probably shouldn't burn all your data testing the first version of your model
@desert oar wtf are you on about?
thats pure nonsense
also 8000 tickers, 250 times a YEAR* for 10 years is a small dataset
i was looking at ROCm and saw it has python API
Like i said im not a finance guy
so i was wondering if something is doable
should easily run on 1 cpu
also 8000 tickers, 250 times a YEAR* for 10 years is a small dataset
@tame fractal that's the amount of iForest fits
20m
He said hes reading a csv file and retraining an isolation forest among other things each iteration
20m rows is very small...
If that is his algorithm then it is not small
not talking about rows, talking about retraining iforest 20m times, not with too much data though each time
let me measure how long a fit takes
iforest?
@frank bone the idea is that your model is meant to work on 30 days of data?
why would you do that?
the past 30 days, sliding window
this doesn't make sense
@tame fractal now you see what im trying to work with
this isnt like, fitting one model on 20m rows
no your approach is wrong
Good, if they can build a model that uses more of the data more effectively then they dont have this weird problem
training model takes 0.2s each time
just measured
its not training on a lot of data
just 30 day window each time
adds up though
You are using it to look for outliers?
yes
In a 30 day window?
yep
Maybe @tame fractal has an opinion on that. Sounds like you have some experience with this and can help out identifying some more conventional and tractable modeling techniques
i first used simple SMA + SD but i dont find it as accurate as iForest
in the worst case ill just get a 128 core server CPU 😄
but if there's a way to get it done on a laptop let me know your ideas
Maybe someone has a great data science discord server they can point me to, but I haven't found one. But I need some guidance.
In python, I am trying to calculate the time decay of a stock option, given the stock option's purchase price and the number of days until it expires. For example, if a call option costs 50 cents and I have 20 days till expiration. How will that price decay over the next 20 days assuming the underlying stock price never changes. The output should be a 20 element array containing the price at T=20 days left, T=19 days left... all the way to 0.
I know the black scholes model is capable of calculating what I want, but I need to get the T in the black sholes model on to the left side of the equation. I'd be able to do that, except that T is found inside the CDF function and I have no clue how to handle that.
Good news is that I have simplified the problem because r is zero in my case.
r = risk free interest rate = 0
There's also a Monte-Carlo approach to option pricing, maybe that would work? Just quickly read about it somewhere, don't know if works for you but maybe worth a look.
Introduction to pricing European options using a Monte Carlo simulation.
Black Scholes is only for European Style options i think, monte carlo can also be applied to American Style
Black Scholes is still useful for American style options and there are modified versions of black scholes that are commonly used that take into account dividends and the ability to exercise the option at any time.
So, I'd say, that Black Scholes is not only for European Style, rather it was intentionally created for European Options by Black Scholes and Merton, and they were fully aware that with some small modifications it could be made to work in different scenarios.
didnt know that, its more resource saving too
Not a Number
Got it
I finally got TensorFlow installed, after 3 hours of troubleshooting cause of errors ^_^
how can I execute a php script from within python and pass command line arguments along?
solution: import subprocess subprocess.run(["php", "-f", "/path/to/php/script.php", "arg1", arg_n"], cwd="/path/to/php/", stdout=subprocess.PIPE, check=True)
where do i pass command line arguments?
got it
..so?
I don't know what to do
lowest loss for lr is 1e-6 anything above gives loss of over 5k anything below is NaN
I got some where and I found the issue after a lot of digging
Can you have floats and ints in a numpy array?
Because from messing around I notice that the values will get converted to whatever data type has priority
I am not sure what the term is
input = np.array([
[[313, 1], #HCL
[323, 1],
[333, 1],
[343, 1]],
[[313, 10e-3], #Ortho
[323, 10e-3],
[333, 10e-3],
[343, 10e-3]],
[[313, 10e-3], #Para
[323, 10e-3],
[333, 10e-3],
[343, 10e-3]]
], dtype='int64')```
Output:
tensor([[[313, 1],
[323, 1],
[333, 1],
[343, 1]],
[[313, 0],
[323, 0],
[333, 0],
[343, 0]],
[[313, 0],
[323, 0],
[333, 0],
[343, 0]]])
int64 makes everything an int
float32 makes everything a float
Can you have floats and ints in a numpy array?
nope, don't think so. Well, you can, but only by specifying the type asobjand losing all performance advantages.
Basically, numpy arrays are wrappers over C arrays, so they need to have a constant dtype.
what's a C array?
an array in the C language 😛
Oh alright
so then what are the options then?
Other than making it an object
Like making two new arrays?
then some how combining them?
Why would you need an array of mixed ints and floats?
the tables says so
let me get it
and there are other floats in there
but not all of them are floats
Oh wait I have an idea
I don't see a single value that can't be a float here.
what's the problem with converting it to a float?
I don't want all of them to be a float only some of them
The 313 becomes a 3.13
uhh, no? Have you even seen scientific notation?
3.13e+02 is 313.
3.13e+02 is 3.13 * 10**2
is 10**2 10^2?
yeah
alright
Scientific notation (also referred to as scientific form or standard index form, or standard form in the UK) is a way of expressing numbers that are too big or too small to be conveniently written in decimal form. It is commonly used by scientists, mathematicians and engineers...
alright
tensor([[[3.1300e+02, 1.0000e+00],
[3.2300e+02, 1.0000e+00],
[3.3300e+02, 1.0000e+00],
[3.4300e+02, 1.0000e+00]]```
Is this a 2D tensor or a 3D?
or is it 1D?
Oh
I gave [3, 4,2]
I just thought that meant 3, 4 by 2s
input = np.array([[[313.0, 1], #HCL
[323.0, 1],
[333.0, 1],
[343.0, 1]]], dtype='float32'```
Did I do the tensor right?
3,4,2 means 3 "layers" of 4 rows of 2 elements.
that table looks 2d, no idea why you're making a 3d tensor .
is mag() of python2 the same as abs() of python3?
Hmm is r+ new to numpy.memmap? I have searched and all links show that previously it wasn't possible to append to npy files. Seems like this has changed? But no one has talked about it?
You can't append to them, they are fixed size, r+ just mean read and write
@acoustic halo then what is a solution, I have 40,000 image arrays and can't load them into memory to just store it all in one go into a npy file
make an empty memmap of the right size first
You know how many images and their size
np.memmap('filename', dtype='float32 or whatever', mode='w+', shape=(40000, x ,y)
40000, 277, 277, 3
hmm but then what would prevent it from overwriting @acoustic halo
You can overwrite it freely
is HDF5 any good would it prevent me from having to load in all the files at once? also does it support memmap or something similar?
But would overwritting not delete previous data? per iteration?
as long as its open in a write mode
What do you mean? Yes overwriting will erase old data but why do you want to keep old data ?
While it is open in write mode, you can edit any part of the memmap in any order you want, it wont delete unmodfied data except for then you first open it as w+, in which case you would open it r+ instead
I want to loop over the image one by one and add into the npy file so that I do not have to load all 40000 images at once, ah ok I get it so I just have to keep it open with With open and loop through in there
No, you save it to a variable
So just create a file pre-established hmm let me try
so x = np.memmap(all the parameters)
then treat x as if it were an np array
just remember to flush it before exiting to save it all to file permanently
Just note that when you do flush it may freeze up your pc because it will max out disk usage for awhile, it'll be several gbs in size for that many images
hi new here! found this discord while trying to learn python. currently trying to understand modulo. I understand the concept but i wanted to get an idea of practical applications (so i guess i don't really understand it)
This is probably the wrong channel for it, but in essence, modulo gives the remainder after a division ex 5 / 2 = 2 remainder 1, so 5%2 = 1
oh sorry, which channel? i'm interested in learning python for data science? i understand that modulo gives remainder
As for practical applications, converting 24h time to 12 hour time by doing time%12
so 13h would be converted to 1, 16 to 4 etc etc
i see thanks. for future reference where should i ask noob questions?
Probably the general channel or one of the help channels
or maybe the comp-sci channel
thanks will do
Hello all. Has anyone finished/enrolled in the IBM data -science specialization and if so, would you recommend it to others.
What things?
2/29/2020
i cant do
df_2020 = df_2020[(df_2020['Date Time'] != '2/29/2020')]
bc there it is not exact like that
Anyone here experienced with tensorflow object detection api? I looked at some old tutorials and new tutorials, do I still have to resize my images for training?
@fallow sandal You most probably do need to do that.
I was just confused why all the tensorflow v2 zoo pretrained models all had resolutions in their name so thanks for the clarifications
what are some good Machine/deep learning projects for an undergrad
many operations (almost all really) operate on a known size tensor, this is why we enforce some size in the algorithm itself, resizing directly at the entry point of the model @fallow sandal
as for why the image sizes are generally so small, it was found that image size doesn't sub-linearly correlates with the performance of the algorithms, but it does linearly correlate with the time it takes to run
making an image 512x512 won't make the algorithm's result much better, but it may very well quadruple the time it takes to complete compared to a 256x256
ah, less pixels to process?
yes
ok ty so much for the explanation, that was very helpful
@visual violet do you want to remove blanks?
you can drop np.nan
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
@lapis sequoia
https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-data-science-projects-to-boost-your-knowledge-and-skills/
I'm using Hyperas for hyperparameter tuning but it seems that for some reason the data function passed to optim.minimize() is giving error if I'm doind pd.read_csv in it
@lapis sequoia look at what type optim.minimize() wants for an input maybe?
it takes functions and the model thats not the issue
sorry
i dont understand error and am not super advanced but happy to try help.
do you have exact error?
thanks i figured it out
I just imported pandas inside the function
for some reason hyperas isn't using global imports
@lapis sequoia the best projects are honestly the ones that you create, rather than something already out there.
Have a question you'd like to answer, collect data, then run an analysis on that collected data, documenting each step of that process.
thats the goal, but to get there i need reference i'm a noob xD
xD
Hmmm do you have any experience with the ds packages?
I can't seem to find documentation on how to manually load my own trained model, all the documentation code for object detection has code where it downloads a pretrained model from the internet
did you look at this https://www.tensorflow.org/tutorials/keras/save_and_load ? I'm assuming you use keras
alternatively, tf.saved_model.load, passing the export dir containing checkpoints and metadata as a parameter
Ok so I have 40,000 images I need stored as Arrays, came to the conclusion that HDF5 is best as it lets me append to the HDF5 file, figured best to use Pytables also. So any idea on how I would iterate over these images and append to HDF5?
I would like to store my X_train, X_test, X_val and labels in the same file but under different dir
are there any issues with manually iterating through the files and generating the arrays and putting them in batch inside dataset objects ?
Batches of 500? 80 batches @odd yoke ?
I don't know you think working with batches is easier?
I would have to put my Training, Testing and Valuation data all in batches, would TensorFlow easily read these?
hdf5 has a concept of datasets and groups, datasets are basically files, in this case, i'd just make them contain one image, groups are folder like structures, which contain multiple datasets
processing the files in batches makes it easier to create groups without using up all your ram
if you're working with tensorflow, one thing I use fairly often is called a TFRecord, it's a file format that contains data in the form of a sequence of dictionary-like structures
it's very well supported within tensorflow obviously, which makes it trivial to pass them into models later on
they are also very easy to spilt into what are called "shards", which is basically just splitting the TFRecords into multiple files, very useful again when your data is too big to fit in RAM, or if you want to shuffle etc
(Also, TFRecords are nowhere near as complex as HDF5, so you may be missing on some feature, but at the same time it can also be much easier to use)
Hmm so I was first looking at NPY correct then I went to research HDF5 as NPY does not allow me to append, as I want to iterate and store all my arrays in the same set so that it would be the same as if I had them stored in a variable in RAM, I want 3 datasets in HDF5 one for X_train, X_test, X_val.
I will take a look at TFRecords if you recommend it
really just want to know whats the best way to proceed
@odd yoke
I think either can work really, there'll be pros and cons for every format, for example, TFRecords are not widely supported across most libraries, let alone most languages, this will tie you to tensorflow
I don't have much experience with HDF5 to talk about its pros & cons tho
I am confident that if you stick to tensorflow, TFRecords will make your life easier to interact with your models, manipulate your splits as you will etc
I'm also not really sure what you mean by "append" here, can you elaborate on that ?
Do you want to add images to existing TFRecords/NPY/HDF5 dynamically ?
Have an open file and write/append each image array
Loop through my images and append to the previous write in the file
I don't think you can do that with TFRecords either, what you can do however is create a new tfrecord, and add it to the list of the files you use when you generate your tf.data.Dataset object
You can also sync up everything every once in a while, by merging them all together
Tho this is a costly operation
one difference between tfrecords and hdf5 is that the data doesn't have to be homogenous, you basically store key/value maps, which can contain all the metadata you want of any type
now, whether that is useful or not depends on your use-case
Eh I just want to pass in these arrays, if my RAM allowed I would have one X_train list with 40,000 arrays and just pass these into my model but I can't so hmm I have to keep researching
Thanks I will looking into both HDF5 and TFRecords
alternatively,
tf.saved_model.load, passing the export dir containing checkpoints and metadata as a parameter
@odd yoke I'm not sure if I'm using keras (still new to tthis sorry). I'm trying to modify the notebook file to fit my custom trained modeled, the last thing I have to modify is that I have to figure out what this function returns as a data type
since i assume there is where it loads the model
trying to dissect what load_model() does, and how I can just substitute the path directory to my pretrained model
Is there where the thing you mentioned will work? (tensorflow.saved_model.load()?) I have my tensorflow not as tf
you want to replace model_dir with your path
Oh I guess it does import it as tf nvm
Ok, I will try that
Thank you so much @odd yoke
the current one uses tf.keras.utils.get_file, you can remove that line and pass in your own path directly
where your saved model is
so just ignore the parameters inside?
"(
fname=model_name,
origin=base_url + model_file,
untar=True)"
sorry my ''' is broken on my keyboard so i can't do the gist thing
remove the whole tf.keras.utils.get_file(...) part