#data-science-and-ml

1 messages ยท Page 204 of 1

silk forge
#

Traceback (most recent call last):
File "pandas_libs\parsers.pyx", line 1169, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas_libs\parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas_libs\parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas_libs\parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/admin/PycharmProjects/discord/test-ml-spam.py", line 6, in <module>
dta = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/ML datasets/SPAM/spam.csv")
File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)

#

File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\pandas\io\parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\pandas\io\parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas_libs\parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas_libs\parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas_libs\parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas_libs\parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas_libs\parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas_libs\parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid continuation byte

#

dta = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/ML datasets/SPAM/spam.csv")
print(dta)

#

this is my code

#

but whats wrong tho

still cloud
#

Looks like you need to change encoding on the import there @silk forge

#

Ensure that spam.csv is saved with UTF-8 encoding. If not, then just add the encoding parameter to read_csv.

celest moss
#

I am implementing a spam classifier using naive bayes. I am having a hard time dealing with rare/unseen words in test cases(Both numerator and denominator are zero when calculating posterior probability). Can anyone help me with this issue ?

lapis sequoia
#

do you need to implement it with naive bayes

#

are you converting your word tokens to vectors

#

how are you handling punctuation

#

can you use embeddings

#

there's not much you can do in case of unseen words, aside from including patterns in your training data that accounts for repeated words in a sentence, etc

quartz monolith
#

@celest moss I'm working on something similiar today:
Want to create a text classification pipeline for a hypertext column on my dataframe

  1. Bag-of-words on a column
  2. TFid
  3. sklearn
dull fern
#

@celest moss If you use words embedding like Word2Vec you can actually try to infer the meaning of unknown words from their context

lime cloud
#

Is there a way to create a histogram in matplotlib from the return value of another one?

#

it returns n, bins, patches

lapis sequoia
#

could anyone recommend a serialization format?

#

I need to export my large dataframe .. so I need to compress the hell out of it before I can write it ..

quartz monolith
#

I read something about HDF5 for very big data frames, you can also load the data optimize the data (compress) @lapis sequoia

lapis sequoia
#

thanks..I found this when comparing

#

parquet seems nice..

#

now to figure out how to write df to parquet and read back

#

sweeetttt

long meadow
#

CatBoostError: Invalid type for cat_feature[7,3]=40.5 : cat_features must be integer or string, real number values and NaN values should be converted to string.

void anvil
#

Anyone here use rllib? Trying to find a half decent guide on creating custom actions and environments

earnest prawn
#

id say that the environments and actions are based on openai gym and for how to create new environments yourself for that lib youll find many guides by simply googling around

#

@void anvil

void anvil
#

oh

#

that would explain why I couldn't find anything

#

and just calls

lapis sequoia
#

im having some memory allocation issues

#

I concat a list of dataframes before writing to parquet file

#

the dataframe is huge..

#

need a workaround..

lapis sequoia
#

I ended up using dask

#

have to test it

#

not sure if concat in dask requires reset index though..

supple ferry
#

@long meadow as far as I understood, catboost is using categorical features. So your input should be categorical. Either use bins as category boundaries, or convert everything to string (not the best approach)

quartz monolith
#

@supple ferry @long meadow If you're using pandas
from sklearn.model_selection import train_test_split
for col in [col]:
df[col] = df[col].astype('category')

y = df.target
X = df.drop('y', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.8)

#

but i'm using lightgbm. Don't know if the approach is the same

#

@lapis sequoia let us know about your exp. with dask

supple ferry
#

@quartz monolith OP can also use patsy and create X y data frames from formula

Patsy.dmatrices("y ~ x1 + x2 + c(x3)", return_type ="dataframe" )

Where x3 is category column

quartz monolith
#

@supple ferry whats the benfit from using patsy vs others like labelencoding, ohe...

lapis sequoia
#

@quartz monolith it's pretty good, but after concatenating multiple dataframes, it tries to save them as parts when you try to export it. It has built in support for export to hdf and parquet, but again it tries to save everything in parts.

#

Compression is pretty solid.

silent swan
#

quick poll: what do you guys use for filtering DataFrames

#

df.query, or df[df["somecol"]==someval]]

lapis sequoia
#

depends how large the dataframe is, how often you're going to query

#

for filtering, I would go with the second

#

especially if you're using in down the line

lapis sequoia
#

I found this little gem

supple ferry
#

@silent swan df.filter
Or df.query is fast

supple ferry
#

@quartz monolith patsy is very suitable for those things. Less lines and also allows you to use formulas in your model. Formula like syntax is very good for people coming from R or Matlab

olive willow
#

has anyone read, Big data by Bernard Marr? if yes is it good?

quartz monolith
#
from gensim import models
model = gensim.models.ldamodel(bow_corpus, num_topics=60)
other_texts = [['die anlage steht auf ins bitte vor ort ueberpruefen'],
               ['es besteht gefahr in der unteren anlage'],
               ['alle taster leuchten'],
               ['der antrieb scheint nicht zu funktionieren']]

other_corpus = [dictionary.doc2bow(doc) for doc in other_texts]
unseen_doc = other_corpus[0]
vector = model[unseen_doc]```

`Error: ValueError: need at least one array to concatenate`

solved
strange epoch
leaden bobcat
#

Is anyone available to chat in voice for a few minutes regarding a ML question with point of sale transactional data?

#

DM me if so, I'm finding a ton of ML info regarding interpreting single rows of data, but I've got transactions that encompass up to 20+ rows, and I'm not sure which method to necessarily approach to get ML to read 20+ rows of data as a single entity

lapis sequoia
#

anyone wanna hear me talk about basic ml because I need to revise

lapis sequoia
#

how does an array containing 5 elements of 512 embeddings each yield a 5 element array of 5 values when a dot product is done..

#

im just trying to wrap my head around how the matrix multiplication occurs

void anvil
#

Magic

#

Same way it does in linear regression.

silent swan
#

I don't understand the issue

lapis sequoia
#

what issue

#

I think I understand

#

512 arranged vertically times 512 arranged vertically then product and sum

supple ferry
#

@quartz monolith solution maybe? ๐Ÿ˜

#

@void anvil do you have experience with numpy?

quartz monolith
#

@lapis sequoia im working also on doc2vec and want to plot it

#

maybe we have the same issuse

lapis sequoia
#

what do you need it for

quartz monolith
#

I want to predict a column with lda and for clustering

lapis sequoia
#

that's not very specific, what's the end goal

#

the column is target classes?

quartz monolith
#

Based on a service text I want to classify in keywords

lapis sequoia
#

could you give an example

quartz monolith
#

SVM was 85% to predict on free text the label. I want to try it with doc2vec

lapis sequoia
#

you're still not getting it

#

give me an example of input and output

quartz monolith
#

X = "All buttons light up and the system stops" y="fuse" e.g.

#

I want to create from the knowledge base a sub classification with lda

#

because the label is to general

lapis sequoia
#

ok

#

what you need is a knowledge graph

#

yeah, I get it now

#

but building a knowledge graph is your first step, because you can classify

#

and you can't do that from training example

#

do you have a finite list of categories?

quartz monolith
#

the goal would be based on x to get y= the text

X= are coming from machine error codes and using knowledge text is in the data base

#

Yes i have over 65 categories but its still not specified class

#

my approach would be to use lda on the free text to make sub classes

lapis sequoia
#

that's not straightforward

#

because anything in your query can trigger or be correlated to another category

void anvil
#

@supple ferry which part

#

I assume you mean numba not numpy

supple ferry
#

@void anvil , no i meant numpy :)
I am trying to write a workaround to my problem.
In STATA when you fit clogit (conditional logit) it drops exog variables that have no within group variance and fits the model without them. There is no implementation of that in python

#

do you know maybe an easy way to group an array based on one column and check some column x for variance ? maybe a vectorized way of doing that

void anvil
#

I mean you can select columns with [: ,x]

#
a = np.array([[1,2],[3,4]])
print(a)
a[:, 1]
print(a)```

Gives:

```[[1 2]
 [3 4]]
array([2, 4])```
#

@supple ferry

#

then you can just do np.var or w/e

#

is that what you're looking for?

supple ferry
#

Not exactly. I also want to group the array by one column and then run all these

silk forge
#

yo

#
import discord
import sklearn.naive_bayes as bae
import pandas as pd
import numpy

dta = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/ML datasets/SPAM/spam.csv" , encoding='ISO-8859-1' ,
    converters = {'v2': lambda x: 1 if x == 'ham' else 0})
# v2 - features

clf =bae.GaussianNB()



x = numpy.array([[dta.v2]])
y = numpy.array([[dta.v1]])


clf.fit(x,y)
n = clf.predict([['xd lol imma destroy everyone here immahhhhhhhhhhhhhhhhhhhhhhhhhh kill all of you ..................................................................................................................']])
print(n)
#
C:\Users\admin\PycharmProjects\discord\venv\Scripts\python.exe C:/Users/admin/PycharmProjects/discord/test-ml-spam.py
Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/discord/test-ml-spam.py", line 18, in <module>
    clf.fit(x,y)
  File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\sklearn\naive_bayes.py", line 189, in fit
    X, y = check_X_y(X, y)
  File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\sklearn\utils\validation.py", line 719, in check_X_y
    estimator=estimator)
  File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\sklearn\utils\validation.py", line 539, in check_array
    % (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.
#

help

quartz monolith
silk forge
#

what

#

@quartz monolith ?

quartz monolith
#

you want to predict your input right=

silk forge
#

no

#

i want to predict of the message is spam or not

#

@quartz monolith

quartz monolith
#

do you have a training set?

silk forge
#

no

#

its just my message

quartz monolith
#

you need to find some csv or data where some text ist with some column (spam yes or not)

silk forge
#
clf.fit(x,y)
n = clf.predict([['xd lol imma destroy everyone here immahhhhhhhhhhhhhhhhhhhhhhhhhh kill all of you ..................................................................................................................']])
quartz monolith
silk forge
#

@quartz monolith

#

yes

#

i have a dataset

#
dta = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/ML datasets/SPAM/spam.csv" , encoding='ISO-8859-1' ,
    converters = {'v2': lambda x: 1 if x == 'ham' else 0})
#

so i cant do this with naive bayes then?

quartz monolith
#
from sklearn.feature_extraction.text import TfidfVectorizer
stop = set(stopwords.words('english') + ['.', ',', '"', "'", '-', '.-'])
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', ngram_range=(1, 2), stop_words=stop)
features = tfidf.fit_transform(df["spam"]).toarray()
labels = df.text_id

features.shape```
#

first i would go with tfidf

silk forge
#

so naive bayes won't work?

quartz monolith
#

after that you choose your model with e.g.

model = GaussianNB()
model.fit(X_train_tfidf, y_train) 
#

and than you can predict with

print(model.predict(count_vect.transform(["xd lol imma destroy everyone here immahhhhhhhhhhhhhhhhhhhhhhhhhh kill all of you .................................................................................................................."])))```
#

to get you score model.score(X_train_tfidf, y_train)

#

but do the step before tfidf like in the tutorial

#

you data muss be clean

#

binary classifiers suits the best in your case its like 0 or 1

#

hope it helped you

#

@silk forge

vale arrow
#

Question: should i use an activation function in all the middle layers of my ANN? Why or why not?

quartz monolith
#
label_encoders = {}
for col in feature_cols:
    print("Encoding {}".format(col))
    new_le = LabelEncoder()
    train[col] = new_le.fit_transform(train[col])
    label_encoders[col] = new_le

need to encode strings to numeric for decision tree

#

```TypeError: argument must be a string or number````

#

Here is my data frame

severe pasture
#

Hi all, I have a question. I'm running a fresh install of Python 3.7.3. I've been given a .nxs file that contains scan data, and I have some .py scripts that go with the data. How can I read the .nxs file and then run scripts on contained data? Will I need certain packages or libraries? I'd like to have the .nxs data read as arrays that I can manipulate.

silent swan
#

@vale arrow you need a non-linear activation function between layers, because otherwise the layers are just affine functions, and a composition of affine functions is still just a affine function, so you don't get any more model expressivity from additional layers

vale arrow
#

@silent swan thank you!

lapis sequoia
#

Was wondering how I could go about rescaling the y-axis on this histogram

#

Would it make sense to have a non-linear y-axis? As in, have the y-axis increment by no set number?

silent swan
#

log scale?

quartz monolith
#

@lapis sequoia my doc2vec is really good with linearsvc
the good think is he understands what doesnt belong to some text and can give me some good keywords

lapis sequoia
#

@silent swan You mean to convert the y-axis to a log scale?

silent swan
#

yep

leaden bobcat
#

Anyone around that's familiar with scikitlearn? getting an odd error for the preloaded iris database trying to use Random Forest Classifier

silent swan
#

just post the error

crude parcel
#

whats error

#

hey guys... is there a place where you can find out more about the various algorithms being used for nlp? more of a use-case to matching algorithm type thing?

crude bloom
#

They break it into categories and tags, it's the best site I've seen to browse the field in the most up-to-date way

crude parcel
#

interesting, it's quite overwhelming

tulip estuary
#

That is a really cool site. I have never seen that before. (A lot more than NLP on there too.)

crude bloom
#

it's generally my go-to to get a high level idea of a field, the website is very well put together

lilac ferry
#

I've created a dataframe (D1). I now want to create a second dataframe (D2) using the data from D1, condition being it needs to match the value of one of the columns from D1. How do I approach this?

echo storm
#

think you just take the column as a slice or are you linking dfs?

lilac ferry
#

slicing worked, thanks!

lapis sequoia
#

.loc

supple ferry
#

@lilac ferry beware though. Simple slicing returns a view not the copy. So if you assign view to another variable, and edit it later your original dataframe will be changed. Either use copy method or use fancy indexing

lapis sequoia
#

what's fancy indexing

supple ferry
lapis sequoia
#

soooo just a fancy term for indexing

#

people have always used bins in df.. since the dawn of groupbys

supple ferry
#

I think it is very important to know the difference between views and copies

#

sometimes it can screw quite a lot if left unattended

lapis sequoia
#

I always make a copy if I'm doing operations that alter the original..

#

anyways, is there a server that focuses primarily on data engineering

#

i'm having trouble with serialization with multiprocessing..

silent swan
#

imo data science ought to include data engineering

#

otherwise it just becomes stats+ml

lean ledge
#

What's wrong with that?

#

Data science and data engineering are very different roles with different skillsets

lapis sequoia
#

Anyone here familiar with Bokeh?

silent swan
#

data science is a big enough umbrella that everything falls under it

#

is it hardcore stats and optimization?
is it domain knowledge and business analytics?
is it big data and databases?

#

the answer is yes to all of the above

#

data science as a term is (for better or for worse) designed to cast a wide net and encompass as much as possible

lean ledge
#

My argument would be data science is a bad buzzword and should be forgotten in favour of anything a bit more specific

lilac ferry
#

@supple ferry i was unaware of that. Thanks ๐Ÿ˜Š

hollow quartz
supple ferry
#

Matplotlib in-line? @hollow quartz

hollow quartz
#

there are Matplotlib in-line @supple ferry

lapis sequoia
#

I'm not sure how to approach this problem: I am reading from a virtual file system and compiling all the contents I traverse into one dataframe like structure ( I can use either Pandas or Dask for this) , but this issue is I have to write it. I cannot write to local because it's too big, and takes too long.. my formats are limited ( I cannot access any pyarrow engine because it's not in the current codebase I have to use) so I have to save in json, or hdf ... then commit the file to cloud storage

#

I'm having trouble saving fast and coping to cloud storage.. takes too long..

#

some serialization formats that have been light enough : 1. Pandas to messagepack objects dumped into pickle (but I am not able to read this back because it's compressed too much), 2. Writing to hdf to tempfile then copying to cloud storage , but this takes too long just to finish the write to hdf

#

it'd be great if someone can suggest something

silent swan
#

@lean ledge I also agree, but I think it serves some purpose in referring to a vague cloud of things

#

I think it's better than the "AI" buzzword, for example

#

I don't see data science being used for needless hype as much

#

are you currently writing in parts?

desert oar
#

@lapis sequoia what about writing messagepack to a plain file? Instead of pickling

#

And aggressively compressing with xz or lzma

hollow shard
#

I know I've been pestering you guys a lot lately, but my neural network is not working, and indeed displays some very strange behaviour. It appears to minimise the loss before bottlenecking and stopping at 0.5. Could anyone help? dataset is MNIST and it has 1 hidden layer of 64 nodes

silent swan
#

post code/training behavior

hollow shard
#

sure

#

my apologies in advance for super messy code, I've been trying stuff out with it and modifying it for a long time now

#

it currently performs at 7% accuracy, and this is the loss vs examples trained on.

lunar leaf
#

@hollow shard can you explain to me your implementation of the delta computation for the output layer of the network? it should be out_delta = cost_function_derivative(output_activations, true_output_label) * activation_function_derivative(x) but I don't quite understand how it is that your code is equivalent to that

hollow shard
#

sorry, it actually doesnt, but when i modified the code to follow the formula you gave it still doesnt work.

#

heres the new code

lunar leaf
#

I'm still not entirely through your code, it's taking a little bit to get my head around

hollow shard
#

hahaha im so sorry

lunar leaf
#

lol no worries

#

I'm not quite understanding why you have these lines:

hold = np.empty_like(w1)
for j in range(0,10):
    hold[:,j] = (-LOSS[j]*Y[j]*(1-Y[j]))*w1[:,j]
delta = np.sum(hold)*Y1*(1-Y1)
hollow shard
#

ohh

#

give me asec

#

thats some particularly awful code

#

I believe thats just some working for the delta for the first layer

#

im trying to find the tutorial i used 1 sec

lunar leaf
#

mmmmmm okay, I don't think I understand why that is the correct way to compute the delta for the first layer

#

alright

hollow shard
#

uhhm yeah if you could drop me a hint on what the correct way is I'd be very grateful

lunar leaf
#

I'm not completely sure how to wrap the correct example into your code here so you don't have to change much

hollow shard
#

eh thats fine

#

fire away

lunar leaf
#

alright, I'm going to more-or-less pull these out of the book I just linked you, but I will re-format the code to be more readable and in python 3.x instead of 2.7

#

give me a few minutes

hollow shard
#

thats fine, thanks so much ffor the help

lunar leaf
#

this code does not include the gradient descent algorithm, it is just an implementation of backpropagation

#

ah, and of course I reversed the arguments on line 40

#

it should be delta = self.cost_function_derivative(label, activation_list[-1])

hollow shard
#

thanls

lunar leaf
#

I also forgot a couple other things (self arguments for the functions)

#

look, don't judge me ok

#

I did not try to compile this

hollow shard
#

hahaha thats fine

#

look at the code I wrote, its 10 times worse

#

ik the book ur talking about btw

lunar leaf
#

I swear by it haha

hollow shard
#

haha

lunar leaf
#

I must have read the whole thing 5 or 6 times by now

hollow shard
#

wow

#

btw what should my self.num_layers value be?

lunar leaf
#

you can instantiate a network object as:
net = network([784, 64, 10])

hollow shard
#

ah ok

lunar leaf
#

the init function I wrote should build the weight matrices and bias vector automatically from that

hollow shard
#

nice, thanks

lunar leaf
#

I'm now realizing that I've referred to the bias values per node as being "bias vectors" which is incorrect

#

the whole thing is a disaster

#

I should quit coding

hollow shard
#

hahhahaha

lunar leaf
#

it's in the ballpark of being right

hollow shard
#

honestly, its fine, look at my code.

#

Now thats a mess

lunar leaf
#

lol

lunar leaf
#

@hollow shard I've fixed several minor issues and made sure it actually works this time

hollow shard
#

oh wow i really appreciate that

#

I've been messing around with it for a while now

lunar leaf
#

as per the example code at the bottom b,w = net.backprop(np.zeros((784,1)), np.zeros((10,1))) will produce the back-propagated error gradients for the biases and weights of the network for a 784 length vector of zeros at the input and a 10 length vector as the label for that input

#

following this, you can write a gradient descent algorithm to produce an update for the weights and biases with those gradients

#

that should be enough to get you on your feet

#

let me know if you need any further help

hollow shard
#

i cant thank u enough, honestly

#

thanks so much for your time

lunar leaf
#

happy to help

crude bloom
#

@lunar leaf oh that's the book by Michael Nielson, I've come across this before, now I'm more encouraged to give it a read

#

his explanations are very intuitive and I love his other writing

lunar leaf
#

it's absolutely phenomenal, really

#

a lot of the code I found pretty tough to get through, but honestly it's a small complaint compared to how deep his explanations are

#

I have yet to find a better intro to neural networks

hollow shard
#

i skimmed the pdf, but i kinda gave up after looking at the code

#

looks like i might be going back though

lunar leaf
#

I think it depends on a few things

#

multi-variate calculus is pretty crucial if you want to really understand this stuff

hollow shard
#

dw, i have that under my belt

lunar leaf
#

and I think that his book is a fantastic way to really get into the nuts and bolts of neural networks and the fundamentals behind a lot of machine learning

hollow shard
#

thats actuall just what im looking for, a super in depth understanding

#

I hate using stuff I don't understand, and I hate using NN's as some kind of black box

crude bloom
#

Nielsen's the guy to be explaining it, too. He's been a big part of distill.pub which has a few articles explaining ML with visuals and interactivity

hollow shard
#

wow ok

#

ah crap its throwing me an error again

#

for some reason its having trouble with the shapes of the output

#

ill fix it i think

lunar leaf
#

note the shape of the inputs and outputs that I provide in the example code

hollow shard
#

yup, i did

#

like i said, ill fix it, its just a small thing

viral lark
#

Hi, there is someone who speaks Portuguese here? My friend is Portuguese speaker and need helps with python for college

lapis sequoia
#

is anyone experienced with keras? I'm trying to use my own dataset to do Sequence Classification on, but I'm having some issues

lapis sequoia
#

what sort of sequence do you want to classify

#

I'm actually publishing something on github later that solves that

#

in keras

#

@viral lark tell me

#

and dont ping me.. message here

lapis sequoia
#

Sequences of time measurements. Float numbers with 3 decimals.

#

Feel free to ping me

granite sierra
#

I'm trying to iterate over an excel document, and it returns None for when I try to print cell.value

#

and when I try to get the sheet name, it returns this <bound method Workbook.get_sheet_names of <openpyxl.workbook.workbook.Workbook object at 0x000001E814987DA0>>

tulip estuary
#

Do you have a parens at the end of get_sheet_names() ?

#

(It is a method and not an attribute, so you need the parens)

granite sierra
#

That worked, oops

#

and what about the previous question

tulip estuary
#

The first question is harder :).
Have you tried it on another excel file? Could it be that None is what is reutrned if the cell is actually empty? What package are you using?

granite sierra
#

I'm using openpyxl

#

the cells aren't empty though, there are definitely values in there

#

let me test on a different excel sheet

#

yea even on a different excel sheet, it print None.

#

but when I print a pandas dataframe, it works completely fine, but I don't really need a pandas dataframe because I want to be able to iterate over the columns and assign the value to a variable

tulip estuary
#

I have never used openpyxl, I have used another package, but it is eluding me which one. Can you load with pandas and then just iterate on it from pandas? It is hard to debug this type of thing without the actual file, code etc.

#

Alternatively, you could export the excel file as a CSV and load it in and parse.

granite sierra
#

what package would you recommend me using for what Iwant to do.

Basically I am just trying to get values from each column and assign them to a variable

#

but a specific value, so like

Lets say

Height      Width
1            3
2            4
3            5
#

and I am creating a square of those dimensions in a different code

#

so it will create a square of 1x3 first, and then 2x4, and then 3x5

#

do you get me?

tulip estuary
#

If pandas loads the file, use pandas. You can do things like df['Height'] * df['Width']...

granite sierra
#

ok

#

but won't that do the entire column at once, instead of row by row ?

tulip estuary
#

pandas is super powerful at these types of column indexing, iterating etc

granite sierra
#

ok

tulip estuary
#

Yeah, it will do it over the columns, but that is going to be more efficient if that is what you are trying to replicate in the end.

granite sierra
#

ok

#

hmm let me try and I'll be back if I have issues

granite sierra
#

I get a key error when I try and index by column

#

oh wait

#

so if i have width in my code, how do I assign the value of width row 1 to that variable? I tried this

#
for i, row in df.iterrows():
    for j, column in row.iteritems():
        width = df['width']
        length = df['length']
        interdistance = df['interdistance']
tulip estuary
#

Umm... what do you actually want to calculate from df['width'] and df['length']? You said the square of the first... but didn't quite understand. (I am guessing you don't need loops at all, but don't understand what you are trying to calculate ๐Ÿ˜ƒ ).

lapis sequoia
#

what is that..

#

what are you trying to do o.o

granite sierra
#

well its obviously wrong

#

not trying to calculate anything

#

just assign the values of width, length, interdistance to a variable

#

let me create some dummy code to try and explain what I am doing

viral lark
#

Tron, can I send a DM to you?

tulip estuary
#

@granite sierra

from pandas import DataFrame

Data = {'width':  [1, 2, 3],
        'height': [2, 4, 6],
       }
df = DataFrame (Data, columns = ['width','height'])


for ii, row in df.iterrows():
    width = row['width']
    height = row['height']

    print('width is {}, height is {}'.format(width, height))
granite sierra
#

ok got it

#

you beat me to some dummy code haha

tulip estuary
#

๐Ÿ˜ƒ

granite sierra
#

Thanks man, that's what I needed

#

why the double ii though?

tulip estuary
#

It is something I was suggested to use many, many, many years ago. If a language uses i to represent an imaginary number then you can get name collisisions and cause all sorts of issues. I have just followed that all through my coding life.

#

So, I use ii and jj etc for my looping

granite sierra
#

Ahh, got it, thanks for the help man. I'm still new to pretty much all the data related stuff

tulip estuary
#

It is fun, keep playing with it!

granite sierra
#

when I print it, it comes out really strange

#

the top is a print of the dataframe

#

and the bot is where I printed the

print(i, width, length, interdistance)
#

did I do something wrong?

tulip estuary
#

What is the loop and part between the for and print?

granite sierra
#
for i, row in df.iterrows():
    width = row['width']
    length = row['length']
    interdistance = ['interdistance']
    print(i, width, length, interdistance)
#

oops

tulip estuary
#

hahaha... look at the interdistance line

granite sierra
#

I just realised

#

derp

#

I'm so stupid, I need a coffee to wake up

tulip estuary
#

More coffee always helps. Though I don't get why your length is 0...

granite sierra
#

Ia lso fixed the length

#

because for some reason I had the variable named as height

tulip estuary
#

kk

granite sierra
#

because, again, I am an idiot ๐Ÿ˜„

#

Thanks mate, I appreciate it

granite sierra
#

is there an easy way to get the sheet's name

#

i want to be able to use the sheets name in an if statement

#

like

if sheetsname == blah:
    do stuff
elif sheetsname == superblah:
    do some more stuff
tulip estuary
granite sierra
#

yea I read it, every package gives me different answers in terms of sheet names and also it never returns all teh sheet names even though the docs say it would

quartz stream
#

@granite sierra Did you try Pandas ?

granite sierra
#

@quartz stream Yea I did, I also tried other excel packages and they all give me different results when I print the sheet name, and none of them including pandas return all the sheet names

quartz stream
#
xls = pandas.ExcelFile(path)
sheets = xls.sheet_names
#
import xlrd
xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)
print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property
#

@granite sierra

granite sierra
#

ok let me try that
@quartz stream Thanks man, sorry, I was at lunch

#

Even with that I don't get all the sheet names, I still only get 1, that's so strange

oblique belfry
#

Does anyone know how to create a keras stateful metric?

quartz stream
#

Yes that is strange @granite sierra

#

can you share the snippets of excel file and your code

granite sierra
#

Sure

#

@quartz stream can I dm you the excel file?

quartz stream
#

Yes

granite sierra
#

Here is the code I've tried

#

Holy moly

#

I'm tilted

#

nvm, I figured out the issue

#

but now that I figured out the issue

#

question

quartz stream
#

@granite sierra

#

See

#

it's working fine

granite sierra
#

Yea I realised, I did a super derp, I pointed the file path to the wrong excel file

granite sierra
#

how do I assign the sheet name to a variable for an if statement

like

if sheet_name = blah:
    do stuff
#

It's always when you have to send your code to someone that you realise the silly mistakes...

quartz stream
#
if sheet_name == sheets[0]:
    print('It's always when you have to send your code to someone that you realise the silly mistakes')
granite sierra
#

oof

simple crag
#

Like, for instance, = is for assignment, not comparison

quartz stream
#

lol

#

just updated

granite sierra
#

yea that I know, was just showing an example of what I wanted

quartz stream
#

so

#

now it's done right ?

#

@granite sierra

granite sierra
#

@quartz stream Yes, thank you good sir, I am so stupid sometimes lmao

quartz stream
#

lol

#

aren't we all?

granite sierra
#

how could I point the filepath to the wrong file xD

quartz stream
#

hahaha

#

it's alright mate

tulip estuary
#

Maybe there should be a channel "here are my dumb mistakes", I could fill it pretty quickly

granite sierra
#

all we need is lemon or someone to make that ;D

desert oar
#

blogging your mistakes and lessons learned is a net positive for society

#

i need to start doing it myself

#

takes some work but it's worth it

surreal nacelle
#

Hey, I decided to finally dive into ml/dl, and I've been learning about linear algebra, I planned on doing so till I understand it perfectly, and then go on to calculus and probabilities. However, there is a long way to go, and I don't want to lose my motivation by exclusively learning maths. I'd like to know if you guys had some ressources teaching both ml/dl alongside the maths required. (I honestly don't see myself spending 2 months learning something without knowing why I'm learning it, and how I'll use it.) Thanks ๐Ÿ˜ƒ

desert oar
#

i dont have a good book reference handy for that purpose but i think that's a really good way to learn

#

linear regression is a great teaching tool because you can derive the same result several different ways: calculus, linear algebra, and probability theory

surreal nacelle
#

I'm not sure to understand what you mean by "linear regression is a great teaching tool" tbh

lunar leaf
#

It provides the foundations, theory, mathematical background, and by-hand implementations for classic feed forward neural networks

#

And a bit of diving into convolutional neural networks

#

Great place to get started

surreal nacelle
#

It looks really great! Thank you

lunar leaf
#

Happy to help

granite sierra
#

Also another book is

surreal nacelle
#

I'm looking at the table of content, and I see that there are some chapters about linear algebra calculus etc, does it teaches these, or briefly explain what they are ? Also, is there any prerequisites to the book ? (apart from python/jupyter etc)

granite sierra
#

I dont think its as focused on the math, but if someone else can chime in

#

It was just recommended to me by a friend

#

but I haven't had much tiem to look into it yet

desert oar
#

imo if you're learning the math don't waste time with "for hackers" type book

#

get a book that uses the math, just make sure its not more math than you can handle

small shore
#

Idk if this is the best place to ask, but if I have links to Images and corresponding data in a file. How can I download those images and save them with the metadata in a pair either in another file or with the image quickly and effectively so image links that donโ€™t respond are removed. Right now I loop through the file and download the image under the same index I save the data to an array, but that is slow. Is there a better way to do this?

surreal nacelle
#

Well, I ordered the o'reilly book, and I started reading the book @lunar leaf mentioned, however, I'm not sure what to think about learning deep learning before taking a simple approach to machine learning.
I also found that : https://machinelearningmastery.com/start-here/
It seems to cover pretty much everything, do you guys have any experience with it ?

desert oar
#

@small shore split up the file into 5 different files, and run the script in 5 terminal windows ๐Ÿ˜ƒ

small shore
#

Lol

#

15 windows to max my thread count

#

Idk if thatโ€™s even right lol

#

I found a dataset with stuff already built ready to download

#

So might just switch to that cause I am lazy

desert oar
#

your network will probably saturate before your CPU does

small shore
#

So

quartz monolith
#

I have a numeric and categorial data set with alot of NaN's. My Labelencoding isnt going to work with that. So how do deal with NaN's in first place and afterwards how to approach numeric and categorical in dataframe. I want to use CART / DT

desert oar
#

@quartz monolith i answered this a while ago but you might not have seen my answer. you need to know why the data is missing

#

NaN is just a computer representation of "not a number"

#

it could mean you wrote log(-1) or it could mean "missing data"

hollow shard
#

btw @lunar leaf the network doesnt really work. Heres the code with the backpropogation algorithm I added, thanks for your time ๐Ÿ‘ http://dpaste.com/0Z8C7H0

dense rose
#

What all should I need to do to get Plotly figures to display in jupyter lab?

lunar leaf
#

@hollow shard I am away on a business trip until tomorrow so I don't have time to give this my full attention at the moment. However, it appears to me like you're performing a weights update once per sample here

#

This may not be a problem, but why don't you try accumulating gradients over many samples before computing a parameter update

hollow shard
#

oh ok, I really appreciate the time youre giving me, thanks for the help

exotic cedar
#

oh you already ordered the book

#

well rip

tame cloak
#

Hi all, new to the channel but great to see an awesome community here. I'm unsure if this is how it works but I have a question on dataset manipulation/grouping. Let me know if I should redirect to one of the help channels

I'm trying to group together a dataset by year. Each year has a few hundred values, so i did this:

df2 = df.groupby(['Year']).sum()

the new "df2" dataframe no longer has a year column, but year is now an index (I believe is the correct terminology), which it doesn't look like I can use for visualizations. I'm trying to make a new year variable with a loop:

for i in range(1950,2018): df2['Year'] = i

which makes the new 'Year' column, but every value ends up being 2017. Can someone assist with what I'm doing wrong here?

desert oar
#

@tame cloak that's correct, Year is now the index of the new dataframe, and the rest of the columns will corespond to columns in the original data.

you will need to clarify what kind of visualization exactly you're trying to produce

tame cloak
#

@desert oar -- something like a simple line graph where I'd have time on the x axis, and another variable like "Points" on the y axis

desert oar
#

@tame cloak keeping year as index should be fine

#

series actually have a plot method

#

so df2['my_column'].plot(); plt.show() or something like that

tame cloak
#

very helpful, thanks!

#

@desert oar if possible, do you know why my for loop was yielding "2017" for all values, and how I could in theory get a column of Years that mimic the key?

desert oar
#

@tame cloak you'll need to give a sample of your data

#

try df['Year'].value_counts() so you can see what the actual distribution is

#

maybe you only have 2017 data..?

tame cloak
#

df2 = df.groupby(['Year']).mean() for i in range(1950,2018): df2['Year'] = i df2[['Age', 'Year']]

                    Age    Year

Year
1950.0 26.131410 2017
1951.0 26.344828 2017
1952.0 26.130769 2017
1953.0 26.018868 2017
1954.0 25.769231 2017
1955.0 25.953704 2017
1956.0 25.813725 2017
1957.0 26.018868 2017

#

@desert oar see above, let me know if that makes sense

#

it goes from 1950 to 2017

desert oar
#

oh

#

that for i in range part

#

not only unnecessary but very wrong

#

just delete it

#

and use .reset_index() if you want the Year column back as part of the dataframe and not as the index

#

but for plotting it will do the right thing as the index

#
df2 = df.groupby(['Year']).mean()
df2['Age'].plot()
plt.show()
#

if you want Year back, do

df2 = df.groupby(['Year']).mean()
df2 = df2.reset_index()
tame cloak
#

Perfect! Thank you so much.

#

That worked.

desert oar
#
for i in range(1950,2018):
    df2['Year'] = i

see if you can figure out why this code is wrong

#

hint: i is a "scalar" and df2['Year'] is a "vector"

tame cloak
#

Some mixing of types that can't occur. Like trying to plug in a constant for a variable

desert oar
#

sorta

#

since i is a scalar, it's broadcasting it

#

it's basically copying i for every element in df2['Year'], which is a Series

tame cloak
#

I see, that makes more sense

#

appreciate it @desert oar

surreal nacelle
#

@exotic cedar Holy shit, this drive is amazing, thanks

exotic cedar
#

np

surreal nacelle
#

@exotic cedar Do you mind me sharing that to some friends ?

exotic cedar
#

ya sure idc

#

was made to help others anyway

surreal nacelle
#

๐Ÿ‘

granite sierra
#

Damn that drive is incredible @exotic cedar

surreal nacelle
#

Hey I'm trying to learn about cross validation using cross_val_score() and kfold, however, I can't figure out what the return values of cross_val_score() are.
It's a list of float, I can see that, but that's all the documentation says about it.
I'm guessing it contains the mean etc etc, but I don't know which index correspond to what.

supple ferry
#

@surreal nacelle , lets say you have 10 fold kfold and cross val score does it to train model on 9 and predict the remaining fold and repeat it so that all of the folds are predicted once. Then it gives you predictions. Depending on model, it will give you the default prediction method. For logistic for example, it will give you the classes, but if you ask for probabilites, you can use method = predict_proba

surreal nacelle
#

Oh, my bad, it totally make sense that the default return values are the 10 predictions. I remembered seeing a piece of code where there was a line which called the mean/std from the results, hence the confusion

#

Thanks for the explanation

#

The mean was calculated using a numpy method. My bad

south quest
#

Hey all, I'm doing some prediction work with keras using a sequential model and a single binary output. I've got 800,000 records to classify for and I've trained on all of them, I have trained on all this and when I use model.evaluate I get a loss value of 0.29 and accuracy 0.91 but when I try predict other data with model.predict I get results that just don't work

#

it predicts between 0.0 and 0.1 but no higher than that

supple ferry
#

is it a classification task ?

#

ah i see it is

#

what are the distribution of the classes ?

lapis sequoia
#

how does one row look like

south quest
#

Let me send one row give me a second

supple ferry
#

what i see is overfitting problem

lapis sequoia
#

yeah, maybe the classes are unbalanced

supple ferry
#

if classes are unbalanced, this will be a problem for you. you can solve it in different ways

#

you can either make sure training and test sets have the same share of classes

lapis sequoia
#

let's see the data

supple ferry
#

or you can artificially generate classes that are way below in count

south quest
#

yeah I'm just trying to find it

lapis sequoia
#

like a row of it to get an idea

supple ferry
#

df.class_column.hist()

#

it will do the thing

#

hist is method not an attribute, my bad

south quest
#

in the code that text column is converted into 3 binary columns

supple ferry
#

what is your y column

south quest
#

and I am trying to predict the last column

lapis sequoia
#

what are you trying to predict.. is the target variable here

supple ferry
#

class column

south quest
#

opened

supple ferry
#

df.opened.hist()

lapis sequoia
#

what is sic number

supple ferry
#

run it and show the picture

south quest
#

one second, need to put that onto the prod one

#

sic number is a value between 0 and 99

lapis sequoia
#

well everything else is 1s and 0s.. you can encode your sic number and job roles too

south quest
#

0 is not a string according to matplotlib

lapis sequoia
#

need to typecast those before you can plot

quartz monolith
#

@desert oar my NaN in my df is "not a number" because my label encoder gives me an error.

south quest
#

okay i can't get matplotlib to work but the type of the opened will have tons more 0s than 1s

#

I know that

#

So is the issue that it is training too much for 0s?

#

(not sure about terminology or anything really in this area of python)

lapis sequoia
#

yes.. your data is unbalanced, you can use things like kfold cv for training

south quest
#

ah okay

#

cheers, will look into it

lapis sequoia
#
class_counts = Counter(list(df['opened']))
df_new = pandas.DataFrame.from_dict(class_counts, orient='index')
df_new.plot(kind='bar')
#

might want to also look at methods for feature selection.. to balance classes.. Essentially finding the most variability to represent your classes

south quest
#

right, makes sense

lapis sequoia
#

this example is not directly related.. but it's the general idea.. a good way to picking just enough data from each class

quartz monolith
#

So PCA and t-SSNE reduces the number of dimensions in a dataset whilst retaining most information after that using it for the model?

lapis sequoia
#

do you understand the curse of dimensionality

vital bison
#

hi can anyone help me open a pickle file ? not sure how to do that?

tulip estuary
#

It should be:

import pickle

with open('filename', 'rb') as f:
    x = pickle.load(f)
desert oar
#

@quartz monolith OK, so you understand why i am asking? Now you need to understand why you are getting an error.

quartz monolith
#

@desert oar CatBoostError: Invalid type for cat_feature[1,0]=nan : cat_features must be integer or string, real number values and NaN values should be converted to string.

desert oar
#

It looks like the missing value is already there

#

Do you know why that value was missing?

#

That's what you need to answer

quartz monolith
#

Yes

desert oar
#

Why is it missing

quartz monolith
#

The employees where lazy and did'nt filled everything out

#

in the knowledge data base

desert oar
#

OK, that is important information

#

So this is truly missing data

#

Fortunately a tree model can handle it usually

#

What data type is this feature? Text?

quartz monolith
#

I need to analyze the knowledge data base with decision tree and to optimize for my thesis

desert oar
#

What is the data type of the feature

#

This also might be a good question for your thesis advisor

quartz monolith
#

only objects

desert oar
#

OK, if the data is categorical then "missing" becomes another category

#

How many missing values are there

quartz monolith
#

give me a sec

tulip estuary
#

@vital bison A longer example of write and read:

import pickle
  
interesting_data = {
  'a': 'this is a',
  'b': 'this is b'
}

# Save
with open('filename.pck', 'wb') as f:
    pickle.dump(interesting_data, f)

# Load
with open('filename.pck', 'rb') as f:
    x = pickle.load(f)

print(x)
print(x['a'])
quartz monolith
#

Controllertype 6256
Controllerstate 31147
FehlernrContr_1 128525
FehlernrContr_2 128559
MotionState 32843
FehlernrDrive_1 133315
FehlernrDrive_2 133393
SWversion 12970
RootCauseComponent 0

#

The label is not missing

#

what makes it so hard is to deal with a lot of nans
when i convert my nan in string and let my decision tree on it i get only 51% and balanced 26%

desert oar
#

@quartz monolith "nan" is how pandas and numpy represent "missing"

#

So yes, it's missing data

#

You need to figure out a way to handle it

#

The way to handle it depends on how many values are missing

#

It also depends on the specific attributes of your task

quartz monolith
#

The feature is categorical and numeric (errornumber and types of machine)

desert oar
#

Which one is numeric?

quartz monolith
#

Is the best solution not to create a new knowledge data base where this kind of matter doesnt occur

desert oar
#

You need to do a combination of feature engineering and possibly excluding values

#

Error number is effectively categorical

#

I don't think you have any numerical features

quartz monolith
#

FehlernrContr 1+2
FehlernrDrive 1+2
thats the most important one
this are the error codes that come from the machine and need to be connected with the label and the label need to be connected with the text

#

I can send you a screenshot in pm

desert oar
#

Thats ok

#

I think you might want to talk to your thesis advisor

#

Discuss your results, discuss the business problem you're trying to solve here, and then figure out a missing data solution

quartz monolith
#

You're right! Imo is the best solution is that employees start to fill everything out. There are way to much values missing and the values that are missing are really important for the label. Thank you

desert oar
#

Yeah unfortunately missing data is a big problem

#

Good luck, I will be curious to know what you decide to do

quartz monolith
#

I will let you know! Maybe the employees will start to fill the missing values out ๐Ÿ˜„ haha just kidding

hollow quartz
#

maturity_udf = udf(lambda cons: "pic" if cons == m else "normal" for m in all_max , StringType()). Why it don't run, I use pyspark

desert oar
#

@hollow quartz what error message are you getting

#

oh

#

"pic" if cons == m else "normal" for m in all_max

#

this isn't valid python syntax

#

can you explain what youre trying to do in words

hollow quartz
#

I have a consumption by day. I want to determine the peak consumption @desert oar

#

that's the error SyntaxError: Generator expression must be parenthesized

tulip estuary
#

Might just need brackets as it is a list comprehension?

all_max = [1, 2, 3, 2, 1, 2, 3]
cons = 3
tt = ["pic" if cons == m else "normal" for m in all_max]
print(tt)
#

With brackets that code should be fine, I think....

hollow quartz
#

If you do that, you will get an arrays

#

I want a value

desert oar
#

you need to explain to us how you want to handle this

#

because for m in all_max means you're iterating

hollow quartz
#

all_max contains the max of all day that i have

tulip estuary
#

is all_max a list?

hollow quartz
#

yes

tulip estuary
#

So, given the list you want to do something to it and get a single value, right?

surreal nacelle
#

@exotic cedar I'm reading the hands on book from your drive (the one I ordered, even tho I ordered an older version) and it's amazing so far. Thanks again

granite sierra
#

@surreal nacelle The hands on machine learning book? Yea I thought it was good

hollow quartz
#

@tulip estuary yes

surreal nacelle
#

I think you're the one who recommended it in the first place

#

so thank you too ๐Ÿ˜„

tulip estuary
#

@hollow quartz Can you give a small example of the values of all_max and then what the expected output is (the single value)?

hollow quartz
#

i find the solution

#

def pic(value):
        if value in all_max:
            return 'pic'
        else:
            return 'normal'
#
pic_udf = udf(pic, StringType())
cons_region_df = cons_region_df.withColumn("Max_cons", pic_udf("consommation"))
tulip estuary
#

Ah, cool. k. You can use a lambda function if you want:

all_max = [1, 2, 3, 2, 1, 2, 3]
cons = 3

tt = lambda x: 'pic' if x in all_max else 'normal'

output = tt(cons)
print(output)
hollow quartz
#

thanks @tulip estuary

tulip estuary
#

๐Ÿ‘

lapis sequoia
#

I would suggest you not use pickle

exotic cedar
#

@granite sierra yep thanks, took a long time to make lol

#

@surreal nacelle ya np, this publisher is super nice, and i try to keep it updated

surreal nacelle
#

I was pleasantly surprised to realize that you had the early version of the next release

granite sierra
#

I'm trying to iterate over sheets in an excel file, and within those sheets, iterate the rows

#

how would I go about doing this?

#

I have this for iterating rows

#
for i, row in df.iterrows():
#

but I cant seem to link it ontop of the sheet iteration

supple ferry
#

@granite sierrawhat do you want to achieve? Why are you using loop.

#

Normally you can vectorize your operations

granite sierra
#

I'm basically going row by row and assigning the value to a variable

#

so for example

#
width         height         interdistance
10            5                20
20            10                20
30            15                25
40            20                10
#

well I just cant seem to figure out how I would navigate multiple sheets, and then assign values to a variable from each row, and it does have to be 1 row at a time because I do stuff with the variables

#

it's not a huge amount of data, hence the iteration not being an issue, its literally only like 20 rows of excel data max

copper umbra
#

Dumb Question of the day: I am pretty new to python. and work mostly in pandas, matplotlib, numpy etc. Just trying to learn the skill to be more effective in my job as a data analyst
I am trying to figure out what i need to develop deliverable offline reports (either in word, excel or pdfs, or somethings else that can be emailed to non-tech-friendly clients). I can print a pd.dataframe to excel obviously but what if i want something more complex like a excel workbook with a title, subtitle, then 3 charts on 1 tab spaced appropriately for the chart size, a listing on another tab. and more charts on a third tab. What the go to library for managing stuff like that

south quest
#

I've been playing with my model all day to try and improve accuracy, it's just a binary classification so I have just been changing the inputs to a final Dense with a sigmoid activation with 1 neuron

#

I can't get the accuracy above 53% and I'm not sure what to do

#

I've tried changing activation functions, changing to so many neuron configurations

earnest prawn
#

my best guess would be that probably your model just doesnt have enough parameters to accurately capture whats going on that and you could try increasing them

void anvil
#

mmm code examples:

    return None```
earnest prawn
#

@south quest

lunar leaf
#

@south quest you may consider increasing the model size like @earnest prawn suggested. Here are a few other things to think about as well:

  1. Is your input data normalized? If so, how?
  2. How many training samples do you have available?
  3. Is this reported accuracy the test accuracy or the training accuracy? A good first step is to ensure your model can overfit the training set before you do anything else.
  4. Did you validate the labels for your data? Perhaps some labels are getting mixed up somewhere.
south quest
#

Cheers everyone, I've just left the office but this are great pointers, will look into them tomorrow ๐Ÿ‘

supple ferry
#

@granite sierra , how you calculate the interdistance?
you can create a function which takes a row as input and generates your variable:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"a":[1, 2, 3, 4], "b": [20, 30, 40, 50]})

In [3]: df
Out[3]:
   a   b
0  1  20
1  2  30
2  3  40
3  4  50

In [4]: def interdistance(row):
   ...:     return row["a"] * row["b"]
   ...:

In [5]: df = df.assign(interdistance = interdistance)

In [6]: df
Out[6]:
   a   b  interdistance
0  1  20             20
1  2  30             60
2  3  40            120
3  4  50            200

Usually, iterating over rows is slow in comparison to vectorized version

granite sierra
#

interdistance isn't a calculation, it's also being assigned to a variable

#

@supple ferry

supple ferry
#

@granite sierra , i dont know your use case exactly, thats why i made a general one

#

to which variable you assign them all ?

granite sierra
#

well width, length and interdistance

supple ferry
#

If you tell me more details I can be more useful ๐Ÿ˜ƒ

granite sierra
#

I create stuff from the cell values

#

so complete hypothetical

#

lets say I am creating a box in a numpy array with the width and length, and then multiple boxes with the inter distance

#

but its all good haha, I figured it out ๐Ÿ˜„

#

I basically just assigned the sheet names to a tuple, and then iterated over the tuple

supple ferry
south quest
#

Thank you

void anvil
#

Anyone use arch package? I'm trying to programatically access the fit coefficients that are dumped with model.summary()

#

I don't see anything with dir() or in the documentation

supple ferry
#

As far as I know, they are using the summary class of Statsmodels. so you can access them via params:

model_1_result = sm.Logit(y, X).fit()
model_1_result.params
void anvil
#

perfect

#

thanks

supple ferry
#

rule of thumb is, on python everyone uses Summary of Statsmodels

void anvil
#

Their forecast is only giving out volatility forecasts, not underlying trend. Do I have to self-calc those or is there a different call rather than forecast()

#

makes sense

supple ferry
#

I havent used that package myself ๐Ÿ˜ƒ

#

I assumed their structure and then checked, it was statsmodels in the backend

#

let me know if it works

void anvil
#

yeah

#

params works

#

but their forecast is only volatility unfortunately

supple ferry
#

unfortunately, i have no exp with that one

void anvil
#

what do you use for arch / garch /arima?

supple ferry
#

for arima statsmodels

#

arch and garch never used

#

statsmodels api is a bit weird, but logic is of R

#

so if you speak R then it is easy

void anvil
#

ok

#

thanks

void anvil
#

some of the time series analysis is really frustrating compared to R

void anvil
#

Any way to silence all the prints?

#

I'd really prefer not to print out 20m lines multiple times

supple ferry
#

there should be verbose

#

for verbosity

#

i dont recall any global setting for that

#

when you fit the model

void anvil
#

there's no verbose

#

I'm using:

    def blockPrint():
        sys.stdout = open(os.devnull, 'w')

    # Restore
    def enablePrint():
        sys.stdout = sys.__stdout__```
desert oar
#

i dont think python has a good time series ecosystem yet

#

i just use RPy2 tbh

#

or i just open R

#

@void anvil not sure the context for this, but you should use the logging module instead. that gives you fine control over what is printed where

void anvil
#

you have any examples?

#

Because "standard" is fine for most things because verbosity is usually an option, I just ran into the case today where I'd be dropping 20m+ print commands

#

hopefully this'll be a one-off type deal

desert oar
#
import logging
import math

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

def gradient_descent(x_init, loss_func, loss_gradient_func, step_size, n_iter, tol):
    loss= math.inf
    x = x_init
    for i in range(n_iter):
        prev_x = x
        x -= step_size * loss_gradient_func(prev_x)
        prev_loss = loss
        loss = loss_func(x)
        loss_change = prev_loss - loss
        logger.debug('Iteration %s, loss %f, change %f, value %f', i, loss, loss_change, x)

        if loss_change < tol:
            logger.debug('Convergence reached, done.')
            break

        if loss_change < 0:
            logger.debug('Loss increased, stopping.')
            x = prev_x
            break

    return x, loss
#

if you set the log level to anything above DEBUG, all those log messages will never be printed

#

err thats buggy anyway

#

good thing i have all the log messages ๐Ÿ˜›

hallow wave
#

When going into a entry level job as a data analyst, what should I know in regards to pandas and matplotlib e.t.c

#

By entry level I mean an internship.

solemn topaz
#

Any pandas experts could help me with expanding deeply nested json data to a single large dataframe?

odd terrace
#

Sorry still my pending question which I haven't resolved yet
I have :

print(self.M.T.shape)```
```(8, 3)
(8, 9082318)```

self.N = np.linalg.lstsq(self.L.T, self.M.T, rcond=None)[0].Twhich is working fine and return(9082318, 3)```

But

I want to perform a kind of sort on M and compute the solution only on the best 8 - n values for example.
Any pointer on how to do that would be extremely appreciated.
Thank you.

surreal nacelle
#

Hey, I'd like to calculate the mean of a feature but only for its members who share the same value with another feature)
Is there a function to do that, or do I have to gather all these feature into an array and calculate the mean myself ? I'm guessing there is a function tho ๐Ÿ˜„

#

It's the kaggle titanic dataset, and there is a NaN for a Fare in the test_set

     PassengerId  Pclass  Sex   Age  SibSp  Parch  Fare  Embarked
152         1044       3    0  60.5      0      0   NaN         1```
I know that it's overkill, but I want to calculate the mean of Fare for all the passenger in the 3rd class
cunning osprey
#

Hey, does anyone know how to code a legend in a scatterplot by color?

#

Im working with the iris data, plotted two variables, colored them by species, now I want the legend to indicate species by the colors, and I'm kind of stumped

hallow wave
#

What do you mean?

cunning osprey
#

Sorry if formatting is bad

#

plt.scatter(Iris[' Petal Length'], Iris['Sepal Length'], c = Iris['Labels'])

hallow wave
#

So you just want a color chart on the side?

cunning osprey
#

Yeah

hallow wave
#

Give me a sec

#
cbar = plt.colorbar()
cbar.set_label('Like/Dislike Ratio')
#

Edit how you desire, cmap = theme, cbar is calling the colorbar method, cbar.set_label, labels the chart

supple ferry
#

@surreal nacelle you can group your data

df.grouoby("fareclass")["fare"].mean()
supple ferry
#

Apologies for errors I am on mobile @surreal nacelle

surreal nacelle
#

Thanks you @supple ferry ๐Ÿ˜ƒ

cunning osprey
#

@hallow wave Sorry, I clarified it stupidely. I just need a legend stating which color belongs to which species

#

That colorbar is useful though, Ill use that in the future

hallow wave
#

cmap = theme

#

I'm not sure you can use legends in scatter

cunning osprey
#

Eh, its alright then, Ill just have to figure this out another time. Not the most important thing to need

hallow wave
#

Ye, from what I see you cant.

#

You use color bars

#

Ok you can

#

Let me elaborate:

#

plt.scatter(view_count, likes, c=ratio, alpha=0.8, edgecolors='black', linewidths=1, cmap='summer', labels=labels)```
#

If there only one type of specie then you use label but as there are more you use labels=<variable>, in the labels list list all the species that you want in order

cunning osprey
#

Hmm, yeah, it displays all my entries on a legend

#

The problem is, there are only 3 species, coded as 0,1,2 in the csv. So putting the entire column in as label just returns a giant legend with the entire column

hallow wave
#

Ye

#

I'm trying to find a solution for it

cunning osprey
#

I mean, I think using bools might work, but I rather not have a huge paragraph of code

#

Don't stress if it's impossible or hard, I'm just a beginner, tryna learn

hallow wave
#

I would personally stick to color maps

#

I found this code though```import numpy as np
np.random.seed(19680801)
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
for color in ['tab:blue', 'tab:orange', 'tab:green']:
n = 750
x, y = np.random.rand(2, n)
scale = 200.0 * np.random.rand(n)
ax.scatter(x, y, c=color, s=scale, label=color,
alpha=0.3, edgecolors='none')

ax.legend()
ax.grid(True)

plt.show()
Copy to clipboard

#

Just replace the list with what you want and I think you should be good

cunning osprey
#

Yeah, I'll copy the code for later

#

Thank you for the help

quartz monolith
#

looks like my tiles in my loo ^^ jk

cunning osprey
#

Has anyone coded dummy variables into regressions before?

quartz monolith
cunning osprey
#

Useful resource

desert oar
#

@cunning osprey i do it all the time. got a specific question about it?

#

the machine learning hipsters call it "one hot encoding"

silent swan
#

"machine learning hipsters"

hallow wave
#

Do i need to learn machine learning?

#

For a data analysis pos

desert oar
#

@hallow wave you should probably be familiar with what it is and how it works, since it's becoming more and more common. you might find it useful in some cases. but it will depend on the job responsibilities

hallow wave
#

Ok, salt, do you currently have a job as a data analyst?

desert oar
#

data scientist

hallow wave
#

Ahh, if possible could you specify what I would need to know to get an internship as a data analyst.

#

It's fine if you don't.

silent swan
#

data science/analyst jobs vary wildly

#

some just require SQL

#

others require a lot of modeling

hallow wave
#

modeling as in visualisation?

desert oar
#

no modeling would be like building models

#

to predict things or to understand past data

#

and yeah i dont really have a general answer, depends on the internship

hallow wave
#

Hmm, I guess predicting would invlove machine learning.

desert oar
#

in the "philosophical" sense of the term yes

#

but machine learning often nowadays refers to a collection of modeling techniques that are "non statistical"

#

i guess it's a distinction between "machine learning, the problem statement" (prediction, clustering, recommendations, etc.) and "machine learning, the toolset" (gradient boosting, neural networks, cross validation, etc.)

#

there is a lot of overlap between ML and statistics

hallow wave
#

Hmm, do you think I could master the field of data analytics in a general sense within 2 years, and ye, I kind of get that!

desert oar
#

nobody can master anything in 2 years unless they are a prodigy

#

for an internship, i would expect at minimum:

  • comfortable with probability (conditional probability, what a probability distribution is)
  • comfortable with statistics (mean, variance, etc)
  • the basics of linear regression modeling
  • some calc and/or linear algebra
  • one programming language (python R matlab julia etc), or at least Excel skills
  • an understanding of how to produce basic data visualizations
#

but that's my preference, if i personally were going to hire an intern

#

you should know all that stuff by ~2nd year of college or after finishing a masters program

hallow wave
#

Ok, well all i've got so far is semi-complex data visualizations with matplotlib and importing data through csv files with pandas, I did this in 2 days.

desert oar
#

thats a good start

#

there is a lot to know and learn

#

data is hard

#

there is rarely one "correct" way to do something

hallow wave
#

Mhmm, I feel likes I could get the math done within 2 weeks, i've already partly mastered the fundamentals of python and i've started with pandas and matplotlib. I'm just planning to learn them first, and then start building my github repo with projects based around them which will give a me an edge in a job application/role and also enforce my knowledge for my own well being. Yes, I get that.

cunning osprey
#

Hmm, on the note of regressions, since I completely forgot about how they work

hallow wave
#

Youtube ^^^

cunning osprey
#

I have a multiple regression line, and after introducing some new independent variables the other coefficents increased

hallow wave
desert oar
#

you might be able to learn it in 2 weeks, but to remember and understand it? that would be very impressive

hallow wave
#

No, that's why I will reinforce it with projects.

desert oar
#

best of luck, if you have the free time it sounds like you will do well

hallow wave
#

That's exactly how I learnt python.

#

I have atleast 6 hours a day :p And 6 weeks till school starts. And then 2 years :p

desert oar
#

nice

#

when i was in school i played video games and went to bad concerts

hallow wave
#

Thanks for the advice!

cunning osprey
#

Nah, I know how linear regression works. Just forgot how interaction works

desert oar
#

@cunning osprey think of it algebraically

b1*x1 + b2*x2 + b12*x1*x2
(b1*x1 + b12*x1*x2) + b2*x2
(b1 + b12*x2)*x1 + b2*x2
#

sicne you can move around the order of addition and you can freely add or remove parentheses

hallow wave
#

Ye, I used to be a 'chav' but I knew I didn't want to be a bum, I sacrificed my childhood pretty much but that's another story. I'm glad that I choose to just get my head down and focus on work cause now I have hope for the future :p Just gunna keep learning and keep moving forward :p Thanks!

cunning osprey
#

Well true enough, I mean, I got the code down and the dummy variables in place. But I just forgot how to interpret the results

#

After putting in my dummy variables, all the other coefficents increased

desert oar
#

@hallow wave good on you for turning it around! happy to help here when you have more questions

hallow wave
#

Well i'm going to start learning what you recomended, thanks for the support!

#

Actually, I have one question, what is spark?

surreal nacelle
#

How can I check the feature's importance after a simple cross_val_score ?

supple ferry
#

What is your algorithm? @surreal nacelle

surreal nacelle
#

Random Forest

desert oar
#

@surreal nacelle you need to get the actual model object that was fitted. im not sure you can do that with cross_val_score

surreal nacelle
#

Either that or Logistic Regression, they give similar prediction right now

#

I have the model at hands

#

I could .fit it no problems

desert oar
#

if you have the fitted object then you can look up the importances directly, check the documentation for whatever library you're using

#

sometimes you need to explicitly enable importances before fitting

#

again it depends on the software

surreal nacelle
#

Gonna do that thank you

#

I use sklearn

desert oar
#

@hallow wave spark is a distributed computing platform that runs on top of hadoop or yarn (which is a scheduler that runs on hadoop)

#

@surreal nacelle ok, the sklearn docs are very good. you should get used to using and reading them

surreal nacelle
#

Yep ๐Ÿ˜ƒ

desert oar
#

@hallow wave so the idea is that instead of developing some kind of customized high performance cluster, with apache spark you just have a cluster of commodity hardware running hadoop, and you can do distributed computations on fairly big data that way

surreal nacelle
#

Btw, is a medium 'toward data science' subscription worth it ? I keep seeing promising articles, but there is a paywall

desert oar
#

eh? i never had to pay

surreal nacelle
desert oar
#

no

#

that's just medium trying to get you to make an account

#

wait what

#

uhh

#

i never saw that

surreal nacelle
#

Yea, it's only 1 more story

tulip estuary
#

I think you get 5 per day for free

desert oar
#

i guess i never read more than 5 a day?

surreal nacelle
#

This is the website that the screenshot is from

#

Well, 5 per day is not too bad

#

gonna create an account then

tulip estuary
#

I think it is 5

desert oar
#

huh

#

interesting

#

i dont mind paying for good content

#

and TDS is good content

surreal nacelle
#

Alright, I'll try the free account and see if it's worth

desert oar
#

i never even signed up

#

but yeah i guess if you need

silent swan
#

omfg the tensorflow ecosystem

quartz monolith
#

@desert oar spoke about the NaN's in a knowledge database. There are some specific columns which has some error code which come from the machine for e.g. controller or drive. For specific error there are two different columns one is filled out the other is a NaN. The label of the data frame should be a Troubleshooting (label). How to deal with this missing values? its not a missing value its a information which doesnt matter for some troubleshoots, right?

#

Previously there was some NaN which were not at random. I cleaned them up with other experts

desert oar
#

@silent swan omfg the tensorflow docs

#

so much black magic that the docs never explain, or barely explain

#

@quartz monolith yes that is possible. you can leave those missing. but how to handle them with catboost is another question

#

what you can try is to concatenate the two columns

#

so maybe you have a record like (Error1, "Restarted computer") and another record like (Error5, None) because Error5 doesn't require troubleshooting

#

so you can make a new single feature "Error1 - Restarted Computer" and "Error5 - None"

#

but that will make it harder for catboost to learn

#

so in that particular case you can just turn the missing values into an empty string or something else

#

since the missing values are "meaningful"

silent swan
#

and then there're like, 12 different ways of doing the same thing

#

I'm so glad I live in the pytorch sphere

#

but I need to port a model to tf

desert oar
#

i love how TF also has this general key-value store mess

#

like regularization is a hack

#

i have no idea what does what

#

learning TF sucks, keras is a bandaid. i need to try torch ๐Ÿ˜›

silent swan
#

pytorch is 10/10

#

I think keras is aight if you have a very standard workflow and never want to poke at what's underneath

hallow wave
#

Hey salt rock, could you possibly give me an example of how conditional probability is used in work.

desert oar
#

@hallow wave im working on a classifier that has 1500 potential classes, meaning my model emits a probability distribution over the 1500 classes for each piece of data. each class corresponds to one of 3 types: high, medium, and low. i can use the laws of probability (and specifically conditional probability) to derive a distribution over those 3 types based on the distribution over the 1500 classes

#

without having to develop a separate model for the 3 classes

hallow wave
#

Hmm, I don't really understand how you would code it but I sort of get the idea. Your data gives you results for a variables for say x and then you create distributions based upon your results/

desert oar
#

sorta... thats the general idea

hallow wave
#

Well I haven't gone in depth with it yet, will have to start making code revolving around it

quartz monolith
#

sorry i was a bit afk i will get into it in some minutes

quartz monolith
#

@desert oar oh i get it i will let them as missing values but as "" .

#

hwo are you guys traning a model? through cpu or gpu?`

#

because my model really takes a long time

cunning osprey
#

Does anyone know how to find certain words in a text column while filtering other certain words

#

Im trying to search for the words 'view' and perspective' while excluding any rows with the words 'bottom' 'top' 'front' and 'rear'

#

So far, I'm doing
for word in perspective:
df[word] = df.astype(str).sum(axis=1).str.contains(word)

#

Which assigns bools to anything that countains the words, but I'm not trying to repeat counts, and I cant have a true statement if the row contains the filtered words

quartz monolith
#

I dont know if the excluding and removing is working in one search
but excluding or delelting any rows which contain string ist something like this
df[~df["column"].isin(["value, top, front, rear"])]

desert oar
#

thats not quite right

#

"value, top, front, rear" is a single string

#

it looks like you want a list of strings like this ["value", "top", "front", "rear"]

cunning osprey
#

Im gettng new columns, true if the words are contained, but I need to drop them somehow

desert oar
#

@cunning osprey

include_words = ['bottom', 'top', 'front', 'rear']
exclude_words = ['view', 'perspective']

include_pattern = '|'.join(rf'\b{w}\b' for w in include_words)
data['has_include'] = data['text'].str.contains(include_pattern)

exclude_pattern = '|'.join(rf'\b{w}\b' for w in exclude_words)
data['has_exclude'] = data['text'].str.contains(exclude_pattern)

data['is_valid'] = data['has_include'] & ~data['has_exclude']

data_filtered = data.loc[data['is_valid']]
cunning osprey
#

Wow

#

That works beautifully

desert oar
#

str.contains takes regex

#

so that makes your life easier

#

\b in regex means "word boundary"

#

so \bview\b will match "view" but not "preview" or "viewer" for example

cunning osprey
#

df['is_valid'] = df['has_include'] & ~df['has_exclude']

was the part I was missing

#

But the word boundary was not something I thought of, thank you so much Mr. Salt Rock

desert oar
#

๐Ÿ‘

cunning osprey
#

Over 3482 true values dang

surreal nacelle
#

Hey, I'm working on the titanic dataset, and I use logistic regression, I currently have 0.80 average prediction, and I'm not sure where to go from there.
I already filtered irrelevant features, used gridsearchCV to get the best hyperparameters (didn't change much tho) . I'm guessing that the problem reside in the data. How would you go about it ? Scaling ? Creating new features ? (If you've already worked on the titanic set, please don't give me 'solutions', I just want some guidance ๐Ÿ˜ƒ ) Thanks

quartz monolith
#

you can try other models and see how they perform

surreal nacelle
#

I already tried other models, and this one performs the best

desert oar
#

0.80 average prediction? or 0.80 accuracy

#

the next step would be making better features

surreal nacelle
#

accuracy sorry

desert oar
#

thats not bad at all

#

scaling can help

#

new features can help

#

maybe start reading some blogs, since you already have the basics

surreal nacelle
#

I've reduced the dataset to 4 features

desert oar
#

very nice

#

thats a lot better than my first titanic attempt back in the day

#

what features do you have?

surreal nacelle
#

Pclass Sex Age Fare

#

brb 2min

#

Ok I'm back, basically I started by removing the id/ticket_id/cabin features as they were irrelevant and incomplete, then changed the Embarkation point from char to int, same for sex, filled the missing values using median, realized siblings/spouse/husband didn't matter for the model (actually made it a little worse), removed these, and here I am with 4 features and 0.80 accuracy.
Next step is scaling and creating new features I guess

desert oar
#

really good start

#

scaling categorical features isnt that useful

#

scaling fare might help

surreal nacelle
#
   Survived  Pclass  Sex   Age     Fare
0         0       3    0  22.0   7.2500
1         1       1    1  38.0  71.2833
2         1       3    1  26.0   7.9250
3         1       1    1  35.0  53.1000``` I think so too
#

but some values are way above average

desert oar
#

subtract mean and divide off std dev

surreal nacelle
#

won't that mess up the scaling ?

desert oar
#

yeah can use median and median abs dev

#

good catch

#

also better missing data imputation is helpful

#

also heres a hint: you can figure out family members with the last names

surreal nacelle
#

mhm interesting

#

Thanks for the input

#

gonna read on data imputation and figure out a way to use last names efficiently

desert oar
#

there was also a fun blog post out there about how ticket numbers related to cabin position in the ship

surreal nacelle
#

๐Ÿ‘

cunning osprey
#

So I have this regression line:

mod = ols("Sepal_length ~ Petal_length + Sepal_width + Petal_width + C(Labels)", data=Iris)

#

Okay, no idea how to format it. But Labels is a column in the Iris dataframe noting the species of flower in the row, 0 - species 1, 1 - species 2, 2 - species 3

#

Just wanted to know if this was the right way to make labels into a dummy variable for regression

trim leaf
#

has anybody tried to build a stock market/futures trading algorithm

earnest prawn
#

Predicting stock market has been a thing since stock market came up lol

crude bloom
#

I'm looking for a machine learning library in python which is optimized for defining individual neurons and creating your own neural network graph. I'm interested in researching new neural network architectures so I don't want to deal with existing abstractions like convolutions and batch norm

#

anyone know an ML library that lets you do this without much effort?

lean ledge
#

What's wrong with just plain tensorflow or pytorch? @crude bloom

crude bloom
#

nothing's wrong with it, I'm just interested in doing some research

#

basically I want to make a neural network that doesn't have layers in a straightforward way

#

more randomly wired so there's more space for exploration

lean ledge
#

What's stopping you from using them?

crude bloom
#

the base element of both of those is that they have predefined layers like convolutions and dense layers

lean ledge
#

not really

crude bloom
#

actually TF does have a lower level API

#

I'm just used to using the more high level stuff

lean ledge
#

they both work with generic low level operations and have optional APIs on top (nn and keras)

#

both of the frameworks are fundamentally low level

crude bloom
#

u right, thanks ๐Ÿ˜ƒ

#

i've been using too much keras lol

floral lodge
#

Hello fine people of data science. I am reposting my question from the help channels here after it was suggested to me by one of the mods. I would like to know if it would be possible to automate the following task, and if so, any modules, libraries or keywords that I could research and learn that may help me accomplish the implementation. Basically, I want to create a small automation that will speed my workflow significantly inside of a CAD program. I want to take an image like this https://gyazo.com/a6d6433ccea29e7b13257c8fd5a5f359 or like this https://gyazo.com/6c9386d3c4d224597d49ac8d657873a2 and analyze it to detect these vertices and click them in the shown order. The perspective and scaling of the image will be slightly different each time, but it will always look like one of the two examples (whichever works better) . Is this possible? I'm familiar enough with automating the gui but I don't know where to get started on the image analysis required to identify these points, or how to create a function that allows me to drag a box around some screenspace to capture the image for analysis. Any insight, tips or knowledge is greatly appreciated.

near viper
#

its a GAN

#

trasining on mnist

#

with discriminator and generator and such

#

written with keras

surreal nacelle
#

0.84 accuracy, making progress on the titanic set ๐Ÿ˜„

near viper
#

my
my network deevolved

#

it had a blurry pattern in the center

#

and then it just died

surreal nacelle
#

There is one thing I'm wondering about, let's say the algo I use has an accuracy which varies between 0.82 and 0.88. Would saving the 0.88 model be beneficial?
If I load that model to predict, will it perform better than if I saved the 0.82, or will it be just as random as training a new one everytime?

quartz monolith
#

@surreal nacelle you should check your confusion matrix and analyze it

surreal nacelle
#

No idea what that is, but i'll google it ๐Ÿ˜„ thanks

quartz monolith
#
df[['error1', 'error2']] = df['error'].str.extract("[,|/](\w*)")
df.drop(columns =["error"], inplace = True)```
data frame has some columns which contains mostly digits also some letters and the words are divided in `, or /` i want to section it. afterwards delete it
supple ferry
#

@surreal nacelle you should use cross validation for that

surreal nacelle
#

I already use cross validation, but I was wondering if a model that performed better than the other, could be used to build around it

#

apparently not tho

#

managed to get .94% doing that, but it was obviously overfitting

desert oar
#

@floral lodge that should be possible but it's outside my particular area of expertise. You will want some kind of "edge detection" algorithm, maybe look in OpenCV

#

Also does your CAD program provide an API for scripting?

#

I know for instance you can write plug-ins to automate fusion 360 and freecad

#

@surreal nacelle you got slightly different accuracy numbers during cross validation?

surreal nacelle
#

I was thinking about combining the results of 2 models, it's a binary classification so I'd just have to compare the predictions of one and only "accept" the predictions that concurs with the other model.

#

Well not really

#

but the algo used to range from 0.78 to 0.91 at some point

desert oar
#

What do you mean range from

surreal nacelle
#

the accuracy score

desert oar
#

How are you evaluating this

#

Yeah but what is your evaluation procedure

surreal nacelle
#

I'm talking about regular predictions, not cross val btw

#

Well, I kfold then cross_val_score using the kfold