#data-science-and-ml | Python | Page 204

silk forge Jul 8, 2019, 12:06 PM

#

Traceback (most recent call last):
File "pandas_libs\parsers.pyx", line 1169, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas_libs\parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas_libs\parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas_libs\parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/admin/PycharmProjects/discord/test-ml-spam.py", line 6, in <module>
dta = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/ML datasets/SPAM/spam.csv")
File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)

#

File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\pandas\io\parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\pandas\io\parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas_libs\parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas_libs\parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas_libs\parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas_libs\parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas_libs\parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas_libs\parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid continuation byte

#


dta = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/ML datasets/SPAM/spam.csv")
print(dta)

#

this is my code

#

but whats wrong tho

still cloud Jul 8, 2019, 2:27 PM

#

Looks like you need to change encoding on the import there @silk forge

#

Ensure that spam.csv is saved with UTF-8 encoding. If not, then just add the encoding parameter to read_csv.

celest moss Jul 8, 2019, 2:36 PM

#

I am implementing a spam classifier using naive bayes. I am having a hard time dealing with rare/unseen words in test cases(Both numerator and denominator are zero when calculating posterior probability). Can anyone help me with this issue ?

lapis sequoia Jul 9, 2019, 1:06 AM

#

do you need to implement it with naive bayes

#

are you converting your word tokens to vectors

#

how are you handling punctuation

#

can you use embeddings

#

there's not much you can do in case of unseen words, aside from including patterns in your training data that accounts for repeated words in a sentence, etc

quartz monolith Jul 9, 2019, 11:22 AM

#

@celest moss I'm working on something similiar today:
Want to create a text classification pipeline for a hypertext column on my dataframe

Bag-of-words on a column
TFid
sklearn

dull fern Jul 9, 2019, 12:01 PM

#

@celest moss If you use words embedding like Word2Vec you can actually try to infer the meaning of unknown words from their context

lime cloud Jul 9, 2019, 1:41 PM

#

Is there a way to create a histogram in matplotlib from the return value of another one?

#

it returns n, bins, patches

lapis sequoia Jul 9, 2019, 2:22 PM

#

could anyone recommend a serialization format?

#

I need to export my large dataframe .. so I need to compress the hell out of it before I can write it ..

quartz monolith Jul 9, 2019, 2:30 PM

#

I read something about HDF5 for very big data frames, you can also load the data optimize the data (compress) @lapis sequoia

#

https://www.dataquest.io/blog/pandas-big-data/

Dataquest

Tutorial: Using Pandas with Large Data Sets in Python –

Python and pandas work together to handle huge data sets with ease. Learn how to harness their power in this in-depth tutorial.

lapis sequoia Jul 9, 2019, 2:38 PM

#

thanks..I found this when comparing

#

https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

Medium

The Best Format to Save Pandas Data - Towards Data Science

A small comparison of various ways to serialize a pandas data frame to the persistent storage

#

parquet seems nice..

#

now to figure out how to write df to parquet and read back

#

sweeetttt

#

https://stackoverflow.com/questions/48083405/what-are-the-differences-between-feather-and-parquet

Stack Overflow

What are the differences between feather and parquet?

Both are columnar (disk-)storage formats for use in data analysis systems.
Both are integrated within Apache Arrow (pyarrow package for python) and are
designed to correspond with Arrow as a colum...

long meadow Jul 9, 2019, 10:12 PM

#

can somebody help me?

📎 unknown.png

#

CatBoostError: Invalid type for cat_feature[7,3]=40.5 : cat_features must be integer or string, real number values and NaN values should be converted to string.

void anvil Jul 10, 2019, 1:56 AM

#

Anyone here use rllib? Trying to find a half decent guide on creating custom actions and environments

earnest prawn Jul 10, 2019, 2:00 AM

#

no I am not using it however looking at the diagram in its docs

📎 rllib-stack.png

#

id say that the environments and actions are based on openai gym and for how to create new environments yourself for that lib youll find many guides by simply googling around

#

@void anvil

#

https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_env.py in fact its even linked to this thing in the first few sentences of the docs which shows that this is the exact procedure to do this

GitHub

ray-project/ray

A fast and simple framework for building and running distributed applications. - ray-project/ray

void anvil Jul 10, 2019, 2:04 AM

#

oh

#

that would explain why I couldn't find anything

#

and just calls

lapis sequoia Jul 10, 2019, 3:05 AM

#

im having some memory allocation issues

#

I concat a list of dataframes before writing to parquet file

#

the dataframe is huge..

#

need a workaround..

lapis sequoia Jul 10, 2019, 3:24 AM

#

I ended up using dask

#

have to test it

#

not sure if concat in dask requires reset index though..

supple ferry Jul 10, 2019, 12:10 PM

#

@long meadow as far as I understood, catboost is using categorical features. So your input should be categorical. Either use bins as category boundaries, or convert everything to string (not the best approach)

quartz monolith Jul 10, 2019, 6:14 PM

#

@supple ferry @long meadow If you're using pandas
from sklearn.model_selection import train_test_split
for col in [col]:
df[col] = df[col].astype('category')

y = df.target
X = df.drop('y', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.8)

#

but i'm using lightgbm. Don't know if the approach is the same

#

check your X with
X.info()

#

@lapis sequoia let us know about your exp. with dask

supple ferry Jul 10, 2019, 8:18 PM

#

@quartz monolith OP can also use patsy and create X y data frames from formula

Patsy.dmatrices("y ~ x1 + x2 + c(x3)", return_type ="dataframe" )

Where x3 is category column

quartz monolith Jul 10, 2019, 8:35 PM

#

@supple ferry whats the benfit from using patsy vs others like labelencoding, ohe...

lapis sequoia Jul 10, 2019, 11:15 PM

#

@quartz monolith it's pretty good, but after concatenating multiple dataframes, it tries to save them as parts when you try to export it. It has built in support for export to hdf and parquet, but again it tries to save everything in parts.

#

Compression is pretty solid.

silent swan Jul 11, 2019, 1:34 AM

#

quick poll: what do you guys use for filtering DataFrames

#

df.query, or df[df["somecol"]==someval]]

lapis sequoia Jul 11, 2019, 2:59 AM

#

depends how large the dataframe is, how often you're going to query

#

for filtering, I would go with the second

#

especially if you're using in down the line

lapis sequoia Jul 11, 2019, 5:30 AM

#

I found this little gem

#

http://ruder.io/optimizing-gradient-descent/

Sebastian Ruder

An overview of gradient descent optimization algorithms

Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad, and Adam actually work.

supple ferry Jul 11, 2019, 9:34 AM

#

@silent swan df.filter
Or df.query is fast

supple ferry Jul 11, 2019, 12:36 PM

#

@quartz monolith patsy is very suitable for those things. Less lines and also allows you to use formulas in your model. Formula like syntax is very good for people coming from R or Matlab

olive willow Jul 11, 2019, 5:19 PM

#

has anyone read, Big data by Bernard Marr? if yes is it good?

quartz monolith Jul 11, 2019, 7:16 PM

#

from gensim import models
model = gensim.models.ldamodel(bow_corpus, num_topics=60)
other_texts = [['die anlage steht auf ins bitte vor ort ueberpruefen'],
               ['es besteht gefahr in der unteren anlage'],
               ['alle taster leuchten'],
               ['der antrieb scheint nicht zu funktionieren']]

other_corpus = [dictionary.doc2bow(doc) for doc in other_texts]
unseen_doc = other_corpus[0]
vector = model[unseen_doc]```

`Error: ValueError: need at least one array to concatenate`

solved

strange epoch Jul 11, 2019, 7:22 PM

#

What is a good way to find some combination of values to set a flag? (RE: https://discordapp.com/channels/267624335836053506/439702951246692352/598923574291726367)

Discord

Discord - Free voice and text chat for gamers

Step up your game with a modern voice & text chat app. Crystal clear voice, multiple server and channel support, mobile apps, and more. Get your free server now!

leaden bobcat Jul 11, 2019, 9:51 PM

#

Is anyone available to chat in voice for a few minutes regarding a ML question with point of sale transactional data?

#

DM me if so, I'm finding a ton of ML info regarding interpreting single rows of data, but I've got transactions that encompass up to 20+ rows, and I'm not sure which method to necessarily approach to get ML to read 20+ rows of data as a single entity

lapis sequoia Jul 12, 2019, 12:05 AM

#

anyone wanna hear me talk about basic ml because I need to revise

lapis sequoia Jul 12, 2019, 12:56 AM

#

how does an array containing 5 elements of 512 embeddings each yield a 5 element array of 5 values when a dot product is done..

#

im just trying to wrap my head around how the matrix multiplication occurs

void anvil Jul 12, 2019, 1:32 AM

#

Magic

#

Same way it does in linear regression.

silent swan Jul 12, 2019, 2:19 AM

#

I don't understand the issue

lapis sequoia Jul 12, 2019, 6:48 AM

#

what issue

#

I think I understand

#

512 arranged vertically times 512 arranged vertically then product and sum

supple ferry Jul 12, 2019, 8:13 AM

#

@quartz monolith solution maybe? 😁

#

@void anvil do you have experience with numpy?

quartz monolith Jul 12, 2019, 8:16 AM

#

singsingRee

#

@lapis sequoia im working also on doc2vec and want to plot it

#

maybe we have the same issuse

lapis sequoia Jul 12, 2019, 8:21 AM

#

what do you need it for

quartz monolith Jul 12, 2019, 8:23 AM

#

I want to predict a column with lda and for clustering

lapis sequoia Jul 12, 2019, 8:24 AM

#

that's not very specific, what's the end goal

#

the column is target classes?

quartz monolith Jul 12, 2019, 8:27 AM

#

Based on a service text I want to classify in keywords

lapis sequoia Jul 12, 2019, 8:27 AM

#

could you give an example

quartz monolith Jul 12, 2019, 8:28 AM

#

SVM was 85% to predict on free text the label. I want to try it with doc2vec

lapis sequoia Jul 12, 2019, 8:28 AM

#

you're still not getting it

#

give me an example of input and output

quartz monolith Jul 12, 2019, 8:29 AM

#

X = "All buttons light up and the system stops" y="fuse" e.g.

#

I want to create from the knowledge base a sub classification with lda

#

because the label is to general

lapis sequoia Jul 12, 2019, 8:30 AM

#

ok

#

what you need is a knowledge graph

#

yeah, I get it now

#

but building a knowledge graph is your first step, because you can classify

#

and you can't do that from training example

#

do you have a finite list of categories?

quartz monolith Jul 12, 2019, 8:31 AM

#

the goal would be based on x to get y= the text

X= are coming from machine error codes and using knowledge text is in the data base

#

Yes i have over 65 categories but its still not specified class

#

my approach would be to use lda on the free text to make sub classes

lapis sequoia Jul 12, 2019, 1:22 PM

#

that's not straightforward

#

because anything in your query can trigger or be correlated to another category

void anvil Jul 12, 2019, 1:29 PM

#

@supple ferry which part

#

I assume you mean numba not numpy

supple ferry Jul 12, 2019, 1:31 PM

#

@void anvil , no i meant numpy :)
I am trying to write a workaround to my problem.
In STATA when you fit clogit (conditional logit) it drops exog variables that have no within group variance and fits the model without them. There is no implementation of that in python

#

do you know maybe an easy way to group an array based on one column and check some column x for variance ? maybe a vectorized way of doing that

void anvil Jul 12, 2019, 1:45 PM

#

I mean you can select columns with [: ,x]

#

a = np.array([[1,2],[3,4]])
print(a)
a[:, 1]
print(a)```

Gives:

```[[1 2]
 [3 4]]
array([2, 4])```

#

@supple ferry

#

then you can just do np.var or w/e

#

is that what you're looking for?

supple ferry Jul 12, 2019, 2:22 PM

#

Not exactly. I also want to group the array by one column and then run all these

silk forge Jul 12, 2019, 2:27 PM

#

yo

#

import discord
import sklearn.naive_bayes as bae
import pandas as pd
import numpy

dta = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/ML datasets/SPAM/spam.csv" , encoding='ISO-8859-1' ,
    converters = {'v2': lambda x: 1 if x == 'ham' else 0})
# v2 - features

clf =bae.GaussianNB()



x = numpy.array([[dta.v2]])
y = numpy.array([[dta.v1]])


clf.fit(x,y)
n = clf.predict([['xd lol imma destroy everyone here immahhhhhhhhhhhhhhhhhhhhhhhhhh kill all of you ..................................................................................................................']])
print(n)

#

C:\Users\admin\PycharmProjects\discord\venv\Scripts\python.exe C:/Users/admin/PycharmProjects/discord/test-ml-spam.py
Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/discord/test-ml-spam.py", line 18, in <module>
    clf.fit(x,y)
  File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\sklearn\naive_bayes.py", line 189, in fit
    X, y = check_X_y(X, y)
  File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\sklearn\utils\validation.py", line 719, in check_X_y
    estimator=estimator)
  File "C:\Users\admin\PycharmProjects\discord\venv\lib\site-packages\sklearn\utils\validation.py", line 539, in check_array
    % (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.

#

help

quartz monolith Jul 12, 2019, 2:31 PM

#

shocked

silk forge Jul 12, 2019, 2:31 PM

#

what

#

@quartz monolith ?

quartz monolith Jul 12, 2019, 2:32 PM

#

you want to predict your input right=

silk forge Jul 12, 2019, 2:32 PM

#

no

#

i want to predict of the message is spam or not

#

@quartz monolith

quartz monolith Jul 12, 2019, 2:33 PM

#

do you have a training set?

silk forge Jul 12, 2019, 2:33 PM

#

no

#

its just my message

quartz monolith Jul 12, 2019, 2:34 PM

#

you need to find some csv or data where some text ist with some column (spam yes or not)

silk forge Jul 12, 2019, 2:34 PM

#

clf.fit(x,y)
n = clf.predict([['xd lol imma destroy everyone here immahhhhhhhhhhhhhhhhhhhhhhhhhh kill all of you ..................................................................................................................']])

quartz monolith Jul 12, 2019, 2:34 PM

#

https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

Towards Data Science

Multi-Class Text Classification with Scikit-Learn

There are lots of applications of text classification in the commercial world. For example, news stories are typically organized by topics…

silk forge Jul 12, 2019, 2:34 PM

#

@quartz monolith

#

yes

#

i have a dataset

#

dta = pd.read_csv(filepath_or_buffer="C:/Users/admin/Desktop/ML datasets/SPAM/spam.csv" , encoding='ISO-8859-1' ,
    converters = {'v2': lambda x: 1 if x == 'ham' else 0})

#

so i cant do this with naive bayes then?

quartz monolith Jul 12, 2019, 2:37 PM

#

from sklearn.feature_extraction.text import TfidfVectorizer
stop = set(stopwords.words('english') + ['.', ',', '"', "'", '-', '.-'])
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', ngram_range=(1, 2), stop_words=stop)
features = tfidf.fit_transform(df["spam"]).toarray()
labels = df.text_id

features.shape```

#

first i would go with tfidf

silk forge Jul 12, 2019, 2:38 PM

#

so naive bayes won't work?

quartz monolith Jul 12, 2019, 2:39 PM

#

after that you choose your model with e.g.

model = GaussianNB()
model.fit(X_train_tfidf, y_train)

#

and than you can predict with

print(model.predict(count_vect.transform(["xd lol imma destroy everyone here immahhhhhhhhhhhhhhhhhhhhhhhhhh kill all of you .................................................................................................................."])))```

#

to get you score model.score(X_train_tfidf, y_train)

#

but do the step before tfidf like in the tutorial

#

you data muss be clean

#

binary classifiers suits the best in your case its like 0 or 1

#

hope it helped you

#

@silk forge

vale arrow Jul 12, 2019, 4:44 PM

#

Question: should i use an activation function in all the middle layers of my ANN? Why or why not?

quartz monolith Jul 12, 2019, 4:51 PM

#

label_encoders = {}
for col in feature_cols:
    print("Encoding {}".format(col))
    new_le = LabelEncoder()
    train[col] = new_le.fit_transform(train[col])
    label_encoders[col] = new_le

need to encode strings to numeric for decision tree

#

```TypeError: argument must be a string or number````

#

📎 Bildschirmfoto_2019-07-12_um_18.52.24.png

#

Here is my data frame

severe pasture Jul 12, 2019, 5:07 PM

#

Hi all, I have a question. I'm running a fresh install of Python 3.7.3. I've been given a .nxs file that contains scan data, and I have some .py scripts that go with the data. How can I read the .nxs file and then run scripts on contained data? Will I need certain packages or libraries? I'd like to have the .nxs data read as arrays that I can manipulate.

silent swan Jul 12, 2019, 5:25 PM

#

@vale arrow you need a non-linear activation function between layers, because otherwise the layers are just affine functions, and a composition of affine functions is still just a affine function, so you don't get any more model expressivity from additional layers

vale arrow Jul 12, 2019, 5:32 PM

#

@silent swan thank you!

lapis sequoia Jul 12, 2019, 8:32 PM

#

📎 Distribution.PNG

#

Was wondering how I could go about rescaling the y-axis on this histogram

#

Would it make sense to have a non-linear y-axis? As in, have the y-axis increment by no set number?

silent swan Jul 12, 2019, 9:19 PM

#

log scale?

lapis sequoia Jul 13, 2019, 9:21 AM

#

https://www.youtube.com/watch?v=MrLPzBxG95I&list=PLl8OlHZGYOQ7bkVbuRthEsaLr7bONzbXS

YouTube

Kilian Weinberger

Lecture 1 "Supervised Learning Setup" -Cornell CS4780 Machine Lear...

Cornell class CS4780. Official class webpage: http://www.cs.cornell.edu/courses/cs4780/2018fa/ Written lecture notes: http://www.cs.cornell.edu/courses/cs478...

▶ Play video

quartz monolith Jul 13, 2019, 10:41 AM

#

https://www.youtube.com/watch?v=KChtdexd5Jo

YouTube

Enthought

Keynote: The New Era in NLP | SciPy 2019 |

Connect with us! ***************** https://twitter.com/enthought https://www.facebook.com/Enthought/ https://www.linkedin.com/company/enthought

▶ Play video

quartz monolith Jul 13, 2019, 11:12 AM

#

@lapis sequoia my doc2vec is really good with linearsvc
the good think is he understands what doesnt belong to some text and can give me some good keywords

lapis sequoia Jul 13, 2019, 2:28 PM

#

@silent swan You mean to convert the y-axis to a log scale?

silent swan Jul 13, 2019, 3:56 PM

#

yep

leaden bobcat Jul 13, 2019, 5:22 PM

#

Anyone around that's familiar with scikitlearn? getting an odd error for the preloaded iris database trying to use Random Forest Classifier

silent swan Jul 13, 2019, 9:20 PM

#

just post the error

crude parcel Jul 13, 2019, 9:32 PM

#

whats error

#

hey guys... is there a place where you can find out more about the various algorithms being used for nlp? more of a use-case to matching algorithm type thing?

crude bloom Jul 14, 2019, 1:17 AM

#

@crude parcel take a look at the papers with code page for NLP. It has the SOTA papers for each major class of NLP: https://paperswithcode.com/area/natural-language-processing

Browse state-of-the-art in ML

Papers With Code highlights trending ML research and the code to implement it.

#

They break it into categories and tags, it's the best site I've seen to browse the field in the most up-to-date way

crude parcel Jul 14, 2019, 1:20 AM

#

interesting, it's quite overwhelming

tulip estuary Jul 14, 2019, 1:27 AM

#

That is a really cool site. I have never seen that before. (A lot more than NLP on there too.)

crude bloom Jul 14, 2019, 1:29 AM

#

it's generally my go-to to get a high level idea of a field, the website is very well put together

lilac ferry Jul 14, 2019, 3:36 AM

#

I've created a dataframe (D1). I now want to create a second dataframe (D2) using the data from D1, condition being it needs to match the value of one of the columns from D1. How do I approach this?

echo storm Jul 14, 2019, 3:39 AM

#

think you just take the column as a slice or are you linking dfs?

lilac ferry Jul 14, 2019, 3:46 AM

#

slicing worked, thanks!

lapis sequoia Jul 14, 2019, 4:34 AM

#

.loc

supple ferry Jul 14, 2019, 9:10 AM

#

@lilac ferry beware though. Simple slicing returns a view not the copy. So if you assign view to another variable, and edit it later your original dataframe will be changed. Either use copy method or use fancy indexing

lapis sequoia Jul 14, 2019, 9:11 AM

#

what's fancy indexing

supple ferry Jul 14, 2019, 9:12 AM

#

@lapis sequoia https://jakevdp.github.io/PythonDataScienceHandbook/02.07-fancy-indexing.html

lapis sequoia Jul 14, 2019, 9:50 AM

#

soooo just a fancy term for indexing

#

people have always used bins in df.. since the dawn of groupbys

supple ferry Jul 14, 2019, 10:33 AM

#

http://pandas-docs.github.io/pandas-docs-travis/user_guide/indexing.html

#

I think it is very important to know the difference between views and copies

#

sometimes it can screw quite a lot if left unattended

lapis sequoia Jul 14, 2019, 10:44 AM

#

I always make a copy if I'm doing operations that alter the original..

#

anyways, is there a server that focuses primarily on data engineering

#

i'm having trouble with serialization with multiprocessing..

silent swan Jul 14, 2019, 5:45 PM

#

imo data science ought to include data engineering

#

otherwise it just becomes stats+ml

lean ledge Jul 14, 2019, 9:53 PM

#

What's wrong with that?

#

Data science and data engineering are very different roles with different skillsets

lapis sequoia Jul 14, 2019, 11:36 PM

#

Anyone here familiar with Bokeh?

silent swan Jul 15, 2019, 4:12 AM

#

data science is a big enough umbrella that everything falls under it

#

is it hardcore stats and optimization?
is it domain knowledge and business analytics?
is it big data and databases?

#

the answer is yes to all of the above

#

data science as a term is (for better or for worse) designed to cast a wide net and encompass as much as possible

lean ledge Jul 15, 2019, 4:58 AM

#

My argument would be data science is a bad buzzword and should be forgotten in favour of anything a bit more specific

lilac ferry Jul 15, 2019, 10:40 AM

#

@supple ferry i was unaware of that. Thanks 😊

hollow quartz Jul 15, 2019, 11:40 AM

#

Hi, I use Ipywidget but I don't see the widget

📎 Annotation_2019-07-15_113702.png

supple ferry Jul 15, 2019, 3:06 PM

#

Matplotlib in-line? @hollow quartz

hollow quartz Jul 15, 2019, 3:07 PM

#

there are Matplotlib in-line @supple ferry

lapis sequoia Jul 15, 2019, 3:10 PM

#

I'm not sure how to approach this problem: I am reading from a virtual file system and compiling all the contents I traverse into one dataframe like structure ( I can use either Pandas or Dask for this) , but this issue is I have to write it. I cannot write to local because it's too big, and takes too long.. my formats are limited ( I cannot access any pyarrow engine because it's not in the current codebase I have to use) so I have to save in json, or hdf ... then commit the file to cloud storage

#

I'm having trouble saving fast and coping to cloud storage.. takes too long..

#

some serialization formats that have been light enough : 1. Pandas to messagepack objects dumped into pickle (but I am not able to read this back because it's compressed too much), 2. Writing to hdf to tempfile then copying to cloud storage , but this takes too long just to finish the write to hdf

#

it'd be great if someone can suggest something

silent swan Jul 15, 2019, 5:26 PM

#

@lean ledge I also agree, but I think it serves some purpose in referring to a vague cloud of things

#

I think it's better than the "AI" buzzword, for example

#

I don't see data science being used for needless hype as much

#

are you currently writing in parts?

desert oar Jul 15, 2019, 6:07 PM

#

@lapis sequoia what about writing messagepack to a plain file? Instead of pickling

#

And aggressively compressing with xz or lzma

hollow shard Jul 15, 2019, 8:20 PM

#

I know I've been pestering you guys a lot lately, but my neural network is not working, and indeed displays some very strange behaviour. It appears to minimise the loss before bottlenecking and stopping at 0.5. Could anyone help? dataset is MNIST and it has 1 hidden layer of 64 nodes

silent swan Jul 15, 2019, 8:30 PM

#

post code/training behavior

hollow shard Jul 15, 2019, 8:43 PM

#

sure

#

my apologies in advance for super messy code, I've been trying stuff out with it and modifying it for a long time now

#

http://dpaste.com/0D723Z4

#

it currently performs at 7% accuracy, and this is the loss vs examples trained on.

#

📎 lossvtime1bad.png

lunar leaf Jul 15, 2019, 9:02 PM

#

@hollow shard can you explain to me your implementation of the delta computation for the output layer of the network? it should be out_delta = cost_function_derivative(output_activations, true_output_label) * activation_function_derivative(x) but I don't quite understand how it is that your code is equivalent to that

hollow shard Jul 15, 2019, 9:15 PM

#

sorry, it actually doesnt, but when i modified the code to follow the formula you gave it still doesnt work.

#

http://dpaste.com/1DQGF36

#

heres the new code

lunar leaf Jul 15, 2019, 9:15 PM

#

I'm still not entirely through your code, it's taking a little bit to get my head around

hollow shard Jul 15, 2019, 9:16 PM

#

hahaha im so sorry

lunar leaf Jul 15, 2019, 9:16 PM

#

lol no worries

#

I'm not quite understanding why you have these lines:

hold = np.empty_like(w1)
for j in range(0,10):
    hold[:,j] = (-LOSS[j]*Y[j]*(1-Y[j]))*w1[:,j]
delta = np.sum(hold)*Y1*(1-Y1)

hollow shard Jul 15, 2019, 9:18 PM

#

ohh

#

give me asec

#

thats some particularly awful code

#

I believe thats just some working for the delta for the first layer

#

im trying to find the tutorial i used 1 sec

lunar leaf Jul 15, 2019, 9:20 PM

#

mmmmmm okay, I don't think I understand why that is the correct way to compute the delta for the first layer

#

alright

#

if you're trying to learn how to implement some NN stuff from scratch, I very highly recommend this book: http://neuralnetworksanddeeplearning.com it has helped me a great deal

hollow shard Jul 15, 2019, 9:21 PM

#

uhhm yeah if you could drop me a hint on what the correct way is I'd be very grateful

lunar leaf Jul 15, 2019, 9:21 PM

#

I'm not completely sure how to wrap the correct example into your code here so you don't have to change much

hollow shard Jul 15, 2019, 9:21 PM

#

eh thats fine

#

fire away

lunar leaf Jul 15, 2019, 9:22 PM

#

alright, I'm going to more-or-less pull these out of the book I just linked you, but I will re-format the code to be more readable and in python 3.x instead of 2.7

#

give me a few minutes

hollow shard Jul 15, 2019, 9:22 PM

#

thats fine, thanks so much ffor the help

#

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ heres the tutorial btw

Matt Mazur

Mazur

A Step by Step Backpropagation Example

Background Backpropagation is a common method for training a neural network. There is no shortage of papers online that attempt to explain how backpropagation works, but few that include an example…

lunar leaf Jul 15, 2019, 9:40 PM

#

@hollow shard look over this and see what you think: http://dpaste.com/3X8TYES

#

this code does not include the gradient descent algorithm, it is just an implementation of backpropagation

#

ah, and of course I reversed the arguments on line 40

#

it should be delta = self.cost_function_derivative(label, activation_list[-1])

hollow shard Jul 15, 2019, 9:43 PM

#

thanls

lunar leaf Jul 15, 2019, 9:44 PM

#

I also forgot a couple other things (self arguments for the functions)

#

look, don't judge me ok

#

I did not try to compile this

hollow shard Jul 15, 2019, 9:45 PM

#

hahaha thats fine

#

look at the code I wrote, its 10 times worse

#

ik the book ur talking about btw

lunar leaf Jul 15, 2019, 9:47 PM

#

I swear by it haha

hollow shard Jul 15, 2019, 9:47 PM

#

haha

lunar leaf Jul 15, 2019, 9:47 PM

#

I must have read the whole thing 5 or 6 times by now

hollow shard Jul 15, 2019, 9:47 PM

#

wow

#

btw what should my self.num_layers value be?

lunar leaf Jul 15, 2019, 9:51 PM

#

you can instantiate a network object as:
net = network([784, 64, 10])

hollow shard Jul 15, 2019, 9:51 PM

#

ah ok

lunar leaf Jul 15, 2019, 9:52 PM

#

the init function I wrote should build the weight matrices and bias vector automatically from that

hollow shard Jul 15, 2019, 9:52 PM

#

nice, thanks

lunar leaf Jul 15, 2019, 9:52 PM

#

I'm now realizing that I've referred to the bias values per node as being "bias vectors" which is incorrect

#

the whole thing is a disaster

#

I should quit coding

hollow shard Jul 15, 2019, 9:52 PM

#

hahhahaha

lunar leaf Jul 15, 2019, 9:52 PM

#

it's in the ballpark of being right

hollow shard Jul 15, 2019, 9:53 PM

#

honestly, its fine, look at my code.

#

Now thats a mess

lunar leaf Jul 15, 2019, 9:53 PM

#

lol

lunar leaf Jul 15, 2019, 10:16 PM

#

@hollow shard I've fixed several minor issues and made sure it actually works this time

#

http://dpaste.com/3KQ4YTA

hollow shard Jul 15, 2019, 10:16 PM

#

oh wow i really appreciate that

#

I've been messing around with it for a while now

lunar leaf Jul 15, 2019, 10:17 PM

#

as per the example code at the bottom b,w = net.backprop(np.zeros((784,1)), np.zeros((10,1))) will produce the back-propagated error gradients for the biases and weights of the network for a 784 length vector of zeros at the input and a 10 length vector as the label for that input

#

following this, you can write a gradient descent algorithm to produce an update for the weights and biases with those gradients

#

that should be enough to get you on your feet

#

let me know if you need any further help

hollow shard Jul 15, 2019, 10:19 PM

#

i cant thank u enough, honestly

#

thanks so much for your time

lunar leaf Jul 15, 2019, 10:19 PM

#

happy to help

crude bloom Jul 15, 2019, 10:23 PM

#

@lunar leaf oh that's the book by Michael Nielson, I've come across this before, now I'm more encouraged to give it a read

#

his explanations are very intuitive and I love his other writing

lunar leaf Jul 15, 2019, 10:24 PM

#

it's absolutely phenomenal, really

#

a lot of the code I found pretty tough to get through, but honestly it's a small complaint compared to how deep his explanations are

#

I have yet to find a better intro to neural networks

hollow shard Jul 15, 2019, 10:25 PM

#

i skimmed the pdf, but i kinda gave up after looking at the code

#

looks like i might be going back though

lunar leaf Jul 15, 2019, 10:26 PM

#

I think it depends on a few things

#

multi-variate calculus is pretty crucial if you want to really understand this stuff

hollow shard Jul 15, 2019, 10:26 PM

#

dw, i have that under my belt

lunar leaf Jul 15, 2019, 10:27 PM

#

and I think that his book is a fantastic way to really get into the nuts and bolts of neural networks and the fundamentals behind a lot of machine learning

hollow shard Jul 15, 2019, 10:27 PM

#

thats actuall just what im looking for, a super in depth understanding

#

I hate using stuff I don't understand, and I hate using NN's as some kind of black box

crude bloom Jul 15, 2019, 10:29 PM

#

Nielsen's the guy to be explaining it, too. He's been a big part of distill.pub which has a few articles explaining ML with visuals and interactivity

hollow shard Jul 15, 2019, 10:29 PM

#

wow ok

#

ah crap its throwing me an error again

#

for some reason its having trouble with the shapes of the output

#

ill fix it i think

lunar leaf Jul 15, 2019, 10:33 PM

#

note the shape of the inputs and outputs that I provide in the example code

hollow shard Jul 15, 2019, 10:41 PM

#

yup, i did

#

like i said, ill fix it, its just a small thing

viral lark Jul 16, 2019, 1:07 AM

#

Hi, there is someone who speaks Portuguese here? My friend is Portuguese speaker and need helps with python for college

lapis sequoia Jul 16, 2019, 8:46 AM

#

is anyone experienced with keras? I'm trying to use my own dataset to do Sequence Classification on, but I'm having some issues

lapis sequoia Jul 16, 2019, 11:31 AM

#

what sort of sequence do you want to classify

#

I'm actually publishing something on github later that solves that

#

in keras

#

@viral lark tell me

#

and dont ping me.. message here

lapis sequoia Jul 16, 2019, 12:30 PM

#

Sequences of time measurements. Float numbers with 3 decimals.

#

Feel free to ping me

granite sierra Jul 16, 2019, 12:48 PM

#

I'm trying to iterate over an excel document, and it returns None for when I try to print cell.value

#

and when I try to get the sheet name, it returns this <bound method Workbook.get_sheet_names of <openpyxl.workbook.workbook.Workbook object at 0x000001E814987DA0>>

tulip estuary Jul 16, 2019, 12:49 PM

#

Do you have a parens at the end of get_sheet_names() ?

#

(It is a method and not an attribute, so you need the parens)

granite sierra Jul 16, 2019, 12:50 PM

#

That worked, oops

#

and what about the previous question

tulip estuary Jul 16, 2019, 12:52 PM

#

The first question is harder :).
Have you tried it on another excel file? Could it be that None is what is reutrned if the cell is actually empty? What package are you using?

granite sierra Jul 16, 2019, 12:52 PM

#

I'm using openpyxl

#

the cells aren't empty though, there are definitely values in there

#

let me test on a different excel sheet

#

yea even on a different excel sheet, it print None.

#

but when I print a pandas dataframe, it works completely fine, but I don't really need a pandas dataframe because I want to be able to iterate over the columns and assign the value to a variable

tulip estuary Jul 16, 2019, 12:57 PM

#

I have never used openpyxl, I have used another package, but it is eluding me which one. Can you load with pandas and then just iterate on it from pandas? It is hard to debug this type of thing without the actual file, code etc.

#

Alternatively, you could export the excel file as a CSV and load it in and parse.

granite sierra Jul 16, 2019, 12:58 PM

#

what package would you recommend me using for what Iwant to do.

Basically I am just trying to get values from each column and assign them to a variable

#

but a specific value, so like

Lets say

Height      Width
1            3
2            4
3            5

#

and I am creating a square of those dimensions in a different code

#

so it will create a square of 1x3 first, and then 2x4, and then 3x5

#

do you get me?

tulip estuary Jul 16, 2019, 1:00 PM

#

If pandas loads the file, use pandas. You can do things like df['Height'] * df['Width']...

granite sierra Jul 16, 2019, 1:01 PM

#

ok

#

but won't that do the entire column at once, instead of row by row ?

tulip estuary Jul 16, 2019, 1:01 PM

#

pandas is super powerful at these types of column indexing, iterating etc

granite sierra Jul 16, 2019, 1:01 PM

#

ok

tulip estuary Jul 16, 2019, 1:02 PM

#

Yeah, it will do it over the columns, but that is going to be more efficient if that is what you are trying to replicate in the end.

granite sierra Jul 16, 2019, 1:02 PM

#

ok

#

hmm let me try and I'll be back if I have issues

granite sierra Jul 16, 2019, 1:30 PM

#

I get a key error when I try and index by column

#

oh wait

#

so if i have width in my code, how do I assign the value of width row 1 to that variable? I tried this

#

for i, row in df.iterrows():
    for j, column in row.iteritems():
        width = df['width']
        length = df['length']
        interdistance = df['interdistance']

tulip estuary Jul 16, 2019, 1:34 PM

#

Umm... what do you actually want to calculate from df['width'] and df['length']? You said the square of the first... but didn't quite understand. (I am guessing you don't need loops at all, but don't understand what you are trying to calculate 😃 ).

lapis sequoia Jul 16, 2019, 1:36 PM

#

what is that..

#

what are you trying to do o.o

granite sierra Jul 16, 2019, 1:44 PM

#

well its obviously wrong

#

not trying to calculate anything

#

just assign the values of width, length, interdistance to a variable

#

let me create some dummy code to try and explain what I am doing

viral lark Jul 16, 2019, 1:47 PM

#

Tron, can I send a DM to you?

tulip estuary Jul 16, 2019, 1:48 PM

#

@granite sierra

from pandas import DataFrame

Data = {'width':  [1, 2, 3],
        'height': [2, 4, 6],
       }
df = DataFrame (Data, columns = ['width','height'])


for ii, row in df.iterrows():
    width = row['width']
    height = row['height']

    print('width is {}, height is {}'.format(width, height))

granite sierra Jul 16, 2019, 1:49 PM

#

ok got it

#

you beat me to some dummy code haha

tulip estuary Jul 16, 2019, 1:49 PM

#

😃

granite sierra Jul 16, 2019, 1:49 PM

#

Thanks man, that's what I needed

#

why the double ii though?

tulip estuary Jul 16, 2019, 1:51 PM

#

It is something I was suggested to use many, many, many years ago. If a language uses i to represent an imaginary number then you can get name collisisions and cause all sorts of issues. I have just followed that all through my coding life.

#

So, I use ii and jj etc for my looping

granite sierra Jul 16, 2019, 1:51 PM

#

Ahh, got it, thanks for the help man. I'm still new to pretty much all the data related stuff

tulip estuary Jul 16, 2019, 1:51 PM

#

It is fun, keep playing with it!

granite sierra Jul 16, 2019, 1:53 PM

#

when I print it, it comes out really strange

#

📎 Strange.PNG

#

the top is a print of the dataframe

#

and the bot is where I printed the

print(i, width, length, interdistance)

#

did I do something wrong?

tulip estuary Jul 16, 2019, 1:55 PM

#

What is the loop and part between the for and print?

granite sierra Jul 16, 2019, 1:55 PM

#

for i, row in df.iterrows():
    width = row['width']
    length = row['length']
    interdistance = ['interdistance']
    print(i, width, length, interdistance)

#

oops

tulip estuary Jul 16, 2019, 1:56 PM

#

hahaha... look at the interdistance line

granite sierra Jul 16, 2019, 1:56 PM

#

I just realised

#

derp

#

I'm so stupid, I need a coffee to wake up

tulip estuary Jul 16, 2019, 1:56 PM

#

More coffee always helps. Though I don't get why your length is 0...

granite sierra Jul 16, 2019, 1:56 PM

#

Ia lso fixed the length

#

because for some reason I had the variable named as height

tulip estuary Jul 16, 2019, 1:57 PM

#

kk

granite sierra Jul 16, 2019, 1:57 PM

#

because, again, I am an idiot 😄

#

Thanks mate, I appreciate it

granite sierra Jul 16, 2019, 3:54 PM

#

is there an easy way to get the sheet's name

#

i want to be able to use the sheets name in an if statement

#

like

if sheetsname == blah:
    do stuff
elif sheetsname == superblah:
    do some more stuff

tulip estuary Jul 16, 2019, 4:16 PM

#

Personally, I don't know, you would have to look through the docs https://openpyxl.readthedocs.io/en/stable/index.html

granite sierra Jul 16, 2019, 4:23 PM

#

yea I read it, every package gives me different answers in terms of sheet names and also it never returns all teh sheet names even though the docs say it would

quartz stream Jul 17, 2019, 6:51 AM

#

@granite sierra Did you try Pandas ?

granite sierra Jul 17, 2019, 8:34 AM

#

@quartz stream Yea I did, I also tried other excel packages and they all give me different results when I print the sheet name, and none of them including pandas return all the sheet names

quartz stream Jul 17, 2019, 10:09 AM

#

xls = pandas.ExcelFile(path)
sheets = xls.sheet_names

#

import xlrd
xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)
print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property

#

@granite sierra

granite sierra Jul 17, 2019, 11:06 AM

#

ok let me try that
@quartz stream Thanks man, sorry, I was at lunch

#

Even with that I don't get all the sheet names, I still only get 1, that's so strange

oblique belfry Jul 17, 2019, 1:01 PM

#

Does anyone know how to create a keras stateful metric?

quartz stream Jul 17, 2019, 2:27 PM

#

Yes that is strange @granite sierra

#

can you share the snippets of excel file and your code

granite sierra Jul 17, 2019, 2:30 PM

#

Sure

#

@quartz stream can I dm you the excel file?

quartz stream Jul 17, 2019, 2:32 PM

#

Yes

granite sierra Jul 17, 2019, 2:32 PM

#

Here is the code I've tried

#

Holy moly

#

I'm tilted

#

nvm, I figured out the issue

#

but now that I figured out the issue

#

question

quartz stream Jul 17, 2019, 2:35 PM

#

@granite sierra

#

📎 unknown.png

#

See

#

it's working fine

granite sierra Jul 17, 2019, 2:35 PM

#

Yea I realised, I did a super derp, I pointed the file path to the wrong excel file

quartz stream Jul 17, 2019, 2:36 PM

#

https://tenor.com/view/shrug-what-huh-will-smith-gif-10978812

Tenor

granite sierra Jul 17, 2019, 2:36 PM

#

how do I assign the sheet name to a variable for an if statement

like

if sheet_name = blah:
    do stuff

#

It's always when you have to send your code to someone that you realise the silly mistakes...

quartz stream Jul 17, 2019, 2:37 PM

#

if sheet_name == sheets[0]:
    print('It's always when you have to send your code to someone that you realise the silly mistakes')

granite sierra Jul 17, 2019, 2:37 PM

#

oof

simple crag Jul 17, 2019, 2:38 PM

#

Like, for instance, = is for assignment, not comparison

quartz stream Jul 17, 2019, 2:38 PM

#

lol

#

just updated

granite sierra Jul 17, 2019, 2:38 PM

#

yea that I know, was just showing an example of what I wanted

quartz stream Jul 17, 2019, 2:39 PM

#

so

#

now it's done right ?

#

@granite sierra

granite sierra Jul 17, 2019, 2:39 PM

#

@quartz stream Yes, thank you good sir, I am so stupid sometimes lmao

quartz stream Jul 17, 2019, 2:40 PM

#

lol

#

aren't we all?

granite sierra Jul 17, 2019, 2:40 PM

#

how could I point the filepath to the wrong file xD

quartz stream Jul 17, 2019, 2:40 PM

#

hahaha

#

it's alright mate

tulip estuary Jul 17, 2019, 2:40 PM

#

Maybe there should be a channel "here are my dumb mistakes", I could fill it pretty quickly

granite sierra Jul 17, 2019, 2:41 PM

#

all we need is lemon or someone to make that ;D

desert oar Jul 17, 2019, 3:25 PM

#

blogging your mistakes and lessons learned is a net positive for society

#

i need to start doing it myself

#

takes some work but it's worth it

surreal nacelle Jul 17, 2019, 3:40 PM

#

Hey, I decided to finally dive into ml/dl, and I've been learning about linear algebra, I planned on doing so till I understand it perfectly, and then go on to calculus and probabilities. However, there is a long way to go, and I don't want to lose my motivation by exclusively learning maths. I'd like to know if you guys had some ressources teaching both ml/dl alongside the maths required. (I honestly don't see myself spending 2 months learning something without knowing why I'm learning it, and how I'll use it.) Thanks 😃

desert oar Jul 17, 2019, 3:41 PM

#

i dont have a good book reference handy for that purpose but i think that's a really good way to learn

#

linear regression is a great teaching tool because you can derive the same result several different ways: calculus, linear algebra, and probability theory

surreal nacelle Jul 17, 2019, 3:48 PM

#

I'm not sure to understand what you mean by "linear regression is a great teaching tool" tbh

lunar leaf Jul 17, 2019, 3:52 PM

#

@surreal nacelle this book is fantastic http://neuralnetworksanddeeplearning.com/chap1.html

#

It provides the foundations, theory, mathematical background, and by-hand implementations for classic feed forward neural networks

#

And a bit of diving into convolutional neural networks

#

Great place to get started

surreal nacelle Jul 17, 2019, 3:53 PM

#

It looks really great! Thank you

lunar leaf Jul 17, 2019, 3:54 PM

#

Happy to help

granite sierra Jul 17, 2019, 3:55 PM

#

Also another book is

#

http://shop.oreilly.com/product/0636920052289.do

Hands-On Machine Learning with Scikit-Learn and TensorFlow

Through a series of recent breakthroughs, deep learning has boosted the entire field of machine learning. Now, even programmers who know close to nothing about this technology can use simple, effic...

surreal nacelle Jul 17, 2019, 4:01 PM

#

I'm looking at the table of content, and I see that there are some chapters about linear algebra calculus etc, does it teaches these, or briefly explain what they are ? Also, is there any prerequisites to the book ? (apart from python/jupyter etc)

granite sierra Jul 17, 2019, 4:06 PM

#

I dont think its as focused on the math, but if someone else can chime in

#

It was just recommended to me by a friend

#

but I haven't had much tiem to look into it yet

desert oar Jul 17, 2019, 4:08 PM

#

imo if you're learning the math don't waste time with "for hackers" type book

#

get a book that uses the math, just make sure its not more math than you can handle

small shore Jul 17, 2019, 4:27 PM

#

Idk if this is the best place to ask, but if I have links to Images and corresponding data in a file. How can I download those images and save them with the metadata in a pair either in another file or with the image quickly and effectively so image links that don’t respond are removed. Right now I loop through the file and download the image under the same index I save the data to an array, but that is slow. Is there a better way to do this?

surreal nacelle Jul 17, 2019, 4:56 PM

#

Well, I ordered the o'reilly book, and I started reading the book @lunar leaf mentioned, however, I'm not sure what to think about learning deep learning before taking a simple approach to machine learning.
I also found that : https://machinelearningmastery.com/start-here/
It seems to cover pretty much everything, do you guys have any experience with it ?

Machine Learning Mastery

Start Here with Machine Learning

Your guide to getting started and getting good at applied machine learning with Machine Learning Mastery.

desert oar Jul 17, 2019, 5:07 PM

#

@small shore split up the file into 5 different files, and run the script in 5 terminal windows 😃

small shore Jul 17, 2019, 5:14 PM

#

Lol

#

15 windows to max my thread count

#

Idk if that’s even right lol

#

I found a dataset with stuff already built ready to download

#

So might just switch to that cause I am lazy

desert oar Jul 17, 2019, 5:20 PM

#

your network will probably saturate before your CPU does

small shore Jul 17, 2019, 5:20 PM

#

So

quartz monolith Jul 17, 2019, 6:26 PM

#

I have a numeric and categorial data set with alot of NaN's. My Labelencoding isnt going to work with that. So how do deal with NaN's in first place and afterwards how to approach numeric and categorical in dataframe. I want to use CART / DT

desert oar Jul 17, 2019, 7:06 PM

#

@quartz monolith i answered this a while ago but you might not have seen my answer. you need to know why the data is missing

#

NaN is just a computer representation of "not a number"

#

it could mean you wrote log(-1) or it could mean "missing data"

hollow shard Jul 17, 2019, 7:49 PM

#

btw @lunar leaf the network doesnt really work. Heres the code with the backpropogation algorithm I added, thanks for your time 👍 http://dpaste.com/0Z8C7H0

dense rose Jul 17, 2019, 8:27 PM

#

What all should I need to do to get Plotly figures to display in jupyter lab?

lunar leaf Jul 17, 2019, 8:32 PM

#

@hollow shard I am away on a business trip until tomorrow so I don't have time to give this my full attention at the moment. However, it appears to me like you're performing a weights update once per sample here

#

This may not be a problem, but why don't you try accumulating gradients over many samples before computing a parameter update

hollow shard Jul 17, 2019, 8:35 PM

#

oh ok, I really appreciate the time youre giving me, thanks for the help

exotic cedar Jul 17, 2019, 9:18 PM

#

@surreal nacelle https://drive.google.com/open?id=1h5wbUhHA2pr_821dB0Izy7aDfCz9KTtt

#

oh you already ordered the book

#

well rip

tame cloak Jul 17, 2019, 11:21 PM

#

Hi all, new to the channel but great to see an awesome community here. I'm unsure if this is how it works but I have a question on dataset manipulation/grouping. Let me know if I should redirect to one of the help channels

I'm trying to group together a dataset by year. Each year has a few hundred values, so i did this:

df2 = df.groupby(['Year']).sum()

the new "df2" dataframe no longer has a year column, but year is now an index (I believe is the correct terminology), which it doesn't look like I can use for visualizations. I'm trying to make a new year variable with a loop:

for i in range(1950,2018): df2['Year'] = i

which makes the new 'Year' column, but every value ends up being 2017. Can someone assist with what I'm doing wrong here?

desert oar Jul 18, 2019, 12:01 AM

#

@tame cloak that's correct, Year is now the index of the new dataframe, and the rest of the columns will corespond to columns in the original data.

you will need to clarify what kind of visualization exactly you're trying to produce

tame cloak Jul 18, 2019, 12:22 AM

#

@desert oar -- something like a simple line graph where I'd have time on the x axis, and another variable like "Points" on the y axis

desert oar Jul 18, 2019, 12:22 AM

#

@tame cloak keeping year as index should be fine

#

series actually have a plot method

#

so df2['my_column'].plot(); plt.show() or something like that

tame cloak Jul 18, 2019, 12:24 AM

#

very helpful, thanks!

#

@desert oar if possible, do you know why my for loop was yielding "2017" for all values, and how I could in theory get a column of Years that mimic the key?

desert oar Jul 18, 2019, 12:26 AM

#

@tame cloak you'll need to give a sample of your data

#

try df['Year'].value_counts() so you can see what the actual distribution is

#

maybe you only have 2017 data..?

tame cloak Jul 18, 2019, 12:30 AM

#

df2 = df.groupby(['Year']).mean() for i in range(1950,2018): df2['Year'] = i df2[['Age', 'Year']]

                    Age    Year

Year
1950.0 26.131410 2017
1951.0 26.344828 2017
1952.0 26.130769 2017
1953.0 26.018868 2017
1954.0 25.769231 2017
1955.0 25.953704 2017
1956.0 25.813725 2017
1957.0 26.018868 2017

#

@desert oar see above, let me know if that makes sense

#

it goes from 1950 to 2017

desert oar Jul 18, 2019, 12:31 AM

#

oh

#

that for i in range part

#

not only unnecessary but very wrong

#

just delete it

#

and use .reset_index() if you want the Year column back as part of the dataframe and not as the index

#

but for plotting it will do the right thing as the index

#

df2 = df.groupby(['Year']).mean()
df2['Age'].plot()
plt.show()

#

if you want Year back, do

df2 = df.groupby(['Year']).mean()
df2 = df2.reset_index()

tame cloak Jul 18, 2019, 12:32 AM

#

Perfect! Thank you so much.

#

That worked.

desert oar Jul 18, 2019, 12:33 AM

#

for i in range(1950,2018):
    df2['Year'] = i

see if you can figure out why this code is wrong

#

hint: i is a "scalar" and df2['Year'] is a "vector"

tame cloak Jul 18, 2019, 12:34 AM

#

Some mixing of types that can't occur. Like trying to plug in a constant for a variable

desert oar Jul 18, 2019, 12:34 AM

#

sorta

#

since i is a scalar, it's broadcasting it

#

it's basically copying i for every element in df2['Year'], which is a Series

tame cloak Jul 18, 2019, 12:35 AM

#

I see, that makes more sense

#

appreciate it @desert oar

surreal nacelle Jul 18, 2019, 4:45 AM

#

@exotic cedar Holy shit, this drive is amazing, thanks

exotic cedar Jul 18, 2019, 4:49 AM

#

np

surreal nacelle Jul 18, 2019, 4:58 AM

#

@exotic cedar Do you mind me sharing that to some friends ?

exotic cedar Jul 18, 2019, 4:59 AM

#

ya sure idc

#

was made to help others anyway

surreal nacelle Jul 18, 2019, 5:00 AM

#

👍

granite sierra Jul 18, 2019, 8:14 AM

#

Damn that drive is incredible @exotic cedar

surreal nacelle Jul 18, 2019, 9:07 AM

#

Hey I'm trying to learn about cross validation using cross_val_score() and kfold, however, I can't figure out what the return values of cross_val_score() are.
It's a list of float, I can see that, but that's all the documentation says about it.
I'm guessing it contains the mean etc etc, but I don't know which index correspond to what.

supple ferry Jul 18, 2019, 9:29 AM

#

@surreal nacelle , lets say you have 10 fold kfold and cross val score does it to train model on 9 and predict the remaining fold and repeat it so that all of the folds are predicted once. Then it gives you predictions. Depending on model, it will give you the default prediction method. For logistic for example, it will give you the classes, but if you ask for probabilites, you can use method = predict_proba

surreal nacelle Jul 18, 2019, 9:32 AM

#

Oh, my bad, it totally make sense that the default return values are the 10 predictions. I remembered seeing a piece of code where there was a line which called the mean/std from the results, hence the confusion

#

Thanks for the explanation

#

The mean was calculated using a numpy method. My bad

south quest Jul 18, 2019, 10:17 AM

#

Hey all, I'm doing some prediction work with keras using a sequential model and a single binary output. I've got 800,000 records to classify for and I've trained on all of them, I have trained on all this and when I use model.evaluate I get a loss value of 0.29 and accuracy 0.91 but when I try predict other data with model.predict I get results that just don't work

#

it predicts between 0.0 and 0.1 but no higher than that

supple ferry Jul 18, 2019, 10:17 AM

#

is it a classification task ?

#

ah i see it is

#

what are the distribution of the classes ?

lapis sequoia Jul 18, 2019, 10:17 AM

#

how does one row look like

south quest Jul 18, 2019, 10:18 AM

#

Let me send one row give me a second

supple ferry Jul 18, 2019, 10:18 AM

#

what i see is overfitting problem

lapis sequoia Jul 18, 2019, 10:18 AM

#

yeah, maybe the classes are unbalanced

supple ferry Jul 18, 2019, 10:18 AM

#

if classes are unbalanced, this will be a problem for you. you can solve it in different ways

#

you can either make sure training and test sets have the same share of classes

lapis sequoia Jul 18, 2019, 10:19 AM

#

let's see the data

supple ferry Jul 18, 2019, 10:19 AM

#

or you can artificially generate classes that are way below in count

south quest Jul 18, 2019, 10:19 AM

#

yeah I'm just trying to find it

lapis sequoia Jul 18, 2019, 10:19 AM

#

like a row of it to get an idea

supple ferry Jul 18, 2019, 10:19 AM

#

df.class_column.hist()

#

it will do the thing

#

hist is method not an attribute, my bad

south quest Jul 18, 2019, 10:20 AM

#

📎 unknown.png

#

in the code that text column is converted into 3 binary columns

supple ferry Jul 18, 2019, 10:21 AM

#

what is your y column

south quest Jul 18, 2019, 10:21 AM

#

and I am trying to predict the last column

lapis sequoia Jul 18, 2019, 10:21 AM

#

what are you trying to predict.. is the target variable here

supple ferry Jul 18, 2019, 10:21 AM

#

class column

south quest Jul 18, 2019, 10:21 AM

#

opened

supple ferry Jul 18, 2019, 10:21 AM

#

df.opened.hist()

lapis sequoia Jul 18, 2019, 10:21 AM

#

what is sic number

supple ferry Jul 18, 2019, 10:21 AM

#

run it and show the picture

south quest Jul 18, 2019, 10:22 AM

#

one second, need to put that onto the prod one

#

sic number is a value between 0 and 99

lapis sequoia Jul 18, 2019, 10:23 AM

#

well everything else is 1s and 0s.. you can encode your sic number and job roles too

south quest Jul 18, 2019, 10:23 AM

#

0 is not a string according to matplotlib

lapis sequoia Jul 18, 2019, 10:24 AM

#

need to typecast those before you can plot

quartz monolith Jul 18, 2019, 10:24 AM

#

@desert oar my NaN in my df is "not a number" because my label encoder gives me an error.

south quest Jul 18, 2019, 10:26 AM

#

okay i can't get matplotlib to work but the type of the opened will have tons more 0s than 1s

#

I know that

#

So is the issue that it is training too much for 0s?

#

(not sure about terminology or anything really in this area of python)

lapis sequoia Jul 18, 2019, 10:27 AM

#

yes.. your data is unbalanced, you can use things like kfold cv for training

south quest Jul 18, 2019, 10:28 AM

#

ah okay

#

cheers, will look into it

lapis sequoia Jul 18, 2019, 10:29 AM

#

class_counts = Counter(list(df['opened']))
df_new = pandas.DataFrame.from_dict(class_counts, orient='index')
df_new.plot(kind='bar')

#

might want to also look at methods for feature selection.. to balance classes.. Essentially finding the most variability to represent your classes

south quest Jul 18, 2019, 10:30 AM

#

right, makes sense

lapis sequoia Jul 18, 2019, 10:30 AM

#

https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

Medium

Visualising high-dimensional datasets using PCA and t-SNE in Python

Update: April 29, 2019. Updated some of the code to not use ggplot but instead use seaborn and matplotlib. I also added an example for a…

#

this example is not directly related.. but it's the general idea.. a good way to picking just enough data from each class

quartz monolith Jul 18, 2019, 10:36 AM

#

So PCA and t-SSNE reduces the number of dimensions in a dataset whilst retaining most information after that using it for the model?

lapis sequoia Jul 18, 2019, 11:19 AM

#

do you understand the curse of dimensionality

vital bison Jul 18, 2019, 11:47 AM

#

hi can anyone help me open a pickle file ? not sure how to do that?

tulip estuary Jul 18, 2019, 11:48 AM

#

It should be:

import pickle

with open('filename', 'rb') as f:
    x = pickle.load(f)

desert oar Jul 18, 2019, 11:49 AM

#

@quartz monolith OK, so you understand why i am asking? Now you need to understand why you are getting an error.

quartz monolith Jul 18, 2019, 11:49 AM

#

@desert oar CatBoostError: Invalid type for cat_feature[1,0]=nan : cat_features must be integer or string, real number values and NaN values should be converted to string.

desert oar Jul 18, 2019, 11:50 AM

#

It looks like the missing value is already there

#

Do you know why that value was missing?

#

That's what you need to answer

quartz monolith Jul 18, 2019, 11:50 AM

#

Yes

desert oar Jul 18, 2019, 11:50 AM

#

Why is it missing

quartz monolith Jul 18, 2019, 11:50 AM

#

The employees where lazy and did'nt filled everything out

#

in the knowledge data base

desert oar Jul 18, 2019, 11:51 AM

#

OK, that is important information

#

So this is truly missing data

#

Fortunately a tree model can handle it usually

#

What data type is this feature? Text?

quartz monolith Jul 18, 2019, 11:52 AM

#

I need to analyze the knowledge data base with decision tree and to optimize for my thesis

desert oar Jul 18, 2019, 11:52 AM

#

What is the data type of the feature

#

This also might be a good question for your thesis advisor

quartz monolith Jul 18, 2019, 11:53 AM

#

only objects

desert oar Jul 18, 2019, 11:53 AM

#

OK, if the data is categorical then "missing" becomes another category

#

How many missing values are there

quartz monolith Jul 18, 2019, 11:53 AM

#

give me a sec

tulip estuary Jul 18, 2019, 11:54 AM

#

@vital bison A longer example of write and read:

import pickle
  
interesting_data = {
  'a': 'this is a',
  'b': 'this is b'
}

# Save
with open('filename.pck', 'wb') as f:
    pickle.dump(interesting_data, f)

# Load
with open('filename.pck', 'rb') as f:
    x = pickle.load(f)

print(x)
print(x['a'])

quartz monolith Jul 18, 2019, 11:54 AM

#

Controllertype 6256
Controllerstate 31147
FehlernrContr_1 128525
FehlernrContr_2 128559
MotionState 32843
FehlernrDrive_1 133315
FehlernrDrive_2 133393
SWversion 12970
RootCauseComponent 0

#

The label is not missing

#

what makes it so hard is to deal with a lot of nans
when i convert my nan in string and let my decision tree on it i get only 51% and balanced 26%

desert oar Jul 18, 2019, 11:56 AM

#

@quartz monolith "nan" is how pandas and numpy represent "missing"

#

So yes, it's missing data

#

You need to figure out a way to handle it

#

The way to handle it depends on how many values are missing

#

It also depends on the specific attributes of your task

quartz monolith Jul 18, 2019, 11:57 AM

#

The feature is categorical and numeric (errornumber and types of machine)

desert oar Jul 18, 2019, 11:58 AM

#

Which one is numeric?

quartz monolith Jul 18, 2019, 11:58 AM

#

Is the best solution not to create a new knowledge data base where this kind of matter doesnt occur

desert oar Jul 18, 2019, 11:59 AM

#

You need to do a combination of feature engineering and possibly excluding values

#

Error number is effectively categorical

#

I don't think you have any numerical features

quartz monolith Jul 18, 2019, 12:00 PM

#

FehlernrContr 1+2
FehlernrDrive 1+2
thats the most important one
this are the error codes that come from the machine and need to be connected with the label and the label need to be connected with the text

#

I can send you a screenshot in pm

desert oar Jul 18, 2019, 12:02 PM

#

Thats ok

#

I think you might want to talk to your thesis advisor

#

Discuss your results, discuss the business problem you're trying to solve here, and then figure out a missing data solution

quartz monolith Jul 18, 2019, 12:06 PM

#

You're right! Imo is the best solution is that employees start to fill everything out. There are way to much values missing and the values that are missing are really important for the label. Thank you

desert oar Jul 18, 2019, 12:14 PM

#

Yeah unfortunately missing data is a big problem

#

Good luck, I will be curious to know what you decide to do

quartz monolith Jul 18, 2019, 12:14 PM

#

I will let you know! Maybe the employees will start to fill the missing values out 😄 haha just kidding

hollow quartz Jul 18, 2019, 12:32 PM

#

maturity_udf = udf(lambda cons: "pic" if cons == m else "normal" for m in all_max , StringType()). Why it don't run, I use pyspark

desert oar Jul 18, 2019, 12:37 PM

#

@hollow quartz what error message are you getting

#

oh

#

"pic" if cons == m else "normal" for m in all_max

#

this isn't valid python syntax

#

can you explain what youre trying to do in words

hollow quartz Jul 18, 2019, 12:42 PM

#

I have a consumption by day. I want to determine the peak consumption @desert oar

#

that's the error SyntaxError: Generator expression must be parenthesized

tulip estuary Jul 18, 2019, 12:48 PM

#

Might just need brackets as it is a list comprehension?

all_max = [1, 2, 3, 2, 1, 2, 3]
cons = 3
tt = ["pic" if cons == m else "normal" for m in all_max]
print(tt)

#

With brackets that code should be fine, I think....

hollow quartz Jul 18, 2019, 12:50 PM

#

If you do that, you will get an arrays

#

I want a value

desert oar Jul 18, 2019, 12:51 PM

#

you need to explain to us how you want to handle this

#

because for m in all_max means you're iterating

hollow quartz Jul 18, 2019, 12:54 PM

#

all_max contains the max of all day that i have

tulip estuary Jul 18, 2019, 12:54 PM

#

is all_max a list?

hollow quartz Jul 18, 2019, 12:55 PM

#

yes

tulip estuary Jul 18, 2019, 12:55 PM

#

So, given the list you want to do something to it and get a single value, right?

surreal nacelle Jul 18, 2019, 1:01 PM

#

@exotic cedar I'm reading the hands on book from your drive (the one I ordered, even tho I ordered an older version) and it's amazing so far. Thanks again

granite sierra Jul 18, 2019, 1:02 PM

#

@surreal nacelle The hands on machine learning book? Yea I thought it was good

hollow quartz Jul 18, 2019, 1:02 PM

#

@tulip estuary yes

surreal nacelle Jul 18, 2019, 1:02 PM

#

I think you're the one who recommended it in the first place

#

so thank you too 😄

tulip estuary Jul 18, 2019, 1:04 PM

#

@hollow quartz Can you give a small example of the values of all_max and then what the expected output is (the single value)?

hollow quartz Jul 18, 2019, 1:05 PM

#

i find the solution

#


def pic(value):
        if value in all_max:
            return 'pic'
        else:
            return 'normal'

#

pic_udf = udf(pic, StringType())
cons_region_df = cons_region_df.withColumn("Max_cons", pic_udf("consommation"))

tulip estuary Jul 18, 2019, 1:07 PM

#

Ah, cool. k. You can use a lambda function if you want:

all_max = [1, 2, 3, 2, 1, 2, 3]
cons = 3

tt = lambda x: 'pic' if x in all_max else 'normal'

output = tt(cons)
print(output)

hollow quartz Jul 18, 2019, 1:08 PM

#

thanks @tulip estuary

tulip estuary Jul 18, 2019, 1:08 PM

#

👍

lapis sequoia Jul 18, 2019, 1:51 PM

#

I would suggest you not use pickle

exotic cedar Jul 18, 2019, 2:57 PM

#

@granite sierra yep thanks, took a long time to make lol

#

@surreal nacelle ya np, this publisher is super nice, and i try to keep it updated

surreal nacelle Jul 18, 2019, 2:59 PM

#

I was pleasantly surprised to realize that you had the early version of the next release

granite sierra Jul 18, 2019, 3:27 PM

#

I'm trying to iterate over sheets in an excel file, and within those sheets, iterate the rows

#

how would I go about doing this?

#

I have this for iterating rows

#

for i, row in df.iterrows():

#

but I cant seem to link it ontop of the sheet iteration

supple ferry Jul 18, 2019, 3:31 PM

#

@granite sierrawhat do you want to achieve? Why are you using loop.

#

Normally you can vectorize your operations

granite sierra Jul 18, 2019, 3:32 PM

#

I'm basically going row by row and assigning the value to a variable

#

so for example

#

width         height         interdistance
10            5                20
20            10                20
30            15                25
40            20                10

#

well I just cant seem to figure out how I would navigate multiple sheets, and then assign values to a variable from each row, and it does have to be 1 row at a time because I do stuff with the variables

#

it's not a huge amount of data, hence the iteration not being an issue, its literally only like 20 rows of excel data max

copper umbra Jul 18, 2019, 3:51 PM

#

Dumb Question of the day: I am pretty new to python. and work mostly in pandas, matplotlib, numpy etc. Just trying to learn the skill to be more effective in my job as a data analyst
I am trying to figure out what i need to develop deliverable offline reports (either in word, excel or pdfs, or somethings else that can be emailed to non-tech-friendly clients). I can print a pd.dataframe to excel obviously but what if i want something more complex like a excel workbook with a title, subtitle, then 3 charts on 1 tab spaced appropriately for the chart size, a listing on another tab. and more charts on a third tab. What the go to library for managing stuff like that

south quest Jul 18, 2019, 4:20 PM

#

I've been playing with my model all day to try and improve accuracy, it's just a binary classification so I have just been changing the inputs to a final Dense with a sigmoid activation with 1 neuron

#

I can't get the accuracy above 53% and I'm not sure what to do

#

I've tried changing activation functions, changing to so many neuron configurations

#

Here is my code: https://paste.seph.club/paste/76533a889d.py

earnest prawn Jul 18, 2019, 4:24 PM

#

my best guess would be that probably your model just doesnt have enough parameters to accurately capture whats going on that and you could try increasing them

void anvil Jul 18, 2019, 4:25 PM

#

mmm code examples:

    return None```

earnest prawn Jul 18, 2019, 4:28 PM

#

@south quest

lunar leaf Jul 18, 2019, 4:32 PM

#

@south quest you may consider increasing the model size like @earnest prawn suggested. Here are a few other things to think about as well:

Is your input data normalized? If so, how?
How many training samples do you have available?
Is this reported accuracy the test accuracy or the training accuracy? A good first step is to ensure your model can overfit the training set before you do anything else.
Did you validate the labels for your data? Perhaps some labels are getting mixed up somewhere.

south quest Jul 18, 2019, 4:36 PM

#

Cheers everyone, I've just left the office but this are great pointers, will look into them tomorrow 👍

supple ferry Jul 18, 2019, 5:01 PM

#

@granite sierra , how you calculate the interdistance?
you can create a function which takes a row as input and generates your variable:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"a":[1, 2, 3, 4], "b": [20, 30, 40, 50]})

In [3]: df
Out[3]:
   a   b
0  1  20
1  2  30
2  3  40
3  4  50

In [4]: def interdistance(row):
   ...:     return row["a"] * row["b"]
   ...:

In [5]: df = df.assign(interdistance = interdistance)

In [6]: df
Out[6]:
   a   b  interdistance
0  1  20             20
1  2  30             60
2  3  40            120
3  4  50            200

Usually, iterating over rows is slow in comparison to vectorized version

granite sierra Jul 18, 2019, 5:02 PM

#

interdistance isn't a calculation, it's also being assigned to a variable

#

@supple ferry

supple ferry Jul 18, 2019, 5:02 PM

#

@copper umbra , try this
https://openpyxl.readthedocs.io/en/stable/

#

@granite sierra , i dont know your use case exactly, thats why i made a general one

#

to which variable you assign them all ?

granite sierra Jul 18, 2019, 5:03 PM

#

well width, length and interdistance

supple ferry Jul 18, 2019, 5:03 PM

#

If you tell me more details I can be more useful 😃

granite sierra Jul 18, 2019, 5:03 PM

#

I create stuff from the cell values

#

so complete hypothetical

#

lets say I am creating a box in a numpy array with the width and length, and then multiple boxes with the inter distance

#

but its all good haha, I figured it out 😄

#

I basically just assigned the sheet names to a tuple, and then iterated over the tuple

supple ferry Jul 18, 2019, 5:06 PM

#

@south quest , you can artificially generate samples from unbalanced class.
SMOTE can be your help. Imbalanced learn has SMOTE in python
https://github.com/scikit-learn-contrib/imbalanced-learn

GitHub

scikit-learn-contrib/imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning - scikit-learn-contrib/imbalanced-learn

south quest Jul 18, 2019, 5:47 PM

#

Thank you

void anvil Jul 18, 2019, 7:02 PM

#

Anyone use arch package? I'm trying to programatically access the fit coefficients that are dumped with model.summary()

#

I don't see anything with dir() or in the documentation

#

https://arch.readthedocs.io/en/latest/univariate/forecasting.html

supple ferry Jul 18, 2019, 7:10 PM

#

As far as I know, they are using the summary class of Statsmodels. so you can access them via params:

model_1_result = sm.Logit(y, X).fit()
model_1_result.params

void anvil Jul 18, 2019, 7:10 PM

#

perfect

#

thanks

supple ferry Jul 18, 2019, 7:11 PM

#

rule of thumb is, on python everyone uses Summary of Statsmodels

void anvil Jul 18, 2019, 7:11 PM

#

Their forecast is only giving out volatility forecasts, not underlying trend. Do I have to self-calc those or is there a different call rather than forecast()

#

makes sense

supple ferry Jul 18, 2019, 7:11 PM

#

I havent used that package myself 😃

#

I assumed their structure and then checked, it was statsmodels in the backend

#

let me know if it works

void anvil Jul 18, 2019, 7:12 PM

#

yeah

#

params works

#

but their forecast is only volatility unfortunately

supple ferry Jul 18, 2019, 7:14 PM

#

unfortunately, i have no exp with that one

void anvil Jul 18, 2019, 7:15 PM

#

what do you use for arch / garch /arima?

supple ferry Jul 18, 2019, 7:18 PM

#

for arima statsmodels

#

arch and garch never used

#

statsmodels api is a bit weird, but logic is of R

#

so if you speak R then it is easy

void anvil Jul 18, 2019, 7:23 PM

#

ok

#

thanks

void anvil Jul 18, 2019, 7:40 PM

#

some of the time series analysis is really frustrating compared to R

void anvil Jul 18, 2019, 7:58 PM

#

Any way to silence all the prints?

#

I'd really prefer not to print out 20m lines multiple times

supple ferry Jul 18, 2019, 8:36 PM

#

there should be verbose

#

for verbosity

#

i dont recall any global setting for that

#

when you fit the model

void anvil Jul 18, 2019, 8:58 PM

#

there's no verbose

#

I'm using:

    def blockPrint():
        sys.stdout = open(os.devnull, 'w')

    # Restore
    def enablePrint():
        sys.stdout = sys.__stdout__```

desert oar Jul 18, 2019, 9:41 PM

#

i dont think python has a good time series ecosystem yet

#

i just use RPy2 tbh

#

or i just open R

#

@void anvil not sure the context for this, but you should use the logging module instead. that gives you fine control over what is printed where

void anvil Jul 19, 2019, 3:10 AM

#

you have any examples?

#

Because "standard" is fine for most things because verbosity is usually an option, I just ran into the case today where I'd be dropping 20m+ print commands

#

hopefully this'll be a one-off type deal

desert oar Jul 19, 2019, 3:41 AM

#

import logging
import math

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

def gradient_descent(x_init, loss_func, loss_gradient_func, step_size, n_iter, tol):
    loss= math.inf
    x = x_init
    for i in range(n_iter):
        prev_x = x
        x -= step_size * loss_gradient_func(prev_x)
        prev_loss = loss
        loss = loss_func(x)
        loss_change = prev_loss - loss
        logger.debug('Iteration %s, loss %f, change %f, value %f', i, loss, loss_change, x)

        if loss_change < tol:
            logger.debug('Convergence reached, done.')
            break

        if loss_change < 0:
            logger.debug('Loss increased, stopping.')
            x = prev_x
            break

    return x, loss

#

if you set the log level to anything above DEBUG, all those log messages will never be printed

#

err thats buggy anyway

#

good thing i have all the log messages 😛

hallow wave Jul 19, 2019, 8:26 AM

#

When going into a entry level job as a data analyst, what should I know in regards to pandas and matplotlib e.t.c

#

By entry level I mean an internship.

solemn topaz Jul 19, 2019, 9:49 AM

#

Any pandas experts could help me with expanding deeply nested json data to a single large dataframe?

odd terrace Jul 19, 2019, 10:37 AM

#

Sorry still my pending question which I haven't resolved yet
I have :

print(self.M.T.shape)```
```(8, 3)
(8, 9082318)```

self.N = np.linalg.lstsq(self.L.T, self.M.T, rcond=None)[0].Twhich is working fine and return(9082318, 3)```

But

I want to perform a kind of sort on M and compute the solution only on the best 8 - n values for example.
Any pointer on how to do that would be extremely appreciated.
Thank you.

surreal nacelle Jul 19, 2019, 3:22 PM

#

Hey, I'd like to calculate the mean of a feature but only for its members who share the same value with another feature)
Is there a function to do that, or do I have to gather all these feature into an array and calculate the mean myself ? I'm guessing there is a function tho 😄

#

It's the kaggle titanic dataset, and there is a NaN for a Fare in the test_set

     PassengerId  Pclass  Sex   Age  SibSp  Parch  Fare  Embarked
152         1044       3    0  60.5      0      0   NaN         1```
I know that it's overkill, but I want to calculate the mean of Fare for all the passenger in the 3rd class

cunning osprey Jul 19, 2019, 4:17 PM

#

Hey, does anyone know how to code a legend in a scatterplot by color?

#

Im working with the iris data, plotted two variables, colored them by species, now I want the legend to indicate species by the colors, and I'm kind of stumped

hallow wave Jul 19, 2019, 4:18 PM

#

What do you mean?

cunning osprey Jul 19, 2019, 4:18 PM

#

Sorry if formatting is bad

#

plt.scatter(Iris[' Petal Length'], Iris['Sepal Length'], c = Iris['Labels'])

hallow wave Jul 19, 2019, 4:19 PM

#

So you just want a color chart on the side?

cunning osprey Jul 19, 2019, 4:19 PM

#

Yeah

hallow wave Jul 19, 2019, 4:19 PM

#

Give me a sec

#

cbar = plt.colorbar()
cbar.set_label('Like/Dislike Ratio')

#

Edit how you desire, cmap = theme, cbar is calling the colorbar method, cbar.set_label, labels the chart

supple ferry Jul 19, 2019, 4:21 PM

#

@surreal nacelle you can group your data

df.grouoby("fareclass")["fare"].mean()

hallow wave Jul 19, 2019, 4:21 PM

#

https://gyazo.com/344016c3f0d0f584fff3a60c50dcdd9c

Gyazo

supple ferry Jul 19, 2019, 4:22 PM

#

Apologies for errors I am on mobile @surreal nacelle

surreal nacelle Jul 19, 2019, 4:22 PM

#

Thanks you @supple ferry 😃

cunning osprey Jul 19, 2019, 4:24 PM

#

@hallow wave Sorry, I clarified it stupidely. I just need a legend stating which color belongs to which species

#

That colorbar is useful though, Ill use that in the future

hallow wave Jul 19, 2019, 4:26 PM

#

cmap = theme

#

I'm not sure you can use legends in scatter

cunning osprey Jul 19, 2019, 4:28 PM

#

Eh, its alright then, Ill just have to figure this out another time. Not the most important thing to need

hallow wave Jul 19, 2019, 4:29 PM

#

Ye, from what I see you cant.

#

You use color bars

#

Ok you can

#

Let me elaborate:

#


plt.scatter(view_count, likes, c=ratio, alpha=0.8, edgecolors='black', linewidths=1, cmap='summer', labels=labels)```

#

If there only one type of specie then you use label but as there are more you use labels=<variable>, in the labels list list all the species that you want in order

cunning osprey Jul 19, 2019, 4:36 PM

#

Hmm, yeah, it displays all my entries on a legend

#

The problem is, there are only 3 species, coded as 0,1,2 in the csv. So putting the entire column in as label just returns a giant legend with the entire column

hallow wave Jul 19, 2019, 4:38 PM

#

Ye

#

I'm trying to find a solution for it

cunning osprey Jul 19, 2019, 4:38 PM

#

I mean, I think using bools might work, but I rather not have a huge paragraph of code

#

Don't stress if it's impossible or hard, I'm just a beginner, tryna learn

hallow wave Jul 19, 2019, 4:40 PM

#

I would personally stick to color maps

#

I found this code though```import numpy as np
np.random.seed(19680801)
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
for color in ['tab:blue', 'tab:orange', 'tab:green']:
n = 750
x, y = np.random.rand(2, n)
scale = 200.0 * np.random.rand(n)
ax.scatter(x, y, c=color, s=scale, label=color,
alpha=0.3, edgecolors='none')

ax.legend()
ax.grid(True)

plt.show()
Copy to clipboard

#

Just replace the list with what you want and I think you should be good

#

📎 sphx_glr_scatter_with_legend_001.png

cunning osprey Jul 19, 2019, 4:48 PM

#

Yeah, I'll copy the code for later

#

Thank you for the help

quartz monolith Jul 19, 2019, 4:53 PM

#

looks like my tiles in my loo ^^ jk

cunning osprey Jul 19, 2019, 4:53 PM

#

Has anyone coded dummy variables into regressions before?

quartz monolith Jul 19, 2019, 4:58 PM

#

never worked with dummies but i have something in my bookmark, maybe it helps you
http://personal.rhul.ac.uk/uhte/006/ec2203/Lecture 13_Use and Interpretation of Dummy Variables.pdf

cunning osprey Jul 19, 2019, 5:16 PM

#

Useful resource

desert oar Jul 19, 2019, 5:25 PM

#

@cunning osprey i do it all the time. got a specific question about it?

#

the machine learning hipsters call it "one hot encoding"

silent swan Jul 19, 2019, 5:44 PM

#

"machine learning hipsters"

hallow wave Jul 19, 2019, 5:46 PM

#

Do i need to learn machine learning?

#

For a data analysis pos

desert oar Jul 19, 2019, 5:48 PM

#

@hallow wave you should probably be familiar with what it is and how it works, since it's becoming more and more common. you might find it useful in some cases. but it will depend on the job responsibilities

hallow wave Jul 19, 2019, 5:48 PM

#

Ok, salt, do you currently have a job as a data analyst?

desert oar Jul 19, 2019, 5:49 PM

#

data scientist

hallow wave Jul 19, 2019, 5:53 PM

#

Ahh, if possible could you specify what I would need to know to get an internship as a data analyst.

#

It's fine if you don't.

silent swan Jul 19, 2019, 5:54 PM

#

data science/analyst jobs vary wildly

#

some just require SQL

#

others require a lot of modeling

hallow wave Jul 19, 2019, 5:56 PM

#

modeling as in visualisation?

desert oar Jul 19, 2019, 5:57 PM

#

no modeling would be like building models

#

to predict things or to understand past data

#

and yeah i dont really have a general answer, depends on the internship

hallow wave Jul 19, 2019, 5:59 PM

#

Hmm, I guess predicting would invlove machine learning.

desert oar Jul 19, 2019, 5:59 PM

#

in the "philosophical" sense of the term yes

#

but machine learning often nowadays refers to a collection of modeling techniques that are "non statistical"

#

i guess it's a distinction between "machine learning, the problem statement" (prediction, clustering, recommendations, etc.) and "machine learning, the toolset" (gradient boosting, neural networks, cross validation, etc.)

#

there is a lot of overlap between ML and statistics

hallow wave Jul 19, 2019, 6:01 PM

#

Hmm, do you think I could master the field of data analytics in a general sense within 2 years, and ye, I kind of get that!

desert oar Jul 19, 2019, 6:02 PM

#

nobody can master anything in 2 years unless they are a prodigy

#

for an internship, i would expect at minimum:

comfortable with probability (conditional probability, what a probability distribution is)
comfortable with statistics (mean, variance, etc)
the basics of linear regression modeling
some calc and/or linear algebra
one programming language (python R matlab julia etc), or at least Excel skills
an understanding of how to produce basic data visualizations

#

but that's my preference, if i personally were going to hire an intern

#

you should know all that stuff by ~2nd year of college or after finishing a masters program

hallow wave Jul 19, 2019, 6:04 PM

#

Ok, well all i've got so far is semi-complex data visualizations with matplotlib and importing data through csv files with pandas, I did this in 2 days.

desert oar Jul 19, 2019, 6:04 PM

#

thats a good start

#

there is a lot to know and learn

#

data is hard

#

there is rarely one "correct" way to do something

hallow wave Jul 19, 2019, 6:06 PM

#

Mhmm, I feel likes I could get the math done within 2 weeks, i've already partly mastered the fundamentals of python and i've started with pandas and matplotlib. I'm just planning to learn them first, and then start building my github repo with projects based around them which will give a me an edge in a job application/role and also enforce my knowledge for my own well being. Yes, I get that.

cunning osprey Jul 19, 2019, 6:07 PM

#

Hmm, on the note of regressions, since I completely forgot about how they work

hallow wave Jul 19, 2019, 6:07 PM

#

Youtube ^^^

cunning osprey Jul 19, 2019, 6:07 PM

#

I have a multiple regression line, and after introducing some new independent variables the other coefficents increased

hallow wave Jul 19, 2019, 6:07 PM

#

https://www.youtube.com/watch?v=zPG4NjIkCjc

YouTube

statisticsfun

An Introduction to Linear Regression Analysis

Tutorial introducing the idea of linear regression analysis and the least square method. Typically used in a statistics class. Playlist on Linear Regression ...

▶ Play video

desert oar Jul 19, 2019, 6:07 PM

#

you might be able to learn it in 2 weeks, but to remember and understand it? that would be very impressive

hallow wave Jul 19, 2019, 6:08 PM

#

No, that's why I will reinforce it with projects.

desert oar Jul 19, 2019, 6:08 PM

#

best of luck, if you have the free time it sounds like you will do well

hallow wave Jul 19, 2019, 6:08 PM

#

That's exactly how I learnt python.

#

I have atleast 6 hours a day :p And 6 weeks till school starts. And then 2 years :p

desert oar Jul 19, 2019, 6:09 PM

#

nice

#

when i was in school i played video games and went to bad concerts

hallow wave Jul 19, 2019, 6:09 PM

#

Thanks for the advice!

cunning osprey Jul 19, 2019, 6:09 PM

#

Nah, I know how linear regression works. Just forgot how interaction works

desert oar Jul 19, 2019, 6:11 PM

#

@cunning osprey think of it algebraically

b1*x1 + b2*x2 + b12*x1*x2
(b1*x1 + b12*x1*x2) + b2*x2
(b1 + b12*x2)*x1 + b2*x2

#

sicne you can move around the order of addition and you can freely add or remove parentheses

hallow wave Jul 19, 2019, 6:12 PM

#

Ye, I used to be a 'chav' but I knew I didn't want to be a bum, I sacrificed my childhood pretty much but that's another story. I'm glad that I choose to just get my head down and focus on work cause now I have hope for the future :p Just gunna keep learning and keep moving forward :p Thanks!

cunning osprey Jul 19, 2019, 6:12 PM

#

Well true enough, I mean, I got the code down and the dummy variables in place. But I just forgot how to interpret the results

#

After putting in my dummy variables, all the other coefficents increased

desert oar Jul 19, 2019, 6:12 PM

#

@hallow wave good on you for turning it around! happy to help here when you have more questions

hallow wave Jul 19, 2019, 6:14 PM

#

Well i'm going to start learning what you recomended, thanks for the support!

#

Actually, I have one question, what is spark?

surreal nacelle Jul 19, 2019, 6:16 PM

#

How can I check the feature's importance after a simple cross_val_score ?

supple ferry Jul 19, 2019, 6:17 PM

#

What is your algorithm? @surreal nacelle

surreal nacelle Jul 19, 2019, 6:17 PM

#

Random Forest

desert oar Jul 19, 2019, 6:17 PM

#

@surreal nacelle you need to get the actual model object that was fitted. im not sure you can do that with cross_val_score

surreal nacelle Jul 19, 2019, 6:17 PM

#

Either that or Logistic Regression, they give similar prediction right now

#

I have the model at hands

#

I could .fit it no problems

desert oar Jul 19, 2019, 6:18 PM

#

if you have the fitted object then you can look up the importances directly, check the documentation for whatever library you're using

#

sometimes you need to explicitly enable importances before fitting

#

again it depends on the software

surreal nacelle Jul 19, 2019, 6:18 PM

#

Gonna do that thank you

#

I use sklearn

desert oar Jul 19, 2019, 6:18 PM

#

@hallow wave spark is a distributed computing platform that runs on top of hadoop or yarn (which is a scheduler that runs on hadoop)

#

@surreal nacelle ok, the sklearn docs are very good. you should get used to using and reading them

surreal nacelle Jul 19, 2019, 6:19 PM

#

Yep 😃

desert oar Jul 19, 2019, 6:19 PM

#

@hallow wave so the idea is that instead of developing some kind of customized high performance cluster, with apache spark you just have a cluster of commodity hardware running hadoop, and you can do distributed computations on fairly big data that way

surreal nacelle Jul 19, 2019, 6:21 PM

#

Btw, is a medium 'toward data science' subscription worth it ? I keep seeing promising articles, but there is a paywall

desert oar Jul 19, 2019, 6:21 PM

#

eh? i never had to pay

#

https://towardsdatascience.com/

Towards Data Science

Sharing concepts, ideas, and codes.

surreal nacelle Jul 19, 2019, 6:21 PM

#

📎 Screen_Shot_2019-07-19_at_8.21.33_PM.png

desert oar Jul 19, 2019, 6:21 PM

#

no

#

that's just medium trying to get you to make an account

#

wait what

#

uhh

#

i never saw that

surreal nacelle Jul 19, 2019, 6:22 PM

#

Yea, it's only 1 more story

tulip estuary Jul 19, 2019, 6:22 PM

#

I think you get 5 per day for free

desert oar Jul 19, 2019, 6:22 PM

#

i just read stuff on https://towardsdatascience.com

#

i guess i never read more than 5 a day?

surreal nacelle Jul 19, 2019, 6:22 PM

#

This is the website that the screenshot is from

#

Well, 5 per day is not too bad

#

gonna create an account then

tulip estuary Jul 19, 2019, 6:23 PM

#

I think it is 5

desert oar Jul 19, 2019, 6:23 PM

#

huh

#

interesting

#

i dont mind paying for good content

#

and TDS is good content

surreal nacelle Jul 19, 2019, 6:23 PM

#

Alright, I'll try the free account and see if it's worth

desert oar Jul 19, 2019, 6:24 PM

#

i never even signed up

#

but yeah i guess if you need

silent swan Jul 19, 2019, 6:33 PM

#

omfg the tensorflow ecosystem

quartz monolith Jul 19, 2019, 6:33 PM

#

@desert oar spoke about the NaN's in a knowledge database. There are some specific columns which has some error code which come from the machine for e.g. controller or drive. For specific error there are two different columns one is filled out the other is a NaN. The label of the data frame should be a Troubleshooting (label). How to deal with this missing values? its not a missing value its a information which doesnt matter for some troubleshoots, right?

#

Previously there was some NaN which were not at random. I cleaned them up with other experts

desert oar Jul 19, 2019, 6:42 PM

#

@silent swan omfg the tensorflow docs

#

so much black magic that the docs never explain, or barely explain

#

@quartz monolith yes that is possible. you can leave those missing. but how to handle them with catboost is another question

#

what you can try is to concatenate the two columns

#

so maybe you have a record like (Error1, "Restarted computer") and another record like (Error5, None) because Error5 doesn't require troubleshooting

#

so you can make a new single feature "Error1 - Restarted Computer" and "Error5 - None"

#

but that will make it harder for catboost to learn

#

so in that particular case you can just turn the missing values into an empty string or something else

#

since the missing values are "meaningful"

silent swan Jul 19, 2019, 6:49 PM

#

and then there're like, 12 different ways of doing the same thing

#

I'm so glad I live in the pytorch sphere

#

but I need to port a model to tf

desert oar Jul 19, 2019, 7:03 PM

#

i love how TF also has this general key-value store mess

#

like regularization is a hack

#

i have no idea what does what

#

learning TF sucks, keras is a bandaid. i need to try torch 😛

silent swan Jul 19, 2019, 7:04 PM

#

pytorch is 10/10

#

I think keras is aight if you have a very standard workflow and never want to poke at what's underneath

hallow wave Jul 19, 2019, 7:06 PM

#

Hey salt rock, could you possibly give me an example of how conditional probability is used in work.

desert oar Jul 19, 2019, 7:13 PM

#

@hallow wave im working on a classifier that has 1500 potential classes, meaning my model emits a probability distribution over the 1500 classes for each piece of data. each class corresponds to one of 3 types: high, medium, and low. i can use the laws of probability (and specifically conditional probability) to derive a distribution over those 3 types based on the distribution over the 1500 classes

#

without having to develop a separate model for the 3 classes

hallow wave Jul 19, 2019, 7:15 PM

#

Hmm, I don't really understand how you would code it but I sort of get the idea. Your data gives you results for a variables for say x and then you create distributions based upon your results/

desert oar Jul 19, 2019, 7:16 PM

#

sorta... thats the general idea

hallow wave Jul 19, 2019, 7:17 PM

#

Well I haven't gone in depth with it yet, will have to start making code revolving around it

quartz monolith Jul 19, 2019, 7:27 PM

#

sorry i was a bit afk i will get into it in some minutes

quartz monolith Jul 19, 2019, 8:02 PM

#

@desert oar oh i get it i will let them as missing values but as "" .

#

hwo are you guys traning a model? through cpu or gpu?`

#

because my model really takes a long time

cunning osprey Jul 19, 2019, 8:09 PM

#

Does anyone know how to find certain words in a text column while filtering other certain words

#

Im trying to search for the words 'view' and perspective' while excluding any rows with the words 'bottom' 'top' 'front' and 'rear'

#

So far, I'm doing
for word in perspective:
df[word] = df.astype(str).sum(axis=1).str.contains(word)

#

Which assigns bools to anything that countains the words, but I'm not trying to repeat counts, and I cant have a true statement if the row contains the filtered words

quartz monolith Jul 19, 2019, 8:12 PM

#

I dont know if the excluding and removing is working in one search
but excluding or delelting any rows which contain string ist something like this
df[~df["column"].isin(["value, top, front, rear"])]

desert oar Jul 19, 2019, 8:16 PM

#

thats not quite right

#

"value, top, front, rear" is a single string

#

it looks like you want a list of strings like this ["value", "top", "front", "rear"]

cunning osprey Jul 19, 2019, 8:20 PM

#

Im gettng new columns, true if the words are contained, but I need to drop them somehow

desert oar Jul 19, 2019, 8:21 PM

#

@cunning osprey

include_words = ['bottom', 'top', 'front', 'rear']
exclude_words = ['view', 'perspective']

include_pattern = '|'.join(rf'\b{w}\b' for w in include_words)
data['has_include'] = data['text'].str.contains(include_pattern)

exclude_pattern = '|'.join(rf'\b{w}\b' for w in exclude_words)
data['has_exclude'] = data['text'].str.contains(exclude_pattern)

data['is_valid'] = data['has_include'] & ~data['has_exclude']

data_filtered = data.loc[data['is_valid']]

cunning osprey Jul 19, 2019, 8:23 PM

#

Wow

#

That works beautifully

desert oar Jul 19, 2019, 8:23 PM

#

str.contains takes regex

#

so that makes your life easier

#

\b in regex means "word boundary"

#

so \bview\b will match "view" but not "preview" or "viewer" for example

cunning osprey Jul 19, 2019, 8:26 PM

#

df['is_valid'] = df['has_include'] & ~df['has_exclude']

was the part I was missing

#

But the word boundary was not something I thought of, thank you so much Mr. Salt Rock

desert oar Jul 19, 2019, 8:28 PM

#

👍

cunning osprey Jul 19, 2019, 8:30 PM

#

Over 3482 true values dang

surreal nacelle Jul 19, 2019, 8:33 PM

#

Hey, I'm working on the titanic dataset, and I use logistic regression, I currently have 0.80 average prediction, and I'm not sure where to go from there.
I already filtered irrelevant features, used gridsearchCV to get the best hyperparameters (didn't change much tho) . I'm guessing that the problem reside in the data. How would you go about it ? Scaling ? Creating new features ? (If you've already worked on the titanic set, please don't give me 'solutions', I just want some guidance 😃 ) Thanks

quartz monolith Jul 19, 2019, 8:42 PM

#

you can try other models and see how they perform

surreal nacelle Jul 19, 2019, 8:46 PM

#

I already tried other models, and this one performs the best

desert oar Jul 19, 2019, 8:46 PM

#

0.80 average prediction? or 0.80 accuracy

#

the next step would be making better features

surreal nacelle Jul 19, 2019, 8:47 PM

#

accuracy sorry

desert oar Jul 19, 2019, 8:47 PM

#

thats not bad at all

#

scaling can help

#

new features can help

#

maybe start reading some blogs, since you already have the basics

surreal nacelle Jul 19, 2019, 8:47 PM

#

I've reduced the dataset to 4 features

desert oar Jul 19, 2019, 8:48 PM

#

very nice

#

thats a lot better than my first titanic attempt back in the day

#

what features do you have?

surreal nacelle Jul 19, 2019, 8:49 PM

#

Pclass Sex Age Fare

#

brb 2min

#

Ok I'm back, basically I started by removing the id/ticket_id/cabin features as they were irrelevant and incomplete, then changed the Embarkation point from char to int, same for sex, filled the missing values using median, realized siblings/spouse/husband didn't matter for the model (actually made it a little worse), removed these, and here I am with 4 features and 0.80 accuracy.
Next step is scaling and creating new features I guess

desert oar Jul 19, 2019, 9:33 PM

#

really good start

#

scaling categorical features isnt that useful

#

scaling fare might help

surreal nacelle Jul 19, 2019, 9:34 PM

#

   Survived  Pclass  Sex   Age     Fare
0         0       3    0  22.0   7.2500
1         1       1    1  38.0  71.2833
2         1       3    1  26.0   7.9250
3         1       1    1  35.0  53.1000``` I think so too

#

but some values are way above average

desert oar Jul 19, 2019, 9:34 PM

#

subtract mean and divide off std dev

surreal nacelle Jul 19, 2019, 9:34 PM

#

won't that mess up the scaling ?

desert oar Jul 19, 2019, 9:35 PM

#

yeah can use median and median abs dev

#

good catch

#

also better missing data imputation is helpful

#

also heres a hint: you can figure out family members with the last names

surreal nacelle Jul 19, 2019, 9:35 PM

#

mhm interesting

#

Thanks for the input

#

gonna read on data imputation and figure out a way to use last names efficiently

desert oar Jul 19, 2019, 9:38 PM

#

there was also a fun blog post out there about how ticket numbers related to cabin position in the ship

surreal nacelle Jul 19, 2019, 9:38 PM

#

👍

cunning osprey Jul 19, 2019, 10:01 PM

#

So I have this regression line:

mod = ols("Sepal_length ~ Petal_length + Sepal_width + Petal_width + C(Labels)", data=Iris)

#

Okay, no idea how to format it. But Labels is a column in the Iris dataframe noting the species of flower in the row, 0 - species 1, 1 - species 2, 2 - species 3

#

Just wanted to know if this was the right way to make labels into a dummy variable for regression

trim leaf Jul 19, 2019, 11:17 PM

#

has anybody tried to build a stock market/futures trading algorithm

earnest prawn Jul 19, 2019, 11:46 PM

#

Predicting stock market has been a thing since stock market came up lol

crude bloom Jul 20, 2019, 12:19 AM

#

I'm looking for a machine learning library in python which is optimized for defining individual neurons and creating your own neural network graph. I'm interested in researching new neural network architectures so I don't want to deal with existing abstractions like convolutions and batch norm

#

anyone know an ML library that lets you do this without much effort?

lean ledge Jul 20, 2019, 12:26 AM

#

What's wrong with just plain tensorflow or pytorch? @crude bloom

crude bloom Jul 20, 2019, 12:27 AM

#

nothing's wrong with it, I'm just interested in doing some research

#

basically I want to make a neural network that doesn't have layers in a straightforward way

#

more randomly wired so there's more space for exploration

lean ledge Jul 20, 2019, 12:28 AM

#

What's stopping you from using them?

crude bloom Jul 20, 2019, 12:29 AM

#

the base element of both of those is that they have predefined layers like convolutions and dense layers

lean ledge Jul 20, 2019, 12:31 AM

#

not really

crude bloom Jul 20, 2019, 12:31 AM

#

actually TF does have a lower level API

#

I'm just used to using the more high level stuff

lean ledge Jul 20, 2019, 12:31 AM

#

they both work with generic low level operations and have optional APIs on top (nn and keras)

#

both of the frameworks are fundamentally low level

#

https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

crude bloom Jul 20, 2019, 12:33 AM

#

u right, thanks 😃

#

i've been using too much keras lol

floral lodge Jul 20, 2019, 3:34 AM

#

Hello fine people of data science. I am reposting my question from the help channels here after it was suggested to me by one of the mods. I would like to know if it would be possible to automate the following task, and if so, any modules, libraries or keywords that I could research and learn that may help me accomplish the implementation. Basically, I want to create a small automation that will speed my workflow significantly inside of a CAD program. I want to take an image like this https://gyazo.com/a6d6433ccea29e7b13257c8fd5a5f359 or like this https://gyazo.com/6c9386d3c4d224597d49ac8d657873a2 and analyze it to detect these vertices and click them in the shown order. The perspective and scaling of the image will be slightly different each time, but it will always look like one of the two examples (whichever works better) . Is this possible? I'm familiar enough with automating the gui but I don't know where to get started on the image analysis required to identify these points, or how to create a function that allows me to drag a box around some screenspace to capture the image for analysis. Any insight, tips or knowledge is greatly appreciated.

📎 2.jpg

Gyazo

near viper Jul 20, 2019, 6:37 AM

#

https://puu.sh/DV4I0/8f159c8d03.png
yeah uh what

#

its a GAN

#

trasining on mnist

#

with discriminator and generator and such

#

written with keras

surreal nacelle Jul 20, 2019, 6:50 AM

#

0.84 accuracy, making progress on the titanic set 😄

near viper Jul 20, 2019, 6:55 AM

#

my
my network deevolved

#

it had a blurry pattern in the center

#

and then it just died

surreal nacelle Jul 20, 2019, 7:28 AM

#

There is one thing I'm wondering about, let's say the algo I use has an accuracy which varies between 0.82 and 0.88. Would saving the 0.88 model be beneficial?
If I load that model to predict, will it perform better than if I saved the 0.82, or will it be just as random as training a new one everytime?

quartz monolith Jul 20, 2019, 9:25 AM

#

@surreal nacelle you should check your confusion matrix and analyze it

surreal nacelle Jul 20, 2019, 9:25 AM

#

No idea what that is, but i'll google it 😄 thanks

quartz monolith Jul 20, 2019, 9:37 AM

#

df[['error1', 'error2']] = df['error'].str.extract("[,|/](\w*)")
df.drop(columns =["error"], inplace = True)```
data frame has some columns which contains mostly digits also some letters and the words are divided in `, or /` i want to section it. afterwards delete it

supple ferry Jul 20, 2019, 11:13 AM

#

@surreal nacelle you should use cross validation for that

surreal nacelle Jul 20, 2019, 11:14 AM

#

I already use cross validation, but I was wondering if a model that performed better than the other, could be used to build around it

#

apparently not tho

#

managed to get .94% doing that, but it was obviously overfitting

desert oar Jul 20, 2019, 12:04 PM

#

@floral lodge that should be possible but it's outside my particular area of expertise. You will want some kind of "edge detection" algorithm, maybe look in OpenCV

#

Also does your CAD program provide an API for scripting?

#

I know for instance you can write plug-ins to automate fusion 360 and freecad

#

@surreal nacelle you got slightly different accuracy numbers during cross validation?

surreal nacelle Jul 20, 2019, 12:07 PM

#

I was thinking about combining the results of 2 models, it's a binary classification so I'd just have to compare the predictions of one and only "accept" the predictions that concurs with the other model.

#

Well not really

#

but the algo used to range from 0.78 to 0.91 at some point

desert oar Jul 20, 2019, 12:07 PM

#

What do you mean range from

surreal nacelle Jul 20, 2019, 12:08 PM

#

the accuracy score

desert oar Jul 20, 2019, 12:08 PM

#

How are you evaluating this

#

Yeah but what is your evaluation procedure

surreal nacelle Jul 20, 2019, 12:08 PM

#

I'm talking about regular predictions, not cross val btw

#

Well, I kfold then cross_val_score using the kfold