earnest prawn Feb 13, 2020, 11:27 AM

#

Well theJSON data presumably contains pixel and format information

#

So you read the pixel, reformat it into a 2d tensor according to the format info and then feed it into your CNN?

lapis sequoia Feb 13, 2020, 1:47 PM

#

Can someone plz help me get internship for data science

coral yoke Feb 13, 2020, 2:03 PM

#

no?

coral otter Feb 13, 2020, 9:02 PM

#

hi all, i want to make a function to add a suffix to a dataframe name like

#

add the suffix _4 to the dataframe name............function(toto) = toto_4

#

i m a begginer

late jackal Feb 14, 2020, 3:44 AM

#

Could anyone nudge me in the right direction for starting this problem

📎 unknown.png

velvet thorn Feb 14, 2020, 4:34 AM

#

do you know how weights and biases work @late jackal

late jackal Feb 14, 2020, 4:44 AM

#

I know it's like x1w1+....xiwi+b

#

Like a weighted avg plus the constant

#

I'm not sure if they just want us to write out the simple function. Or if they would like us to make some sort of training data

velvet thorn Feb 14, 2020, 4:58 AM

#

why do you think so?

#

like for a) they just want oyou to calculate the output i.e. through algebraic substitution

#

the others are logic questions

hybrid scroll Feb 14, 2020, 10:11 AM

#

📎 unknown.png

#

Hello guys, I did AutoML using h2o and the result look like this
how to save the model for DRF_1 ?
so I can share or recall that model without retrain the data

dire stirrup Feb 14, 2020, 2:51 PM

#

pick;e?

somber hamlet Feb 14, 2020, 5:12 PM

#

hello, how can I draw a pyplot graph from a dict? {x:y}

#

Seems like I've to separate them in two list, but I don't find the answer elegant

velvet thorn Feb 14, 2020, 5:18 PM

#

*zip(d.items())?

mild sierra Feb 14, 2020, 10:12 PM

#

anyone here use luigi

lapis sequoia Feb 14, 2020, 10:42 PM

#

anyone know what do do in feature engineering to beat svm ?

coral yoke Feb 14, 2020, 10:51 PM

#

what?

#

to beat SVM with what?

lapis sequoia Feb 14, 2020, 11:15 PM

#

new columns

coral yoke Feb 14, 2020, 11:23 PM

#

you said beat the SVM as in using another model. what are you trying to say i must be misunderstanding, sorry

deep spire Feb 14, 2020, 11:25 PM

#

So I've been trying to get more organized with my project management (been having some issues with communication/tasks at work). Do any of you guys have any suggestions for tools or methodologies for managing data analytics/data science projects?

lapis sequoia Feb 14, 2020, 11:27 PM

#

I am trying to get a better score

📎 unknown.png

#

but I don't know how

#

this is the data set https://www.kaggle.com/becksddf/churn-in-telecoms-dataset

Churn in Telecom's dataset

coral yoke Feb 14, 2020, 11:28 PM

#

i can't spend my time performing EDA for you man i'm sorry

#

it's up to you to understand your dataset and know how to handle it

#

others may have that time but unfortunately i do not

lapis sequoia Feb 14, 2020, 11:29 PM

#

I have done some eda

#

i combined all minute columns and that was accepted

lapis sequoia Feb 15, 2020, 12:12 AM

#

oh python

#

why do you crash and not tell me

#

📎 unknown.png

lapis sequoia Feb 15, 2020, 1:08 AM

#

yep

📎 unknown.png

#

gg python thanks for telling me nothing again

lapis sequoia Feb 15, 2020, 1:33 AM

#



Model LogisticRegression
    CV scores [0.12244898 0.27083333 0.29166667 0.10416667 0.22916667]
    mean=0.204 std=0.077

Model SVC
    CV scores [0.53061224 0.5625     0.5625     0.39583333 0.625     ]
    mean=0.535 std=0.076

Model DT (prunned=4)
    CV scores [0.42857143 0.52083333 0.52083333 0.4375     0.64583333]
    mean=0.511 std=0.078

#

are these bad numbers?

#

¯_(ツ)_/¯

#

I don't like the data we were given

#

I don't get how I am supposed to build a feature for a categorical data

alpine tiger Feb 15, 2020, 2:43 PM

#

Hey, guys! I'm writing a school paper on a ML project, and I'm a bit confused about the terminology of Hypothesis, Hypothesis Class and Representation.

I have a dataset, with alot of features - though I in practice only use 32-40 variables.
The target value is either a signal (True/1.0) or a background (False/0.0)

I'm using a neural network, with undetermined architecture (Not yet performed Model Selection), though a sigmoid activation in the output node.

With LaTeX notation :

Is it correct to say that my representation is (X^d_i, y_i),
where X is a "n x d"-matrix, y a n-vector, d is an integer in the intervall = [32,40], y_i = {0.0 , 1.0} and i ranges from [0, n] where n is number of samples?

Is my hypothesis class every function : R^d -> [0.0, 1.0]

An instance of (preferably trained) network would be a hypothesis?

uncut shadow Feb 15, 2020, 9:04 PM

#

Hey. I was trying to make my own neural network from scratch, but I have got stuck on a problem which I cannot solve (I was trying for few days, but it still doesn't know).
I have a code for a backpropagation:

a1 = np.dot(X, weights1) + b1
hidden = sigmoid(a1)
a2 = np.dot(hidden, weights2.T) + b2
output = sigmoid(a2)
outputs.append(output)

# backpropagtion
dloss_yh = - (np.divide(y, output) - np.divide(1 - y, 1 - output))
dloss_y = np.dot(np.divide(1, X.shape[1]), 2*(output - y))
dloss_z2 = dloss_yh * np.dot(output, 1 - output)
dloss_a1 = np.dot(weights2, dloss_z2)
dloss_z1 = np.dot(dloss_a1, np.dot(hidden, 1 - hidden))
dloss_w1 = np.dot(np.divide(1., X.shape[1]), np.dot(dloss_z1, X.T))
dloss_w2 = np.dot(dloss_z2, a1.T)
dloss_b1 = np.dot(dloss_z1, np.ones((dloss_z1.shape[1], 1)))
dloss_b2 = np.dot(dloss_z2, np.ones((dloss_z2.shape[1], 1)))

weights1 -= dloss_w1 * learning_rate
weights2 -= dloss_w2 * learning_rate
b1 -= dloss_b1 * learning_rate
b2 -= dloss_b2 * learning_rate

I have made it to adjust my weights properly to find the smallest loss. The problem is with numpy and matrices. I get error when I try to update weights2:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1434, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/PC/PycharmProjects/Machine Learning/basic_neural_net.py", line 75, in <module>
    result = fit(X, y, n_epochs=1000)
  File "C:/Users/PC/PycharmProjects/Machine Learning/basic_neural_net.py", line 65, in fit
    weights2 -= dloss_w2 * learning_rate
ValueError: non-broadcastable output operand with shape (1,2) doesn't match the broadcast shape (2,2)

#

How can I change the shape of these losses to make it work?

#

That's how they look like.

📎 unknown.png

#

also X is just a

X = [[1 0]
     [0 0]]

and y is just a

y = [[1]
     [0]]

uncut shadow Feb 15, 2020, 9:40 PM

#

Also when should I use ndarray.T? I mean, is there any specific way to use it to make it work or I have to just transpose matrices randomly?

lapis sequoia Feb 16, 2020, 1:12 PM

#

https://towardsdatascience.com/data-science-career-mistakes-how-to-learn-usable-mathematics-for-the-same-269dd1166263

Medium

Data Science Career Mistakes & How To Learn Usable Mathematics for ...

Thinking of data science as merely a technical profession, like programming, may take you away from your goals.

#

Did some mistakes while starting out in Data Science. Don't want other beginners to do same

tawdry rose Feb 16, 2020, 2:14 PM

#

did you use datacamp ? is it good

lapis sequoia Feb 16, 2020, 2:56 PM

#

@tawdry rose if that question was for me then answer is "no, I haven't use datacamp"

tawdry rose Feb 16, 2020, 2:57 PM

#

no for everyone

#

including you

granite sierra Feb 16, 2020, 3:01 PM

#

tbh, I've used it a tiny bit, it seems great for learning, but I'm not paying 300$ a year for subscription, if it was paid for, I'd definitely use it

tawdry rose Feb 16, 2020, 3:01 PM

#

courses not very awesome but i liked their project based learning

#

but idunno if its worth it

lapis sequoia Feb 16, 2020, 3:24 PM

#

Fro my understanding, free stuff is as useful as a course

#

Kaggle Courses and free courses on edX, udacity and coursera

#

from the point of view of my work expereince

granite sierra Feb 16, 2020, 3:41 PM

#

Yea that's true, datacamp does ahve this "skills based learning", where it tests your skills and helps you strengthen your weaker skills, that's kinda useful. Obviously can be done with the free ones as well, but then you have to have a good enough understanding of what you think are your weaknesses and strengths

uncut shadow Feb 16, 2020, 3:53 PM

#

there is a way to get detacamp subscription for free

#

lol

lapis sequoia Feb 16, 2020, 4:07 PM

#

@granite sierra That's correct. I am aware of my weaknesses:

Python (general)
NumPy (design, data structures and operations/functions/methods)
Pandas (advanced data cleaning and preprocessing)
Statistics (both in general and in Python)

and those all are important things one needs to know for day to day work as data scientist/ML-engineer

I figured this while listening to "Seat next to You" (you know who is behind this song)

granite sierra Feb 16, 2020, 4:11 PM

#

@uncut shadow what's that way to get it for free

uncut shadow Feb 16, 2020, 4:12 PM

#

https://www.quora.com/How-do-I-access-DataCamp-courses-for-free
@granite sierra

granite sierra Feb 16, 2020, 4:13 PM

#

Huh interesting

uncut shadow Feb 16, 2020, 4:13 PM

#

yeah

#

there are also free udemy courses (not only the ones which are always for free). search for "udemy coupons" if you want

#

you might find some interesting ones

tawdry rose Feb 16, 2020, 4:19 PM

#

what is yout suggestion about project based learning

#

like datacamp's projects

#

i actually like projects more than courses

lapis sequoia Feb 16, 2020, 4:35 PM

#

Try Kaggle

tawdry rose Feb 16, 2020, 4:36 PM

#

hmm you are so true.

#

what resources you are using

#

on datascience

#

actually im not datascientist im first grade cs student

lapis sequoia Feb 16, 2020, 4:37 PM

#

Focus more on something called "Reproducible Data Science" (you need to know Git and GitHub/GitLab)

#

@tawdry rose, you want to be a data scientist?

#

why not Software Enginner or computer programmer

tawdry rose Feb 16, 2020, 4:37 PM

#

i dunno im trying every field 😄

#

why did you choose data

lapis sequoia Feb 16, 2020, 4:38 PM

#

Because I wanted to change, most of the work in India is service based where you write 100 LoC in a year

#

I worked with a startup, a product based company, and I wrote 1000 LoC a day

#

and I am not an engg grad.

#

I became programmer because I liked Linux along with its all development tools

#

found programming there and started doing it and felt like doing it forever

#

Then I could not find much product based companies (no I was not a genius who got hired by M$/Google from my final year). If I am not writing much code, then why do such a job. Better find one where impact is higher, where I can use technology to solve problems more directly

#

that is where data science came in, I like AI more of course. But for now, I am sticking to data science. All AI of today is ML based and one must have good grounding in data science to make more sense of ML. That is what I believe

#

So yeah...

#

What about you, why you chose CS @tawdry rose

tawdry rose Feb 16, 2020, 4:42 PM

#

because i wanted to be computer scientist and programmer

#

i could go to medicine

#

but i didn't want

lapis sequoia Feb 16, 2020, 4:43 PM

#

may be then you should follow your heart

#

become programmer.

#

Python + Corman + SICP + Rust is a good combination, for you still got 4 years before you start looking for a job

tawdry rose Feb 16, 2020, 4:44 PM

#

hmm what is corman

lapis sequoia Feb 16, 2020, 4:45 PM

#

https://mitpress.mit.edu/books/introduction-algorithms-third-edition

The MIT Press

Introduction to Algorithms, Third Edition

The latest edition of the essential text and professional reference, with substantial new material on such topics as vEB trees, multithreaded algorithms, dynamic programming, and edge-based flow.
Some books on algorithms are rigorous but incomplete; others cove...

#

SICP -- https://en.wikipedia.org/wiki/Structure_and_Interpretation_of_Computer_Programs

Structure and Interpretation of Computer Programs

Structure and Interpretation of Computer Programs (SICP) is a computer science textbook by Massachusetts Institute of Technology (MIT) professors Harold Abelson and Gerald Jay Sussman with Julie Sussman. It is known as the Wizard Book in hacker culture. It teaches fundamental ...

#

dont look anywhere else other than following your heart. You got a long way to go

tawdry rose Feb 16, 2020, 4:46 PM

#

ah i will read 'em thanks 😄

#

this AI looks interesting and cool

#

training machine

#

and artificial intelligence

upbeat jetty Feb 16, 2020, 4:47 PM

#

Ouch, that's pretty hardcore stuff 🙂

#

BTW, i think i saw somewhere on the net a SICP version which uses Python

lapis sequoia Feb 16, 2020, 4:48 PM

#

yeah, there's one with Common Lisp too

tawdry rose Feb 16, 2020, 4:49 PM

#

https://pedrokroger.net/sicp-python/

SICP in Python

Structure and Interpretation of Computer Programs (a.k.a SICP, or “The Wizard Book”) is
considered one of the great computer science books. Some people claim it will
make you a better programmer. It was the entry-level computer science subject at MIT
and it’s still used in uni...

#

maybe this one

lapis sequoia Feb 16, 2020, 4:50 PM

#

unfortunately, the only way to become a good programmer is to do this hard stuff. (sometimes people get lucky from college placements. Lucky in the sense, in college, algorithms are still fresh in your minds)

upbeat jetty Feb 16, 2020, 4:50 PM

#

Probably. Didn't read through it, so take it with a grain of salt check reviews.

tawdry rose Feb 16, 2020, 4:51 PM

#

actually learning cs in hard way can be little exhausting sometimes 😄

#

im first grade student but still we have 5 assignment(which 2 weeks deadline projects) and 6 quizzes(2 days deadline little programming tasks)

upbeat jetty Feb 16, 2020, 4:51 PM

#

Well, while i'm not a programmer, i think if you want to to learn software engineering (compared, to CS), other books become "bibles"

#

https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882

#

Check that one

#

Balancing practical application and baseline knowledge is hard

#

Loved that book, but the examples are in Java, so some things may be a bit different compared to other languages.

tawdry rose Feb 16, 2020, 4:55 PM

#

seems good book

#

thanks 😄

upbeat jetty Feb 16, 2020, 4:57 PM

#

Also, there's a so-called Gang of Four https://en.wikipedia.org/wiki/Design_Patterns

Design Patterns

Design Patterns: Elements of Reusable Object-Oriented Software (1994) is a software engineering book describing software design patterns. The book was written by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, with a foreword by Grady Booch. The book is divided ...

#

While you probably shouldn't blindly implement everything you see in that book (and esp not in Python, which has its differences), knowing what people mean under certain names (say, singleton) would help understanding code of other people.

#

oops

lapis sequoia Feb 16, 2020, 5:04 PM

#

GoF is high level book. You cant understand it unless you master OOA/M/D first

#

Those are the fundamentals

#

GoF is a fine-fine book, it is good of you to bring it up. Thing is, one first must learn to walk before he starts to run

#

https://www.amazon.com/Object-Oriented-Analysis-Design-Applications-3rd/dp/020189551X

#

This is a prerequisite for that

#

Even any basic OOA/M/D book by great authors like Uncle Bob, Rebecca Wirfs-Brock or James Rumbaugh will be fine

#

I have missed few authors, you can find them on comp.object

#

May be this discussion should be moved to #algos-and-data-structs

terse crater Feb 16, 2020, 5:11 PM

#

Hey, how do I grab data from video files? Any ideas?

#

It is gameplay footage

#

OCR / ML?

upbeat jetty Feb 16, 2020, 6:16 PM

#

Is it possible to reverse-engineer replay files instead of raw video?

uncut shadow Feb 16, 2020, 7:28 PM

#

Hey. If I have a 2 layer neural network [2, 2, 1] (number of neurons in each layer). What would be the shape of the matrix for biases for hidden layer? The input X has shape (7, 2).

oblique belfry Feb 16, 2020, 8:33 PM

#

Has anyone used MXNet? What are your thoughts? I am seeing a lot of interesting blog posts on the platform, but I am not seeing a lot of either research projects or production ready projects. This is a bit concerning.

jolly briar Feb 16, 2020, 11:24 PM

#

I have an SQL group by query that I want to reproduce in pandas - so in the SQL I can create multiple variables as part of the group by operation, but I'm not sure how to do this in pandas.

I'm currently planning to create an assignment for each variable in the SQL groupby, so that's around 10 instances along the lines of

x1 = df.groupby([...]).blah
x2 = df.groupby([...]).blah
...
x10 = df.groupby([...]).blah

whereas the SQL had something along the lines of

select
    count(*) as n_x,
    sum(x1) as sum_x1,
    sum(x2) as sum_x2,
    sum(x3) / count(*) as x3_dens,
    sum(x4) / count(*) as x4_dens,
    sum(x3) / sum(x4) as x3_x4,
    sum(x5) / sum(x6) as x5_x6,
from some.table
group by THING

is there a straightforward way to reproduce this in pandas?

jolly briar Feb 16, 2020, 11:54 PM

#

i just created a separate function and applied it to each sub dataframe created by groupby().apply()

worn minnow Feb 17, 2020, 2:45 AM

#

is this a good channel for a bs4 question?

lapis sequoia Feb 17, 2020, 11:02 AM

#

Does anyone use Openturns

#

??

uncut shadow Feb 17, 2020, 2:43 PM

#

Hey. Does anybody know any good tutorial about activation functions? I see that they might change loss a lot so I just want to know which ones to use for a particular problem

#

Also is there anything wrong with using 2 sigmoid functions in 2 layer nn?

uncut shadow Feb 17, 2020, 4:01 PM

#

Because I don't think loss should look like this

📎 loss.png

jolly briar Feb 18, 2020, 12:55 AM

#

when grouping the data is often reduced in size, I'm wondering if it's possible to group data and instead of reducing the size of it introduce duplicates

#

currently I'm merging back in to the original dataframe and introducing dups there anyway

paper niche Feb 18, 2020, 1:25 PM

#

groupby().transform() in pandas?

#

oops a bit late, I realized.

bitter skiff Feb 18, 2020, 1:53 PM

#

Hi, somebody used StyleGAN2?
I try to generate a latent space representation out of an image with StyleGAN2.
I original thought that would be covered under "Projecting images to latent space" using "run_projector.py"

But this doesn't seem to generate latent space representations but a lot of png's.
Am I completely on the wrong track or does the projector function generate latent spaces?

trail pagoda Feb 18, 2020, 2:33 PM

#

Transformations and costly i/o operations are an inherent hyperparameter of a model im working on

#

as these transformed data sets are expensive what is the reccomended way to intelligently 'cache' the most used ones so that the model will save and load something it's been asked to do before without filling my hard drive with gigabytes of trash?

#

at the minute I have a folder in my project called 'pickle_jar' which is just a big folder full of serialized objects that can be called on later but it does it with literally everything and isn't sustainable.

oblique belfry Feb 18, 2020, 2:51 PM

#

Can you be a little more specific?

trail pagoda Feb 18, 2020, 7:30 PM

#

the raw data is a text file called a .cif that contains all the information about the crystal structure of some material

#

representing the crystal in an machine learnable way is an open question and I will be trying a lot of different representations with slight perturbations as my training data to see what works best for a given problem

#

these perturbnations are non trivial and essentialyl ahve to be constructed through some cpu intensive stuff and as such my model spends more time waiting for data to be prepared than it does training and gpu utilization is at about 15%

oblique belfry Feb 18, 2020, 8:13 PM

#

@trail pagoda

Yeah...I understand being in that place. I was working on an action recognition project and we had many of the same issues. I think one can run a normal training with basic preprocessing. But if your model requires data from multiple sources and/or requires a LOT of different preprocessing steps, it might be better to do those ahead of time.

Before every train, we would create a "data cache" for the model to train on. This cache was the data already preprocessed in .hdf5 format. This allowed the model to just load the preprocessed data from the disk in an optimized way. I know creating a lot of these cache's can be annoying, but I'd personally rather pony up and get another SSD versus having epoch training times to be on the order of days instead of hours or minutes.

With preprocessing before training, this allowed us to utilize all 24 CPUs on our dev box. Running this script took 10 min instead of 3-4 hours.

So instead of taking days to get results, we spent 15 min prepping the data. Then the first epoch's results came back within 40 min.

We viewed that whole process as a sort of "compilation step" before training.

hexed rampart Feb 19, 2020, 3:41 AM

#

What exactly is a gate in an LSTM neural network. I could not find a clear answer for this online. From what I understood it is a feed forward neural network who's output is squished through a certain activation function? Thanks in advance.

velvet thorn Feb 19, 2020, 4:09 AM

#

not...really?

#

each LSTM unit has a state, right

#

"gates" are basically rules that affect how that state changes when new data comes in.

lapis sequoia Feb 19, 2020, 4:56 AM

#

anyone wanna team up for hash code

jolly briar Feb 19, 2020, 11:53 AM

#

how to check the kernel that a notebook is using , i'm not sure whether it's using the right env and that's going to make it hard to share with others

lament needle Feb 19, 2020, 12:01 PM

#

any good sources to read transformer ?

crystal sluice Feb 19, 2020, 4:51 PM

#

hey guys, quick question

#

import pandas as pd

file = open('file1.xlsx', 'rb')

df = pd.read_excel(file)

df_media = df.mean()
df_count = df.count()

df_nomes = df['Nome']
df_nome_idade = df[['Nome','Idade']]


filtro = [df['Idade'] > 30]

print(filtro)

This is Showing the data like:
1 True
2 False
3 True
4 True

I want it to show the actual values contained in cells

#

any tips?

oblique belfry Feb 19, 2020, 5:45 PM

#

Is there a notable difference using TF/Pytorch with nvidia-docker vs just running it normal? Is there a notable slowdown running models via containers?

vivid cloak Feb 19, 2020, 7:58 PM

#

does anyone know how to find things in a pandas dataframe?

#

like if I want to get the index of something in a column

#

I've looked through the docs section on indexing but didn't come across anything that helped

drowsy grove Feb 20, 2020, 12:55 AM

#

Has anyone used pd.read_sql_querywith if statements? I keep getting syntax error.

#

df2 works but can't get df3. Very confusing.

📎 unknown.png

#

I know I can just read the whole table and then just select with pandas functions. But I just want to know what I'm doing wrong here.

#

The error message

📎 unknown.png

drowsy grove Feb 20, 2020, 1:23 AM

#

After getting the data, I need to, per instruction, "removing user PII, while still allowing the application to be hydrated
with data for development or testing."

#

I wonder what that means, does it just mean that I need to hash the name column?

#

Thanks.

shell yarrow Feb 20, 2020, 2:34 AM

#

I think if my very old and vague memories of SQL are correct that your if ought to be a WHERE

#

https://www.w3schools.com/SQl/sql_where.asp

drowsy grove Feb 20, 2020, 2:53 AM

#

Jesus. How did I miss that?

#

@shell yarrow I'm thoroughly ashamed.

#

Thank you so much. I will leave my images up there to serve as a reminder for me to be always humble.

shell yarrow Feb 20, 2020, 2:55 AM

#

I've stopped counting my stupid mistakes after passing the 1000000th 🙂

lapis sequoia Feb 20, 2020, 3:13 AM

#

print()

thin terrace Feb 20, 2020, 6:51 AM

#

How can I normalize the features of my dataset in the range of (-0.5, 0.5)? I can only find solutions for between -1 and 1 or 0 and 1.

supple ferry Feb 20, 2020, 7:27 AM

#

@thin terrace , there is an argument called feature_range in which you can use to give your range.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

thin terrace Feb 20, 2020, 7:28 AM

#

Thanks

pine path Feb 20, 2020, 1:10 PM

#

hey,Im new to datascience ...

#

How do I start learning?

wary apex Feb 20, 2020, 3:47 PM

#

Hey guys, is sentdex a good playlist to go through ? Heard a lot about it

plain turret Feb 20, 2020, 3:51 PM

#

Give it a try to see if it's for you

wary apex Feb 20, 2020, 3:51 PM

#

I see

plain turret Feb 20, 2020, 3:51 PM

#

You only have a youtube video lenght of time to lose in case not :)

wary apex Feb 20, 2020, 3:52 PM

#

Yeah I'm planning on getting the andrew coursera course as well

#

Got a lot of recommendations for it

plain turret Feb 20, 2020, 3:52 PM

#

Yah

wary apex Feb 20, 2020, 3:53 PM

#

Would I need to separately learn data science or does ML delve into it as well ?

chilly shuttle Feb 20, 2020, 9:58 PM

#

data science delves into ML, not so much other way around

#

5c comment from someone who hires, I see too many people who did some moocs on convnets but have almost 0 understanding of stats. Gotta get that foundation

drowsy grove Feb 21, 2020, 12:42 AM

#

Has anyone dealt with anonymizing personal data or PII before? I'm supposed to remove user PII, while still allowing the application to be hydrated with data for development or testing.

#

Honestly I don't know what that means, but this is the closest I get.

#

The dataframe c below is my result

📎 unknown.png

#

Would this work? I can then just drop the original name column.

chilly shuttle Feb 21, 2020, 4:14 AM

#

it depends on what you need for your models and what requirements you need to meet

#

it's somewhere between difficult to impossible to truly anonymise data in a way that makes is still useful for ML models and there have recently been some surprising cases of reidentifying supposedly anonymous data

velvet thorn Feb 21, 2020, 5:49 AM

#

good place to start would be deciding on an anonymity metric

thin terrace Feb 21, 2020, 9:05 AM

#

Hi,

Looking for a way to reshape a B x W x H x 1 grayscale image (np.array) to a B x W x H x 3 RGB image.

I fail miserably in my attempts

chilly shuttle Feb 21, 2020, 9:31 AM

#

https://stackoverflow.com/questions/25876640/subsampling-every-nth-entry-in-a-numpy-array/25876672

Stack Overflow

subsampling every nth entry in a numpy array

I am a beginner with numpy, and I am trying to extract some data from a long numpy array. What I need to do is start from a defined position in my array, and then subsample every nth data point fro...

#

just select 0th 1st and 2nd elements

thin terrace Feb 21, 2020, 9:38 AM

#

It's not that simple, the shape is (1, 28, 28, 1)

#

which means the last 1 which I need to extend to 3 elements is nested deep in 28x28 arrays

#

I guess I can do an ugly nested loop but surely there must be another way

chilly shuttle Feb 21, 2020, 9:55 AM

#

yes...

#

take 3 slices with the method i linked

#

then concat them into 28x28x3

#

DM me your data i'll do it for fun

vital cipher Feb 21, 2020, 12:39 PM

#

hi guys just wanted to know is there any installation steps to download SAS enterprise miner on an ubuntu machine?

#

if so please guide

#

sas 9.4*

velvet thorn Feb 21, 2020, 12:54 PM

#

for that kind of stuff, if I understand correctly, you want np.repeat(a, 3, axis=-1)

lapis sequoia Feb 21, 2020, 2:00 PM

#

which is the best clustering algorithm when it comes to not knowing how many exact clusters or groups we need?

drowsy grove Feb 21, 2020, 3:58 PM

#

@chilly shuttle Thanks

#

@velvet thorn Does anonymity metric mean how anonymous the data is?

chilly shuttle Feb 21, 2020, 3:59 PM

#

yes but your organisation needs to explicitly define what that is

#

so that when inevitably the data gets leaked and reidentified, you're not on the hook

velvet thorn Feb 21, 2020, 4:03 PM

#

if you don't know how many clusters there are...DBSCAN is nice to start with, I would say

#

can consider hierarchical clustering depending on your use case

#

yes, basically

#

"anonymity" is a nebulous concept - there are several different metrics that aim to objectively represent "how anonymous" some data is

gaunt blade Feb 21, 2020, 4:44 PM

#

I want to try to make crypto price predictor. Anyone has recommendations on what to watch/read/look for? I know how to get history market data and such I am wondering more on ML part, what model to use stuff like that. Preferrably something that'd be good introduction to learning ML etc

lapis sequoia Feb 21, 2020, 5:50 PM

#

can i get some help with R? i need to make a graph based on my data, but i dont know how to do
i have state, city, fatalities, wounded, date as the columns

#

yes where fatalities most likely to occur

#

map of the area? what do you mean?

#

ohh

#

i am a newbie at this, so i would like to try where deaths are most likely to occur with a line graph

#

with states

#

does that make sense?

#

ohh

#

currently, my data is like this:

#

📎 data1x.PNG

#

is there a way in R to total up the fatalities by state?

#

so i can make the x axis

#

thanks!

vast temple Feb 21, 2020, 8:23 PM

#

Hi guys, i have a problem. I have a authors dataset with ~270k names, and other 1k dataset with books and their descriptions. I need to create new dataframe with authors in books descriptions, and measure accuracy of that maching. How do i do that? I mean general direction, how to do that kinda stuff. Do i need to use fuzzywuzzy? Do i need to use loops to do that? Do i need to create very big 'dirty' df and for each author with 1k extra rows, and match within?

lapis sequoia Feb 21, 2020, 8:47 PM

#

@void anvil in that article, can we put FUN = sum?

#

instead of FUN = mean

rotund knot Feb 21, 2020, 11:00 PM

#

Hi guys, I've got a df that has data stored from twitter scraping, sorted by word and frequency of word. I want to make a front end that will enable a user to search for a keyword, run a python script to append to a bar graph. Is Django my best way forward?

velvet thorn Feb 22, 2020, 3:17 AM

#

@rotund knot it really depends.

#

the fact that you want to display a bar graph makes it a little twitchy

#

you could use Dash

#

which is built for this kind of thing

#

it is possible to use Django

#

or even Flask

#

but then you'd have to code more of the plotting logic yourself

#

so like either Dash alone or Django/Flask/something else with MPL

granite steppe Feb 22, 2020, 4:37 AM

#

hi im just trying to get into data visulization but i just dont know where to start... any suggestions?

shell yarrow Feb 22, 2020, 4:40 AM

#

where do you start?

#

I mean - what's your current situation ? Are you in school or currently working ? Do you have a domain of expertise or trying to gauge possible careers for your higher education ?

granite steppe Feb 22, 2020, 4:43 AM

#

left in 3rd year of uni

#

bscit

#

i dont have a single knowledge in data science @shell yarrow

shell yarrow Feb 22, 2020, 4:45 AM

#

bsc it = bachelor in information tech ?

granite steppe Feb 22, 2020, 4:45 AM

#

y

#

ye

#

most of the time i only did business subjects

shell yarrow Feb 22, 2020, 4:46 AM

#

I'm only gonna be able to give 'spare time / continuous education' kind of advice but...

granite steppe Feb 22, 2020, 4:46 AM

#

sud be fine

#

just throw at me

shell yarrow Feb 22, 2020, 4:47 AM

#

pick a subject matter that interests you and try to find interesting patterns about it

granite steppe Feb 22, 2020, 4:47 AM

#

oh like space

#

got that part

shell yarrow Feb 22, 2020, 4:47 AM

#

I did a couple coursera courses - i'm not good but they were helpful laying out what the field looks like

granite steppe Feb 22, 2020, 4:47 AM

#

i c

#

what about ur math skills?

#

i did math up to high school

shell yarrow Feb 22, 2020, 4:48 AM

#

you can catch up on that (khan academy and many others)

granite steppe Feb 22, 2020, 4:48 AM

#

its still fresh coz it was 2 years ago haha

#

not that long

shell yarrow Feb 22, 2020, 4:49 AM

#

but if you need a formal education / validation, I'm not sure.

#

I don't think for Visualization specifically, maths are too hard (you're not ought to do hard stats)

granite steppe Feb 22, 2020, 4:49 AM

#

ohhh

#

i c

#

i might as well start with coursea to start with then

#

thnx heaps bud @shell yarrow

shell yarrow Feb 22, 2020, 4:51 AM

#

hey - also google around as much as you can (trying to avoid buzzfeed and other clickbaits...)

granite steppe Feb 22, 2020, 4:52 AM

#

ye sure thnx for the tips 😄

#

coursea courses fro datavisulization are not free gonna look somewhere else haha

#

i feel cheap but it is what it is haha

shell yarrow Feb 22, 2020, 4:57 AM

#

are you working?

#

as in the very judgemental question 'hey do you have a job' ?

#

anyway - youtube is same for free but you need to search for cotnent

tired copper Feb 22, 2020, 4:59 AM

#

corey schafer's got some good videos on data science

granite steppe Feb 22, 2020, 5:06 AM

#

i have a job but not in IT atm haha and also i worked as a junior ui/ux designer during my uni time

#

does that answer ur question

#

sure will check his video out @tired copper thnx bud

granite steppe Feb 22, 2020, 6:14 AM

#

hi i was wondering sud i be able to use pandas properly before i use mathplotlib

velvet thorn Feb 22, 2020, 7:53 AM

#

hm.

#

preferably.

granite steppe Feb 22, 2020, 8:27 AM

#

oh sweet thnx

rotund knot Feb 22, 2020, 9:43 AM

#

@velvet thorn Thank you most kindly for your advice, I will start with Dash today.

hollow shard Feb 22, 2020, 12:07 PM

#

hello, I've been writing a CNN from scratch to train on the MNIST data, but its been producing strange results, for example the accuracy rising to 30% and then just falling back down to 10%, could anyone please look at my code and find out why this is, because I'm stumped

#

http://dpaste.com/0B3HPW7

#

its very loosely based on this tutorial:

#

https://towardsdatascience.com/a-guide-to-convolutional-neural-networks-from-scratch-f1e3bfc3e2de

Medium

A Guide to Convolutional Neural Networks from Scratch

Convolutional neural networks are the workhorse behind a lot of the progress made in deep learning during the 2010s. These networks have…

tiny flame Feb 22, 2020, 2:46 PM

#

Hello guys

#

Has anyone tried to implement a machine learning algorithm in a language like Scratch

#

In Python it's pretty easy

velvet thorn Feb 22, 2020, 3:00 PM

#

@hollow shard you should probably consider formatting your code better

#

so it's easier to find out what's wrong with it

#

you can check out PEP8

eternal mantle Feb 22, 2020, 4:24 PM

#

I have some a Pandas DataFrame with three simple columns, plus a separate index/id. One column is a timestamp/datetime object string output. I would like to be able to filter that data by date or time, separately. For example, filter for all rows that occur between this day and that day. Or filter for all rows that occur on any day, but in the morning.

What would be the best way to go about that, assuming very minor knowledge of Pandas? I was originally thinking to split the timestamp into a date, and time. Then make use of the strftime formatting to generate the right string output for writing to disk, and again for reading back into a datetime object when reconstructing the DataFrame

velvet thorn Feb 22, 2020, 4:47 PM

#

I...don't really see what the first part of your question has to do with the second

#

for filtering: .dt accessor

#

for writing/reading, pandas should infer data type automatically, but if it doesn't, pd.to_datetime

eternal mantle Feb 22, 2020, 5:03 PM

#

Hmm, might not have explained that the best. Pandas did not actually determine the datetime format automatically, so I am using pd.to_datetime to create that datetime object. The same in reverse when creating new rows.

I am more wondering if I want to filter the datetime by either the date only or the time only, would it be best to have those columns split or unified?

velvet thorn Feb 22, 2020, 5:10 PM

#

unified.

#

because there is no date type or time type.

#

when loading from disk, did you tell pandas to parse dates?

#

check the documentation; it needs to be enabled

eternal mantle Feb 22, 2020, 5:15 PM

#

So then with unified, I would essentially combine the command line arguments they they assemble the proper datetime objects to compare the data against. Im not sure that makes sense. But I think I understand the basics of how I would filter, just need to figure out how to translate that to code

And looking at the code, it looks like my only argument given to pandas regarding reading data is index_col=0. I could have sworn I used the option for parsing dates at some point. Maybe early on in development. But it doesn't look to be there now. I'll have to mess around with that too

#

Also worth mentioning, I use a slightly different time format but I think it would still be picked up by Pandas auto-detection. Though I understand it's best to explicitly tell Pandas the format so it doesn't waste time trying to guess.

%Y-%m-%d %H:%M:%S is the time format I use. Only real change from ISO is I don't include the timezone info in the middle

velvet thorn Feb 22, 2020, 5:21 PM

#

depends on what your command line arguments look like.

eternal mantle Feb 22, 2020, 5:27 PM

#

At the simplest, I want to be able to select by date. Time ranges haven't been fully decided on. Selection by date would look something like suppylement list --before 2020-02-01 --after 2020-01-14. Supporting a combination of the filters where you can use a combination of --before, --after or --on to select specific ranges.

Time filtering would be similar variations of those arguments. Need to decide if I am going to support the same set of arguments, or different arguments. It may be easiest to just accept a full datetime string and determine which pieces of the datetime it is applicable for. Though I have not had the best luck getting argparse to properly accept spaces in arguments so I may need to fix that, or change the time format in arguments slightly

velvet thorn Feb 22, 2020, 5:42 PM

#

hm

#

okay, two things

#

that kind of filtering is quite trivial, just need to be comfortable with argparse

#

rather than getting argparse to accept spaces, escape the spaces in the arguments you pass to your script.

eternal mantle Feb 22, 2020, 5:47 PM

#

Makes sense. Still figuring out argparse and getting the hang of the finer details. First project I'm trying to tackle on my own so a lot of figuring out and learning new things. I did try some various combinations of escape for arguments coming in from the command line without much luck. Might be worth taking another shot at it as having spaces properly in some args would be very helpful

#

Thanks for all the tips though, def appreciate it.

velvet thorn Feb 22, 2020, 6:14 PM

#

no...

#

what I mean is, for example

#

python script.py —arg a\ b\ c

#

that’s one argument that will be parsed as 'a b c'

eternal mantle Feb 22, 2020, 6:46 PM

#

I just tried like that, and I get an 'unrecognized argument' error. I believe the issue may stem from the way I set up argparse to handle the modes. It can definitely be done, I did it in the past. Just may need some adjustment to the arg setup for spaces to work. Right now, I cannot enter spaces in arguments -- probably because I am using positional args instead of named args

pulsar stag Feb 22, 2020, 7:31 PM

#

As a programmer/algorithmic trader the majority of my time at work is spent breaking down big data and trying to figure out ways of creating dashboards around this information. With this being said, I've found a tool that I started using before my dashboard creation process to highlight relationships between my data for further investigation. This tool is D-Tale a python, react, flask library that's built off Plotly & Dash to allows easy data analysis and integrates easily into Jupiter notebook.

As a fan, I wanted to put together a practice & tutorial on how to use this powerful tool in a comprehensive way so I made this video where I take the Coingecko API to pull all cryptocurrency financial data by date & break it down into price, volume & market cap. Easily adaptable to an endless amount of cryptocurrencies to compare with each other on this tool.

You can find the full tutorial on this subject here:
https://www.youtube.com/watch?v=0RihZNdQc7k&feature=youtu.be

YouTube

Pip Install Python

Pip Install D-Tale: Advanced Python Dashboard

Python Dash Plotly Udemy Course: https://www.youtube.com/redirect?v=psvU4zwO3Ao&event=video_description&redir_token=ie8R3amPq4Qn8G8CvRoYjWuW2L18MTU4MjQ2NTM0OEAxNTgyMzc4OTQ4&q=https%3A%2F%2Fwww.udemy.com%2Fcourse%2Fplotly-dash%2F%3FcouponCode%3DPOTLUCK

---------Useful Links---...

▶ Play video

chilly glen Feb 22, 2020, 7:48 PM

#

I have one dumb question not sure if this is the right channel. Why is it important to learn ML from scratch, fundamentals etc since we already have bunch of libraries which are so much helpful that we pretty much don't have to build our own ML algorithm from scratch

PS I'm new to python/ML 😅

hollow shard Feb 22, 2020, 7:52 PM

#

Well because its useful, and for me at least, fun to know how these things work, and the best way to learn how something works is to make it yourself

#

it really provides great insight into how what youll be working with actually functions, which allows you to work more efficiently

#

personally, because its my hobby, i make it a rule to only build stuff from scratch, because i dont see the fun in just writing a few lines and getting results, but i think its good to at least build a simple neural network yourself first @chilly glen

chilly glen Feb 22, 2020, 7:55 PM

#

Ohhh thank you @hollow shard for the honest answer 💯

hollow shard Feb 22, 2020, 7:58 PM

#

Np 👍

chilly glen Feb 22, 2020, 7:59 PM

#

But building libs like tensorflow, numpy from scratch would be pretty tough

#

Right?

hollow shard Feb 22, 2020, 7:59 PM

#

Well I use numpy, but not tensorflow

#

building neural networks is a challenge but not impossible

#

at the end of the day its just simple calculus and its really rewarding, for me at least

#

again, building a normal neural network for mnist with only numpy is practically a rite of passage, but youll most likely want to build cnns and more complex stuff using tensorflow

chilly glen Feb 22, 2020, 8:01 PM

#

Aah ok I'm not too familiar with the jargons

#

lemon_xd

#

Anyway what's the best way to start ? I probably feel like one need to have a strong maths

hollow shard Feb 22, 2020, 8:03 PM

#

Well mnist is just a dataset of handwritten numbers

#

one second

#

there are 2 resources that really helped me, 3blue1browns video series on neural nets, and michael neilsens book, which comes with code

#

https://m.youtube.com/watch?v=aircAruvnKk

YouTube

3Blue1Brown

But what is a Neural Network? | Deep learning, chapter 1

Home page: https://www.3blue1brown.com/
Brought to you by you: http://3b1b.co/nn1-thanks
Additional funding provided by Amplify Partners

For any early-stage ML entrepreneurs, Amplify would love to hear from you: 3blue1brown@amplifypartners.com

Full playlist: http://3b1b.co/...

▶ Play video

#

Nielsens github is linked in the description

chilly glen Feb 22, 2020, 8:05 PM

#

Should I start with neural networks ? Is that a beginning of the roadmap or something ?

hollow shard Feb 22, 2020, 8:06 PM

#

Hm, how much do you know already?

#

i mean i would say yes, but others might say that it would be good to start with simple regression

#

if you really know nothing try the start of andrew ngs machine learning course

chilly glen Feb 22, 2020, 8:10 PM

#

I am a software developer already on the ui side but I know python little bit. Yeah I believe I should start with Coursera Andrew ngs course

hollow shard Feb 22, 2020, 8:11 PM

#

Ok, thats good then

#

its on coursera look ot up

#

*it

jolly briar Feb 22, 2020, 8:32 PM

#

imo simple regression is important, i don't get why people do stuff like GANs and whatever right off the bat... and can't imaging them being any use in the workforce

#

maybe they are though, but I would be surprised i guess

hollow shard Feb 22, 2020, 8:33 PM

#

I think for a lot of people its just a matter of interest

jolly briar Feb 22, 2020, 8:33 PM

#

right - if it's purely interest then fair enough

#

but a lot also say they're interested in work

#

and in the case of the latter, i think they're probably wasting their time

#

that being said - "all roads lead to rome", if someone is having fun and is interested, it's not a waste of time necessarily if it leads to them getting the foundations etc at a later date

hollow shard Feb 22, 2020, 8:38 PM

#

Right, for work purposes i think the majority of cases just need some kind of basic regression

jolly briar Feb 22, 2020, 8:39 PM

#

and ability to actually make datasets

#

that's way more useful for an entry position

hollow shard Feb 22, 2020, 8:40 PM

#

Right, just data processing and visualisation (maybe) is hugely important

#

especially as a lot of peoples knowledge doesn't stretch beyond some simple excel formulas

jolly briar Feb 22, 2020, 8:41 PM

#

yea

#

though index match is probably more useful than GANs for most entry positions

hollow shard Feb 22, 2020, 8:44 PM

#

Im trying to think of an actual scenario where gans would be useful

shell yarrow Feb 22, 2020, 10:33 PM

#

hello data-sciencers, does anyone has a recommendation for a DB when crawling some subreddit on a local computer?

#

i'm only interested in subreddit name, text and timestamps. I don't care (or want) the other metadata

#

the type of operations i'll do with the crawled data are basic NLP things (tokenize, build frequency lists of n-grams, etc.)

eternal mantle Feb 22, 2020, 10:42 PM

#

I've never used it in Python but maybe SqlLite could work if you want an actual database

shell yarrow Feb 22, 2020, 10:44 PM

#

here's my consideration, writing to file makes it hard to have several crawlers in parallel

#

so I thought something that'd handle the locks out of the box would help 🙂

#

checking SQLite now and see if it has limitations re- size

#

ok the internet told me it'd be enough for playing at home (I don't intend to have more than 16TB of data for this experiment). Thank you Dexter!

edit: if someone has other considerations, recommendations, I'm all ears. It's still very early in the project and i'm mainly playing around prototypes.

umbral forge Feb 22, 2020, 11:51 PM

#

Got a question to ask the data scientists here. So let us just say we have an array of X amount of either True or False randomly distributed. I have a goal of just basically performing a search pattern that is based on a percentage of search coverage that will provide me with the amount of index skipping necessary to achieve that percentage. Of course, if it find True, then it should stop searching. So the percentage is more or less the maximum search coverage. I don't need it to be a complete linear search. Rather, just wanted to check an even amount of samples to see if True is in there.

For example, if I have 100 items of randomly True of False. If I want to perform a 50% search, then obviously I should be searching at array index of 0, 2, 4, 6... etc, which will provide me a 50% search coverage. If I have 159 items in an array and I want 30% coverage, then that means I should be searching roughly 47 or 48 items out of 159 to achieve that 30% search coverage. How would I translate that into the appropriately evenly distributed index out of 159 in order to do that? Such algorithm should obviously work in all scenarios like 1314 array length and 40% coverage will mean skipping X indexes to get that percentage coverage.

reef bone Feb 23, 2020, 12:11 AM

#

so we're just trying to calc the step needed between indices? that should be fairly easy to do

umbral forge Feb 23, 2020, 12:11 AM

#

I mean it's really just skip by the 1/coverage, right?

reef bone Feb 23, 2020, 12:12 AM

#

i think so yea

umbral forge Feb 23, 2020, 12:12 AM

#

Bonus point if we do the search front and back towards the middle.

reef bone Feb 23, 2020, 12:13 AM

#

you will need to round somewhere though

#

as you won't always get a whole number

umbral forge Feb 23, 2020, 12:13 AM

#

So basically, assuming it's a 50% coverage... 0, -1, 2, -3, 4, -5... etc?

#

Yeah that's fine.

#

The percentage is just there for a suggestion.

#

Does not have to be exact.

#

Just needs to be close enough.

reef bone Feb 23, 2020, 12:15 AM

#

>>> def get_step(coverage):
...     return int(1 / coverage)
... 
>>> get_step(0.5)
2
>>> get_step(0.25)
4
>>> get_step(0.9)  # Rounding 'error'
1
>>> 
>>> from string import ascii_letters as data
>>> data
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> 
>>> for index in range(0, len(data), get_step(0.33)):
...     print(index, data[index])
... 
0 a
3 d
6 g
9 j
12 m
15 p
18 s
21 v
24 y
27 B
30 E
33 H
36 K
39 N
42 Q
45 T
48 W
51 Z

#

it would be more tricky for bonus points yea, but just for one direction this seems sufficient

umbral forge Feb 23, 2020, 12:17 AM

#

Yeah I guess so.

#

From performance perspective, given a random sample, would searching front and back towards middle theoretically supposed to be faster?

#

or there is negligible difference?

reef bone Feb 23, 2020, 12:19 AM

#

it really depends on the distribution

#

for example, if we had 100 indices, there was only 1 True, and we could only check 10

#

and it's truly random

#

then just checking the first 10 should be just as good as checking each 10nth

umbral forge Feb 23, 2020, 12:20 AM

#

Interesting.

#

Ah yes using range()'s step parameter is indeed a good way to do this. Thanks.

reef bone Feb 23, 2020, 12:23 AM

#

yea it's "safe" in that it won't let you step outside of the allowed range

umbral forge Feb 23, 2020, 12:23 AM

#

I don't know why I was having a brain freeze on this.

#

I think I was just thinking too much about the "bonus point" part, haha

reef bone Feb 23, 2020, 12:25 AM

#

it's an interesting question and if you stick around for a while maybe you'll get a response from some of the more statistics-oriented members that lurk here, but if the distribution is truly random then I'm fairly sure it doesn't actually matter how you search

umbral forge Feb 23, 2020, 12:26 AM

#

Right yeah it becomes a question of whether there is any performance to be gained from doing the extra work.

#

Maybe there is not.

#

Assuming a sample size of around 200.

#

If sample size becomes 1000, I wonder if that will make a difference though.

silent swan Feb 23, 2020, 4:51 AM

#

I'm not sure I fully understand the question

#

but if it's truly random it really doesn't matter how you search right?

#

the first K observations will have the same distribution as an evenly distributed K observations

umbral forge Feb 23, 2020, 6:31 AM

#

On that note, is there a better search algorithm for tackling such an issue then? So the perimeter is a random sized array that contains a completely random distribution of True or False. I am not necessarily looking for 100% accuracy when searching this array for True. I just need a search that covers the array in a uniform manner. Maybe it could be 30% coverage, 50% coverage or more. So I guess I was just looking for an efficient way to achieve that @silent swan

elfin hatch Feb 23, 2020, 8:32 AM

#

I've been coding python for a while now but i can't think of any projects to try out (that take a medium to long time)

#

Any help?

uncut shadow Feb 23, 2020, 3:47 PM

#

Hey. I have a small problem, gradient should point to the highest point on the graph but somehow for me it doesn't. What is wrong? https://pastebin.pl/view/13ce5f98
I had to use self.weights -= -np.dot(...) because only the the loss decreases

Untitled - Pastebin

Pastebin.pl is a website where you can store code/text online for a set period of time and share to anybody on earth

chilly shuttle Feb 23, 2020, 4:28 PM

#

so i haven't tried your actual code

#

but Gradient should point to the highest direction so I should substract it from weights, but then the loss increases.

#

gradients are not monotonic

#

you can and typically will have a drop in loss function on your way to the global maximum

#

see: local maxima/minima

jolly briar Feb 23, 2020, 4:49 PM

#

given a dataframe with a multiindex for columns i want to just select the second level of the index, currently i have [x[1] for x in df.columns], I'm wondering if there's a better way / more pandas-y

stable forum Feb 23, 2020, 6:20 PM

#

@jolly briar df.xs() allows to select data at particular level.

uncut shadow Feb 23, 2020, 10:28 PM

#

Hey. I have a general question. I have seen many times that in deep learning you have to take the sum of weighted inputs. There is one thing, I have never seen in any tutorial/video/repository doing this sum. The only thing people do is
activation(np.dot(X, weights) + bias)
So where is this sum?

fallow vapor Feb 23, 2020, 10:53 PM

#

Does anyone know how to get Vim keybindings in Jupyter?

paper niche Feb 23, 2020, 11:46 PM

#

@fallow vapor there’s an nbextension called “Select Codemirror Keymaps” that does this

fallow vapor Feb 23, 2020, 11:47 PM

#

@paper niche awesome. thank you

velvet thorn Feb 24, 2020, 2:23 AM

#

...in the dot product...?

oblique belfry Feb 24, 2020, 2:54 AM

#

Is Scala good for data science/machine learning? If so, why? I see multiple libraries being written and Scala and I am curious why.

velvet thorn Feb 24, 2020, 3:17 AM

#

Scala is generally better for productionisation than experimentation

#

Spark is written largely in Scala

#

powerful type system leads to stronger compile-time correctness guarantees

oblique belfry Feb 24, 2020, 3:25 AM

#

Gotcha. I see Scala and Spark together a lot. How about for deep learning and neural networks? Mxnet has api bindings for Scala?

I am curious if one could take a trained model, maybe in Onnx, and run it in production in Scala.

silent swan Feb 24, 2020, 7:09 AM

#

@uncut shadow dot product includes a summation

jolly briar Feb 24, 2020, 1:13 PM

#

anyone had <IPython.core.display.Javascript object> appear in notebooks (using gitlab)? They're not there when i run the notebook locally, but appear when I commit to the repo then look at it within gitlab, I don't really get where they're coming from but they're kinda annoying

supple ferry Feb 24, 2020, 4:09 PM

#

@void anvil , considering pandas 1.0 just came out it might take some time

supple ferry Feb 25, 2020, 8:13 AM

#

https://stackoverflow.com/questions/60390097/integer-partitioning-special-case

Stack Overflow

Integer partitioning: Special case

I have a special case of integer partitioning. I have looked at StackOverflow and combined two pieces of code I have found. Yet, it does not fit my purpose well.

I want to have an array of k non-

#

If anyone can give me a helping hand, that would be great

lapis sequoia Feb 25, 2020, 8:15 AM

#

hello every one, I'm quite new to python(Student), today i got an assignment in which i have to find the repeated value in column H and those rows where value in H is same those rows needs to be appended in new columns in front of old row.
please reply if anybody here to help
if question is not clear please ask for more clarifications

#

this the original csv and some work done

📎 csv.JPG

#

📎 unknown.png

#

in img1 you can see name of person, suppose this name is repeated in the data then all rows where value is true those rows should be get selected

#

and repeated rows should be pasted in front of first value and if more than one then in new columns (i+)

📎 csv2.JPG

summer plover Feb 25, 2020, 8:22 AM

#

I sent spiderMan here because I do not know the tools of the data science, but I know you guys do. if you could please help out that would be great 😄

lapis sequoia Feb 25, 2020, 8:28 AM

#

@summer plover thank so much

vagrant sparrow Feb 25, 2020, 8:33 AM

#

Anyone in here use anaconda? Can you share your experience using it in data science?

dire stirrup Feb 25, 2020, 8:55 AM

#

@vagrant sparrow very convenient

#

most of the ds libraries are alrdy installed in anaconda

vagrant sparrow Feb 25, 2020, 8:58 AM

#

Is it easy to crawl data from youtube comments, twitter, facebook, instagram, etc? Do you have any tips and trick on using anaconda to do that kind of task?

#

@dire stirrup

dire stirrup Feb 25, 2020, 8:59 AM

#

explore beautiful soup @vagrant sparrow

#

they scrape html code parts

#

it is installed in anaconda as well

vagrant sparrow Feb 25, 2020, 9:01 AM

#

Do you recommend to use it with pycharm?

#

@dire stirrup well thanks for your insight and suggestions.. its help alot.. 😀👍

dire stirrup Feb 25, 2020, 9:05 AM

#

Yeah pycharm is fine

lapis sequoia Feb 25, 2020, 9:24 AM

#

here is the csv file if you want to try https://we.tl/t-cL3CcLMP77

Data.zip

1 file sent via WeTransfer, the simplest way to send your files around the world

#

looking for help 👀

velvet thorn Feb 25, 2020, 10:52 AM

#

I actually don't really get what you're trying to do @lapis sequoia

#

do you have an example of what your result should look like

lapis sequoia Feb 25, 2020, 11:27 AM

#

desired output

📎 csv2.JPG

stable forum Feb 25, 2020, 11:47 AM

#

@lapis sequoia Sorry, but the image is unclear as well. What you mean by, rows needs to be appended in new columns in front of old row?

#

I mean, when the values are duplicated more than once, do you add the columns, where?

velvet thorn Feb 25, 2020, 12:31 PM

#

yeah, basically, that

#

I'm not sure what you want

lapis sequoia Feb 25, 2020, 12:47 PM

#

@velvet thorn @stable forum please click on open original in the left-bottom of image for clear image

#

when values are found to be duplicated then number of columns ==number of times value if found duplicated

velvet thorn Feb 25, 2020, 12:50 PM

#

no, I am not saying the image is of low quality/resolution

lapis sequoia Feb 25, 2020, 12:50 PM

#

and the values should be paste into those columns front of original row

velvet thorn Feb 25, 2020, 12:50 PM

#

I am saying that your intentions are not obvious from the image

stable forum Feb 25, 2020, 12:50 PM

#

You want to output how many times, the DataFrame['Name on Account'] is duplicated, in new column?

#

Make simple excel, and just shoot a image, or structure your problem.

#

http://xyproblem.info/

The XY Problem

Asking about your attempted solution rather than your actual problem

#

So in your example, 'cell J' would be the count of item in the .csv, and then it would be the values?

#

And if the count is > 2, it expands horizontally?

lapis sequoia Feb 25, 2020, 12:54 PM

#

yes

#

please allow me to explain

#

as you can see in cell 'H' have values using these values find duplicate

#

if duplicated value found:

#

then copy cell['D','E','F']

#

and paste those values in front of row1

#

📎 csv2.JPG

wide knot Feb 25, 2020, 4:01 PM

#

hi everyone. anyone familiar with scikit's TSNE?

trying to run a 3MB file, but I always run out of memory and my whole computer hangs. 🙂

need ideas on how to handle large datasets for tsne. 😦

jaunty canopy Feb 25, 2020, 4:36 PM

#

reduce n_iter and perplexity. this would reduce exec time and reduce memory usage but solution would be less valide

#

“Since t-SNE scales quadratically in the number of objects N, its applicability is limited to data sets with only a few thousand input objects; beyond that, learning becomes too slow to be practical (and the memory requirements become too large)”

#

https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

Medium

Visualising high-dimensional datasets using PCA and t-SNE in Python

Update: April 29, 2019. Updated some of the code to not use ggplot but instead use seaborn and matplotlib. I also added an example for a…

wide knot Feb 25, 2020, 4:43 PM

#

@jaunty canopy damn so it's O(n^2) in memory. hmm. right now I'm on perplexity=500, and I still haven't reached the point where the cluster generated is visually appealing.

I guess I'll just reduce my dataset for now (since I plan on increasing the perplexity further). What do you think?

jaunty canopy Feb 25, 2020, 4:45 PM

#

ok

#

but you can try running a PCA first then running a tsne on the ouput. your choice.

wide knot Feb 25, 2020, 4:48 PM

#

so the thing is my data is only 2 dimensional. is it still advisable to run PCA on this

jaunty canopy Feb 25, 2020, 4:51 PM

#

as i said it all depends on you. but with 2D just reduce the dataset

wide knot Feb 25, 2020, 4:58 PM

#

thanks !

somber lagoon Feb 25, 2020, 5:20 PM

#

so im getting into python. looks like im most intersted in processing text. so what kind of career path or jobs should i targe

slim torrent Feb 25, 2020, 8:06 PM

#

ok so after reading countless sql vs nosql comparisons I still have no clue what to use for my project

still abyss Feb 25, 2020, 8:28 PM

#

Hey guys, I'm running Ridge, Lasso and ElasticNet on some data with GridSearchCV. Should I be using the same alpha values for all three?

eternal mantle Feb 25, 2020, 10:37 PM

#

@slim torrent I'd probably just go with SQL then. At least in my case, I already know it. Unless you need specific NoSQL features you can probably do fine without it. Never used SQL with Python myself but I heard with PostgreSQL is a popular combo. I'm also a fan of SqlLite

slim torrent Feb 25, 2020, 10:39 PM

#

@eternal mantle thanks I will keep that in mind and will probably go with sql. although atm I'm trying out mongodb

velvet thorn Feb 26, 2020, 12:52 AM

#

do you intend to hook it up to anything else?

#

like say a web framework or a cloud computing service etc.

lapis sequoia Feb 26, 2020, 2:30 AM

#

@velvet thorn hey man

#

do you know if there's any open source serving option.. for building ML applications

#

like you know how we have kaggle kernels.. there's repos, an environment, hosting for notebooks, etc

jolly briar Feb 26, 2020, 2:50 AM

#

how to replace chars in strings by dict

say

d = {'k':'t', 'a' : 't'}
df = pd.DataFrame({'a': ['dog', 'kat'], 'b' : [1,2]})

i want to do some

df.a.replace(d)

such that kat is converted to ttt

#

( here d contains a single value - i would like this to extend to as many as is needed )

lapis sequoia Feb 26, 2020, 2:58 AM

#

have you consider string translate

#

@jolly briar

jolly briar Feb 26, 2020, 2:59 AM

#

@lapis sequoia no, what is it

#

i just looped over a dict

lapis sequoia Feb 26, 2020, 3:00 AM

#

show me

#

oh you're working with dfs too

jolly briar Feb 26, 2020, 3:00 AM

#

show you what?

#

i know looping works , ive given an example

#

@lapis sequoia updated example

#

obviously you can just do

for key in dict:
    df.col.replace(regex=key, value=dict[key], inplace=True)

#

@lapis sequoia do you know or not?

lapis sequoia Feb 26, 2020, 3:07 AM

#

im not sure what you're trying to do here

jolly briar Feb 26, 2020, 3:07 AM

#

i don't know how you couldn't

lapis sequoia Feb 26, 2020, 3:07 AM

#

replacing characters in a column with a dict?

jolly briar Feb 26, 2020, 3:08 AM

#

yes, i don't know what is unclear from the example

lapis sequoia Feb 26, 2020, 3:08 AM

#

if you could show me your expected input and output..

jolly briar Feb 26, 2020, 3:08 AM

#

i have

#

have you looked at the above example?

lapis sequoia Feb 26, 2020, 3:08 AM

#

I think I may be able to suggest an easy way

#

ok wait

jolly briar Feb 26, 2020, 3:08 AM

#

🙄

lapis sequoia Feb 26, 2020, 3:08 AM

#

I meant, an example df

jolly briar Feb 26, 2020, 3:08 AM

#

look up ffs

lapis sequoia Feb 26, 2020, 3:09 AM

#

df = pd.DataFrame({'a': ['dog', 'kat'], 'b' : [1,2]})

#

this?

jolly briar Feb 26, 2020, 3:09 AM

#

that would be a dataframe, yes

lapis sequoia Feb 26, 2020, 3:09 AM

#

ok, and what do you want to do here

jolly briar Feb 26, 2020, 3:09 AM

#

jesus

#

read or just leave it lol

lapis sequoia Feb 26, 2020, 3:17 AM

#

hey man.. this is a very roundabout way of doing this

#

dont write shit code and expect people to understand without telling them what you want to do

#

ideally you should have done something like

jolly briar Feb 26, 2020, 3:19 AM

#

@lapis sequoia explain how it's unclear then

#

rather than failing to read for 10 minutes

#

dont write shit code
it's an example
without telling them what you want to do
it's specified in the example, if ... you... read

lapis sequoia Feb 26, 2020, 3:21 AM

#

translate_dict = {'a':'X', 'b':'Y'}
translate_table = "ab".maketrans(translate_dict) 
df["col1"]= data["col1"].str.translate(translate_dict)

#

it's hard to read, because I was trying to wrap my head around why someone would do that.. try to understand

jolly briar Feb 26, 2020, 3:21 AM

#

well ask a question then

lapis sequoia Feb 26, 2020, 3:21 AM

#

and an example means, actual sample input and output

jolly briar Feb 26, 2020, 3:21 AM

#

don't imply it's not clear when it is

lapis sequoia Feb 26, 2020, 3:22 AM

#

yes, that's perception.. to you it's clear because you wrote it.. in a way that's not optimal..

jolly briar Feb 26, 2020, 3:22 AM

#

no

#

obviously i'm not going to give all the bloody context in a MWE

#

it has data, and what needs to be done with it

#

it doesn't get a fat lot clearer

jolly briar Feb 26, 2020, 3:45 AM

#

@lapis sequoia I just want to make this completely clear:

an example means, actual sample input

d = {'k':'t', 'a' : 't'}
df = pd.DataFrame({'a': ['dog', 'kat'], 'b' : [1,2]})

** and output

i want to do some

df.a.replace(d)

such that kat is converted to ttt

there may be somethings that I missed, but you completely failed to
highlight any of them and instead made requests such as

show me

there's an example...

im not sure what you're trying to do here

it's explained

replacing characters in a column with a dict?

like in the example? ofc...

if you could show me your expected input and output..

like what's in the example?

I meant, an example df

the one in the example?

ok, and what do you want to do here

perhaps what's in the example?

etc.

lapis sequoia Feb 26, 2020, 3:46 AM

#

calm down man

jolly briar Feb 26, 2020, 3:47 AM

#

@lapis sequoia read, man

velvet thorn Feb 26, 2020, 3:50 AM

#

uh

#

.str.replace with custom function?

jolly briar Feb 26, 2020, 3:52 AM

#

i didn't know str.replace took a custom function - you mean like apply etc?

velvet thorn Feb 26, 2020, 3:52 AM

#

no.

jolly briar Feb 26, 2020, 3:52 AM

#

hrm

velvet thorn Feb 26, 2020, 3:52 AM

#

d = {'k':'t', 'a' : 't'}
df = pd.DataFrame({'a': ['dog', 'kat'], 'b' : [1, 2]})
regex = '|'.join(d)

df['a'].str.replace(regex, lambda match: d[match.group()])

#

output:

0    dog
1    ttt
Name: a, dtype: object

jolly briar Feb 26, 2020, 3:53 AM

#

@velvet thorn that makes sense

#

not sure if i prefer it to looping over the dict or not though now

#

i thought there was something more 'inbuilt' for this i guess

velvet thorn Feb 26, 2020, 4:43 AM

#

well, there's an obvious difference

#

but anyway it doesn't seem like a common use case to me

umbral forge Feb 26, 2020, 4:56 AM

#

Don't want to hijack the current conversation here but I just put a question in #help-falafel that's data-science related if anybody here can help me 🙂

quartz stream Feb 26, 2020, 7:27 AM

#

Anyone interested in creating a AI model where we can check if the site is phishing or not based on database from https://www.phishtank.com/developer_info.php

#

So it would learn the features of what does a phishing site look like and it would detect the website which are not yet in the database but have similar characteristics to a phishing website

trim ridge Feb 26, 2020, 7:37 AM

#

The only real indicators in the data for a phishing site is the url containing or replicating anothers - which can be done through an algorithm. The times dont really tell much to AI nor do the RIR since servers are everywhere. Seems overkill.

quartz stream Feb 26, 2020, 7:45 AM

#

What kinda algorithm?

#

@trim ridge

trim ridge Feb 26, 2020, 7:56 AM

#

For determining new phishing sites I would do this.

Go through each organisation and check if a key word of their's is in the URL or in any text/header tag in the HTML. Probably do this with regex.

I would check if the page has a <form> and an action attribute alongside a username/password input. Then compare the URL/IP of the action to the organisation its mimicking.

Only consider the RIR if its Chinese (APNIC).

Probably add some sort of checklist, if the page doesnt meet a set amount of criteria then put it up for human review.

quartz stream Feb 26, 2020, 9:28 AM

#

Thanks a ton !

#

There should be probably a library which does the same

#

LOL

#

@trim ridge

trim ridge Feb 26, 2020, 9:44 AM

#

Seems like a useful tool, will create if not exists.

jolly briar Feb 26, 2020, 10:36 AM

#

@velvet thorn obvious difference in what sense? I get that they're different, i'm not sure what you're referring to though. I doubt it's a common use case, but it's one that I have 🙃

lyric kernel Feb 26, 2020, 5:49 PM

#

does anyone know of some minamal example code for GANs ? I want to use some more abstract example as sample code because i cant use the hundred lines of code, used for the original implementations of the papers.

rigid summit Feb 26, 2020, 9:55 PM

#

Hey, for some reason the results of my kruskal test doesn't display in my console, is there anything I can do to print the statistic and p value?

#

era_900_1100 = df.loc[(df['expected_recovery_amount']<1100) & (df['expected_recovery_amount']>=900)]

by_recovery_strategy = era_900_1100.groupby(['recovery_strategy'])
by_recovery_strategy['age'].describe().unstack()

Level_0_age = era_900_1100.loc[df['recovery_strategy']=="Level 0 Recovery"]['age']
Level_1_age = era_900_1100.loc[df['recovery_strategy']=="Level 1 Recovery"]['age']
stats.kruskal(Level_0_age,Level_1_age)```

#

it runs, but nothing shows up in the console

plain turret Feb 26, 2020, 10:47 PM

#

this will print in Jupyter Notebooks but not for regular console if i'm not mistaken

#

try printing it with print()

cedar briar Feb 26, 2020, 11:19 PM

#

visdom or tensorboard and why?

jolly briar Feb 27, 2020, 12:24 AM

#

https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Export-to-Excel

anyone had any joy with highlighting cells?

#

i just get this kinda thing

📎 unknown.png

kind hollow Feb 27, 2020, 12:25 AM

#

is it me or is that not yellow CBPikaThink

jolly briar Feb 27, 2020, 12:25 AM

#

its not yellow

oblique belfry Feb 27, 2020, 3:39 PM

#

I started at a new company on Monday. When I was hired, I understood the situation that they had basic ML in place but were looking to take things up a notch. After being here most of the week, that is not the case. Their "ML" is just some thresholds. Actually, they offer 4 "ML" offerings, but only one of the offerings is used in production. I asked if they record the decisions of the ML (They save the data, run the ML, and get alerts if the threshold has been hit.), and they said no. They have no clue if what they are offering is even valid.

Given all that, I know I have a long way to go with this company. What are some valuable things I should know in building out this infrastructure? Obviously I want to start logging all these actions for the future so that I can run A/B tests on models and what not.

I guess my thing is, I know what I need to do for my job, but transforming a business into a data-driven business is a tall task and I want to make sure I am not forgetting anything. Also, it is a good way to share common best practices.

upbeat jetty Feb 27, 2020, 5:24 PM

#

What are your favorite packaging practices? Basically, if you saw some console program written in Python for processing/visualising data you want to use, how would you like to be packaged and organised? I'm asking here because data science is close for intended audience.

carmine forge Feb 27, 2020, 9:01 PM

#

hello, can anyone share an impressive jupyter notebook, something visiually appealing and scientific in nature, preferably something related to geology or natural sciences

lapis sequoia Feb 28, 2020, 1:55 AM

#

@oblique belfry tell me more

#

you need infrastructure in place for versioning models, recording results, verifying them and updating models..

grand copper Feb 28, 2020, 6:56 AM

#

Hey, does somebody know how to find a second dominant frequency in a signal?

lapis sequoia Feb 28, 2020, 7:57 AM

#

do you have the formula for getting the dominant frequencies

#

@grand copper

grand copper Feb 28, 2020, 7:58 AM

#

Not at the moment. Couldn't really understand it.

grand copper Feb 28, 2020, 8:15 AM

#

czestotliwosc, Data = wav.read(root.filename)
        if len(Data.shape) == 2:
            Data = Data[:, 0]
        dlugosc = len(Data)
        okres_cz = 1.0 / czestotliwosc
        sek = dlugosc / float(czestotliwosc)
        czas = np.arange(0, sek, okres_cz)
        #Transformata fouriera
        FFT = np.abs(fft(Data))
        FFT_side = FFT[range(dlugosc // 2)]
        czest = np.fft.fftfreq(Data.size, d=(czas[1] - czas[0]))

        #Znajdowanie maksymalnej czestotliwosci w sygnale
        pos_mask = np.where(czest > 0)
        czest_prob = czest[pos_mask]
        max = czest_prob[FFT[pos_mask].argmax()]
        max = int(max)
        print(max)

#

"czestotliwosc" is rate, "dlugosc" is length, "okres_cz" is a period? (you see it in code, 1 divided by frequency/rate)

#

"sek", I have no idea, myself.

#

I'm on the phone, that's why I haven't translated in code.

lapis sequoia Feb 28, 2020, 10:05 AM

#

well the formatting isn't helpful either

#

gonna need comments or at least a formula

lapis sequoia Feb 28, 2020, 2:26 PM

#

I am trying to use the AdaBoostRegressor with scikit optimize and it needs the predict method to also return the std dev of y at X , I am using the ExtraTreesRegressor as the base estimator

oblique belfry Feb 28, 2020, 4:22 PM

#

@lapis sequoia Sorry. I fell asleep and it has been a crazy morning.

When I first took over the job, I was concerned about model versioning and whatnot. But, that was under the assumption they were already recording all that they were doing.

#

But even with their crude "ml", they do not record the output of their predictions. How can you implement ML if you do not even know what the baseline is?

#

Just a big culture shift.

#

Wasn't prepared for that.

lapis sequoia Feb 28, 2020, 5:03 PM

#

https://www.reddit.com/r/learnmachinelearning/comments/faxrpr/where_can_i_learn_about_activation_functions_loss/. Can anybody give suggestion for this questions which might help.Thanks!

r/learnmachinelearning - Where can i learn about activation functio...

0 votes and 0 comments so far on Reddit

strange stag Feb 29, 2020, 4:04 AM

#

mk, so i have df1 with the keys ['direction','Exp Date','Name','Exp Time','Price'] and df2 with the keys ['Name','Exp Time','Exp Value','Exp Date'] and i am trying to merge these two dataframes, the only difference being that the columns direction and Price, which are not existent in df2

#

.merge and df10 = pd.concat(frames, keys=['Name','Exp Time','Exp Value','Exp Date','Price', 'direction']) not exactly working

velvet thorn Feb 29, 2020, 5:53 AM

#

merge how

#

I’m assuming you want to combine the rows?

#

and the values in the columns from the second dataframe will be null?

strange stag Feb 29, 2020, 6:01 AM

#

lolz... nvm was concating the wrong dfs..

strange stag Feb 29, 2020, 8:17 AM

#

@velvet thorn you still there?

#

anyone know why im getting a Unable to allocate array with shape (23980000,) and data type int64
when trying to df = df.drop(df[df['Type'] == 'Spread'].index) with a jupyter notebook
I know the df is huge (282mb) worth of lines, because im reading in 180-ish csvs with 10-20k lines each

#

2,995,051 rows

#

stackoverflow suggests 64bit python, but i am using this...so

#

perhaps it is because i am doing this...

df = pd.DataFrame()
for csv_file in csv_files:
    df = df.append(pd.read_csv(csv_file))

#

df.info(memory_usage='deep')
says the memory usage is 1.0GB, so this is by no means out of my computing power

velvet thorn Feb 29, 2020, 8:24 AM

#

in IPython?

#

or just a normal interpreter

strange stag Feb 29, 2020, 8:24 AM

#

jupyter notebook

velvet thorn Feb 29, 2020, 8:24 AM

#

in Jupyter stuff hangs around

#

how about this?

#

pd.concat([pd.read_csv(filename) for filename in csv_files])

strange stag Feb 29, 2020, 8:25 AM

#

MemoryError: Unable to allocate array with shape (6200000,) and data type int64

#

when running
df = df.drop(df[df['Type'] == 'Spread'].index)

#

perhaps i should use a loop instead?

#

hmm, samething with a loop

#

just a lesser shape

#

for x in range(len(df)):
    if df[df['Type'][x] == 'Spread']:
        df = df.drop(df[df['Type'] == 'Spread'].index)

ripe forge Feb 29, 2020, 8:43 AM

#

@strange stag can you check what python is running just to be safe? What os do you use?

#

If you're on windows or Linux, run import platform; platform.architecture()

strange stag Feb 29, 2020, 8:45 AM

#

('32bit', 'WindowsPE')

#

notebook using 32bit?

#

hmm, didnt error when i ran it outside of the notebook

ripe forge Feb 29, 2020, 11:32 AM

#

You most likely have multiple python installs and your notebook is launching with the crappy one

#

Remove the 32 bit python, and only use 64 bit

jovial river Feb 29, 2020, 6:59 PM

#

How can I graph a multiple linear regression model? I am having trouble with this because when I go to graph, it tells me that x and y has to be the same size. How can I make them be the same size then?

#

Here are my x and y variables. Data is a pandas dataframe.

y = data['mpg']
x = data[['cyl', 'disp']]
x_train,x_test,y_train,y_test=train_test_split(x,y) # by default will do a 25,75 split for testing and training respectively.
x_train

plt.scatter(x_test, y_test, label='Testing Set') # Error not the same size.
plt.plot(x_test, efficiency_y_pred_model_1, label='Model 1', color = 'orange', linewidth=2)
plt.plot(x_test, efficiency_y_pred_model_2, label='Model 2', color = 'red', linewidth=2)
plt.xlabel('Cyl')
plt.ylabel('Mpg')
plt.title('ROC curve')
plt.legend(loc="best")
plt.show()
print('MSE for model1: {0}'.format(MSE_model1))
print('MSE for model2: {0}'.format(MSE_model2))

strange stag Feb 29, 2020, 7:37 PM

#

anyone know a faster way to do this?

for x in range(len(df)):
    df.loc[x, 'Exp Time'] = datetime.strptime(df.loc[x, 'Exp Time'].split(' ', 1)[1], '%I:%M %p').strftime('%H:%M')

analog dawn Feb 29, 2020, 7:52 PM

#

HELLO GUYS DO YOU KNOW ANY GOOD COURSE THAT ARE COMPLETE DATA SCIENCE ?

velvet thorn Mar 1, 2020, 2:25 AM

#

what exactly do you want to do @strange stag

#

@jovial river well, a plot only has 2 axes, but you want to graph 3 different variables (two features and one target)

sand fractal Mar 1, 2020, 3:06 AM

#

Got any good guides for making a word frequency generator?

#

I feed it a CSV file full of comments. I would feed dictionary so it can account for different words

jovial river Mar 1, 2020, 3:15 AM

#

@gm ya I figured I would have to either graph it as a three dimensional plot or plot each feature separately.

velvet thorn Mar 1, 2020, 5:42 AM

#

you could use colour for the target

hollow shard Mar 1, 2020, 1:08 PM

#

hello, I've been writing a CNN from scratch to train on the MNIST data, but its been producing strange results, for example the accuracy rising to 30% and then just falling back down to 10%, could anyone please look at my code and find out why this is, because I'm stumped
http://dpaste.com/0SY73M0 (code updated to follow pep8)
its very loosely based on this tutorial:
https://towardsdatascience.com/a-guide-to-convolutional-neural-networks-from-scratch-f1e3bfc3e2de

Medium

A Guide to Convolutional Neural Networks from Scratch

Convolutional neural networks are the workhorse behind a lot of the progress made in deep learning during the 2010s. These networks have…

uncut shadow Mar 1, 2020, 1:40 PM

#

Hey. I have a question. How to update biases? I initialize them all to be 0 at start, but how do I have to update them during backpropagation? This is my code so you can run it and check https://repl.it/repls/SnoopyGleefulTab

repl.it

SnoopyGleefulTab

Repl.it is a simple yet powerful online IDE, Editor, Compiler, Interpreter, and REPL. Code, compile, run, and host in 50+ programming languages: Clojure, Haskell, Kotlin (beta), QBasic, Forth, LOLCODE, BrainF, Emoticon, Bloop, Unlambda, JavaScript, CoffeeScript, Scheme, APL, L...

hollow shard Mar 1, 2020, 2:51 PM

#

I believe it would just be dfunctionin your dense code, but also dont initialize weights or biases as zeros

#

also @uncut shadow if you want I have complete code of a neural network from scratch if you want

hot badger Mar 1, 2020, 3:09 PM

#

I want to save data which is in train which has 5 rows and 12 columns into csv file how to do it?

oblique belfry Mar 1, 2020, 3:24 PM

#

df.to_csv

harsh sapphire Mar 1, 2020, 4:22 PM

#

Hi! I have a question related to memory for Data Cleaning. Right now I'm running an iterated for loop across a DataFrame that stores audio files. Everything in the for loop is working perfectly. It pulls a file splits it into time segments and generates and saves a spectrogram for each audio file. However, I can't seem to figure out what it is holding in memory. I have plt.close and soundfile.close() after my data read ins and image generation.

I keep getting crashes after it consumes about 14GB of RAM over 2~3 minutes.

It's 50+ lines but I can't post or message the code if you need it.

#

Also I have an ipywidget for variable inspection. Which doesnt't show any objects stored in memory

oblique belfry Mar 1, 2020, 4:42 PM

#

https://developers.google.com/machine-learning/guides/rules-of-ml

Very good guide by the Google team to develop ML infrastructure in a business from scratch.

Google Developers

Rules of Machine Learning: | ML Universal Guides | Google Devel...

uncut shadow Mar 1, 2020, 6:41 PM

#

@hollow shard Yes, I'd like to see those nets from scratch If you can show the repo, thanks

jolly briar Mar 1, 2020, 8:58 PM

#

anyone have any approach of keeping papers they download organised? my downloaded pdfs are a bit of a mess...

late jackal Mar 1, 2020, 10:14 PM

#

would this be the proper channel to ask about dbscan sklearn?

worn stratus Mar 1, 2020, 10:24 PM

#

yes @late jackal

late jackal Mar 1, 2020, 10:30 PM

#

i have some clusters that i plotted but there seems to be lots of noise would you know how i can make a second plot that doesn't include the noise so that i can see the clusters more easily

lapis sequoia Mar 1, 2020, 10:56 PM

#

If we're allowed to ask questions here then I'd really appreciate if anyone has any insight to my issue in #help-croissant

oblique belfry Mar 2, 2020, 12:48 AM

#

@jolly briar

I used to keep them all in Dropbox. Now I just keep a private Git repo with notes on them.

jolly briar Mar 2, 2020, 1:23 AM

#

@oblique belfry hrm, yeah i do something similar.

something that i've made good use of recently (couple of months or so) is the
following:

make_note() {
    pushd ~/<where i store my notes>
    clear
    vim daily-notes/$(date +'%d-%m-%Y'.md) -c 'Goyo'
    git add .
    git commit -m "notes update"
    git push
    popd
    clear
}

search_notes() {
    # search through daily notes dir
    egrep -rni ~/<where i store my notes> -e $1 --color=auto
}

with aliases for each of them.

oblique belfry Mar 2, 2020, 1:27 AM

#

Nice. How does the search work.

#

Does egret look through the files?

#

Sorry. Egrep...texting on the new iPad is a new experience.

jolly briar Mar 2, 2020, 1:42 AM

#

@oblique belfry yeah it looks through all the files

#

haven't got enough for it to be an issue yet speed wise, it's been handy though

oblique belfry Mar 2, 2020, 1:43 AM

#

Nice. I’ll be stealing those bash commands. Lol

I wish there was an easier way.

jolly briar Mar 2, 2020, 1:43 AM

#

tbh i'm not sure if there could be an easier way

#

I mean - here we've just chained a few commands, which is the beauty of terminal stuff i guess

#

people probably use something like evernote or whatever for less than this provides 🤔

#

idk though, as I've never used anything like that lol

oblique belfry Mar 2, 2020, 1:44 AM

#

I tried Dropbox paper since I like markdown.

jolly briar Mar 2, 2020, 1:44 AM

#

yeah i take all these in md

#

and i've been tagging them as well at the end - and making sure (when possible) everything is on oneline

#

so that a search brings up the context

#

currently i have 33 files and 2585 lines, apparently...

lapis sequoia Mar 2, 2020, 1:48 AM

#

heh

#

im not in the right place by the looks of it because im nowhere on this level

jolly briar Mar 2, 2020, 1:48 AM

#

@lapis sequoia steal my commands, now you're on the level 🤝

lapis sequoia Mar 2, 2020, 1:49 AM

#

I would if I could understand them XD

#

I do computer science GCSE and we've just been introduced to python

#

currently we are doing an exam which is just mainly text file handling, organising data in a text file and recalling the data back and sorting it

jolly briar Mar 2, 2020, 1:50 AM

#

this isn't python fwiw - these are just a couple of functions that you could put in your bash/zsh rc file

#

exam sounds pretty pragmatic

lapis sequoia Mar 2, 2020, 10:14 AM

#

Is it possible for me to create a dataframe that contains a bunch of different categories, then add in only a few specific categories at a time while leaving the other blank until later? I have a large datasets that I need. The page I'm scraping has a built in html table for some of the data, but not for all of it. Could I grab that table, insert the appropriate data, then leave the rest blank until later?

weary finch Mar 2, 2020, 5:13 PM

#

Hey guys, looking to learn how to continuously retrain and redeploy ML models in production as new data becomes available. I have very good skills with scraping and would like to use them as a means to update the data to retrain models with.

Does anyone have any ideas of websites I could scrape off to do this?

copper umbra Mar 2, 2020, 7:15 PM

#

Can anyone here help me with a seaborn visual problem

#

📎 unknown.png

#

normal output looks like this

#

📎 unknown.png

#

but then i try to add the line width being determine but a integer value (removing the # from the code it does this)

📎 unknown.png

hollow shard Mar 2, 2020, 7:20 PM

#

Anyone ever seen a cnn behave like this before? (accuracy vs batches trained on)

📎 Figure_1.png

#

the dataset is mnist btw and a keras model with the same architecture got 98% accuracy

uncut shadow Mar 2, 2020, 9:56 PM

#

Hey. I was trying to make a neural network from scratch. The problem is, that it doesn't work like it should. I mean, it's not very accurate. Could somebody check it and suggest what should I change or add?
https://repl.it/repls/HorizontalWarmheartedDegrees

repl.it

HorizontalWarmheartedDegrees

Repl.it is a simple yet powerful online IDE, Editor, Compiler, Interpreter, and REPL. Code, compile, run, and host in 50+ programming languages: Clojure, Haskell, Kotlin (beta), QBasic, Forth, LOLCODE, BrainF, Emoticon, Bloop, Unlambda, JavaScript, CoffeeScript, Scheme, APL, L...

proud iron Mar 3, 2020, 1:41 AM

#

hello, is this a place where i can ask for help regarding coding of machine learning?

#

or do i ask in the many #help channels

velvet thorn Mar 3, 2020, 2:38 AM

#

@proud iron depends on how in-depth your question is

#

simple questions/theoretical questions fit here, I think

proud iron Mar 3, 2020, 2:39 AM

#

ah i see

#

it's posted on #help-apple

#

but it's basically me having trouble with label encoder from sklearn library

#

would a question like that fit here?

#

i'm just making sure, as it seems like this is a place for discussions rather than help

velvet thorn Mar 3, 2020, 2:40 AM

#

yes, that would be fine

#

I think

#

but anyway, the reason is that you're using the wrong class

#

I think what you want is OneHotEncoder

#

LabelEncoder is for targets

#

not features

#

@uncut shadow honestly, I don't think many people will be willing to go through your code step by step and find out what exactly is wrong with it

#

it's quite a high-level bug

proud iron Mar 3, 2020, 2:42 AM

#

ah ok thank you!

#

gotta do more reading

lapis sequoia Mar 3, 2020, 11:51 AM

#

I'm making one dict from csv,

#

but getting all values of column in one key

#

but i want only that row should be get values in dict

#

here my code

📎 Dict.JPG

#

please open in original for high quality img

#

sorry if i make any mistake , i'm quite new into this

#

not sure what you're trying to do

#

you want to load your csv as a dict?

#

then?

#

i want to make dict,

📎 Dict1.JPG

#

here column Vincode should be key and other rows as values

velvet thorn Mar 3, 2020, 12:00 PM

#

hm.

lapis sequoia Mar 3, 2020, 12:01 PM

#

actually i'm working on one assignment

📎 Quest.JPG

#

please read this

velvet thorn Mar 3, 2020, 12:01 PM

#

if I understand correctly...

#

you probably want a comprehension.

#

{row['VINCODE']: [row['District'], row['Taluka'], row['VIL_NAME']] for _, row in df.iterrows()}

lapis sequoia Mar 3, 2020, 12:03 PM

#

please read this

actually i'm working on one assignment
@lapis sequoia

#

if I understand correctly...
@velvet thorn

📎 Out.JPG

#

output

#

here how to get individual key and it's value here

#

ok I got it

velvet thorn Mar 3, 2020, 12:25 PM

#

@lapis sequoia don't ping me thanks

lapis sequoia Mar 3, 2020, 12:26 PM

#

sorry

late garnet Mar 3, 2020, 12:39 PM

#

I would love to have anyone interested in time series data mining to check out this article and GitHub repository.

If you like what we are providing, a simple github star goes a long way!

“How To Painlessly Analyze Your Time Series” by Andrew Van Benschoten https://link.medium.com/ADrCfELCt4

https://github.com/matrix-profile-foundation/matrixprofile

Medium

How To Painlessly Analyze Your Time Series

An introduction to MPA: the Matrix Profile API

GitHub

matrix-profile-foundation/matrixprofile

A Python 2 and 3 library making time series data mining tasks utilizing matrix profile algorithms accessible to everyone. - matrix-profile-foundation/matrixprofile

velvet thorn Mar 3, 2020, 12:43 PM

#

@late garnet didn't you post this already?

late garnet Mar 3, 2020, 12:44 PM

#

I had a typo, deleted it and reposted instead of editing. I apologize for the spam.

polar acorn Mar 3, 2020, 12:48 PM

#

How similar is it to stumpy?

late garnet Mar 3, 2020, 12:50 PM

#

It is similar with different goals in mind. Stumpy is particular about what implementations it offers while our library is more full-featured. We try to make the barrier to entry low. Basically, you can treat the algorithms like a black box and just review the results. Stumpy requires some academic/technical understanding of the underlying algorithms prior to usage. For more details, you can read the article linked above.

polar acorn Mar 3, 2020, 12:54 PM

#

Looks nice. Are you (as in the organisation) affiliated with research environment that published all those matrix profile papers?

late garnet Mar 3, 2020, 12:55 PM

#

We are not directly affiliated with that group, however we do have web meetings with Eamonn at times and are provided early research results.

polar acorn Mar 3, 2020, 1:00 PM

#

Seems nice. I considered matrix profiles for my current time series problem but looked elsewhere due to d>>>n, but I hope I get an excuse to try it out soon.

late garnet Mar 3, 2020, 1:00 PM

#

d>>>n?

polar acorn Mar 3, 2020, 1:05 PM

#

Many many more features than samples, highly correlated as well.

late garnet Mar 3, 2020, 1:05 PM

#

I see

#

A GitHub star goes a long way in showing your interest in our work. 🙂 It is also highly appreciated!

#

Andrew and I are the original maintainers of the Target repository "matrixprofile-ts".

lyric kernel Mar 3, 2020, 3:13 PM

#

Hey guys,
what is the "by design" way of providing a trained model to make predictions inside another application ? Like a Windowsforms app or something like that ?
Is it always : Set up an API ?

eager heath Mar 3, 2020, 3:16 PM

#

Not always, you can actually deploy the model

lyric kernel Mar 3, 2020, 3:17 PM

#

Can you point to any ressources about that ?
The official Tensorflow stuff includes some weird docker & Kubernetes method

#

https://www.tensorflow.org/tfx/serving/serving_basic

TensorFlow

Serving a TensorFlow Model | TFX

late garnet Mar 3, 2020, 3:19 PM

#

Your question is highly dependent on the use case. Are you hoping to integrate your model into a web application, desktop application, mobile application or make a library for many people to use?

lyric kernel Mar 3, 2020, 3:27 PM

#

Lets say a desktop application. Im mostly looking for a point to read all about it. Not 1 specific solution. When i was looking by myself i didnt find what i was looking for

timber shadow Mar 3, 2020, 4:26 PM

#

Can someone help me with why my code isn't plotting anything on the graph?

#

https://pastebin.com/fAFZHp17

Pastebin

[Python] # -*- coding: utf-8 -*- """ Created on Thu Feb 13 20:23:...

#

I currently only have one set of values, but even that isn't being plotted

hollow shard Mar 3, 2020, 5:33 PM

#

@timber shadow if you just want to plot one point use plt.scatter instead of plt.plot

#

also ur .append's are out of the loop

#

idk if thats an issue

timber shadow Mar 3, 2020, 5:35 PM

#

Thanks, the issue was really silly in the end

#

It’s basically that I was only doing one set of data to start, then check it all works

#

But obviously it’s a line graph so it needed a second set of values to plot

hollow shard Mar 3, 2020, 5:36 PM

#

right yeah

#

np 👍

rigid summit Mar 3, 2020, 7:13 PM

#

Hey! I'm wondering how I get a $278 value from this regression - I'll explain more if needed:

📎 actualrecoveryregression2results.png

#

The idea is that the actual_recovery_amount is greater past the $1000 expected_recovery_amount threshold. I'm just unclear how my course got a $278 difference from this output

#

The cost of recovery past $1000 is $50, if that helps

strange stag Mar 3, 2020, 10:21 PM

#

What i mean by merge is... subtract the larger dataframe from the smaller giving me a new dataframe, that consists only of rows that are present in both dataframes
how do i merge these two dataframes...?
df1 ~> Index(['Name', 'Exp Time', 'Exp Value', 'Exp Date', 'Strike'], dtype='object')
df2 ~> Index(['direction', 'Exp Date', 'Name', 'Exp Time', 'Strike'], dtype='object')

df1 having the column Exp Value that does not exist in df2
df2 having the column direction that does not exist in df1

rigid summit Mar 3, 2020, 10:23 PM

#

You need to use pandas .concat like this:

#

pd.concat([df1, df3], join="inner")```

#

the "inner" means just the ones that match

strange stag Mar 3, 2020, 10:25 PM

#

ah okay

rigid summit Mar 3, 2020, 10:25 PM

#

Any insight on my stats?

strange stag Mar 3, 2020, 10:25 PM

#

hmm

rigid summit Mar 3, 2020, 10:25 PM

#

Didn't work?

strange stag Mar 3, 2020, 10:25 PM

#

think i need to match on columns

#

and na... not a clue on your problem 😕

#

might have to give it more of a think, but i really dont think i can help w/ that

#

however, in regards to myself, i really am grateful for you giving me a direction on where/what to look for / do

rigid summit Mar 3, 2020, 10:28 PM

#

I think I'm must being brain dead, it's the coerrelation coeficient that gives the number changed per one unit, I guess I multiply that by 1000?

#

No worries man, that's the first time I could help someone, I'm learning too

strange stag Mar 3, 2020, 10:30 PM

#

ye, i literlly have no idea what ur talking about lols...

rigid summit Mar 3, 2020, 10:31 PM

#

I'm pretty sure that's it. It's like a dollar change - for $1 of expected amount recovered, $2.something is actually revovered, past the threshold. So for the $1000 it's just $2.something times 1000

#

Must be some variance, and my course answer is just off because of that. But that is dangerous thinking in math derp

strange stag Mar 3, 2020, 10:36 PM

#

ima be selfish and ask if u can help me with my q tho 😛

#

df3 = pd.concat([df, signals_df], join="inner", axis=1, objs=['direction', 'Name', 'Exp Value', 'Exp Date', 'Exp Time', 'Strike'])
TypeError: concat() got multiple values for argument 'objs'

rigid summit Mar 3, 2020, 10:37 PM

#

I don't think you need the objs

#

just the simple one I put up

#

It should merge the right ones auto

strange stag Mar 3, 2020, 10:38 PM

#

does not tho

#

well

#

lemme see if im blind

#

ye..
len(df3) = 1980875
len(df) = 1979663
len(signals_df) = 1212

rigid summit Mar 3, 2020, 10:40 PM

#

try:

strange stag Mar 3, 2020, 10:40 PM

#

so its adding

rigid summit Mar 3, 2020, 10:40 PM

#


pd.concat(objs, axis=0, join='inner', ignore_index=False, keys=None,
          levels=None, names=None, verify_integrity=False, copy=True)```

#

oops

#

not outer inner

strange stag Mar 3, 2020, 10:42 PM

#

df5 = pd.concat([signals_df, df], objs=k, axis=0, join='inner', ignore_index=False, keys=None,
          levels=None, names=None, verify_integrity=False, copy=True)

TypeError: concat() got multiple values for argument 'objs'

#

k = set of keys from df & signals_df

rigid summit Mar 3, 2020, 10:43 PM

#

Not sure man, without fucking around with the data

I was just reviewing this:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

#

and this:

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

strange stag Mar 3, 2020, 10:43 PM

#

ight, nw, ill figure it out 🙂

strange stag Mar 3, 2020, 11:00 PM

#

@rigid summit .combine_first o.O

rigid summit Mar 3, 2020, 11:01 PM

#

You got? 👍

strange stag Mar 3, 2020, 11:02 PM

#

dono, crashed my kernel, trying again 😛

#

darn, nvm Update null elements with value in the same location in other.

#

this seems more promising df5 = pd.merge(df, signals_df, on=['Exp Date', 'Name', 'Exp Time'])

lyric kernel Mar 4, 2020, 1:47 PM

#

When i load data into a df . Then do something alike this:

new_df = df['Column1', 'Column 3']
df = []

does this free up the memory of the complete df that i stored at the beginning ? Prob not. How could i achieve this though ?

lapis sequoia Mar 4, 2020, 1:49 PM

#

you can delete the df

lyric kernel Mar 4, 2020, 1:49 PM

#

but does it free the memory ;/?

lapis sequoia Mar 4, 2020, 1:49 PM

#

import gc

#

gc.collect()

#

after that.. and yes

lyric kernel Mar 4, 2020, 1:49 PM

#

lemme check

lapis sequoia Mar 4, 2020, 1:49 PM

#

you can also, reduce memory usage by assigning fixed data types for your columns

lyric kernel Mar 4, 2020, 1:50 PM

#

hm ?

lapis sequoia Mar 4, 2020, 1:50 PM

#

df[col].astype(np.int8)

#

8, 16, 32, 64.. same with float but float 16, 32, 64

#

you need to check the min max of each column then assign one of these under which it falls under

lyric kernel Mar 4, 2020, 1:52 PM

#

Hm would i check the min max before loading ?

lapis sequoia Mar 4, 2020, 1:52 PM

#

i meant after loading

#

then whenever you do operations, you'll handling relatively lighter data

#

you can also take a look at dask

#

im gonna sleep now

#

byeee

lyric kernel Mar 4, 2020, 1:53 PM

#

thx !

lapis sequoia Mar 4, 2020, 1:54 PM

#

np

blissful pike Mar 4, 2020, 4:04 PM

#

for ML, anyone wanna give a detailed explanation between pytorch and tensorflow? which one should i pick up and should i eventually pick up both for ML?

late drum Mar 4, 2020, 9:18 PM

#

anyone familiar with openpyxl? It seems to refuse to close files it loads so raises errors when used in temporary directories

#

hmm, seems to only happen with "read_only" mode

merry portal Mar 5, 2020, 2:49 AM

#

So I finally figured out how to take a series, and create a multiplication table with it. For given numeric series nums, the below code works, but I'm wondering if there is builtin way to do this. Or if numpy has something
pd.DataFrame([nums]*nums.shape[0], index=nums.index).mul(nums, axis=0)

velvet thorn Mar 5, 2020, 3:16 AM

#

@merry portal what do you mean by a multiplication table?

#

like this?

1 2 3 4  5
1 2 3 4  5
2 4 6 8  10
3 6 9 12 15

#

@lyric kernel why do you want to free memory manually?

merry portal Mar 5, 2020, 4:16 AM

#

@velvet thorn yes, exactly

#

velvet thorn Mar 5, 2020, 4:28 AM

#

ah, okay, I see

#

that's simple

#

>>> a = np.arange(4)
>>> np.multiply.outer(a, a)
array([[0, 0, 0, 0],
       [0, 1, 2, 3],
       [0, 2, 4, 6],
       [0, 3, 6, 9]])

merry portal Mar 5, 2020, 6:25 AM

#

Ah yet thats it. Then you could pass it into dataframe contstructor along with original series as index and column, and everything is still magic. Thanks!

umbral forge Mar 5, 2020, 6:28 AM

#

# Just curious how many iterations there are.
iteration_count = 0


def breadth_first_search(arr):
    global iteration_count

    # Length of the input array.
    arrayLen = len(arr)

    # Store traversed index here as proof.
    indexArray = []

    # Loop through the level as the divide will be exponential.
    for i in range(1, arrayLen):
        # Break out of loop when indexArray is complete.
        if len(indexArray) == arrayLen:
            break

        # levelArrayLen is the length of array split by the exponent amount.
        levelArrayLen = arrayLen // i

        # Start at 0 at the beginning of a level loop.
        first = 0

        # Loop through and start splitting into logarithmic amount.
        for x in range(1, i + 1):
            # Want to see how many times it ultimately iterated.
            iteration_count += 1

            # last is the length of array multiplied by how many times splitted.
            last = levelArrayLen * x

            # mid point of each split within a level.
            mid = (last - first) // 2 + first

            # Store mid point into indexArray and check its neighbour too.
            if mid not in indexArray:
                indexArray.append(mid)
            elif (mid + 1) not in indexArray and (mid + 1) < arrayLen:
                indexArray.append(mid + 1)
            elif (mid - 1) not in indexArray and (mid - 1) > -1:
                indexArray.append(mid - 1)

            # If the value in that index is True, break out.
            if arr[indexArray[-1]]:
                return indexArray

            first = last + 1

    return indexArray


testArray = [False, False, False, False, False, False, False, False,
             False, False, False, False, False, False, False, False]
print(breadth_first_search(testArray))
print('Iterated %s times.' % iteration_count)

#

So I made this attempt of breadth-first-search by attempting to divide the testArray down logarithmically. The idea is checking the middle of each logarithmic split of the testArray to see if it is True. The testArray is supposed to represent a sample which should only contain either True or False with the possibility of "clusters". These codes are working about as what I expected and basically it's just cycling through indexes logarithmically and add the middle index value into another array to signify that it has been "checked". The checking part, which may potentially be computationally heavy, doesn't get executed unless it's not part of the indexArray, in my attempt to be efficient with the check. However, if you see the print statement of Iterated X times., you will see that depending on the size of the sample, it can iterate many more times than its actual sample size. For example, if you replace the single True in the testArray and run it, you will see that it iterated 21 times. I am just curious about how bad that could be in a scalable sense and whether you guys can help me think of a way to minimize extra iterations? I am hoping just looping through and manipulating integer (index value) is pretty trivial in and so more iteration isn't the worst thing.Thank you!

lapis sequoia Mar 5, 2020, 7:33 AM

#

I'm trying to rename files in my dir but getting error, can somebody help

📎 Rename.JPG

#

error

📎 Rename1.JPG

velvet thorn Mar 5, 2020, 7:57 AM

#

@lapis sequoia why don't you post your code/errors as text?

lapis sequoia Mar 5, 2020, 8:00 AM

#

sorry, I will

#

'''Python

#

`Python """Rename multiple files in dir"""

import os

FilePath=r"C:\Users\Pocra_Gis\Desktop\2"

r=root, d=directories, f = files

for root, dirs, files in os.walk(FilePath):
for filename in files:
FileRename=(filename[filename.find('(')+1:filename.find(')')]+filename[-4:])
os.rename(filename,FileRename)
print(root,FileRename) `

#

thank you "gm" for suggestion

#

here only files in first folders get renamed

lapis sequoia Mar 5, 2020, 8:24 AM

#

i'm getting error when in goes into subfolder/files

#

if i print print(root,filename)

#

i gett all the files present in dir C:\Users\Pocra_Gis\Desktop\2 1.txt C:\Users\Pocra_Gis\Desktop\2 2.txt C:\Users\Pocra_Gis\Desktop\2\1 TestFile(1_1).txt C:\Users\Pocra_Gis\Desktop\2\1 TestFile(2_2).txt C:\Users\Pocra_Gis\Desktop\2\2 TestFile(2_1).txt C:\Users\Pocra_Gis\Desktop\2\2 TestFile(2_2).txt

lapis sequoia Mar 5, 2020, 8:56 AM

#

can anybody help

lapis sequoia Mar 5, 2020, 9:52 AM

#

Hi morning!

I need some advice on below interview questions I met:

How do you define KPI?
How do you formulate clear problem statement, and hypotheses based on data insights?
Definition of all the metrics you plan to use?
Identification of key stakeholders?

I was working as ML engineer in B2B company. All I know is how to measure the algorithms performance... I am struggling in answering these questions. I'd appreciate some help there. Thanks!

hybrid pecan Mar 5, 2020, 1:18 PM

#

Those are going to depend heavily on the situation

lapis sequoia Mar 5, 2020, 2:00 PM

#

As far as I am concerned, majority of the B2C companies asked me such questions without giving a clue. I checked what standard KPIs are out there, and got like 18+ at least depending on business interest: such as retention rate, churn rate, customer lifetime value, etc. You can imagine companies like these make money by having customers subscribing their products regularly. So how do you as data analyst/scientist can define a KPI, or formulate a problem statement?

lapis sequoia Mar 5, 2020, 8:25 PM

#

so goddamn many questions lately

#

would love to try to answer some but it would take forever

lapis sequoia Mar 5, 2020, 9:23 PM

#

Yo would anyone be able to help me create a histogram?

umbral forge Mar 6, 2020, 6:24 AM

#

It would be lovely if somebody can answer my question if they scroll up just a little bit though 😉 It's a bit long but could be interesting to data scientists! As always, I appreciate the help!

velvet thorn Mar 6, 2020, 6:26 AM

#

@umbral forge at this point

#

it's not really a data science question

#

more a computer science question I would say

umbral forge Mar 6, 2020, 6:53 AM

#

You think so? So you'd suggest that I ask it in another channel then? @velvet thorn

#

I certainly can post it in #algos-and-data-structs

velvet thorn Mar 6, 2020, 6:54 AM

#

I would say so...?

#

seems like an algorithm question

umbral forge Mar 6, 2020, 7:00 AM

#

Done and thanks for the suggestion!

merry portal Mar 6, 2020, 7:12 AM

#

When sorting in pandas, how can I break ties by index? Assuming index was not initially sorted, so using a stable sort will not give correct output

velvet thorn Mar 6, 2020, 7:13 AM

#

how do you want to break ties then

merry portal Mar 6, 2020, 7:13 AM

#

Unless pandas interally sorts index and I'm not aware of it. In which case a stable sort will work

velvet thorn Mar 6, 2020, 7:13 AM

#

oh, what you mean is

#

you want to sort by values, but the lower indexed row will come first?

merry portal Mar 6, 2020, 7:13 AM

#

Sure

velvet thorn Mar 6, 2020, 7:13 AM

#

.sort_index().sort_values()

merry portal Mar 6, 2020, 7:14 AM

#

Do I need to pass mergesort to sort_values as kind of sort to perform?

velvet thorn Mar 6, 2020, 7:14 AM

#

yes

#

default is quicksort (I'm not sure why), which is not stable

merry portal Mar 6, 2020, 7:15 AM

#

Ok thank you. Just starting with pandas/scientific computing, and you've been a huge help!

velvet thorn Mar 6, 2020, 7:15 AM

#

no worries, it wasn't much

#

incidentally, your other question

#

there are different kinds of indices, and you can check which with .index

#

in particular, the most common is a RangeIndex, which doesn't store individual values for each row

#

only a start and end, and therefore it is sorted by definition

merry portal Mar 6, 2020, 7:16 AM

#

Oh yes. I think I have like int64 or somethign

velvet thorn Mar 6, 2020, 7:16 AM

#

in this case, you would not need to sort by index

merry portal Mar 6, 2020, 7:16 AM

#

Yep that makes sense

fading abyss Mar 6, 2020, 9:20 AM

#

Hello all!

I just started learning python and I want to take on a little project to practice what I have learned so far.

I am currently working for a destination management company and we are receiving a booking list from tour operators that we are uploading to our system for our reservations department to work on. The problem is that, the text file is inconsistent. I normally need to look for incorrect data then replace it with a correct data. In order to do so, I’m currently using Power Query to split the rows based on the positions and have it loaded in excel as a table. From there, I applied a conditional formatting that highlights the rows with incorrect data which I will then correct. Once done, I will copy the data to a notepad then just remove tab to put back each fields to their correct positions before uploading to our system.

My question is. How will I be able to execute the same task in Python? The goal is to have our end-users upload the received booking list. A script will then parse the text and look for the rows with errors and return it on a table ( some sort of form ) so that our end-users can correct it. Once done, they will apply the changes, then export a new text file with the changes.

What tools/library do I need to accomplish the task? I’ve been reading up about text processing and I keep seeing NLTK. I’m just hoping to know how would you go about the task as an experienced python developer. I hope the above makes sense. I want to include a sample data but unfortunately, there are sensitive information that I can’t share.

dull fern Mar 6, 2020, 11:13 AM

#

Hello @fading abyss, for anything related with excel/power query/tables you can look into the pandas library

#

Then it depends on what kind of text processing you wish to do. Maybe custom rules with regex (re library) can be enough. If you want more complex processing you can check SpaCy

velvet thorn Mar 6, 2020, 12:38 PM

#

@fading abyss it depends on the nature of the errors

oblique belfry Mar 6, 2020, 2:07 PM

#

Has anyone used Comet.ml?

lapis sequoia Mar 6, 2020, 5:24 PM

#

Anyone got any resources on how to prepare data for a neural network? I'm using pytorch

oblique belfry Mar 6, 2020, 5:24 PM

#

Well...what type of data do you have?

lapis sequoia Mar 6, 2020, 5:25 PM

#

a spreadsheet of football games

#

dates, teams, results and statistics

#

Project for a job opportunity. Hopefully goes well

oblique belfry Mar 6, 2020, 5:26 PM

#

Congrats. I hope it goes well.
I assume this is some type of regression like problem?

#

I know more about CV problems, but the general start of data prep is:

Checking outliers and null data
One-hot encoding things that need to be one-hot encoding
Scale data
Setup a Pytorch DataLoader

There are others on here that can help you out more than I can. Good luck.

lapis sequoia Mar 6, 2020, 5:29 PM

#

Thanks

#

Was looking at Sentdex's videos on pytorch but we don't have similar data. Plus his data is already prepared and ready to be fed to the NN

oblique belfry Mar 6, 2020, 5:30 PM

#

I wish there were more posts on data prep. That is always the "magic" of those tutorials.

lapis sequoia Mar 6, 2020, 5:31 PM

#

Yeah usually they go "A NN is very easy to set up! Just a few lines of code"

#

But the real work is data preparation

fading abyss Mar 6, 2020, 5:33 PM

#

@velvet thorn and @dull fern , nothing sort of complicated errors. Just some incorrect numbering on some rows.

📎 SPOILER_Capture.PNG

#

Like for example in the screenshot, the left most 7-digit-number is the booking reference field. The right most highlighted numbers are the number of members in each room and the highlighted numbers in the middle is the number of rooms under the booking reference.

#

That booking should be 3 rooms, but it was numbered as 1 and 2. So that field should be replaced by 3. Then the number of members is 1 through 5, but it should be 1,1,2,1,2. Meaning 1 single occupancy room and 2 double occupancy for a total of 3 rooms.

#

Anyhow, I'm not planning to implement some sort of algorithm to automatically replace those numbers as of yet. The goal for now is to output the data in a tabular form and have the end-users an option to replace the numbers to a correct one and re-export it so it's ready to be uploaded in our system.

dull fern Mar 6, 2020, 5:49 PM

#

So the screenshot is what you get as an input ? And you would like to export it to an excel file, correct it and then back to that format, right ?

lapis sequoia Mar 6, 2020, 5:57 PM

#

Could I please get some help? I'm trying to use a query to get only a select few cities to show up in a column but I just keep running into issues

#

donations_new.query('Contribution_City == "MANCHESTER"'and 'Contribution_City == "NASHUA"', inplace = True)

#

Now whenever I use donations_new.head() nothing shows up

fading abyss Mar 6, 2020, 6:23 PM

#

So the screenshot is what you get as an input ? And you would like to export it to an excel file, correct it and then back to that format, right ?
@dull fern Yes. That’s the input file we are receiving. No, not in excel. I’d like to keep them out working in excel as those numbers cant be replaced directly with a number. There are whitespaces before the number to keep them in the correct position when uploading in our system. They might end up putting the number on a wrong position. What I have in mind is output it in a tabular form ( web browser ) and have them correct the numbers there. Once done, they will submit the changes and my script will do the necessary actions to position each data then output it again on a text file that is ready to be uploaded.

lone blaze Mar 6, 2020, 7:38 PM

#

Hello!

eager heath Mar 6, 2020, 7:39 PM

#

Hey!

dull fern Mar 6, 2020, 7:58 PM

#

@fading abyss Alright, so basic string formatting should be enough for the processing part, you could use the string method split to catch each "field" of a row in a list. Then the method format or join should work to put them back together.

silk cedar Mar 6, 2020, 8:29 PM

#

Hello, new to the discord world, if I have questions on manipulating large excel sheets using pandas would this be the best channel?

runic rock Mar 6, 2020, 9:01 PM

#

Hello, are you learning python?

#

For data science

silk cedar Mar 6, 2020, 9:50 PM

#

Trying to haha

#

Seems that pandas is not optimized for sorting so it is very slow, I am also relatively new to Python/programming in general so I am sure there is probably a better way just in the case that I am using it

jolly briar Mar 6, 2020, 10:37 PM

#

@silk cedar what's the question

merry portal Mar 6, 2020, 10:37 PM

#

If I have a pandas dataframe that is sparse on both rows and columns (15x more row than column), which of the scipy.sparse matrix representation should I be using for math operations? I'm guessing block sparse matrix or compressed sparse row matrix, but not sure. https://docs.scipy.org/doc/scipy/reference/sparse.html
I'm doing: mean, subtraction, element-wise or row/column wise multiplication and dot product with another matrix or vector

silk cedar Mar 6, 2020, 11:08 PM

#

Essentially I am looking through one excel and if the serial matches in another excel write a date of manufacture to the original spread sheet

#

`for s1 in df.itertuples():
idx=s1.Index
print(s1)
for s2 in dfDoM.itertuples():
idx2=s2.Index
if df.loc[idx]['Serial'] == dfDoM.loc[idx2]['BARCODE']:
df.at[idx, 'DoM'] = dfDoM.loc[idx2]['PACKING_DATE']
break

#

problem is the second excel is 61k rows and the first is 5k

#

I am sure there is a better way using pandas, but not sure how

#

Also even better would be probably just analyze lists but I was trying to remain in pandas because going from excel to list back to excel sounded tedious

#

@jolly briar

jolly briar Mar 6, 2020, 11:14 PM

#

@silk cedar not sure i fully follow - can't you merge ?

silk cedar Mar 6, 2020, 11:23 PM

#

Maybe? Haha sorry still pretty new to this

jolly briar Mar 6, 2020, 11:24 PM

#

@silk cedar https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

#

see if the examples there look like they'd be of use

silk cedar Mar 6, 2020, 11:25 PM

#

Thanks! Off the top of your head do the formats of the two df's have to match?

jolly briar Mar 6, 2020, 11:25 PM

#

what - the types of what you'd be merging on?

#

if df.loc[idx]['Serial'] == dfDoM.loc[idx2]['BARCODE'] works then i'd expect merging to work as well

unborn shuttle Mar 6, 2020, 11:31 PM

#

Hello all 🙂 I am seeking direction towards a sub-community within the data-science group that covers topics in bioinformatics, preferably using Python + regex to do pattern searches in DNA sequence snippets. If anyone would be able to direct me to the correct "group"/sub-community, I would appreciate it. Please and thank you!

silk cedar Mar 6, 2020, 11:51 PM

#

Thanks again @jolly briar I will try and get that working

jolly briar Mar 6, 2020, 11:51 PM

#

@silk cedar np - sounds like a merge is what you're after tho

#

pandas docs are nice btw - esp the new ones - check out the getting started and user guide

merry portal Mar 7, 2020, 2:36 AM

#

Given a mask, how do I know what the positions of the true values are? Feels trivial but haven't been able to find efficient solution
Like to print out. Double for loop is not terrible runtime (under one min), but feels wrong

velvet thorn Mar 7, 2020, 3:40 AM

#

what kind of mask do you mean

ripe forge Mar 7, 2020, 5:55 AM

#

If you're ever looping when it comes to pandas or numpy, there's probably a better way

vale fog Mar 7, 2020, 1:59 PM

#

Anyone done any fast.ai courses?

lapis sequoia Mar 7, 2020, 3:06 PM

#

I want to work on model that can detect subject,object,verb etc in a sentence.Where can i find the dataset or where can i scrap the raw data

somber hamlet Mar 7, 2020, 4:23 PM

#

novels

#

or dictionnaries

lapis sequoia Mar 7, 2020, 4:48 PM

#

In that case the problem is that i need to label it myself which would take so much time for creating huge dataset

merry portal Mar 7, 2020, 6:25 PM

#

@velvet thorn @ripe forge: (my_matrix > 5) -> I want the index/column of each true occurance

#

Instead of giant matrix full of true/false

ripe forge Mar 7, 2020, 6:29 PM

#

well, it gives what it's supposed to give. the question should be, how to get what you want

#

without knowing more about what you're doing, take a look at np.where

merry portal Mar 7, 2020, 6:35 PM

#

   aa bb cc
a  1   2 3
b  5   1 5
c  5   5 5

so from m < 5 I would want something like [(a,aa), (a,bb), (a,cc), (b,bb)]

#

Which is not what np.where returns. My understand is np.where is conditional replacement of values

lapis sequoia Mar 7, 2020, 8:28 PM

#

Hey guys, does anyone have some tips on how to optimize randomforest regressions most effectively? random gridsearch? gridsearchcv? a combination thereof? I know how to do those, but I'm an absolute beginner in terms of randomforests... i only did a parameter grid test for n_estimatorsand max_features and the results are rather unsatisfactory. Thanks

lapis sequoia Mar 7, 2020, 9:32 PM

#

Does anyone know what values I should use to replace NaN in a column of averages?

#

The data I have puts NaN if there is no prior data to calculate the average

#

Need it for a NN btw

chrome rampart Mar 7, 2020, 11:01 PM

#

Hello, would you recommend learning another language next to Python like Java? And should I learn about algorithms and computer science before getting into Data science (AI)?

slim elm Mar 7, 2020, 11:19 PM

#

Do you know any programming language pyro?

#

Does anyone know what values I should use to replace NaN in a column of averages?
@lapis sequoia

#

Does anyone know what values I should use to replace NaN in a column of averages?
@lapis sequoia Hey deusex, if you are using pandas and this is a pandas data frame you can use the .fillna() method to replace all NaN values with what every you specify inside the tuple.

chrome rampart Mar 7, 2020, 11:30 PM

#

@slim elm yes, I know python and I'm learning pandas

lapis sequoia Mar 7, 2020, 11:57 PM

#

If you are learning python, finish learning that one first.

#

Would be better to start from compsci and algorithms though @chrome rampart

chrome rampart Mar 7, 2020, 11:58 PM

#

I finished Python syntax and OOP

#

but I can't find sources for CS and algorithms

lapis sequoia Mar 7, 2020, 11:59 PM

#

MIT has free courses

#

search their web and youtube

chrome rampart Mar 7, 2020, 11:59 PM

#

There's this CS50 course but it's so long

#

weeks

lapis sequoia Mar 7, 2020, 11:59 PM

#

lol you can't learn compsci in days

#data-science-and-ml

r=root, d=directories, f = files