#data-science-and-ml | Python | Page 248

desert oar Aug 26, 2020, 4:50 PM

#

makes sense then

lapis sequoia Aug 26, 2020, 4:58 PM

#

Is it possible to use the .map function on a series of booleans?

#

Because now I might be thinking of making a new series on whether or not something is discontinued, and I was going to replace "False" with "No" and "True" with "Yes"

tidal bough Aug 26, 2020, 5:00 PM

#

don't see why it wouldn't be possible

lapis sequoia Aug 26, 2020, 5:00 PM

#

Well, I'm trying, but it's replacing the entire series with NaN, and in the documentation they show it only working with obejct dtypes, not booleans

desert oar Aug 26, 2020, 5:01 PM

#

of course you could, but why would you want to

#

oh i see

#

yeah perfectly valid

lapis sequoia Aug 26, 2020, 5:01 PM

#

Nvm I am stupid

#

Put True and False in ''

#

-_-

desert oar Aug 26, 2020, 5:01 PM

#

pd.Series([True, True, False]).map({True: 'hello', False: 'goodbye'})

#

yeah, True and "True" are different

lapis sequoia Aug 26, 2020, 5:01 PM

#

should have known, haha

solar bluff Aug 26, 2020, 5:04 PM

#

I use .map all the time to map series that contain enumerated values into their corresponding strings

#

(i look at a lot of data that's produced by C++ code pushing structs into hdf5 files)

lapis sequoia Aug 26, 2020, 5:08 PM

#

is there a way to use the inplace argument for mapping as well? or do i have to create a new series?

#

The documentation seems to point to there not being a method

solar bluff Aug 26, 2020, 5:08 PM

#

df["series"] = df["series"].map(dict) works just fine

lapis sequoia Aug 26, 2020, 5:09 PM

#

Ah, true

#

Thank you both for the help!

solar bluff Aug 26, 2020, 5:16 PM

#

❤️

#

credit goes to @desert oar, a true champ

desert oar Aug 26, 2020, 5:16 PM

#

hah, i feel strongly about helping people w/ pandas

#

because its really hard to learn from the official docs..

#

this might be a good learning resource https://tomaugspurger.github.io/modern-1-intro.html @lapis sequoia

datas-frame – Modern Pandas (Part 1)

Posts and writings by Tom Augspurger

#

this is the most recent post in the series https://tomaugspurger.github.io/modern-8-scaling.html

datas-frame – Modern Pandas (Part 8): Scaling

Posts and writings by Tom Augspurger

lapis sequoia Aug 26, 2020, 5:18 PM

#

you too? whew, i thought i was the only one that struggled with documentation reading here

desert oar Aug 26, 2020, 5:18 PM

#

its awful

#

they tried, but

#

its amazing anyone knows anything about pandas

#

it needs a serious overhaul imo, ive been wanting to write my own guide for a while

solar bluff Aug 26, 2020, 5:26 PM

#

I LOVE that series from Tom Augspurger

#

I also feel that the McKinney book is confusing, and that's sad because he's the creator

#

best book I know of on how to use Pandas is the Pandas 1.x Cookbook by Ted Petrou and Matt Harrison

glass wyvern Aug 26, 2020, 5:43 PM

#

Does anyone have experience with sparse matrices? I want to solve some linear systems of equations. The coefficient matrix is predominantly diagonal and the rest of the elements are 0 thus I think sparse matrices are the way to go. Thanks!

desert oar Aug 26, 2020, 6:40 PM

#

@glass wyvern yeah, what's your question exactly?

#

sparse matrices can be good for what you just described, if the coefficient matrix is very big

#

not much value in it for a small matrix, the main benefit of a sparse matrix is saving memory

merry ridge Aug 26, 2020, 7:32 PM

#

This is kind of a dumb question but is anyone familiar with the term imputation? How is this conjugated? Is the base word Impute? It is very difficult for me to see it as anything less than a misspelling of the word "input" but other people in data science in my work group insists that this is a common word.

#

None of them speak English as a first language, and I am skeptical of the way they are using it in a sentence

desert oar Aug 26, 2020, 7:33 PM

#

yes, imputation is the "verbal action" form of "impute"

#

"missing data imputation" is the act of "imputing missing data"

merry ridge Aug 26, 2020, 7:34 PM

#

Alright, thank you for the reassurance

#

I have been in a constant battle of made-up words until now.

sudden cedar Aug 26, 2020, 7:36 PM

#

does anyone know how i would retrieve the neural net with the highest fitness score in NEAT

amber anvil Aug 26, 2020, 8:25 PM

#

Not sure if this is the right sub-server, but can someone maybe help me with a little problem i'm encountering with pandas?
It's a problem with data frames: I'm trying to map a column with numbers N. These N are also present as keys (K) in a dictionary with values V. Whenever I try to substitute the N in the dataframe by the the V from the dictionary with df.map, the indexes of dict are mapped and not the key-reflecting values...
Anyone know how to solve this?

merry ridge Aug 26, 2020, 8:26 PM

#

Are you using df = df.map?

#

I think df.map creates a copy of the dataframe off the top of my head

#

There should be an optional argument inplace = True to make it update the dataframe itself otherwise, but I would need to check the documentation. I have the memory of a goldfish

amber anvil Aug 26, 2020, 8:30 PM

#

main_df['column_with_N'].replace(dictionary, inplace=True) works, but its super slow. from what i read df.map would be faster, but its not doing the job rn

merry ridge Aug 26, 2020, 8:33 PM

#

Looking at the documentation, it is clear I have no idea what I am talking about. So disregard

amber anvil Aug 26, 2020, 8:33 PM

#

main_df['column_with_N'].map(dictionary) shows a proper mapping in a series, but it does not substitute the values in main_df

#

ahah no worries, thx for ur help

#

🙂

merry ridge Aug 26, 2020, 8:42 PM

#

When I try writing main_df['column_with_N'] = main_df['column_with_N'].map(dictionary) it seems to output what you want

#

Not sure if this is what you are after @amber anvil I am more of a Matlab person.

desert oar Aug 26, 2020, 8:54 PM

#

@amber anvil you need to assign the result back to the original, as hexicle pointed out

#

.map does not work "in place"

muted sapphire Aug 26, 2020, 10:02 PM

#

I want to ask a very simple question regarding something I encountered in anaconda, but I dont know if this is the correct place. May I or should I move to a help channel?

#

(I ask here because anaconda is considered a popular python distribution amongst data scientists)

solar bluff Aug 27, 2020, 12:03 AM

#

@muted sapphire what's the question? (I am not an admin or mod but if I can help I will)

desert oar Aug 27, 2020, 12:04 AM

#

@muted sapphire yep this place is fine, otherwise #tools-and-devops is ok

muted sapphire Aug 27, 2020, 12:04 AM

#

Hey thanks greghouse. I just wanted to know if its normal, everythime that you create a new environment, to reinstall jupyter notebook?

#

It happened to me yesterday and it seemed weird that I had to reinstall whats already in my pc

#

Thanks guys

velvet thorn Aug 27, 2020, 12:10 AM

#

Hey thanks greghouse. I just wanted to know if its normal, everythime that you create a new environment, to reinstall jupyter notebook?
@muted sapphire that’s what a virtual environment is

#

it effectively acts like a new “container” for installed packages

muted sapphire Aug 27, 2020, 12:11 AM

#

Packages I can understand. But jupyter, i mean its like an IDE, isnt it?

velvet thorn Aug 27, 2020, 12:11 AM

#

Jupyter is a package too

muted sapphire Aug 27, 2020, 12:11 AM

#

And to be honest a friend of mine doesnt have to install it when he makes a new environment so I was unsure whether i made a mistake or not

velvet thorn Aug 27, 2020, 12:12 AM

#

it’s possible to do that too

muted sapphire Aug 27, 2020, 12:13 AM

#

I see. Do you know how? I didnt know jupyter behaves like a package tbh. I just considered it an IDE, like pycharm

velvet thorn Aug 27, 2020, 12:14 AM

#

okay, first

#

“package” and “IDE” are not mutually exclusive

#

a package is just a Python module container

#

and you can write an IDE in Python

#

which would make it a package too

#

anyway, to answer your question...

muted sapphire Aug 27, 2020, 12:18 AM

#

Thank you for the valuable information, I hadnt thought about it this way but makes sense.

#

Yes please, go on

velvet thorn Aug 27, 2020, 12:18 AM

#

I believe you cannot customise it directly, but it depends on your version of Anaconda...? (I’ve never had a need to do this)

muted sapphire Aug 27, 2020, 12:18 AM

#

I have the latest, he doesnt

#

Maybe thats a reason, i dont know

#

As long as it is "normal" and its not a mistake by me, i dont mind it installing it

velvet thorn Aug 27, 2020, 12:19 AM

#

IMO

#

new environments coming with stuff that is not necessary is an antipattern

#

you won’t always be doing stuff that needs Jupyter

#

and by “necessary” I mean for Python to run

muted sapphire Aug 27, 2020, 12:22 AM

#

This is true, it makes sense for it NOT to come with it installed.

#

Perhaps I just want to test something in the console or w/e.

#

You are right, i was mainly confused because I didnt consider jupyter as a package you know?

sudden cedar Aug 27, 2020, 12:24 AM

#

does anyone know how i would retrieve the neural net with the highest fitness score in NEAT
as in save that data genomes data and only run it by itself

muted sapphire Aug 27, 2020, 12:24 AM

#

Thank you anyway 🙂 @velvet thorn You were very helpful

velvet thorn Aug 27, 2020, 12:24 AM

#

Thank you anyway 🙂 @velvet thorn You were very helpful
@muted sapphire np!

graceful glacier Aug 27, 2020, 1:00 AM

#

i wrote code to filter words from a given Pandas series that contain atleast two vowels

#

import pandas as pd
from collections import Counter
color_series = pd.Series(['Red', 'Green', 'Orange', 'Pink', 'Yellow', 'White'])
print("Original Series:")
print(color_series)
print("\nFiltered words:")
result = mask = color_series.map(lambda c: sum([Counter(c.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
print(color_series[result])

velvet thorn Aug 27, 2020, 1:02 AM

#

hm.

#

I'm sure there's a better way

#

let me think

graceful glacier Aug 27, 2020, 1:02 AM

#

any suugestions to if i can use regex to solve this

velvet thorn Aug 27, 2020, 1:04 AM

#

>>> import re
>>> colours = pd.Series(['Red', 'Green', 'Orange', 'Pink', 'Yellow', 'White'])
>>> colours.str.count('[aeiou]', flags=re.I)
0    1
1    2
2    3
3    1
4    2
5    2
dtype: int64

#

there you go

graceful glacier Aug 27, 2020, 1:05 AM

#

thanks

velvet thorn Aug 27, 2020, 1:05 AM

#

yw

atomic forge Aug 27, 2020, 1:44 AM

#

is Pandas worth learning

velvet thorn Aug 27, 2020, 1:44 AM

#

yes

#

for data analysis

atomic forge Aug 27, 2020, 1:44 AM

#

hm

#

u need to learn that

#

and

#

matplotlib

#

for graphing

#

then again mysql does the job of pandas so

velvet thorn Aug 27, 2020, 1:45 AM

#

no

#

SQL does not do the job of pandas...

#

and pandas doesn't do the job of SQL either

atomic forge Aug 27, 2020, 1:47 AM

#

they both deal with

#

data bases

#

thro code

velvet thorn Aug 27, 2020, 1:47 AM

#

SQL deals with databases

#

pandas doesn't

#

the abstraction is different.

atomic forge Aug 27, 2020, 1:48 AM

#

then how would u describe a pandas dataframe

velvet thorn Aug 27, 2020, 1:48 AM

#

in particular, SQL focuses strongly on guarantees that databases provide, like ACID

#

the DataFrame is an abstraction representing tabular data

atomic forge Aug 27, 2020, 1:48 AM

#

and dont say a dictionary of series cuz it rly isnt :/

#

well i mean it IS but

#

acc yea

velvet thorn Aug 27, 2020, 1:48 AM

#

and

atomic forge Aug 27, 2020, 1:48 AM

#

ic where ur comming from

velvet thorn Aug 27, 2020, 1:49 AM

#

pandas doesn't need a database

#

it's (more or less) source-agnostic

#

SQL deals only with databases

atomic forge Aug 27, 2020, 1:49 AM

#

huh

#

ic ic

#

well imma need a resource to learn pandas anyway so if u dont mind

#

!resources

arctic wedgeBOT Aug 27, 2020, 1:50 AM

#

Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

velvet thorn Aug 27, 2020, 1:51 AM

#

hm I don't really have one, sorry

#

also pandas is a lot more suited to quick experimentation than SQL

atomic forge Aug 27, 2020, 1:54 AM

#

h m

#

*puts pandas in code test as if the name is code

velvet thorn Aug 27, 2020, 1:54 AM

#

because the name of the package is pandas

atomic forge Aug 27, 2020, 1:55 AM

#

well

#

ok so

#

if im not wrong

#

from what ive learned

#

if u have a datafram and u wanna only do when a certain condition is true

#

df.iloc[df["column"] > 5]```

#

?

#

or was it df.loc

#

damn ot

velvet thorn Aug 27, 2020, 1:56 AM

#

okay

#

so, if you just want to flter on rows

#

you can do df[df['column'] > 5]

#

.loc is for when you want to filter on rows and columns

#

so say you want all the rows where column_1 > 5, and only the column column_2

#

that would be df.loc[df['column_1'] > 5, ['column_2']]

#

df.loc[row_indexer, col_indexer]

#

.iloc, on the other hand, is for positional indexing

atomic forge Aug 27, 2020, 1:58 AM

#

ic

velvet thorn Aug 27, 2020, 1:58 AM

#

so say you want the 3rd row

atomic forge Aug 27, 2020, 1:58 AM

#

ohhh

velvet thorn Aug 27, 2020, 1:58 AM

#

df.iloc[2]

#

3rd row, 1st column?

atomic forge Aug 27, 2020, 1:58 AM

#

thats the

#

3rd columb

#

o nvm

velvet thorn Aug 27, 2020, 1:58 AM

#

df.iloc[2, 0]

atomic forge Aug 27, 2020, 1:58 AM

#

ohhhh

#

ic ic

#

and if u want to get by row name?

#

ok ok i got it thz

#

rhx

#

rhx

#

thx

velvet thorn Aug 27, 2020, 2:01 AM

#

rows don't have names, normally.

atomic forge Aug 27, 2020, 2:02 AM

#

but if u want

#

like say

#

u have a list of states

#

and their population

#

and area

#

and u want

#

the U.S's row

velvet thorn Aug 27, 2020, 2:02 AM

#

you would have a column

#

called "state" or something like that

#

then df[df['state'] == 'US']

atomic forge Aug 27, 2020, 2:03 AM

#

o

#

ic

#

and if im not wrong

#

df["columnname"] would get u a series

velvet thorn Aug 27, 2020, 2:03 AM

#

yes, that's correct

#

and that Series represents a column

atomic forge Aug 27, 2020, 2:04 AM

#

yay im understanding this

velvet thorn Aug 27, 2020, 2:04 AM

#

yup, good job

atomic forge Aug 27, 2020, 2:04 AM

#

ill keep grinding my data science book then

#

matplotlib

#

cant wait

sudden cedar Aug 27, 2020, 2:16 AM

#

does anyone know how i would retrieve the neural net with the highest fitness score in NEAT
as in save that data genomes data and only run it by itself

crude karma Aug 27, 2020, 2:36 AM

#

this might be a lousy question but can you treat data frames like arrays

#

liek if i import a .csv file into jupyter... can i treat the data as an array and index stuff out of it

stable sequoia Aug 27, 2020, 2:41 AM

#

if you read the csv using pandas this might help you. Take a look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

crude karma Aug 27, 2020, 2:51 AM

#

thxx

desert rapids Aug 27, 2020, 3:10 AM

#

Im learning ds now. would any of you be able to send me intro classes udemy or whatever it may be?

crude karma Aug 27, 2020, 3:16 AM

#

i used freecodecamp

lapis sequoia Aug 27, 2020, 5:21 AM

#

Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?

lapis sequoia Aug 27, 2020, 7:56 AM

#

Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?
same question

velvet thorn Aug 27, 2020, 8:32 AM

#

Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?
@lapis sequoia why do you want someone to tell you what to do

#

the world is full of interesting questions

#

find one close to your heart.

#

that's not how you should end your university journey IMO

lapis sequoia Aug 27, 2020, 8:39 AM

#

@velvet thorn i need some Suggestions only

velvet thorn Aug 27, 2020, 8:39 AM

#

then maybe you should tell us what you're interested in

#

because AI is so wide

#

like games? make a game AI

#

interested in photography? how about some kind of smart filter

#

food? maybe an NLP project for parsing recipes?

#

health? ML for mental health, given a daily questionnaire?

#

there are a ton of ideas out there; this is just what I came up with off the top of my head.

lapis sequoia Aug 27, 2020, 8:45 AM

#

I think when he asked that he meant what kind of project would be respectable enough to pass, sure there's lots of ideas but skills+cliche+some more factors narrow down the spectrum

#

Maybe ask some final-years what they made and get to know what kind of stuff works, usually it should be deployable too

velvet thorn Aug 27, 2020, 8:50 AM

#

I think when he asked that he meant what kind of project would be respectable enough to pass, sure there's lots of ideas but skills+cliche+some more factors narrow down the spectrum
@lapis sequoia it's really hard to say because standards vary widely across institutions

lapis sequoia Aug 27, 2020, 8:51 AM

#

That's true

velvet thorn Aug 27, 2020, 8:51 AM

#

but, yeah, honestly

lapis sequoia Aug 27, 2020, 8:51 AM

#

which is why it's best to ask around

velvet thorn Aug 27, 2020, 8:51 AM

#

I have seen way too many people whose first instinct is to come to a community and ask for help with something they should have spent some time thinking about first

#

so I suppose I'm a little jaded

lapis sequoia Aug 27, 2020, 8:52 AM

#

Welll

#

here it's a big issue

hearty token Aug 27, 2020, 11:12 AM

#

need some help with webscraping in #help-burrito !

desert oar Aug 27, 2020, 1:33 PM

#

I agree with gm

#

Learning to ask for help is good. But learn to try and think for yourself first.

#

If they said "hi im debating between X Y and Z topics and my advisor is ambivalent, can someone give me insight into any of these domains for an undergrad thesis?"

#

Im sure we would all be happy to help

#

Their question is one step above the people who just ask for homework answers

#

Part of writing a thesis is picking a topic, that's part of research

lapis sequoia Aug 27, 2020, 3:08 PM

#

Sounds weird but anyone from India here?

ripe forge Aug 27, 2020, 3:52 PM

#

We've got folks from all over the world, though that question doesn't seem on topic for here.

obsidian mica Aug 27, 2020, 6:02 PM

#

how could i update a dataframe in real time

#

if i am passing in input from a file and adding to it

lapis sequoia Aug 27, 2020, 6:15 PM

#

Is it possible to reference "NaN" in pandas? it's automatically filling in blank cells as such and I would like to map it to "" because that seems easier to work with outside of the pandas module. I can't seem to find out how to reference "NaN", though

tidal bough Aug 27, 2020, 6:16 PM

#

so you want to replace NaN cells with empty strings?

#

What's the datatype of those cells?

lapis sequoia Aug 27, 2020, 6:17 PM

#

Yes, simply because I can't seem to reference the "NaN" in other statements, like if statements or loops

#

objects

#

If there's a way to reference "NaN" outside of pandas, that would be nice

#

I tried numpy.nan, but no dice

tidal bough Aug 27, 2020, 6:18 PM

#

it's pandas.nan I believe

#

https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

#

looks like it's generally np.nan

lapis sequoia Aug 27, 2020, 6:20 PM

#

just tried that, but apparently the module doesnt have that attribute

onyx juniper Aug 27, 2020, 6:34 PM

#

hello, do you guys have any resource recommendations for the math side of data science which will accompany me throughout my data science learning journey? i know linear algebra, calculus and linear programming, however, i really need help with the statistics

lapis sequoia Aug 27, 2020, 6:34 PM

#

I need to map it to a string because I'm using pandas with regular expressions

#

Oh, sorry.

#

found it. there's a .fillna method

fallow sable Aug 27, 2020, 6:45 PM

#

the book data science from scratch goes a bit into it and has additional resources if you want to learn more which probably answers your question @onyx juniper

lapis sequoia Aug 27, 2020, 7:22 PM

#

Is there a way in pandas to get the index of a value, column, or row?

crude karma Aug 27, 2020, 8:03 PM

#

Hi, i am trying to plot a stock market graph on python with the date on the x axis and the price on the y axis. However I get an error that says KeyError: 'Date'.. but in my CSV file there is a column called date? Could it be that the jupyter notebook cannot recognize my DTG format?

woven radish Aug 27, 2020, 8:26 PM

#

@crude karma make sure capitalization is the same, but for debugging you might need to post a code snippet, like the section of your graphing code and what your df.head() looks like

crude karma Aug 27, 2020, 8:27 PM

#

okay i figured it out but this doesnt look like a stock chart.. how do i combine both highs and lows

📎 unknown.png

woven radish Aug 27, 2020, 8:29 PM

#

Is this close to what you’re looking for? https://www.byteacademy.co/blog/time-series-python?hs_amp=true

Visualizing Time Series Data of Stock Prices

Unleashing the power of Panadas to visualise a time series data of Stock PRices

crude karma Aug 27, 2020, 8:30 PM

#

oh danng okay ill read that and figure it out thanks buddy

lapis sequoia Aug 27, 2020, 8:45 PM

#

hey, this is my situation: i have a dataframe and need to update each row. for each row i need to make a request to retrieve the new data and replace the old data. The thing is, that if I do this sequentially, it will probably take 15-20 days. That is why I want to use multithreading so that it will only take a few hours if parellelize the requests. I know this is probably some basic stuff for you, but what is the best way to pass the data from a pandas dataframe to each thread in python?

#

it is not good practice to create variables in a for loop for each file and row, right?

#

that's why i was thinking to either create a variable for each row manually instead of doing it manually

#

then i would pass each variable with the datarow to a thread, make the request, update and then replace the row in the dataframe with the updated row

#

but that would mean I would need to create 200 variables by hand... so i am sure there must be some better way to do this if creating them dynamically is bad practice

#

how would you go about this?

velvet thorn Aug 27, 2020, 11:55 PM

#

Is there a way in pandas to get the index of a value, column, or row?
@lapis sequoia which do you want?

lapis sequoia Aug 27, 2020, 11:55 PM

#

I guess either. or are the methods really different from each other?

velvet thorn Aug 27, 2020, 11:56 PM

#

hey, this is my situation: i have a dataframe and need to update each row. for each row i need to make a request to retrieve the new data and replace the old data. The thing is, that if I do this sequentially, it will probably take 15-20 days. That is why I want to use multithreading so that it will only take a few hours if parellelize the requests. I know this is probably some basic stuff for you, but what is the best way to pass the data from a pandas dataframe to each thread in python?
@lapis sequoia does any row depend on any other row?

#

I guess either. or are the methods really different from each other?
@lapis sequoia hm...let's take a step back

#

why do you want to do that?

lapis sequoia Aug 27, 2020, 11:56 PM

#

I just think it could be useful sometimes, like if you want to sort something

velvet thorn Aug 27, 2020, 11:57 PM

#

.sort_values()?

lapis sequoia Aug 27, 2020, 11:58 PM

#

Oh, I guess there's a method for that but

#

Is there really never a good time to return the index of something? just feels like something that could come in handy

velvet thorn Aug 27, 2020, 11:58 PM

#

I don't think I have ever needed to do that, but you could filter and then access .index

#

like literally ever as far as I can remember

lapis sequoia Aug 27, 2020, 11:59 PM

#

oh, I didn't realize the .index method returned a value

velvet thorn Aug 28, 2020, 12:00 AM

#

also native Python nan is float('nan')

marble briar Aug 28, 2020, 12:53 AM

#

I have a dataset with 100 labels how do i calculate the accuracy?

crude karma Aug 28, 2020, 2:17 AM

#

how come when i specify a figsize, it says 'list' object has no attribute 'loc'

#

when i do df= plt.plot(df.loc[:,'Time'],df.loc[:,'VO2']) it works but when i add a figsize, taht error shows up

still verge Aug 28, 2020, 5:15 AM

#

where are you adding figsize?

crude karma Aug 28, 2020, 5:43 AM

#

oh i figufred it out

#

i added at the end

#

but ium supposed to add at the beginnign

still verge Aug 28, 2020, 5:44 AM

#

😄

austere swift Aug 28, 2020, 7:37 AM

#

so I'm trying to implement keras tuner as an automatic hyperparameter tuner in my model and for the weight regularization I was wondering what would be a good minimum and maximum value to have?

#

and a good value to step too

#

Ping me if you have an answer and thank you

solid lagoon Aug 28, 2020, 7:44 AM

#

hello, i have a dataframe with a column which takes only two values, say A and B, and want to create a column A_1,A_2,A_3....A_countA,B_1,B_2,....B_countB

#

how do I achieve this?

#

t = pd.Series(["a", "b", "b", "b", "b", "a"]) t 0 a 1 b 2 b 3 b 4 b 5 a dtype: object func(t) 0 a a_1 1 b b_1 2 b b_2 3 b b_3 4 b b_4 5 a a_2 dtype: object

#

can someone tell me how i can achieve func?

still verge Aug 28, 2020, 7:50 AM

#

try determining the indices of all the letters and store that into another list

#

so that you can use those indices to append to the letters

solid lagoon Aug 28, 2020, 7:52 AM

#

i have trouble getting the indices

still verge Aug 28, 2020, 7:55 AM

#

prob best if you had a function that went through the list and kept individual counters

#

and appending them to a larger list

solid lagoon Aug 28, 2020, 7:56 AM

#

you mean have a global counter

#

did it thanks

#

for reference
t = pd.DataFrame({'A': ["a", "b", "b", "b", "b", "a"]}) counter={} def func(x): ix = counter.get(x, 0) counter[x] = ix + 1 return '{0}_{1}'.format(x, ix) t.A.apply(func)

velvet thorn Aug 28, 2020, 9:35 AM

#

uh...

#

@solid lagoon a bit late but well

#

you should actually use cumcount

#

>>> t + (t.groupby(t).cumcount() + 1).map(lambda v: f'_{v}')
0    a_1
1    b_1
2    b_2
3    b_3
4    b_4
5    a_2
dtype: object

solid lagoon Aug 28, 2020, 9:38 AM

#

thanks man, I knew I had seen this somewhere way before

velvet thorn Aug 28, 2020, 9:39 AM

#

yeah I know because I myself spent time coding exactly that

#

and then a while later I found there was something for this

lapis sequoia Aug 28, 2020, 10:32 AM

#

Hey guys is SQLite common for data analysis? I’ve just learned yesterday that Python has a sqlite library built in. Really only need a database to store data in and query what I need. I don’t have admin access on my work laptop so can’t try others without requesting, but is it at least common use?

velvet thorn Aug 28, 2020, 10:33 AM

#

Hey guys is SQLite common for data analysis? I’ve just learned yesterday that Python has a sqlite library built in. Really only need a database to store data in and query what I need. I don’t have admin access on my work laptop so can’t try others without requesting, but is it at least common use?
@lapis sequoia pandas?

lapis sequoia Aug 28, 2020, 10:37 AM

#

@velvet thorn I’m trying to avoid reading in the data every time and then selecting what I need. So came across this SQLite database I could potentially use to store the data and then query what I need. Was just wondering if SQLite is commonly used?

velvet thorn Aug 28, 2020, 10:37 AM

#

for small datasets

lapis sequoia Aug 28, 2020, 10:37 AM

#

What’s generally considered small?

velvet thorn Aug 28, 2020, 10:38 AM

#

well

#

anything under a gigabyte

#

but honestly

#

I don't really see the problem with reading data into memory every time...?

#

although if you don't have to do interactive analysis

#

SQLite might be just what you need

#

I presume you're good with SQL so why not

lapis sequoia Aug 28, 2020, 10:42 AM

#

I’ve just been finding it super slow and there is certain repetitive analysis I do, that I know exactly what I need.

Well, I know the basics, but it can’t be that hard to pick up!

velvet thorn Aug 28, 2020, 10:43 AM

#

if you find pandas slow

#

generally one of two things is true

you're using it wrongly
your data is too big

#

anything above a gigabyte (on disk) starts to poke into "bad for pandas" territory (you can consider something like dask I suppose)

lapis sequoia Aug 28, 2020, 10:46 AM

#

Right I see. Yes I’m going over a gigabyte. I’m super new to this kind of stuff so more than likely not being optimal! Literally just ordered Python for Data Analysis by Wesley McKinney!

velvet thorn Aug 28, 2020, 11:04 AM

#

okay so

#

very simple rule of thumb

#

if you have a for loop in your pandas code, you're probably doing something wrong

desert oar Aug 28, 2020, 11:57 AM

#

+1 although i do tend to loop over .columns occasionally

lapis sequoia Aug 28, 2020, 12:13 PM

#

if you have a for loop in your pandas code, you're probably doing something wrong
@velvet thorn
Definitely no for loops!

desert oar Aug 28, 2020, 12:13 PM

#

@lapis sequoia can you give an example of something that's slow

#

And can you give an example of a different tool where the same operation is not slow

velvet thorn Aug 28, 2020, 12:26 PM

#

+1 although i do tend to loop over .columns occasionally
@desert oar oh yeah that's perfectly fine

bitter fiber Aug 28, 2020, 2:07 PM

#

any1 have a good pyspark resource for me to learn? Im thinking w3 schools for hiveql first.

desert oar Aug 28, 2020, 3:35 PM

#

@bitter fiber hiveql is basically just sql. i wouldnt start there

#

i dont know of specific resources, but it helps if you think of pyspark as a declarative interface to a query engine

bitter fiber Aug 28, 2020, 3:36 PM

#

I know sql just wanted to learn how to setup the environment and special quirks

desert oar Aug 28, 2020, 3:36 PM

#

ah, i cant say i know much about setting up the env

bitter fiber Aug 28, 2020, 3:36 PM

#

Right.. I have 6 raspberry pi's and 1 main computer that i wanted to interface together for a hobby of mine

desert oar Aug 28, 2020, 3:36 PM

#

but yeah, spark is weird because you have to think of it more like constructing a query or constructing a program that is to be compiled and executed, rather than executing code line by line as in python

bitter fiber Aug 28, 2020, 3:36 PM

#

I was thinking maybe it would be useful to create a datamine

desert oar Aug 28, 2020, 3:37 PM

#

i think typically you deploy on yarn or mesos, although it does support "standalone" cluster mode

#

and i guess it supports k8s too

#

https://spark.apache.org/docs/latest/index.html#launching-on-a-cluster

Overview - Spark 3.0.0 Documentation

Apache Spark 3.0.0 documentation homepage

#

the docs are decent albeit sometimes disorganized

#

i would start by practicing w/ pyspark itself on a local cluster before you try to actually deploy on your rpi farm

bitter fiber Aug 28, 2020, 3:40 PM

#

What does standalone cluster mode mean?

#

ok so on my own computer in a local cluster meaning running just on my workstation?

#

my workstation that im working on first has 16 physical and 32 total with virtual cpus I want to learn how to utilize everything.

desert oar Aug 28, 2020, 3:44 PM

#

"standalone cluster" would be spark running directly on the machines without an engine like yarn/mesos/kubernetes underneath it

bitter fiber Aug 28, 2020, 3:44 PM

#

ah..

desert oar Aug 28, 2020, 3:44 PM

#

"local cluster" is 1 machine

#

i think for making use of a single high-core workstation spark probably isn't the best unless you have tons of RAM

bitter fiber Aug 28, 2020, 3:45 PM

#

256 GBS of ram

desert oar Aug 28, 2020, 3:45 PM

#

oh yeah

#

go for it, see how it works

bitter fiber Aug 28, 2020, 3:45 PM

#

I bought a 1500 dollar refurbished machine

desert oar Aug 28, 2020, 3:46 PM

#

i have a similar machine at work, its nice but we never use spark on it

#

for big stuff there i just use dask or i just yolo 30 GBs of data into memory with pandas or data.table

bitter fiber Aug 28, 2020, 3:46 PM

#

Thats what I do for work. pandas

#

i would like to start a data mine in my house that consumes many public apis

desert oar Aug 28, 2020, 3:46 PM

#

thats a fun project

bitter fiber Aug 28, 2020, 3:47 PM

#

I was originally thinking of running with a LAN mongodb to not worry about schemas

#

and just injest everything into my big computer

desert oar Aug 28, 2020, 3:47 PM

#

spark/hdfs is probably better for that

#

dump it all to a NAS

bitter fiber Aug 28, 2020, 3:47 PM

#

Right..

#

My brother has a NAS

floral mantle Aug 28, 2020, 3:47 PM

#

Any tips or starting points on downloading main posts and comments from a Facebook group I’m a member of? Doing some text analysis and word cloud type stuff - want to know if it’s doable, link examples, and see if anyone’s aware if it’s against any sort of TOS

desert oar Aug 28, 2020, 3:47 PM

#

read in with rpis and save on hdfs running on a NAS or something? idk

#

@floral mantle that's probably against facebook's TOS and you check to see if they have any provisions about "automation" or "crawling"

floral mantle Aug 28, 2020, 3:48 PM

#

I think you’d have to use their Graph API to do it and I see references for it

bitter fiber Aug 28, 2020, 3:48 PM

#

only 2 TB of harddisk on my workstation though..

floral mantle Aug 28, 2020, 3:48 PM

#

So need a Dev key etc.

bitter fiber Aug 28, 2020, 3:48 PM

#

facebook is tough. you need to get verified app permission

#

its not like twitter which is more open

desert oar Aug 28, 2020, 3:49 PM

#

yeah you can use the graph API

#

(if you can get access)

bitter fiber Aug 28, 2020, 3:49 PM

#

yeah i would say learning the Graph api is very valuable in the marketplace though

desert oar Aug 28, 2020, 3:49 PM

#

i dont know what goes into getting that kind of permision

#

@bitter fiber or just work for Big Corp where they contract out all that stuff 😛

#

(but then you end up doing half the work for the contractor anyway because they dont know wtf theyre doing)

bitter fiber Aug 28, 2020, 3:50 PM

#

Lol they hired some guy to just maintain the facebook api and he barely works nowe

#

at my job and they cant fire him because everyone else is too lazy to work on that.

#

its more legal stuff than anything lmao

#

@desert oar i had another question; a claim that people say about hadoop is that you use 1/10th the server cost; is that because you compress the data more or something across a cluster?

#

built in backups?

desert oar Aug 28, 2020, 3:55 PM

#

i dont know what that even means

#

like if you pay for 300 GB of storage hadoop only lets you use 30 GB?

#

i dont work with hadoop directly ever so i honestly have no idea, but that's a questionable claim

bitter fiber Aug 28, 2020, 5:06 PM

#

Gotcha..

polar acorn Aug 28, 2020, 8:51 PM

#

Might be true if you have a lot replication going on. Storing 30 GB of data might end up taking up lot more than just that.

#

Not sure how this applies to hadoop directly but this blog post gives an overview of why storing 30 GB of data might need 300 GB. https://jrs-s.net/2016/11/08/depressing-storage-calculator/

jovial lotus Aug 28, 2020, 10:34 PM

#

Hi all, I have a machine learning algorithm that I am trying to code. I have had very little experience with it so I am getting stuck on what type of algorithm I should use. I am trying to make a program that if a song is playing (SONG X), it recommends the next song (song Y). In order to do so, I have a set of variables that song Y should fit or be closest to (variables a,b,c,d,e,f....). All of the variables are percentages. Given a list of songs that song Y could be in, I want to find the best match for song Y in the list. If it was only one variable, all I would do is find the song in the list of songs that has the closest variable value. But what do I do once I start comparing multiple variables?

tidal bough Aug 28, 2020, 10:35 PM

#

So, find the closest point to a given one in a high-dimensional space, based on a predefined metric?

jovial lotus Aug 28, 2020, 10:35 PM

#

I believe so yes ?

tidal bough Aug 28, 2020, 10:37 PM

#

I mean, that's it. It's no different from the one-variable case. The only thing you need is to design the metric function. For that, you could use just sum of squared differences (euclidean metric), but note that you'd want to normalize all of your parameters then (so that they're all 0 mean and 1 variance), otherwise certain parameters will affect the distance more than others.

jovial lotus Aug 28, 2020, 10:40 PM

#

So sum of the squared differences and then find the song that has the smallest sum. Which should be the song that is least different?

tidal bough Aug 28, 2020, 10:41 PM

#

Pretty much. I mean, that's just finding the closest point in space to this one.

#

To calculate the distances efficiently, you can use https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html#scipy.spatial.distance.cdist

#

EDIT: fixed link, it's cdist you want.

jovial lotus Aug 28, 2020, 10:44 PM

#

Wow that is literally so much help, thank you

#

I have been trying a bunch of different complicated algorithms for the past few days

#

So there is one caveat, one of the variables isnt a percentage like the others are. That variable being bpm(beats per minute) which doesnt really have a max or a min so there isnt a way for me to represent it in a percentile manner.

tidal bough Aug 28, 2020, 10:46 PM

#

Sure there is 🙂

jovial lotus Aug 28, 2020, 10:46 PM

#

How so?

tidal bough Aug 28, 2020, 10:47 PM

#

For your entire dataset, for each variable, calculate the mean and standard deviation for that variable

#

Then subtract the mean and divide by std.

#

Every variable will then end up with 0 mean and 1 std.

#

It means they'll then lose obvious meaning (having 1 on the bpm score would mean "it's around 1 standard deviation more than the mean among all songs"), but that'd put them all into a similar range.

#

~~scikit-learn has a function for this transformation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html~~

#

wait

#

wrong one, hold on

#

ah, there it is
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale

jovial lotus Aug 28, 2020, 10:50 PM

#

I'm writing all of this down. I think that is all I need so really really thank you

tidal bough Aug 28, 2020, 10:50 PM

#

and scikit-learn is nice in that is has detailed User Guides

#

Here's one for data standartization: https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

#

oh, and by the way, scikit-learn actually has an entire module for efficiently (without considering every single other point) find closest neighbours for a point:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors - list of functions
https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors - User Guide for it.

jovial lotus Aug 28, 2020, 10:55 PM

#

So, I am trying to have this recommender program iterate through the entire list so that each song is "perfectly" played after another and that it creates a playlist/mix. The best move would be to add all of the summed differences through the entire playlist and then compare that with other versions of the playlist, possibly every single version of the playlist. I feel like that would be too brute force. Do you have any advice on how I should do that?

tidal bough Aug 28, 2020, 10:56 PM

#

Why not just start from a random (or user-defined) point and then traverse the graph of songs, always choosing the closest non-explored point?

jovial lotus Aug 28, 2020, 10:57 PM

#

Holy crap, okay, better to look back into my algorithm textbooks haha

tidal bough Aug 28, 2020, 10:57 PM

#

The best move would be to add all of the summed differences through the entire playlist and then compare that with other versions of the playlist, possibly every single version of the playlist.
In geometry terms, this can be rephrased as "I want to find the shortest-length path that visits all of my points exactly once". Do you happen to know how that task is called, perhaps? 🙂

jovial lotus Aug 28, 2020, 10:58 PM

#

eularian or something right?

#

eu- something haha

#

eulerian path

tidal bough Aug 28, 2020, 10:59 PM

#

That's how the path traversing all nodes exactly once is called (I think), but the problem of finding the one with the shortest length is very (in?)famous under the name of the https://en.wikipedia.org/wiki/Travelling_salesman_problem

#

it's, uhm, a very very hard problem. NP-complete, even.

#

So you probably shouldn't bother. Just always go to the closest unexplored neighbour or something.

jovial lotus Aug 28, 2020, 11:01 PM

#

yeah, just that every single other song is technically a neighbor

#

unless I could categorize the bpms as neighbors since I want the songs to flow into eachother...

tidal bough Aug 28, 2020, 11:02 PM

#

yup, it's like euclideanTSP on a plane (where you can go to any city you want and the distance is just the euclidean distance between them), but it's in n-dimensional space instead 😅

#

Nevertheless, whenever your problem turns out to be a subclass of TSP, that's generally a sign that you might want to simplify it.

#

(TSP isn't easily solved even for points on a plane)

jovial lotus Aug 28, 2020, 11:04 PM

#

Okay yeah, I think this is a great starting point, thank you

#

Mind if I add you in case I have any other questions?

tidal bough Aug 28, 2020, 11:08 PM

#

sure

modest rune Aug 29, 2020, 12:46 AM

#

I am having a long back and forth on the coursera forums for Andrew Ng's machine learning course. Either I am just dense and need someone else to explain things to me (most likely), or the other person is wrong. Anyone on here willing to help me out. Here is the discussion (I had to save it to a PDF since the forum post is behind a user/password wall on coursera).

https://gofile.io/d/5ElxoJ

Gofile

Gofile is a free and anonymous file-sharing platform. You can store and share data of all types (files, images, music, videos etc...). There is no limit, you download at the maximum speed of your connection and everything is free.

#

Here is the coursera link, in case you have credentials and can view the forum (Just in case you a weary about opening some random dude's PDF from a file sharing site you may not be familiar with)
https://www.coursera.org/learn/machine-learning/discussions/weeks/1/threads/5WdAbuk8EeqXNhLj2fFeZQ

Coursera

Coursera | Online Courses & Credentials From Top Educators. Join fo...

Learn online and earn valuable credentials from top universities like Yale, Michigan, Stanford, and leading companies like Google and IBM. Join Coursera for free and transform your career with degrees, certificates, Specializations, & MOOCs in data science, computer science, b...

tidal bough Aug 29, 2020, 1:28 AM

#

I don't think either of you are really wrong. You're basically asking why use the least squared error function of all things. The answer is something like "it's the provably best way under certain assumptions to minimize the mean error".

#

The data you made doesn't really fit these assumptions, so it unsurprisingly is a very bad fit under LSE. You could potentially achieve that orange line by detecting outliers - for example, if one searched for a subset of points of size around 70% of the total that had the least average squared error when fitting a line to it, then one would obtain the red line:

📎 unknown.png

#

So, I guess, I could also say that your concerns are valid, but they pretty much never occur in practice. You don't usually have to fit a line to a dataset that's obviously non-linear.

lapis sequoia Aug 29, 2020, 1:51 AM

#

Any suggestions on how to use NumPy to get rid of the for-loops in the function shown below?

def mu_davidson(mus, mws, xs):
    mus = np.asarray(mus)
    mws = np.asarray(mws)
    xs = np.asarray(xs)

    a = 0.375
    e = (2 * np.sqrt(mws * np.array([mws]).T)) / (mws + np.array([mws]).T)

    f = 0.0
    n = len(mus)
    for i in range(n):
        for j in range(n):
            f = f + xs[i] * xs[j] * e[i, j]**a / np.sqrt(mus[i] * mus[j])

    mu_mix = 1 / f
    return mu_mix

Here's an example of using the function:

mus = [179.75, 363.87]
mws = [2.016, 28.014]
xs = [0.85, 0.15]
mu_mix = mu_davidson(mus, mws, xs)
print(f'mu_mix = {mu_mix}')

velvet thorn Aug 29, 2020, 2:06 AM

#

@lapis sequoia what's that supposed to do?

modest rune Aug 29, 2020, 2:09 AM

#

@tidal bough thanks! FYI, I am taking Professor Ng's course so I can make sense of what you told me a few days ago. I almost have it all sorted out.

Regarding the recent discussion, this is the latest update, which I think clears up my confusion and explains the other dude's opinion:

Found this, and I think it answers my question:

https://www.mathworks.com/help/stats/examples/fitting-an-orthogonal-regression-using-principal-components-analysis.html

"PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as Orthogonal Regression or Total Least Squares, and is appropriate when there is no natural distinction between predictor and response variables, or when all variables are measured with error. This is in contrast to the usual regression assumption that predictor variables are measured exactly, and only the response variable has an error component."

My interpretation of this statement is... Normally, we assume zero error in the X axis (the input), only error in the Y axis (the output). But, in the case that the X value is also susceptible to error, then PCA is a better fit.

So, for the example of the square feet of a home vs predicted home price. There is a negligible error in the square feet measurement that can assumed to be zero, while there is much error in the price values. In that case, do not use PCA.

However, if the city required the use of a specific contractor to make square feet measurements on homes and that contractor was known to intentionally add error into their measurements just to throw everyone off, then PCA would be the better method to use.

IF I understood everything correctly, that explanation clears up my confusion. Please let me know if I am understanding this correctly.

Fitting an Orthogonal Regression Using Principal Components Analysi...

This example shows how to use Principal Components Analysis (PCA) to fit a linear regression.

lapis sequoia Aug 29, 2020, 2:10 AM

#

@velvet thorn See my edit. I added an example of using the function.

modest rune Aug 29, 2020, 2:11 AM

#

As for as I can tell PCA matches what I was trying to do with my intuitive fitting using the shortest perpendicular distance.

velvet thorn Aug 29, 2020, 2:11 AM

#

@velvet thorn See my edit. I added an example of using the function.
@lapis sequoia I mean, what are the for loops intended to achieve?

#

understanding how to optimise your algorithm from a high-level description would be simpler than trying to figure it out from your code

lapis sequoia Aug 29, 2020, 2:13 AM

#

The for-loops are part of a summation equation.

📎 screenshot.png

velvet thorn Aug 29, 2020, 2:14 AM

#

okay, it is too early for me to read math or iterative numpy code, so I will leave this to someone else...

#

hopefully someone else will come along

#

never mind I got bored and did it

#

@lapis sequoia ((xs * xs.T) * (e ** a) / ((mus * mus.T) ** 0.5)).sum(axis=None)

#

you need to make xs and mus 2D

#

[:, np.newaxis]

tidal bough Aug 29, 2020, 2:29 AM

#

As for as I can tell PCA matches what I was trying to do with my intuitive fitting using the shortest perpendicular distance.
@modest rune Pretty much. PCA is used for dimensionality reduction - basically, take a lot of points in n-dimensional space, and find an m-dimensional (m<n) subspace to project the points to such that the lengths of the projections are minimized. For n=3,m=2, it's finding a plane in 3d space that the data most closely matches. For n=2,m=1, it's your example. Unlike LSE, PCA indeed doesn't have any direction bias - in fact, PCA is perfectly fine with fitting a vertical line, something LSE can't do at all, because, well, the latter assumes that y is a function of x.

#

@lapis sequoia

Any suggestions on how to use NumPy to get rid of the for-loops in the function shown below?
I'd say like this:

prod = np.multiply.outer(xs,xs) # prod[i,j] = xs[i]*xs[j]
prod *= e**a
prod /= np.sqrt(np.multiply.outer(mus,mus))
f = np.sum(prod)

#

outer is one of my favorite numpy features; I've done so much stuff to mimic its behavior before I found it 🙂

velvet thorn Aug 29, 2020, 2:33 AM

#

yeah, I should have used the outer product

#

...too early.

#

perhaps I should go back to sleep

lapis sequoia Aug 29, 2020, 2:35 AM

#

@tidal bough and @velvet thorn thanks, I had no idea outer existed

tidal bough Aug 29, 2020, 2:36 AM

#

it can be applied to any numpy ufunc

lapis sequoia Aug 29, 2020, 2:41 AM

#

What's the difference between np.outer and np.multiply.outer? They appear to do the same thing.

velvet thorn Aug 29, 2020, 2:44 AM

#

What's the difference between np.outer and np.multiply.outer? They appear to do the same thing.
@lapis sequoia effectively, nothing

#

outer is a method on all numpy ufuncs

#

so, for example, you could have np.add.outer

#

however, because np.multiply.outer is a special operation known as the outer product, it is given a top-level alias np.outer

#

you can tell this if you look at their signatures.

#

np.multiply.outer takes the generic ufunc.outer signature

#

whereas np.outer is different

lapis sequoia Aug 29, 2020, 2:46 AM

#

Ah, I see. Thanks again.

velvet thorn Aug 29, 2020, 2:46 AM

#

np

lapis sequoia Aug 29, 2020, 2:55 AM

#

I revised my previous function based on help from @tidal bough and @velvet thorn. This looks much cleaner.

def mu_davidson(mus, mws, xs):
    a = 0.375
    e = 2 * np.outer(mws, mws)**0.5 / np.add.outer(mws, mws)
    f = np.sum(np.outer(xs, xs) * e**a / np.outer(mus, mus)**0.5)
    mu_mix = 1 / f
    return mu_mix

graceful glacier Aug 29, 2020, 3:28 AM

#

if anyone whos familiar with SQL help me with why the last statement is printing 1

#

📎 unknown.png

#

the first table is derived from the 'hacker_news' table. it shows the top users and their score

graceful glacier Aug 29, 2020, 3:45 AM

#

update: i found my mistake, it was an miscalculation i made with the JOIN statement

tall sierra Aug 29, 2020, 12:21 PM

#

Hi guys, I make youtube videos where I vulgarize Artificial Intelligence terms and news for non-experts. My goal is to demistify the AI “Black box” for everyone and sensitize people about the risks. Give it a check if you can, I am actually posting a new video un 2 hours ! 😁 I would love any feedback (especially negative, but pertinent) in order to improve my videos and vulgarizing skills! Thank you!
Here's the channel: https://www.youtube.com/c/WhatsAI

YouTube

What's AI

Hi, I am Louis, from Montreal, Canada, also knows as What's AI and I try to vulgarize Artificial Intelligence terms and news the best way I can for non-exper...

hallow briar Aug 29, 2020, 3:49 PM

#

Got anything on LSTMs? Cause I'd love to know wtf those are doing

midnight goblet Aug 29, 2020, 5:01 PM

#

yeah LSTM is amazing but I'm looking for YOLO v4 @hallow briar

grand pike Aug 29, 2020, 7:38 PM

#

Hi All- I am fairly new to discord and have a question regarding time series analysis and handling missing data in a df

#

I have a df, indexed by date, to capture the spread history of various bonds. However, for some bonds, the spread levels become unstable as the bond reaches maturity/is close to being paid off (as shown by the sudden drops in the plot)

#

sample spreads

📎 unstable_spreads.jpg

#

Now, I want to apply some data quality checking to stabilize such bonds

#

specifically, the DQ check that I want to apply is that across the bonds (which are columns in the df), each time there are 10 consecutive NaN i.e. missing data, cut off the rest of the data as the last 20 days of data

#

However, I am struggling to find a clean way of defining a function to perform this DQ check. Any thoughts on what an ideal approach may be?

#

corresponding df

📎 118580522_949058188927859_6128813426355704531_n.jpg

#

*as well as the last 20 days of data

lapis sequoia Aug 29, 2020, 8:27 PM

#

Hi. The binned distribution of one of the columns in a dataframe is shown below in blue. I've tried removing outliers using IQR and variations of IQR (tuning the quantiles) and in red you see the binned distribution of the subset of elements which lie in the quantiles [0.05, 0.95]
My question is why the red distribution is so much smaller. The filtering removed only about 100 elements. Shouldnt the red be about as high as the blue distribution?

📎 unknown.png

#

Zoomed in on x E [0, 5]

📎 unknown.png

tidal bough Aug 29, 2020, 10:07 PM

#

@lapis sequoia It looks like the bins of the red one are smaller and the histogram is not normalized (density=True isn't passed), so the smaller the bins, the lower they will be (because fewer elements falls into them).

#

Pass density=True when examining distributions. Here's a comparison:

#

X = np.random.randint(0,500,10000)
plt.close()
plt.figure()
plt.hist(X,bins=50)
plt.hist(X,bins=100)
plt.show()

produces:

📎 unknown.png

#

X = np.random.randint(0,500,10000)
plt.close()
plt.figure()
plt.hist(X,bins=50,density = True)
plt.hist(X,bins=100,density = True)
plt.show()

produces:

📎 unknown.png

tight stone Aug 29, 2020, 10:33 PM

#

A more specific question to DL/ML:
Are there any publications/projects/demos to a program that involves hand-gestures to control a e.g web-page or similar?

tidal bough Aug 29, 2020, 10:38 PM

#

gesture control on google scholar gives a lot of promizing patents:
https://patents.google.com/patent/US9640181B2/en
https://patents.google.com/patent/US8448083B1/en

but I'm actually unable to find studies, huh.

plucky cairn Aug 29, 2020, 10:53 PM

#

hey beginner question on good algorithms to try for classifying text into one of two categories

tidal bough Aug 29, 2020, 10:56 PM

#

scikit-learn is an amazing library for ready solutions (rather than making your own). Check out logistic regression, and every topic with "classification" in it:
https://scikit-learn.org/stable/supervised_learning.html

plucky cairn Aug 29, 2020, 10:59 PM

#

that page is exactly what i needed thanks

naive jay Aug 29, 2020, 11:25 PM

#

hey guys can anyone link me some sites where i can find some data sets, specifically im looking for server logs

tight stone Aug 29, 2020, 11:26 PM

#

@tidal bough Wow, thanks for those. Exactly what I am searching for

ripe mortar Aug 30, 2020, 2:09 AM

#

Hello. I'm testing some Pandas and threading/multiprocessing. I find it odd that threading is a bit faster than multiprocessing. The function I passed to multiprocessing.Process and threading.Thread sums() a dataframe and threading finished first. Is this right? I thought multiprocessing would finish faster.

still otter Aug 30, 2020, 2:10 AM

#

are you actually doing parallel work?

ripe mortar Aug 30, 2020, 2:15 AM

#

are you actually doing parallel work?
@still otter I'm counting the number of votes in the dataframe per candidate and I pass it on to a function that filters the dataframe by candidate name and sum() them up.

#

This is the threading version. The multiprocessing one is similar.

📎 Screen_Shot_2020-08-30_at_10.16.43_AM.png

still otter Aug 30, 2020, 2:23 AM

#

hm. well in general Thread is faster because it has less overheads, but Thread is not capable of concurrent computation in pure python. So which is faster depends on how you are doing the computation and how much data you're working with

#

i don't know much about pandas but it's possible that sum() is run in native code that releases the GIL, which means it can be run concurrently with Threads, in which case the main downside of Thread is sidestepped and Thread will almost certainly be faster in this case

ripe mortar Aug 30, 2020, 2:33 AM

#

Thank you!

upper vessel Aug 30, 2020, 4:45 AM

#

anyone know a method to reduce mode collapse in GANs, without adding another neuronal network.

frail arch Aug 30, 2020, 4:59 AM

#

does anyone know how to add a custom function into the model in Keras? as in I want to pass the output of a layer through my function and use it's output for another layer

hasty grail Aug 30, 2020, 5:20 AM

#

@frail arch Can't you just use the functional API? Can you provide an example of what you want to achieve?

frail arch Aug 30, 2020, 6:20 AM

#

@hasty grail for eg. say, I want to take output of a layer, add something to it, pass it to a dictionary and use the dictionary's output as input for the next layer

hasty grail Aug 30, 2020, 6:53 AM

#

Does the functional API not work for that?

lapis sequoia Aug 30, 2020, 10:04 AM

#

scikit-learn is an amazing library for ready solutions (rather than making your own). Check out logistic regression, and every topic with "classification" in it:
https://scikit-learn.org/stable/supervised_learning.html
@tidal bough Thanks I also needed this page

gaunt tusk Aug 30, 2020, 10:27 AM

#

Anyone got any good resources on reinforcement learning?

lapis sequoia Aug 30, 2020, 10:45 AM

#

hello, i am going in to year 12 and am looking to do CS at uni, can someone explain what a job in data-science would entail

vivid wren Aug 30, 2020, 11:04 AM

#

I'm working on a pixel art editing program and wanted to know what a good method for finding similar neighbors with bucket fill? I have the pixels mapped with a dictionary in f"{x}x{y}" format. I made my own function which figures out all the valid neighbors recursively but don't know if there is a more efficient method.

#

def bucket_fill(id, layer):
    to_fill = [id]

    def find_neighbors(neighbor_id):
        x, y = neighbor_id.split("x")
        x, y = int(x), int(y)
        l = None if x == 0 else f"{x-1}x{y}"
        r = None if x == layer.width - 1 else f"{x+1}x{y}"
        t = None if y == 0 else f"{x}x{y-1}"
        b = None if y == layer.height - 1 else f"{x}x{y+1}"
        neighbors = [l, r, t, b]
        return [n for n in neighbors if n]

    def check_neighbors(neighbor_list): #Check if color matches, and not already in the to-fill list, returns new pixels to check after adding them to to-fill
        new_neighbors = []
        for n in neighbor_list:
            if not n in to_fill:
                if layer.pixeldict[n].color  == layer.pixeldict[id].color:
                    to_fill.append(n)
                    new_neighbors.append(n)
                    print(f"Added {n}")
        return new_neighbors

    def check(neighbor_id, i): #Recursively check a pixel and its neighbors
        print(f"Check recursion {i}")
        neighbors = find_neighbors(neighbor_id)
        neighbor_list = check_neighbors(neighbors)
        print(neighbor_list)
        for n in neighbor_list:
            check(n, i + 1)

    check(id, 0)
    return to_fill

hasty grail Aug 30, 2020, 12:24 PM

#

Am not entirely sure why you need to store them in a dictionary. Wouldn't a 2-D array do pretty much the same thing?

#

Seems that you are performing a breadth-first search, which is perfectly valid imo

modest rune Aug 30, 2020, 1:28 PM

#

This I think is a super easy question. In numpy, what is the best way to create a 2D (lets call it G) array with dimensions Mx2, each column is a feature that has a defined linspace representing values I want to predict, and G needs to be every possible combination of the the 2 features linspace values.

#

for example:

# probably would create these using linspace to create these, unless a function exists that does 
# everything at once.
ages = [45;50;55;60]
nose_pimples = [0;1;2]

# Desired Result
G = 
[ 0, 45;
  0, 50;
  0, 55;
  0, 60;
  1, 45;
  1, 50;
  1, 55;
  1, 60;
  2, 45;
  2, 50;
  2, 55;
  3, 60  ]

hasty grail Aug 30, 2020, 1:50 PM

#

np.stack(list(itertools.product(nose_pimples, ages)))

#

or you can use meshgrid I guess

arctic wedgeBOT Aug 30, 2020, 1:53 PM

#

You are not allowed to use that command here. Please use the #bot-commands channel instead.

hasty grail Aug 30, 2020, 1:55 PM

#

!eval

import numpy as np
print(np.mgrid[0:3:1, 45:61:5].reshape(2, -1).T)

arctic wedgeBOT Aug 30, 2020, 1:55 PM

#

You are not allowed to use that command here. Please use the #bot-commands channel instead.

hasty grail Aug 30, 2020, 1:55 PM

#

^

modest rune Aug 30, 2020, 2:15 PM

#

thanks

tidal bough Aug 30, 2020, 2:34 PM

#

Anyone got any good resources on reinforcement learning?
@gaunt tusk https://www.coursera.org/learn/practical-rl/ I'm doing this coursera course on it.

Also:

• Sutton, Barto - Reinforcement Learning: An Introduction
• Berkeley - CS285: Deep Reinforcement Learning
(copied from the AI discord server)

frail arch Aug 30, 2020, 3:02 PM

#

how to install caffe in windows? I am getting error CMake Error: CMake was unable to find a build program corresponding to "Ninja". CMAKE_MAKE_PROGRAM is not set. You probably need to select a different build tool.

#

I have latest CMake installed

raven mulch Aug 30, 2020, 3:10 PM

#

In this video we go over the distinction between invariance and sensitivity based adversarial perturbations. The former being a much less studied attack which is able to break "robust" models!

I encourage you to create discussions here or on the youtube comment section about the paper and share related work, we can all learn from each other!

Video: https://www.youtube.com/watch?v=NhZY2tnDTZg

YouTube

Federico Barbero

Fundamental Tradeoffs between Invariance and Sensitivity to Adversa...

In this video we go over the distinction between invariance and sensitivity based adversarial perturbations. The former being a much less studied attack which is able to break "robust" models!

Paper: https://arxiv.org/abs/2002.04599

Abstract: Adversarial examples are mal...

▶ Play video

crimson umbra Aug 30, 2020, 3:11 PM

#

anyone know where i can get lecture videos and slides for the latest cs109 courses with a recent Python 3.x version
Or is the 2015 version the only one that's free for all

weak kiln Aug 30, 2020, 3:26 PM

#

If you do enjoy it please consider subscribing and promoting the channel! It encourages me to put more effort into these videos I have other videos which span related topics.
@raven mulch

I think it's great that you're creating YouTube content and sharing it with our members in a channel that has a relevant topic - but "remember to subscribe" crosses the line over into straight up advertising, and violates our rules. Maybe you can use this channel to ask for feedback, instead. I wouldn't have any problem with that.

raven mulch Aug 30, 2020, 3:27 PM

#

Sorry I will edit that part out

#

Done

weak kiln Aug 30, 2020, 3:28 PM

#

Just try to keep that in mind for the next video, though. We technically don't allow advertising, but I think it's a shame to completely block content creators who are making things that may be relevant to the interests of our members - so you're basically walking a bit of a tightrope with these posts.

raven mulch Aug 30, 2020, 3:30 PM

#

Yep definitely. I appreciate it! I’m mainly looking to gain a following of people to discuss papers I make videos on, I understand how advertising can become annoying though

velvet thorn Aug 30, 2020, 3:39 PM

#

Yep definitely. I appreciate it! I’m mainly looking to gain a following of people to discuss papers I make videos on, I understand how advertising can become annoying though
@raven mulch honestly, it's a p good topic

#

but I was not sure if it was against the rules

tawny pivot Aug 30, 2020, 4:23 PM

#

Hi i have dataframe with multi columns and nan values at the beginning. And I try this:

#

📎 unknown.png

#

i need to each columns beginning value's timestamp

#

any idea?

novel remnant Aug 30, 2020, 4:50 PM

#

do you want the timestamp (which is in index?) for the first non nan value per dataframe column?

modest rune Aug 30, 2020, 5:00 PM

#

@tidal bough i finally figured out how to generate a surface plot for IV. I mean, I actually understand what the heck I am doing and how the math works under the hood. Thanks for your help earlier. And you were right, I was a few characters away from having working code. But, I was about 10 hours of learning away from actually understanding what was going on. Anywho, coursera, for free, has an excellent intro course on machine learning by Stanford's Dr. Andrew Ng. I'm only 2 weeks into the 8 Week course,but prof. Ng explains everything in a way I can understand and doesn't make assumptions about my math background.

tidal bough Aug 30, 2020, 5:01 PM

#

yeah, I very much liked how that course gives you an understanding of how it works under the hood

#

you won't need to actually implement these algorithms, most likely - just use premade algorithms from libraries like scikit-learn or pytorch - but it's going to be useful if you would want to understand ML articles or code some advanced (and so non-standard) algorithm.

modest rune Aug 30, 2020, 5:04 PM

#

Yeah. I hate blindly using a library without enough understanding of the underlying principles. I just can't be confident in my usage.

tawny pivot Aug 30, 2020, 5:39 PM

#

do you want the timestamp (which is in index?) for the first non nan value per dataframe column?
@novel remnant yes i need a new dataframe that contains: column names and first non nan value's index

lapis sequoia Aug 30, 2020, 5:48 PM

#

Anyone here experienced with tensorflow?

novel remnant Aug 30, 2020, 5:53 PM

#

@tawny pivot

something like this then?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'time': pd.date_range('2020-01-01', '2020-01-05', freq='d'),
    'a': [np.nan, np.nan, 1, 2, 3],
    'b': [np.nan, 1, 2, 3, 4],
    'c': [1, 2, 3, 4, 5]
})
df.set_index('time', inplace=True, drop=True)

# This is the part that you want
new_dict = {}

for col in df.columns:
    new_dict[col] = df[~pd.isna(df[col])].index[0]
    
pd.DataFrame.from_dict(new_dict, orient='index').T

#

@lapis sequoia pure tensorflow or tensorflow keras?

lapis sequoia Aug 30, 2020, 5:58 PM

#

uhh like a simple doubt

#

first i download the dataset using this

#

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

#

now i need to use this function to resize the images to 64 x 64

#

https://www.tensorflow.org/api_docs/python/tf/image/resize_with_pad

TensorFlow

tf.image.resize_with_pad | TensorFlow Core v2.3.0

#

but i am getting an error 🤔

novel remnant Aug 30, 2020, 6:07 PM

#

what error are you getting?
the images are grayscale do you reshape them first to shape (-1, 28, 28, 1) before resizing?

plucky cairn Aug 30, 2020, 6:32 PM

#

Are there best practices for text pre-processing?

#

This is what I need to do

📎 unknown.png

#

this is what i'm doing

📎 unknown.png

#

but applying this using a pandas transform is super slow on 2k text bodies

#

and i will need to do it on 16k on the out-of-sample texts

lapis sequoia Aug 30, 2020, 6:38 PM

#

I got this error

#

output dimensions must be positive [Op:ResizeBilinear]

novel remnant Aug 30, 2020, 6:50 PM

#

I'm not getting any errors on my part, can you share the part of your code that throws the error?

tawny pivot Aug 30, 2020, 7:02 PM

#

@tawny pivot

something like this then?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'time': pd.date_range('2020-01-01', '2020-01-05', freq='d'),
    'a': [np.nan, np.nan, 1, 2, 3],
    'b': [np.nan, 1, 2, 3, 4],
    'c': [1, 2, 3, 4, 5]
})
df.set_index('time', inplace=True, drop=True).T

# This is the part that you want
new_dict = {}

for col in df.columns:
    new_dict[col] = df[~pd.isna(df[col])].index[0]
    
pd.DataFrame.from_dict(new_dict, orient='index').T

@novel remnant this works for me thank you ^_^

novel remnant Aug 30, 2020, 7:03 PM

#

cheers!

solid aurora Aug 30, 2020, 8:20 PM

#

What is the best technique for finding feature importance in a dataset?

#

Let's say I have a trained SKLearn model with a good enough (~80%) accuracy

#

There seem to be several ways I can find feature importance:

#

sklearn's .feature_importance_, which I'm not sure how it works

#

Recursive Feature Elimination

#

Permutation feature importance

#

Which of the above will give the "best" results?

#

And when would I want to use one over the other?

#

And should I be doing RFE/PFI with a cross-validation set? or using accuracy from the training set itself?

plucky cairn Aug 30, 2020, 8:36 PM

#

can someone help me understand sparse matrices and how to manipulate them. from what i understand a sparse matrix basically only gives the non-zero entries to save memory

#

can i use standard numpy functions on a sparse matrix?

#

particularly i want to do something like
`np.sum(np.multiply(x!=0,(y>0)[:,None]),axis=0)

tidal bough Aug 30, 2020, 8:47 PM

#

Yup, you pretty much can.

plucky cairn Aug 30, 2020, 8:54 PM

#

okay, can i also ask why after using a count_vectorizer i would have columns that sum to zero?

#

that would mean the word doesn't show up in any documents right?

#

something like this

cv = CountVectorizer()
bagofwords = cv.fit_transform(text)
np.min(np.sum(bagofwords,axis=0))

#

returns zero

tidal bough Aug 30, 2020, 9:00 PM

#

I think so, yeah. It's weird if that's the case

plucky cairn Aug 30, 2020, 9:01 PM

#

hmm, must mean something weird is going on

lapis sequoia Aug 30, 2020, 9:19 PM

#

Im having some trouble wrapping my head around how to approach the problem I am currently having with Panda's and my dataframe.

Basically I have 4 columns using a datetime index that are all daily values. from different shop locations. I want to resample it into monthly columns, but without losing each daily value by just using resample.mean I have several years worth of data, and it would be nice to have each column in the final df be labeled Month Year. Im a little stuck. Any help would be appreciated.

sudden kernel Aug 30, 2020, 9:44 PM

#

would be easier to visualise what you want if you showed us a sample of your data and how you want it to look like

lapis sequoia Aug 30, 2020, 9:47 PM

#

One moment

#

📎 Screen_Shot_2020-08-30_at_5.47.24_PM.png

#

raw data is formatted like this

#

I want to turn it into this

📎 Screen_Shot_2020-08-30_at_5.47.49_PM.png

#

I can do it manually via

a = df.loc['2011-08']
a = a.unstack().reset_index(drop=True)

But its a huge hassle to do for large datasets and I know there is some way my beginner brain isn't seeing

#

The key is to preserve the data and not just use reshape.mean or some other thing that doesn't allow me to keep all data.

sudden kernel Aug 30, 2020, 9:55 PM

#

so you basically want to reshape all rows from 2016-04 in the original df, to a single column in the new df

lapis sequoia Aug 30, 2020, 9:56 PM

#

yes, but my data goes back to 1993, till today

#

so I need a solution that isnt using .loc 444 times

#

I have a sample csv with data from 2006 till 2020 with random int in it to try to figure this out

arctic wedgeBOT Aug 30, 2020, 10:08 PM

#

Hey @lapis sequoia!

It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

gaunt tusk Aug 30, 2020, 10:26 PM

#

@tidal bough Thank you for those resources, both look nice

still verge Aug 31, 2020, 7:12 AM

#

anyone have pyspark experience and want to share what it was like for you?

velvet thorn Aug 31, 2020, 7:14 AM

#

anyone have pyspark experience and want to share what it was like for you?
@still verge what was what like?

#

working with PySpark?

#

like working with pandas but much more tiring and bothersome

still verge Aug 31, 2020, 7:14 AM

#

yeah

#

what makes it tiring?

velvet thorn Aug 31, 2020, 7:15 AM

#

it being distributed means that stuff runs slower

#

on small datasets

#

of course, you wouldn't be able to do that kind of stuff on large datasets with native pandas (would need, like, dask or something)

#

but, yeah.

#

the abstractions are not as convenient

#

e.g. selecting specific rows and columns

still verge Aug 31, 2020, 7:16 AM

#

many people told me not to use it if the dataset is small, is it that bad?

velvet thorn Aug 31, 2020, 7:16 AM

#

you have to litter your code with a lot of the function operators

#

many people told me not to use it if the dataset is small, is it that bad?
@still verge without a reason, I'd say you shouldn't

still verge Aug 31, 2020, 7:17 AM

#

tahnks for the input!

frail arch Aug 31, 2020, 10:07 AM

#

can someone help me with caffe installation?

#

is it supported for Python 3.8?

lapis sequoia Aug 31, 2020, 1:07 PM

#

hello, if you are a little bit familiar with multithreading, can you help me understand what i am doing wrong here?

#

import _thread
from threading import Thread, Lock

mutex = Lock()

df = pd.read_csv(f"dftest_1.csv")
df = df.reset_index(drop=True)
df['id']='NaN'
df['new_score'] = 'NaN'

for index, row in df.iterrows():
    s = row['full_link']
    s = s[38:44]
    df.at[index, 'id'] = s

def get_new_data(index, row):
    global df
    submission = reddit.submission(row['id'])
    print(submission.score)
    mutex.acquire() 
    df.at[index, 'new_score'] = submission.score
    mutex.release()

for index, row in df.iterrows():
    _thread.start_new_thread(get_new_data, (index, row))

#

I am loading a csv, create two new columns filled mit 'NaN'. then i create the ID from the full link. so far so good

#

now, I try to update the column 'new_score'. I do this using _thread so the requests i make with reddit.submission() happen all at the same time.

#

in the get_new_data() function I make the request and print the submission.score. it works and i can see the scores one after another and almost instantly - so the multithreading seems to work

#

then i lock the dataframe, write the new value and release it again

#

but the dataframe that is returned doesnt have the new values

#

no error

#

but also no new values

ivory panther Aug 31, 2020, 2:03 PM

#

Try to use Ray for multithreading

#

Any idea to convert this data frame into a time serie taking months' columns as index?

📎 unknown.png

velvet thorn Aug 31, 2020, 2:18 PM

#

Any idea to convert this data frame into a time serie taking months' columns as index?
@ivory panther which is the month column?

ivory panther Aug 31, 2020, 2:19 PM

#

Enero, Febrero, Marzo ... (January, February, March, etc)

velvet thorn Aug 31, 2020, 2:19 PM

#

what do the numbers represent then

#

since I see a 2

neon path Aug 31, 2020, 2:20 PM

#

Looks like murder counts to me

ivory panther Aug 31, 2020, 2:20 PM

#

The number of crimes

velvet thorn Aug 31, 2020, 2:20 PM

#

no, I mean

#

what do you expect the result to look like

#

in general for "how do I convert this to that" questions sample output is very useful in helping people understand what you expect

#

because "time series" is rather vague

ivory panther Aug 31, 2020, 2:23 PM

#

Have date instead of just year. For example 2015/January, Aguascalientes, Homicidio, 2 (crimes)

#

Somithing similar to this

📎 unknown.png

bleak swift Aug 31, 2020, 2:46 PM

#

i pressed anaconda navigator but it isnt working
(i cant open my anaconda navigator how to fix?)

jolly sinew Aug 31, 2020, 2:50 PM

#

what are you trying to get to anaconda for?

#

I don't really use the navigator, but you can open a terminal on a mac and type jupyter notebook and it'll open notebooks

#

open anaconda prompt / miniconda prompt on windows to do the same thing

bleak swift Aug 31, 2020, 2:53 PM

#

thanks

serene scaffold Aug 31, 2020, 3:15 PM

#

I don't know that I like matplotlib

tidal bough Aug 31, 2020, 3:16 PM

#

For that matter, what nice higher-level plotting libraries that matplotlib are there? I don't quite get how they can be easier to use than the latter.

serene scaffold Aug 31, 2020, 3:20 PM

#

import typing as t
from plotter import Plot, Point, Color

class Car(Point):
    value: int
    speed: int
    type: Color

cars: t.List[Car]
car_chart = Plot(cars)
car_chart.show()

#

that's the kind of API I'd expect to see

tidal bough Aug 31, 2020, 3:23 PM

#

ah, interesting

serene scaffold Aug 31, 2020, 3:24 PM

#

I guess I could make one but that requires work.

atomic forge Aug 31, 2020, 4:20 PM

#

hi Everyone

#

so I am a python coder and

#

I have experience with Numpy but

#

Im trying to learn Pandas

#

and eventually matplotlib

#

does anyone have any good resources for learning these libraries

#

book, yt tutorials, anything

tidal bough Aug 31, 2020, 4:28 PM

#

for matplotlib and numpy, pretty much docs

#

pandas docs... aren't that nice

atomic forge Aug 31, 2020, 4:28 PM

#

uhh

#

then what do i use

#

for pandas

#

well matplotlib as well if smth makes it easier to learn

#

https://www.youtube.com/watch?v=ZyhVh-qRZPA&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS ?

YouTube

Corey Schafer

Python Pandas Tutorial (Part 1): Getting Started with Data Analysis...

In this video, we will be learning how to get started with Pandas using Python.

This video is sponsored by Brilliant. Go to https://brilliant.org/cms to sign up for free. Be one of the first 200 people to sign up with this link and get 20% off your premium subscription.

In t...

▶ Play video

#

?

wintry sapphire Aug 31, 2020, 4:42 PM

#

Hello

#

is anyone famiali with pandas

#

I need some help

marsh seal Aug 31, 2020, 5:26 PM

#

Hello, I want to iterate over a period of time. how can i compare a day's information with its previous day? One of the things i want to compare is the close of a stock with its previous day

wintry sapphire Aug 31, 2020, 5:33 PM

#

Hey @marsh seal

#

I tink I am doing something similar

#

do you know how to call on the previous row data?

marsh seal Aug 31, 2020, 5:46 PM

#

hey thanks for a quick reply @wintry sapphire no i don't

novel remnant Aug 31, 2020, 6:07 PM

#

use shift and create a new column with the shifted values for which you can compare with the original values

#

this way you can vectorize the operations for quick results

marsh seal Aug 31, 2020, 6:13 PM

#

@novel remnant Hi potaki, could you show me an example please

novel remnant Aug 31, 2020, 6:14 PM

#

sure one momment

#

for example if you want to subtract the value of the previous day from the value of the current day

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2020-01-01', '2020-01-10', freq='d'),
    'a': np.arange(10)
})
df.set_index('date', inplace=True, drop=True)
df['a_previous'] = df.a.shift()
df['a_minus_previous'] = df.a - df.a_previous
df

lapis sequoia Aug 31, 2020, 6:41 PM

#

Series of Data Science Articles for getting started with Data Science / Machine Learning, includes step-by-step implementations:

#

https://medium.com/@linguisticmaz/the-data-series-d27c1f34c627

Medium

The Data Series

The Data Series | Episode 1

#

Please consider reading if you are interesting and subbing to my channel to help build your knowledge and skills in data science

#

https://www.youtube.com/channel/UCiFF3AvbzLWdRyRnQMEttqw?view_as=subscriber

YouTube

Mazen Ahmed

Data Science Educator

silk saddle Aug 31, 2020, 6:47 PM

#

hey, im kinda starting out on python, ik some basics n stuff and after some help from ppl i wanna get into machine learning, i think? xd, i dont rly know what it is, anyone got resources on what it is

lapis sequoia Aug 31, 2020, 6:55 PM

#

hey @silk saddle

#

https://medium.com/ai-in-plain-english/introduction-to-machine-learning-6b84f64d783e?source=your_stories_page---------------------------

Medium

Introduction To Machine Learning

The Data Series | Episode 3

#

I release at least one episode every week

#

if you guys arent sure about anything or don't understand any concepts leave a comment on my channel or article

#

and i will get back to you as soon as possible

silk saddle Aug 31, 2020, 6:57 PM

#

tysm ❤️

arctic canopy Sep 1, 2020, 12:25 AM

#

Sup guys, Im learning the math that is needed for ML (which will take 2-3 month) but In these 2-3 I don't want to just learn math without programming(I already has experience with python about 6months) so can you give me any advice of what I should do like what kind of projects should I work on rn because Im kinda lost now.

desert oar Sep 1, 2020, 12:26 AM

#

since you are learning machine learning, you can try implementing some algorithms from scratch

#

maybe start with linear regression with OLS and/or gradient descent

#

principal components

#

things like that

#

maximum likelihood even

arctic canopy Sep 1, 2020, 12:26 AM

#

but things like this don't need math?

desert oar Sep 1, 2020, 12:26 AM

#

sure they do

#

you need to know and understand the equations in order to implement them

arctic canopy Sep 1, 2020, 12:27 AM

#

but Im still learning the math so how I can deal with them?

velvet thorn Sep 1, 2020, 12:32 AM

#

you can do the simpler things.

#

what specifically are you learning now?

#

alternatively, you can work on general projects that are not specifically related to ML

arctic canopy Sep 1, 2020, 12:35 AM

#

currently learning calculus

velvet thorn Sep 1, 2020, 12:35 AM

#

hm.

#

what have you done already

arctic canopy Sep 1, 2020, 12:38 AM

#

calculus* btw thanks for the advice it will try to work on projects not about ML.

#

if you mean with math nothing much but If you mean python projects, I have made a website and some automation stuff

velvet thorn Sep 1, 2020, 12:40 AM

#

yup, that's cool!

#

if you wanna do ML

#

it's important to also be a good programmer.

#

have you worked with visualisation tools?

#

in particular, matplotlib

arctic canopy Sep 1, 2020, 12:40 AM

#

not really

#

should I have a look at it?

velvet thorn Sep 1, 2020, 12:46 AM

#

if you want

#

just thought it might fit into calculus

#

it's kind of hard to think of a programming project that can focus on that

arctic canopy Sep 1, 2020, 12:56 AM

#

yeah I think my question is kinda wierd haha, thanks for answering. I think I will try making bots for some platforms that is the idea that just came into my mind.

tidal bough Sep 1, 2020, 12:56 AM

#

the https://www.coursera.org/learn/machine-learning course doesn't really assume any background knowledge - it needs linear algebra, but it teaches it in process. It has plenty of implementation tasks.

#

no Reinforcement Learning there, sadly, for that you need a more serious course, from the Advanced ML specialization.

nimble solar Sep 1, 2020, 1:04 AM

#

hi. i am trying to install a local package on disk
pip install /directory/my_package
but when i run jupyter, and import my_package , it says it is not found

#

is there a work around to fix this?

#

i tried installing the package with the same pip as in the which jupyter directory

velvet thorn Sep 1, 2020, 1:11 AM

#

yeah I think my question is kinda wierd haha, thanks for answering. I think I will try making bots for some platforms that is the idea that just came into my mind.
@arctic canopy that's fine too! as long as you're practicing programming and learning new things, don't worry

#

there are a ton of interesting concepts that I picked up along the way that became relevant months later

arctic canopy Sep 1, 2020, 1:15 AM

#

thanks a lot man,thats mean a lot

velvet thorn Sep 1, 2020, 1:17 AM

#

yw

#

feel free to ask if you need any other help

dark agate Sep 1, 2020, 1:40 AM

#

If you had $5.2K in tuition reimbursement from your employer for accredited coursework, what course/degree/boot camp would you use it for?

#

^for someone who has Python basics down but wants to pivot careers into data science

desert oar Sep 1, 2020, 2:24 AM

#

@nimble solar are you using a venv or other environment?

wintry sapphire Sep 1, 2020, 4:10 AM

#

📎 image0.png

#

Hi guys, I am trying to acheive this in a Dataframe

#

but i keep getting NaN

#

does anyone know how to do it?

flat quest Sep 1, 2020, 4:19 AM

#

@arctic canopy. Yeah like salt was saying try reimplementing some algorithms or papers. For the first few you might want to follow a guide.

As for the actual math, if you've covered calculus you can reimplement many of the basic algorithms without much difficulty. linear, logistic should all be doable

#

I mean ur dividing by zero @wintry sapphire. Pandas doesn't know how to deal with anything divided by 0

wintry sapphire Sep 1, 2020, 4:21 AM

#

Oh

#

@flat quest so if I leave it as NaN

#

maybe

#

I should b.fill this right?

flat quest Sep 1, 2020, 4:26 AM

#

well yeah but depends on the problem

wintry sapphire Sep 1, 2020, 4:30 AM

#

Hmm alright

#

cause I want to find the percentage change

#

@flat quest

#

How do I

#

fill my 1 Jan number with my second Jan?

wintry sapphire Sep 1, 2020, 4:47 AM

#

Hey @flat quest , do you happent o know why

#

even after I fill my 1 Jan with a number

#

i still get an error?

flat quest Sep 1, 2020, 4:48 AM

#

not sure i totally follow what ur trying to do
fill 1jan number with second jan?

wintry sapphire Sep 1, 2020, 4:49 AM

#

@flat quest Alright so here is my output

#

📎 unknown.png

#

Bascially in option 1

#

column

#

I want to 2019-01-01

#

to be my initial which is 10,000

#

for the next date, 2019-01-02, I would want it to be the value under StkB_close 2Jan - 1 Jan divide by 1 Jan

#

times the value above in option 1 in 1 jan

#

Meaning

#

In my option 1, for 2 Jan

#

the value would be

#

(101.12 - 101.12) / 101.12 * 10000

#

Assuming the final value is 30000

#

Then for 3 Jan

#

it would be

#

(97.40 - 101.12) / 101.12 * 30000

#

not sure i totally follow what ur trying to do
fill 1jan number with second jan?
@flat quest this is what I'm tying to do

#

But I keep getting an NaN

velvet thorn Sep 1, 2020, 5:09 AM

#

@wintry sapphire df.shift(1) / df - 1

wintry sapphire Sep 1, 2020, 5:13 AM

#

ohhh

#

@velvet thorn do you know

#

How to print out

#

certain rows and columns only

#

like in my DF, I have 5 columns = A B C D E

#

But I only wanted rows 5, 6 from Columbs C D E

velvet thorn Sep 1, 2020, 5:21 AM

#

hm.

#

are you new to pandas?

#

that is a very basic operation

wintry sapphire Sep 1, 2020, 5:21 AM

#

is it

velvet thorn Sep 1, 2020, 5:21 AM

#

I suggest you read this

#

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

wintry sapphire Sep 1, 2020, 5:21 AM

#

the .loc?

#

@velvet thorn so what I did was

#

for i, one_d in enumerate(date_check):
    print(portfolios.loc[one_d, 'Option_1'])```

#

where date_check is the dates whcih I want to find

#

dates are the

#

index

#

but I want it to be from

#

several columns

#

not just

#

option_1

copper hemlock Sep 1, 2020, 7:53 AM

#

i can't seem to calculate the input features for my linear layer in pytorch

#

i apply the formula but i get size mismatch error

#

self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)        
self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
self.conv3 = nn.Conv2d(in_channels=12, out_channels=24, kernel_size=5)
        
self.fc1 = nn.Linear(in_features=?????, out_features=360)


#forward method for pooling
tensor = F.relu(self.conv1(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)
print(tensor.size())        

tensor = F.relu(self.conv2(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)

tensor = F.relu(self.conv3(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)

#

can someone ELI5?

#

CxHxW = 1x40x40

#

according to my calculations its supposed to be 2411 but i get size mismatch error

#

nvm im dumb, i was calculating correct, error was elsewhere 😄

hushed flax Sep 1, 2020, 8:16 AM

#

Wow

supple frigate Sep 1, 2020, 9:58 AM

#

Hello guys, what do i need to know to getting a start with data science? i learned a little bit of pandas and numpy

lapis sequoia Sep 1, 2020, 10:06 AM

#

scikitlearn is nice to know and is fun to work with

#

https://scikit-learn.org/stable/

#

@supple frigate it contains also datasets that you can work with and make predictions or stuff like that

#

I have another question though:

I have two different CSVs with time series data. One Table is continuous, starting in 01.01.2017 at 00:00. From there each row represents one hour (1. Table). The data looks kind of like this:

Table aka df1:

Date,                   Volume
2017-02-03 12-PM,       9787.51
2017-02-03 01-PM,       9792.01
2017-02-03 02-PM,       9803.94
2017-02-03 03-PM,       9573.99

The other table contains events that happened and are serialized by UNIX datetime in seconds. I was able to convert it to datetime and group it by hour with this code:

df['datetime'] = pd.to_datetime(df['created_utc'], unit='s')
df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%d %I-%p")
df['date_by_hour'] = df['datetime'].apply(lambda x: x.strftime('%Y-%m-%d %H:00'))

This resulted in this data:

Table aka df2:

created_utc,    score,      compound,   datetime,               date_by_hour
1486120391,        156,        0.125,        2017-02-03 12:13:11,    2017-02-03 12:00:00
1486125540,     1863,       0.475,      2017-02-03 13:39:00,    2017-02-03 13:00:00
1486126013,     863,        0.889,      2017-02-03 13:46:53,    2017-02-03 13:00:00
1486130203,     23,         0.295,         2017-02-03 14:56:43,    2017-02-03 14:00:00

Now I need to map the events (2.table) to the Time Series of the 1. Table. If multiple events happened in one hour, i need to make an addition of the scores and calculate the mean average of the compound. In the end i want to have a dataframe like this:

#

Final Dataframe

Date,                   Volume,         score,      compound,
2017-02-03 12-PM,       9787.51,        156,        0.125,
2017-02-03 01-PM,       9792.01,        2726,       0.682,
2017-02-03 02-PM,       9803.94,        23,         0.295,
2017-02-03 03-PM,       9573.99,        0,          0,

I know my code below does not work and is wrong, but I wanted to show what I was thinking how I could achieve this. I thought I could loop through each row of my events table df2 and compare if the datetime matches. If so, I would calculate score and compound. The issue is that I know that one should not loop through a dataframe and I don't know how to loop through another dataframe at the same time and perform the right calculations based on the previous rows...

for index, row in df2.iterrows():
    memory_score = 0
    memory_compound = 0
    if df1['Date'] == df2['date_by_hour']:
        df1['score'] = row['score'] + memory_score
        df1['compound'] = (row['compound'] + memory_compound) / 2

How can I get to my Final Dataframe? There must be some pandas magic that I could use to make this work and map the time series data to the right hours.

velvet thorn Sep 1, 2020, 10:32 AM

#

@lapis sequoia how about a join?

lapis sequoia Sep 1, 2020, 10:50 AM

#

someone to help me with pandas

#

should be fairly straight forward

eager root Sep 1, 2020, 11:18 AM

#

Can you suggest another way of having the functionality the class searchNodeMaker(dict) provides here https://py3.codeskulptor.org/#user305_MIkOQUGRta9iO5Z_1.py ? I am trying to write that code into C++ and I don't know how to convert that.

distant moss Sep 1, 2020, 12:24 PM

#

what is ξ(never defined before) in this context?

📎 unknown.png

high urchin Sep 1, 2020, 12:40 PM

#

hey every1, im using xlsxwriter to make a report in excel. On that i have a pie chart, and i'm not able to put the legend in the circular instead of being outside, like this image: Idk if its because xlsxwriter uses chart styles from Excel 2007.

📎 pie_chart_ex.JPG

raven torrent Sep 1, 2020, 12:55 PM

#

Hey everyone, is there any way I can turn my deep learning model (regression) in google colab into a coreML model

desert oar Sep 1, 2020, 1:19 PM

#

@distant moss what paper is that?

#

i dont know the answer btw but thats pretty bad to just not define notation like that

distant moss Sep 1, 2020, 1:30 PM

#

I would think it's some kind of a know operator in matrices computations or smth

#

https://people.eecs.berkeley.edu/~jrs/meshpapers/Sorkine.pdf @desert oar page 4

#

or maybe the ξ is the sparse Cholesky factorization they computing....

tidal bough Sep 1, 2020, 2:06 PM

#

yeah, I'd think it's the decomposition or something

high urchin Sep 1, 2020, 2:19 PM

#

Do you guys know how to remove the bold and cell borders when you use pandas to_excel? I'm trying to overwrite it but the first column doesnt change

cerulean flint Sep 1, 2020, 2:47 PM

#

Does anyone have experience with connecting microsoft forms and answers to python?

gusty oak Sep 1, 2020, 4:00 PM

#

How do I make line charts and how do I save them as .png and send them into an embed?

tidal bough Sep 1, 2020, 4:02 PM

#

How do I make line charts
With matplotlib, probably.

how do I save them as .png
plt.savefig

pale thunder Sep 1, 2020, 4:03 PM

#

you can set a different matplotlib backend if all you want is images

tidal bough Sep 1, 2020, 4:03 PM

#

yup, matplotlib.use("AGG") if you only want pngs

gusty oak Sep 1, 2020, 4:08 PM

#

alright

bold bane Sep 1, 2020, 5:11 PM

#

is web scraping data science?

#

because im about to ask a question

unique wolf Sep 1, 2020, 6:14 PM

#

def Diff2(old_list, new_list): 
    li_dif = [[i for i in old_list if i not in new_list],[i for i in new_list if i not in old_list]]
    return li_dif

Is there a more efficient way to find differences in 2 lists than my current function? I want to know added and removed items separately^

#

I'll post it in general :/

desert oar Sep 1, 2020, 6:50 PM

#

@bold bane not in and of itself. but it can be useful in data science projects

#

@unique wolf you could use a set maybe? return list(set(old_list) ^ set(new_list))

tidal bough Sep 1, 2020, 6:54 PM

#

well, for the exact same output(as sets), use

def Diff2(old_list, new_list):
    s1,s2 = map(set,(old_list,new_list))
    return s1 - s2, s2 - s1

#

symmetric difference s1 ^ s2 is equivalent to (s1-s2) | (s2-s1) (elements that are in one of the sets)

rare ice Sep 1, 2020, 7:53 PM

#

Any ideas on how I can visualize a JSON object in a dynamic tree in a Jupyter Notebook? Javascript (more specifically React) has visualization components like https://github.com/storybookjs/react-treebeard. Is there something similar for Jupyter?

lapis sequoia Sep 1, 2020, 8:15 PM

#

hey, quick question: how do i convert AM/PM datetime to 24 hour datetime? Like, I want to convert 2020-05-19 01-PM to 2020-05-19 13:00:00

#

do i use also something like

df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%d %I-%p")

desert oar Sep 1, 2020, 8:29 PM

#

@lapis sequoia keep it stored as a "datetime" internally. when you want to print it, use df['datetime'].dt.strftime https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.strftime.html?highlight=strftime#pandas.Series.dt.strftime

#

the strftime format spec is here https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

lapis sequoia Sep 1, 2020, 8:37 PM

#

Thanks @desert oar

#

It works!

amber moat Sep 1, 2020, 10:06 PM

#

Any ideas on how I can visualize a JSON object in a dynamic tree in a Jupyter Notebook?
@rare ice I think it's not possible. But you can load the JSON object into a dictionary with the json module and print it. It'll look like a json tree

desert oar Sep 1, 2020, 10:09 PM

#

it would be a nice notebook or lab extension though!

#

@velvet thorn i just found a use case for "auxiliary" pandas indexes, a query like this:

all_urls.groupby(['url_source'])['homepage_eval'].value_counts()

#

although like you said you can always set_index first

velvet thorn Sep 1, 2020, 10:57 PM

#

although like you said you can always set_index first
@desert oar can you elaborate

desert oar Sep 1, 2020, 10:57 PM

#

@velvet thorn all_urls.set_index('url_source').gropuby(level=0

#

but that has a lot of less-desirable properties e.g. if you want to use the original index inside each group

#

also i currently have a 3-level column index with names, yikes

desert oar Sep 2, 2020, 12:08 AM

#

is there a jupyter notebook or lab extension that allows you to bookmark a cell?

dusk aspen Sep 2, 2020, 12:12 AM

#

hi, so I want to be able to input 3 images to python, then decide one of them is the main one. then i want to find which one is the closest to the main one. does anyone know how i would do that?

velvet thorn Sep 2, 2020, 12:20 AM

#

is there a jupyter notebook or lab extension that allows you to bookmark a cell?
@desert oar not sure about this but you can use HTML

#

like how you would do a table of contents but for one cell

#

hi, so I want to be able to input 3 images to python, then decide one of them is the main one. then i want to find which one is the closest to the main one. does anyone know how i would do that?
@dusk aspen you can try a Siamese network

dusk aspen Sep 2, 2020, 12:21 AM

#

ok, i can try it

velvet thorn Sep 2, 2020, 12:28 AM

#

@lapis sequoia you can try asking here too next time

#

anyway, I believe you have that problem because you create a new BeautifulSoup for each page (in your loop)...but you only ever extract stuff after the loop ends?

still otter Sep 2, 2020, 9:23 AM

#

I have a sort of design question. I have a lot of separate files that are being generated in real time. I will be making scripts to extract some plottable data out of these files. What can I use/do to save this arbitrary plottable data in a central location which can also notify plotters that there is new data to plot?

desert oar Sep 2, 2020, 11:46 AM

#

@still otter what does "central" mean in this case? where/how are the plots being generated?

chilly pasture Sep 2, 2020, 2:00 PM

#

I have a 300 mb text file (glove embeddings), what is the fastest way to upload it in colab everytime? my google drive is full so that is out of option.. Does hadoop or spark help for this?

desert oar Sep 2, 2020, 2:19 PM

#

@chilly pasture google cloud storage?

chilly pasture Sep 2, 2020, 2:45 PM

#

This is the first time i am hearing about it. thanks

#

i mean i thought it was a general term for google drive

kindred gyro Sep 2, 2020, 2:45 PM

#

Hewo, can I ask about if I can get the tweets from Tweepy package by year? Or is it just by tags

tall bronze Sep 2, 2020, 2:53 PM

#

Hi everyone. Not sure if this is the place to ask - How often is Cython used in data science for computationally expensive programs?

torpid gull Sep 2, 2020, 2:54 PM

#

Hello everyone!!

tall bronze Sep 2, 2020, 2:54 PM

#

I never heard of it as a total beginner, despite lots of people dislike for Python's performance. But, as soon as I started my position in computational research, Cython was immidiately brought into light.

eager heath Sep 2, 2020, 3:09 PM

#

Hey guys, if you were to use an algorithm to make a text based on a large dataset, in a Markov chain fashion that'd actually yield (mostly) grammatically correct results and could be tuned based on user inputs (supervised learning?), what algorithm would you choose?

desert oar Sep 2, 2020, 3:56 PM

#

@tall bronze i use it at work occasionally. sometimes numba can help improve performance too

#

cython is good when you have to process a lot of data in a "production" setting and/or for writing libraries with better performance than pure python alone. it's not necessarily that useful in data science, moreso in ML engineering

lapis sequoia Sep 2, 2020, 4:17 PM

#

hey guys I posted a data science related question in help-nitrogen, any help would be highly appreciated!

grave frost Sep 2, 2020, 4:24 PM

#

@eager heath Why not explore Deep Learning models? They are pretty efficient and highly accurate. Personally, I don't think general algos ever provide great performance or accuracy.....

eager heath Sep 2, 2020, 4:38 PM

#

Because it is way more work, I en ever did anything deep learning related lemon_eyes

#

But I think it would be a good introduction, wouldn't it?

#

Would you have any great resource to get me started please?

spring flame Sep 2, 2020, 6:14 PM

#

Hi, I know basic python and some external libraries for research. But I am interested in building AIs. I am not aware of how to proceed using python. Can someone suggest online sources where i can begin learning AI using Python?

desert oar Sep 2, 2020, 6:26 PM

#

@spring flame maybe https://course.fast.ai/ could be a good place to start

Practical Deep Learning for Coders

Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD - the book and the course

spring flame Sep 2, 2020, 6:27 PM

#

Thanks a lot

viral scroll Sep 2, 2020, 6:52 PM

#

Hi All,

Is there any function in pandas to calculate cumulative average/mean just like there is cumsum for cumulative sum

#

?

novel remnant Sep 2, 2020, 6:53 PM

#

global average or expanding average?

#

I think you're looking for expanding average

viral scroll Sep 2, 2020, 6:56 PM

#

expanding average

novel remnant Sep 2, 2020, 6:57 PM

#

it's simple then call series.expanding().mean()

viral scroll Sep 2, 2020, 6:57 PM

#

like for every month all the rows of previous months should be included in the mean

novel remnant Sep 2, 2020, 6:58 PM

#

expanding does that

viral scroll Sep 2, 2020, 6:58 PM

#

great thanks....let me try

novel remnant Sep 2, 2020, 6:58 PM

#

alright cheers

stiff stratus Sep 2, 2020, 7:41 PM

#

What would you call seaborn, is it a wrapper for matplotlib? If so, please define what a wrapper is.

#

I am searching for a technical term

#

Please answer me in #help-lollipop

still otter Sep 2, 2020, 9:17 PM

#

@desert oar By "central" I mean if a plotter client (backed by matplotlib/bokeh/whatever) wants to get some data to plot, regardless of what data it wants or which script is creating the data, the plotter client will always access the same location to get its data.

For example, I was thinking maybe I can use a single sqlite db to store all this extracted data. If I want to make a new script to extract some additional data, the script would add a new table to the db, which it would then populate. Plotters can then be easily updated to plot the new data, without caring about the scripts that made it. Does this design seem good for this goal? Or is there perhaps a tool/library that is better fit for this purpose?

serene cipher Sep 2, 2020, 11:21 PM

#

hello! can anyone help me with excel filtering?

desert oar Sep 2, 2020, 11:59 PM

#

@still otter ok. what are the plotters though? are they all separate programs? how are they accessing the data? is this all happening on a single machine? on a network?

#

why not just keep a bunch of parquet files somewhere and watch for updates?

velvet thorn Sep 3, 2020, 12:00 AM

#

@still otter so basically you want a central store for data + push notifications?

still otter Sep 3, 2020, 12:01 AM

#

everything is on a single machine for now, and probably will be for a while

#

@velvet thorn yeah, basically. it's probably a simple thing to solve but i'm unfamiliar with what's available to me

velvet thorn Sep 3, 2020, 12:02 AM

#

hm.

#

to give good advice on design it would be important to understand (quite a bit) more about the architecture and your needs

#

like so you have another script running

#

that's constantly updating plots?

still otter Sep 3, 2020, 12:04 AM

#

so, i haven't really decided on the plotter quite yet

#

i have a basic plotter right now which is just a simple python script that makes a matplotlib plot manually from the files i have

#

basically parses the relevant data from the files without storing it anywhere

#

but it's a bit slow

#

and has to re-parse everything if i close and reopen it

#

also, this is kind of poor timing for me, i need to leave for a while 😓

velvet thorn Sep 3, 2020, 12:06 AM

#

why do you want push notifications then

still otter Sep 3, 2020, 12:06 AM

#

thanks for the responses though, i'll have to read about parquet

velvet thorn Sep 3, 2020, 12:06 AM

#

sounds like pull would be a better strategy

still otter Sep 3, 2020, 12:07 AM

#

maybe, i just like the sound of plots updating immediately when a new file is created

#

anyway, thanks for now, will be back later if anyone is around

lapis sequoia Sep 3, 2020, 12:08 AM

#

Any tips on how to get rid of the for-loops in the first function? My attempt in the second function gives the wrong result because I'm not accounting for j != i. But I'm not aware of anything in NumPy to account for that.

import numpy as np


def mu_brokaw(mus, mws, xs):
    n = len(mus)
    mu_mix = 0.0

    for i in range(n):
        d = 0.0

        for j in range(n):

            if j != i:
                mij = ((4 * mws[i] * mws[j]) / ((mws[i] + mws[j])**2))**0.25

                num = (mws[i] / mws[j]) - (mws[i] / mws[j])**0.45
                den = 2 * (1 + mws[i] / mws[j]) + ((1 + (mws[i] / mws[j])**0.45) / (1 + mij)) * mij
                aij = mij * ((mws[j] / mws[i])**0.5) * (1 + num / den)

                sij = 1.0
                d = d + sij * aij * xs[j] / np.sqrt(mus[j])

        mu_mix = mu_mix + (xs[i] * np.sqrt(mus[i])) / (xs[i] / np.sqrt(mus[i]) + d)

    return mu_mix


def mu_brokaw2(mus, mws, xs):
    mij = ((4 * np.outer(mws, mws)) / ((np.add.outer(mws, mws))**2))**0.25

    num = np.divide.outer(mws, mws) - np.divide.outer(mws, mws)**0.45
    den = 2 * (1 + np.divide.outer(mws, mws)) + (1 + np.divide.outer(mws, mws)**0.45) / (1 + mij) * mij
    aij = mij * (np.divide.outer(mws, mws)**0.5) * (1 + num / den)

    sij = 1.0
    d = np.sum(sij * aij * xs / np.sqrt(mus))

    mu_mix = np.sum((xs * np.sqrt(mus)) / (xs / np.sqrt(mus) + d))
    return mu_mix


if __name__ == '__main__':
    # dynamic gas viscosity in µP
    mu_h2 = 179.75
    mu_n2 = 363.87

    # molecular weight in g/mol
    mw_h2 = 2.016
    mw_n2 = 28.014

    # mole fraction
    x_h2 = 0.85
    x_n2 = 0.15

    mu_mix = mu_brokaw([mu_h2, mu_n2], [mw_h2, mw_n2], [x_h2, x_n2])
    print(f'mu_mix = {mu_mix:.4f}')

    mu_mix2 = mu_brokaw2([mu_h2, mu_n2], [mw_h2, mw_n2], [x_h2, x_n2])
    print(f'mu_mix2 = {mu_mix2:.4f}')

desert oar Sep 3, 2020, 12:27 AM

#

honestly this seems like it's really overengineered @still otter

#

whatever the plotter process is, i would just poll the directory every 30 seconds for new files

#

you can get fancier by sending notifications over a socket or something but i'd start with something really simple like polling for new files

#

i think inotify can watch for directories too

#

yep

#

https://github.com/dsoprea/PyInotify

GitHub

dsoprea/PyInotify

An efficient and elegant inotify (Linux filesystem activity monitor) library for Python. Python 2 and 3 compatible. - dsoprea/PyInotify

#

import inotify.adapters

from my_library import plot_from_file

def main():
    i = inotify.adapters.Inotify()
    i.add_watch('./plot-data')

    for _, type_names, path, filename in i.event_gen(yield_nones=False):
        plot_from_file(filename)

if __name__ == '__main__':
    main()

velvet thorn Sep 3, 2020, 1:30 AM

#

^ agree about the overengineered part

lapis sequoia Sep 3, 2020, 2:54 AM

#

How do I make a table like this in python? I have all the data already in a couple of df's. I just need it to print out a decent looking table.

📎 Screen_Shot_2020-09-02_at_10.52.16_PM.png

fast bluff Sep 3, 2020, 4:49 AM

#

Ok I'm big noob but I need some help. I'm working on a project to monitor market data and I'm running into a stupid problem. I already know what's causing it, just not sure how to get around it

#

ts = ForeignExchange(key='secret',output_format='pandas')
avdf = ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact')
#This returns a tuple which is the root of all my issues (I think)
avdf.drop(['open','high','low'])
#Returns "tuple object has no attribute drop"
#So I tried converting it into a list a few ways
df = list(avdf)
#This worked but I'm still having the same issue
df.drop(['open','high','low'])
#Returns "list object has no attribute drop"
#So I thought maybe it was because I had to directly link it to pandas. So I tried this
df = pd.DataFrame(ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact'), columns=avdf.columns, index=avdf.index)
#But still no luck.. It returns "tuple has no object columns"
#Getting annoyed and it's probably a super easy fix so if anyone could help me out, that would be greatly appriciated :D

#

I'm on python 3.8 using pandas 1.0.5 and alpha_vantage 2.2.0

#

Im gonna head to bed. If anyone has the time to respond please @ me in the message

#

Added " to the pd line

#

Now I get this traceback

arctic wedgeBOT Sep 3, 2020, 5:10 AM

#

Hey @fast bluff!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

fast bluff Sep 3, 2020, 5:12 AM

#

Would rather not use that so hopefully this works

#

https://pastebin.com/sbh5TZ7s

Pastebin

Traceback1 - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

lapis sequoia Sep 3, 2020, 7:23 AM

#

Hey guys! Anyone who works in analytics? Maybe you can help me out here: https://stackoverflow.com/questions/63513587/best-way-to-get-selling-windows-for-each-product-category-in-pandas I'm stuck with this for 12 days

Stack Overflow

Best way to get 'selling windows' for each product category in pandas?

So my dataframe has sales details of many products for many years and graph kind of look like this:

And I'm trying to find out selling windows of each product.
What I've tried so far:
The approach I

grizzled saffron Sep 3, 2020, 8:35 AM

#

Hi everyone, I'm trying to get a count column that get total 'Unit Sold' by tags from 'tags' in this df:

📎 Screenshot_1.png

#

I want to make new two columns one for tag names and second for how much unit sold
For example:

📎 Screenshot_2.png

#

How can I do that? can I split tags by ","?

molten hamlet Sep 3, 2020, 10:48 AM

#

split tags?

#

text in tag?

grizzled saffron Sep 3, 2020, 10:50 AM

#

yes I want to split the tags.. and count the unit sold for each tag

dense wharf Sep 3, 2020, 11:09 AM

#

Hi everyone.

I'm new to python and I've just started to follow a youtube channel by Corey Schafer, trying to learn Pandas. I use Pycharm Community.

I'm kind of looking for a dataframe interface that resembles the Jupyter Notebook or if it's possible, any kind of an external window like the one on mathplotlib. Not really liking the one that's there on Pycharm.

Is there any way I can do that?

Thanks!

velvet thorn Sep 3, 2020, 12:43 PM

#

Hi everyone.

I'm new to python and I've just started to follow a youtube channel by Corey Schafer, trying to learn Pandas. I use Pycharm Community.

I'm kind of looking for a dataframe interface that resembles the Jupyter Notebook or if it's possible, any kind of an external window like the one on mathplotlib. Not really liking the one that's there on Pycharm.

Is there any way I can do that?

Thanks!
@dense wharf PyCharm has Jupyter integration, but only for the Professional version

#

if you're a student you can get it for free though

#

yes I want to split the tags.. and count the unit sold for each tag
@grizzled saffron df['tags'].str.get_dummies(','), then groupby columns

dense wharf Sep 3, 2020, 12:44 PM

#

Thankyou! I'll look into it

grizzled saffron Sep 3, 2020, 12:50 PM

#

@velvet thorn thanks for the reply.. I am getting an error:

#

📎 Screenshot_3.png

velvet thorn Sep 3, 2020, 12:51 PM

#

uh.

#

do you know what that does?

#

okay

#

just run df['tags'].str.get_dummies(',') by itself

#

and you should understand what you're doing wrong

grizzled saffron Sep 3, 2020, 12:52 PM

#

it made every values=0

#

and..

#

it made new columns by the tag names

#

I want to make one column named 'tag' then get this dummies to the row of 'tag'

#

then count the values of each tags in all rows

velvet thorn Sep 3, 2020, 1:57 PM

#

yes, that's why I said

#

groupby column

#

after that

grizzled saffron Sep 3, 2020, 2:24 PM

#

@velvet thorn Im sorry Im pretty new with pandas.. can you write here an example for the code..

wild pine Sep 3, 2020, 2:35 PM

#

hey guys. I'm trying to write an implementation of the NEAT algorithm, and there's something i don't quite understand:
between speciation, killing off the weakest genomes and repopulation, what happens to the existing spiecies? i know that none of the genomes from the previous generation survives, but do the spiecies?
i mean, do i just wipe all the spiecies and respeciate each new generation from scratch, or do i somehow keep a representive genome from each and let them live untill they have been underperforming for too long?

flat quest Sep 3, 2020, 2:54 PM

#

i haven't personally dived into genetic algorithms so can't really say that much here. But surely ur keeping the best genomes from the previous generation?

@wild pine

wild pine Sep 3, 2020, 2:56 PM

#

unless you're implementing some sort of elitism, where you let a couple of the best performing genomes survive unchanged, i don't think that's usually the case.
i mean if you think about it, evolution is about finding out who gets to reproduce, not finding out who gets to live forever.

#

generally the idea is that everyone dies, but only the best performing organisms pass on their genes to the next generation

wintry sapphire Sep 3, 2020, 3:20 PM

#

Hi all

#

how do I print out this

#

📎 unknown.png

#

form my dataframe

#

currently this is my dataframe

#

I want it toprint out

#

On 2019-03-29, Option A is XXX, Option B is XXX, Option C is XXX

#

@velvet thorn any suggestions? 🙂

supple minnow Sep 3, 2020, 4:03 PM

#

does anybody have any experience with DEAP library? I wanna know how can u set a specific chromosome in already define population for genetic algorithm?

runic stream Sep 3, 2020, 4:39 PM

#

Hey all! So I was thinking about making a project related to AI. Anybody want to collaborate? Actually I'm a final undergrad student and a project would really help me get a hands on experience about the topics I learnt. Thoughts?

steel roost Sep 3, 2020, 5:03 PM

#

wwhere can i get data? Like is there a site or something that i can use?

grizzled saffron Sep 3, 2020, 5:07 PM

#

@steel roost Kaggle, Data.world

fast bluff Sep 3, 2020, 5:08 PM

#

Could someone please peep my message from last night if you have the time

#

to sum up

#

avdf = ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact')
#This works fine but returns a tuple which I can't work with.
df = pd.DataFrame(ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact'), columns="avdf.columns", index="avdf.index")
#This returns an error involving my arguements w/ index (full traceback posted w/ pastebin link above)
df = list(avdf)
#Tried this
df.drop(['open','high','low'])
#"list has no attribute drop"
```See earlier post for further explanation

#

I tried a bunch of stuff and I think I'm on the right track with the pd.DataFrame but I think I'm having a problem passing the columns and index

flat quest Sep 3, 2020, 5:20 PM

#

well yes @wild pine the one that reproduces is the one that continues. But with most ML problems, u want the best producing rather than the one that happens to survive the best.

Ah so took a brief look at Neat. So its an evolution of the neural architecture itself. Based on what I can tell it looks like they're making mutations and then forming species based on a certain threshold difference. It looks like organisms are only eliminated based on their performance compared to indiivduals within their same species.

So by that logic, no species should completely die out. However, it might be beneficial to at some point remove the species completely if their performance is too terrible.

oblique vine Sep 3, 2020, 5:22 PM

#

https://www.coursera.org/learn/machine-learning/
Is this course still good to learn ML? I mean, it is 9 years old, lots of things have changed

Coursera

Machine Learning

Learn Machine Learning from Stanford University. Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, ...

fast bluff Sep 3, 2020, 5:23 PM

#

Complete guess, but I assume the fundamentals are still about the same so it couldn't hurt

steel roost Sep 3, 2020, 5:45 PM

#

is there a way to speed up pandas readers?

#

i have a data file that extremley huge, but appears to freeze when just trying to print the dataframe

#

import pandas as pd
import numpy
data_file = '/home/doomedapple7565/Downloads/Parking_Violations_Issued_-_Fiscal_Year_2017.csv'

# want sheet 1 to be new york
# want sheet 2 to be new jersey
#and i want a count of the number of tickets for each license plate
#and i want the first and last ticket of each license plate

df = pd.read_csv(data_file)
print(df[0])

#

i took this file: https://www.kaggle.com/new-york-city/nyc-parking-tickets?select=Parking_Violations_Issued_-_Fiscal_Year_2017.csv

NYC Parking Tickets

42.3M Rows of Parking Ticket Data, Aug 2013-June 2017

wild pine Sep 3, 2020, 6:14 PM

#

@flat quest so basically each generation will consist of a mix of survivors from the previous generation, along with their offspring?
tbh that also makes more sense to me and was my first intuition, until i read this response on a related question on stackexchange:
The neural networks with the worst performance are killed off after speciation. None of the neural networks survive - the entire population is replaced with the offspring of the nets remaining after the culling stage.
i suppose there're several different approaches.

I guess i'll try to eliminate half of each species and let them reproduce until the population limit is reached, keeping the best performing genomes from each generation (possibly mutating them slightly). I can always just rewrite it if it turns out to be a disaster.
thanks a lot for your input! I've been working really hard on the rest of the code, and it's been so frustrating to be stuck on such a minor detail. at least i feel like i can get back to coding now.

fast bluff Sep 3, 2020, 6:17 PM

#

Can someone please help me ;-;