#data-science-and-ml

1 messages · Page 248 of 1

desert oar
#

makes sense then

lapis sequoia
#

Is it possible to use the .map function on a series of booleans?

#

Because now I might be thinking of making a new series on whether or not something is discontinued, and I was going to replace "False" with "No" and "True" with "Yes"

tidal bough
#

don't see why it wouldn't be possible

lapis sequoia
#

Well, I'm trying, but it's replacing the entire series with NaN, and in the documentation they show it only working with obejct dtypes, not booleans

desert oar
#

of course you could, but why would you want to

#

oh i see

#

yeah perfectly valid

lapis sequoia
#

Nvm I am stupid

#

Put True and False in ''

#

-_-

desert oar
#
pd.Series([True, True, False]).map({True: 'hello', False: 'goodbye'})
#

yeah, True and "True" are different

lapis sequoia
#

should have known, haha

solar bluff
#

I use .map all the time to map series that contain enumerated values into their corresponding strings

#

(i look at a lot of data that's produced by C++ code pushing structs into hdf5 files)

lapis sequoia
#

is there a way to use the inplace argument for mapping as well? or do i have to create a new series?

#

The documentation seems to point to there not being a method

solar bluff
#

df["series"] = df["series"].map(dict) works just fine

lapis sequoia
#

Ah, true

#

Thank you both for the help!

solar bluff
#

❤️

#

credit goes to @desert oar, a true champ

desert oar
#

hah, i feel strongly about helping people w/ pandas

#

because its really hard to learn from the official docs..

lapis sequoia
#

you too? whew, i thought i was the only one that struggled with documentation reading here

desert oar
#

its awful

#

they tried, but

#

its amazing anyone knows anything about pandas

#

it needs a serious overhaul imo, ive been wanting to write my own guide for a while

solar bluff
#

I LOVE that series from Tom Augspurger

#

I also feel that the McKinney book is confusing, and that's sad because he's the creator

#

best book I know of on how to use Pandas is the Pandas 1.x Cookbook by Ted Petrou and Matt Harrison

glass wyvern
#

Does anyone have experience with sparse matrices? I want to solve some linear systems of equations. The coefficient matrix is predominantly diagonal and the rest of the elements are 0 thus I think sparse matrices are the way to go. Thanks!

desert oar
#

@glass wyvern yeah, what's your question exactly?

#

sparse matrices can be good for what you just described, if the coefficient matrix is very big

#

not much value in it for a small matrix, the main benefit of a sparse matrix is saving memory

merry ridge
#

This is kind of a dumb question but is anyone familiar with the term imputation? How is this conjugated? Is the base word Impute? It is very difficult for me to see it as anything less than a misspelling of the word "input" but other people in data science in my work group insists that this is a common word.

#

None of them speak English as a first language, and I am skeptical of the way they are using it in a sentence

desert oar
#

yes, imputation is the "verbal action" form of "impute"

#

"missing data imputation" is the act of "imputing missing data"

merry ridge
#

Alright, thank you for the reassurance

#

I have been in a constant battle of made-up words until now.

sudden cedar
#

does anyone know how i would retrieve the neural net with the highest fitness score in NEAT

amber anvil
#

Not sure if this is the right sub-server, but can someone maybe help me with a little problem i'm encountering with pandas?
It's a problem with data frames: I'm trying to map a column with numbers N. These N are also present as keys (K) in a dictionary with values V. Whenever I try to substitute the N in the dataframe by the the V from the dictionary with df.map, the indexes of dict are mapped and not the key-reflecting values...
Anyone know how to solve this?

merry ridge
#

Are you using df = df.map?

#

I think df.map creates a copy of the dataframe off the top of my head

#

There should be an optional argument inplace = True to make it update the dataframe itself otherwise, but I would need to check the documentation. I have the memory of a goldfish

amber anvil
#

main_df['column_with_N'].replace(dictionary, inplace=True) works, but its super slow. from what i read df.map would be faster, but its not doing the job rn

merry ridge
#

Looking at the documentation, it is clear I have no idea what I am talking about. So disregard

amber anvil
#

main_df['column_with_N'].map(dictionary) shows a proper mapping in a series, but it does not substitute the values in main_df

#

ahah no worries, thx for ur help

#

🙂

merry ridge
#

When I try writing main_df['column_with_N'] = main_df['column_with_N'].map(dictionary) it seems to output what you want

#

Not sure if this is what you are after @amber anvil I am more of a Matlab person.

desert oar
#

@amber anvil you need to assign the result back to the original, as hexicle pointed out

#

.map does not work "in place"

muted sapphire
#

I want to ask a very simple question regarding something I encountered in anaconda, but I dont know if this is the correct place. May I or should I move to a help channel?

#

(I ask here because anaconda is considered a popular python distribution amongst data scientists)

solar bluff
#

@muted sapphire what's the question? (I am not an admin or mod but if I can help I will)

desert oar
muted sapphire
#

Hey thanks greghouse. I just wanted to know if its normal, everythime that you create a new environment, to reinstall jupyter notebook?

#

It happened to me yesterday and it seemed weird that I had to reinstall whats already in my pc

#

Thanks guys

velvet thorn
#

Hey thanks greghouse. I just wanted to know if its normal, everythime that you create a new environment, to reinstall jupyter notebook?
@muted sapphire that’s what a virtual environment is

#

it effectively acts like a new “container” for installed packages

muted sapphire
#

Packages I can understand. But jupyter, i mean its like an IDE, isnt it?

velvet thorn
#

Jupyter is a package too

muted sapphire
#

And to be honest a friend of mine doesnt have to install it when he makes a new environment so I was unsure whether i made a mistake or not

velvet thorn
#

it’s possible to do that too

muted sapphire
#

I see. Do you know how? I didnt know jupyter behaves like a package tbh. I just considered it an IDE, like pycharm

velvet thorn
#

okay, first

#

“package” and “IDE” are not mutually exclusive

#

a package is just a Python module container

#

and you can write an IDE in Python

#

which would make it a package too

#

anyway, to answer your question...

muted sapphire
#

Thank you for the valuable information, I hadnt thought about it this way but makes sense.

#

Yes please, go on

velvet thorn
#

I believe you cannot customise it directly, but it depends on your version of Anaconda...? (I’ve never had a need to do this)

muted sapphire
#

I have the latest, he doesnt

#

Maybe thats a reason, i dont know

#

As long as it is "normal" and its not a mistake by me, i dont mind it installing it

velvet thorn
#

IMO

#

new environments coming with stuff that is not necessary is an antipattern

#

you won’t always be doing stuff that needs Jupyter

#

and by “necessary” I mean for Python to run

muted sapphire
#

This is true, it makes sense for it NOT to come with it installed.

#

Perhaps I just want to test something in the console or w/e.

#

You are right, i was mainly confused because I didnt consider jupyter as a package you know?

sudden cedar
#

does anyone know how i would retrieve the neural net with the highest fitness score in NEAT
as in save that data genomes data and only run it by itself

muted sapphire
#

Thank you anyway 🙂 @velvet thorn You were very helpful

velvet thorn
#

Thank you anyway 🙂 @velvet thorn You were very helpful
@muted sapphire np!

graceful glacier
#

i wrote code to filter words from a given Pandas series that contain atleast two vowels

#

import pandas as pd
from collections import Counter
color_series = pd.Series(['Red', 'Green', 'Orange', 'Pink', 'Yellow', 'White'])
print("Original Series:")
print(color_series)
print("\nFiltered words:")
result = mask = color_series.map(lambda c: sum([Counter(c.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
print(color_series[result])

velvet thorn
#

hm.

#

I'm sure there's a better way

#

let me think

graceful glacier
#

any suugestions to if i can use regex to solve this

velvet thorn
#
>>> import re
>>> colours = pd.Series(['Red', 'Green', 'Orange', 'Pink', 'Yellow', 'White'])
>>> colours.str.count('[aeiou]', flags=re.I)
0    1
1    2
2    3
3    1
4    2
5    2
dtype: int64
#

there you go

graceful glacier
#

thanks

velvet thorn
#

yw

atomic forge
#

is Pandas worth learning

velvet thorn
#

yes

#

for data analysis

atomic forge
#

hm

#

u need to learn that

#

and

#

matplotlib

#

for graphing

#

then again mysql does the job of pandas so

velvet thorn
#

no

#

SQL does not do the job of pandas...

#

and pandas doesn't do the job of SQL either

atomic forge
#

they both deal with

#

data bases

#

thro code

velvet thorn
#

SQL deals with databases

#

pandas doesn't

#

the abstraction is different.

atomic forge
#

then how would u describe a pandas dataframe

velvet thorn
#

in particular, SQL focuses strongly on guarantees that databases provide, like ACID

#

the DataFrame is an abstraction representing tabular data

atomic forge
#

and dont say a dictionary of series cuz it rly isnt :/

#

well i mean it IS but

#

acc yea

velvet thorn
#

and

atomic forge
#

ic where ur comming from

velvet thorn
#

pandas doesn't need a database

#

it's (more or less) source-agnostic

#

SQL deals only with databases

atomic forge
#

huh

#

ic ic

#

well imma need a resource to learn pandas anyway so if u dont mind

#

!resources

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

velvet thorn
#

hm I don't really have one, sorry

#

also pandas is a lot more suited to quick experimentation than SQL

atomic forge
#

h m

#

*puts pandas in code test as if the name is code

velvet thorn
#

because the name of the package is pandas

atomic forge
#

well

#

ok so

#

if im not wrong

#

from what ive learned

#

if u have a datafram and u wanna only do when a certain condition is true

#
df.iloc[df["column"] > 5]```
#

?

#

or was it df.loc

#

damn ot

velvet thorn
#

okay

#

so, if you just want to flter on rows

#

you can do df[df['column'] > 5]

#

.loc is for when you want to filter on rows and columns

#

so say you want all the rows where column_1 > 5, and only the column column_2

#

that would be df.loc[df['column_1'] > 5, ['column_2']]

#

df.loc[row_indexer, col_indexer]

#

.iloc, on the other hand, is for positional indexing

atomic forge
#

ic

velvet thorn
#

so say you want the 3rd row

atomic forge
#

ohhh

velvet thorn
#

df.iloc[2]

#

3rd row, 1st column?

atomic forge
#

thats the

#

3rd columb

#

o nvm

velvet thorn
#

df.iloc[2, 0]

atomic forge
#

ohhhh

#

ic ic

#

and if u want to get by row name?

#

ok ok i got it thz

#

rhx

#

rhx

#

thx

velvet thorn
#

rows don't have names, normally.

atomic forge
#

but if u want

#

like say

#

u have a list of states

#

and their population

#

and area

#

and u want

#

the U.S's row

velvet thorn
#

you would have a column

#

called "state" or something like that

#

then df[df['state'] == 'US']

atomic forge
#

o

#

ic

#

and if im not wrong

#

df["columnname"] would get u a series

velvet thorn
#

yes, that's correct

#

and that Series represents a column

atomic forge
#

yay im understanding this

velvet thorn
#

yup, good job

atomic forge
#

ill keep grinding my data science book then

#

matplotlib

#

cant wait

sudden cedar
#

does anyone know how i would retrieve the neural net with the highest fitness score in NEAT
as in save that data genomes data and only run it by itself

crude karma
#

this might be a lousy question but can you treat data frames like arrays

#

liek if i import a .csv file into jupyter... can i treat the data as an array and index stuff out of it

stable sequoia
crude karma
#

thxx

desert rapids
#

Im learning ds now. would any of you be able to send me intro classes udemy or whatever it may be?

crude karma
#

i used freecodecamp

lapis sequoia
#

Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?

lapis sequoia
#

Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?
same question

velvet thorn
#

Hey, guys i have a major project to do in final year of my ug degree
So can anyone tell a good topic for a major project on ai?
@lapis sequoia why do you want someone to tell you what to do

#

the world is full of interesting questions

#

find one close to your heart.

#

that's not how you should end your university journey IMO

lapis sequoia
#

@velvet thorn i need some Suggestions only

velvet thorn
#

then maybe you should tell us what you're interested in

#

because AI is so wide

#

like games? make a game AI

#

interested in photography? how about some kind of smart filter

#

food? maybe an NLP project for parsing recipes?

#

health? ML for mental health, given a daily questionnaire?

#

there are a ton of ideas out there; this is just what I came up with off the top of my head.

lapis sequoia
#

I think when he asked that he meant what kind of project would be respectable enough to pass, sure there's lots of ideas but skills+cliche+some more factors narrow down the spectrum

#

Maybe ask some final-years what they made and get to know what kind of stuff works, usually it should be deployable too

velvet thorn
#

I think when he asked that he meant what kind of project would be respectable enough to pass, sure there's lots of ideas but skills+cliche+some more factors narrow down the spectrum
@lapis sequoia it's really hard to say because standards vary widely across institutions

lapis sequoia
#

That's true

velvet thorn
#

but, yeah, honestly

lapis sequoia
#

which is why it's best to ask around

velvet thorn
#

I have seen way too many people whose first instinct is to come to a community and ask for help with something they should have spent some time thinking about first

#

so I suppose I'm a little jaded

lapis sequoia
#

Welll

#

here it's a big issue

hearty token
desert oar
#

I agree with gm

#

Learning to ask for help is good. But learn to try and think for yourself first.

#

If they said "hi im debating between X Y and Z topics and my advisor is ambivalent, can someone give me insight into any of these domains for an undergrad thesis?"

#

Im sure we would all be happy to help

#

Their question is one step above the people who just ask for homework answers

#

Part of writing a thesis is picking a topic, that's part of research

lapis sequoia
#

Sounds weird but anyone from India here?

ripe forge
#

We've got folks from all over the world, though that question doesn't seem on topic for here.

obsidian mica
#

how could i update a dataframe in real time

#

if i am passing in input from a file and adding to it

lapis sequoia
#

Is it possible to reference "NaN" in pandas? it's automatically filling in blank cells as such and I would like to map it to "" because that seems easier to work with outside of the pandas module. I can't seem to find out how to reference "NaN", though

tidal bough
#

so you want to replace NaN cells with empty strings?

#

What's the datatype of those cells?

lapis sequoia
#

Yes, simply because I can't seem to reference the "NaN" in other statements, like if statements or loops

#

objects

#

If there's a way to reference "NaN" outside of pandas, that would be nice

#

I tried numpy.nan, but no dice

tidal bough
#

it's pandas.nan I believe

#

looks like it's generally np.nan

lapis sequoia
#

just tried that, but apparently the module doesnt have that attribute

onyx juniper
#

hello, do you guys have any resource recommendations for the math side of data science which will accompany me throughout my data science learning journey? i know linear algebra, calculus and linear programming, however, i really need help with the statistics

lapis sequoia
#

I need to map it to a string because I'm using pandas with regular expressions

#

Oh, sorry.

#

found it. there's a .fillna method

fallow sable
#

the book data science from scratch goes a bit into it and has additional resources if you want to learn more which probably answers your question @onyx juniper

lapis sequoia
#

Is there a way in pandas to get the index of a value, column, or row?

crude karma
#

Hi, i am trying to plot a stock market graph on python with the date on the x axis and the price on the y axis. However I get an error that says KeyError: 'Date'.. but in my CSV file there is a column called date? Could it be that the jupyter notebook cannot recognize my DTG format?

woven radish
#

@crude karma make sure capitalization is the same, but for debugging you might need to post a code snippet, like the section of your graphing code and what your df.head() looks like

crude karma
#

okay i figured it out but this doesnt look like a stock chart.. how do i combine both highs and lows

woven radish
crude karma
#

oh danng okay ill read that and figure it out thanks buddy

lapis sequoia
#

hey, this is my situation: i have a dataframe and need to update each row. for each row i need to make a request to retrieve the new data and replace the old data. The thing is, that if I do this sequentially, it will probably take 15-20 days. That is why I want to use multithreading so that it will only take a few hours if parellelize the requests. I know this is probably some basic stuff for you, but what is the best way to pass the data from a pandas dataframe to each thread in python?

#

it is not good practice to create variables in a for loop for each file and row, right?

#

that's why i was thinking to either create a variable for each row manually instead of doing it manually

#

then i would pass each variable with the datarow to a thread, make the request, update and then replace the row in the dataframe with the updated row

#

but that would mean I would need to create 200 variables by hand... so i am sure there must be some better way to do this if creating them dynamically is bad practice

#

how would you go about this?

velvet thorn
#

Is there a way in pandas to get the index of a value, column, or row?
@lapis sequoia which do you want?

lapis sequoia
#

I guess either. or are the methods really different from each other?

velvet thorn
#

hey, this is my situation: i have a dataframe and need to update each row. for each row i need to make a request to retrieve the new data and replace the old data. The thing is, that if I do this sequentially, it will probably take 15-20 days. That is why I want to use multithreading so that it will only take a few hours if parellelize the requests. I know this is probably some basic stuff for you, but what is the best way to pass the data from a pandas dataframe to each thread in python?
@lapis sequoia does any row depend on any other row?

#

I guess either. or are the methods really different from each other?
@lapis sequoia hm...let's take a step back

#

why do you want to do that?

lapis sequoia
#

I just think it could be useful sometimes, like if you want to sort something

velvet thorn
#

.sort_values()?

lapis sequoia
#

Oh, I guess there's a method for that but

#

Is there really never a good time to return the index of something? just feels like something that could come in handy

velvet thorn
#

I don't think I have ever needed to do that, but you could filter and then access .index

#

like literally ever as far as I can remember

lapis sequoia
#

oh, I didn't realize the .index method returned a value

velvet thorn
#

also native Python nan is float('nan')

marble briar
#

I have a dataset with 100 labels how do i calculate the accuracy?

crude karma
#

how come when i specify a figsize, it says 'list' object has no attribute 'loc'

#

when i do df= plt.plot(df.loc[:,'Time'],df.loc[:,'VO2']) it works but when i add a figsize, taht error shows up

still verge
#

where are you adding figsize?

crude karma
#

oh i figufred it out

#

i added at the end

#

but ium supposed to add at the beginnign

still verge
#

😄

austere swift
#

so I'm trying to implement keras tuner as an automatic hyperparameter tuner in my model and for the weight regularization I was wondering what would be a good minimum and maximum value to have?

#

and a good value to step too

#

Ping me if you have an answer and thank you

solid lagoon
#

hello, i have a dataframe with a column which takes only two values, say A and B, and want to create a column A_1,A_2,A_3....A_countA,B_1,B_2,....B_countB

#

how do I achieve this?

#

t = pd.Series(["a", "b", "b", "b", "b", "a"]) t 0 a 1 b 2 b 3 b 4 b 5 a dtype: object func(t) 0 a a_1 1 b b_1 2 b b_2 3 b b_3 4 b b_4 5 a a_2 dtype: object

#

can someone tell me how i can achieve func?

still verge
#

try determining the indices of all the letters and store that into another list

#

so that you can use those indices to append to the letters

solid lagoon
#

i have trouble getting the indices

still verge
#

prob best if you had a function that went through the list and kept individual counters

#

and appending them to a larger list

solid lagoon
#

you mean have a global counter

#

did it thanks

#

for reference
t = pd.DataFrame({'A': ["a", "b", "b", "b", "b", "a"]}) counter={} def func(x): ix = counter.get(x, 0) counter[x] = ix + 1 return '{0}_{1}'.format(x, ix) t.A.apply(func)

velvet thorn
#

uh...

#

@solid lagoon a bit late but well

#

you should actually use cumcount

#
>>> t + (t.groupby(t).cumcount() + 1).map(lambda v: f'_{v}')
0    a_1
1    b_1
2    b_2
3    b_3
4    b_4
5    a_2
dtype: object
solid lagoon
#

thanks man, I knew I had seen this somewhere way before

velvet thorn
#

yeah I know because I myself spent time coding exactly that

#

and then a while later I found there was something for this

lapis sequoia
#

Hey guys is SQLite common for data analysis? I’ve just learned yesterday that Python has a sqlite library built in. Really only need a database to store data in and query what I need. I don’t have admin access on my work laptop so can’t try others without requesting, but is it at least common use?

velvet thorn
#

Hey guys is SQLite common for data analysis? I’ve just learned yesterday that Python has a sqlite library built in. Really only need a database to store data in and query what I need. I don’t have admin access on my work laptop so can’t try others without requesting, but is it at least common use?
@lapis sequoia pandas?

lapis sequoia
#

@velvet thorn I’m trying to avoid reading in the data every time and then selecting what I need. So came across this SQLite database I could potentially use to store the data and then query what I need. Was just wondering if SQLite is commonly used?

velvet thorn
#

for small datasets

lapis sequoia
#

What’s generally considered small?

velvet thorn
#

well

#

anything under a gigabyte

#

but honestly

#

I don't really see the problem with reading data into memory every time...?

#

although if you don't have to do interactive analysis

#

SQLite might be just what you need

#

I presume you're good with SQL so why not

lapis sequoia
#

I’ve just been finding it super slow and there is certain repetitive analysis I do, that I know exactly what I need.

Well, I know the basics, but it can’t be that hard to pick up!

velvet thorn
#

if you find pandas slow

#

generally one of two things is true

  1. you're using it wrongly
  2. your data is too big
#

anything above a gigabyte (on disk) starts to poke into "bad for pandas" territory (you can consider something like dask I suppose)

lapis sequoia
#

Right I see. Yes I’m going over a gigabyte. I’m super new to this kind of stuff so more than likely not being optimal! Literally just ordered Python for Data Analysis by Wesley McKinney!

velvet thorn
#

okay so

#

very simple rule of thumb

#

if you have a for loop in your pandas code, you're probably doing something wrong

desert oar
#

+1 although i do tend to loop over .columns occasionally

lapis sequoia
#

if you have a for loop in your pandas code, you're probably doing something wrong
@velvet thorn
Definitely no for loops!

desert oar
#

@lapis sequoia can you give an example of something that's slow

#

And can you give an example of a different tool where the same operation is not slow

velvet thorn
#

+1 although i do tend to loop over .columns occasionally
@desert oar oh yeah that's perfectly fine

bitter fiber
#

any1 have a good pyspark resource for me to learn? Im thinking w3 schools for hiveql first.

desert oar
#

@bitter fiber hiveql is basically just sql. i wouldnt start there

#

i dont know of specific resources, but it helps if you think of pyspark as a declarative interface to a query engine

bitter fiber
#

I know sql just wanted to learn how to setup the environment and special quirks

desert oar
#

ah, i cant say i know much about setting up the env

bitter fiber
#

Right.. I have 6 raspberry pi's and 1 main computer that i wanted to interface together for a hobby of mine

desert oar
#

but yeah, spark is weird because you have to think of it more like constructing a query or constructing a program that is to be compiled and executed, rather than executing code line by line as in python

bitter fiber
#

I was thinking maybe it would be useful to create a datamine

desert oar
#

i think typically you deploy on yarn or mesos, although it does support "standalone" cluster mode

#

and i guess it supports k8s too

#

the docs are decent albeit sometimes disorganized

#

i would start by practicing w/ pyspark itself on a local cluster before you try to actually deploy on your rpi farm

bitter fiber
#

What does standalone cluster mode mean?

#

ok so on my own computer in a local cluster meaning running just on my workstation?

#

my workstation that im working on first has 16 physical and 32 total with virtual cpus I want to learn how to utilize everything.

desert oar
#

"standalone cluster" would be spark running directly on the machines without an engine like yarn/mesos/kubernetes underneath it

bitter fiber
#

ah..

desert oar
#

"local cluster" is 1 machine

#

i think for making use of a single high-core workstation spark probably isn't the best unless you have tons of RAM

bitter fiber
#

256 GBS of ram

desert oar
#

oh yeah

#

go for it, see how it works

bitter fiber
#

I bought a 1500 dollar refurbished machine

desert oar
#

i have a similar machine at work, its nice but we never use spark on it

#

for big stuff there i just use dask or i just yolo 30 GBs of data into memory with pandas or data.table

bitter fiber
#

Thats what I do for work. pandas

#

i would like to start a data mine in my house that consumes many public apis

desert oar
#

thats a fun project

bitter fiber
#

I was originally thinking of running with a LAN mongodb to not worry about schemas

#

and just injest everything into my big computer

desert oar
#

spark/hdfs is probably better for that

#

dump it all to a NAS

bitter fiber
#

Right..

#

My brother has a NAS

floral mantle
#

Any tips or starting points on downloading main posts and comments from a Facebook group I’m a member of? Doing some text analysis and word cloud type stuff - want to know if it’s doable, link examples, and see if anyone’s aware if it’s against any sort of TOS

desert oar
#

read in with rpis and save on hdfs running on a NAS or something? idk

#

@floral mantle that's probably against facebook's TOS and you check to see if they have any provisions about "automation" or "crawling"

floral mantle
#

I think you’d have to use their Graph API to do it and I see references for it

bitter fiber
#

only 2 TB of harddisk on my workstation though..

floral mantle
#

So need a Dev key etc.

bitter fiber
#

facebook is tough. you need to get verified app permission

#

its not like twitter which is more open

desert oar
#

yeah you can use the graph API

#

(if you can get access)

bitter fiber
#

yeah i would say learning the Graph api is very valuable in the marketplace though

desert oar
#

i dont know what goes into getting that kind of permision

#

@bitter fiber or just work for Big Corp where they contract out all that stuff 😛

#

(but then you end up doing half the work for the contractor anyway because they dont know wtf theyre doing)

bitter fiber
#

Lol they hired some guy to just maintain the facebook api and he barely works nowe

#

at my job and they cant fire him because everyone else is too lazy to work on that.

#

its more legal stuff than anything lmao

#

@desert oar i had another question; a claim that people say about hadoop is that you use 1/10th the server cost; is that because you compress the data more or something across a cluster?

#

built in backups?

desert oar
#

i dont know what that even means

#

like if you pay for 300 GB of storage hadoop only lets you use 30 GB?

#

i dont work with hadoop directly ever so i honestly have no idea, but that's a questionable claim

bitter fiber
#

Gotcha..

polar acorn
#

Might be true if you have a lot replication going on. Storing 30 GB of data might end up taking up lot more than just that.

jovial lotus
#

Hi all, I have a machine learning algorithm that I am trying to code. I have had very little experience with it so I am getting stuck on what type of algorithm I should use. I am trying to make a program that if a song is playing (SONG X), it recommends the next song (song Y). In order to do so, I have a set of variables that song Y should fit or be closest to (variables a,b,c,d,e,f....). All of the variables are percentages. Given a list of songs that song Y could be in, I want to find the best match for song Y in the list. If it was only one variable, all I would do is find the song in the list of songs that has the closest variable value. But what do I do once I start comparing multiple variables?

tidal bough
#

So, find the closest point to a given one in a high-dimensional space, based on a predefined metric?

jovial lotus
#

I believe so yes ?

tidal bough
#

I mean, that's it. It's no different from the one-variable case. The only thing you need is to design the metric function. For that, you could use just sum of squared differences (euclidean metric), but note that you'd want to normalize all of your parameters then (so that they're all 0 mean and 1 variance), otherwise certain parameters will affect the distance more than others.

jovial lotus
#

So sum of the squared differences and then find the song that has the smallest sum. Which should be the song that is least different?

tidal bough
#

Pretty much. I mean, that's just finding the closest point in space to this one.

#

EDIT: fixed link, it's cdist you want.

jovial lotus
#

Wow that is literally so much help, thank you

#

I have been trying a bunch of different complicated algorithms for the past few days

#

So there is one caveat, one of the variables isnt a percentage like the others are. That variable being bpm(beats per minute) which doesnt really have a max or a min so there isnt a way for me to represent it in a percentile manner.

tidal bough
#

Sure there is 🙂

jovial lotus
#

How so?

tidal bough
#

For your entire dataset, for each variable, calculate the mean and standard deviation for that variable

#

Then subtract the mean and divide by std.

#

Every variable will then end up with 0 mean and 1 std.

#

It means they'll then lose obvious meaning (having 1 on the bpm score would mean "it's around 1 standard deviation more than the mean among all songs"), but that'd put them all into a similar range.

#

wait

#

wrong one, hold on

jovial lotus
#

I'm writing all of this down. I think that is all I need so really really thank you

tidal bough
#

and scikit-learn is nice in that is has detailed User Guides

jovial lotus
#

So, I am trying to have this recommender program iterate through the entire list so that each song is "perfectly" played after another and that it creates a playlist/mix. The best move would be to add all of the summed differences through the entire playlist and then compare that with other versions of the playlist, possibly every single version of the playlist. I feel like that would be too brute force. Do you have any advice on how I should do that?

tidal bough
#

Why not just start from a random (or user-defined) point and then traverse the graph of songs, always choosing the closest non-explored point?

jovial lotus
#

Holy crap, okay, better to look back into my algorithm textbooks haha

tidal bough
#

The best move would be to add all of the summed differences through the entire playlist and then compare that with other versions of the playlist, possibly every single version of the playlist.
In geometry terms, this can be rephrased as "I want to find the shortest-length path that visits all of my points exactly once". Do you happen to know how that task is called, perhaps? 🙂

jovial lotus
#

eularian or something right?

#

eu- something haha

#

eulerian path

tidal bough
#

it's, uhm, a very very hard problem. NP-complete, even.

#

So you probably shouldn't bother. Just always go to the closest unexplored neighbour or something.

jovial lotus
#

yeah, just that every single other song is technically a neighbor

#

unless I could categorize the bpms as neighbors since I want the songs to flow into eachother...

tidal bough
#

yup, it's like euclideanTSP on a plane (where you can go to any city you want and the distance is just the euclidean distance between them), but it's in n-dimensional space instead 😅

#

Nevertheless, whenever your problem turns out to be a subclass of TSP, that's generally a sign that you might want to simplify it.

#

(TSP isn't easily solved even for points on a plane)

jovial lotus
#

Okay yeah, I think this is a great starting point, thank you

#

Mind if I add you in case I have any other questions?

tidal bough
#

sure

modest rune
#

I am having a long back and forth on the coursera forums for Andrew Ng's machine learning course. Either I am just dense and need someone else to explain things to me (most likely), or the other person is wrong. Anyone on here willing to help me out. Here is the discussion (I had to save it to a PDF since the forum post is behind a user/password wall on coursera).

https://gofile.io/d/5ElxoJ

#

Here is the coursera link, in case you have credentials and can view the forum (Just in case you a weary about opening some random dude's PDF from a file sharing site you may not be familiar with)
https://www.coursera.org/learn/machine-learning/discussions/weeks/1/threads/5WdAbuk8EeqXNhLj2fFeZQ

tidal bough
#

I don't think either of you are really wrong. You're basically asking why use the least squared error function of all things. The answer is something like "it's the provably best way under certain assumptions to minimize the mean error".

#

The data you made doesn't really fit these assumptions, so it unsurprisingly is a very bad fit under LSE. You could potentially achieve that orange line by detecting outliers - for example, if one searched for a subset of points of size around 70% of the total that had the least average squared error when fitting a line to it, then one would obtain the red line:

#

So, I guess, I could also say that your concerns are valid, but they pretty much never occur in practice. You don't usually have to fit a line to a dataset that's obviously non-linear.

lapis sequoia
#

Any suggestions on how to use NumPy to get rid of the for-loops in the function shown below?

def mu_davidson(mus, mws, xs):
    mus = np.asarray(mus)
    mws = np.asarray(mws)
    xs = np.asarray(xs)

    a = 0.375
    e = (2 * np.sqrt(mws * np.array([mws]).T)) / (mws + np.array([mws]).T)

    f = 0.0
    n = len(mus)
    for i in range(n):
        for j in range(n):
            f = f + xs[i] * xs[j] * e[i, j]**a / np.sqrt(mus[i] * mus[j])

    mu_mix = 1 / f
    return mu_mix

Here's an example of using the function:

mus = [179.75, 363.87]
mws = [2.016, 28.014]
xs = [0.85, 0.15]
mu_mix = mu_davidson(mus, mws, xs)
print(f'mu_mix = {mu_mix}')
velvet thorn
#

@lapis sequoia what's that supposed to do?

modest rune
#

@tidal bough thanks! FYI, I am taking Professor Ng's course so I can make sense of what you told me a few days ago. I almost have it all sorted out.

Regarding the recent discussion, this is the latest update, which I think clears up my confusion and explains the other dude's opinion:

Found this, and I think it answers my question:

https://www.mathworks.com/help/stats/examples/fitting-an-orthogonal-regression-using-principal-components-analysis.html

"PCA minimizes the perpendicular distances from the data to the fitted model. This is the linear case of what is known as Orthogonal Regression or Total Least Squares, and is appropriate when there is no natural distinction between predictor and response variables, or when all variables are measured with error. This is in contrast to the usual regression assumption that predictor variables are measured exactly, and only the response variable has an error component."

My interpretation of this statement is... Normally, we assume zero error in the X axis (the input), only error in the Y axis (the output). But, in the case that the X value is also susceptible to error, then PCA is a better fit.

So, for the example of the square feet of a home vs predicted home price. There is a negligible error in the square feet measurement that can assumed to be zero, while there is much error in the price values. In that case, do not use PCA.

However, if the city required the use of a specific contractor to make square feet measurements on homes and that contractor was known to intentionally add error into their measurements just to throw everyone off, then PCA would be the better method to use.

IF I understood everything correctly, that explanation clears up my confusion. Please let me know if I am understanding this correctly.

lapis sequoia
#

@velvet thorn See my edit. I added an example of using the function.

modest rune
#

As for as I can tell PCA matches what I was trying to do with my intuitive fitting using the shortest perpendicular distance.

velvet thorn
#

@velvet thorn See my edit. I added an example of using the function.
@lapis sequoia I mean, what are the for loops intended to achieve?

#

understanding how to optimise your algorithm from a high-level description would be simpler than trying to figure it out from your code

lapis sequoia
velvet thorn
#

okay, it is too early for me to read math or iterative numpy code, so I will leave this to someone else...

#

hopefully someone else will come along

#

never mind I got bored and did it

#

@lapis sequoia ((xs * xs.T) * (e ** a) / ((mus * mus.T) ** 0.5)).sum(axis=None)

#

you need to make xs and mus 2D

#

[:, np.newaxis]

tidal bough
#

As for as I can tell PCA matches what I was trying to do with my intuitive fitting using the shortest perpendicular distance.
@modest rune Pretty much. PCA is used for dimensionality reduction - basically, take a lot of points in n-dimensional space, and find an m-dimensional (m<n) subspace to project the points to such that the lengths of the projections are minimized. For n=3,m=2, it's finding a plane in 3d space that the data most closely matches. For n=2,m=1, it's your example. Unlike LSE, PCA indeed doesn't have any direction bias - in fact, PCA is perfectly fine with fitting a vertical line, something LSE can't do at all, because, well, the latter assumes that y is a function of x.

#

@lapis sequoia

Any suggestions on how to use NumPy to get rid of the for-loops in the function shown below?
I'd say like this:

prod = np.multiply.outer(xs,xs) # prod[i,j] = xs[i]*xs[j]
prod *= e**a
prod /= np.sqrt(np.multiply.outer(mus,mus))
f = np.sum(prod)
#

outer is one of my favorite numpy features; I've done so much stuff to mimic its behavior before I found it 🙂

velvet thorn
#

yeah, I should have used the outer product

#

...too early.

#

perhaps I should go back to sleep

lapis sequoia
#

@tidal bough and @velvet thorn thanks, I had no idea outer existed

tidal bough
#

it can be applied to any numpy ufunc

lapis sequoia
#

What's the difference between np.outer and np.multiply.outer? They appear to do the same thing.

velvet thorn
#

What's the difference between np.outer and np.multiply.outer? They appear to do the same thing.
@lapis sequoia effectively, nothing

#

outer is a method on all numpy ufuncs

#

so, for example, you could have np.add.outer

#

however, because np.multiply.outer is a special operation known as the outer product, it is given a top-level alias np.outer

#

you can tell this if you look at their signatures.

#

np.multiply.outer takes the generic ufunc.outer signature

#

whereas np.outer is different

lapis sequoia
#

Ah, I see. Thanks again.

velvet thorn
#

np

lapis sequoia
#

I revised my previous function based on help from @tidal bough and @velvet thorn. This looks much cleaner.

def mu_davidson(mus, mws, xs):
    a = 0.375
    e = 2 * np.outer(mws, mws)**0.5 / np.add.outer(mws, mws)
    f = np.sum(np.outer(xs, xs) * e**a / np.outer(mus, mus)**0.5)
    mu_mix = 1 / f
    return mu_mix
graceful glacier
#

if anyone whos familiar with SQL help me with why the last statement is printing 1

#

the first table is derived from the 'hacker_news' table. it shows the top users and their score

graceful glacier
#

update: i found my mistake, it was an miscalculation i made with the JOIN statement

tall sierra
#

Hi guys, I make youtube videos where I vulgarize Artificial Intelligence terms and news for non-experts. My goal is to demistify the AI “Black box” for everyone and sensitize people about the risks. Give it a check if you can, I am actually posting a new video un 2 hours ! 😁 I would love any feedback (especially negative, but pertinent) in order to improve my videos and vulgarizing skills! Thank you!
Here's the channel: https://www.youtube.com/c/WhatsAI

hallow briar
#

Got anything on LSTMs? Cause I'd love to know wtf those are doing

midnight goblet
#

yeah LSTM is amazing but I'm looking for YOLO v4 @hallow briar

grand pike
#

Hi All- I am fairly new to discord and have a question regarding time series analysis and handling missing data in a df

#

I have a df, indexed by date, to capture the spread history of various bonds. However, for some bonds, the spread levels become unstable as the bond reaches maturity/is close to being paid off (as shown by the sudden drops in the plot)

#

Now, I want to apply some data quality checking to stabilize such bonds

#

specifically, the DQ check that I want to apply is that across the bonds (which are columns in the df), each time there are 10 consecutive NaN i.e. missing data, cut off the rest of the data as the last 20 days of data

#

However, I am struggling to find a clean way of defining a function to perform this DQ check. Any thoughts on what an ideal approach may be?

#

*as well as the last 20 days of data

lapis sequoia
#

Hi. The binned distribution of one of the columns in a dataframe is shown below in blue. I've tried removing outliers using IQR and variations of IQR (tuning the quantiles) and in red you see the binned distribution of the subset of elements which lie in the quantiles [0.05, 0.95]
My question is why the red distribution is so much smaller. The filtering removed only about 100 elements. Shouldnt the red be about as high as the blue distribution?

tidal bough
#

@lapis sequoia It looks like the bins of the red one are smaller and the histogram is not normalized (density=True isn't passed), so the smaller the bins, the lower they will be (because fewer elements falls into them).

#

Pass density=True when examining distributions. Here's a comparison:

#
X = np.random.randint(0,500,10000)
plt.close()
plt.figure()
plt.hist(X,bins=50)
plt.hist(X,bins=100)
plt.show()

produces:

#
X = np.random.randint(0,500,10000)
plt.close()
plt.figure()
plt.hist(X,bins=50,density = True)
plt.hist(X,bins=100,density = True)
plt.show()

produces:

tight stone
#

A more specific question to DL/ML:
Are there any publications/projects/demos to a program that involves hand-gestures to control a e.g web-page or similar?

tidal bough
plucky cairn
#

hey beginner question on good algorithms to try for classifying text into one of two categories

tidal bough
plucky cairn
#

that page is exactly what i needed thanks

naive jay
#

hey guys can anyone link me some sites where i can find some data sets, specifically im looking for server logs

tight stone
#

@tidal bough Wow, thanks for those. Exactly what I am searching for

ripe mortar
#

Hello. I'm testing some Pandas and threading/multiprocessing. I find it odd that threading is a bit faster than multiprocessing. The function I passed to multiprocessing.Process and threading.Thread sums() a dataframe and threading finished first. Is this right? I thought multiprocessing would finish faster.

still otter
#

are you actually doing parallel work?

ripe mortar
#

are you actually doing parallel work?
@still otter I'm counting the number of votes in the dataframe per candidate and I pass it on to a function that filters the dataframe by candidate name and sum() them up.

still otter
#

hm. well in general Thread is faster because it has less overheads, but Thread is not capable of concurrent computation in pure python. So which is faster depends on how you are doing the computation and how much data you're working with

#

i don't know much about pandas but it's possible that sum() is run in native code that releases the GIL, which means it can be run concurrently with Threads, in which case the main downside of Thread is sidestepped and Thread will almost certainly be faster in this case

ripe mortar
#

Thank you!

upper vessel
#

anyone know a method to reduce mode collapse in GANs, without adding another neuronal network.

frail arch
#

does anyone know how to add a custom function into the model in Keras? as in I want to pass the output of a layer through my function and use it's output for another layer

hasty grail
#

@frail arch Can't you just use the functional API? Can you provide an example of what you want to achieve?

frail arch
#

@hasty grail for eg. say, I want to take output of a layer, add something to it, pass it to a dictionary and use the dictionary's output as input for the next layer

hasty grail
#

Does the functional API not work for that?

lapis sequoia
gaunt tusk
#

Anyone got any good resources on reinforcement learning?

lapis sequoia
#

hello, i am going in to year 12 and am looking to do CS at uni, can someone explain what a job in data-science would entail

vivid wren
#

I'm working on a pixel art editing program and wanted to know what a good method for finding similar neighbors with bucket fill? I have the pixels mapped with a dictionary in f"{x}x{y}" format. I made my own function which figures out all the valid neighbors recursively but don't know if there is a more efficient method.

#
def bucket_fill(id, layer):
    to_fill = [id]

    def find_neighbors(neighbor_id):
        x, y = neighbor_id.split("x")
        x, y = int(x), int(y)
        l = None if x == 0 else f"{x-1}x{y}"
        r = None if x == layer.width - 1 else f"{x+1}x{y}"
        t = None if y == 0 else f"{x}x{y-1}"
        b = None if y == layer.height - 1 else f"{x}x{y+1}"
        neighbors = [l, r, t, b]
        return [n for n in neighbors if n]

    def check_neighbors(neighbor_list): #Check if color matches, and not already in the to-fill list, returns new pixels to check after adding them to to-fill
        new_neighbors = []
        for n in neighbor_list:
            if not n in to_fill:
                if layer.pixeldict[n].color  == layer.pixeldict[id].color:
                    to_fill.append(n)
                    new_neighbors.append(n)
                    print(f"Added {n}")
        return new_neighbors

    def check(neighbor_id, i): #Recursively check a pixel and its neighbors
        print(f"Check recursion {i}")
        neighbors = find_neighbors(neighbor_id)
        neighbor_list = check_neighbors(neighbors)
        print(neighbor_list)
        for n in neighbor_list:
            check(n, i + 1)

    check(id, 0)
    return to_fill
hasty grail
#

Am not entirely sure why you need to store them in a dictionary. Wouldn't a 2-D array do pretty much the same thing?

#

Seems that you are performing a breadth-first search, which is perfectly valid imo

modest rune
#

This I think is a super easy question. In numpy, what is the best way to create a 2D (lets call it G) array with dimensions Mx2, each column is a feature that has a defined linspace representing values I want to predict, and G needs to be every possible combination of the the 2 features linspace values.

#

for example:

# probably would create these using linspace to create these, unless a function exists that does 
# everything at once.
ages = [45;50;55;60]
nose_pimples = [0;1;2]

# Desired Result
G = 
[ 0, 45;
  0, 50;
  0, 55;
  0, 60;
  1, 45;
  1, 50;
  1, 55;
  1, 60;
  2, 45;
  2, 50;
  2, 55;
  3, 60  ]
hasty grail
#

np.stack(list(itertools.product(nose_pimples, ages)))

#

or you can use meshgrid I guess

arctic wedgeBOT
#

You are not allowed to use that command here. Please use the #bot-commands channel instead.

hasty grail
#

!eval

import numpy as np
print(np.mgrid[0:3:1, 45:61:5].reshape(2, -1).T)
arctic wedgeBOT
#

You are not allowed to use that command here. Please use the #bot-commands channel instead.

hasty grail
#

^

modest rune
#

thanks

tidal bough
#

Anyone got any good resources on reinforcement learning?
@gaunt tusk https://www.coursera.org/learn/practical-rl/ I'm doing this coursera course on it.

Also:

• Sutton, Barto - Reinforcement Learning: An Introduction
• Berkeley - CS285: Deep Reinforcement Learning
(copied from the AI discord server)

frail arch
#

how to install caffe in windows? I am getting error CMake Error: CMake was unable to find a build program corresponding to "Ninja". CMAKE_MAKE_PROGRAM is not set. You probably need to select a different build tool.

#

I have latest CMake installed

raven mulch
#

In this video we go over the distinction between invariance and sensitivity based adversarial perturbations. The former being a much less studied attack which is able to break "robust" models!

I encourage you to create discussions here or on the youtube comment section about the paper and share related work, we can all learn from each other!

Video: https://www.youtube.com/watch?v=NhZY2tnDTZg

In this video we go over the distinction between invariance and sensitivity based adversarial perturbations. The former being a much less studied attack which is able to break "robust" models!

Paper: https://arxiv.org/abs/2002.04599

Abstract: Adversarial examples are mal...

▶ Play video
crimson umbra
#

anyone know where i can get lecture videos and slides for the latest cs109 courses with a recent Python 3.x version
Or is the 2015 version the only one that's free for all

weak kiln
#

If you do enjoy it please consider subscribing and promoting the channel! It encourages me to put more effort into these videos I have other videos which span related topics.
@raven mulch

I think it's great that you're creating YouTube content and sharing it with our members in a channel that has a relevant topic - but "remember to subscribe" crosses the line over into straight up advertising, and violates our rules. Maybe you can use this channel to ask for feedback, instead. I wouldn't have any problem with that.

raven mulch
#

Sorry I will edit that part out

#

Done

weak kiln
#

Just try to keep that in mind for the next video, though. We technically don't allow advertising, but I think it's a shame to completely block content creators who are making things that may be relevant to the interests of our members - so you're basically walking a bit of a tightrope with these posts.

raven mulch
#

Yep definitely. I appreciate it! I’m mainly looking to gain a following of people to discuss papers I make videos on, I understand how advertising can become annoying though

velvet thorn
#

Yep definitely. I appreciate it! I’m mainly looking to gain a following of people to discuss papers I make videos on, I understand how advertising can become annoying though
@raven mulch honestly, it's a p good topic

#

but I was not sure if it was against the rules

tawny pivot
#

Hi i have dataframe with multi columns and nan values at the beginning. And I try this:

#

i need to each columns beginning value's timestamp

#

any idea?

novel remnant
#

do you want the timestamp (which is in index?) for the first non nan value per dataframe column?

modest rune
#

@tidal bough i finally figured out how to generate a surface plot for IV. I mean, I actually understand what the heck I am doing and how the math works under the hood. Thanks for your help earlier. And you were right, I was a few characters away from having working code. But, I was about 10 hours of learning away from actually understanding what was going on. Anywho, coursera, for free, has an excellent intro course on machine learning by Stanford's Dr. Andrew Ng. I'm only 2 weeks into the 8 Week course,but prof. Ng explains everything in a way I can understand and doesn't make assumptions about my math background.

tidal bough
#

yeah, I very much liked how that course gives you an understanding of how it works under the hood

#

you won't need to actually implement these algorithms, most likely - just use premade algorithms from libraries like scikit-learn or pytorch - but it's going to be useful if you would want to understand ML articles or code some advanced (and so non-standard) algorithm.

modest rune
#

Yeah. I hate blindly using a library without enough understanding of the underlying principles. I just can't be confident in my usage.

tawny pivot
#

do you want the timestamp (which is in index?) for the first non nan value per dataframe column?
@novel remnant yes i need a new dataframe that contains: column names and first non nan value's index

lapis sequoia
#

Anyone here experienced with tensorflow?

novel remnant
#

@tawny pivot

something like this then?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'time': pd.date_range('2020-01-01', '2020-01-05', freq='d'),
    'a': [np.nan, np.nan, 1, 2, 3],
    'b': [np.nan, 1, 2, 3, 4],
    'c': [1, 2, 3, 4, 5]
})
df.set_index('time', inplace=True, drop=True)

# This is the part that you want
new_dict = {}

for col in df.columns:
    new_dict[col] = df[~pd.isna(df[col])].index[0]
    
pd.DataFrame.from_dict(new_dict, orient='index').T
#

@lapis sequoia pure tensorflow or tensorflow keras?

lapis sequoia
#

uhh like a simple doubt

#

first i download the dataset using this

#

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

#

now i need to use this function to resize the images to 64 x 64

#

but i am getting an error 🤔

novel remnant
#

what error are you getting?
the images are grayscale do you reshape them first to shape (-1, 28, 28, 1) before resizing?

plucky cairn
#

Are there best practices for text pre-processing?

#

but applying this using a pandas transform is super slow on 2k text bodies

#

and i will need to do it on 16k on the out-of-sample texts

lapis sequoia
#

I got this error

#

output dimensions must be positive [Op:ResizeBilinear]

novel remnant
#

I'm not getting any errors on my part, can you share the part of your code that throws the error?

tawny pivot
#

@tawny pivot

something like this then?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'time': pd.date_range('2020-01-01', '2020-01-05', freq='d'),
    'a': [np.nan, np.nan, 1, 2, 3],
    'b': [np.nan, 1, 2, 3, 4],
    'c': [1, 2, 3, 4, 5]
})
df.set_index('time', inplace=True, drop=True).T

# This is the part that you want
new_dict = {}

for col in df.columns:
    new_dict[col] = df[~pd.isna(df[col])].index[0]
    
pd.DataFrame.from_dict(new_dict, orient='index').T

@novel remnant this works for me thank you ^_^

novel remnant
#

cheers!

solid aurora
#

What is the best technique for finding feature importance in a dataset?

#

Let's say I have a trained SKLearn model with a good enough (~80%) accuracy

#

There seem to be several ways I can find feature importance:

#
  • sklearn's .feature_importance_, which I'm not sure how it works
#
  • Recursive Feature Elimination
#
  • Permutation feature importance
#

Which of the above will give the "best" results?

#

And when would I want to use one over the other?

#

And should I be doing RFE/PFI with a cross-validation set? or using accuracy from the training set itself?

plucky cairn
#

can someone help me understand sparse matrices and how to manipulate them. from what i understand a sparse matrix basically only gives the non-zero entries to save memory

#

can i use standard numpy functions on a sparse matrix?

#

particularly i want to do something like
`np.sum(np.multiply(x!=0,(y>0)[:,None]),axis=0)

tidal bough
#

Yup, you pretty much can.

plucky cairn
#

okay, can i also ask why after using a count_vectorizer i would have columns that sum to zero?

#

that would mean the word doesn't show up in any documents right?

#

something like this

cv = CountVectorizer()
bagofwords = cv.fit_transform(text)
np.min(np.sum(bagofwords,axis=0))
#

returns zero

tidal bough
#

I think so, yeah. It's weird if that's the case

plucky cairn
#

hmm, must mean something weird is going on

lapis sequoia
#

Im having some trouble wrapping my head around how to approach the problem I am currently having with Panda's and my dataframe.

Basically I have 4 columns using a datetime index that are all daily values. from different shop locations. I want to resample it into monthly columns, but without losing each daily value by just using resample.mean I have several years worth of data, and it would be nice to have each column in the final df be labeled Month Year. Im a little stuck. Any help would be appreciated.

sudden kernel
#

would be easier to visualise what you want if you showed us a sample of your data and how you want it to look like

lapis sequoia
#

One moment

#

raw data is formatted like this

#

I can do it manually via

a = df.loc['2011-08']
a = a.unstack().reset_index(drop=True)

But its a huge hassle to do for large datasets and I know there is some way my beginner brain isn't seeing

#

The key is to preserve the data and not just use reshape.mean or some other thing that doesn't allow me to keep all data.

sudden kernel
#

so you basically want to reshape all rows from 2016-04 in the original df, to a single column in the new df

lapis sequoia
#

yes, but my data goes back to 1993, till today

#

so I need a solution that isnt using .loc 444 times

#

I have a sample csv with data from 2006 till 2020 with random int in it to try to figure this out

arctic wedgeBOT
#

Hey @lapis sequoia!

It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

gaunt tusk
#

@tidal bough Thank you for those resources, both look nice

still verge
#

anyone have pyspark experience and want to share what it was like for you?

velvet thorn
#

anyone have pyspark experience and want to share what it was like for you?
@still verge what was what like?

#

working with PySpark?

#

like working with pandas but much more tiring and bothersome

still verge
#

yeah

#

what makes it tiring?

velvet thorn
#

it being distributed means that stuff runs slower

#

on small datasets

#

of course, you wouldn't be able to do that kind of stuff on large datasets with native pandas (would need, like, dask or something)

#

but, yeah.

#

the abstractions are not as convenient

#

e.g. selecting specific rows and columns

still verge
#

many people told me not to use it if the dataset is small, is it that bad?

velvet thorn
#

you have to litter your code with a lot of the function operators

#

many people told me not to use it if the dataset is small, is it that bad?
@still verge without a reason, I'd say you shouldn't

still verge
#

tahnks for the input!

frail arch
#

can someone help me with caffe installation?

#

is it supported for Python 3.8?

lapis sequoia
#

hello, if you are a little bit familiar with multithreading, can you help me understand what i am doing wrong here?

#
import _thread
from threading import Thread, Lock

mutex = Lock()

df = pd.read_csv(f"dftest_1.csv")
df = df.reset_index(drop=True)
df['id']='NaN'
df['new_score'] = 'NaN'

for index, row in df.iterrows():
    s = row['full_link']
    s = s[38:44]
    df.at[index, 'id'] = s

def get_new_data(index, row):
    global df
    submission = reddit.submission(row['id'])
    print(submission.score)
    mutex.acquire() 
    df.at[index, 'new_score'] = submission.score
    mutex.release()

for index, row in df.iterrows():
    _thread.start_new_thread(get_new_data, (index, row))
#

I am loading a csv, create two new columns filled mit 'NaN'. then i create the ID from the full link. so far so good

#

now, I try to update the column 'new_score'. I do this using _thread so the requests i make with reddit.submission() happen all at the same time.

#

in the get_new_data() function I make the request and print the submission.score. it works and i can see the scores one after another and almost instantly - so the multithreading seems to work

#

then i lock the dataframe, write the new value and release it again

#

but the dataframe that is returned doesnt have the new values

#

no error

#

but also no new values

ivory panther
#

Try to use Ray for multithreading

#

Any idea to convert this data frame into a time serie taking months' columns as index?

velvet thorn
#

Any idea to convert this data frame into a time serie taking months' columns as index?
@ivory panther which is the month column?

ivory panther
#

Enero, Febrero, Marzo ... (January, February, March, etc)

velvet thorn
#

what do the numbers represent then

#

since I see a 2

neon path
#

Looks like murder counts to me

ivory panther
#

The number of crimes

velvet thorn
#

no, I mean

#

what do you expect the result to look like

#

in general for "how do I convert this to that" questions sample output is very useful in helping people understand what you expect

#

because "time series" is rather vague

ivory panther
#

Have date instead of just year. For example 2015/January, Aguascalientes, Homicidio, 2 (crimes)

bleak swift
#

i pressed anaconda navigator but it isnt working
(i cant open my anaconda navigator how to fix?)

jolly sinew
#

what are you trying to get to anaconda for?

#

I don't really use the navigator, but you can open a terminal on a mac and type jupyter notebook and it'll open notebooks

#

open anaconda prompt / miniconda prompt on windows to do the same thing

bleak swift
#

thanks

serene scaffold
#

I don't know that I like matplotlib

tidal bough
#

For that matter, what nice higher-level plotting libraries that matplotlib are there? I don't quite get how they can be easier to use than the latter.

serene scaffold
#
import typing as t
from plotter import Plot, Point, Color

class Car(Point):
    value: int
    speed: int
    type: Color

cars: t.List[Car]
car_chart = Plot(cars)
car_chart.show()
#

that's the kind of API I'd expect to see

tidal bough
#

ah, interesting

serene scaffold
#

I guess I could make one but that requires work.

atomic forge
#

hi Everyone

#

so I am a python coder and

#

I have experience with Numpy but

#

Im trying to learn Pandas

#

and eventually matplotlib

#

does anyone have any good resources for learning these libraries

#

book, yt tutorials, anything

tidal bough
#

for matplotlib and numpy, pretty much docs

#

pandas docs... aren't that nice

atomic forge
#

uhh

#

then what do i use

#

for pandas

#

well matplotlib as well if smth makes it easier to learn

#

?

wintry sapphire
#

Hello

#

is anyone famiali with pandas

#

I need some help

marsh seal
#

Hello, I want to iterate over a period of time. how can i compare a day's information with its previous day? One of the things i want to compare is the close of a stock with its previous day

wintry sapphire
#

Hey @marsh seal

#

I tink I am doing something similar

#

do you know how to call on the previous row data?

marsh seal
#

hey thanks for a quick reply @wintry sapphire no i don't

novel remnant
#

use shift and create a new column with the shifted values for which you can compare with the original values

#

this way you can vectorize the operations for quick results

marsh seal
#

@novel remnant Hi potaki, could you show me an example please

novel remnant
#

sure one momment

#

for example if you want to subtract the value of the previous day from the value of the current day

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2020-01-01', '2020-01-10', freq='d'),
    'a': np.arange(10)
})
df.set_index('date', inplace=True, drop=True)
df['a_previous'] = df.a.shift()
df['a_minus_previous'] = df.a - df.a_previous
df
lapis sequoia
#

Series of Data Science Articles for getting started with Data Science / Machine Learning, includes step-by-step implementations:

#

Please consider reading if you are interesting and subbing to my channel to help build your knowledge and skills in data science

silk saddle
#

hey, im kinda starting out on python, ik some basics n stuff and after some help from ppl i wanna get into machine learning, i think? xd, i dont rly know what it is, anyone got resources on what it is

lapis sequoia
#

hey @silk saddle

#

I release at least one episode every week

#

if you guys arent sure about anything or don't understand any concepts leave a comment on my channel or article

#

and i will get back to you as soon as possible

silk saddle
#

tysm ❤️

arctic canopy
#

Sup guys, Im learning the math that is needed for ML (which will take 2-3 month) but In these 2-3 I don't want to just learn math without programming(I already has experience with python about 6months) so can you give me any advice of what I should do like what kind of projects should I work on rn because Im kinda lost now.

desert oar
#

since you are learning machine learning, you can try implementing some algorithms from scratch

#

maybe start with linear regression with OLS and/or gradient descent

#

principal components

#

things like that

#

maximum likelihood even

arctic canopy
#

but things like this don't need math?

desert oar
#

sure they do

#

you need to know and understand the equations in order to implement them

arctic canopy
#

but Im still learning the math so how I can deal with them?

velvet thorn
#

you can do the simpler things.

#

what specifically are you learning now?

#

alternatively, you can work on general projects that are not specifically related to ML

arctic canopy
#

currently learning calculus

velvet thorn
#

hm.

#

what have you done already

arctic canopy
#

calculus* btw thanks for the advice it will try to work on projects not about ML.

#

if you mean with math nothing much but If you mean python projects, I have made a website and some automation stuff

velvet thorn
#

yup, that's cool!

#

if you wanna do ML

#

it's important to also be a good programmer.

#

have you worked with visualisation tools?

#

in particular, matplotlib

arctic canopy
#

not really

#

should I have a look at it?

velvet thorn
#

if you want

#

just thought it might fit into calculus

#

it's kind of hard to think of a programming project that can focus on that

arctic canopy
#

yeah I think my question is kinda wierd haha, thanks for answering. I think I will try making bots for some platforms that is the idea that just came into my mind.

tidal bough
#

no Reinforcement Learning there, sadly, for that you need a more serious course, from the Advanced ML specialization.

nimble solar
#

hi. i am trying to install a local package on disk
pip install /directory/my_package
but when i run jupyter, and import my_package , it says it is not found

#

is there a work around to fix this?

#

i tried installing the package with the same pip as in the which jupyter directory

velvet thorn
#

yeah I think my question is kinda wierd haha, thanks for answering. I think I will try making bots for some platforms that is the idea that just came into my mind.
@arctic canopy that's fine too! as long as you're practicing programming and learning new things, don't worry

#

there are a ton of interesting concepts that I picked up along the way that became relevant months later

arctic canopy
#

thanks a lot man,thats mean a lot

velvet thorn
#

yw

#

feel free to ask if you need any other help

dark agate
#

If you had $5.2K in tuition reimbursement from your employer for accredited coursework, what course/degree/boot camp would you use it for?

#

^for someone who has Python basics down but wants to pivot careers into data science

desert oar
#

@nimble solar are you using a venv or other environment?

wintry sapphire
#

Hi guys, I am trying to acheive this in a Dataframe

#

but i keep getting NaN

#

does anyone know how to do it?

flat quest
#

@arctic canopy. Yeah like salt was saying try reimplementing some algorithms or papers. For the first few you might want to follow a guide.

As for the actual math, if you've covered calculus you can reimplement many of the basic algorithms without much difficulty. linear, logistic should all be doable

#

I mean ur dividing by zero @wintry sapphire. Pandas doesn't know how to deal with anything divided by 0

wintry sapphire
#

Oh

#

@flat quest so if I leave it as NaN

#

maybe

#

I should b.fill this right?

flat quest
#

well yeah but depends on the problem

wintry sapphire
#

Hmm alright

#

cause I want to find the percentage change

#

@flat quest

#

How do I

#

fill my 1 Jan number with my second Jan?

wintry sapphire
#

Hey @flat quest , do you happent o know why

#

even after I fill my 1 Jan with a number

#

i still get an error?

flat quest
#

not sure i totally follow what ur trying to do
fill 1jan number with second jan?

wintry sapphire
#

@flat quest Alright so here is my output

#

Bascially in option 1

#

column

#

I want to 2019-01-01

#

to be my initial which is 10,000

#

for the next date, 2019-01-02, I would want it to be the value under StkB_close 2Jan - 1 Jan divide by 1 Jan

#

times the value above in option 1 in 1 jan

#

Meaning

#

In my option 1, for 2 Jan

#

the value would be

#

(101.12 - 101.12) / 101.12 * 10000

#

Assuming the final value is 30000

#

Then for 3 Jan

#

it would be

#

(97.40 - 101.12) / 101.12 * 30000

#

not sure i totally follow what ur trying to do
fill 1jan number with second jan?
@flat quest this is what I'm tying to do

#

But I keep getting an NaN

velvet thorn
#

@wintry sapphire df.shift(1) / df - 1

wintry sapphire
#

ohhh

#

@velvet thorn do you know

#

How to print out

#

certain rows and columns only

#

like in my DF, I have 5 columns = A B C D E

#

But I only wanted rows 5, 6 from Columbs C D E

velvet thorn
#

hm.

#

are you new to pandas?

#

that is a very basic operation

wintry sapphire
#

is it

velvet thorn
#

I suggest you read this

wintry sapphire
#

the .loc?

#

@velvet thorn so what I did was

#
for i, one_d in enumerate(date_check):
    print(portfolios.loc[one_d, 'Option_1'])```
#

where date_check is the dates whcih I want to find

#

dates are the

#

index

#

but I want it to be from

#

several columns

#

not just

#

option_1

copper hemlock
#

i can't seem to calculate the input features for my linear layer in pytorch

#

i apply the formula but i get size mismatch error

#
self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)        
self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
self.conv3 = nn.Conv2d(in_channels=12, out_channels=24, kernel_size=5)
        
self.fc1 = nn.Linear(in_features=?????, out_features=360)


#forward method for pooling
tensor = F.relu(self.conv1(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)
print(tensor.size())        

tensor = F.relu(self.conv2(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)

tensor = F.relu(self.conv3(tensor))
tensor = F.max_pool2d(tensor, kernel_size=2, stride=2)
#

can someone ELI5?

#

CxHxW = 1x40x40

#

according to my calculations its supposed to be 2411 but i get size mismatch error

#

nvm im dumb, i was calculating correct, error was elsewhere 😄

hushed flax
#

Wow

supple frigate
#

Hello guys, what do i need to know to getting a start with data science? i learned a little bit of pandas and numpy

lapis sequoia
#

scikitlearn is nice to know and is fun to work with

#

@supple frigate it contains also datasets that you can work with and make predictions or stuff like that

#

I have another question though:

I have two different CSVs with time series data. One Table is continuous, starting in 01.01.2017 at 00:00. From there each row represents one hour (1. Table). The data looks kind of like this:

  1. Table aka df1:
Date,                   Volume
2017-02-03 12-PM,       9787.51
2017-02-03 01-PM,       9792.01
2017-02-03 02-PM,       9803.94
2017-02-03 03-PM,       9573.99

The other table contains events that happened and are serialized by UNIX datetime in seconds. I was able to convert it to datetime and group it by hour with this code:

df['datetime'] = pd.to_datetime(df['created_utc'], unit='s')
df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%d %I-%p")
df['date_by_hour'] = df['datetime'].apply(lambda x: x.strftime('%Y-%m-%d %H:00'))

This resulted in this data:

  1. Table aka df2:
created_utc,    score,      compound,   datetime,               date_by_hour
1486120391,        156,        0.125,        2017-02-03 12:13:11,    2017-02-03 12:00:00
1486125540,     1863,       0.475,      2017-02-03 13:39:00,    2017-02-03 13:00:00
1486126013,     863,        0.889,      2017-02-03 13:46:53,    2017-02-03 13:00:00
1486130203,     23,         0.295,         2017-02-03 14:56:43,    2017-02-03 14:00:00

Now I need to map the events (2.table) to the Time Series of the 1. Table. If multiple events happened in one hour, i need to make an addition of the scores and calculate the mean average of the compound. In the end i want to have a dataframe like this:

#
  1. Final Dataframe
Date,                   Volume,         score,      compound,
2017-02-03 12-PM,       9787.51,        156,        0.125,
2017-02-03 01-PM,       9792.01,        2726,       0.682,
2017-02-03 02-PM,       9803.94,        23,         0.295,
2017-02-03 03-PM,       9573.99,        0,          0, 

I know my code below does not work and is wrong, but I wanted to show what I was thinking how I could achieve this. I thought I could loop through each row of my events table df2 and compare if the datetime matches. If so, I would calculate score and compound. The issue is that I know that one should not loop through a dataframe and I don't know how to loop through another dataframe at the same time and perform the right calculations based on the previous rows...

for index, row in df2.iterrows():
    memory_score = 0
    memory_compound = 0
    if df1['Date'] == df2['date_by_hour']:
        df1['score'] = row['score'] + memory_score
        df1['compound'] = (row['compound'] + memory_compound) / 2    

How can I get to my Final Dataframe? There must be some pandas magic that I could use to make this work and map the time series data to the right hours.

velvet thorn
#

@lapis sequoia how about a join?

lapis sequoia
#

someone to help me with pandas

#

should be fairly straight forward

eager root
distant moss
high urchin
#

hey every1, im using xlsxwriter to make a report in excel. On that i have a pie chart, and i'm not able to put the legend in the circular instead of being outside, like this image: Idk if its because xlsxwriter uses chart styles from Excel 2007.

raven torrent
#

Hey everyone, is there any way I can turn my deep learning model (regression) in google colab into a coreML model

desert oar
#

@distant moss what paper is that?

#

i dont know the answer btw but thats pretty bad to just not define notation like that

distant moss
#

I would think it's some kind of a know operator in matrices computations or smth

#

or maybe the ξ is the sparse Cholesky factorization they computing....

tidal bough
#

yeah, I'd think it's the decomposition or something

high urchin
#

Do you guys know how to remove the bold and cell borders when you use pandas to_excel? I'm trying to overwrite it but the first column doesnt change

cerulean flint
#

Does anyone have experience with connecting microsoft forms and answers to python?

gusty oak
#

How do I make line charts and how do I save them as .png and send them into an embed?

tidal bough
#

How do I make line charts
With matplotlib, probably.

how do I save them as .png
plt.savefig

pale thunder
#

you can set a different matplotlib backend if all you want is images

tidal bough
#

yup, matplotlib.use("AGG") if you only want pngs

gusty oak
#

alright

bold bane
#

is web scraping data science?

#

because im about to ask a question

unique wolf
#
def Diff2(old_list, new_list): 
    li_dif = [[i for i in old_list if i not in new_list],[i for i in new_list if i not in old_list]]
    return li_dif

Is there a more efficient way to find differences in 2 lists than my current function? I want to know added and removed items separately^

#

I'll post it in general :/

desert oar
#

@bold bane not in and of itself. but it can be useful in data science projects

#

@unique wolf you could use a set maybe? return list(set(old_list) ^ set(new_list))

tidal bough
#

well, for the exact same output(as sets), use

def Diff2(old_list, new_list):
    s1,s2 = map(set,(old_list,new_list))
    return s1 - s2, s2 - s1
#

symmetric difference s1 ^ s2 is equivalent to (s1-s2) | (s2-s1) (elements that are in one of the sets)

rare ice
#

Any ideas on how I can visualize a JSON object in a dynamic tree in a Jupyter Notebook? Javascript (more specifically React) has visualization components like https://github.com/storybookjs/react-treebeard. Is there something similar for Jupyter?

lapis sequoia
#

hey, quick question: how do i convert AM/PM datetime to 24 hour datetime? Like, I want to convert 2020-05-19 01-PM to 2020-05-19 13:00:00

#

do i use also something like

df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%d %I-%p")
desert oar
lapis sequoia
#

Thanks @desert oar

#

It works!

amber moat
#

Any ideas on how I can visualize a JSON object in a dynamic tree in a Jupyter Notebook?
@rare ice I think it's not possible. But you can load the JSON object into a dictionary with the json module and print it. It'll look like a json tree

desert oar
#

it would be a nice notebook or lab extension though!

#

@velvet thorn i just found a use case for "auxiliary" pandas indexes, a query like this:

all_urls.groupby(['url_source'])['homepage_eval'].value_counts()
#

although like you said you can always set_index first

velvet thorn
#

although like you said you can always set_index first
@desert oar can you elaborate

desert oar
#

@velvet thorn all_urls.set_index('url_source').gropuby(level=0

#

but that has a lot of less-desirable properties e.g. if you want to use the original index inside each group

#

also i currently have a 3-level column index with names, yikes

desert oar
#

is there a jupyter notebook or lab extension that allows you to bookmark a cell?

dusk aspen
#

hi, so I want to be able to input 3 images to python, then decide one of them is the main one. then i want to find which one is the closest to the main one. does anyone know how i would do that?

velvet thorn
#

is there a jupyter notebook or lab extension that allows you to bookmark a cell?
@desert oar not sure about this but you can use HTML

#

like how you would do a table of contents but for one cell

#

hi, so I want to be able to input 3 images to python, then decide one of them is the main one. then i want to find which one is the closest to the main one. does anyone know how i would do that?
@dusk aspen you can try a Siamese network

dusk aspen
#

ok, i can try it

velvet thorn
#

@lapis sequoia you can try asking here too next time

#

anyway, I believe you have that problem because you create a new BeautifulSoup for each page (in your loop)...but you only ever extract stuff after the loop ends?

still otter
#

I have a sort of design question. I have a lot of separate files that are being generated in real time. I will be making scripts to extract some plottable data out of these files. What can I use/do to save this arbitrary plottable data in a central location which can also notify plotters that there is new data to plot?

desert oar
#

@still otter what does "central" mean in this case? where/how are the plots being generated?

chilly pasture
#

I have a 300 mb text file (glove embeddings), what is the fastest way to upload it in colab everytime? my google drive is full so that is out of option.. Does hadoop or spark help for this?

desert oar
#

@chilly pasture google cloud storage?

chilly pasture
#

This is the first time i am hearing about it. thanks

#

i mean i thought it was a general term for google drive

kindred gyro
#

Hewo, can I ask about if I can get the tweets from Tweepy package by year? Or is it just by tags

tall bronze
#

Hi everyone. Not sure if this is the place to ask - How often is Cython used in data science for computationally expensive programs?

torpid gull
#

Hello everyone!!

tall bronze
#

I never heard of it as a total beginner, despite lots of people dislike for Python's performance. But, as soon as I started my position in computational research, Cython was immidiately brought into light.

eager heath
#

Hey guys, if you were to use an algorithm to make a text based on a large dataset, in a Markov chain fashion that'd actually yield (mostly) grammatically correct results and could be tuned based on user inputs (supervised learning?), what algorithm would you choose?

desert oar
#

@tall bronze i use it at work occasionally. sometimes numba can help improve performance too

#

cython is good when you have to process a lot of data in a "production" setting and/or for writing libraries with better performance than pure python alone. it's not necessarily that useful in data science, moreso in ML engineering

lapis sequoia
#

hey guys I posted a data science related question in help-nitrogen, any help would be highly appreciated!

grave frost
#

@eager heath Why not explore Deep Learning models? They are pretty efficient and highly accurate. Personally, I don't think general algos ever provide great performance or accuracy.....

eager heath
#

Because it is way more work, I en ever did anything deep learning related lemon_eyes

#

But I think it would be a good introduction, wouldn't it?

#

Would you have any great resource to get me started please?

spring flame
#

Hi, I know basic python and some external libraries for research. But I am interested in building AIs. I am not aware of how to proceed using python. Can someone suggest online sources where i can begin learning AI using Python?

desert oar
spring flame
#

Thanks a lot

viral scroll
#

Hi All,

Is there any function in pandas to calculate cumulative average/mean just like there is cumsum for cumulative sum

#

?

novel remnant
#

global average or expanding average?

#

I think you're looking for expanding average

viral scroll
#

expanding average

novel remnant
#

it's simple then call series.expanding().mean()

viral scroll
#

like for every month all the rows of previous months should be included in the mean

novel remnant
#

expanding does that

viral scroll
#

great thanks....let me try

novel remnant
#

alright cheers

stiff stratus
#

What would you call seaborn, is it a wrapper for matplotlib? If so, please define what a wrapper is.

#

I am searching for a technical term

still otter
#

@desert oar By "central" I mean if a plotter client (backed by matplotlib/bokeh/whatever) wants to get some data to plot, regardless of what data it wants or which script is creating the data, the plotter client will always access the same location to get its data.

For example, I was thinking maybe I can use a single sqlite db to store all this extracted data. If I want to make a new script to extract some additional data, the script would add a new table to the db, which it would then populate. Plotters can then be easily updated to plot the new data, without caring about the scripts that made it. Does this design seem good for this goal? Or is there perhaps a tool/library that is better fit for this purpose?

serene cipher
#

hello! can anyone help me with excel filtering?

desert oar
#

@still otter ok. what are the plotters though? are they all separate programs? how are they accessing the data? is this all happening on a single machine? on a network?

#

why not just keep a bunch of parquet files somewhere and watch for updates?

velvet thorn
#

@still otter so basically you want a central store for data + push notifications?

still otter
#

everything is on a single machine for now, and probably will be for a while

#

@velvet thorn yeah, basically. it's probably a simple thing to solve but i'm unfamiliar with what's available to me

velvet thorn
#

hm.

#

to give good advice on design it would be important to understand (quite a bit) more about the architecture and your needs

#

like so you have another script running

#

that's constantly updating plots?

still otter
#

so, i haven't really decided on the plotter quite yet

#

i have a basic plotter right now which is just a simple python script that makes a matplotlib plot manually from the files i have

#

basically parses the relevant data from the files without storing it anywhere

#

but it's a bit slow

#

and has to re-parse everything if i close and reopen it

#

also, this is kind of poor timing for me, i need to leave for a while 😓

velvet thorn
#

why do you want push notifications then

still otter
#

thanks for the responses though, i'll have to read about parquet

velvet thorn
#

sounds like pull would be a better strategy

still otter
#

maybe, i just like the sound of plots updating immediately when a new file is created

#

anyway, thanks for now, will be back later if anyone is around

lapis sequoia
#

Any tips on how to get rid of the for-loops in the first function? My attempt in the second function gives the wrong result because I'm not accounting for j != i. But I'm not aware of anything in NumPy to account for that.

import numpy as np


def mu_brokaw(mus, mws, xs):
    n = len(mus)
    mu_mix = 0.0

    for i in range(n):
        d = 0.0

        for j in range(n):

            if j != i:
                mij = ((4 * mws[i] * mws[j]) / ((mws[i] + mws[j])**2))**0.25

                num = (mws[i] / mws[j]) - (mws[i] / mws[j])**0.45
                den = 2 * (1 + mws[i] / mws[j]) + ((1 + (mws[i] / mws[j])**0.45) / (1 + mij)) * mij
                aij = mij * ((mws[j] / mws[i])**0.5) * (1 + num / den)

                sij = 1.0
                d = d + sij * aij * xs[j] / np.sqrt(mus[j])

        mu_mix = mu_mix + (xs[i] * np.sqrt(mus[i])) / (xs[i] / np.sqrt(mus[i]) + d)

    return mu_mix


def mu_brokaw2(mus, mws, xs):
    mij = ((4 * np.outer(mws, mws)) / ((np.add.outer(mws, mws))**2))**0.25

    num = np.divide.outer(mws, mws) - np.divide.outer(mws, mws)**0.45
    den = 2 * (1 + np.divide.outer(mws, mws)) + (1 + np.divide.outer(mws, mws)**0.45) / (1 + mij) * mij
    aij = mij * (np.divide.outer(mws, mws)**0.5) * (1 + num / den)

    sij = 1.0
    d = np.sum(sij * aij * xs / np.sqrt(mus))

    mu_mix = np.sum((xs * np.sqrt(mus)) / (xs / np.sqrt(mus) + d))
    return mu_mix


if __name__ == '__main__':
    # dynamic gas viscosity in µP
    mu_h2 = 179.75
    mu_n2 = 363.87

    # molecular weight in g/mol
    mw_h2 = 2.016
    mw_n2 = 28.014

    # mole fraction
    x_h2 = 0.85
    x_n2 = 0.15

    mu_mix = mu_brokaw([mu_h2, mu_n2], [mw_h2, mw_n2], [x_h2, x_n2])
    print(f'mu_mix = {mu_mix:.4f}')

    mu_mix2 = mu_brokaw2([mu_h2, mu_n2], [mw_h2, mw_n2], [x_h2, x_n2])
    print(f'mu_mix2 = {mu_mix2:.4f}')
desert oar
#

honestly this seems like it's really overengineered @still otter

#

whatever the plotter process is, i would just poll the directory every 30 seconds for new files

#

you can get fancier by sending notifications over a socket or something but i'd start with something really simple like polling for new files

#

i think inotify can watch for directories too

#

yep

#
import inotify.adapters

from my_library import plot_from_file

def main():
    i = inotify.adapters.Inotify()
    i.add_watch('./plot-data')

    for _, type_names, path, filename in i.event_gen(yield_nones=False):
        plot_from_file(filename)

if __name__ == '__main__':
    main()
velvet thorn
#

^ agree about the overengineered part

lapis sequoia
fast bluff
#

Ok I'm big noob but I need some help. I'm working on a project to monitor market data and I'm running into a stupid problem. I already know what's causing it, just not sure how to get around it

#
ts = ForeignExchange(key='secret',output_format='pandas')
avdf = ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact')
#This returns a tuple which is the root of all my issues (I think)
avdf.drop(['open','high','low'])
#Returns "tuple object has no attribute drop"
#So I tried converting it into a list a few ways
df = list(avdf)
#This worked but I'm still having the same issue
df.drop(['open','high','low'])
#Returns "list object has no attribute drop"
#So I thought maybe it was because I had to directly link it to pandas. So I tried this
df = pd.DataFrame(ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact'), columns=avdf.columns, index=avdf.index)
#But still no luck.. It returns "tuple has no object columns"
#Getting annoyed and it's probably a super easy fix so if anyone could help me out, that would be greatly appriciated :D
#

I'm on python 3.8 using pandas 1.0.5 and alpha_vantage 2.2.0

#

Im gonna head to bed. If anyone has the time to respond please @ me in the message

#

Added " to the pd line

#

Now I get this traceback

arctic wedgeBOT
#

Hey @fast bluff!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

fast bluff
#

Would rather not use that so hopefully this works

lapis sequoia
#

Hey guys! Anyone who works in analytics? Maybe you can help me out here: https://stackoverflow.com/questions/63513587/best-way-to-get-selling-windows-for-each-product-category-in-pandas I'm stuck with this for 12 days

grizzled saffron
#

Hi everyone, I'm trying to get a count column that get total 'Unit Sold' by tags from 'tags' in this df:

#

I want to make new two columns one for tag names and second for how much unit sold
For example:

#

How can I do that? can I split tags by ","?

molten hamlet
#

split tags?

#

text in tag?

grizzled saffron
#

yes I want to split the tags.. and count the unit sold for each tag

dense wharf
#

Hi everyone.

I'm new to python and I've just started to follow a youtube channel by Corey Schafer, trying to learn Pandas. I use Pycharm Community.

I'm kind of looking for a dataframe interface that resembles the Jupyter Notebook or if it's possible, any kind of an external window like the one on mathplotlib. Not really liking the one that's there on Pycharm.

Is there any way I can do that?

Thanks!

velvet thorn
#

Hi everyone.

I'm new to python and I've just started to follow a youtube channel by Corey Schafer, trying to learn Pandas. I use Pycharm Community.

I'm kind of looking for a dataframe interface that resembles the Jupyter Notebook or if it's possible, any kind of an external window like the one on mathplotlib. Not really liking the one that's there on Pycharm.

Is there any way I can do that?

Thanks!
@dense wharf PyCharm has Jupyter integration, but only for the Professional version

#

if you're a student you can get it for free though

#

yes I want to split the tags.. and count the unit sold for each tag
@grizzled saffron df['tags'].str.get_dummies(','), then groupby columns

dense wharf
#

Thankyou! I'll look into it

grizzled saffron
#

@velvet thorn thanks for the reply.. I am getting an error:

velvet thorn
#

uh.

#

do you know what that does?

#

okay

#

just run df['tags'].str.get_dummies(',') by itself

#

and you should understand what you're doing wrong

grizzled saffron
#

it made every values=0

#

and..

#

it made new columns by the tag names

#

I want to make one column named 'tag' then get this dummies to the row of 'tag'

#

then count the values of each tags in all rows

velvet thorn
#

yes, that's why I said

#

groupby column

#

after that

grizzled saffron
#

@velvet thorn Im sorry Im pretty new with pandas.. can you write here an example for the code..

wild pine
#

hey guys. I'm trying to write an implementation of the NEAT algorithm, and there's something i don't quite understand:
between speciation, killing off the weakest genomes and repopulation, what happens to the existing spiecies? i know that none of the genomes from the previous generation survives, but do the spiecies?
i mean, do i just wipe all the spiecies and respeciate each new generation from scratch, or do i somehow keep a representive genome from each and let them live untill they have been underperforming for too long?

flat quest
#

i haven't personally dived into genetic algorithms so can't really say that much here. But surely ur keeping the best genomes from the previous generation?

@wild pine

wild pine
#

unless you're implementing some sort of elitism, where you let a couple of the best performing genomes survive unchanged, i don't think that's usually the case.
i mean if you think about it, evolution is about finding out who gets to reproduce, not finding out who gets to live forever.

#

generally the idea is that everyone dies, but only the best performing organisms pass on their genes to the next generation

wintry sapphire
#

Hi all

#

how do I print out this

#

form my dataframe

#

currently this is my dataframe

#

I want it toprint out

#

On 2019-03-29, Option A is XXX, Option B is XXX, Option C is XXX

#

@velvet thorn any suggestions? 🙂

supple minnow
#

does anybody have any experience with DEAP library? I wanna know how can u set a specific chromosome in already define population for genetic algorithm?

runic stream
#

Hey all! So I was thinking about making a project related to AI. Anybody want to collaborate? Actually I'm a final undergrad student and a project would really help me get a hands on experience about the topics I learnt. Thoughts?

steel roost
#

wwhere can i get data? Like is there a site or something that i can use?

grizzled saffron
#

@steel roost Kaggle, Data.world

fast bluff
#

Could someone please peep my message from last night if you have the time

#

to sum up

#
avdf = ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact')
#This works fine but returns a tuple which I can't work with.
df = pd.DataFrame(ts.get_currency_exchange_intraday(from_symbol='USD',to_symbol='EUR',interval='1min',outputsize='compact'), columns="avdf.columns", index="avdf.index")
#This returns an error involving my arguements w/ index (full traceback posted w/ pastebin link above)
df = list(avdf)
#Tried this
df.drop(['open','high','low'])
#"list has no attribute drop"
```See earlier post for further explanation
#

I tried a bunch of stuff and I think I'm on the right track with the pd.DataFrame but I think I'm having a problem passing the columns and index

flat quest
#

well yes @wild pine the one that reproduces is the one that continues. But with most ML problems, u want the best producing rather than the one that happens to survive the best.

Ah so took a brief look at Neat. So its an evolution of the neural architecture itself. Based on what I can tell it looks like they're making mutations and then forming species based on a certain threshold difference. It looks like organisms are only eliminated based on their performance compared to indiivduals within their same species.

So by that logic, no species should completely die out. However, it might be beneficial to at some point remove the species completely if their performance is too terrible.

oblique vine
#

https://www.coursera.org/learn/machine-learning/
Is this course still good to learn ML? I mean, it is 9 years old, lots of things have changed

Coursera

Learn Machine Learning from Stanford University. Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, ...

fast bluff
#

Complete guess, but I assume the fundamentals are still about the same so it couldn't hurt

steel roost
#

is there a way to speed up pandas readers?

#

i have a data file that extremley huge, but appears to freeze when just trying to print the dataframe

#
import pandas as pd
import numpy
data_file = '/home/doomedapple7565/Downloads/Parking_Violations_Issued_-_Fiscal_Year_2017.csv'

# want sheet 1 to be new york
# want sheet 2 to be new jersey
#and i want a count of the number of tickets for each license plate
#and i want the first and last ticket of each license plate

df = pd.read_csv(data_file)
print(df[0])
wild pine
#

@flat quest so basically each generation will consist of a mix of survivors from the previous generation, along with their offspring?
tbh that also makes more sense to me and was my first intuition, until i read this response on a related question on stackexchange:
The neural networks with the worst performance are killed off after speciation. None of the neural networks survive - the entire population is replaced with the offspring of the nets remaining after the culling stage.
i suppose there're several different approaches.

I guess i'll try to eliminate half of each species and let them reproduce until the population limit is reached, keeping the best performing genomes from each generation (possibly mutating them slightly). I can always just rewrite it if it turns out to be a disaster.
thanks a lot for your input! I've been working really hard on the rest of the code, and it's been so frustrating to be stuck on such a minor detail. at least i feel like i can get back to coding now.

fast bluff
#

Can someone please help me ;-;