#data-science-and-ml | Python | Page 252

crude karma Sep 14, 2020, 11:41 PM

#

Im trying to apply for some data science programs but I only have some qualitative analysis from my work as an analyst

#

feel free to ping me if someone answers 🙂

serene anvil Sep 15, 2020, 1:14 AM

#

Anyone have a good resource for getting started with data streaming?

winter portal Sep 15, 2020, 3:20 AM

#

hey guys

#

I have a tags system

#

so basically user inputs some text from discord

#

and it gets stored in a sqlite3 database

#

and when the stored text is called with the respective keyword,

#

the input text is displayed

#

(By a discord bot of course)

#

something like this

📎 unknown.png

#

but over here, the lines arent getting recognized

#

and the text string is stored without the specified lines

#

I dont want that to happen

#

how to fix it?

lapis sequoia Sep 15, 2020, 4:04 AM

#

Im working with a pretty large data set, 500 rows maybe like 10 columns. I am running forloops to create even more permutations, and all these permutations get saved as a new csv. The problem is it makes my computer crash after saving 2-3 CSV's. What Can I do to stop this from happening

austere swift Sep 15, 2020, 4:44 AM

#

do you delete the dataframe from memory after saving it as a csv?

#

because if you just keep saving more and more dataframes to memory itll eventually fill up and crash

red rose Sep 15, 2020, 5:01 AM

#

Guys how can I make videos like this to practice python? https://www.youtube.com/watch?v=4-2nqd6-ZXg
Is it complex? (Basically videos about ranking over time of some sort of data) Not sure where to start

fast bluff Sep 15, 2020, 5:05 AM

#

def SortData():
    global df
    cols = df.columns.tolist()                          #Convert columns to list
    cols
    colsorder = [2,1,4,3]                        #Move last column to first position
    df = df[cols[i] for i in colsorder]       ```

#

Last line returning a syntax error, any idea why?

velvet thorn Sep 15, 2020, 5:06 AM

#

add one more set of brackets

tall barn Sep 15, 2020, 5:07 AM

#

I take it this is the appropriate channel for asking questions that involve pandas?

#

I'm having issues getting rid of NaN values in one of my dataframe's columns.

#

I'm running python avg_age = clean_df['Age'].mean() clean_df['Age'].fillna(avg_age, inplace=True)

#

I have verified that the first line correctly takes the average value of values in the Age column

#

I have also verified that for some reason, the number of null values in the column does not change after running these lines

#

I first tried using .replace for this, with the same failure

#

All help I've found googling around links to people that didn't know about using inplace=True, which definitely isn't the problem here

#

Has anyone else experienced this sort of issue? Any fixes?

#

Ok, ok, nevermind. As usual, finally asking for help revealed the issue to me.

#

...I was running my after-check of .isnull().sum() on the wrong dataframe

#

20 minutes of confusion and finally deciding to join this server only for this to happen to me

velvet thorn Sep 15, 2020, 5:41 AM

#

@tall barn that is rough.

#

one thing I like to do is never perform inplace modifications

tall barn Sep 15, 2020, 5:41 AM

#

So you just assign back to itself?

velvet thorn Sep 15, 2020, 5:41 AM

#

no

#

for example...

#

raw_df = pd.read_csv(...)
cleaned_df = raw_df.fillna(...).replace(...).map(...)

tall barn Sep 15, 2020, 5:42 AM

#

Ah new dataframe every time?

velvet thorn Sep 15, 2020, 5:42 AM

#

then let's say I want to perform some other transformation on cleaned_df:

some_other_df = cleaned_df.melt(...)

#

for example.

#

yup

tall barn Sep 15, 2020, 5:42 AM

#

I can see the value in that

velvet thorn Sep 15, 2020, 5:42 AM

#

are you using Jupyter?

tall barn Sep 15, 2020, 5:43 AM

#

Wouldn’t that fill up the memory?

#

Yes

velvet thorn Sep 15, 2020, 5:43 AM

#

worry about memory when it becomes a problem

#

premature optimisation is the root of all evil

tall barn Sep 15, 2020, 5:43 AM

#

Fair enough

velvet thorn Sep 15, 2020, 5:43 AM

#

the really nice thing about doing this is that you can go back to earlier cells

#

and run them

#

and they will just work.

#

because you never modify DataFrames

tall barn Sep 15, 2020, 5:43 AM

#

Oh God yes

velvet thorn Sep 15, 2020, 5:43 AM

#

whereas say you create a DataFrame in cell 8, perform a read-only aggregation in cell 9, and modify it in cell 10

#

if around cell 15 you want to change something in cell 9 just as an experiment, it might not work

#

because you've modified the source already.

#

naming can be a pain

#

but there are also ways to handle that

tall barn Sep 15, 2020, 5:44 AM

#

I’m afraid that really wouldn’t fly with the instructors at my job though. Having a dozen dataframes isn’t a nice surprise to people who expect to just look at one

velvet thorn Sep 15, 2020, 5:45 AM

#

yup, that's fine

#

I mean, you don't have to expose the intermediate DataFrames

#

but of course if you have other requirements then those should take precedence

tall barn Sep 15, 2020, 5:45 AM

#

I’ll keep in mind what you’ve said once my code is no longer being inspected for grading purposes

#

Which should be in only a week or two

velvet thorn Sep 15, 2020, 5:46 AM

#

sure!

#

I mean, what I have said is by no means the way to do things

#

and in fact it is a little foreign to a fair number of data scientists, I think?

#

because it's a bit more of a computer science theory-based approach

tall barn Sep 15, 2020, 5:46 AM

#

Sure, but I strongly identify with what you said about running old cells

velvet thorn Sep 15, 2020, 5:46 AM

#

(disclaimer: I have neither a formal DS nor a formal CS background)

#

but yeah if you're interested in this kind of thing you can look up functional programming; a lot of my DS code structure is inspired by those ideas

#

I think they generally lead to greater levels of correctness and readability once you get used to the idioms, since pandas is actually quite a functional library

tall barn Sep 15, 2020, 5:48 AM

#

I don’t have a formal background in this, but I’m getting it now as pre-assignment job training. I got hired as a software developer and shoehorned into data scientist somehow...

#

I’m starting to get an appreciation for pandas for sure, the best documentation of any code I’ve ever read

#

To the point I’m mostly reading documentation to fulfill assignments, rather than following the shitty lessons

velvet thorn Sep 15, 2020, 5:50 AM

#

yup, it does have p good documentation.

#

I mean

#

I do understand that career path

#

I've bounced all around ML engineer/DS/SWE in the span of a year or so

tall barn Sep 15, 2020, 5:51 AM

#

It’s not that I don’t understand it, it’s just that I historically have hated statistics haha

#

I like the logic puzzles of programming, not so much data crunching. I’m adapting though. Statistics is less bad with code than it was in a generic stat class in college

upbeat cradle Sep 15, 2020, 10:29 AM

#

Hey, I’m looking to find instances in pandas where a person has used the same details out of 2 fields I.e: a phrase and an email address, what’s the best way to find instances where people have changed either of those details and provide a count of the presence of either?

brittle agate Sep 15, 2020, 10:31 AM

#

How can I implement 10-fold cross-validation in this code?
(train_ds, val_ds, test_ds), metadata = tfds.load( 'tf_flowers', split=['train[:60%]', 'train[60%:90%]', 'train[90%:]'], with_info=True, as_supervised=True )

lapis sequoia Sep 15, 2020, 12:54 PM

#

this isn't python is it

pale thunder Sep 15, 2020, 12:55 PM

#

that is python

lapis sequoia Sep 15, 2020, 12:56 PM

#

oh cool.. I haven't used tf this was before

modest rune Sep 15, 2020, 1:22 PM

#

I am trying to use numba to speed up my code, but I keep running into problems because numba doesn't support functions my code uses.

#

So... I am trying to find different ways to accomplish the same tasks... ways that numba supports.

#

I was doing this:

# y: vector of size 200
# x: vector of size 4
y - x.reshape((len(day),1))

#

numba has odd problems with numpy.reshape(), so I am trying to get rid of it

#

And I can't help but think, there is a simpler way to do this type of subtraction in numpy.

#

something like y - x.T, but that doesn't work

#

Calling x.T changes the vector of size (2,) to a vector of size (2,) 🙂

brittle agate Sep 15, 2020, 1:26 PM

#

this isn't python is it
@lapis sequoia
Oof...

odd yoke Sep 15, 2020, 1:26 PM

#

it is python

brittle agate Sep 15, 2020, 1:27 PM

#

https://tenor.com/view/doom-slayer-carlton-dance-moves-grooves-gif-16257196

Tenor

lapis sequoia Sep 15, 2020, 1:27 PM

#

lol I'm just used to using sklearn for cv, etc

#

honestly it seems way too complicated in TF

brittle agate Sep 15, 2020, 1:27 PM

#

No problem man.

odd yoke Sep 15, 2020, 1:27 PM

#

@modest rune do you know which part is causing the crash exactly, the reshape of the len ?

raw rapids Sep 15, 2020, 1:28 PM

#

How can I implement 10-fold cross-validation in this code?
`(train_ds, val_ds, test_ds), metadata = tfds.load(
'tf_flowers',
split=['train[:60%]', 'train[60%:90%]', 'train[90%:]'],
with_info=True,
as_supervised=True
just seems to be a tuple

odd yoke Sep 15, 2020, 1:28 PM

#

either way, can you try .reshape(-1, 1) or use expand_dims

modest rune Sep 15, 2020, 1:28 PM

#

numba supports reshape(), but only if the array is contiguously stored in memory.

odd yoke Sep 15, 2020, 1:28 PM

#

which it should be given your example

modest rune Sep 15, 2020, 1:29 PM

#

It is throwing an error saying it is not.

#

I have to call x.copy() to fix it.

loud veldt Sep 15, 2020, 1:29 PM

#

Morning everyone!

I am having trouble putting these two separate mathematical formulas into one print statement. I've hacked my way to the solution, but I'd like to know how I can combine the two into one single print statement.

`import math

monthly_payment = 8721.8
periods = 120
interest = 5.6 / 100
i = interest / 12 * 1

print(i * math.pow(1 + i, periods)) # 0.008159169813759389
print(math.pow(1 + i, periods) - 1) # 0.7483935315198691

print(int(monthly_payment / (0.008159169813759389/0.7483935315198691))) # 800000`

modest rune Sep 15, 2020, 1:30 PM

#

My guess is that x is a view to a bigger array that I sliced up earlier on in my code.

#

And hence, it is not contiguous.

#

I think copying the array will more than offset any speedups I get from using numba

#

I am going to try expand_dims, hopefully numba supports that

brittle agate Sep 15, 2020, 1:32 PM

#

honestly it seems way too complicated in TF
@lapis sequoia
Why? If I'm downloading from cloud collection and break dataset into train, val and test input?

lapis sequoia Sep 15, 2020, 1:33 PM

#

not that part.. I mean actually doing cross validation on this

odd yoke Sep 15, 2020, 1:33 PM

#

yeah that's probably the reason why, i just tried it on my end and it works with reshape, the views are probably causing issues

brittle agate Sep 15, 2020, 1:36 PM

#

not that part.. I mean actually doing cross validation on this
@lapis sequoia
U mean this case?

#

And yeah.

#

I resolved this problem.

#

(train_ds, test_ds), metadata = tfds.load(
    'tf_flowers',
    split=['train[:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True
)

val_ds = train_ds.split = [
f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
]```

#

One line of code. Not bad))

modest rune Sep 15, 2020, 1:44 PM

#

@odd yoke expand_dims worked. Now I need to work through the other 3 errors numba is throwing. Thanks for helping me get over that hump!

brittle agate Sep 15, 2020, 1:46 PM

#

just seems to be a tuple
@raw rapids
What?

modest rune Sep 15, 2020, 2:25 PM

#

Any insight into what this means for a non-computer science guy?
Resolution failure for non-literal arguments
It is an error I am receiving when numba tries to compile this line of code:

# x: 2D array, variable names where changed to protect their identity
temp = x.argmin(1)

Full error from numba

Compilation is falling back to object mode WITH looplifting enabled because Function "j_nearest_strike" failed type inference due to: - Res
olution failure for literal arguments:
AssertionError()
- Resolution failure for non-literal arguments:
AssertionError()

During: resolving callee type: BoundFunction(array.argmin for array(float64, 2d, C))
During: typing of call at c:/.../analysis2.py (159)


File "analysis2.py", line 159:
def j_nearest_strike(strike, x):
    <source elided>
    abs_minus = np.abs(minus)
    temp = abs_minus.argmin(1)
    ^

  @jit('int64[:](float64[:], float64[:])')

odd yoke Sep 15, 2020, 2:29 PM

#

argmin doesn't support additional arguments iirc

#

it's saying it didn't type check

modest rune Sep 15, 2020, 2:29 PM

#

The actual code... sorry for the ugly variable names an unnecessary assignments, I am just trying to get it to work.

@jit('int64[:](float64[:], float64[:])')
def j_nearest_strike(strike, x):
    #strike = strike.copy()
    #length = strike.size
    strike_shaped = np.expand_dims(strike, 1)
    minus = x - strike_shaped
    abs_minus = np.abs(minus)
    temp = abs_minus.argmin(1)
    return temp

#

argmin doesn't support additional arguments iirc
@odd yoke
I guess I am dense. Can you rephrase that? I am not following.

#

Without numba, that line works perfectly.

odd yoke Sep 15, 2020, 2:30 PM

#

you can only do arr.argmin()

modest rune Sep 15, 2020, 2:30 PM

#

oh

odd yoke Sep 15, 2020, 2:30 PM

#

numba doesn't support all of numpy

modest rune Sep 15, 2020, 2:31 PM

#

well, crap

odd yoke Sep 15, 2020, 2:31 PM

#

most axis parameters, like this one, aren't available

modest rune Sep 15, 2020, 2:31 PM

#

abs_minus is a 2D array for vectorization purposes

#

I guess, if I am using numba, maybe I don't need to vectorize as much?

odd yoke Sep 15, 2020, 2:32 PM

#

use loops

#

yeah

modest rune Sep 15, 2020, 2:32 PM

#

Any idea how I can get around this problem while still taking the vectorized approach?

odd yoke Sep 15, 2020, 2:32 PM

#

numba optimizes loops

modest rune Sep 15, 2020, 2:32 PM

#

ok

#

I guess, I should be happy... but I am just annoyed because it took me a while to get the vectorization right in many parts of the code I want to use numba on. So... I have to rip all of that out.

#

A friend of mine suggested using pypy instead of numba and laughed at me. 🙂

odd yoke Sep 15, 2020, 2:35 PM

#

sounds like your friend needs to look up what numba does

modest rune Sep 15, 2020, 2:35 PM

#

All I can say is... When I am done destroying my vectorization and getting rid of my class, the numba code better be much faster!

odd yoke Sep 15, 2020, 2:35 PM

#

also that's the purpose of numba, unlike numpy, it wants you to write "normal" python code

modest rune Sep 15, 2020, 2:36 PM

#

The friend claims that pypy will get me similar performance gains, the only problem will be that installing scipy, pandas, and numpy will be a PIA.

odd yoke Sep 15, 2020, 2:36 PM

#

import numba as nb
import numpy as np


arr = np.random.randint(10, size=(100,50))

@nb.jit(nopython=True, parallel=True)
def argmin(arr):
  ret = np.empty((arr.shape[0],))
  for i in nb.prange(arr.shape[0]):
    ret[i] = np.argmin(arr[i])
  return ret

print(argmin(arr))
print(arr.argmin(1))```works

#

just gotta like, specify the dtype

#

i'm just too lazy

modest rune Sep 15, 2020, 2:37 PM

#

thanks! you didn't have to do that.

pale thunder Sep 15, 2020, 2:38 PM

#

Pypy is generally a lot slower than nopython numba since it uses actual python objects

odd yoke Sep 15, 2020, 2:39 PM

#

it's not even the same purpose really

#

numba is for optimizing numerical programs

frank marsh Sep 15, 2020, 2:53 PM

#

Hey all, Need a help I just started learning phyton and I am learning about dataframe. I have a sample of 29405 rows and 1 column and when I using this code
df= pd.Dataframe (df)
df1.insert(2, "Consumer", [1:29405])

But it's showing SyntaxError: invalid syntax

lapis sequoia Sep 15, 2020, 2:56 PM

#

hey guys noob here, Im trying to use pandas and read a file, but when I use read_csv it doesnt print columns but long lines of text where columns are separated by '\t'

I figured out I had to use a delimiter but when I plug \t into it I get an error ? ( it is a TSV not CSV files but I seen it works the same)

desert oar Sep 15, 2020, 3:04 PM

#

@lapis sequoia can you share your actual code?

lapis sequoia Sep 15, 2020, 3:07 PM

#

📎 unknown.png

#

only that and I get an error in this line

#

@desert oar

desert oar Sep 15, 2020, 3:08 PM

#

show the error? also please try to share code as text, and not screenshots. some people have difficulty reading screenshots

lapis sequoia Sep 15, 2020, 3:09 PM

#

here is what I get if I dont specify any delimiter

📎 unknown.png

#

sorry but idk how to share the code in fashion way

#

📎 unknown.png

#

📎 unknown.png

grave frost Sep 15, 2020, 3:13 PM

#

@lapis sequoia How big is the TSV file you want to load?

lapis sequoia Sep 15, 2020, 3:14 PM

#

500 000 ko

#

big file

grave frost Sep 15, 2020, 3:14 PM

#

ko?

lapis sequoia Sep 15, 2020, 3:14 PM

#

@grave frost

#

Ko I guess it's 500 mo

grave frost Sep 15, 2020, 3:15 PM

#

Hmm...

raw rapids Sep 15, 2020, 3:15 PM

#

@Jerrody you're just loading a tuple of values

grave frost Sep 15, 2020, 3:16 PM

#

@lapis sequoia Did you try just taking a few values, making a new tsv and reading that with pandas?

lapis sequoia Sep 15, 2020, 3:17 PM

#

hmm no, you mean like create a file with only the columns im interest in or st ?

#

idk how to do that tbh Im a real beginner

grave frost Sep 15, 2020, 3:17 PM

#

No, create a file with only a few data points (like maybe of 5Mb). Also, how much RAM have you got?

desert oar Sep 15, 2020, 3:17 PM

#

@lapis sequoia try using nrows=100 to load only 100 rows

#

!codeblock

arctic wedgeBOT Sep 15, 2020, 3:17 PM

#

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')

desert oar Sep 15, 2020, 3:18 PM

#

^ that tells you how to post code

lapis sequoia Sep 15, 2020, 3:18 PM

#

8gb

#

oh ok thanks

#

but I get the error just by printing the columns alone @desert oar

#

and no error if I dont use delimiter

desert oar Sep 15, 2020, 3:19 PM

#

if you dont use delimiter= it is loading everything as text

#

pd.read_csv('data.tsv', delimiter='\t', nrows=100)

does this produce an error?

lapis sequoia Sep 15, 2020, 3:20 PM

#

no!

#

lenght= 3148

#

may be that...

#

but it's weird since I only prints the columns

desert oar Sep 15, 2020, 3:21 PM

#

the error is not from printing the columns

#

do you want to read the data file? or do you only want to print the columns?

lapis sequoia Sep 15, 2020, 3:22 PM

#

well, Im really exploring the data as a noob atm

desert oar Sep 15, 2020, 3:22 PM

#

it sounds like you do not have enough free memory on your computer to read the whole dataset

lapis sequoia Sep 15, 2020, 3:22 PM

#

seeing if I can find one specific columns and exploring only the data associated with it

desert oar Sep 15, 2020, 3:22 PM

#

you can select columns with usecols=

lapis sequoia Sep 15, 2020, 3:22 PM

#

hmmm okay

desert oar Sep 15, 2020, 3:22 PM

#

pd.read_csv('data.tsv', delimiter='\t', nrows=100, usecols=['CASEID', 'QUESTID2', 'CIGEVER'])

#

this only reads the CASEID, QUESTID2, and CIGEVER columns

#

and only 100 rows

grave frost Sep 15, 2020, 3:23 PM

#

Is there any way to create partial checkpoint? Like to make a chkpt if the Epoch is done only about 30% or something?

lapis sequoia Sep 15, 2020, 3:24 PM

#

oh okay thanks! so since I specify those parameters in the read_csv before doing anything else, it means if I dont do that it goes trough all the data AND ROWS even if I print the columns only and create an error ? do i understand correctly ?

fast bluff Sep 15, 2020, 3:25 PM

#

def SortData():
    global df
    cols = df.columns.tolist()                          
    cols
    colsorder = [2,1,4,3]                        
    df = df[cols[i] for i in colsorder]                                  
    return df
``` Last line returning a syntax error plox halp

grave frost Sep 15, 2020, 3:27 PM

#

Can you post the whole error? and are you using Jupyter Notebooks by any chance?

fast bluff Sep 15, 2020, 3:28 PM

#

Are you talking to him or I?

grave frost Sep 15, 2020, 3:28 PM

#

you

fast bluff Sep 15, 2020, 3:28 PM

#

I'm using ATOM with some addons to make it work like Jupyter

grave frost Sep 15, 2020, 3:28 PM

#

Hydrogen?

fast bluff Sep 15, 2020, 3:29 PM

#

Yeah

#

📎 unknown.png

grave frost Sep 15, 2020, 3:29 PM

#

Just restart your kernel and run only the lines you need

fast bluff Sep 15, 2020, 3:30 PM

#

Code looks fine, you just thinking its an error with the kernel?

marsh seal Sep 15, 2020, 3:30 PM

#

https://media.discordapp.net/attachments/587375753306570782/755439106317877341/unknown.png?width=730&height=702 Hello, how do I retrieve a specific timestamp from this data frame? Also, how would i make the timestamps a column 'date'? This is a multi index btw

paper niche Sep 15, 2020, 3:32 PM

#

@fast bluff you need 1 more pair of brackets I think, try df[[cols[i] for i in colsorder]]?

fast bluff Sep 15, 2020, 3:33 PM

#

Someone else suggested that. When I try that it says "list index out of range"

paper niche Sep 15, 2020, 3:33 PM

#

how many columns do you have? 4?

fast bluff Sep 15, 2020, 3:33 PM

#

Yeah 4 exact

#

SMA, CLOSE, BUY, SELL

paper niche Sep 15, 2020, 3:34 PM

#

python is 0-indexed, you need [1, 0, 3, 2] instead

fast bluff Sep 15, 2020, 3:34 PM

#

OHH

#

omg

#

smh thank you

paper niche Sep 15, 2020, 3:34 PM

#

though, it's probably easier to just select them directly? if the column names are fixed

#

as in df = df[['CLOSE', 'BUY', 'SELL', 'SMA']] or whatever

fast bluff Sep 15, 2020, 3:35 PM

#

I was doing that before and ran into some errors yelling at me to use .iloc so I've just been playing it safe

#

Starting with 0 fixed the problem lol tysm that went right over my head

grave frost Sep 15, 2020, 3:39 PM

#

Just an idle and theoretical question - can we use a decoder (like in an autoencoder) with some layers to give probablity oriented outputs? Like for every Latent Space combination, it would produce several unique possiblities. How would we implement something like that? Maybe by using "softmax" at the end to get probablities, but the thing would be that those outputs would all be the BEST it can produce, just tweaked in different ways. Like it wouldn't assign 94% to one, then 3% to another and so on. It would give several outputs but each of them would be the most likely. Like for 50 outputs, it would assign each one at 2%. So is a decoder deterministic in that way (to be not able to produce multiple outputs for same input) or does it actually end up producing unique outputs every time?

lapis sequoia Sep 15, 2020, 3:50 PM

#

So I now have this variable workdays which each id is associated with differents numbers, and I'd simply like to plot the frequency/repartition of each number for the whole data set but I only gets one big columns


plt.hist(data)
plt.show()

#

what am i doing wrong

📎 unknown.png

#

I'd like to have a frequency distribution

tidal bough Sep 15, 2020, 3:51 PM

#

how does data look like?

#

like, what's data.head()?

desert oar Sep 15, 2020, 3:52 PM

#

@lapis sequoia

    df = df[[cols[i] for i in colsorder]]

lapis sequoia Sep 15, 2020, 3:53 PM

#

99 is a specific value dat means error or st but others ID should have any numbers

📎 unknown.png

tidal bough Sep 15, 2020, 3:54 PM

#

you probably want to hist a specific column

lapis sequoia Sep 15, 2020, 3:54 PM

#

@desert oar sorry I dont understand where i'm supposed to do that

tidal bough Sep 15, 2020, 3:54 PM

#

plt.hist(data["WORKDAYS"])
plt.show()

paper niche Sep 15, 2020, 3:54 PM

#

https://media.discordapp.net/attachments/587375753306570782/755439106317877341/unknown.png?width=730&height=702 Hello, how do I retrieve a specific timestamp from this data frame? Also, how would i make the timestamps a column 'date'? This is a multi index btw
@marsh seal
if they are proper datetime formats, you can retrieve directly with new.loc['2017-01-03'] if you want the 3 Jan 2017 rows, for example.
and I'm not quite sure I get your second question, you want to reset the timestamp index into its own column called 'date'?

lapis sequoia Sep 15, 2020, 3:55 PM

#

@tidal bough it worked thanks! so even if I specify the reading to only the columns I still need to specify it in the plot, thanks 😉

modest rune Sep 15, 2020, 3:57 PM

#

@odd yoke So, it is ugly and needs to get cleaned up. But the functions I wanted to run with numba now run without errors. They went from taking 4 seconds to run to 500ms. Which is great.

But, something weird happened, the parent function that calls them:

# everything from custom_function down has been decorated with @njit
pandas.Dataframe.rolling().apply(custom_function)

pandas.Dataframe.rolling().apply(custom_function) (refer to this as A)gets called 8000 times. Because of how rolling() works, custom_function (refer to this as Child of A) gets called probably 1 million times (8000 * 100).

Before numba: A took 4 seconds to run, of which Child of A accounted for 3.8 of those seconds.
After numba: A took 4 seconds to run, of which Child of A accounted for 0.5 of those seconds.

Seems like there is a significant amount of overhead loading the numba function or something. Any insights, I feel like something trivial needs be done to fix this problem.

tidal bough Sep 15, 2020, 4:00 PM

#

@lapis sequoia Alternatively, you can use seaborn to visualize dataframes; it's nicer in some ways. With it, it'd be:

import seaborn as sns
sns.displot(data=data,x="WORKDAYS")

marsh seal Sep 15, 2020, 4:00 PM

#

@paper niche yes i want the dates to have a column named 'date. this way i can filter through them

lapis sequoia Sep 15, 2020, 4:00 PM

#

thanks will look into it

#

and do you know how could I delete some values of my data like since 0 and 99 are specific values that arent really analysable ?

#

tldr: How to delete specifics values of a column before plotting

tidal bough Sep 15, 2020, 4:04 PM

#

I'd suggest, instead of using 0 and 99 to mark "invalid value", using None or NaN.

#

Though also, you can just filter them out when plotting:

#

plt.hist(data["WORKDAYS"][(data["WORKDAYS"]!=99) & (data["WORKDAYS"]!=0)])
plt.show()

paper niche Sep 15, 2020, 4:06 PM

#

@paper niche yes i want the dates to have a column named 'date. this way i can filter through them
@marsh seal you can use df.rename_axis(index=['date', None]).reset_index(level=0) which will convert the first-level index (your timestamp index) into its own column

marsh seal Sep 15, 2020, 4:08 PM

#

@paper niche thank you for your help. much appreciated!! (i've been stuck on this for a few hours)

paper niche Sep 15, 2020, 4:08 PM

#

sure no problem, I edited my answer btw, you can rename the index before resetting it so it gets a meaningful column name

lapis sequoia Sep 15, 2020, 4:09 PM

#

amazing thanks ! @tidal bough one last thing if you please, the x values arent really readable and scaled, how do I show the exact numbers for each :

#

📎 unknown.png

#

it's weird since the data show some numbers "2" and isnt from 0 to 10 or anything

tidal bough Sep 15, 2020, 4:10 PM

#

with seaborn it's very simple: pass discrete = True.

#

With matplotlib, it's more annoying - you'd have to manually set the number of bins to max - min of your data.

lapis sequoia Sep 15, 2020, 4:12 PM

#

ok so I have to find the max mins with describe or something and what do I put in bins= then ?

#

literally max - min ?

tidal bough Sep 15, 2020, 4:12 PM

#

lemme find the last time I did that...

lapis sequoia Sep 15, 2020, 4:14 PM

#

because the min max is in fact 0 to 100 and I already randomly put the bins=10/1 but it is what is on my screenshot

#

thanks for your help 🙂

#

oh no

#

I put 10/1 so I must do it 100/1 , right ?

#

that's it!

tidal bough Sep 15, 2020, 4:17 PM

#

should be max(data["WORKDAYS"]) - min(data["WORKDAYS"]) + 1 bins, and also pass align = "left"

lapis sequoia Sep 15, 2020, 4:17 PM

#

How do I put every number 1,2,3,4,5 in the x legend tho

#

hmmm

tidal bough Sep 15, 2020, 4:17 PM

#

How do I put every number 1,2,3,4,5 in the x legend tho
no idea, at that point you might want to use seaborn 🙂

lapis sequoia Sep 15, 2020, 4:18 PM

#

ok , thanks for the help so far!

desert oar Sep 15, 2020, 4:20 PM

#

@lapis sequoia write df = df[[cols[i] for i in colsorder]] instead of df = df[cols[i] for i in colsorder].

lapis sequoia Sep 15, 2020, 4:21 PM

#

ah it's not me lol it's @fast bluff his username makes it seems it was me asking

#

I was wondering what you were talking about...

fast bluff Sep 15, 2020, 4:22 PM

#

LOL

#

mbmb I like being sneaky

desert oar Sep 15, 2020, 4:22 PM

#

@fast bluff mods will probably ask you to change your name, for being difficult to tag

fast bluff Sep 15, 2020, 4:23 PM

#

I'll set a nickname

desert oar Sep 15, 2020, 4:23 PM

#

👍

fast bluff Sep 15, 2020, 4:23 PM

#

😄

red rose Sep 15, 2020, 4:28 PM

#

where do you guys find interesting data to work with ?

mellow spruce Sep 15, 2020, 4:29 PM

#

Hi guys I was wondering how can I apply a value of a list to 10 rows of a data frame and then go to the next value of the list. ie.:
Input:

    Numbers
      1
      2
      3
      4
      5
      6
      7
      8 
      9
      10
      11
      12
      13```

Output:

```Numbers|Grop
     1      A
      2     A
      3      A
      4      A
      5      A
      6      A
      7      A
      8       A
      9       A    
      10      A  
      11      B
      12      B
      13      B

Thank you !

marsh seal Sep 15, 2020, 4:50 PM

#

@paper niche I get the following error from that code

📎 unknown.png

gray phoenix Sep 15, 2020, 4:51 PM

#

Does anyone know how to tweak the NLTK library?

desert oar Sep 15, 2020, 5:01 PM

#

@red rose UCI machine learning repository, kaggle, various blog posts

#

@gray phoenix tweak how?

#

@marsh seal is new a DataFrame or a Series?

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename_axis.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rename_axis.html
the Series version doesn't accept index=

marsh seal Sep 15, 2020, 5:05 PM

#

@desert oar I think i figured out part of it.

📎 unknown.png

desert oar Sep 15, 2020, 5:06 PM

#

so new was a Series before you did reset_index?

#

new_df = new.rename_axis(['date', 'market_cap']).reset_index(level='date') how about this?

gray phoenix Sep 15, 2020, 5:07 PM

#

@desert oar I am using NLTK to read reviews. But right now the accuracy is fairly low.

I'm looking in how i can improve the accuracy

desert oar Sep 15, 2020, 5:08 PM

#

@gray phoenix it sounds like you might need a different model

#

it's usually not a matter of tweaking the library

#

you need to understand what the existing model is doing, and why it produces the results that it produces

#

then you can understand how to make it better

#

this is why knowing how the model works is important, otherwise you are just guessing or blindly following recommendations

#

what method are you currently using? what do you mean by "read reviews"?

#

@mellow spruce https://repl.it/@maximum__/CadetblueModernDiskdrive#main.py

import numpy as np
import pandas as pd

data = pd.DataFrame({'y': np.arange(50)})

vals = ['A', 'B', 'C']
vals_rep = pd.Series(np.repeat(vals, 10))

data['z'] = vals_rep
print(data)

repl.it

maximum__

CadetblueModernDiskdrive

A Python repl by maximum__

marsh seal Sep 15, 2020, 5:10 PM

#

@desert oar data frame. the output for the code new_df is 'length of new names must be 1, got 2'

desert oar Sep 15, 2020, 5:13 PM

#

well you're doing all these inplace operations

#

so now it's a dataframe

#

but it looks like it was a series before

gray phoenix Sep 15, 2020, 5:18 PM

#

@desert oar I might have to do more research into it. TBH I am still fairly new to Data Science and NLP.

But to give you the used case: at work I am working with the data scientist to take the reviews we get from our patients, and give a numerical value of our patients experience. He gave me the data of the reviews, and he also gave me a version of the same output, except with his NLPs engine score. (He codes in R, I code in python).

I was using NLTK to score the review, but I am way off. Not sure if I should be utilizing NLTK. I'm just a little loss and don't know where to start

desert oar Sep 15, 2020, 5:18 PM

#

what part of NTLK are you using?

#

and how do you define patient experience? what does the numerical value represent?

gray phoenix Sep 15, 2020, 5:22 PM

#

I am using

from nltk.sentiment.vader import SentimentIntensityAnalyzer

I believe I am using vader_lexicon

in my output, I have the "PatientFeedback" then How positive, neutral, negative, the patients feedback is

desert oar Sep 15, 2020, 5:26 PM

#

you are using SentimentIntensityAnalyzer.polarity_scores?

#

(i'm not an expert in this area, but you should first make sure you're using the tool correctly before you decide if the tool is not suitable for your application)

gray phoenix Sep 15, 2020, 5:28 PM

#

Yes I am

for review in reviews['PatientFeedback']: sent = sia.polarity_scores(str(review))

Do you know where or a resource I can learn about it

desert oar Sep 15, 2020, 5:28 PM

#

https://github.com/cjhutto/vaderSentiment

GitHub

cjhutto/vaderSentiment

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social ...

gray phoenix Sep 15, 2020, 5:28 PM

#

Thank you @desert oar

desert oar Sep 15, 2020, 5:29 PM

#

https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

Medium

Simplifying Sentiment Analysis using VADER in Python (on Social Med...

An easy to use Python library built especially for sentiment analysis of social media texts.

#

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.
it might be that this just is not the right tool for the job

#

but im not sure what other tools for sentiment analysis are available. you might need to do some digging, look around on forums, read blog posts, etc.

gray phoenix Sep 15, 2020, 5:30 PM

#

Thank you so much

desert oar Sep 15, 2020, 5:31 PM

#

this seems to be the original paper citation:

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

#

so you might also want to look at what papers cited this paper

dense copper Sep 15, 2020, 6:33 PM

#

hey guys, if I have a pandas dataframe like this, any ideas how I can get only the first and last item for each unique instance (ticker) in the first column?

ticker dimension calendardate
AACH       MRY   2018-12-31 2018-12-31
AACH       MRY   2017-12-31 2017-12-31
AACH       MRY   2016-12-31 2016-12-31
AACH       MRY   2015-12-31 2015-12-31
AACG       MRY   2019-12-31 2019-12-31
AACG       MRY   2018-12-31 2018-12-31
AACG       MRY   2017-12-31 2017-03-31
AACG       MRY   2016-12-31 2016-03-31
AACG       MRY   2015-12-31 2015-03-31
AAAP       MRY   2016-12-31 2016-12-31
AAAP       MRY   2015-12-31 2015-12-31
  AA       MRY   2019-12-31 2019-12-31
  AA       MRY   2018-12-31 2018-12-31
  AA       MRY   2017-12-31 2017-12-31
  AA       MRY   2016-12-31 2016-12-31
  AA       MRY   2015-12-31 2015-12-31
   A       MRY   2019-12-31 2019-10-31
   A       MRY   2018-12-31 2018-10-31
   A       MRY   2017-12-31 2017-10-31
   A       MRY   2016-12-31 2016-10-31
   A       MRY   2015-12-31 2015-10-31

#

for example I would want only 2018 and 2015 for AACH, 2019 and 2015 for AACG, etc

#

I expect I need to groupby ticker but I'm not sure where to go from there

desert oar Sep 15, 2020, 6:37 PM

#

@dense copper ```python
data.groupby('ticker').agg(['first', 'last'])

maybe?

dense copper Sep 15, 2020, 6:42 PM

#

hmm that's not a bad idea, but I'm reading I should probably use nth(0) rather than first() since first() ignores NaNs... any idea if it's possible to pass a param to nth() in that context you used it w/ agg? I don't see anything in the docs about that

#

oof I need to get better at pandas lol

fast bluff Sep 15, 2020, 7:07 PM

#

Would anyone mind reviewing my code?? Everything works as intended, am just a newb and criticism is always helpful

#

And if you're down, would it be easier to send the entire file, or send it via discord code? It's only 60 lines

novel remnant Sep 15, 2020, 7:22 PM

#

a nested list

solid aurora Sep 15, 2020, 7:26 PM

#

Is there a way to use sklearn's pipelines to short-circuit out on certain datapoints?

#

i.e. if I had a trained model M

#

and input data D = [1, 2, -3, 4]

#

let's say I want my output to be M(p) for each point p in D

#

unless p < 0, in which case I want to short-circuit out of the pipeline and the output is -1

#

so my output should look like [M(1), M(2), -1, M(4)]

#

(my actual condition is a bit more complex but still vectorizable)

#

The reason I wanted to use sklearn's pipelines is to keep the shape and order of my input data

#

so maybe there's a better way to do that?

serene scaffold Sep 15, 2020, 7:33 PM

#

# df: pd.DataFrame
>>> array = df[column].to_numpy(dtype=np.float, na_value=np.nan)
ValueError: could not convert string to float: '?'

I don't think pandas was having any issues reading in the question marks as null values, however it represents those, so I'm not sure why they're not being converted to np.nan

desert oar Sep 15, 2020, 7:35 PM

#

@void anvil json?

#

@serene scaffold what is your question exactly? if df[column] has "?"s in it, that error makes sense

serene scaffold Sep 15, 2020, 7:36 PM

#

@serene scaffold what is your question exactly? if df[column] has "?"s in it, that error makes sense
@desert oar I should have mentioned that df[column] is a Series but I want them to be np.nan

desert oar Sep 15, 2020, 7:37 PM

#

does it contain floats and strings, or all strings?

#

e.g. is it pd.Series([1.3, 2.7, '?']) or pd.Series(['1.3', '2.7', '?'])

serene scaffold Sep 15, 2020, 7:37 PM

#

let' see

#

ah, they're strings

#

darn

desert oar Sep 15, 2020, 7:39 PM

#

the question mark is the problem no matter what

serene scaffold Sep 15, 2020, 7:40 PM

#

that's how the csv represents missing data

desert oar Sep 15, 2020, 7:41 PM

#

import numpy as np
import pandas as pd

y = pd.Series([1.3, 2.7, '?'])

y_arr = y.mask(y == '?').to_numpy(dtype=np.float)

@serene scaffold

#

you have to just replace the ? with something else

serene scaffold Sep 15, 2020, 7:42 PM

#

Thanks!

desert oar Sep 15, 2020, 7:42 PM

#

unfortunately .replace doesn't actually let you replace with nulls, i think

#

print( y.replace('?', np.nan).to_numpy(dtype=np.float) )

you can do it with float('nan')/math.nan/np.nan but not None

#

@void anvil no, the "top level" of a json file can be any valid json type

#

null, int, float, string, array, object

#

why is that weird? it makes sense

#

!e ```python
import json
print( json.loads("null") )
print( json.loads("3.5") )
print( json.loads('["a", "b"]') )

arctic wedgeBOT Sep 15, 2020, 7:44 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | None
002 | 3.5
003 | ['a', 'b']

desert oar Sep 15, 2020, 7:44 PM

#

why would you need any of that

#

you mean, why would someone save data as just arrays? who knows

#

where did you get this data

#

thats not what i mean, lol

#

what's the data for? who generated it?

#

[[1, 2, 3], [4, 5, 6]]

it's like this?

serene scaffold Sep 15, 2020, 7:46 PM

#

def open_csv(path):
    df = pd.read_csv(path)
    df = df.replace('?', float('nan'))
    return df

#

this seems to fix it

desert oar Sep 15, 2020, 7:47 PM

#

@serene scaffold ```python
df = pd.read_csv(path, na_values=['?'])

serene scaffold Sep 15, 2020, 7:47 PM

#

oh dank

desert oar Sep 15, 2020, 7:47 PM

#

why are you being asked to work on data that you dont know anything about

#

yikes

#

thats a rough job

#

you might want to talk to whoever currently uses this data

#

ask them what it's for, how they use it, etc

#

maybe there's an automated system that consumes it -- do you have access to its source code or at least its business logic?

#

etc

#

this is the part of data science that requires you to be more of a business person / manager than anything

modest rune Sep 15, 2020, 7:54 PM

#

I think numba is compiling every time I run the program. Isn't it supposed to cache the binaries?

desert oar Sep 15, 2020, 7:56 PM

#

http://numba.pydata.org/numba-doc/dev/developer/caching.html @modest rune they have some rules about when compiled code is actually cached

#

maybe for some reason your program doesnt meet those requirements

modest rune Sep 15, 2020, 7:57 PM

#

I see there is a cache flag you can set... I guess I assumed it cached the binaries by default.

#

HAHAHA!!! That did IT! cache=true!

#

4 seconds down to 1.366 seconds. Still a lot of room for improvement, but at least I didn't completely waste my time numbafying my code.

serene scaffold Sep 15, 2020, 8:08 PM

#

>>> x
[0.48098 0.60981 0.47886 ... 0.47272 0.56138     nan]
>>> y
[    nan 0.41685 0.37368 ... 0.35527 0.44202 0.22364]
>>> x[~np.isnan(x).any(axis=-1), ~np.isnan(y).any(axis=-1)]
[]

modest rune Sep 15, 2020, 8:08 PM

#

And the numbafied code is no longer the long pole in the benchmark tent

serene scaffold Sep 15, 2020, 8:09 PM

#

evidently this isn't the way to remove indices where the element is nan in either

modest rune Sep 15, 2020, 8:09 PM

#

Thankyou @desert oar and @odd yoke

buoyant flint Sep 15, 2020, 8:17 PM

#

Hello nice people, can some one help me with a question using the str.extract function

lapis sequoia Sep 15, 2020, 8:25 PM

#

anyone know how to use Plotly Express

#

Fig1 = px.scatter_mapbox (Random Locations)
Fig2 = scatter_mapbox (Random Locations 2)

Then somehow get both of them to show

#

I cannot get this to work, and reading the documentation is enraging

#

cannot find anything that solves my issue

desert oar Sep 15, 2020, 8:48 PM

#

@serene scaffold im confused what are you trying to do

#

@void anvil i dont understand, its just a file full of numbers?

#

if its 15 years old it probably isnt even "json" technically

#

if nobody is using it and nobody knows what it is

#

why does anyone care

#

just delete it

serene scaffold Sep 15, 2020, 8:54 PM

#

>>> a = np.ndarray([np.nan, 2.0, 6.0, 4.0, np.nan])
>>> b = np.ndarray([1.0, np.nan, 3.0, 8.0, np.nan])
# convert these to ->
>>> a = np.ndarray([6.0, 4.0])
>>> b = np.ndarray([3.0, 8.0])

#

@desert oar delete every element that's nan in either array

#

actually this is wrong too

#

fixed

lime loom Sep 15, 2020, 9:24 PM

#

is it possible to have a jupyter notebook use kernels from multiple servers?

serene scaffold Sep 15, 2020, 9:34 PM

#

x: np.array; y: np.array
overlap = ~np.isnan(x) & ~np.isnan(y)
x, y = x[overlap], y[overlap]

#

appears to work correctly

#

very esoteric looking though

modest rune Sep 15, 2020, 9:46 PM

#

I'm sorry, but I just can't make sense of these numba errors. I could use help 2 ways... (1) How do I interpret the errors? and (2) how do I fix this particular error.

Code:

@njit
def iterate_stuff(c_TV, c_x, c_y, p_TV, p_x, p_y,underlying, initial_investment,
  flat_fee, per_contract_fee, percent_fee,days, putCall_floats, multipliers,
  asks, strikes, closes, putCalls):

  for i in range(days.shape[0]):
    # lots of code afer this

Error:

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, C)
During: typing of argument at c:/.../analysis2.py (222)

File "analysis2.py", line 222:
def iterate_stuff(c_TV, c_x, c_y, p_TV, p_x, p_y,underlying, initial_investment, flat_fee, per_contract_fee, percent_fee,days, putCall_floats, multipli
ers, asks, strikes, closes, putCalls):
    <source elided>

    for i in range(days.shape[0]):
    ^

desert oar Sep 15, 2020, 10:03 PM

#

@serene scaffold well you dont need to use the tuple expansion trick

#

it isnt esoteric to me, its just using boolean operators for boolean operations

#

also maybe you want | and not &

serene scaffold Sep 15, 2020, 10:05 PM

#

if either if False, it has to be False

desert oar Sep 15, 2020, 10:05 PM

#

then use or, not and

#

wait

#

what if one is missing but the other is not missing

#

nan, rather

serene scaffold Sep 15, 2020, 10:06 PM

#

we only want the elements that are numbers in both arrays

desert oar Sep 15, 2020, 10:06 PM

#

yeah, use |

#

oh i see

#

you have ~ in there

#

so yes youre fine

serene scaffold Sep 15, 2020, 10:06 PM

#

we need two arrays of only numbers that are the same shape.

desert oar Sep 15, 2020, 10:07 PM

#

@modest rune im surprised at that error, without knowing more about your code its hard to say why

#

it might even be a problem in numba's type inference

modest rune Sep 15, 2020, 10:08 PM

#

@desert oar I think I figured it out. One of the many parameters I had in my function declaration was an array of strings. I removed it and the error went away.

desert oar Sep 15, 2020, 10:08 PM

#

ah

modest rune Sep 15, 2020, 10:08 PM

#

Not a very helpful error.

desert oar Sep 15, 2020, 10:08 PM

#

yeah, numba probably doesnt like that

#

no its not helpful

#

especially since the arrow is pointing to the wrong thing

#

could be a good bug report if you can isolate and reproduce it

modest rune Sep 15, 2020, 10:12 PM

#

yeah... I might be emotionally up for that if I can get my code to run 🙂

#

I keep biting off bigger and bigger chunks of code to numbify... its like pulling teeth.

orchid lintel Sep 15, 2020, 10:36 PM

#

This looks rad. Databricks for Dask, basically: https://coiled.io/

Coiled

Manon Michel

Coiled: Scaling Python Simply

Managing data science at scale, helping you run at maximum speed and minimum cost.

lapis sequoia Sep 15, 2020, 10:58 PM

#

nice

sinful dock Sep 15, 2020, 11:01 PM

#

Can anyone please help with this, I 've been stuck for a while:

df_diff = pd.DataFrame([1, 2 , 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 6, 7, 0, 8], columns = ['diff'], index=pd.date_range('1/1/2000', periods=24,freq = 'H'))

I'm trying to get the locations of indexes for the beginning and **end **of the larger set of consecutive zeros, for the case above the answer would be (2000-01-01 09:00:00, 2000-01-01 18:00:00)
This is what I have so far:

    ref_val  = 0
    shut_in_days = 7
    for row, value in enumerate(df[col_name]):
        if value == ref_val:
            day_counter = 0
            #Check consequtive zero elements
            previous_value = abs(df[col_name][row-1]) if row > 0 else None 
            current_value = abs(df[col_name][row])
            next_value = abs(df[col_name][row+1]) if row < len(df[col_name])-1 else None 
            while (previous_value + current_value + next_value) ==0 and (day_counter > shut_in_days):
                row+=1
                day_counter+=1
            print(df.index[row], value) if row < len(df[col_name])-1 else None
        else:
            print(row, value)```

velvet thorn Sep 15, 2020, 11:15 PM

#

@sinful dock this is what I did:

>>> zero_runs = (df.diff(1) != 0).cumsum()[df['diff'] == 0]
>>> zero_runs[zero_runs['diff'] == zero_runs.groupby('diff').size().idxmax()].index
DatetimeIndex(['2000-01-01 09:00:00', '2000-01-01 10:00:00',
               '2000-01-01 11:00:00', '2000-01-01 12:00:00',
               '2000-01-01 13:00:00', '2000-01-01 14:00:00',
               '2000-01-01 15:00:00', '2000-01-01 16:00:00',
               '2000-01-01 17:00:00', '2000-01-01 18:00:00'],
              dtype='datetime64[ns]', freq=None)

#

there might be a better way

#

I haven't used pandas in a long time

sinful dock Sep 15, 2020, 11:23 PM

#

@velvet thorn thanks, trying to digest it now

lapis sequoia Sep 15, 2020, 11:37 PM

#

Hi all. I have a question about pandas. How can you replace a series in a dataframe with another series? Essentially I took in a dataframe, then a series from that. I took in ANOTHER dataframe, and another series from that

#

I tried to combine these two series using the .append function. It seems like I can store this newly combined series in a variable, but I am now trying to replace the old series from the old DataFrame with the newly combined series. But nothing I am doing is working and pandas' documentation isn't leading me anywhere useful. Is this even possible in pandas?

#

Please let me know with a mention if you happen to have any idea. Thank you very much

#

Like, I realllllly thought that oldSeries = newSeries would do the trick but no can do

#

How do you mean?

#

For example, I'm trying the update function and this doesn't even seem to be working. When I so much as try to print out the new dataframe, the old series is still there even though I apparently updated it with the new one

sinful dock Sep 15, 2020, 11:58 PM

#

I guess, if you are trying to replace one series with another of similar len(), wouldn't you just say df['new_series'] = df2["new_series"]where df is the dataframe you wan to keep and df2 is the other one? if you want to combine dataframes check these:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

lapis sequoia Sep 15, 2020, 11:59 PM

#

Thats what I meant by oldSeries = newSeries

#

Unfortunately, the newly combined series does not yet exist in a dataframe. I'm trying to insert it into the old dataframe in place of the original series, but it's stuck to the dataframe, seemingly

#

Here is one concern I had: can one series not be longer than the rest of the series in the dataframe? That's the only reason I can think of. If pandas disallows that it must be capping the series at some point, which is really frustrating

velvet thorn Sep 16, 2020, 12:23 AM

#

Here is one concern I had: can one series not be longer than the rest of the series in the dataframe? That's the only reason I can think of. If pandas disallows that it must be capping the series at some point, which is really frustrating
@lapis sequoia no, it cannot.

#

that wouldn't make sense

#

pandas is meant for tabular (i.e. rectangular) data

lapis sequoia Sep 16, 2020, 12:24 AM

#

That's unfortunate

velvet thorn Sep 16, 2020, 12:24 AM

#

if you need that, it's likely that you have to reconsider your data modell

#

how about you explain what you're trying to do

#

might be an XY problem

lapis sequoia Sep 16, 2020, 12:24 AM

#

I will try

#

I have a master dataframe so to speak. It has a very specific format and I need to keep it in that format because I have a mastercode, if you will, that reads that format very specifically. So I don't want to have to change the master format at all

velvet thorn Sep 16, 2020, 12:25 AM

#

hm.

#

go on

lapis sequoia Sep 16, 2020, 12:26 AM

#

I recently got a new dataframe with some new products that I am trying to add to the master dataframe. The problem is that the new dataframe has different column names, a different number of columns - basically not in the same format as the master dataframe at all. HOWEVER, the new dataframe does have an identical "products" column that I would like to combine with the "products" column in the master dataframe

#

As well as any other columns in the new dataframe that the master dataframe also might have. But i need to fir everything into the master dataframe even if that means that some of the new dataframe columns are unnecessary and untouched

velvet thorn Sep 16, 2020, 12:30 AM

#

okay

#

so basically

#

you want to combine the two, leaving out unnecessary columns from the new DataFrame, and (possibly) mapping some column names

lapis sequoia Sep 16, 2020, 12:31 AM

#

One solution would be to tailor the entire new dataframe so that it's in the format of the old one, and use "" as placeholders

#

Yeah, basically

velvet thorn Sep 16, 2020, 12:31 AM

#

leaving null values where columns in the old DataFrame do not exist in the new one.

#

correct?

lapis sequoia Sep 16, 2020, 12:33 AM

#

That might work. So far it looks like only one column in the combined dataframe is going to be 100% filled with no null values is the "products" one, the one I am directly adding to. Later on I will gather additional information that will help me fill the null vaules out, but I want to keep everything in the format of the old dataframe

#

I suppose one way to do this is to make a new dataframe where every value except for that in the "products" column is turned to null

velvet thorn Sep 16, 2020, 12:51 AM

#

hm.

#

I would suggest

#

replacing column names

#

then just using pd.concat

lapis sequoia Sep 16, 2020, 1:10 AM

#

Thats' what I meant basically

#

Thanks!

desert oar Sep 16, 2020, 1:20 AM

#

@left drift i think you need to give a specific example

#

its really not clear what you want

#

if you give example inputs and outputs it would help significantly

velvet thorn Sep 16, 2020, 1:21 AM

#

@desert oar did you get the wrong commander

desert oar Sep 16, 2020, 1:22 AM

#

yes

#

@lapis sequoia ^

lapis sequoia Sep 16, 2020, 1:25 AM

#

I can try. I'm sorry that I'm not being as specific.

read the new dataframe

Create variable for master series
create variable for new series.
Using .isin, I want to create another series that consists of the values in new series that are not already in master series. This is to remove overlaps, essentially.

Now I would like to essentially stack master series on top of the new series. I went about many ways to do this but now I see that it's not going to work because adding a series to dataframe breaks the rectangular shape of the dataframe, which pandas does not allow.```

#

Sorry, that ended up not being code at ALL. I apologize, Im not looking at it right now

#

I tried using combinedseries = masterseries.append(new series)

#

It created a new series, but I couldn't simply type masterSeries = combinedseries

#

Or use the .replace documentation either

#

so it looks like I'll have to concatenate two dataframes

carmine iron Sep 16, 2020, 1:35 AM

#

Can anyone help with ML basics

desert oar Sep 16, 2020, 1:36 AM

#

@lapis sequoia ok, i agree that pd.concat seems correct. let me work an example, one moment

#

actually it looks like you get it

lapis sequoia Sep 16, 2020, 2:32 AM

#

anyone have expeirence with px.scatter_mapbox?

#

I cannot for the life of me get two figures with different data sets to appear on the same map

modest rune Sep 16, 2020, 2:49 AM

#

Question: I am close to maximizing my speed improvements from numba. I have a few different things I want to try. One of them is... run some of my parallel code on my gpu.

So, regarding that idea. Am I going to waste my time if I try to do that on a dell laptop running windows with whatever video card comes stock with the laptop?

deft harbor Sep 16, 2020, 4:49 AM

#

Most likely

bitter harbor Sep 16, 2020, 4:49 AM

#

How so

#

I can’t imagine dell would/could stop you from doing it

deft harbor Sep 16, 2020, 4:51 AM

#

Unless I read it incorrectly, his focus is speed

bitter harbor Sep 16, 2020, 4:52 AM

#

Yes and?

deft harbor Sep 16, 2020, 4:52 AM

#

I don't think their is going to be a massive speed improvement on a laptop that has a good chance of having no real gpu

#

That's why I said most likely to is there a chance he is wasting time

bitter harbor Sep 16, 2020, 4:54 AM

#

The amount of speed you get would depend on which gpu 100%, but an increase is still an increase, and correct me if I’m wrong, but the newer dells ship with 2060-80’s (maybe 30.0’s too idk)

deft harbor Sep 16, 2020, 4:55 AM

#

I haven't used a laptop in years, so I'm not sure what the standard dell would ship with these days 🤷‍♂️

bitter harbor Sep 16, 2020, 4:55 AM

#

Just because you don’t have a titan, doesn’t mean it’s not worthwhile

deft harbor Sep 16, 2020, 4:56 AM

#

I suppose it depends on what you are doing and how you account for the time it takes to set things up

bitter harbor Sep 16, 2020, 4:56 AM

#

Also I’ve got a like 7 yo amd laptop gpu but it still helps

#

I suppose it depends on what you are doing
Ya that too, but unless you’re trying to take some brute force approach (or something similar), it doesn’t need to be that powerful

deft harbor Sep 16, 2020, 4:58 AM

#

Transferring gtp3 for a conditional GAN clearly

#

😅

austere swift Sep 16, 2020, 5:13 AM

#

even really slow gpus beat pretty fast cpus in a lot of applications

brittle agate Sep 16, 2020, 5:45 AM

#

I can use Keras Tuner with Transfer Learning(for example, model from TF Hub)?

austere swift Sep 16, 2020, 5:59 AM

#

keras tuner requires you to put some hp arguments within the model generator

#

so you wouldnt be able to

carmine finch Sep 16, 2020, 6:23 AM

#

does anyone know how to use github?

velvet thorn Sep 16, 2020, 6:28 AM

#

does anyone know how to use github?
@carmine finch most people know how to use GitHub, but I think you want #tools-and-devops

carmine finch Sep 16, 2020, 6:34 AM

#

ok so now u want me to go to a diff section

velvet thorn Sep 16, 2020, 6:38 AM

#

I am suggesting

#

this channel is for data science

solid mantle Sep 16, 2020, 6:40 AM

#

Does pandas only read csv?

#

I want to read a .out file using pandas. Can I do that?

velvet thorn Sep 16, 2020, 6:40 AM

#

what is an .out file?

#

that doesn't seem like an extension I know of

solid mantle Sep 16, 2020, 6:40 AM

#

Its a text file formal

velvet thorn Sep 16, 2020, 6:40 AM

#

what produced it?

solid mantle Sep 16, 2020, 6:40 AM

#

Format

#

Just some c code

#

It can be read using numpy normally, like np.loadtxt(filename.out)

#

I'm literally seeing only one command pd.read_csv in pandas

#

I believe you can treat out file like a txt file

#

Then how do you read a txt file using pandas?

brittle agate Sep 16, 2020, 6:43 AM

#

so you wouldnt be able to
@austere swift
Thank u, but I can add TF Hub model and else few layers to the model and add Tuner?

austere swift Sep 16, 2020, 6:44 AM

#

Yeah you can add tuner to the layers you add

velvet thorn Sep 16, 2020, 6:44 AM

#

Then how do you read a txt file using pandas?
@solid mantle the documentation of np.loadtxt says whitespace is used as a delimiter

brittle agate Sep 16, 2020, 6:44 AM

#

Nice, and it will be work fine?

velvet thorn Sep 16, 2020, 6:44 AM

#

my guess is you could use sep=' ' in pd.read_csv

solid mantle Sep 16, 2020, 6:45 AM

#

Yea my file is not a csv though

#

How can i read a txt file using pandas?

velvet thorn Sep 16, 2020, 6:45 AM

#

no, my point is

#

if you can use np.loadtxt, you should be able to use pd.read_csv with sep=' '

#

pd.read_csv just reads files which use a character as a separator

#

it doesn't have to be a comma (the "c" in "csv")

solid mantle Sep 16, 2020, 6:46 AM

#

You can read non CSV files with read_csv?

velvet thorn Sep 16, 2020, 6:46 AM

#

it can be, for example, a tab (that would be TSV), or something else

#

extensions are hints

solid mantle Sep 16, 2020, 6:46 AM

#

@velvet thorn dm

velvet thorn Sep 16, 2020, 6:47 AM

#

the extension of a file is one thing, its format another

solid mantle Sep 16, 2020, 6:47 AM

#

Ohh?

#

That's new info

#

I'll check

#

It worked xD

craggy light Sep 16, 2020, 6:50 AM

#

the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters, for example, semicolons. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. (from https://en.wikipedia.org/wiki/Comma-separated_values)

solid mantle Sep 16, 2020, 6:50 AM

#

Thankyou

mild topaz Sep 16, 2020, 7:07 AM

#

hello, i am doing image classification on dogs and cats breed. my folder structure this way ```python
demo -- >
training --> dog --> breed 1
--> breed 2
--> breed 3
--> cat --> breed 1
--> breed 2
--> breed 3

  testing -->    dog --> breed 1
                     --> breed 2
                     --> breed 3
            --> cat  --> breed 1
                     --> breed 2
                     --> breed 3```  is this the correct folder structure for image classification of dog and cat breed ?

brittle agate Sep 16, 2020, 7:07 AM

#

Yeah you can add tuner to the layers you add
@Spacecraft1013#5969
Nice, and it will be work fine?

lilac ferry Sep 16, 2020, 7:08 AM

#

I keep on getting the attachment as error while trying to run sklearn.model_selection.learning_curve. The target variable is of binary class and the minority class has a frequency of ~3.7%.

📎 unknown.png

lilac ferry Sep 16, 2020, 7:28 AM

#

solved it! I needed to increase the minimum size of my training set. I was taking too few samples and that resulted in just one class.

brittle agate Sep 16, 2020, 9:31 AM

#

hello, i am doing image classification on dogs and cats breed. my folder structure this way ```python
demo -- >
training --> dog --> breed 1
--> breed 2
--> breed 3
--> cat --> breed 1
--> breed 2
--> breed 3
  testing -->    dog --> breed 1
                     --> breed 2
                     --> breed 3
            --> cat  --> breed 1
                     --> breed 2
                     --> breed 3```  is this the correct folder structure for image classification of dog and cat breed ?

@mild topaz
Folder: Demo
Dirs: Dog, Cats.

#

U need to split data with np, pt or tf.

#

It's much more easiest.

grave frost Sep 16, 2020, 10:24 AM

#

Anyone know how to read an array with dype tf.int64 ? I can't figure it out. Tried tf.print() but that gave the same output (the type, dtype and shape of array) which looks like this:- <BatchDataset shapes: ((1, 50), (1, 50)), types: (tf.int64, tf.int64)>

mild topaz Sep 16, 2020, 10:39 AM

#

U need to split data with np, pt or tf.
@brittle agate can u explain a bit how i can do this?

odd yoke Sep 16, 2020, 10:45 AM

#

the object you have is not a tensor @grave frost it's a dataset, you can use .as_numpy_iterator() and cast to list or whatever, or consume it using .take or a for loop

hasty grail Sep 16, 2020, 11:05 AM

#

^

lapis sequoia Sep 16, 2020, 11:47 AM

#

Hey guys, i'm trying to delete some data of columns that have a value > 30 I found this on the internet and it works for one columns but when I add the other with the & none of the two works ( I plot it after to see if it has worked)

#

data = pd.read_csv('data.tsv', delimiter='\t', usecols=['WORKDAYS','MJDAY30A'])

# Get indexes where name column has value greater than 30

indexmjday = data[ (data['MJDAY30A'] > 30)].index


# Delete these row indexes from dataFrame
data.drop(indexmjday, inplace=True)```

#

so this work

#

but when I add

#

indexmjday= data[ (data['MJDAY30A'] > 30) & (data['WORKDAYS'] >30)].index

#

it doesnt work anymore

velvet thorn Sep 16, 2020, 11:51 AM

#

huh

#

why not just do this:

raw_data = pd.read_csv('data.tsv', delimiter='\t', usecols=['WORKDAYS','MJDAY30A'])
data = raw_data[(raw_data['WORKDAYS'] <= 30) & (raw_data['MJDAY30A'] <= 30)]

lapis sequoia Sep 16, 2020, 11:53 AM

#

well idk aha im noob

#

so this delete the data I guess

#

it worked, thanks vm I will remember it

brittle agate Sep 16, 2020, 12:56 PM

#

@brittle agate can u explain a bit how i can do this?
@mild topaz
Okay, I work with fork TensorFlow:
How I'm doing it:

# Loading collection(1 parametr), splitting the data and adding some standard param to splitted data(3 and  params).
(train_ds, test_ds) = tfds.load(
    'tf_flowers',
    split=['train[:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True
)

# Adding Validation, but it's not simple Val. It's 10-fold cross-validation.
val_ds = train_ds.split = [
    f'train[{k}%:{k + 10}%]' for k in range(0, 100, 10)
]

#

I think that adding data from local dir it's not big deal.

#

Need just to google.

modest rune Sep 16, 2020, 12:58 PM

#

Ok, I don't really understand something. In Numba, there is the function numba.prange() that if I understand things properly, is supposed to speed up for loops (an maybe other loops?) by parallelizing their execution. When I use prange() my loops are significantly faster. But, if I use prange() and set the @njit(parallel=True), usually things slow down.

And then, there is @njit(nogil=True)... while is related to parallel execution too.

So, it seems like numba runs code in parallel in many situations, regardless of your nogil and parallel settings? And what is the point of setting Parallel=True if nogil=False?

And, are there other functions like prange() that I should be aware of, that if I use instead of core python functions or common numpy functions will improve my speeds further?

brittle agate Sep 16, 2020, 1:00 PM

#

@mild topaz
If I want to use simple validation(not k-fold and so on).

(train_ds, val_ds, test_ds) = tfds.load(
    'tf_flowers',
    split=['train[:80%%]', 'train[80%:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True
)

misty fjord Sep 16, 2020, 1:01 PM

#

I wonna be a data scientist but from where I should start 😭 😭 😭

modest rune Sep 16, 2020, 1:03 PM

#

@misty fjord that is a life level question. There are a lot of good ways to approach your problem.

brittle agate Sep 16, 2020, 1:03 PM

#

@misty fjord
U need to know see theory. Basic architecture of NNs. What is bias, weights, input, output, activation's functions.

#

U can read Deep Learning with Python.

modest rune Sep 16, 2020, 1:04 PM

#

If I were to rewind my life and redo things, I would have taken a different approach to my learning, but the path I did take, worked well enough for engineering.

brittle agate Sep 16, 2020, 1:04 PM

#

Go to the site Tensorflow. U can find plan of education for beginners.

misty fjord Sep 16, 2020, 1:05 PM

#

I > @misty fjord

U need to know see theory. Basic architecture of NNs. What is bias, weights, input, output, activation's functions.
@brittle agate nns !? 😭

brittle agate Sep 16, 2020, 1:05 PM

#

Neural Networks.

#

S - plural.

#

I mean.

lapis sequoia Sep 16, 2020, 1:06 PM

#

how data science relate to neurla networks ? I thought data science was more statistical analysis of data and that neural networks were more like machine learning stuff

modest rune Sep 16, 2020, 1:07 PM

#

If I were you, I would find what you think would be:
a. perfect job
b. perfect type of problem to solve
c. most enjoyable thing to work on all day

Then, ask questions about what you want to do working backwards from there. It seems to me that data-science is quite a broad field and Machine Learning may or may not be the path you want to take.

raw rapids Sep 16, 2020, 1:07 PM

#

data science contains both machine learning and statistical ananlysis

misty fjord Sep 16, 2020, 1:07 PM

#

Ha Ok . Can you give me a path. Or should I just start and find it my self > S - plural.
@brittle agate 😭 😢 😭 😢

lapis sequoia Sep 16, 2020, 1:08 PM

#

oh okay

#

I guess it uses the data analysis to better calibrate the neural network or something like that

misty fjord Sep 16, 2020, 1:08 PM

#

If I were you, I would find what you think would be:
a. perfect job
b. perfect type of problem to solve
c. most enjoyable thing to work on all day

Then, ask questions about what you want to do working backwards from there. It seems to me that data-science is quite a broad field and Machine Learning may or may not be the path you want to take.
@modest rune I'm the c type of ppl I'm just enjoying doing python

brittle agate Sep 16, 2020, 1:08 PM

#

@lapis sequoia
Yep!

raw rapids Sep 16, 2020, 1:08 PM

#

tune is used more often btw

#

instead of calibrate

brittle agate Sep 16, 2020, 1:09 PM

#

https://tenor.com/view/michael-scott-wink-yes-आँख-मारना-gif-5795910

Tenor

lapis sequoia Sep 16, 2020, 1:10 PM

#

hmm I see thanks, Im a real beginner rn but I'd like to go into neural networks, should I dive directly into it or is it better for me to train with statistical analysis pandas matplotlib etcc 1st ??

odd yoke Sep 16, 2020, 1:10 PM

#

are you familiar with a programming language ?

lapis sequoia Sep 16, 2020, 1:10 PM

#

I already know about inferiential statistic just not about programming

modest rune Sep 16, 2020, 1:10 PM

#

@lapis sequoia If you like programming, or specifically like programming in python, and you like data... find something you can do to earn money doing those things. Specifically, find a company you like that hires people to do exactly that. Then, ask them or someone who works for them, what you should be learning.

lapis sequoia Sep 16, 2020, 1:10 PM

#

python basic that's all

modest rune Sep 16, 2020, 1:11 PM

#

In the end, it is all a journey, you will start down path A, but as things evolve, you will end up on path G.

misty fjord Sep 16, 2020, 1:12 PM

#

@lapis sequoia If you like programming, or specifically like programming in python, and you like data... find something you can do to earn money doing those things. Specifically, find a company you like that hires people to do exactly that. Then, ask them or someone who works for them, what you should be learning.
@modest rune if there such companies in where I live I'll already done that

brittle agate Sep 16, 2020, 1:12 PM

#

hmm I see thanks, Im a real beginner rn but I'd like to go into neural networks, should I dive directly into it or is it better for me to train with statistical analysis pandas matplotlib etcc 1st ??
@lapis sequoia
I think that not. Yeah, of course u need to know data structures of python. But today. Forks of ML is so easy.

#

U can choose TensorFlow or PyTorch. I recommend to start with TF, because:

TF has many guides(official site).
Ecosystem: TF Hub, TF Board and Datasets.
He is very easy fork, because above TF u're using Keras.

#

High API fork for the easy work with backend forks like PyTorch, Teanos, TensorFlow and so on.

#

U can use TensorFlow for the Production.

#

He has coral, lite(mobile special) and .js versions.

odd yoke Sep 16, 2020, 1:19 PM

#

really torch is definitely not far behind, if at all, on most of these points

brittle agate Sep 16, 2020, 1:19 PM

#

Yeah I know.

#

With Caffe 2 is so hot.

#

But I prefer TensorFlow.

lapis sequoia Sep 16, 2020, 1:21 PM

#

I appreciate the help, thank u. Tho I dont understand every of your points tho lol what does "fork" means aha ?

The thing is as with everything since ive learned python Im kind of lost into what should I do like should I find a project that would interest me to do with tensor flow or should I follow broad theory courses with random exercises ?

brittle agate Sep 16, 2020, 1:21 PM

#

@odd yoke
I think, that he is more friendly with newest people.

odd yoke Sep 16, 2020, 1:21 PM

#

the edge TF has is deploying on different systems with tflite, or tf.js but other than that:

torch also has a bunch of guides
torch has a ton of premade models, datasets etc as well
it has a very similar api to keras, while being significantly faster
and it's definitely usable in production

brittle agate Sep 16, 2020, 1:21 PM

#

@lapis sequoia
I mean framework.

lapis sequoia Sep 16, 2020, 1:21 PM

#

ooh okay lol

odd yoke Sep 16, 2020, 1:22 PM

#

tf is actually much better now but not that long ago it was an absolute mess

#

if you ever have the opportunity to use tf1.XX, do it, it's an enlightening experience

brittle agate Sep 16, 2020, 1:23 PM

#

the edge TF has is deploying on different systems with tflite, or tf.js but other than that:

torch also has a bunch of guides

torch has a ton of premade models, datasets etc as well

it has a very similar api to keras, while being significantly faster

and it's definitely usable in production
@odd yoke
I recommended TensorFlow because I know him much better(and I like him :3). But he can choice another fork.

odd yoke Sep 16, 2020, 1:23 PM

#

you will see what bad design is

brittle agate Sep 16, 2020, 1:23 PM

#

tf is actually much better now but not that long ago it was an absolute mess
@odd yoke
TF 1.x was so stoopid.

odd yoke Sep 16, 2020, 1:23 PM

#

yet it's what my company uses pensivewobble

brittle agate Sep 16, 2020, 1:24 PM

#

you will see what bad design is
@odd yoke
But we talking about present.

#

And for beginner is good choice.

odd yoke Sep 16, 2020, 1:24 PM

#

really i don't think there's any major difference anymore between torch and tf ever since tf2 came out

#

both are completely fine

lapis sequoia Sep 16, 2020, 1:25 PM

#

ive found the pdf of book "hands on machine learning" do you recommend me to start with taht without any prior experiences or to explore here and there by myself ?

brittle agate Sep 16, 2020, 1:25 PM

#

@odd yoke

I recommended TensorFlow because I know him much better(and I like him :3). But he can choice another fork.

#

But yeah what u said is true.

#

of course :D

#

ive found the pdf of book "hands on machine learning" do you recommend me to start with taht without any prior experiences or to explore here and there by myself ?
@lapis sequoia
Deep Learning with Python.

odd yoke Sep 16, 2020, 1:26 PM

#

I personally am not a fan of starting directly with code because, in my experience, i see a tendency in ppl that do that to never go back to see the fundamentals and theory behind the algorithms

brittle agate Sep 16, 2020, 1:27 PM

#

What is ppl?

odd yoke Sep 16, 2020, 1:27 PM

#

people

brittle agate Sep 16, 2020, 1:27 PM

#

kk

lapis sequoia Sep 16, 2020, 1:27 PM

#

hmm yeah like they apply it as a method or recipe without really knowing what they are doing

#

@brittle agate is it a book ?

brittle agate Sep 16, 2020, 1:28 PM

#

Yep)

lapis sequoia Sep 16, 2020, 1:28 PM

#

ok thanks will look into it

odd yoke Sep 16, 2020, 1:29 PM

#

there's "deep learning with pytorch" which is free and actually is a nice mix of theory and programming

#

and it's free

brittle agate Sep 16, 2020, 1:29 PM

#

Yeah, it's first edition. Second edition in process.

odd yoke Sep 16, 2020, 1:29 PM

#

it's from the pytorch people too

brittle agate Sep 16, 2020, 1:29 PM

#

@lapis sequoia

Deep Learning with Python.
Yeah, it's first edition. Second edition in process.

odd yoke Sep 16, 2020, 1:29 PM

#

https://pytorch.org/deep-learning-with-pytorch you can get it here if you're comfortable giving some of your informations, they don't check you directly get a download link to the pdf

PyTorch

An open source deep learning platform that provides a seamless path from research prototyping to production deployment.

brittle agate Sep 16, 2020, 1:30 PM

#

it's from the pytorch people too
@odd yoke
This book is actual for today?

odd yoke Sep 16, 2020, 1:30 PM

#

it came out 3 months ago

brittle agate Sep 16, 2020, 1:30 PM

#

I mean, PyTorch changed too how TF.

#

it came out 3 months ago
@odd yoke Really?

#

XD

#

I didn't know.

odd yoke Sep 16, 2020, 1:30 PM

#

2 months*

#

4 august

brittle agate Sep 16, 2020, 1:30 PM

#

LOL

#

https://tenor.com/view/doom-slayer-carlton-dance-moves-grooves-gif-16257196

Tenor

#

2 months*
@odd yoke
After this words I want to know too PyTorch how TF)) :3

lapis sequoia Sep 16, 2020, 1:32 PM

#

@brittle agate do you know if the author is Francois Chollet or if it's Nikhil ketkar ? I found 2 books with this name

#

thanks guys I'll do some research and make my choice

brittle agate Sep 16, 2020, 1:33 PM

#

Francois Chollet

lapis sequoia Sep 16, 2020, 1:33 PM

#

ok thanks 😉

odd yoke Sep 16, 2020, 1:33 PM

#

francois chollet is the original creator of keras, so it's not some random dude too

brittle agate Sep 16, 2020, 1:34 PM

#

But some methods is so hardest than in TF.
For example K-Fold. U don't know what is that right now. You will find out soon x)

#

francois chollet is the original creator of keras, so it's not some random dude too
@odd yoke
I read only original book by him.

#

And yeah, all guides for beginners, on the TF official site, Francois Chollet coded.

lapis sequoia Sep 16, 2020, 1:35 PM

#

hmm great I also found it for free online!

#

crazy internet

modest rune Sep 16, 2020, 1:36 PM

#

@odd yoke thoughts on my question above?

odd yoke Sep 16, 2020, 1:36 PM

#

the numba prange question ?

modest rune Sep 16, 2020, 1:36 PM

#

yes

#

Ok, I don't really understand something. In Numba, there is the function numba.prange() that if I understand things properly, is supposed to speed up for loops (an maybe other loops?) by parallelizing their execution. When I use prange() my loops are significantly faster. But, if I use prange() and set the @njit(parallel=True), usually things slow down.

And then, there is @njit(nogil=True)... while is related to parallel execution too.

So, it seems like numba runs code in parallel in many situations, regardless of your nogil and parallel settings? And what is the point of setting Parallel=True if nogil=False?

And, are there other functions like prange() that I should be aware of, that if I use instead of core python functions or common numpy functions will improve my speeds further?

odd yoke Sep 16, 2020, 1:36 PM

#

iirc prange if there's no parallel=True is the same as range

#

i'm not sure what could cause slow downs

modest rune Sep 16, 2020, 1:38 PM

#

I have about 6 nested functions that all are decorated with @njit. I get the best performance if only the top level function is set to parallel=True and all of my loops use prange() instead of range(). If I make all 6 fucntions parallel=True, massive slowdown.

#

Like 50x

brittle agate Sep 16, 2020, 1:39 PM

#

crazy internet
@lapis sequoia
PDF Drive is good storage of books for free :))))))))))))))

modest rune Sep 16, 2020, 1:39 PM

#

My guess is that numba is spinning up too many parallel OS processes when I set too much stuff to parallel, and this is not ideal.

#

just too much overhead created for the amount of cores my laptop has.

brittle agate Sep 16, 2020, 1:41 PM

#

@odd yoke
Q:
When need to use Flatten?

#

After Conv Layers is good practice to use Flatten or not?

odd yoke Sep 16, 2020, 1:42 PM

#

it depends on the architecture you're building

brittle agate Sep 16, 2020, 1:42 PM

#

For example transfer learning(ResNet50).

odd yoke Sep 16, 2020, 1:42 PM

#

FCN for example don't have any fully connected layers, so you would not need to flatten

brittle agate Sep 16, 2020, 1:43 PM

#

I know.

odd yoke Sep 16, 2020, 1:43 PM

#

For a regular resnet for something like classification, yeah, you would need to flatten

brittle agate Sep 16, 2020, 1:43 PM

#

For multi-class classification.

#

I mean.

#

For a regular resnet for something like classification, yeah, you would need to flatten
@odd yoke
One time after Conv Layers?

odd yoke Sep 16, 2020, 1:43 PM

#

yes

brittle agate Sep 16, 2020, 1:44 PM

#

Okay.

#

Thanks :3

#

https://tenor.com/view/sirenhead-dance-meme-monster-gif-17181272

Tenor

odd yoke Sep 16, 2020, 1:44 PM

#

you can have like
conv -> relu -> maxpool -> conv -> relu -> maxpool -> ... -> flatten -> dense -> relu -> dense -> softmax

#

obviously for a resnet it's much more complex

brittle agate Sep 16, 2020, 1:45 PM

#

Okay, thanks)

#

obviously for a resnet it's much more complex
@odd yoke
But I can use already Transfer Learning and I don't need to make residual connections))

odd yoke Sep 16, 2020, 1:46 PM

#

if the model is already made, then yeah you don't have to bother

brittle agate Sep 16, 2020, 1:46 PM

#

But, thanks anyway.

#

Have a good day.

dense copper Sep 16, 2020, 3:42 PM

#

hey all, is there a way to have pandas only calculate pct_change for the first row of a group in a groupby group?

#

or more simply maybe, given something like this:

             A   B   C   D
2000-01-02  14   5  20  14
2000-01-09   4   2  20   3
2000-01-16   5  54   7   6
2000-01-23   4   3  21   2
2000-01-30   1   2   8   6
2000-02-06  55  32   5   4

#

can I calculate pct_change for only like, 2000-01-30 to 2000-02-06?

#

this example is from here: https://www.geeksforgeeks.org/python-pandas-dataframe-pct_change/ - what I'd like to do is skip the calculation for everything except the bottom row

GeeksforGeeks

Python | Pandas dataframe.pct_change() - GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.

halcyon vale Sep 16, 2020, 3:50 PM

#

https://www.linkedin.com/posts/thinam-tamang-3b12831a2_66daysofdata-datascience-machinelearning-activity-6712018164508590080-_u5e

Thinam Tamang posted on LinkedIn

Day 13 of #66DaysOfData! with Ken Jee

Supervised Classification :
Classification is the process of choosing the correct class label for a given input...

solar phoenix Sep 16, 2020, 3:57 PM

#

Can someone help me with a pandas dataframe, I want to search a single column with a string, then output the value of a different column of the same row

#

or maybe just select the whole row

tulip briar Sep 16, 2020, 4:41 PM

#

Hello guys, i'm programming just as a hobby and Python is the only language i know. I recently came to conclusion that i would like to creaty my own Neural Network, just to see how is it working, how hard it is to do one. I knew that i need data so i gathered a lot of numeric data regarding Character Auctions in some MMORPG. Now i would like to create the NN which based on auctions ended (data that i have) will be able to predict the price of given character. For now i'm using parameters such as: Level, Skills, Achievment Points, Charms Points, Gold (everything measured in numbers). I also made numeric data related to Vocation and World (server) (i gave different number to every voc and server). When it comes to items etc. i skipped that part for now. I realise that this is really hard, so for now i just leave it. I would like to add it in the future, when let's say "alpha version" would work.
The problem is - i don't really know where to start. I see bunch of articles which touch this topic but it wasn't "it". Are you able to recommend me some sources from which i can learn creating NN from scratch?

desert oar Sep 16, 2020, 4:51 PM

#

@tulip briar neural networks fundamentally require a bit of linear algebra and calculus to understand. you can get started without knowing those things, but you might get stuck quickly. maybe check out https://www.fast.ai/ for some intro material oriented towards people with programming experience and toward getting you into neural networks (and deep learning) quickly

#

i also recommend subscribing to https://towardsdatascience.com/ which has a lot of tutorial-level content

#

eventually if you want to get seriously into data science and machine learning you will need to start understanding statistics and probability in addition to linear algebra and calculus. but that's farther down your learning path and you will probably start learning those concepts as you go along anyway

grave frost Sep 16, 2020, 4:58 PM

#

the object you have is not a tensor @grave frost it's a dataset, you can use .as_numpy_iterator() and cast to list or whatever, or consume it using .take or a for loop
@odd yoke Could you elaborate a bit more?

odd yoke Sep 16, 2020, 5:01 PM

#

what you are trying to print is a tf.data.Dataset object

#

not a tf.Tensor

grave frost Sep 16, 2020, 5:01 PM

#

Ohh.. right.

#

I was treating it as a tensor

serene scaffold Sep 16, 2020, 5:11 PM

#

the docs for numpy mention axes a lot

#

if I want to find the average of all elements in a matrix, regardless of the shape of that matrix, does that mean the axis is zero or what?

#

[[1, 2], [3, 4]], I would want the average to be sum([1, 2, 3, 4]) / 4 even if it were [1, 2, 3, 4] or [[1], [2], [3], [4]]

#

for what I'm doing, of course, not necessarily in general

pale thunder Sep 16, 2020, 5:13 PM

#

I think there is a flatten kwarg, or just .reshape((-1,))

serene scaffold Sep 16, 2020, 5:15 PM

#

I guess I can use that for now but I figured axes held the key

pale thunder Sep 16, 2020, 5:18 PM

#

oh, you can use .flat

serene scaffold Sep 16, 2020, 5:45 PM

#

also I have a data frame with a column called 'Binary Label', but I need to delete that series for one part

#

the docs, unsurprisingly, mentions axes

#

that series is all booleans though so I might be able to keep it if I can represent them as 1.0 and 0.0 in a numpy matrix

tulip briar Sep 16, 2020, 5:57 PM

#

@desert oar I don't have much troubles with math, i'm just totally newbie in coding :-D. I have learning about tensorflow, did some udemy course about Pandas but didn't find any intresting guide / course about NN.

serene scaffold Sep 16, 2020, 6:09 PM

#

looks like representing booleans as floats is fine

#

I need to do calculations for elements that are not null and for which the corresponding boolean element for that row is correct

#

I'm sure I'm not phrasing this effectively

#

let's turn it into an xy problem

#

[[3., 2., 1.],
 [5., 4., 0.],
 [6., 1., 1.]]

[[1., 1., 1.],
 [0., 0., 0.],
 [1., 1., 1.]]

#

if I can go from the first to the second, I can implicitly solve the actual problem.

desert oar Sep 16, 2020, 6:28 PM

#

@serene scaffold i dont understand either version of the problem

serene scaffold Sep 16, 2020, 6:28 PM

#

see how the rightmost column is always 1 or 0?

desert oar Sep 16, 2020, 6:28 PM

#

yes

serene scaffold Sep 16, 2020, 6:28 PM

#

I just want to make the entirety of each row match that

desert oar Sep 16, 2020, 6:28 PM

#

what do you mean "make the entirety of each row match that"

serene scaffold Sep 16, 2020, 6:28 PM

#

if the last element of the row is 1.0, every element becomes 1.0

desert oar Sep 16, 2020, 6:29 PM

#

i see. and now what do you actually want to do?

serene scaffold Sep 16, 2020, 6:29 PM

#

the last column comes from a dataframe where the values are the strings Yes or No

#

and I'm storing them as these floats because every other column in the dataframe is a float

#

that is, every other column stores only floats

#

or nan

desert oar Sep 16, 2020, 6:30 PM

#

ok, but what do you actually want to do

serene scaffold Sep 16, 2020, 6:32 PM

#

and, for example, I need to take the average of every non-nan value in a column, for which the value in the rightmost column is 1.0, and replace the nans with the average of those values.

desert oar Sep 16, 2020, 6:32 PM

#

ok. you are doing mean imputation for missing data?

serene scaffold Sep 16, 2020, 6:32 PM

#

ye

desert oar Sep 16, 2020, 6:32 PM

#

stratified by the value of the label?

serene scaffold Sep 16, 2020, 6:32 PM

#

I guess I could have just said that

desert oar Sep 16, 2020, 6:33 PM

#

because you really can't/shouldn't stratify by the value of the label, since you won't know the label at prediction time if a record w/ missing data happens to appear

serene scaffold Sep 16, 2020, 6:33 PM

#

it's conditional mean imputation. The two classes are just "Yes" and "No"

desert oar Sep 16, 2020, 6:33 PM

#

is the yes/no category a feature or the target

serene scaffold Sep 16, 2020, 6:34 PM

#

feature

dense copper Sep 16, 2020, 6:34 PM

#

argh. pandas makes me wanna stab someone.

serene scaffold Sep 16, 2020, 6:34 PM

#

the yes/no designation is always given

desert oar Sep 16, 2020, 6:34 PM

#

can you do this in pandas? or do you have to do it in numpy?

#

its easy to do both ways but the pandas way is "more elegant" since you can use groupby

serene scaffold Sep 16, 2020, 6:34 PM

#

I'm allowed to use pandas and numpy only

#

can't even use stdlib

desert oar Sep 16, 2020, 6:34 PM

#

is this homework? a job interview?

serene scaffold Sep 16, 2020, 6:35 PM

#

homework. which is why I'm trying to make my question very pointed to learning the libraries

#

this is the only programming assignment for the course and we're not even required to do it in python

desert oar Sep 16, 2020, 6:35 PM

#

i see

#

so lets lay down some general principles then

serene scaffold Sep 16, 2020, 6:36 PM

#

but there's a part where we have to do 8000 ** 2 calculations and idk how the Java users are going to do that without numpy

desert oar Sep 16, 2020, 6:36 PM

#

principle 1: when in doubt, use boolean subsetting by row

#

principle 2: pandas groupby exists

#

java loops are presumably a lot faster than python loops

#

so let's do the pandas version because it sounds like your data is in pandas

serene scaffold Sep 16, 2020, 6:38 PM

#

the professor is only letting people use Java because it's the default language in the department and he doesn't want to force people to learn python if they don't want to. but that's beside the point.

desert oar Sep 16, 2020, 6:38 PM

#

you can do it with a for loop in python too, it will just take a long time

#

data = pd.DataFrame({
    "x": [1.0, 1.5, 3.0, -3.2, -4.7, -2.2],
    "is_valid": ["no", "no", "no", "yes", "yes", "yes"]
})

is this a fair representation of the data? of course including more than columns just x

serene scaffold Sep 16, 2020, 6:39 PM

#

0.19958,0.2133,0.47241,0.36529,0.57143,0.53333,0.41898,0.35424,0.39056,0.41731,No
0.26562,0.25566,0.47299,0.34705,0.5,0.33333,0.42342,0.38472,0.40689,0.41538,Yes
0.11211,0.37656,0.53267,0.37086,0.57143,0.46667,0.5287,0.47377,0.50504,0.48333,No```

desert oar Sep 16, 2020, 6:39 PM

#

sure

#

ok, now let's pretend you're dumber than you are

serene scaffold Sep 16, 2020, 6:39 PM

#

that's setting the bar obscenely low already

desert oar Sep 16, 2020, 6:39 PM

#

give me an algorithm for imputing the mean within each group. it can be imprecise, we will make it more precise later.

#

just a rough sequence of steps

#

ignore anything related to pandas and numpy

#

imagine you are a number magician and you can just operate on tables of data directly

serene scaffold Sep 16, 2020, 6:41 PM

#

I guess you could filter out all the rows that aren't the right class, and then take the average of each column

desert oar Sep 16, 2020, 6:42 PM

#

stated another way: select only the rows that are the right class, take the average of each column in those rows, and replace every missing value with the average in its respective column

#

is that right?

serene scaffold Sep 16, 2020, 6:42 PM

#

yes

desert oar Sep 16, 2020, 6:42 PM

#

great. in pandas, how do you select rows such that column "class" equals "yes"

serene scaffold Sep 16, 2020, 6:42 PM

#

is it group_by?

desert oar Sep 16, 2020, 6:42 PM

#

no

#

you are less clever than that

#

pretend you are not clever

dense copper Sep 16, 2020, 6:43 PM

#

can someone please explain to me wtf the point of limit is in pct_change()? It seems to do basically nothing lol

>>> revs
      ticker calendardate       revenue
None                                   
29054      A   2019-12-31  5.163000e+09
29055      A   2018-12-31  4.914000e+09
29056      A   2017-12-31  4.472000e+09
29057      A   2016-12-31  4.202000e+09
29058      A   2015-12-31  4.038000e+09
>>> revs['revenue'].pct_change()
None
29054         NaN
29055   -0.048228
29056   -0.089947
29057   -0.060376
29058   -0.039029
Name: revenue, dtype: float64
>>> revs['revenue'].pct_change(limit=1)
None
29054         NaN
29055   -0.048228
29056   -0.089947
29057   -0.060376
29058   -0.039029
Name: revenue, dtype: float64
>>>

according to the docs, "The number of consecutive NAs to fill before stopping." -- doesn't this mean it should stop after hitting that first NaN?? all I want to do is calculate the pct_change the first fu%&$( row and I've been at this for 3 damn days and gotten nowhere.

serene scaffold Sep 16, 2020, 6:43 PM

#

I don't really know, I haven't used pandas before this

desert oar Sep 16, 2020, 6:43 PM

#

you have never ever subsetted rows in pandas ever?

serene scaffold Sep 16, 2020, 6:44 PM

#

I've just never used pandas

desert oar Sep 16, 2020, 6:44 PM

#

oh

#

lets do numpy then

dense copper Sep 16, 2020, 6:44 PM

#

(and yeah I realize this is reversed, I'm using that as a test to ensure what would happen if it results in NaN for the first row, I want it to just stop and move on to the next group (next ticker))

serene scaffold Sep 16, 2020, 6:44 PM

#

I've used numpy a bit but I've learned a lot of what I now know from trying to solve this assignment

desert oar Sep 16, 2020, 6:45 PM

#

i see

#

i thought youve used both libraries before

#

lets do pandas then. slicing and subsetting in numpy is actually messier in some respects than in pandas

serene scaffold Sep 16, 2020, 6:45 PM

#

I've never needed to use pandas for nlp and a lot of the math stuff is handled by existing libraries even if those libraries are using numpy internally.

desert oar Sep 16, 2020, 6:46 PM

#

https://numpy.org/doc/stable/user/basics.indexing.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
i recommend at least trying to struggle through both of these documents

#

you will get confused

#

but they do explain what you need to know

serene scaffold Sep 16, 2020, 6:46 PM

#

😨

desert oar Sep 16, 2020, 6:47 PM

#

there are 3 ways to select rows or columns in pandas: by "label", by numerical position, or by an array of True/False

#

i use the term "array" to mean anything "array-like", which includes pd.Series, np.ndarray (1 dimensional), and list

#

in this case, you want the 3rd option

#

pandas dataframes have 2 main "accessors" that you can use for selecting data: .loc and .iloc. .loc is for selecting by label or boolean array. .iloc is for selecting by numerical position.

lapis sequoia Sep 16, 2020, 6:49 PM

#

Quick question about appending multiple txt files into one data frame as I cannot figure out how to do this with Pandas.

I have a folder with a bunch of text files representing a banks quarterly reports from year 2006-2015 (2006-03-31, 2006-06-30, 2006-09-30, 2006-12-31....2015-12-31)

desert oar Sep 16, 2020, 6:51 PM

#

@lapis sequoia ```python
from pathlib import Path

import pandas as pd

data = pd.concat([pd.read_csv(path) for path in Path('data/').iterdir()])

#

you might need to adjust the arguments to pd.read_csv and to pd.concat depending on your specific setup

lapis sequoia Sep 16, 2020, 6:51 PM

#

But it is not CSV?

#

it is .txt files

desert oar Sep 16, 2020, 6:51 PM

#

well it depends on how the data is formatted

#

.txt is just part of the filename

#

it could be anything in there

#

is it tabular data? tab-separated maybe?

lapis sequoia Sep 16, 2020, 6:52 PM

#

Oh I get it

#

pathlib??

desert oar Sep 16, 2020, 6:52 PM

#

yes, it's part of the standard python library

#

!d g pathlib

arctic wedgeBOT Sep 16, 2020, 6:53 PM

#

`pathlib`

New in version 3.4.

Source code: Lib/pathlib.py

This module offers classes representing filesystem paths with semantics appropriate for different operating systems. Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.

../_images/pathlib-inheritance.png If you’ve never used this module before or just aren’t sure which class is right for your task, Path is most likely what you need. It instantiates a concrete path for the platform the code is running on.

Pure paths are useful in some special cases; for example:... read more

lapis sequoia Sep 16, 2020, 6:53 PM

#

ok

#

the txt files are in a folder called rawdata which is in my current directory

#

then I don't need to specify path?

desert oar Sep 16, 2020, 6:57 PM

#

@serene scaffold maybe i'll give you an example to help you get started w/ those docs. here it is without groupby. the numpy-only version is very similar, except without .loc:

import pandas as pd

data = pd.DataFrame({
    "x": [1.0, 1.5, None, -3.2, -4.7, None],
    "y": [11, None, 12, None, 105, 101],
    "is_valid": ["no", "no", "no", "yes", "yes", "yes"],
})

# pd.Series of bool values
is_yes = data["is_valid"] == "yes"
is_no = data["is_valid"] == "no"

# pd.DataFrame of bool values
is_null = data.drop(columns=["is_valid"]).isnull()

for c in is_null.columns:
    # Fill "yes" rows
    data.loc[is_null[c] & is_yes, c] = data.loc[~is_null[c] & is_yes, c].mean()
    # Fill "no" rows
    data.loc[is_null[c] & is_no, c] = data.loc[~is_null[c] & is_no, c].mean()

#

of course, this changes the data in-place, which you might not want

#

so maybe you can put the imputed version of the data in a new column

import pandas as pd

data = pd.DataFrame({
    "x": [1.0, 1.5, None, -3.2, -4.7, None],
    "y": [11, None, 12, None, 105, 101],
    "is_valid": ["no", "no", "no", "yes", "yes", "yes"],
})

# pd.Series of bool values
is_yes = data["is_valid"] == "yes"
is_no = data["is_valid"] == "no"

# pd.DataFrame of bool values
is_null = data.drop(columns=["is_valid"]).isnull()

for c in is_null.columns:
    # Make a new column to receive the imputed data
    c_new = f'{c}_imputed'
    data[c_new] = data[c]
    # Fill "yes" rows
    data.loc[is_null[c] & is_yes, c_new] = data.loc[~is_null[c] & is_yes, c].mean()
    # Fill "no" rows
    data.loc[is_null[c] & is_no, c_new] = data.loc[~is_null[c] & is_no, c].mean()

#

@lapis sequoia well you need to specify the directory where to look for files

#

the general principle is, build a list of dataframes, then pass that list to pd.concat

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

lapis sequoia Sep 16, 2020, 7:00 PM

#

📎 Screenshot_2020-09-16_at_20.59.45.png

#

📎 Screenshot_2020-09-16_at_20.58.35.png

#

Oh ok so it doesnt matter if I have the folder there in the working directory then

#

Will read into this. Thanks for the help!

desert oar Sep 16, 2020, 7:00 PM

#

@dense copper im honestly not sure what that limit parameter is for... looks like bad documentation. but, just practically, how do you expect to calculate the % change in the first row? there is nothing for the percent change to be based on, in the first row.

#

@lapis sequoia you can always specify the full path to the directory

#

e.g. /home/mrbambocha/projects/lab3/data/ or something like that

lapis sequoia Sep 16, 2020, 7:01 PM

#

Yes, I understand. Thanks a lot!

serene scaffold Sep 16, 2020, 7:02 PM

#

@desert oar thanks, I'm looking at this

dense copper Sep 16, 2020, 7:02 PM

#

@desert oar assuming the data is reversed (which it will be), the % change would be row 0 / row 1

desert oar Sep 16, 2020, 7:02 PM

#

there is a lot going on here @serene scaffold so i recommend reading through the pandas guide and using my examples to help

dense copper Sep 16, 2020, 7:02 PM

#

or more specifically for my case I want row 0 / row 4, but that's easy

desert oar Sep 16, 2020, 7:02 PM

#

then you need to run the pct change in reverse... right?

dense copper Sep 16, 2020, 7:03 PM

#

yes I know that, but my point with reversing it in this example is I want it to stop when it hits the first NaN

#

I assumed that would mean it would just not calculate anything if limit=1 and the first result is NaN, but instead it just apparently does the same thing

#

and basically everything I find about it in tutorials is someone who just copy/pasted the docs from Pandas to explain it. lol

desert oar Sep 16, 2020, 7:05 PM

#

@dense copper you might have to just make up some fake data and experiment

#

what happens if limit=0?

dense copper Sep 16, 2020, 7:06 PM

#

error

#

limit must be > 0

desert oar Sep 16, 2020, 7:06 PM

#

i see

dense copper Sep 16, 2020, 7:06 PM

#

consider this ...

desert oar Sep 16, 2020, 7:06 PM

#

ive never actually used this method before so i know as much as you do

dense copper Sep 16, 2020, 7:06 PM

#

>>> revs
      ticker calendardate       revenue
None                                   
29054      A   2019-12-31  5.163000e+09
29055      A   2018-12-31  4.914000e+09
29056      A   2017-12-31  4.472000e+09
29057      A   2016-12-31  4.202000e+09
29058      A   2015-12-31  4.038000e+09
>>> revs['revenue'].pct_change(99)
None
29054   NaN
29055   NaN
29056   NaN
29057   NaN
29058   NaN
Name: revenue, dtype: float64
>>>

desert oar Sep 16, 2020, 7:06 PM

#

https://github.com/pandas-dev/pandas/blob/v1.1.2/pandas/core/generic.py#L10230-L10244 this is the source

GitHub

pandas-dev/pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more - pandas-dev/pandas

dense copper Sep 16, 2020, 7:07 PM

#

pct_change(99) will return NaN in every row because there's only 5 rows. anything more than pct_change(4) will be NaN for everything right?

desert oar Sep 16, 2020, 7:07 PM

#

the source explains it. limit= is passed to .fillna, which is called before shifting and dividing

dense copper Sep 16, 2020, 7:07 PM

#

but unfortunately it still runs the calc for every row.

desert oar Sep 16, 2020, 7:07 PM

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html

dense copper Sep 16, 2020, 7:07 PM

#

which is bad for me, since I have 250k+ rows and 100+ columns

desert oar Sep 16, 2020, 7:07 PM

#

the limit parameter here explains itself in more detail

#

this is an issue of unclear docs from pandas

dense copper Sep 16, 2020, 7:08 PM

#

surprised pikachu lol

#

it's annoying how everything in Pandas is so needlessly complex

desert oar Sep 16, 2020, 7:08 PM

#

yeah. i'd recommend filing a bug report: "in the docs for pct_change, clarify that limit and fill_method are kwargs passed to fillna"

#

i'd argue that it is not needless

#

poorly documented in some cases, perhaps

dense copper Sep 16, 2020, 7:09 PM

#

want a cell of data? sure just do my_data.get_index[0:,:99]['some_var'].get_the_data(act_fucky=False).index(0)[1]! easy!

#

lol

desert oar Sep 16, 2020, 7:09 PM

#

huh?

#

why would you do that

dense copper Sep 16, 2020, 7:09 PM

#

I'm being facetious...

desert oar Sep 16, 2020, 7:10 PM

#

usually when pandas seems "overly complex" it means one of two things: the existing APIs are too low level and should have some wrapper functionality developed, or the existing APIs are too high level and need to expose more internals to the user

dense copper Sep 16, 2020, 7:10 PM

#

anyway, I'm looking through the source

desert oar Sep 16, 2020, 7:10 PM

#

the actual complexity is in my opinion structural and unavoidable

dense copper Sep 16, 2020, 7:10 PM

#

I think the bigger issue is I'm a dumbass and don't understand pandas

desert oar Sep 16, 2020, 7:11 PM

#

i rarely need long annoying invocations like that except when i'm using multiindex, which is definitely under-supported from an API design perspective

#

well that is a problem

dense copper Sep 16, 2020, 7:11 PM

#

just frustrated cause I've been trying to 3 days now to accomplish something that seems so simple

desert oar Sep 16, 2020, 7:11 PM

#

in order to use pandas effectively imo it's important to understand the separation between data and axes

dense copper Sep 16, 2020, 7:11 PM

#

and I'm getting nowhere

desert oar Sep 16, 2020, 7:11 PM

#

what are you trying to accomplish exactly

dense copper Sep 16, 2020, 7:14 PM

#

given data like the following:

 ticker calendardate  revenue
      A   2019-12-31  10
      A   2018-12-31  9
      A   2017-12-31  7
      A   2016-12-31  1
      A   2015-12-31  3

      B   2019-12-31  5
      B   2018-12-31  4
      B   2017-12-31  3
      B   2016-12-31  2
      B   2015-12-31  3

#

my goal is to calculate the pct change in revenue from 2015 to 2019 ONLY, from each ticker

#

e.g. I want ONLY (10/3) ** (1/5), and (5/3) ** (1/5)

#

using pct_change(4) calculates it for every row, so I will get the value I want, plus four NaNs

#

instead I want it to calculate the first result, see that the second is NaN, and skip the last 3 attempts since I know they will all be NaN

#

(for each group, if that makes sense)

#

that more complex calculation (e.g. ** 1/5) is to calculate CAGR, and I'm doing that with a lambda function using .apply(), but if I can figure out how to just do pct_change (or any function for that matter lol) then I can figure out the rest.

#

the problem is I have hundreds of thousands of rows and hundreds of columns, but I only need the result of the first row divided by the 5th, and everything else gets thrown away. The default behavior increases the time it takes to run this process by an order of magnitude.

desert oar Sep 16, 2020, 7:23 PM

#

@dense copper have you tried using the freq parameter?

dense copper Sep 16, 2020, 7:25 PM

#

I'm not sure how, and as usual the documentation is terrible lol

#

been looking around at that though

desert oar Sep 16, 2020, 7:27 PM

#

import pandas as pd

data = ...

def pct_change_for_ticker(y):
    return y.pct_change(freq=pd.tseries.offsets.YearEnd())

pct_changes = data.set_index(['ticker', 'calendardate']) \
    .groupby(level='ticker')['revenue'] \
    .apply(pct_change_for_ticker)

dense copper Sep 16, 2020, 7:31 PM

#

cursory attempt on that gave me a NotImplementedError

#

I'll keep trying though ... thanks for the idea

ornate granite Sep 16, 2020, 8:50 PM

#

Hi, I am using Weka for text classification. For some reason when I apply the StringtoWrodVector filter, it does nothing to my data set and the number of attributes remains the same.

Does anyone have any idea why this filter is not working for me?🙏

stable otter Sep 17, 2020, 1:31 AM

#

can someone help me convert python script to exe file

lapis sequoia Sep 17, 2020, 1:34 AM

#

anyone here use dash?

modest rune Sep 17, 2020, 1:41 AM

#

I could use some help. This particular part of my numba code accounts for 90% of my slowdown. I am hoping for a much faster way to accomplish this simple task:

def closest(values, desired):
    AB = np.abs(desired - np.expand_dims(values, 1))
    indexes = np.empty((AB.shape[0],), np.int64)
    for i in nb.prange(AB.shape[0]):
        indexes[i] = np.argmin(AB[i])
    return indexes

This is what the function is supposed to do. I have two 1D arrays, values and desired. desired has 200 elements in it, evenly spaced apart. Depending on when I call the function, values can have anywhere from 1 element to 200 elements, each with hard to predict spacings.

The code goes through every value of values and finds the INDEX of the element in desired that is closest to value.

Here is an example:

Input:
desired = [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]
values = [1.12, 4.77, 3.39]

Output:
indexes = [0,8,5]

#

anyone here use dash?
@lapis sequoia
I do

lapis sequoia Sep 17, 2020, 1:50 AM

#

Im having an issue moving a legend on this map i have made

#

I looked at the documentation, and it seems to have no effect on it

#

📎 Screen_Shot_2020-09-16_at_9.12.11_PM.png

#

This is my problem

modest rune Sep 17, 2020, 1:51 AM

#

You asked if I use it. I do, but I am terrible at it.

lapis sequoia Sep 17, 2020, 1:51 AM

#

lolol

modest rune Sep 17, 2020, 1:52 AM

#

I mean, dash uses html and css. And to get the position of things to work properly, you really need to understand html and stylesheets really well.

#

dash makes beautiful plots though.

lapis sequoia Sep 17, 2020, 1:53 AM

#

I was able to do lots of things in relationship to the map and even the colorbar without using html

wise garden Sep 17, 2020, 2:10 AM

#

How can I find out the index of multiple columns in pandas

lapis sequoia Sep 17, 2020, 2:19 AM

#

make a slice

modest rune Sep 17, 2020, 2:22 AM

#

I was able to do lots of things in relationship to the map and even the colorbar without using html
@lapis sequoia
Yeah, the graphical elements usually have non-html/css attributes you can mess with. But, it wouldn't suprise me if text positioning was purely html and css.

wise garden Sep 17, 2020, 2:22 AM

#

Let me rephrase

modest rune Sep 17, 2020, 2:23 AM

#

I know what you are asking... let me see if I can help

#

dataframe.columns.index(column_name1, column_name2)?

wise garden Sep 17, 2020, 2:25 AM

#

let me check

#

I'm doing something weird

modest rune Sep 17, 2020, 2:26 AM

#

index is a python core function that returns the index of the item in a list with that name.

wise garden Sep 17, 2020, 2:26 AM

#

like I could get what I need with select_dtypes but I want to include other columns too

modest rune Sep 17, 2020, 2:26 AM

#

dataframe.columns, should return a list of column names

wise garden Sep 17, 2020, 2:27 AM

#

yea I was trying things w .index but I was trying after I isolated columns based on their dtype

desert oar Sep 17, 2020, 2:28 AM

#

@dense copper https://repl.it/@maximum__/pct-change-pandas#main.py does this get you close to what you want?

repl.it

maximum__

pct change pandas

A Python repl by maximum__

#

import io
import pandas as pd

txt = """
 ticker calendardate  revenue
      A   2019-12-31  10
      A   2018-12-31  9
      A   2017-12-31  7
      A   2016-12-31  1
      A   2015-12-31  3
      B   2019-12-31  5
      B   2018-12-31  4
      B   2017-12-31  3
      B   2016-12-31  2
      B   2015-12-31  3
"""

data = pd.read_fwf(io.StringIO(txt), header=1)
data['calendardate'] = pd.to_datetime(data['calendardate'])

print(data)

def pct_change_for_ticker(y):
    return y.pct_change(freq=pd.tseries.offsets.YearEnd())

pct_changes = data.set_index('calendardate') \
    .groupby('ticker')['revenue'] \
    .apply(pct_change_for_ticker)

print(pct_changes)

#

@wise garden you want to select certain columns by dtype?

wise garden Sep 17, 2020, 2:29 AM

#

dataframe.columns, should return a list of column names
@modest rune anyway to pass a list through?

#

I've got 40+ column names to pass through

#

sorry, not the columns portion, the index portion

desert oar Sep 17, 2020, 2:31 AM

#

data.loc[["a", "b", "c"], ["X", "Y"]]

something like this? that selects the rows with index labels "a", "b", and "c", and columns "X" and "Y"

wise garden Sep 17, 2020, 2:34 AM

#

data.loc[["a", "b", "c"], ["X", "Y"]]
something like this? that selects the rows with index labels "a", "b", and "c", and columns "X" and "Y"
@desert oar I'm hoping to use df.iloc

#

oooo but might be able to use this

desert oar Sep 17, 2020, 2:35 AM

#

so you have lists of row numbers and/or column numbers?

#

i.e. by position

#

data.iloc[[3, 2, 12], [45, 46]]

same syntax

#

data.iloc[:, [45, 46]]
data.loc[:, ["X", "Y"]]

for all rows

wise garden Sep 17, 2020, 2:37 AM

#

Yea I was trying to make it harder by passing the col names that I found into indices and then passing them to .iloc.

desert oar Sep 17, 2020, 2:38 AM

#

data.loc[:, ["X", "Y"]]
data[["X", "Y"]]

pandas also accepts the 2nd line as a synonym for the 1st

#

i write data[["foo", "bar"]] all the time when using pandas code

wise garden Sep 17, 2020, 2:38 AM

#

yea solid, thx for the help

velvet thorn Sep 17, 2020, 7:06 AM

#

@modest rune did you solve your problem yet?

modest rune Sep 17, 2020, 7:11 AM

#

Yes! All is good now

velvet thorn Sep 17, 2020, 7:12 AM

#

how did you fix it?

solid mantle Sep 17, 2020, 7:50 AM

#

Hey, pandas issue

#

My dataframe here is taking the first row as a header row. Can i somehow edit it so it does NOT do that?

📎 Screenshot_from_2020-09-17_13-21-10.png

#

nevermind I solved it

lapis sequoia Sep 17, 2020, 8:59 AM

#

header = False

#

you can also supply header as list of strings

subtle tide Sep 17, 2020, 11:25 AM

#

Hi, hope everyone is doing fine, I'm not really into data science but I need pandas to do a quick work, I have a CSV file, and I need to make a search on it on a particular column, I could get away with a for loop but I've read that Pandas is more efficient, how can I achieve this?

velvet thorn Sep 17, 2020, 11:30 AM

#

Hi, hope everyone is doing fine, I'm not really into data science but I need pandas to do a quick work, I have a CSV file, and I need to make a search on it on a particular column, I could get away with a for loop but I've read that Pandas is more efficient, how can I achieve this?
@subtle tide what do you mean a search?

#

To be clear the variables are x and z
@solemn cosmos logarithm

ripe forge Sep 17, 2020, 11:35 AM

#

Don't worry about efficiently so early. If this csv file isn't massive you can happily use csv module in python and call it a day.

subtle tide Sep 17, 2020, 11:35 AM

#

@velvet thorn , I have a csv file with this header, [name, phone_number, email], and about a thousand row, I want to get the name and phone_number of a particular email

velvet thorn Sep 17, 2020, 11:36 AM

#

df.loc[df['email'] == the_email, ['name', 'phone_number']]

subtle tide Sep 17, 2020, 11:36 AM

#

@ripe forge , ok thanks, I will have that in mind

velvet thorn Sep 17, 2020, 11:36 AM

#

but yes, a thousand rows is nothing

subtle tide Sep 17, 2020, 11:36 AM

#

thanks @velvet thorn

velvet thorn Sep 17, 2020, 11:36 AM

#

yw

modest rune Sep 17, 2020, 12:54 PM

#

how did you fix it?
@velvet thorn

I took advantage of the fact that I knew how the desired array was constructed.

# a : data that I am analyzing
desired = numpy.linspace(a.min(), a.max(), 200)

By storing the 3 values used in the linspace and then later on applying them to values in the right way and converting to an integer, I was able to calculate the proper index, instead of searching for the index. Much much faster.

#

You must be on the other side of the world because when you asked me:

How did you fix it?
It was 2am-ish and I was asleep.

terse ridge Sep 17, 2020, 3:06 PM

#

hey guys I justed wanted to ask is the first node is same as start node in Binary Tree?
pls fell free to ping me if you have a answer

desert oar Sep 17, 2020, 3:07 PM

#

@terse ridge this is a better question for #algos-and-data-structs

#

data science consists of statistics, machine learning, and related topics

terse ridge Sep 17, 2020, 3:24 PM

#

ok sorry

fringe dagger Sep 17, 2020, 3:53 PM

#

umm hey guys i want to ask something about a particular dataset anyone wanna help me?

mellow spruce Sep 17, 2020, 3:53 PM

#

How can do the average of the next n rows for every row in a data frame? I have a data frame that looks like :

    A|1
    B|2
    C|3
    D|4
    E|5
    F|6
    G|7
    H|8
    I|9
    J|10
    K|11
    L|12
    M|13```
and I want to average the next 3 rows for every row so the output would be like 
```Object|Value|Average_3
    A|1|6.3
    B|2|8.6
    C|3|11
    D|4|13.3
    ... and so on```

I was thinking on doing something like 

```df['average_3']=df['value'].apply(lambda x: x.shift(1)+x.shift(2)+x.shift(3))

However, the n number of rows will not always be the same so I was wondering how I can apply a for loop inside the lambda function and also how will this manage the last n rows since they won't have all the future rows to do the average on?

desert oar Sep 17, 2020, 5:01 PM

#

@fringe dagger it's better to just ask your question, don't "ask to ask"

fringe dagger Sep 17, 2020, 5:02 PM

#

um yeah sure

#

📎 Screenshot_2020-09-17_223407.png

sterile wyvern Sep 17, 2020, 5:04 PM

#

@mellow spruce

#

df['average_3'] = df.iloc[:,1].rolling(window=3).mean()

desert oar Sep 17, 2020, 5:05 PM

#

i prefer to use column names but yes use rolling

fringe dagger Sep 17, 2020, 5:05 PM

#

how does the training data work here?

#

like i'm using two different columns to predict a regression model

desert oar Sep 17, 2020, 5:05 PM

#

i actually dont know if you can roll forward though @sterile wyvern

#

@fringe dagger can you also share your code as text (either using code formatting or a paste site)? it's difficult for some people to read text in images

fringe dagger Sep 17, 2020, 5:06 PM

#

oh okay never done that and its in notebook format

desert oar Sep 17, 2020, 5:06 PM

#

!paste

arctic wedgeBOT Sep 17, 2020, 5:06 PM

#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

sterile wyvern Sep 17, 2020, 5:07 PM

#

@desert oar need to go through rolling documentation

desert oar Sep 17, 2020, 5:07 PM

#

@sterile wyvern its not in there. but i found this, which is... messy https://stackoverflow.com/a/51619099/2954547

Stack Overflow

Rolling Look Forward Sum with Datetime Index in Pandas

I have multivariate time-series/panel data in the following simplified format:

id,date,event_ind
1,2014-01-01,0
1,2014-01-02,1
1,2014-01-03,1
2,2014-01-01,1
2,2014-01-02,1
2,2014-01-03,1
3,2014-01...

sterile wyvern Sep 17, 2020, 5:11 PM

#

for i in range(0,df.shape[0]-2):
df.loc[df.index[i+2],'SMA_3'] = np.round(((df.iloc[i,1]+ df.iloc[i+1,1+df.iloc[i+2,1])/3),1)

#

alternate for rolling

desert oar Sep 17, 2020, 5:14 PM

#

https://repl.it/@maximum__/pandas-forward-roll#main.py @mellow spruce

k = 3

data['average_3'] = (
    data['Value']
        .shift(-1).fillna(0)
        .rolling(k, win_type='boxcar').sum()
        .shift(-k+1)
)

repl.it

maximum__

pandas forward roll

A Python repl by maximum__

merry ridge Sep 17, 2020, 5:39 PM

#

Is there a simple way to be able to do arithmetic operations on an extremely large number? The documentation says that it should just automatically work for newer versions of python but that doesn't seem to be the case

lapis sequoia Sep 17, 2020, 6:00 PM

#

im making a smart macro that detects if the message sent has been sent twice without a message in between and im not sure how im going to go about the AI for this

#

ive made the macro thats all done

#

if anyone has any clue on how to help make sure to ping me, thanks

desert oar Sep 17, 2020, 6:18 PM

#

@merry ridge what error are you getting? how large is extremely large?

#

@lapis sequoia do you really need AI for that? you can't just compare the two messages for equality (or similarity)?

lapis sequoia Sep 17, 2020, 6:19 PM

#

and how would i go about doing that?

desert oar Sep 17, 2020, 6:20 PM

#

what do you mean? if message1 == message2:

distant tree Sep 17, 2020, 6:21 PM

#

Function messages?

lapis sequoia Sep 17, 2020, 6:25 PM

#

ohhh ok

merry ridge Sep 17, 2020, 6:32 PM

#

The upper bound of what I need is about 1e400~

desert oar Sep 17, 2020, 7:42 PM

#

@merry ridge what kind of operation are you doing?

#

!e ```python
print( 1e400 + 1e400 )