#data-science-and-ml

1 messages · Page 252 of 1

crude karma
#

Im trying to apply for some data science programs but I only have some qualitative analysis from my work as an analyst

#

feel free to ping me if someone answers 🙂

serene anvil
#

Anyone have a good resource for getting started with data streaming?

winter portal
#

hey guys

#

I have a tags system

#

so basically user inputs some text from discord

#

and it gets stored in a sqlite3 database

#

and when the stored text is called with the respective keyword,

#

the input text is displayed

#

(By a discord bot of course)

#

but over here, the lines arent getting recognized

#

and the text string is stored without the specified lines

#

I dont want that to happen

#

how to fix it?

lapis sequoia
#

Im working with a pretty large data set, 500 rows maybe like 10 columns. I am running forloops to create even more permutations, and all these permutations get saved as a new csv. The problem is it makes my computer crash after saving 2-3 CSV's. What Can I do to stop this from happening

austere swift
#

do you delete the dataframe from memory after saving it as a csv?

#

because if you just keep saving more and more dataframes to memory itll eventually fill up and crash

red rose
fast bluff
#
def SortData():
    global df
    cols = df.columns.tolist()                          #Convert columns to list
    cols
    colsorder = [2,1,4,3]                        #Move last column to first position
    df = df[cols[i] for i in colsorder]       ```
#

Last line returning a syntax error, any idea why?

velvet thorn
#

add one more set of brackets

tall barn
#

I take it this is the appropriate channel for asking questions that involve pandas?

#

I'm having issues getting rid of NaN values in one of my dataframe's columns.

#

I'm running python avg_age = clean_df['Age'].mean() clean_df['Age'].fillna(avg_age, inplace=True)

#

I have verified that the first line correctly takes the average value of values in the Age column

#

I have also verified that for some reason, the number of null values in the column does not change after running these lines

#

I first tried using .replace for this, with the same failure

#

All help I've found googling around links to people that didn't know about using inplace=True, which definitely isn't the problem here

#

Has anyone else experienced this sort of issue? Any fixes?

#

Ok, ok, nevermind. As usual, finally asking for help revealed the issue to me.

#

...I was running my after-check of .isnull().sum() on the wrong dataframe

#

20 minutes of confusion and finally deciding to join this server only for this to happen to me

velvet thorn
#

@tall barn that is rough.

#

one thing I like to do is never perform inplace modifications

tall barn
#

So you just assign back to itself?

velvet thorn
#

no

#

for example...

#
raw_df = pd.read_csv(...)
cleaned_df = raw_df.fillna(...).replace(...).map(...)
tall barn
#

Ah new dataframe every time?

velvet thorn
#

then let's say I want to perform some other transformation on cleaned_df:

some_other_df = cleaned_df.melt(...)
#

for example.

#

yup

tall barn
#

I can see the value in that

velvet thorn
#

are you using Jupyter?

tall barn
#

Wouldn’t that fill up the memory?

#

Yes

velvet thorn
#

worry about memory when it becomes a problem

#

premature optimisation is the root of all evil

tall barn
#

Fair enough

velvet thorn
#

the really nice thing about doing this is that you can go back to earlier cells

#

and run them

#

and they will just work.

#

because you never modify DataFrames

tall barn
#

Oh God yes

velvet thorn
#

whereas say you create a DataFrame in cell 8, perform a read-only aggregation in cell 9, and modify it in cell 10

#

if around cell 15 you want to change something in cell 9 just as an experiment, it might not work

#

because you've modified the source already.

#

naming can be a pain

#

but there are also ways to handle that

tall barn
#

I’m afraid that really wouldn’t fly with the instructors at my job though. Having a dozen dataframes isn’t a nice surprise to people who expect to just look at one

velvet thorn
#

yup, that's fine

#

I mean, you don't have to expose the intermediate DataFrames

#

but of course if you have other requirements then those should take precedence

tall barn
#

I’ll keep in mind what you’ve said once my code is no longer being inspected for grading purposes

#

Which should be in only a week or two

velvet thorn
#

sure!

#

I mean, what I have said is by no means the way to do things

#

and in fact it is a little foreign to a fair number of data scientists, I think?

#

because it's a bit more of a computer science theory-based approach

tall barn
#

Sure, but I strongly identify with what you said about running old cells

velvet thorn
#

(disclaimer: I have neither a formal DS nor a formal CS background)

#

but yeah if you're interested in this kind of thing you can look up functional programming; a lot of my DS code structure is inspired by those ideas

#

I think they generally lead to greater levels of correctness and readability once you get used to the idioms, since pandas is actually quite a functional library

tall barn
#

I don’t have a formal background in this, but I’m getting it now as pre-assignment job training. I got hired as a software developer and shoehorned into data scientist somehow...

#

I’m starting to get an appreciation for pandas for sure, the best documentation of any code I’ve ever read

#

To the point I’m mostly reading documentation to fulfill assignments, rather than following the shitty lessons

velvet thorn
#

yup, it does have p good documentation.

#

I mean

#

I do understand that career path

#

I've bounced all around ML engineer/DS/SWE in the span of a year or so

tall barn
#

It’s not that I don’t understand it, it’s just that I historically have hated statistics haha

#

I like the logic puzzles of programming, not so much data crunching. I’m adapting though. Statistics is less bad with code than it was in a generic stat class in college

upbeat cradle
#

Hey, I’m looking to find instances in pandas where a person has used the same details out of 2 fields I.e: a phrase and an email address, what’s the best way to find instances where people have changed either of those details and provide a count of the presence of either?

brittle agate
#

How can I implement 10-fold cross-validation in this code?
(train_ds, val_ds, test_ds), metadata = tfds.load( 'tf_flowers', split=['train[:60%]', 'train[60%:90%]', 'train[90%:]'], with_info=True, as_supervised=True )

lapis sequoia
#

this isn't python is it

pale thunder
#

that is python

lapis sequoia
#

oh cool.. I haven't used tf this was before

modest rune
#

I am trying to use numba to speed up my code, but I keep running into problems because numba doesn't support functions my code uses.

#

So... I am trying to find different ways to accomplish the same tasks... ways that numba supports.

#

I was doing this:

# y: vector of size 200
# x: vector of size 4
y - x.reshape((len(day),1))
#

numba has odd problems with numpy.reshape(), so I am trying to get rid of it

#

And I can't help but think, there is a simpler way to do this type of subtraction in numpy.

#

something like y - x.T, but that doesn't work

#

Calling x.T changes the vector of size (2,) to a vector of size (2,) 🙂

brittle agate
#

this isn't python is it
@lapis sequoia
Oof...

odd yoke
#

it is python

lapis sequoia
#

lol I'm just used to using sklearn for cv, etc

#

honestly it seems way too complicated in TF

brittle agate
#

No problem man.

odd yoke
#

@modest rune do you know which part is causing the crash exactly, the reshape of the len ?

raw rapids
#

How can I implement 10-fold cross-validation in this code?
`(train_ds, val_ds, test_ds), metadata = tfds.load(
'tf_flowers',
split=['train[:60%]', 'train[60%:90%]', 'train[90%:]'],
with_info=True,
as_supervised=True
just seems to be a tuple

odd yoke
#

either way, can you try .reshape(-1, 1) or use expand_dims

modest rune
#

numba supports reshape(), but only if the array is contiguously stored in memory.

odd yoke
#

which it should be given your example

modest rune
#

It is throwing an error saying it is not.

#

I have to call x.copy() to fix it.

loud veldt
#

Morning everyone!

I am having trouble putting these two separate mathematical formulas into one print statement. I've hacked my way to the solution, but I'd like to know how I can combine the two into one single print statement.

`import math

monthly_payment = 8721.8
periods = 120
interest = 5.6 / 100
i = interest / 12 * 1

print(i * math.pow(1 + i, periods)) # 0.008159169813759389
print(math.pow(1 + i, periods) - 1) # 0.7483935315198691

print(int(monthly_payment / (0.008159169813759389/0.7483935315198691))) # 800000`

modest rune
#

My guess is that x is a view to a bigger array that I sliced up earlier on in my code.

#

And hence, it is not contiguous.

#

I think copying the array will more than offset any speedups I get from using numba

#

I am going to try expand_dims, hopefully numba supports that

brittle agate
#

honestly it seems way too complicated in TF
@lapis sequoia
Why? If I'm downloading from cloud collection and break dataset into train, val and test input?

lapis sequoia
#

not that part.. I mean actually doing cross validation on this

odd yoke
#

yeah that's probably the reason why, i just tried it on my end and it works with reshape, the views are probably causing issues

brittle agate
#

not that part.. I mean actually doing cross validation on this
@lapis sequoia
U mean this case?

#

And yeah.

#

I resolved this problem.

#
(train_ds, test_ds), metadata = tfds.load(
    'tf_flowers',
    split=['train[:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True
)

val_ds = train_ds.split = [
f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
]```
#

One line of code. Not bad))

modest rune
#

@odd yoke expand_dims worked. Now I need to work through the other 3 errors numba is throwing. Thanks for helping me get over that hump!

brittle agate
#

just seems to be a tuple
@raw rapids
What?

modest rune
#

Any insight into what this means for a non-computer science guy?
Resolution failure for non-literal arguments
It is an error I am receiving when numba tries to compile this line of code:

# x: 2D array, variable names where changed to protect their identity
temp = x.argmin(1)

Full error from numba

Compilation is falling back to object mode WITH looplifting enabled because Function "j_nearest_strike" failed type inference due to: - Res
olution failure for literal arguments:
AssertionError()
- Resolution failure for non-literal arguments:
AssertionError()

During: resolving callee type: BoundFunction(array.argmin for array(float64, 2d, C))
During: typing of call at c:/.../analysis2.py (159)


File "analysis2.py", line 159:
def j_nearest_strike(strike, x):
    <source elided>
    abs_minus = np.abs(minus)
    temp = abs_minus.argmin(1)
    ^

  @jit('int64[:](float64[:], float64[:])')
odd yoke
#

argmin doesn't support additional arguments iirc

#

it's saying it didn't type check

modest rune
#

The actual code... sorry for the ugly variable names an unnecessary assignments, I am just trying to get it to work.

@jit('int64[:](float64[:], float64[:])')
def j_nearest_strike(strike, x):
    #strike = strike.copy()
    #length = strike.size
    strike_shaped = np.expand_dims(strike, 1)
    minus = x - strike_shaped
    abs_minus = np.abs(minus)
    temp = abs_minus.argmin(1)
    return temp
#

argmin doesn't support additional arguments iirc
@odd yoke
I guess I am dense. Can you rephrase that? I am not following.

#

Without numba, that line works perfectly.

odd yoke
#

you can only do arr.argmin()

modest rune
#

oh

odd yoke
#

numba doesn't support all of numpy

modest rune
#

well, crap

odd yoke
#

most axis parameters, like this one, aren't available

modest rune
#

abs_minus is a 2D array for vectorization purposes

#

I guess, if I am using numba, maybe I don't need to vectorize as much?

odd yoke
#

use loops

#

yeah

modest rune
#

Any idea how I can get around this problem while still taking the vectorized approach?

odd yoke
#

numba optimizes loops

modest rune
#

ok

#

I guess, I should be happy... but I am just annoyed because it took me a while to get the vectorization right in many parts of the code I want to use numba on. So... I have to rip all of that out.

#

A friend of mine suggested using pypy instead of numba and laughed at me. 🙂

odd yoke
#

sounds like your friend needs to look up what numba does

modest rune
#

All I can say is... When I am done destroying my vectorization and getting rid of my class, the numba code better be much faster!

odd yoke
#

also that's the purpose of numba, unlike numpy, it wants you to write "normal" python code

modest rune
#

The friend claims that pypy will get me similar performance gains, the only problem will be that installing scipy, pandas, and numpy will be a PIA.

odd yoke
#
import numba as nb
import numpy as np


arr = np.random.randint(10, size=(100,50))

@nb.jit(nopython=True, parallel=True)
def argmin(arr):
  ret = np.empty((arr.shape[0],))
  for i in nb.prange(arr.shape[0]):
    ret[i] = np.argmin(arr[i])
  return ret

print(argmin(arr))
print(arr.argmin(1))```works
#

just gotta like, specify the dtype

#

i'm just too lazy

modest rune
#

thanks! you didn't have to do that.

pale thunder
#

Pypy is generally a lot slower than nopython numba since it uses actual python objects

odd yoke
#

it's not even the same purpose really

#

numba is for optimizing numerical programs

frank marsh
#

Hey all, Need a help I just started learning phyton and I am learning about dataframe. I have a sample of 29405 rows and 1 column and when I using this code
df= pd.Dataframe (df)
df1.insert(2, "Consumer", [1:29405])

But it's showing SyntaxError: invalid syntax

lapis sequoia
#

hey guys noob here, Im trying to use pandas and read a file, but when I use read_csv it doesnt print columns but long lines of text where columns are separated by '\t'

I figured out I had to use a delimiter but when I plug \t into it I get an error ? ( it is a TSV not CSV files but I seen it works the same)

desert oar
#

@lapis sequoia can you share your actual code?

lapis sequoia
#

only that and I get an error in this line

#

@desert oar

desert oar
#

show the error? also please try to share code as text, and not screenshots. some people have difficulty reading screenshots

lapis sequoia
#

sorry but idk how to share the code in fashion way

grave frost
#

@lapis sequoia How big is the TSV file you want to load?

lapis sequoia
#

500 000 ko

#

big file

grave frost
#

ko?

lapis sequoia
#

@grave frost

#

Ko I guess it's 500 mo

grave frost
#

Hmm...

raw rapids
#

@Jerrody you're just loading a tuple of values

grave frost
#

@lapis sequoia Did you try just taking a few values, making a new tsv and reading that with pandas?

lapis sequoia
#

hmm no, you mean like create a file with only the columns im interest in or st ?

#

idk how to do that tbh Im a real beginner

grave frost
#

No, create a file with only a few data points (like maybe of 5Mb). Also, how much RAM have you got?

desert oar
#

@lapis sequoia try using nrows=100 to load only 100 rows

#

!codeblock

arctic wedgeBOT
#

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')
desert oar
#

^ that tells you how to post code

lapis sequoia
#

8gb

#

oh ok thanks

#

but I get the error just by printing the columns alone @desert oar

#

and no error if I dont use delimiter

desert oar
#

if you dont use delimiter= it is loading everything as text

#
pd.read_csv('data.tsv', delimiter='\t', nrows=100)

does this produce an error?

lapis sequoia
#

no!

#

lenght= 3148

#

may be that...

#

but it's weird since I only prints the columns

desert oar
#

the error is not from printing the columns

#

do you want to read the data file? or do you only want to print the columns?

lapis sequoia
#

well, Im really exploring the data as a noob atm

desert oar
#

it sounds like you do not have enough free memory on your computer to read the whole dataset

lapis sequoia
#

seeing if I can find one specific columns and exploring only the data associated with it

desert oar
#

you can select columns with usecols=

lapis sequoia
#

hmmm okay

desert oar
#
pd.read_csv('data.tsv', delimiter='\t', nrows=100, usecols=['CASEID', 'QUESTID2', 'CIGEVER'])
#

this only reads the CASEID, QUESTID2, and CIGEVER columns

#

and only 100 rows

grave frost
#

Is there any way to create partial checkpoint? Like to make a chkpt if the Epoch is done only about 30% or something?

lapis sequoia
#

oh okay thanks! so since I specify those parameters in the read_csv before doing anything else, it means if I dont do that it goes trough all the data AND ROWS even if I print the columns only and create an error ? do i understand correctly ?

fast bluff
#
def SortData():
    global df
    cols = df.columns.tolist()                          
    cols
    colsorder = [2,1,4,3]                        
    df = df[cols[i] for i in colsorder]                                  
    return df
``` Last line returning a syntax error plox halp
grave frost
#

Can you post the whole error? and are you using Jupyter Notebooks by any chance?

fast bluff
#

Are you talking to him or I?

grave frost
#

you

fast bluff
#

I'm using ATOM with some addons to make it work like Jupyter

grave frost
#

Hydrogen?

fast bluff
#

Yeah

grave frost
#

Just restart your kernel and run only the lines you need

fast bluff
#

Code looks fine, you just thinking its an error with the kernel?

marsh seal
paper niche
#

@fast bluff you need 1 more pair of brackets I think, try df[[cols[i] for i in colsorder]]?

fast bluff
#

Someone else suggested that. When I try that it says "list index out of range"

paper niche
#

how many columns do you have? 4?

fast bluff
#

Yeah 4 exact

#

SMA, CLOSE, BUY, SELL

paper niche
#

python is 0-indexed, you need [1, 0, 3, 2] instead

fast bluff
#

OHH

#

omg

#

smh thank you

paper niche
#

though, it's probably easier to just select them directly? if the column names are fixed

#

as in df = df[['CLOSE', 'BUY', 'SELL', 'SMA']] or whatever

fast bluff
#

I was doing that before and ran into some errors yelling at me to use .iloc so I've just been playing it safe

#

Starting with 0 fixed the problem lol tysm that went right over my head

grave frost
#

Just an idle and theoretical question - can we use a decoder (like in an autoencoder) with some layers to give probablity oriented outputs? Like for every Latent Space combination, it would produce several unique possiblities. How would we implement something like that? Maybe by using "softmax" at the end to get probablities, but the thing would be that those outputs would all be the BEST it can produce, just tweaked in different ways. Like it wouldn't assign 94% to one, then 3% to another and so on. It would give several outputs but each of them would be the most likely. Like for 50 outputs, it would assign each one at 2%. So is a decoder deterministic in that way (to be not able to produce multiple outputs for same input) or does it actually end up producing unique outputs every time?

lapis sequoia
#

So I now have this variable workdays which each id is associated with differents numbers, and I'd simply like to plot the frequency/repartition of each number for the whole data set but I only gets one big columns


plt.hist(data)
plt.show()
#

I'd like to have a frequency distribution

tidal bough
#

how does data look like?

#

like, what's data.head()?

desert oar
#

@lapis sequoia

    df = df[[cols[i] for i in colsorder]]                                  
lapis sequoia
#

99 is a specific value dat means error or st but others ID should have any numbers

tidal bough
#

you probably want to hist a specific column

lapis sequoia
#

@desert oar sorry I dont understand where i'm supposed to do that

tidal bough
#
plt.hist(data["WORKDAYS"])
plt.show()
paper niche
#

https://media.discordapp.net/attachments/587375753306570782/755439106317877341/unknown.png?width=730&height=702 Hello, how do I retrieve a specific timestamp from this data frame? Also, how would i make the timestamps a column 'date'? This is a multi index btw
@marsh seal
if they are proper datetime formats, you can retrieve directly with new.loc['2017-01-03'] if you want the 3 Jan 2017 rows, for example.
and I'm not quite sure I get your second question, you want to reset the timestamp index into its own column called 'date'?

lapis sequoia
#

@tidal bough it worked thanks! so even if I specify the reading to only the columns I still need to specify it in the plot, thanks 😉

modest rune
#

@odd yoke So, it is ugly and needs to get cleaned up. But the functions I wanted to run with numba now run without errors. They went from taking 4 seconds to run to 500ms. Which is great.

But, something weird happened, the parent function that calls them:

# everything from custom_function down has been decorated with @njit
pandas.Dataframe.rolling().apply(custom_function)

pandas.Dataframe.rolling().apply(custom_function) (refer to this as A)gets called 8000 times. Because of how rolling() works, custom_function (refer to this as Child of A) gets called probably 1 million times (8000 * 100).

Before numba: A took 4 seconds to run, of which Child of A accounted for 3.8 of those seconds.
After numba: A took 4 seconds to run, of which Child of A accounted for 0.5 of those seconds.

Seems like there is a significant amount of overhead loading the numba function or something. Any insights, I feel like something trivial needs be done to fix this problem.

tidal bough
#

@lapis sequoia Alternatively, you can use seaborn to visualize dataframes; it's nicer in some ways. With it, it'd be:

import seaborn as sns
sns.displot(data=data,x="WORKDAYS")
marsh seal
#

@paper niche yes i want the dates to have a column named 'date. this way i can filter through them

lapis sequoia
#

thanks will look into it

#

and do you know how could I delete some values of my data like since 0 and 99 are specific values that arent really analysable ?

#

tldr: How to delete specifics values of a column before plotting

tidal bough
#

I'd suggest, instead of using 0 and 99 to mark "invalid value", using None or NaN.

#

Though also, you can just filter them out when plotting:

#
plt.hist(data["WORKDAYS"][(data["WORKDAYS"]!=99) & (data["WORKDAYS"]!=0)])
plt.show()
paper niche
#

@paper niche yes i want the dates to have a column named 'date. this way i can filter through them
@marsh seal you can use df.rename_axis(index=['date', None]).reset_index(level=0) which will convert the first-level index (your timestamp index) into its own column

marsh seal
#

@paper niche thank you for your help. much appreciated!! (i've been stuck on this for a few hours)

paper niche
#

sure no problem, I edited my answer btw, you can rename the index before resetting it so it gets a meaningful column name

lapis sequoia
#

amazing thanks ! @tidal bough one last thing if you please, the x values arent really readable and scaled, how do I show the exact numbers for each :

#

it's weird since the data show some numbers "2" and isnt from 0 to 10 or anything

tidal bough
#

with seaborn it's very simple: pass discrete = True.

#

With matplotlib, it's more annoying - you'd have to manually set the number of bins to max - min of your data.

lapis sequoia
#

ok so I have to find the max mins with describe or something and what do I put in bins= then ?

#

literally max - min ?

tidal bough
#

lemme find the last time I did that...

lapis sequoia
#

because the min max is in fact 0 to 100 and I already randomly put the bins=10/1 but it is what is on my screenshot

#

thanks for your help 🙂

#

oh no

#

I put 10/1 so I must do it 100/1 , right ?

#

that's it!

tidal bough
#

should be max(data["WORKDAYS"]) - min(data["WORKDAYS"]) + 1 bins, and also pass align = "left"

lapis sequoia
#

How do I put every number 1,2,3,4,5 in the x legend tho

#

hmmm

tidal bough
#

How do I put every number 1,2,3,4,5 in the x legend tho
no idea, at that point you might want to use seaborn 🙂

lapis sequoia
#

ok , thanks for the help so far!

desert oar
#

@lapis sequoia write df = df[[cols[i] for i in colsorder]] instead of df = df[cols[i] for i in colsorder].

lapis sequoia
#

ah it's not me lol it's @fast bluff his username makes it seems it was me asking

#

I was wondering what you were talking about...

fast bluff
#

LOL

#

mbmb I like being sneaky

desert oar
#

@fast bluff mods will probably ask you to change your name, for being difficult to tag

fast bluff
#

I'll set a nickname

desert oar
#

👍

fast bluff
#

😄

red rose
#

where do you guys find interesting data to work with ?

mellow spruce
#

Hi guys I was wondering how can I apply a value of a list to 10 rows of a data frame and then go to the next value of the list. ie.:
Input:

    Numbers
      1
      2
      3
      4
      5
      6
      7
      8 
      9
      10
      11
      12
      13```

Output:

```Numbers|Grop
     1      A
      2     A
      3      A
      4      A
      5      A
      6      A
      7      A
      8       A
      9       A    
      10      A  
      11      B
      12      B
      13      B

Thank you !

marsh seal
gray phoenix
#

Does anyone know how to tweak the NLTK library?

desert oar
#

@red rose UCI machine learning repository, kaggle, various blog posts

#

@gray phoenix tweak how?

#

@marsh seal is new a DataFrame or a Series?

marsh seal
desert oar
#

so new was a Series before you did reset_index?

#

new_df = new.rename_axis(['date', 'market_cap']).reset_index(level='date') how about this?

gray phoenix
#

@desert oar I am using NLTK to read reviews. But right now the accuracy is fairly low.

I'm looking in how i can improve the accuracy

desert oar
#

@gray phoenix it sounds like you might need a different model

#

it's usually not a matter of tweaking the library

#

you need to understand what the existing model is doing, and why it produces the results that it produces

#

then you can understand how to make it better

#

this is why knowing how the model works is important, otherwise you are just guessing or blindly following recommendations

#

what method are you currently using? what do you mean by "read reviews"?

marsh seal
#

@desert oar data frame. the output for the code new_df is 'length of new names must be 1, got 2'

desert oar
#

well you're doing all these inplace operations

#

so now it's a dataframe

#

but it looks like it was a series before

gray phoenix
#

@desert oar I might have to do more research into it. TBH I am still fairly new to Data Science and NLP.

But to give you the used case: at work I am working with the data scientist to take the reviews we get from our patients, and give a numerical value of our patients experience. He gave me the data of the reviews, and he also gave me a version of the same output, except with his NLPs engine score. (He codes in R, I code in python).

I was using NLTK to score the review, but I am way off. Not sure if I should be utilizing NLTK. I'm just a little loss and don't know where to start

desert oar
#

what part of NTLK are you using?

#

and how do you define patient experience? what does the numerical value represent?

gray phoenix
#

I am using

from nltk.sentiment.vader import SentimentIntensityAnalyzer

I believe I am using vader_lexicon

in my output, I have the "PatientFeedback" then How positive, neutral, negative, the patients feedback is

desert oar
#

you are using SentimentIntensityAnalyzer.polarity_scores?

#

(i'm not an expert in this area, but you should first make sure you're using the tool correctly before you decide if the tool is not suitable for your application)

gray phoenix
#

Yes I am

for review in reviews['PatientFeedback']: sent = sia.polarity_scores(str(review))

Do you know where or a resource I can learn about it

desert oar
gray phoenix
#

Thank you @desert oar

desert oar
#

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.
it might be that this just is not the right tool for the job

#

but im not sure what other tools for sentiment analysis are available. you might need to do some digging, look around on forums, read blog posts, etc.

gray phoenix
#

Thank you so much

desert oar
#

this seems to be the original paper citation:

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
#

so you might also want to look at what papers cited this paper

dense copper
#

hey guys, if I have a pandas dataframe like this, any ideas how I can get only the first and last item for each unique instance (ticker) in the first column?

ticker dimension calendardate
AACH       MRY   2018-12-31 2018-12-31
AACH       MRY   2017-12-31 2017-12-31
AACH       MRY   2016-12-31 2016-12-31
AACH       MRY   2015-12-31 2015-12-31
AACG       MRY   2019-12-31 2019-12-31
AACG       MRY   2018-12-31 2018-12-31
AACG       MRY   2017-12-31 2017-03-31
AACG       MRY   2016-12-31 2016-03-31
AACG       MRY   2015-12-31 2015-03-31
AAAP       MRY   2016-12-31 2016-12-31
AAAP       MRY   2015-12-31 2015-12-31
  AA       MRY   2019-12-31 2019-12-31
  AA       MRY   2018-12-31 2018-12-31
  AA       MRY   2017-12-31 2017-12-31
  AA       MRY   2016-12-31 2016-12-31
  AA       MRY   2015-12-31 2015-12-31
   A       MRY   2019-12-31 2019-10-31
   A       MRY   2018-12-31 2018-10-31
   A       MRY   2017-12-31 2017-10-31
   A       MRY   2016-12-31 2016-10-31
   A       MRY   2015-12-31 2015-10-31
#

for example I would want only 2018 and 2015 for AACH, 2019 and 2015 for AACG, etc

#

I expect I need to groupby ticker but I'm not sure where to go from there

desert oar
#

@dense copper ```python
data.groupby('ticker').agg(['first', 'last'])

maybe?
dense copper
#

hmm that's not a bad idea, but I'm reading I should probably use nth(0) rather than first() since first() ignores NaNs... any idea if it's possible to pass a param to nth() in that context you used it w/ agg? I don't see anything in the docs about that

#

oof I need to get better at pandas lol

fast bluff
#

Would anyone mind reviewing my code?? Everything works as intended, am just a newb and criticism is always helpful

#

And if you're down, would it be easier to send the entire file, or send it via discord code? It's only 60 lines

novel remnant
#

a nested list

solid aurora
#

Is there a way to use sklearn's pipelines to short-circuit out on certain datapoints?

#

i.e. if I had a trained model M

#

and input data D = [1, 2, -3, 4]

#

let's say I want my output to be M(p) for each point p in D

#

unless p < 0, in which case I want to short-circuit out of the pipeline and the output is -1

#

so my output should look like [M(1), M(2), -1, M(4)]

#

(my actual condition is a bit more complex but still vectorizable)

#

The reason I wanted to use sklearn's pipelines is to keep the shape and order of my input data

#

so maybe there's a better way to do that?

serene scaffold
#
# df: pd.DataFrame
>>> array = df[column].to_numpy(dtype=np.float, na_value=np.nan)
ValueError: could not convert string to float: '?'

I don't think pandas was having any issues reading in the question marks as null values, however it represents those, so I'm not sure why they're not being converted to np.nan

desert oar
#

@void anvil json?

#

@serene scaffold what is your question exactly? if df[column] has "?"s in it, that error makes sense

serene scaffold
#

@serene scaffold what is your question exactly? if df[column] has "?"s in it, that error makes sense
@desert oar I should have mentioned that df[column] is a Series but I want them to be np.nan

desert oar
#

does it contain floats and strings, or all strings?

#

e.g. is it pd.Series([1.3, 2.7, '?']) or pd.Series(['1.3', '2.7', '?'])

serene scaffold
#

let' see

#

ah, they're strings

#

darn

desert oar
#

the question mark is the problem no matter what

serene scaffold
#

that's how the csv represents missing data

desert oar
#
import numpy as np
import pandas as pd

y = pd.Series([1.3, 2.7, '?'])

y_arr = y.mask(y == '?').to_numpy(dtype=np.float)

@serene scaffold

#

you have to just replace the ? with something else

serene scaffold
#

Thanks!

desert oar
#

unfortunately .replace doesn't actually let you replace with nulls, i think

#
print( y.replace('?', np.nan).to_numpy(dtype=np.float) )

you can do it with float('nan')/math.nan/np.nan but not None

#

@void anvil no, the "top level" of a json file can be any valid json type

#

null, int, float, string, array, object

#

why is that weird? it makes sense

#

!e ```python
import json
print( json.loads("null") )
print( json.loads("3.5") )
print( json.loads('["a", "b"]') )

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | None
002 | 3.5
003 | ['a', 'b']
desert oar
#

why would you need any of that

#

you mean, why would someone save data as just arrays? who knows

#

where did you get this data

#

thats not what i mean, lol

#

what's the data for? who generated it?

#
[[1, 2, 3], [4, 5, 6]]

it's like this?

serene scaffold
#
def open_csv(path):
    df = pd.read_csv(path)
    df = df.replace('?', float('nan'))
    return df
#

this seems to fix it

desert oar
#

@serene scaffold ```python
df = pd.read_csv(path, na_values=['?'])

serene scaffold
#

oh dank

desert oar
#

why are you being asked to work on data that you dont know anything about

#

yikes

#

thats a rough job

#

you might want to talk to whoever currently uses this data

#

ask them what it's for, how they use it, etc

#

maybe there's an automated system that consumes it -- do you have access to its source code or at least its business logic?

#

etc

#

this is the part of data science that requires you to be more of a business person / manager than anything

modest rune
#

I think numba is compiling every time I run the program. Isn't it supposed to cache the binaries?

desert oar
#

maybe for some reason your program doesnt meet those requirements

modest rune
#

I see there is a cache flag you can set... I guess I assumed it cached the binaries by default.

#

HAHAHA!!! That did IT! cache=true!

#

4 seconds down to 1.366 seconds. Still a lot of room for improvement, but at least I didn't completely waste my time numbafying my code.

serene scaffold
#
>>> x
[0.48098 0.60981 0.47886 ... 0.47272 0.56138     nan]
>>> y
[    nan 0.41685 0.37368 ... 0.35527 0.44202 0.22364]
>>> x[~np.isnan(x).any(axis=-1), ~np.isnan(y).any(axis=-1)]
[]
modest rune
#

And the numbafied code is no longer the long pole in the benchmark tent

serene scaffold
#

evidently this isn't the way to remove indices where the element is nan in either

modest rune
#

Thankyou @desert oar and @odd yoke

buoyant flint
#

Hello nice people, can some one help me with a question using the str.extract function

lapis sequoia
#

anyone know how to use Plotly Express

#

Fig1 = px.scatter_mapbox (Random Locations)
Fig2 = scatter_mapbox (Random Locations 2)

Then somehow get both of them to show

#

I cannot get this to work, and reading the documentation is enraging

#

cannot find anything that solves my issue

desert oar
#

@serene scaffold im confused what are you trying to do

#

@void anvil i dont understand, its just a file full of numbers?

#

if its 15 years old it probably isnt even "json" technically

#

if nobody is using it and nobody knows what it is

#

why does anyone care

#

just delete it

serene scaffold
#
>>> a = np.ndarray([np.nan, 2.0, 6.0, 4.0, np.nan])
>>> b = np.ndarray([1.0, np.nan, 3.0, 8.0, np.nan])
# convert these to ->
>>> a = np.ndarray([6.0, 4.0])
>>> b = np.ndarray([3.0, 8.0])
#

@desert oar delete every element that's nan in either array

#

actually this is wrong too

#

fixed

lime loom
#

is it possible to have a jupyter notebook use kernels from multiple servers?

serene scaffold
#
x: np.array; y: np.array
overlap = ~np.isnan(x) & ~np.isnan(y)
x, y = x[overlap], y[overlap]
#

appears to work correctly

#

very esoteric looking though

modest rune
#

I'm sorry, but I just can't make sense of these numba errors. I could use help 2 ways... (1) How do I interpret the errors? and (2) how do I fix this particular error.

Code:

@njit
def iterate_stuff(c_TV, c_x, c_y, p_TV, p_x, p_y,underlying, initial_investment,
  flat_fee, per_contract_fee, percent_fee,days, putCall_floats, multipliers,
  asks, strikes, closes, putCalls):

  for i in range(days.shape[0]):
    # lots of code afer this

Error:

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, C)
During: typing of argument at c:/.../analysis2.py (222)

File "analysis2.py", line 222:
def iterate_stuff(c_TV, c_x, c_y, p_TV, p_x, p_y,underlying, initial_investment, flat_fee, per_contract_fee, percent_fee,days, putCall_floats, multipli
ers, asks, strikes, closes, putCalls):
    <source elided>

    for i in range(days.shape[0]):
    ^
desert oar
#

@serene scaffold well you dont need to use the tuple expansion trick

#

it isnt esoteric to me, its just using boolean operators for boolean operations

#

also maybe you want | and not &

serene scaffold
#

if either if False, it has to be False

desert oar
#

then use or, not and

#

wait

#

what if one is missing but the other is not missing

#

nan, rather

serene scaffold
#

we only want the elements that are numbers in both arrays

desert oar
#

yeah, use |

#

oh i see

#

you have ~ in there

#

so yes youre fine

serene scaffold
#

we need two arrays of only numbers that are the same shape.

desert oar
#

@modest rune im surprised at that error, without knowing more about your code its hard to say why

#

it might even be a problem in numba's type inference

modest rune
#

@desert oar I think I figured it out. One of the many parameters I had in my function declaration was an array of strings. I removed it and the error went away.

desert oar
#

ah

modest rune
#

Not a very helpful error.

desert oar
#

yeah, numba probably doesnt like that

#

no its not helpful

#

especially since the arrow is pointing to the wrong thing

#

could be a good bug report if you can isolate and reproduce it

modest rune
#

yeah... I might be emotionally up for that if I can get my code to run 🙂

#

I keep biting off bigger and bigger chunks of code to numbify... its like pulling teeth.

orchid lintel
lapis sequoia
#

nice

sinful dock
#

Can anyone please help with this, I 've been stuck for a while:

df_diff = pd.DataFrame([1, 2 , 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 6, 7, 0, 8], columns = ['diff'], index=pd.date_range('1/1/2000', periods=24,freq = 'H'))

I'm trying to get the locations of indexes for the beginning and **end **of the larger set of consecutive zeros, for the case above the answer would be (2000-01-01 09:00:00, 2000-01-01 18:00:00)
This is what I have so far:

    ref_val  = 0
    shut_in_days = 7
    for row, value in enumerate(df[col_name]):
        if value == ref_val:
            day_counter = 0
            #Check consequtive zero elements
            previous_value = abs(df[col_name][row-1]) if row > 0 else None 
            current_value = abs(df[col_name][row])
            next_value = abs(df[col_name][row+1]) if row < len(df[col_name])-1 else None 
            while (previous_value + current_value + next_value) ==0 and (day_counter > shut_in_days):
                row+=1
                day_counter+=1
            print(df.index[row], value) if row < len(df[col_name])-1 else None
        else:
            print(row, value)```
velvet thorn
#

@sinful dock this is what I did:

>>> zero_runs = (df.diff(1) != 0).cumsum()[df['diff'] == 0]
>>> zero_runs[zero_runs['diff'] == zero_runs.groupby('diff').size().idxmax()].index
DatetimeIndex(['2000-01-01 09:00:00', '2000-01-01 10:00:00',
               '2000-01-01 11:00:00', '2000-01-01 12:00:00',
               '2000-01-01 13:00:00', '2000-01-01 14:00:00',
               '2000-01-01 15:00:00', '2000-01-01 16:00:00',
               '2000-01-01 17:00:00', '2000-01-01 18:00:00'],
              dtype='datetime64[ns]', freq=None)
#

there might be a better way

#

I haven't used pandas in a long time

sinful dock
#

@velvet thorn thanks, trying to digest it now

lapis sequoia
#

Hi all. I have a question about pandas. How can you replace a series in a dataframe with another series? Essentially I took in a dataframe, then a series from that. I took in ANOTHER dataframe, and another series from that

#

I tried to combine these two series using the .append function. It seems like I can store this newly combined series in a variable, but I am now trying to replace the old series from the old DataFrame with the newly combined series. But nothing I am doing is working and pandas' documentation isn't leading me anywhere useful. Is this even possible in pandas?

#

Please let me know with a mention if you happen to have any idea. Thank you very much

#

Like, I realllllly thought that oldSeries = newSeries would do the trick but no can do

#

How do you mean?

#

For example, I'm trying the update function and this doesn't even seem to be working. When I so much as try to print out the new dataframe, the old series is still there even though I apparently updated it with the new one

sinful dock
lapis sequoia
#

Thats what I meant by oldSeries = newSeries

#

Unfortunately, the newly combined series does not yet exist in a dataframe. I'm trying to insert it into the old dataframe in place of the original series, but it's stuck to the dataframe, seemingly

#

Here is one concern I had: can one series not be longer than the rest of the series in the dataframe? That's the only reason I can think of. If pandas disallows that it must be capping the series at some point, which is really frustrating

velvet thorn
#

Here is one concern I had: can one series not be longer than the rest of the series in the dataframe? That's the only reason I can think of. If pandas disallows that it must be capping the series at some point, which is really frustrating
@lapis sequoia no, it cannot.

#

that wouldn't make sense

#

pandas is meant for tabular (i.e. rectangular) data

lapis sequoia
#

That's unfortunate

velvet thorn
#

if you need that, it's likely that you have to reconsider your data modell

#

how about you explain what you're trying to do

#

might be an XY problem

lapis sequoia
#

I will try

#

I have a master dataframe so to speak. It has a very specific format and I need to keep it in that format because I have a mastercode, if you will, that reads that format very specifically. So I don't want to have to change the master format at all

velvet thorn
#

hm.

#

go on

lapis sequoia
#

I recently got a new dataframe with some new products that I am trying to add to the master dataframe. The problem is that the new dataframe has different column names, a different number of columns - basically not in the same format as the master dataframe at all. HOWEVER, the new dataframe does have an identical "products" column that I would like to combine with the "products" column in the master dataframe

#

As well as any other columns in the new dataframe that the master dataframe also might have. But i need to fir everything into the master dataframe even if that means that some of the new dataframe columns are unnecessary and untouched

velvet thorn
#

okay

#

so basically

#

you want to combine the two, leaving out unnecessary columns from the new DataFrame, and (possibly) mapping some column names

lapis sequoia
#

One solution would be to tailor the entire new dataframe so that it's in the format of the old one, and use "" as placeholders

#

Yeah, basically

velvet thorn
#

leaving null values where columns in the old DataFrame do not exist in the new one.

#

correct?

lapis sequoia
#

That might work. So far it looks like only one column in the combined dataframe is going to be 100% filled with no null values is the "products" one, the one I am directly adding to. Later on I will gather additional information that will help me fill the null vaules out, but I want to keep everything in the format of the old dataframe

#

I suppose one way to do this is to make a new dataframe where every value except for that in the "products" column is turned to null

velvet thorn
#

hm.

#

I would suggest

#

replacing column names

#

then just using pd.concat

lapis sequoia
#

Thats' what I meant basically

#

Thanks!

desert oar
#

@left drift i think you need to give a specific example

#

its really not clear what you want

#

if you give example inputs and outputs it would help significantly

velvet thorn
#

@desert oar did you get the wrong commander

desert oar
#

yes

#

@lapis sequoia ^

lapis sequoia
#

I can try. I'm sorry that I'm not being as specific.

read the new dataframe

Create variable for master series
create variable for new series.
Using .isin, I want to create another series that consists of the values in new series that are not already in master series. This is to remove overlaps, essentially.

Now I would like to essentially stack master series on top of the new series. I went about many ways to do this but now I see that it's not going to work because adding a series to dataframe breaks the rectangular shape of the dataframe, which pandas does not allow.```
#

Sorry, that ended up not being code at ALL. I apologize, Im not looking at it right now

#

I tried using combinedseries = masterseries.append(new series)

#

It created a new series, but I couldn't simply type masterSeries = combinedseries

#

Or use the .replace documentation either

#

so it looks like I'll have to concatenate two dataframes

carmine iron
#

Can anyone help with ML basics

desert oar
#

@lapis sequoia ok, i agree that pd.concat seems correct. let me work an example, one moment

#

actually it looks like you get it

lapis sequoia
#

anyone have expeirence with px.scatter_mapbox?

#

I cannot for the life of me get two figures with different data sets to appear on the same map

modest rune
#

Question: I am close to maximizing my speed improvements from numba. I have a few different things I want to try. One of them is... run some of my parallel code on my gpu.

So, regarding that idea. Am I going to waste my time if I try to do that on a dell laptop running windows with whatever video card comes stock with the laptop?

deft harbor
#

Most likely

bitter harbor
#

How so

#

I can’t imagine dell would/could stop you from doing it

deft harbor
#

Unless I read it incorrectly, his focus is speed

bitter harbor
#

Yes and?

deft harbor
#

I don't think their is going to be a massive speed improvement on a laptop that has a good chance of having no real gpu

#

That's why I said most likely to is there a chance he is wasting time

bitter harbor
#

The amount of speed you get would depend on which gpu 100%, but an increase is still an increase, and correct me if I’m wrong, but the newer dells ship with 2060-80’s (maybe 30.0’s too idk)

deft harbor
#

I haven't used a laptop in years, so I'm not sure what the standard dell would ship with these days 🤷‍♂️

bitter harbor
#

Just because you don’t have a titan, doesn’t mean it’s not worthwhile

deft harbor
#

I suppose it depends on what you are doing and how you account for the time it takes to set things up

bitter harbor
#

Also I’ve got a like 7 yo amd laptop gpu but it still helps

#

I suppose it depends on what you are doing
Ya that too, but unless you’re trying to take some brute force approach (or something similar), it doesn’t need to be that powerful

deft harbor
#

Transferring gtp3 for a conditional GAN clearly

#

😅

austere swift
#

even really slow gpus beat pretty fast cpus in a lot of applications

brittle agate
#

I can use Keras Tuner with Transfer Learning(for example, model from TF Hub)?

austere swift
#

keras tuner requires you to put some hp arguments within the model generator

#

so you wouldnt be able to

carmine finch
#

does anyone know how to use github?

velvet thorn
#

does anyone know how to use github?
@carmine finch most people know how to use GitHub, but I think you want #tools-and-devops

carmine finch
#

ok so now u want me to go to a diff section

velvet thorn
#

I am suggesting

#

this channel is for data science

solid mantle
#

Does pandas only read csv?

#

I want to read a .out file using pandas. Can I do that?

velvet thorn
#

what is an .out file?

#

that doesn't seem like an extension I know of

solid mantle
#

Its a text file formal

velvet thorn
#

what produced it?

solid mantle
#

Format

#

Just some c code

#

It can be read using numpy normally, like np.loadtxt(filename.out)

#

I'm literally seeing only one command pd.read_csv in pandas

#

I believe you can treat out file like a txt file

#

Then how do you read a txt file using pandas?

brittle agate
#

so you wouldnt be able to
@austere swift
Thank u, but I can add TF Hub model and else few layers to the model and add Tuner?

austere swift
#

Yeah you can add tuner to the layers you add

velvet thorn
#

Then how do you read a txt file using pandas?
@solid mantle the documentation of np.loadtxt says whitespace is used as a delimiter

brittle agate
#

Nice, and it will be work fine?

velvet thorn
#

my guess is you could use sep=' ' in pd.read_csv

solid mantle
#

Yea my file is not a csv though

#

How can i read a txt file using pandas?

velvet thorn
#

no, my point is

#

if you can use np.loadtxt, you should be able to use pd.read_csv with sep=' '

#

pd.read_csv just reads files which use a character as a separator

#

it doesn't have to be a comma (the "c" in "csv")

solid mantle
#

You can read non CSV files with read_csv?

velvet thorn
#

it can be, for example, a tab (that would be TSV), or something else

#

extensions are hints

solid mantle
#

@velvet thorn dm

velvet thorn
#

the extension of a file is one thing, its format another

solid mantle
#

Ohh?

#

That's new info

#

I'll check

#

It worked xD

craggy light
#

the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters, for example, semicolons. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. (from https://en.wikipedia.org/wiki/Comma-separated_values)

solid mantle
#

Thankyou

mild topaz
#

hello, i am doing image classification on dogs and cats breed. my folder structure this way ```python
demo -- >
training --> dog --> breed 1
--> breed 2
--> breed 3
--> cat --> breed 1
--> breed 2
--> breed 3

  testing -->    dog --> breed 1
                     --> breed 2
                     --> breed 3
            --> cat  --> breed 1
                     --> breed 2
                     --> breed 3```  is this the correct folder structure for image classification of dog and cat breed ?
brittle agate
#

Yeah you can add tuner to the layers you add
@Spacecraft1013#5969
Nice, and it will be work fine?

lilac ferry
#

I keep on getting the attachment as error while trying to run sklearn.model_selection.learning_curve. The target variable is of binary class and the minority class has a frequency of ~3.7%.

lilac ferry
#

solved it! I needed to increase the minimum size of my training set. I was taking too few samples and that resulted in just one class.

brittle agate
#

hello, i am doing image classification on dogs and cats breed. my folder structure this way ```python
demo -- >
training --> dog --> breed 1
--> breed 2
--> breed 3
--> cat --> breed 1
--> breed 2
--> breed 3

  testing -->    dog --> breed 1
                     --> breed 2
                     --> breed 3
            --> cat  --> breed 1
                     --> breed 2
                     --> breed 3```  is this the correct folder structure for image classification of dog and cat breed ?

@mild topaz
Folder: Demo
Dirs: Dog, Cats.

#

U need to split data with np, pt or tf.

#

It's much more easiest.

grave frost
#

Anyone know how to read an array with dype tf.int64 ? I can't figure it out. Tried tf.print() but that gave the same output (the type, dtype and shape of array) which looks like this:- <BatchDataset shapes: ((1, 50), (1, 50)), types: (tf.int64, tf.int64)>

mild topaz
#

U need to split data with np, pt or tf.
@brittle agate can u explain a bit how i can do this?

odd yoke
#

the object you have is not a tensor @grave frost it's a dataset, you can use .as_numpy_iterator() and cast to list or whatever, or consume it using .take or a for loop

hasty grail
#

^

lapis sequoia
#

Hey guys, i'm trying to delete some data of columns that have a value > 30 I found this on the internet and it works for one columns but when I add the other with the & none of the two works ( I plot it after to see if it has worked)

#
data = pd.read_csv('data.tsv', delimiter='\t', usecols=['WORKDAYS','MJDAY30A'])

# Get indexes where name column has value greater than 30

indexmjday = data[ (data['MJDAY30A'] > 30)].index


# Delete these row indexes from dataFrame
data.drop(indexmjday, inplace=True)```
#

so this work

#

but when I add

#
indexmjday= data[ (data['MJDAY30A'] > 30) & (data['WORKDAYS'] >30)].index
#

it doesnt work anymore

velvet thorn
#

huh

#

why not just do this:

raw_data = pd.read_csv('data.tsv', delimiter='\t', usecols=['WORKDAYS','MJDAY30A'])
data = raw_data[(raw_data['WORKDAYS'] <= 30) & (raw_data['MJDAY30A'] <= 30)]
lapis sequoia
#

well idk aha im noob

#

so this delete the data I guess

#

it worked, thanks vm I will remember it

brittle agate
#

@brittle agate can u explain a bit how i can do this?
@mild topaz
Okay, I work with fork TensorFlow:
How I'm doing it:

# Loading collection(1 parametr), splitting the data and adding some standard param to splitted data(3 and  params).
(train_ds, test_ds) = tfds.load(
    'tf_flowers',
    split=['train[:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True
)

# Adding Validation, but it's not simple Val. It's 10-fold cross-validation.
val_ds = train_ds.split = [
    f'train[{k}%:{k + 10}%]' for k in range(0, 100, 10)
]
#

I think that adding data from local dir it's not big deal.

#

Need just to google.

modest rune
#

Ok, I don't really understand something. In Numba, there is the function numba.prange() that if I understand things properly, is supposed to speed up for loops (an maybe other loops?) by parallelizing their execution. When I use prange() my loops are significantly faster. But, if I use prange() and set the @njit(parallel=True), usually things slow down.

And then, there is @njit(nogil=True)... while is related to parallel execution too.

So, it seems like numba runs code in parallel in many situations, regardless of your nogil and parallel settings? And what is the point of setting Parallel=True if nogil=False?

And, are there other functions like prange() that I should be aware of, that if I use instead of core python functions or common numpy functions will improve my speeds further?

brittle agate
#

@mild topaz
If I want to use simple validation(not k-fold and so on).

(train_ds, val_ds, test_ds) = tfds.load(
    'tf_flowers',
    split=['train[:80%%]', 'train[80%:90%]', 'train[90%:]'],
    with_info=True,
    as_supervised=True
)
misty fjord
#

I wonna be a data scientist but from where I should start 😭 😭 😭

modest rune
#

@misty fjord that is a life level question. There are a lot of good ways to approach your problem.

brittle agate
#

@misty fjord
U need to know see theory. Basic architecture of NNs. What is bias, weights, input, output, activation's functions.

#

U can read Deep Learning with Python.

modest rune
#

If I were to rewind my life and redo things, I would have taken a different approach to my learning, but the path I did take, worked well enough for engineering.

brittle agate
#

Go to the site Tensorflow. U can find plan of education for beginners.

misty fjord
#

I > @misty fjord

U need to know see theory. Basic architecture of NNs. What is bias, weights, input, output, activation's functions.
@brittle agate nns !? 😭

brittle agate
#

Neural Networks.

#

S - plural.

#

I mean.

lapis sequoia
#

how data science relate to neurla networks ? I thought data science was more statistical analysis of data and that neural networks were more like machine learning stuff

modest rune
#

If I were you, I would find what you think would be:
a. perfect job
b. perfect type of problem to solve
c. most enjoyable thing to work on all day

Then, ask questions about what you want to do working backwards from there. It seems to me that data-science is quite a broad field and Machine Learning may or may not be the path you want to take.

raw rapids
#

data science contains both machine learning and statistical ananlysis

misty fjord
#

Ha Ok . Can you give me a path. Or should I just start and find it my self > S - plural.
@brittle agate 😭 😢 😭 😢

lapis sequoia
#

oh okay

#

I guess it uses the data analysis to better calibrate the neural network or something like that

misty fjord
#

If I were you, I would find what you think would be:
a. perfect job
b. perfect type of problem to solve
c. most enjoyable thing to work on all day

Then, ask questions about what you want to do working backwards from there. It seems to me that data-science is quite a broad field and Machine Learning may or may not be the path you want to take.
@modest rune I'm the c type of ppl I'm just enjoying doing python

brittle agate
#

@lapis sequoia
Yep!

raw rapids
#

tune is used more often btw

#

instead of calibrate

lapis sequoia
#

hmm I see thanks, Im a real beginner rn but I'd like to go into neural networks, should I dive directly into it or is it better for me to train with statistical analysis pandas matplotlib etcc 1st ??

odd yoke
#

are you familiar with a programming language ?

lapis sequoia
#

I already know about inferiential statistic just not about programming

modest rune
#

@lapis sequoia If you like programming, or specifically like programming in python, and you like data... find something you can do to earn money doing those things. Specifically, find a company you like that hires people to do exactly that. Then, ask them or someone who works for them, what you should be learning.

lapis sequoia
#

python basic that's all

modest rune
#

In the end, it is all a journey, you will start down path A, but as things evolve, you will end up on path G.

misty fjord
#

@lapis sequoia If you like programming, or specifically like programming in python, and you like data... find something you can do to earn money doing those things. Specifically, find a company you like that hires people to do exactly that. Then, ask them or someone who works for them, what you should be learning.
@modest rune if there such companies in where I live I'll already done that

brittle agate
#

hmm I see thanks, Im a real beginner rn but I'd like to go into neural networks, should I dive directly into it or is it better for me to train with statistical analysis pandas matplotlib etcc 1st ??
@lapis sequoia
I think that not. Yeah, of course u need to know data structures of python. But today. Forks of ML is so easy.

#

U can choose TensorFlow or PyTorch. I recommend to start with TF, because:

  1. TF has many guides(official site).
  2. Ecosystem: TF Hub, TF Board and Datasets.
  3. He is very easy fork, because above TF u're using Keras.
#

High API fork for the easy work with backend forks like PyTorch, Teanos, TensorFlow and so on.

#
  1. U can use TensorFlow for the Production.
#

He has coral, lite(mobile special) and .js versions.

odd yoke
#

really torch is definitely not far behind, if at all, on most of these points

brittle agate
#

Yeah I know.

#

With Caffe 2 is so hot.

#

But I prefer TensorFlow.

lapis sequoia
#

I appreciate the help, thank u. Tho I dont understand every of your points tho lol what does "fork" means aha ?

The thing is as with everything since ive learned python Im kind of lost into what should I do like should I find a project that would interest me to do with tensor flow or should I follow broad theory courses with random exercises ?

brittle agate
#

@odd yoke
I think, that he is more friendly with newest people.

odd yoke
#

the edge TF has is deploying on different systems with tflite, or tf.js but other than that:

  • torch also has a bunch of guides
  • torch has a ton of premade models, datasets etc as well
  • it has a very similar api to keras, while being significantly faster
  • and it's definitely usable in production
brittle agate
#

@lapis sequoia
I mean framework.

lapis sequoia
#

ooh okay lol

odd yoke
#

tf is actually much better now but not that long ago it was an absolute mess

#

if you ever have the opportunity to use tf1.XX, do it, it's an enlightening experience

brittle agate
#

the edge TF has is deploying on different systems with tflite, or tf.js but other than that:

  • torch also has a bunch of guides
  • torch has a ton of premade models, datasets etc as well
  • it has a very similar api to keras, while being significantly faster
  • and it's definitely usable in production
    @odd yoke
    I recommended TensorFlow because I know him much better(and I like him :3). But he can choice another fork.
odd yoke
#

you will see what bad design is

brittle agate
#

tf is actually much better now but not that long ago it was an absolute mess
@odd yoke
TF 1.x was so stoopid.

odd yoke
#

yet it's what my company uses pensivewobble

brittle agate
#

you will see what bad design is
@odd yoke
But we talking about present.

#

And for beginner is good choice.

odd yoke
#

really i don't think there's any major difference anymore between torch and tf ever since tf2 came out

#

both are completely fine

lapis sequoia
#

ive found the pdf of book "hands on machine learning" do you recommend me to start with taht without any prior experiences or to explore here and there by myself ?

brittle agate
#

@odd yoke

I recommended TensorFlow because I know him much better(and I like him :3). But he can choice another fork.

#

But yeah what u said is true.

#

of course :D

#

ive found the pdf of book "hands on machine learning" do you recommend me to start with taht without any prior experiences or to explore here and there by myself ?
@lapis sequoia
Deep Learning with Python.

odd yoke
#

I personally am not a fan of starting directly with code because, in my experience, i see a tendency in ppl that do that to never go back to see the fundamentals and theory behind the algorithms

brittle agate
#

What is ppl?

odd yoke
#

people

brittle agate
#

kk

lapis sequoia
#

hmm yeah like they apply it as a method or recipe without really knowing what they are doing

#

@brittle agate is it a book ?

brittle agate
#

Yep)

lapis sequoia
#

ok thanks will look into it

odd yoke
#

there's "deep learning with pytorch" which is free and actually is a nice mix of theory and programming

#

and it's free

brittle agate
#

Yeah, it's first edition. Second edition in process.

odd yoke
#

it's from the pytorch people too

brittle agate
#

@lapis sequoia

Deep Learning with Python.
Yeah, it's first edition. Second edition in process.

odd yoke
#

https://pytorch.org/deep-learning-with-pytorch you can get it here if you're comfortable giving some of your informations, they don't check you directly get a download link to the pdf

brittle agate
#

it's from the pytorch people too
@odd yoke
This book is actual for today?

odd yoke
#

it came out 3 months ago

brittle agate
#

I mean, PyTorch changed too how TF.

#

it came out 3 months ago
@odd yoke Really?

#

XD

#

I didn't know.

odd yoke
#

2 months*

#

4 august

brittle agate
#

LOL

#

2 months*
@odd yoke
After this words I want to know too PyTorch how TF)) :3

lapis sequoia
#

@brittle agate do you know if the author is Francois Chollet or if it's Nikhil ketkar ? I found 2 books with this name

#

thanks guys I'll do some research and make my choice

brittle agate
#

Francois Chollet

lapis sequoia
#

ok thanks 😉

odd yoke
#

francois chollet is the original creator of keras, so it's not some random dude too

brittle agate
#

But some methods is so hardest than in TF.
For example K-Fold. U don't know what is that right now. You will find out soon x)

#

francois chollet is the original creator of keras, so it's not some random dude too
@odd yoke
I read only original book by him.

#

And yeah, all guides for beginners, on the TF official site, Francois Chollet coded.

lapis sequoia
#

hmm great I also found it for free online!

#

crazy internet

modest rune
#

@odd yoke thoughts on my question above?

odd yoke
#

the numba prange question ?

modest rune
#

yes

#

Ok, I don't really understand something. In Numba, there is the function numba.prange() that if I understand things properly, is supposed to speed up for loops (an maybe other loops?) by parallelizing their execution. When I use prange() my loops are significantly faster. But, if I use prange() and set the @njit(parallel=True), usually things slow down.

And then, there is @njit(nogil=True)... while is related to parallel execution too.

So, it seems like numba runs code in parallel in many situations, regardless of your nogil and parallel settings? And what is the point of setting Parallel=True if nogil=False?

And, are there other functions like prange() that I should be aware of, that if I use instead of core python functions or common numpy functions will improve my speeds further?

odd yoke
#

iirc prange if there's no parallel=True is the same as range

#

i'm not sure what could cause slow downs

modest rune
#

I have about 6 nested functions that all are decorated with @njit. I get the best performance if only the top level function is set to parallel=True and all of my loops use prange() instead of range(). If I make all 6 fucntions parallel=True, massive slowdown.

#

Like 50x

brittle agate
#

crazy internet
@lapis sequoia
PDF Drive is good storage of books for free :))))))))))))))

modest rune
#

My guess is that numba is spinning up too many parallel OS processes when I set too much stuff to parallel, and this is not ideal.

#

just too much overhead created for the amount of cores my laptop has.

brittle agate
#

@odd yoke
Q:
When need to use Flatten?

#

After Conv Layers is good practice to use Flatten or not?

odd yoke
#

it depends on the architecture you're building

brittle agate
#

For example transfer learning(ResNet50).

odd yoke
#

FCN for example don't have any fully connected layers, so you would not need to flatten

brittle agate
#

I know.

odd yoke
#

For a regular resnet for something like classification, yeah, you would need to flatten

brittle agate
#

For multi-class classification.

#

I mean.

#

For a regular resnet for something like classification, yeah, you would need to flatten
@odd yoke
One time after Conv Layers?

odd yoke
#

yes

brittle agate
#

Okay.

#

Thanks :3

odd yoke
#

you can have like
conv -> relu -> maxpool -> conv -> relu -> maxpool -> ... -> flatten -> dense -> relu -> dense -> softmax

#

obviously for a resnet it's much more complex

brittle agate
#

Okay, thanks)

#

obviously for a resnet it's much more complex
@odd yoke
But I can use already Transfer Learning and I don't need to make residual connections))

odd yoke
#

if the model is already made, then yeah you don't have to bother

brittle agate
#

But, thanks anyway.

#

Have a good day.

dense copper
#

hey all, is there a way to have pandas only calculate pct_change for the first row of a group in a groupby group?

#

or more simply maybe, given something like this:

             A   B   C   D
2000-01-02  14   5  20  14
2000-01-09   4   2  20   3
2000-01-16   5  54   7   6
2000-01-23   4   3  21   2
2000-01-30   1   2   8   6
2000-02-06  55  32   5   4
#

can I calculate pct_change for only like, 2000-01-30 to 2000-02-06?

halcyon vale
solar phoenix
#

Can someone help me with a pandas dataframe, I want to search a single column with a string, then output the value of a different column of the same row

#

or maybe just select the whole row

tulip briar
#

Hello guys, i'm programming just as a hobby and Python is the only language i know. I recently came to conclusion that i would like to creaty my own Neural Network, just to see how is it working, how hard it is to do one. I knew that i need data so i gathered a lot of numeric data regarding Character Auctions in some MMORPG. Now i would like to create the NN which based on auctions ended (data that i have) will be able to predict the price of given character. For now i'm using parameters such as: Level, Skills, Achievment Points, Charms Points, Gold (everything measured in numbers). I also made numeric data related to Vocation and World (server) (i gave different number to every voc and server). When it comes to items etc. i skipped that part for now. I realise that this is really hard, so for now i just leave it. I would like to add it in the future, when let's say "alpha version" would work.
The problem is - i don't really know where to start. I see bunch of articles which touch this topic but it wasn't "it". Are you able to recommend me some sources from which i can learn creating NN from scratch?

desert oar
#

@tulip briar neural networks fundamentally require a bit of linear algebra and calculus to understand. you can get started without knowing those things, but you might get stuck quickly. maybe check out https://www.fast.ai/ for some intro material oriented towards people with programming experience and toward getting you into neural networks (and deep learning) quickly

#

eventually if you want to get seriously into data science and machine learning you will need to start understanding statistics and probability in addition to linear algebra and calculus. but that's farther down your learning path and you will probably start learning those concepts as you go along anyway

grave frost
#

the object you have is not a tensor @grave frost it's a dataset, you can use .as_numpy_iterator() and cast to list or whatever, or consume it using .take or a for loop
@odd yoke Could you elaborate a bit more?

odd yoke
#

what you are trying to print is a tf.data.Dataset object

#

not a tf.Tensor

grave frost
#

Ohh.. right.

#

I was treating it as a tensor

serene scaffold
#

the docs for numpy mention axes a lot

#

if I want to find the average of all elements in a matrix, regardless of the shape of that matrix, does that mean the axis is zero or what?

#

[[1, 2], [3, 4]], I would want the average to be sum([1, 2, 3, 4]) / 4 even if it were [1, 2, 3, 4] or [[1], [2], [3], [4]]

#

for what I'm doing, of course, not necessarily in general

pale thunder
#

I think there is a flatten kwarg, or just .reshape((-1,))

serene scaffold
#

I guess I can use that for now but I figured axes held the key

pale thunder
#

oh, you can use .flat

serene scaffold
#

also I have a data frame with a column called 'Binary Label', but I need to delete that series for one part

#

the docs, unsurprisingly, mentions axes

#

that series is all booleans though so I might be able to keep it if I can represent them as 1.0 and 0.0 in a numpy matrix

tulip briar
#

@desert oar I don't have much troubles with math, i'm just totally newbie in coding :-D. I have learning about tensorflow, did some udemy course about Pandas but didn't find any intresting guide / course about NN.

serene scaffold
#

looks like representing booleans as floats is fine

#

I need to do calculations for elements that are not null and for which the corresponding boolean element for that row is correct

#

I'm sure I'm not phrasing this effectively

#

let's turn it into an xy problem

#
[[3., 2., 1.],
 [5., 4., 0.],
 [6., 1., 1.]]

[[1., 1., 1.],
 [0., 0., 0.],
 [1., 1., 1.]]
#

if I can go from the first to the second, I can implicitly solve the actual problem.

desert oar
#

@serene scaffold i dont understand either version of the problem

serene scaffold
#

see how the rightmost column is always 1 or 0?

desert oar
#

yes

serene scaffold
#

I just want to make the entirety of each row match that

desert oar
#

what do you mean "make the entirety of each row match that"

serene scaffold
#

if the last element of the row is 1.0, every element becomes 1.0

desert oar
#

i see. and now what do you actually want to do?

serene scaffold
#

the last column comes from a dataframe where the values are the strings Yes or No

#

and I'm storing them as these floats because every other column in the dataframe is a float

#

that is, every other column stores only floats

#

or nan

desert oar
#

ok, but what do you actually want to do

serene scaffold
#

and, for example, I need to take the average of every non-nan value in a column, for which the value in the rightmost column is 1.0, and replace the nans with the average of those values.

desert oar
#

ok. you are doing mean imputation for missing data?

serene scaffold
#

ye

desert oar
#

stratified by the value of the label?

serene scaffold
#

I guess I could have just said that

desert oar
#

because you really can't/shouldn't stratify by the value of the label, since you won't know the label at prediction time if a record w/ missing data happens to appear

serene scaffold
#

it's conditional mean imputation. The two classes are just "Yes" and "No"

desert oar
#

is the yes/no category a feature or the target

serene scaffold
#

feature

dense copper
#

argh. pandas makes me wanna stab someone.

serene scaffold
#

the yes/no designation is always given

desert oar
#

can you do this in pandas? or do you have to do it in numpy?

#

its easy to do both ways but the pandas way is "more elegant" since you can use groupby

serene scaffold
#

I'm allowed to use pandas and numpy only

#

can't even use stdlib

desert oar
#

is this homework? a job interview?

serene scaffold
#

homework. which is why I'm trying to make my question very pointed to learning the libraries

#

this is the only programming assignment for the course and we're not even required to do it in python

desert oar
#

i see

#

so lets lay down some general principles then

serene scaffold
#

but there's a part where we have to do 8000 ** 2 calculations and idk how the Java users are going to do that without numpy

desert oar
#

principle 1: when in doubt, use boolean subsetting by row

#

principle 2: pandas groupby exists

#

java loops are presumably a lot faster than python loops

#

so let's do the pandas version because it sounds like your data is in pandas

serene scaffold
#

the professor is only letting people use Java because it's the default language in the department and he doesn't want to force people to learn python if they don't want to. but that's beside the point.

desert oar
#

you can do it with a for loop in python too, it will just take a long time

#
data = pd.DataFrame({
    "x": [1.0, 1.5, 3.0, -3.2, -4.7, -2.2],
    "is_valid": ["no", "no", "no", "yes", "yes", "yes"]
})

is this a fair representation of the data? of course including more than columns just x

serene scaffold
#
0.19958,0.2133,0.47241,0.36529,0.57143,0.53333,0.41898,0.35424,0.39056,0.41731,No
0.26562,0.25566,0.47299,0.34705,0.5,0.33333,0.42342,0.38472,0.40689,0.41538,Yes
0.11211,0.37656,0.53267,0.37086,0.57143,0.46667,0.5287,0.47377,0.50504,0.48333,No```
desert oar
#

sure

#

ok, now let's pretend you're dumber than you are

serene scaffold
#

that's setting the bar obscenely low already

desert oar
#

give me an algorithm for imputing the mean within each group. it can be imprecise, we will make it more precise later.

#

just a rough sequence of steps

#

ignore anything related to pandas and numpy

#

imagine you are a number magician and you can just operate on tables of data directly

serene scaffold
#

I guess you could filter out all the rows that aren't the right class, and then take the average of each column

desert oar
#

stated another way: select only the rows that are the right class, take the average of each column in those rows, and replace every missing value with the average in its respective column

#

is that right?

serene scaffold
#

yes

desert oar
#

great. in pandas, how do you select rows such that column "class" equals "yes"

serene scaffold
#

is it group_by?

desert oar
#

no

#

you are less clever than that

#

pretend you are not clever

dense copper
#

can someone please explain to me wtf the point of limit is in pct_change()? It seems to do basically nothing lol

>>> revs
      ticker calendardate       revenue
None                                   
29054      A   2019-12-31  5.163000e+09
29055      A   2018-12-31  4.914000e+09
29056      A   2017-12-31  4.472000e+09
29057      A   2016-12-31  4.202000e+09
29058      A   2015-12-31  4.038000e+09
>>> revs['revenue'].pct_change()
None
29054         NaN
29055   -0.048228
29056   -0.089947
29057   -0.060376
29058   -0.039029
Name: revenue, dtype: float64
>>> revs['revenue'].pct_change(limit=1)
None
29054         NaN
29055   -0.048228
29056   -0.089947
29057   -0.060376
29058   -0.039029
Name: revenue, dtype: float64
>>> 

according to the docs, "The number of consecutive NAs to fill before stopping." -- doesn't this mean it should stop after hitting that first NaN?? all I want to do is calculate the pct_change the first fu%&$( row and I've been at this for 3 damn days and gotten nowhere.

serene scaffold
#

I don't really know, I haven't used pandas before this

desert oar
#

you have never ever subsetted rows in pandas ever?

serene scaffold
#

I've just never used pandas

desert oar
#

oh

#

lets do numpy then

dense copper
#

(and yeah I realize this is reversed, I'm using that as a test to ensure what would happen if it results in NaN for the first row, I want it to just stop and move on to the next group (next ticker))

serene scaffold
#

I've used numpy a bit but I've learned a lot of what I now know from trying to solve this assignment

desert oar
#

i see

#

i thought youve used both libraries before

#

lets do pandas then. slicing and subsetting in numpy is actually messier in some respects than in pandas

serene scaffold
#

I've never needed to use pandas for nlp and a lot of the math stuff is handled by existing libraries even if those libraries are using numpy internally.

desert oar
#

you will get confused

#

but they do explain what you need to know

serene scaffold
#

😨

desert oar
#

there are 3 ways to select rows or columns in pandas: by "label", by numerical position, or by an array of True/False

#

i use the term "array" to mean anything "array-like", which includes pd.Series, np.ndarray (1 dimensional), and list

#

in this case, you want the 3rd option

#

pandas dataframes have 2 main "accessors" that you can use for selecting data: .loc and .iloc. .loc is for selecting by label or boolean array. .iloc is for selecting by numerical position.

lapis sequoia
#

Quick question about appending multiple txt files into one data frame as I cannot figure out how to do this with Pandas.

I have a folder with a bunch of text files representing a banks quarterly reports from year 2006-2015 (2006-03-31, 2006-06-30, 2006-09-30, 2006-12-31....2015-12-31)

desert oar
#

@lapis sequoia ```python
from pathlib import Path

import pandas as pd

data = pd.concat([pd.read_csv(path) for path in Path('data/').iterdir()])

#

you might need to adjust the arguments to pd.read_csv and to pd.concat depending on your specific setup

lapis sequoia
#

But it is not CSV?

#

it is .txt files

desert oar
#

well it depends on how the data is formatted

#

.txt is just part of the filename

#

it could be anything in there

#

is it tabular data? tab-separated maybe?

lapis sequoia
#

Oh I get it

#

pathlib??

desert oar
#

yes, it's part of the standard python library

#

!d g pathlib

arctic wedgeBOT
#

New in version 3.4.

Source code: Lib/pathlib.py

This module offers classes representing filesystem paths with semantics appropriate for different operating systems. Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.

../_images/pathlib-inheritance.png If you’ve never used this module before or just aren’t sure which class is right for your task, Path is most likely what you need. It instantiates a concrete path for the platform the code is running on.

Pure paths are useful in some special cases; for example:... read more

lapis sequoia
#

ok

#

the txt files are in a folder called rawdata which is in my current directory

#

then I don't need to specify path?

desert oar
#

@serene scaffold maybe i'll give you an example to help you get started w/ those docs. here it is without groupby. the numpy-only version is very similar, except without .loc:

import pandas as pd

data = pd.DataFrame({
    "x": [1.0, 1.5, None, -3.2, -4.7, None],
    "y": [11, None, 12, None, 105, 101],
    "is_valid": ["no", "no", "no", "yes", "yes", "yes"],
})

# pd.Series of bool values
is_yes = data["is_valid"] == "yes"
is_no = data["is_valid"] == "no"

# pd.DataFrame of bool values
is_null = data.drop(columns=["is_valid"]).isnull()

for c in is_null.columns:
    # Fill "yes" rows
    data.loc[is_null[c] & is_yes, c] = data.loc[~is_null[c] & is_yes, c].mean()
    # Fill "no" rows
    data.loc[is_null[c] & is_no, c] = data.loc[~is_null[c] & is_no, c].mean()
#

of course, this changes the data in-place, which you might not want

#

so maybe you can put the imputed version of the data in a new column

import pandas as pd

data = pd.DataFrame({
    "x": [1.0, 1.5, None, -3.2, -4.7, None],
    "y": [11, None, 12, None, 105, 101],
    "is_valid": ["no", "no", "no", "yes", "yes", "yes"],
})

# pd.Series of bool values
is_yes = data["is_valid"] == "yes"
is_no = data["is_valid"] == "no"

# pd.DataFrame of bool values
is_null = data.drop(columns=["is_valid"]).isnull()

for c in is_null.columns:
    # Make a new column to receive the imputed data
    c_new = f'{c}_imputed'
    data[c_new] = data[c]
    # Fill "yes" rows
    data.loc[is_null[c] & is_yes, c_new] = data.loc[~is_null[c] & is_yes, c].mean()
    # Fill "no" rows
    data.loc[is_null[c] & is_no, c_new] = data.loc[~is_null[c] & is_no, c].mean()
#

@lapis sequoia well you need to specify the directory where to look for files

#

the general principle is, build a list of dataframes, then pass that list to pd.concat

lapis sequoia
#

Oh ok so it doesnt matter if I have the folder there in the working directory then

#

Will read into this. Thanks for the help!

desert oar
#

@dense copper im honestly not sure what that limit parameter is for... looks like bad documentation. but, just practically, how do you expect to calculate the % change in the first row? there is nothing for the percent change to be based on, in the first row.

#

@lapis sequoia you can always specify the full path to the directory

#

e.g. /home/mrbambocha/projects/lab3/data/ or something like that

lapis sequoia
#

Yes, I understand. Thanks a lot!

serene scaffold
#

@desert oar thanks, I'm looking at this

dense copper
#

@desert oar assuming the data is reversed (which it will be), the % change would be row 0 / row 1

desert oar
#

there is a lot going on here @serene scaffold so i recommend reading through the pandas guide and using my examples to help

dense copper
#

or more specifically for my case I want row 0 / row 4, but that's easy

desert oar
#

then you need to run the pct change in reverse... right?

dense copper
#

yes I know that, but my point with reversing it in this example is I want it to stop when it hits the first NaN

#

I assumed that would mean it would just not calculate anything if limit=1 and the first result is NaN, but instead it just apparently does the same thing

#

and basically everything I find about it in tutorials is someone who just copy/pasted the docs from Pandas to explain it. lol

desert oar
#

@dense copper you might have to just make up some fake data and experiment

#

what happens if limit=0?

dense copper
#

error

#

limit must be > 0

desert oar
#

i see

dense copper
#

consider this ...

desert oar
#

ive never actually used this method before so i know as much as you do

dense copper
#
>>> revs
      ticker calendardate       revenue
None                                   
29054      A   2019-12-31  5.163000e+09
29055      A   2018-12-31  4.914000e+09
29056      A   2017-12-31  4.472000e+09
29057      A   2016-12-31  4.202000e+09
29058      A   2015-12-31  4.038000e+09
>>> revs['revenue'].pct_change(99)
None
29054   NaN
29055   NaN
29056   NaN
29057   NaN
29058   NaN
Name: revenue, dtype: float64
>>> 
desert oar
dense copper
#

pct_change(99) will return NaN in every row because there's only 5 rows. anything more than pct_change(4) will be NaN for everything right?

desert oar
#

the source explains it. limit= is passed to .fillna, which is called before shifting and dividing

dense copper
#

but unfortunately it still runs the calc for every row.

dense copper
#

which is bad for me, since I have 250k+ rows and 100+ columns

desert oar
#

the limit parameter here explains itself in more detail

#

this is an issue of unclear docs from pandas

dense copper
#

surprised pikachu lol

#

it's annoying how everything in Pandas is so needlessly complex

desert oar
#

yeah. i'd recommend filing a bug report: "in the docs for pct_change, clarify that limit and fill_method are kwargs passed to fillna"

#

i'd argue that it is not needless

#

poorly documented in some cases, perhaps

dense copper
#

want a cell of data? sure just do my_data.get_index[0:,:99]['some_var'].get_the_data(act_fucky=False).index(0)[1]! easy!

#

lol

desert oar
#

huh?

#

why would you do that

dense copper
#

I'm being facetious...

desert oar
#

usually when pandas seems "overly complex" it means one of two things: the existing APIs are too low level and should have some wrapper functionality developed, or the existing APIs are too high level and need to expose more internals to the user

dense copper
#

anyway, I'm looking through the source

desert oar
#

the actual complexity is in my opinion structural and unavoidable

dense copper
#

I think the bigger issue is I'm a dumbass and don't understand pandas

desert oar
#

i rarely need long annoying invocations like that except when i'm using multiindex, which is definitely under-supported from an API design perspective

#

well that is a problem

dense copper
#

just frustrated cause I've been trying to 3 days now to accomplish something that seems so simple

desert oar
#

in order to use pandas effectively imo it's important to understand the separation between data and axes

dense copper
#

and I'm getting nowhere

desert oar
#

what are you trying to accomplish exactly

dense copper
#

given data like the following:

 ticker calendardate  revenue
      A   2019-12-31  10
      A   2018-12-31  9
      A   2017-12-31  7
      A   2016-12-31  1
      A   2015-12-31  3

      B   2019-12-31  5
      B   2018-12-31  4
      B   2017-12-31  3
      B   2016-12-31  2
      B   2015-12-31  3
#

my goal is to calculate the pct change in revenue from 2015 to 2019 ONLY, from each ticker

#

e.g. I want ONLY (10/3) ** (1/5), and (5/3) ** (1/5)

#

using pct_change(4) calculates it for every row, so I will get the value I want, plus four NaNs

#

instead I want it to calculate the first result, see that the second is NaN, and skip the last 3 attempts since I know they will all be NaN

#

(for each group, if that makes sense)

#

that more complex calculation (e.g. ** 1/5) is to calculate CAGR, and I'm doing that with a lambda function using .apply(), but if I can figure out how to just do pct_change (or any function for that matter lol) then I can figure out the rest.

#

the problem is I have hundreds of thousands of rows and hundreds of columns, but I only need the result of the first row divided by the 5th, and everything else gets thrown away. The default behavior increases the time it takes to run this process by an order of magnitude.

desert oar
#

@dense copper have you tried using the freq parameter?

dense copper
#

I'm not sure how, and as usual the documentation is terrible lol

#

been looking around at that though

desert oar
#
import pandas as pd

data = ...

def pct_change_for_ticker(y):
    return y.pct_change(freq=pd.tseries.offsets.YearEnd())

pct_changes = data.set_index(['ticker', 'calendardate']) \
    .groupby(level='ticker')['revenue'] \
    .apply(pct_change_for_ticker)
dense copper
#

cursory attempt on that gave me a NotImplementedError

#

I'll keep trying though ... thanks for the idea

ornate granite
#

Hi, I am using Weka for text classification. For some reason when I apply the StringtoWrodVector filter, it does nothing to my data set and the number of attributes remains the same.

Does anyone have any idea why this filter is not working for me?🙏

stable otter
#

can someone help me convert python script to exe file

lapis sequoia
#

anyone here use dash?

modest rune
#

I could use some help. This particular part of my numba code accounts for 90% of my slowdown. I am hoping for a much faster way to accomplish this simple task:

def closest(values, desired):
    AB = np.abs(desired - np.expand_dims(values, 1))
    indexes = np.empty((AB.shape[0],), np.int64)
    for i in nb.prange(AB.shape[0]):
        indexes[i] = np.argmin(AB[i])
    return indexes

This is what the function is supposed to do. I have two 1D arrays, values and desired. desired has 200 elements in it, evenly spaced apart. Depending on when I call the function, values can have anywhere from 1 element to 200 elements, each with hard to predict spacings.

The code goes through every value of values and finds the INDEX of the element in desired that is closest to value.

Here is an example:

Input:
desired = [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]
values = [1.12, 4.77, 3.39]

Output:
indexes = [0,8,5]
#

anyone here use dash?
@lapis sequoia
I do

lapis sequoia
#

Im having an issue moving a legend on this map i have made

#

I looked at the documentation, and it seems to have no effect on it

#

This is my problem

modest rune
#

You asked if I use it. I do, but I am terrible at it.

lapis sequoia
#

lolol

modest rune
#

I mean, dash uses html and css. And to get the position of things to work properly, you really need to understand html and stylesheets really well.

#

dash makes beautiful plots though.

lapis sequoia
#

I was able to do lots of things in relationship to the map and even the colorbar without using html

wise garden
#

How can I find out the index of multiple columns in pandas

lapis sequoia
#

make a slice

modest rune
#

I was able to do lots of things in relationship to the map and even the colorbar without using html
@lapis sequoia
Yeah, the graphical elements usually have non-html/css attributes you can mess with. But, it wouldn't suprise me if text positioning was purely html and css.

wise garden
#

Let me rephrase

modest rune
#

I know what you are asking... let me see if I can help

#

dataframe.columns.index(column_name1, column_name2)?

wise garden
#

let me check

#

I'm doing something weird

modest rune
#

index is a python core function that returns the index of the item in a list with that name.

wise garden
#

like I could get what I need with select_dtypes but I want to include other columns too

modest rune
#

dataframe.columns, should return a list of column names

wise garden
#

yea I was trying things w .index but I was trying after I isolated columns based on their dtype

desert oar
#
import io
import pandas as pd

txt = """
 ticker calendardate  revenue
      A   2019-12-31  10
      A   2018-12-31  9
      A   2017-12-31  7
      A   2016-12-31  1
      A   2015-12-31  3
      B   2019-12-31  5
      B   2018-12-31  4
      B   2017-12-31  3
      B   2016-12-31  2
      B   2015-12-31  3
"""

data = pd.read_fwf(io.StringIO(txt), header=1)
data['calendardate'] = pd.to_datetime(data['calendardate'])

print(data)

def pct_change_for_ticker(y):
    return y.pct_change(freq=pd.tseries.offsets.YearEnd())

pct_changes = data.set_index('calendardate') \
    .groupby('ticker')['revenue'] \
    .apply(pct_change_for_ticker)

print(pct_changes)
#

@wise garden you want to select certain columns by dtype?

wise garden
#

dataframe.columns, should return a list of column names
@modest rune anyway to pass a list through?

#

I've got 40+ column names to pass through

#

sorry, not the columns portion, the index portion

desert oar
#
data.loc[["a", "b", "c"], ["X", "Y"]]

something like this? that selects the rows with index labels "a", "b", and "c", and columns "X" and "Y"

wise garden
#
data.loc[["a", "b", "c"], ["X", "Y"]]

something like this? that selects the rows with index labels "a", "b", and "c", and columns "X" and "Y"
@desert oar I'm hoping to use df.iloc

#

oooo but might be able to use this

desert oar
#

so you have lists of row numbers and/or column numbers?

#

i.e. by position

#
data.iloc[[3, 2, 12], [45, 46]]

same syntax

#
data.iloc[:, [45, 46]]
data.loc[:, ["X", "Y"]]

for all rows

wise garden
#

Yea I was trying to make it harder by passing the col names that I found into indices and then passing them to .iloc.

desert oar
#
data.loc[:, ["X", "Y"]]
data[["X", "Y"]]

pandas also accepts the 2nd line as a synonym for the 1st

#

i write data[["foo", "bar"]] all the time when using pandas code

wise garden
#

yea solid, thx for the help

velvet thorn
#

@modest rune did you solve your problem yet?

modest rune
#

Yes! All is good now

velvet thorn
#

how did you fix it?

solid mantle
#

Hey, pandas issue

#

nevermind I solved it

lapis sequoia
#

header = False

#

you can also supply header as list of strings

subtle tide
#

Hi, hope everyone is doing fine, I'm not really into data science but I need pandas to do a quick work, I have a CSV file, and I need to make a search on it on a particular column, I could get away with a for loop but I've read that Pandas is more efficient, how can I achieve this?

velvet thorn
#

Hi, hope everyone is doing fine, I'm not really into data science but I need pandas to do a quick work, I have a CSV file, and I need to make a search on it on a particular column, I could get away with a for loop but I've read that Pandas is more efficient, how can I achieve this?
@subtle tide what do you mean a search?

#

To be clear the variables are x and z
@solemn cosmos logarithm

ripe forge
#

Don't worry about efficiently so early. If this csv file isn't massive you can happily use csv module in python and call it a day.

subtle tide
#

@velvet thorn , I have a csv file with this header, [name, phone_number, email], and about a thousand row, I want to get the name and phone_number of a particular email

velvet thorn
#

df.loc[df['email'] == the_email, ['name', 'phone_number']]

subtle tide
#

@ripe forge , ok thanks, I will have that in mind

velvet thorn
#

but yes, a thousand rows is nothing

subtle tide
#

thanks @velvet thorn

velvet thorn
#

yw

modest rune
#

how did you fix it?
@velvet thorn

I took advantage of the fact that I knew how the desired array was constructed.

# a : data that I am analyzing
desired = numpy.linspace(a.min(), a.max(), 200)

By storing the 3 values used in the linspace and then later on applying them to values in the right way and converting to an integer, I was able to calculate the proper index, instead of searching for the index. Much much faster.

#

You must be on the other side of the world because when you asked me:

How did you fix it?
It was 2am-ish and I was asleep.

terse ridge
#

hey guys I justed wanted to ask is the first node is same as start node in Binary Tree?
pls fell free to ping me if you have a answer

desert oar
#

data science consists of statistics, machine learning, and related topics

terse ridge
#

ok sorry

fringe dagger
#

umm hey guys i want to ask something about a particular dataset anyone wanna help me?

mellow spruce
#

How can do the average of the next n rows for every row in a data frame? I have a data frame that looks like :

    A|1
    B|2
    C|3
    D|4
    E|5
    F|6
    G|7
    H|8
    I|9
    J|10
    K|11
    L|12
    M|13```
and I want to average the next 3 rows for every row so the output would be like 
```Object|Value|Average_3
    A|1|6.3
    B|2|8.6
    C|3|11
    D|4|13.3
    ... and so on```

I was thinking on doing something like 

```df['average_3']=df['value'].apply(lambda x: x.shift(1)+x.shift(2)+x.shift(3))

However, the n number of rows will not always be the same so I was wondering how I can apply a for loop inside the lambda function and also how will this manage the last n rows since they won't have all the future rows to do the average on?

desert oar
#

@fringe dagger it's better to just ask your question, don't "ask to ask"

fringe dagger
#

um yeah sure

sterile wyvern
#

@mellow spruce

#

df['average_3'] = df.iloc[:,1].rolling(window=3).mean()

desert oar
#

i prefer to use column names but yes use rolling

fringe dagger
#

how does the training data work here?

#

like i'm using two different columns to predict a regression model

desert oar
#

i actually dont know if you can roll forward though @sterile wyvern

#

@fringe dagger can you also share your code as text (either using code formatting or a paste site)? it's difficult for some people to read text in images

fringe dagger
#

oh okay never done that and its in notebook format

desert oar
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

sterile wyvern
#

@desert oar need to go through rolling documentation

desert oar
sterile wyvern
#

for i in range(0,df.shape[0]-2):
df.loc[df.index[i+2],'SMA_3'] = np.round(((df.iloc[i,1]+ df.iloc[i+1,1+df.iloc[i+2,1])/3),1)

#

alternate for rolling

desert oar
merry ridge
#

Is there a simple way to be able to do arithmetic operations on an extremely large number? The documentation says that it should just automatically work for newer versions of python but that doesn't seem to be the case

lapis sequoia
#

im making a smart macro that detects if the message sent has been sent twice without a message in between and im not sure how im going to go about the AI for this

#

ive made the macro thats all done

#

if anyone has any clue on how to help make sure to ping me, thanks

desert oar
#

@merry ridge what error are you getting? how large is extremely large?

#

@lapis sequoia do you really need AI for that? you can't just compare the two messages for equality (or similarity)?

lapis sequoia
#

and how would i go about doing that?

desert oar
#

what do you mean? if message1 == message2:

distant tree
#

Function messages?

lapis sequoia
#

ohhh ok

merry ridge
#

The upper bound of what I need is about 1e400~

desert oar
#

@merry ridge what kind of operation are you doing?

#

!e ```python
print( 1e400 + 1e400 )