#data-science-and-ml
1 messages · Page 252 of 1
Anyone have a good resource for getting started with data streaming?
hey guys
I have a tags system
so basically user inputs some text from discord
and it gets stored in a sqlite3 database
and when the stored text is called with the respective keyword,
the input text is displayed
(By a discord bot of course)
something like this
but over here, the lines arent getting recognized
and the text string is stored without the specified lines
I dont want that to happen
how to fix it?
Im working with a pretty large data set, 500 rows maybe like 10 columns. I am running forloops to create even more permutations, and all these permutations get saved as a new csv. The problem is it makes my computer crash after saving 2-3 CSV's. What Can I do to stop this from happening
do you delete the dataframe from memory after saving it as a csv?
because if you just keep saving more and more dataframes to memory itll eventually fill up and crash
Guys how can I make videos like this to practice python? https://www.youtube.com/watch?v=4-2nqd6-ZXg
Is it complex? (Basically videos about ranking over time of some sort of data) Not sure where to start
def SortData():
global df
cols = df.columns.tolist() #Convert columns to list
cols
colsorder = [2,1,4,3] #Move last column to first position
df = df[cols[i] for i in colsorder] ```
Last line returning a syntax error, any idea why?
add one more set of brackets
I take it this is the appropriate channel for asking questions that involve pandas?
I'm having issues getting rid of NaN values in one of my dataframe's columns.
I'm running python avg_age = clean_df['Age'].mean() clean_df['Age'].fillna(avg_age, inplace=True)
I have verified that the first line correctly takes the average value of values in the Age column
I have also verified that for some reason, the number of null values in the column does not change after running these lines
I first tried using .replace for this, with the same failure
All help I've found googling around links to people that didn't know about using inplace=True, which definitely isn't the problem here
Has anyone else experienced this sort of issue? Any fixes?
Ok, ok, nevermind. As usual, finally asking for help revealed the issue to me.
...I was running my after-check of .isnull().sum() on the wrong dataframe
20 minutes of confusion and finally deciding to join this server only for this to happen to me
@tall barn that is rough.
one thing I like to do is never perform inplace modifications
So you just assign back to itself?
no
for example...
raw_df = pd.read_csv(...)
cleaned_df = raw_df.fillna(...).replace(...).map(...)
Ah new dataframe every time?
then let's say I want to perform some other transformation on cleaned_df:
some_other_df = cleaned_df.melt(...)
for example.
yup
I can see the value in that
are you using Jupyter?
worry about memory when it becomes a problem
premature optimisation is the root of all evil
Fair enough
the really nice thing about doing this is that you can go back to earlier cells
and run them
and they will just work.
because you never modify DataFrames
Oh God yes
whereas say you create a DataFrame in cell 8, perform a read-only aggregation in cell 9, and modify it in cell 10
if around cell 15 you want to change something in cell 9 just as an experiment, it might not work
because you've modified the source already.
naming can be a pain
but there are also ways to handle that
I’m afraid that really wouldn’t fly with the instructors at my job though. Having a dozen dataframes isn’t a nice surprise to people who expect to just look at one
yup, that's fine
I mean, you don't have to expose the intermediate DataFrames
but of course if you have other requirements then those should take precedence
I’ll keep in mind what you’ve said once my code is no longer being inspected for grading purposes
Which should be in only a week or two
sure!
I mean, what I have said is by no means the way to do things
and in fact it is a little foreign to a fair number of data scientists, I think?
because it's a bit more of a computer science theory-based approach
Sure, but I strongly identify with what you said about running old cells
(disclaimer: I have neither a formal DS nor a formal CS background)
but yeah if you're interested in this kind of thing you can look up functional programming; a lot of my DS code structure is inspired by those ideas
I think they generally lead to greater levels of correctness and readability once you get used to the idioms, since pandas is actually quite a functional library
I don’t have a formal background in this, but I’m getting it now as pre-assignment job training. I got hired as a software developer and shoehorned into data scientist somehow...
I’m starting to get an appreciation for pandas for sure, the best documentation of any code I’ve ever read
To the point I’m mostly reading documentation to fulfill assignments, rather than following the shitty lessons
yup, it does have p good documentation.
I mean
I do understand that career path
I've bounced all around ML engineer/DS/SWE in the span of a year or so
It’s not that I don’t understand it, it’s just that I historically have hated statistics haha
I like the logic puzzles of programming, not so much data crunching. I’m adapting though. Statistics is less bad with code than it was in a generic stat class in college
Hey, I’m looking to find instances in pandas where a person has used the same details out of 2 fields I.e: a phrase and an email address, what’s the best way to find instances where people have changed either of those details and provide a count of the presence of either?
How can I implement 10-fold cross-validation in this code?
(train_ds, val_ds, test_ds), metadata = tfds.load( 'tf_flowers', split=['train[:60%]', 'train[60%:90%]', 'train[90%:]'], with_info=True, as_supervised=True )
this isn't python is it
that is python
oh cool.. I haven't used tf this was before
I am trying to use numba to speed up my code, but I keep running into problems because numba doesn't support functions my code uses.
So... I am trying to find different ways to accomplish the same tasks... ways that numba supports.
I was doing this:
# y: vector of size 200
# x: vector of size 4
y - x.reshape((len(day),1))
numba has odd problems with numpy.reshape(), so I am trying to get rid of it
And I can't help but think, there is a simpler way to do this type of subtraction in numpy.
something like y - x.T, but that doesn't work
Calling x.T changes the vector of size (2,) to a vector of size (2,) 🙂
this isn't python is it
@lapis sequoia
Oof...
it is python
lol I'm just used to using sklearn for cv, etc
honestly it seems way too complicated in TF
No problem man.
@modest rune do you know which part is causing the crash exactly, the reshape of the len ?
How can I implement 10-fold cross-validation in this code?
`(train_ds, val_ds, test_ds), metadata = tfds.load(
'tf_flowers',
split=['train[:60%]', 'train[60%:90%]', 'train[90%:]'],
with_info=True,
as_supervised=True
just seems to be a tuple
either way, can you try .reshape(-1, 1) or use expand_dims
numba supports reshape(), but only if the array is contiguously stored in memory.
which it should be given your example
Morning everyone!
I am having trouble putting these two separate mathematical formulas into one print statement. I've hacked my way to the solution, but I'd like to know how I can combine the two into one single print statement.
`import math
monthly_payment = 8721.8
periods = 120
interest = 5.6 / 100
i = interest / 12 * 1
print(i * math.pow(1 + i, periods)) # 0.008159169813759389
print(math.pow(1 + i, periods) - 1) # 0.7483935315198691
print(int(monthly_payment / (0.008159169813759389/0.7483935315198691))) # 800000`
My guess is that x is a view to a bigger array that I sliced up earlier on in my code.
And hence, it is not contiguous.
I think copying the array will more than offset any speedups I get from using numba
I am going to try expand_dims, hopefully numba supports that
honestly it seems way too complicated in TF
@lapis sequoia
Why? If I'm downloading from cloud collection and break dataset into train, val and test input?
not that part.. I mean actually doing cross validation on this
yeah that's probably the reason why, i just tried it on my end and it works with reshape, the views are probably causing issues
not that part.. I mean actually doing cross validation on this
@lapis sequoia
U mean this case?
And yeah.
I resolved this problem.
(train_ds, test_ds), metadata = tfds.load(
'tf_flowers',
split=['train[:90%]', 'train[90%:]'],
with_info=True,
as_supervised=True
)
val_ds = train_ds.split = [
f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
]```
One line of code. Not bad))
@odd yoke expand_dims worked. Now I need to work through the other 3 errors numba is throwing. Thanks for helping me get over that hump!
just seems to be a tuple
@raw rapids
What?
Any insight into what this means for a non-computer science guy?
Resolution failure for non-literal arguments
It is an error I am receiving when numba tries to compile this line of code:
# x: 2D array, variable names where changed to protect their identity
temp = x.argmin(1)
Full error from numba
Compilation is falling back to object mode WITH looplifting enabled because Function "j_nearest_strike" failed type inference due to: - Res
olution failure for literal arguments:
AssertionError()
- Resolution failure for non-literal arguments:
AssertionError()
During: resolving callee type: BoundFunction(array.argmin for array(float64, 2d, C))
During: typing of call at c:/.../analysis2.py (159)
File "analysis2.py", line 159:
def j_nearest_strike(strike, x):
<source elided>
abs_minus = np.abs(minus)
temp = abs_minus.argmin(1)
^
@jit('int64[:](float64[:], float64[:])')
The actual code... sorry for the ugly variable names an unnecessary assignments, I am just trying to get it to work.
@jit('int64[:](float64[:], float64[:])')
def j_nearest_strike(strike, x):
#strike = strike.copy()
#length = strike.size
strike_shaped = np.expand_dims(strike, 1)
minus = x - strike_shaped
abs_minus = np.abs(minus)
temp = abs_minus.argmin(1)
return temp
argmindoesn't support additional arguments iirc
@odd yoke
I guess I am dense. Can you rephrase that? I am not following.
Without numba, that line works perfectly.
you can only do arr.argmin()
oh
numba doesn't support all of numpy
well, crap
most axis parameters, like this one, aren't available
abs_minus is a 2D array for vectorization purposes
I guess, if I am using numba, maybe I don't need to vectorize as much?
Any idea how I can get around this problem while still taking the vectorized approach?
numba optimizes loops
ok
I guess, I should be happy... but I am just annoyed because it took me a while to get the vectorization right in many parts of the code I want to use numba on. So... I have to rip all of that out.
A friend of mine suggested using pypy instead of numba and laughed at me. 🙂
sounds like your friend needs to look up what numba does
All I can say is... When I am done destroying my vectorization and getting rid of my class, the numba code better be much faster!
also that's the purpose of numba, unlike numpy, it wants you to write "normal" python code
The friend claims that pypy will get me similar performance gains, the only problem will be that installing scipy, pandas, and numpy will be a PIA.
import numba as nb
import numpy as np
arr = np.random.randint(10, size=(100,50))
@nb.jit(nopython=True, parallel=True)
def argmin(arr):
ret = np.empty((arr.shape[0],))
for i in nb.prange(arr.shape[0]):
ret[i] = np.argmin(arr[i])
return ret
print(argmin(arr))
print(arr.argmin(1))```works
just gotta like, specify the dtype
i'm just too lazy
thanks! you didn't have to do that.
Pypy is generally a lot slower than nopython numba since it uses actual python objects
Hey all, Need a help I just started learning phyton and I am learning about dataframe. I have a sample of 29405 rows and 1 column and when I using this code
df= pd.Dataframe (df)
df1.insert(2, "Consumer", [1:29405])
But it's showing SyntaxError: invalid syntax
hey guys noob here, Im trying to use pandas and read a file, but when I use read_csv it doesnt print columns but long lines of text where columns are separated by '\t'
I figured out I had to use a delimiter but when I plug \t into it I get an error ? ( it is a TSV not CSV files but I seen it works the same)
@lapis sequoia can you share your actual code?
show the error? also please try to share code as text, and not screenshots. some people have difficulty reading screenshots
here is what I get if I dont specify any delimiter
sorry but idk how to share the code in fashion way
@lapis sequoia How big is the TSV file you want to load?
ko?
Hmm...
@Jerrody you're just loading a tuple of values
@lapis sequoia Did you try just taking a few values, making a new tsv and reading that with pandas?
hmm no, you mean like create a file with only the columns im interest in or st ?
idk how to do that tbh Im a real beginner
No, create a file with only a few data points (like maybe of 5Mb). Also, how much RAM have you got?
Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.
To do this, use the following method:
```python
print('Hello world!')
```
Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them
This will result in the following:
print('Hello world!')
^ that tells you how to post code
8gb
oh ok thanks
but I get the error just by printing the columns alone @desert oar
and no error if I dont use delimiter
if you dont use delimiter= it is loading everything as text
pd.read_csv('data.tsv', delimiter='\t', nrows=100)
does this produce an error?
no!
lenght= 3148
may be that...
but it's weird since I only prints the columns
the error is not from printing the columns
do you want to read the data file? or do you only want to print the columns?
well, Im really exploring the data as a noob atm
it sounds like you do not have enough free memory on your computer to read the whole dataset
seeing if I can find one specific columns and exploring only the data associated with it
you can select columns with usecols=
hmmm okay
pd.read_csv('data.tsv', delimiter='\t', nrows=100, usecols=['CASEID', 'QUESTID2', 'CIGEVER'])
this only reads the CASEID, QUESTID2, and CIGEVER columns
and only 100 rows
Is there any way to create partial checkpoint? Like to make a chkpt if the Epoch is done only about 30% or something?
oh okay thanks! so since I specify those parameters in the read_csv before doing anything else, it means if I dont do that it goes trough all the data AND ROWS even if I print the columns only and create an error ? do i understand correctly ?
def SortData():
global df
cols = df.columns.tolist()
cols
colsorder = [2,1,4,3]
df = df[cols[i] for i in colsorder]
return df
``` Last line returning a syntax error plox halp
Can you post the whole error? and are you using Jupyter Notebooks by any chance?
Are you talking to him or I?
you
I'm using ATOM with some addons to make it work like Jupyter
Hydrogen?
Just restart your kernel and run only the lines you need
Code looks fine, you just thinking its an error with the kernel?
https://media.discordapp.net/attachments/587375753306570782/755439106317877341/unknown.png?width=730&height=702 Hello, how do I retrieve a specific timestamp from this data frame? Also, how would i make the timestamps a column 'date'? This is a multi index btw
@fast bluff you need 1 more pair of brackets I think, try df[[cols[i] for i in colsorder]]?
Someone else suggested that. When I try that it says "list index out of range"
how many columns do you have? 4?
python is 0-indexed, you need [1, 0, 3, 2] instead
though, it's probably easier to just select them directly? if the column names are fixed
as in df = df[['CLOSE', 'BUY', 'SELL', 'SMA']] or whatever
I was doing that before and ran into some errors yelling at me to use .iloc so I've just been playing it safe
Starting with 0 fixed the problem lol tysm that went right over my head
Just an idle and theoretical question - can we use a decoder (like in an autoencoder) with some layers to give probablity oriented outputs? Like for every Latent Space combination, it would produce several unique possiblities. How would we implement something like that? Maybe by using "softmax" at the end to get probablities, but the thing would be that those outputs would all be the BEST it can produce, just tweaked in different ways. Like it wouldn't assign 94% to one, then 3% to another and so on. It would give several outputs but each of them would be the most likely. Like for 50 outputs, it would assign each one at 2%. So is a decoder deterministic in that way (to be not able to produce multiple outputs for same input) or does it actually end up producing unique outputs every time?
So I now have this variable workdays which each id is associated with differents numbers, and I'd simply like to plot the frequency/repartition of each number for the whole data set but I only gets one big columns
plt.hist(data)
plt.show()
what am i doing wrong
I'd like to have a frequency distribution
@lapis sequoia
df = df[[cols[i] for i in colsorder]]
99 is a specific value dat means error or st but others ID should have any numbers
you probably want to hist a specific column
@desert oar sorry I dont understand where i'm supposed to do that
plt.hist(data["WORKDAYS"])
plt.show()
https://media.discordapp.net/attachments/587375753306570782/755439106317877341/unknown.png?width=730&height=702 Hello, how do I retrieve a specific timestamp from this data frame? Also, how would i make the timestamps a column 'date'? This is a multi index btw
@marsh seal
if they are proper datetime formats, you can retrieve directly withnew.loc['2017-01-03']if you want the 3 Jan 2017 rows, for example.
and I'm not quite sure I get your second question, you want to reset the timestamp index into its own column called 'date'?
@tidal bough it worked thanks! so even if I specify the reading to only the columns I still need to specify it in the plot, thanks 😉
@odd yoke So, it is ugly and needs to get cleaned up. But the functions I wanted to run with numba now run without errors. They went from taking 4 seconds to run to 500ms. Which is great.
But, something weird happened, the parent function that calls them:
# everything from custom_function down has been decorated with @njit
pandas.Dataframe.rolling().apply(custom_function)
pandas.Dataframe.rolling().apply(custom_function) (refer to this as A)gets called 8000 times. Because of how rolling() works, custom_function (refer to this as Child of A) gets called probably 1 million times (8000 * 100).
Before numba: A took 4 seconds to run, of which Child of A accounted for 3.8 of those seconds.
After numba: A took 4 seconds to run, of which Child of A accounted for 0.5 of those seconds.
Seems like there is a significant amount of overhead loading the numba function or something. Any insights, I feel like something trivial needs be done to fix this problem.
@lapis sequoia Alternatively, you can use seaborn to visualize dataframes; it's nicer in some ways. With it, it'd be:
import seaborn as sns
sns.displot(data=data,x="WORKDAYS")
@paper niche yes i want the dates to have a column named 'date. this way i can filter through them
thanks will look into it
and do you know how could I delete some values of my data like since 0 and 99 are specific values that arent really analysable ?
tldr: How to delete specifics values of a column before plotting
I'd suggest, instead of using 0 and 99 to mark "invalid value", using None or NaN.
Though also, you can just filter them out when plotting:
plt.hist(data["WORKDAYS"][(data["WORKDAYS"]!=99) & (data["WORKDAYS"]!=0)])
plt.show()
@paper niche yes i want the dates to have a column named 'date. this way i can filter through them
@marsh seal you can usedf.rename_axis(index=['date', None]).reset_index(level=0)which will convert the first-level index (your timestamp index) into its own column
@paper niche thank you for your help. much appreciated!! (i've been stuck on this for a few hours)
sure no problem, I edited my answer btw, you can rename the index before resetting it so it gets a meaningful column name
amazing thanks ! @tidal bough one last thing if you please, the x values arent really readable and scaled, how do I show the exact numbers for each :
it's weird since the data show some numbers "2" and isnt from 0 to 10 or anything
with seaborn it's very simple: pass discrete = True.
With matplotlib, it's more annoying - you'd have to manually set the number of bins to max - min of your data.
ok so I have to find the max mins with describe or something and what do I put in bins= then ?
literally max - min ?
lemme find the last time I did that...
because the min max is in fact 0 to 100 and I already randomly put the bins=10/1 but it is what is on my screenshot
thanks for your help 🙂
oh no
I put 10/1 so I must do it 100/1 , right ?
that's it!
should be max(data["WORKDAYS"]) - min(data["WORKDAYS"]) + 1 bins, and also pass align = "left"
How do I put every number 1,2,3,4,5 in the x legend tho
no idea, at that point you might want to use seaborn 🙂
ok , thanks for the help so far!
@lapis sequoia write df = df[[cols[i] for i in colsorder]] instead of df = df[cols[i] for i in colsorder].
ah it's not me lol it's @fast bluff his username makes it seems it was me asking
I was wondering what you were talking about...
@fast bluff mods will probably ask you to change your name, for being difficult to tag
I'll set a nickname
👍
😄
where do you guys find interesting data to work with ?
Hi guys I was wondering how can I apply a value of a list to 10 rows of a data frame and then go to the next value of the list. ie.:
Input:
Numbers
1
2
3
4
5
6
7
8
9
10
11
12
13```
Output:
```Numbers|Grop
1 A
2 A
3 A
4 A
5 A
6 A
7 A
8 A
9 A
10 A
11 B
12 B
13 B
Thank you !
@paper niche I get the following error from that code
Does anyone know how to tweak the NLTK library?
@red rose UCI machine learning repository, kaggle, various blog posts
@gray phoenix tweak how?
@marsh seal is new a DataFrame or a Series?
@desert oar I think i figured out part of it.
so new was a Series before you did reset_index?
new_df = new.rename_axis(['date', 'market_cap']).reset_index(level='date') how about this?
@desert oar I am using NLTK to read reviews. But right now the accuracy is fairly low.
I'm looking in how i can improve the accuracy
@gray phoenix it sounds like you might need a different model
it's usually not a matter of tweaking the library
you need to understand what the existing model is doing, and why it produces the results that it produces
then you can understand how to make it better
this is why knowing how the model works is important, otherwise you are just guessing or blindly following recommendations
what method are you currently using? what do you mean by "read reviews"?
@mellow spruce https://repl.it/@maximum__/CadetblueModernDiskdrive#main.py
import numpy as np
import pandas as pd
data = pd.DataFrame({'y': np.arange(50)})
vals = ['A', 'B', 'C']
vals_rep = pd.Series(np.repeat(vals, 10))
data['z'] = vals_rep
print(data)
@desert oar data frame. the output for the code new_df is 'length of new names must be 1, got 2'
well you're doing all these inplace operations
so now it's a dataframe
but it looks like it was a series before
@desert oar I might have to do more research into it. TBH I am still fairly new to Data Science and NLP.
But to give you the used case: at work I am working with the data scientist to take the reviews we get from our patients, and give a numerical value of our patients experience. He gave me the data of the reviews, and he also gave me a version of the same output, except with his NLPs engine score. (He codes in R, I code in python).
I was using NLTK to score the review, but I am way off. Not sure if I should be utilizing NLTK. I'm just a little loss and don't know where to start
what part of NTLK are you using?
and how do you define patient experience? what does the numerical value represent?
I am using
from nltk.sentiment.vader import SentimentIntensityAnalyzer
I believe I am using vader_lexicon
in my output, I have the "PatientFeedback" then How positive, neutral, negative, the patients feedback is
you are using SentimentIntensityAnalyzer.polarity_scores?
(i'm not an expert in this area, but you should first make sure you're using the tool correctly before you decide if the tool is not suitable for your application)
Yes I am
for review in reviews['PatientFeedback']: sent = sia.polarity_scores(str(review))
Do you know where or a resource I can learn about it
Thank you @desert oar
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.
it might be that this just is not the right tool for the job
but im not sure what other tools for sentiment analysis are available. you might need to do some digging, look around on forums, read blog posts, etc.
Thank you so much
this seems to be the original paper citation:
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
so you might also want to look at what papers cited this paper
hey guys, if I have a pandas dataframe like this, any ideas how I can get only the first and last item for each unique instance (ticker) in the first column?
ticker dimension calendardate
AACH MRY 2018-12-31 2018-12-31
AACH MRY 2017-12-31 2017-12-31
AACH MRY 2016-12-31 2016-12-31
AACH MRY 2015-12-31 2015-12-31
AACG MRY 2019-12-31 2019-12-31
AACG MRY 2018-12-31 2018-12-31
AACG MRY 2017-12-31 2017-03-31
AACG MRY 2016-12-31 2016-03-31
AACG MRY 2015-12-31 2015-03-31
AAAP MRY 2016-12-31 2016-12-31
AAAP MRY 2015-12-31 2015-12-31
AA MRY 2019-12-31 2019-12-31
AA MRY 2018-12-31 2018-12-31
AA MRY 2017-12-31 2017-12-31
AA MRY 2016-12-31 2016-12-31
AA MRY 2015-12-31 2015-12-31
A MRY 2019-12-31 2019-10-31
A MRY 2018-12-31 2018-10-31
A MRY 2017-12-31 2017-10-31
A MRY 2016-12-31 2016-10-31
A MRY 2015-12-31 2015-10-31
for example I would want only 2018 and 2015 for AACH, 2019 and 2015 for AACG, etc
I expect I need to groupby ticker but I'm not sure where to go from there
@dense copper ```python
data.groupby('ticker').agg(['first', 'last'])
maybe?
hmm that's not a bad idea, but I'm reading I should probably use nth(0) rather than first() since first() ignores NaNs... any idea if it's possible to pass a param to nth() in that context you used it w/ agg? I don't see anything in the docs about that
oof I need to get better at pandas lol
Would anyone mind reviewing my code?? Everything works as intended, am just a newb and criticism is always helpful
And if you're down, would it be easier to send the entire file, or send it via discord code? It's only 60 lines
a nested list
Is there a way to use sklearn's pipelines to short-circuit out on certain datapoints?
i.e. if I had a trained model M
and input data D = [1, 2, -3, 4]
let's say I want my output to be M(p) for each point p in D
unless p < 0, in which case I want to short-circuit out of the pipeline and the output is -1
so my output should look like [M(1), M(2), -1, M(4)]
(my actual condition is a bit more complex but still vectorizable)
The reason I wanted to use sklearn's pipelines is to keep the shape and order of my input data
so maybe there's a better way to do that?
# df: pd.DataFrame
>>> array = df[column].to_numpy(dtype=np.float, na_value=np.nan)
ValueError: could not convert string to float: '?'
I don't think pandas was having any issues reading in the question marks as null values, however it represents those, so I'm not sure why they're not being converted to np.nan
@void anvil json?
@serene scaffold what is your question exactly? if df[column] has "?"s in it, that error makes sense
@serene scaffold what is your question exactly? if
df[column]has"?"s in it, that error makes sense
@desert oar I should have mentioned thatdf[column]is a Series but I want them to benp.nan
does it contain floats and strings, or all strings?
e.g. is it pd.Series([1.3, 2.7, '?']) or pd.Series(['1.3', '2.7', '?'])
the question mark is the problem no matter what
that's how the csv represents missing data
import numpy as np
import pandas as pd
y = pd.Series([1.3, 2.7, '?'])
y_arr = y.mask(y == '?').to_numpy(dtype=np.float)
@serene scaffold
you have to just replace the ? with something else
Thanks!
unfortunately .replace doesn't actually let you replace with nulls, i think
print( y.replace('?', np.nan).to_numpy(dtype=np.float) )
you can do it with float('nan')/math.nan/np.nan but not None
@void anvil no, the "top level" of a json file can be any valid json type
null, int, float, string, array, object
why is that weird? it makes sense
!e ```python
import json
print( json.loads("null") )
print( json.loads("3.5") )
print( json.loads('["a", "b"]') )
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | None
002 | 3.5
003 | ['a', 'b']
why would you need any of that
you mean, why would someone save data as just arrays? who knows
where did you get this data
thats not what i mean, lol
what's the data for? who generated it?
[[1, 2, 3], [4, 5, 6]]
it's like this?
def open_csv(path):
df = pd.read_csv(path)
df = df.replace('?', float('nan'))
return df
this seems to fix it
@serene scaffold ```python
df = pd.read_csv(path, na_values=['?'])
oh dank
why are you being asked to work on data that you dont know anything about
yikes
thats a rough job
you might want to talk to whoever currently uses this data
ask them what it's for, how they use it, etc
maybe there's an automated system that consumes it -- do you have access to its source code or at least its business logic?
etc
this is the part of data science that requires you to be more of a business person / manager than anything
I think numba is compiling every time I run the program. Isn't it supposed to cache the binaries?
http://numba.pydata.org/numba-doc/dev/developer/caching.html @modest rune they have some rules about when compiled code is actually cached
maybe for some reason your program doesnt meet those requirements
I see there is a cache flag you can set... I guess I assumed it cached the binaries by default.
HAHAHA!!! That did IT! cache=true!
4 seconds down to 1.366 seconds. Still a lot of room for improvement, but at least I didn't completely waste my time numbafying my code.
>>> x
[0.48098 0.60981 0.47886 ... 0.47272 0.56138 nan]
>>> y
[ nan 0.41685 0.37368 ... 0.35527 0.44202 0.22364]
>>> x[~np.isnan(x).any(axis=-1), ~np.isnan(y).any(axis=-1)]
[]
And the numbafied code is no longer the long pole in the benchmark tent
evidently this isn't the way to remove indices where the element is nan in either
Thankyou @desert oar and @odd yoke
Hello nice people, can some one help me with a question using the str.extract function
anyone know how to use Plotly Express
Fig1 = px.scatter_mapbox (Random Locations)
Fig2 = scatter_mapbox (Random Locations 2)
Then somehow get both of them to show
I cannot get this to work, and reading the documentation is enraging
cannot find anything that solves my issue
@serene scaffold im confused what are you trying to do
@void anvil i dont understand, its just a file full of numbers?
if its 15 years old it probably isnt even "json" technically
if nobody is using it and nobody knows what it is
why does anyone care
just delete it
>>> a = np.ndarray([np.nan, 2.0, 6.0, 4.0, np.nan])
>>> b = np.ndarray([1.0, np.nan, 3.0, 8.0, np.nan])
# convert these to ->
>>> a = np.ndarray([6.0, 4.0])
>>> b = np.ndarray([3.0, 8.0])
@desert oar delete every element that's nan in either array
actually this is wrong too
fixed
is it possible to have a jupyter notebook use kernels from multiple servers?
x: np.array; y: np.array
overlap = ~np.isnan(x) & ~np.isnan(y)
x, y = x[overlap], y[overlap]
appears to work correctly
very esoteric looking though
I'm sorry, but I just can't make sense of these numba errors. I could use help 2 ways... (1) How do I interpret the errors? and (2) how do I fix this particular error.
Code:
@njit
def iterate_stuff(c_TV, c_x, c_y, p_TV, p_x, p_y,underlying, initial_investment,
flat_fee, per_contract_fee, percent_fee,days, putCall_floats, multipliers,
asks, strikes, closes, putCalls):
for i in range(days.shape[0]):
# lots of code afer this
Error:
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, C)
During: typing of argument at c:/.../analysis2.py (222)
File "analysis2.py", line 222:
def iterate_stuff(c_TV, c_x, c_y, p_TV, p_x, p_y,underlying, initial_investment, flat_fee, per_contract_fee, percent_fee,days, putCall_floats, multipli
ers, asks, strikes, closes, putCalls):
<source elided>
for i in range(days.shape[0]):
^
@serene scaffold well you dont need to use the tuple expansion trick
it isnt esoteric to me, its just using boolean operators for boolean operations
also maybe you want | and not &
if either if False, it has to be False
then use or, not and
wait
what if one is missing but the other is not missing
nan, rather
we only want the elements that are numbers in both arrays
we need two arrays of only numbers that are the same shape.
@modest rune im surprised at that error, without knowing more about your code its hard to say why
it might even be a problem in numba's type inference
@desert oar I think I figured it out. One of the many parameters I had in my function declaration was an array of strings. I removed it and the error went away.
ah
Not a very helpful error.
yeah, numba probably doesnt like that
no its not helpful
especially since the arrow is pointing to the wrong thing
could be a good bug report if you can isolate and reproduce it
yeah... I might be emotionally up for that if I can get my code to run 🙂
I keep biting off bigger and bigger chunks of code to numbify... its like pulling teeth.
This looks rad. Databricks for Dask, basically: https://coiled.io/
nice
Can anyone please help with this, I 've been stuck for a while:
df_diff = pd.DataFrame([1, 2 , 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 6, 7, 0, 8], columns = ['diff'], index=pd.date_range('1/1/2000', periods=24,freq = 'H'))
I'm trying to get the locations of indexes for the beginning and **end **of the larger set of consecutive zeros, for the case above the answer would be (2000-01-01 09:00:00, 2000-01-01 18:00:00)
This is what I have so far:
ref_val = 0
shut_in_days = 7
for row, value in enumerate(df[col_name]):
if value == ref_val:
day_counter = 0
#Check consequtive zero elements
previous_value = abs(df[col_name][row-1]) if row > 0 else None
current_value = abs(df[col_name][row])
next_value = abs(df[col_name][row+1]) if row < len(df[col_name])-1 else None
while (previous_value + current_value + next_value) ==0 and (day_counter > shut_in_days):
row+=1
day_counter+=1
print(df.index[row], value) if row < len(df[col_name])-1 else None
else:
print(row, value)```
@sinful dock this is what I did:
>>> zero_runs = (df.diff(1) != 0).cumsum()[df['diff'] == 0]
>>> zero_runs[zero_runs['diff'] == zero_runs.groupby('diff').size().idxmax()].index
DatetimeIndex(['2000-01-01 09:00:00', '2000-01-01 10:00:00',
'2000-01-01 11:00:00', '2000-01-01 12:00:00',
'2000-01-01 13:00:00', '2000-01-01 14:00:00',
'2000-01-01 15:00:00', '2000-01-01 16:00:00',
'2000-01-01 17:00:00', '2000-01-01 18:00:00'],
dtype='datetime64[ns]', freq=None)
there might be a better way
I haven't used pandas in a long time
@velvet thorn thanks, trying to digest it now
Hi all. I have a question about pandas. How can you replace a series in a dataframe with another series? Essentially I took in a dataframe, then a series from that. I took in ANOTHER dataframe, and another series from that
I tried to combine these two series using the .append function. It seems like I can store this newly combined series in a variable, but I am now trying to replace the old series from the old DataFrame with the newly combined series. But nothing I am doing is working and pandas' documentation isn't leading me anywhere useful. Is this even possible in pandas?
Please let me know with a mention if you happen to have any idea. Thank you very much
Like, I realllllly thought that oldSeries = newSeries would do the trick but no can do
How do you mean?
For example, I'm trying the update function and this doesn't even seem to be working. When I so much as try to print out the new dataframe, the old series is still there even though I apparently updated it with the new one
I guess, if you are trying to replace one series with another of similar len(), wouldn't you just say df['new_series'] = df2["new_series"]where df is the dataframe you wan to keep and df2 is the other one? if you want to combine dataframes check these:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Thats what I meant by oldSeries = newSeries
Unfortunately, the newly combined series does not yet exist in a dataframe. I'm trying to insert it into the old dataframe in place of the original series, but it's stuck to the dataframe, seemingly
Here is one concern I had: can one series not be longer than the rest of the series in the dataframe? That's the only reason I can think of. If pandas disallows that it must be capping the series at some point, which is really frustrating
Here is one concern I had: can one series not be longer than the rest of the series in the dataframe? That's the only reason I can think of. If pandas disallows that it must be capping the series at some point, which is really frustrating
@lapis sequoia no, it cannot.
that wouldn't make sense
pandas is meant for tabular (i.e. rectangular) data
That's unfortunate
if you need that, it's likely that you have to reconsider your data modell
how about you explain what you're trying to do
might be an XY problem
I will try
I have a master dataframe so to speak. It has a very specific format and I need to keep it in that format because I have a mastercode, if you will, that reads that format very specifically. So I don't want to have to change the master format at all
I recently got a new dataframe with some new products that I am trying to add to the master dataframe. The problem is that the new dataframe has different column names, a different number of columns - basically not in the same format as the master dataframe at all. HOWEVER, the new dataframe does have an identical "products" column that I would like to combine with the "products" column in the master dataframe
As well as any other columns in the new dataframe that the master dataframe also might have. But i need to fir everything into the master dataframe even if that means that some of the new dataframe columns are unnecessary and untouched
okay
so basically
you want to combine the two, leaving out unnecessary columns from the new DataFrame, and (possibly) mapping some column names
One solution would be to tailor the entire new dataframe so that it's in the format of the old one, and use "" as placeholders
Yeah, basically
leaving null values where columns in the old DataFrame do not exist in the new one.
correct?
That might work. So far it looks like only one column in the combined dataframe is going to be 100% filled with no null values is the "products" one, the one I am directly adding to. Later on I will gather additional information that will help me fill the null vaules out, but I want to keep everything in the format of the old dataframe
I suppose one way to do this is to make a new dataframe where every value except for that in the "products" column is turned to null
@left drift i think you need to give a specific example
its really not clear what you want
if you give example inputs and outputs it would help significantly
@desert oar did you get the wrong commander
I can try. I'm sorry that I'm not being as specific.
read the new dataframe
Create variable for master series
create variable for new series.
Using .isin, I want to create another series that consists of the values in new series that are not already in master series. This is to remove overlaps, essentially.
Now I would like to essentially stack master series on top of the new series. I went about many ways to do this but now I see that it's not going to work because adding a series to dataframe breaks the rectangular shape of the dataframe, which pandas does not allow.```
Sorry, that ended up not being code at ALL. I apologize, Im not looking at it right now
I tried using combinedseries = masterseries.append(new series)
It created a new series, but I couldn't simply type masterSeries = combinedseries
Or use the .replace documentation either
so it looks like I'll have to concatenate two dataframes
Can anyone help with ML basics
@lapis sequoia ok, i agree that pd.concat seems correct. let me work an example, one moment
actually it looks like you get it
anyone have expeirence with px.scatter_mapbox?
I cannot for the life of me get two figures with different data sets to appear on the same map
Question: I am close to maximizing my speed improvements from numba. I have a few different things I want to try. One of them is... run some of my parallel code on my gpu.
So, regarding that idea. Am I going to waste my time if I try to do that on a dell laptop running windows with whatever video card comes stock with the laptop?
Most likely
Unless I read it incorrectly, his focus is speed
Yes and?
I don't think their is going to be a massive speed improvement on a laptop that has a good chance of having no real gpu
That's why I said most likely to is there a chance he is wasting time
The amount of speed you get would depend on which gpu 100%, but an increase is still an increase, and correct me if I’m wrong, but the newer dells ship with 2060-80’s (maybe 30.0’s too idk)
I haven't used a laptop in years, so I'm not sure what the standard dell would ship with these days 🤷♂️
Just because you don’t have a titan, doesn’t mean it’s not worthwhile
I suppose it depends on what you are doing and how you account for the time it takes to set things up
Also I’ve got a like 7 yo amd laptop gpu but it still helps
I suppose it depends on what you are doing
Ya that too, but unless you’re trying to take some brute force approach (or something similar), it doesn’t need to be that powerful
even really slow gpus beat pretty fast cpus in a lot of applications
I can use Keras Tuner with Transfer Learning(for example, model from TF Hub)?
keras tuner requires you to put some hp arguments within the model generator
so you wouldnt be able to
does anyone know how to use github?
does anyone know how to use github?
@carmine finch most people know how to use GitHub, but I think you want #tools-and-devops
ok so now u want me to go to a diff section
Its a text file formal
what produced it?
Format
Just some c code
It can be read using numpy normally, like np.loadtxt(filename.out)
I'm literally seeing only one command pd.read_csv in pandas
I believe you can treat out file like a txt file
Then how do you read a txt file using pandas?
so you wouldnt be able to
@austere swift
Thank u, but I can add TF Hub model and else few layers to the model and add Tuner?
Yeah you can add tuner to the layers you add
Then how do you read a txt file using pandas?
@solid mantle the documentation ofnp.loadtxtsays whitespace is used as a delimiter
Nice, and it will be work fine?
my guess is you could use sep=' ' in pd.read_csv
no, my point is
if you can use np.loadtxt, you should be able to use pd.read_csv with sep=' '
pd.read_csv just reads files which use a character as a separator
it doesn't have to be a comma (the "c" in "csv")
You can read non CSV files with read_csv?
it can be, for example, a tab (that would be TSV), or something else
extensions are hints
@velvet thorn dm
the extension of a file is one thing, its format another
the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters, for example, semicolons. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. (from https://en.wikipedia.org/wiki/Comma-separated_values)
Thankyou
hello, i am doing image classification on dogs and cats breed. my folder structure this way ```python
demo -- >
training --> dog --> breed 1
--> breed 2
--> breed 3
--> cat --> breed 1
--> breed 2
--> breed 3
testing --> dog --> breed 1
--> breed 2
--> breed 3
--> cat --> breed 1
--> breed 2
--> breed 3``` is this the correct folder structure for image classification of dog and cat breed ?
Yeah you can add tuner to the layers you add
@Spacecraft1013#5969
Nice, and it will be work fine?
I keep on getting the attachment as error while trying to run sklearn.model_selection.learning_curve. The target variable is of binary class and the minority class has a frequency of ~3.7%.
solved it! I needed to increase the minimum size of my training set. I was taking too few samples and that resulted in just one class.
hello, i am doing image classification on dogs and cats breed. my folder structure this way ```python
demo -- >
training --> dog --> breed 1
--> breed 2
--> breed 3
--> cat --> breed 1
--> breed 2
--> breed 3testing --> dog --> breed 1 --> breed 2 --> breed 3 --> cat --> breed 1 --> breed 2 --> breed 3``` is this the correct folder structure for image classification of dog and cat breed ?
@mild topaz
Folder: Demo
Dirs: Dog, Cats.
U need to split data with np, pt or tf.
It's much more easiest.
Anyone know how to read an array with dype tf.int64 ? I can't figure it out. Tried tf.print() but that gave the same output (the type, dtype and shape of array) which looks like this:- <BatchDataset shapes: ((1, 50), (1, 50)), types: (tf.int64, tf.int64)>
U need to split data with np, pt or tf.
@brittle agate can u explain a bit how i can do this?
the object you have is not a tensor @grave frost it's a dataset, you can use .as_numpy_iterator() and cast to list or whatever, or consume it using .take or a for loop
^
Hey guys, i'm trying to delete some data of columns that have a value > 30 I found this on the internet and it works for one columns but when I add the other with the & none of the two works ( I plot it after to see if it has worked)
data = pd.read_csv('data.tsv', delimiter='\t', usecols=['WORKDAYS','MJDAY30A'])
# Get indexes where name column has value greater than 30
indexmjday = data[ (data['MJDAY30A'] > 30)].index
# Delete these row indexes from dataFrame
data.drop(indexmjday, inplace=True)```
so this work
but when I add
indexmjday= data[ (data['MJDAY30A'] > 30) & (data['WORKDAYS'] >30)].index
it doesnt work anymore
huh
why not just do this:
raw_data = pd.read_csv('data.tsv', delimiter='\t', usecols=['WORKDAYS','MJDAY30A'])
data = raw_data[(raw_data['WORKDAYS'] <= 30) & (raw_data['MJDAY30A'] <= 30)]
well idk aha im noob
so this delete the data I guess
it worked, thanks vm I will remember it
@brittle agate can u explain a bit how i can do this?
@mild topaz
Okay, I work with fork TensorFlow:
How I'm doing it:
# Loading collection(1 parametr), splitting the data and adding some standard param to splitted data(3 and params).
(train_ds, test_ds) = tfds.load(
'tf_flowers',
split=['train[:90%]', 'train[90%:]'],
with_info=True,
as_supervised=True
)
# Adding Validation, but it's not simple Val. It's 10-fold cross-validation.
val_ds = train_ds.split = [
f'train[{k}%:{k + 10}%]' for k in range(0, 100, 10)
]
I think that adding data from local dir it's not big deal.
Need just to google.
Ok, I don't really understand something. In Numba, there is the function numba.prange() that if I understand things properly, is supposed to speed up for loops (an maybe other loops?) by parallelizing their execution. When I use prange() my loops are significantly faster. But, if I use prange() and set the @njit(parallel=True), usually things slow down.
And then, there is @njit(nogil=True)... while is related to parallel execution too.
So, it seems like numba runs code in parallel in many situations, regardless of your nogil and parallel settings? And what is the point of setting Parallel=True if nogil=False?
And, are there other functions like prange() that I should be aware of, that if I use instead of core python functions or common numpy functions will improve my speeds further?
@mild topaz
If I want to use simple validation(not k-fold and so on).
(train_ds, val_ds, test_ds) = tfds.load(
'tf_flowers',
split=['train[:80%%]', 'train[80%:90%]', 'train[90%:]'],
with_info=True,
as_supervised=True
)
I wonna be a data scientist but from where I should start 😭 😭 😭
@misty fjord that is a life level question. There are a lot of good ways to approach your problem.
@misty fjord
U need to know see theory. Basic architecture of NNs. What is bias, weights, input, output, activation's functions.
U can read Deep Learning with Python.
If I were to rewind my life and redo things, I would have taken a different approach to my learning, but the path I did take, worked well enough for engineering.
Go to the site Tensorflow. U can find plan of education for beginners.
I > @misty fjord
U need to know see theory. Basic architecture of NNs. What is bias, weights, input, output, activation's functions.
@brittle agate nns !? 😭
how data science relate to neurla networks ? I thought data science was more statistical analysis of data and that neural networks were more like machine learning stuff
If I were you, I would find what you think would be:
a. perfect job
b. perfect type of problem to solve
c. most enjoyable thing to work on all day
Then, ask questions about what you want to do working backwards from there. It seems to me that data-science is quite a broad field and Machine Learning may or may not be the path you want to take.
data science contains both machine learning and statistical ananlysis
Ha Ok . Can you give me a path. Or should I just start and find it my self > S - plural.
@brittle agate 😭 😢 😭 😢
oh okay
I guess it uses the data analysis to better calibrate the neural network or something like that
If I were you, I would find what you think would be:
a. perfect job
b. perfect type of problem to solve
c. most enjoyable thing to work on all dayThen, ask questions about what you want to do working backwards from there. It seems to me that data-science is quite a broad field and Machine Learning may or may not be the path you want to take.
@modest rune I'm the c type of ppl I'm just enjoying doing python
@lapis sequoia
Yep!
hmm I see thanks, Im a real beginner rn but I'd like to go into neural networks, should I dive directly into it or is it better for me to train with statistical analysis pandas matplotlib etcc 1st ??
are you familiar with a programming language ?
I already know about inferiential statistic just not about programming
@lapis sequoia If you like programming, or specifically like programming in python, and you like data... find something you can do to earn money doing those things. Specifically, find a company you like that hires people to do exactly that. Then, ask them or someone who works for them, what you should be learning.
python basic that's all
In the end, it is all a journey, you will start down path A, but as things evolve, you will end up on path G.
@lapis sequoia If you like programming, or specifically like programming in python, and you like data... find something you can do to earn money doing those things. Specifically, find a company you like that hires people to do exactly that. Then, ask them or someone who works for them, what you should be learning.
@modest rune if there such companies in where I live I'll already done that
hmm I see thanks, Im a real beginner rn but I'd like to go into neural networks, should I dive directly into it or is it better for me to train with statistical analysis pandas matplotlib etcc 1st ??
@lapis sequoia
I think that not. Yeah, of course u need to know data structures of python. But today. Forks of ML is so easy.
U can choose TensorFlow or PyTorch. I recommend to start with TF, because:
- TF has many guides(official site).
- Ecosystem: TF Hub, TF Board and Datasets.
- He is very easy fork, because above TF u're using Keras.
High API fork for the easy work with backend forks like PyTorch, Teanos, TensorFlow and so on.
- U can use TensorFlow for the Production.
He has coral, lite(mobile special) and .js versions.
really torch is definitely not far behind, if at all, on most of these points
I appreciate the help, thank u. Tho I dont understand every of your points tho lol what does "fork" means aha ?
The thing is as with everything since ive learned python Im kind of lost into what should I do like should I find a project that would interest me to do with tensor flow or should I follow broad theory courses with random exercises ?
@odd yoke
I think, that he is more friendly with newest people.
the edge TF has is deploying on different systems with tflite, or tf.js but other than that:
- torch also has a bunch of guides
- torch has a ton of premade models, datasets etc as well
- it has a very similar api to keras, while being significantly faster
- and it's definitely usable in production
@lapis sequoia
I mean framework.
ooh okay lol
tf is actually much better now but not that long ago it was an absolute mess
if you ever have the opportunity to use tf1.XX, do it, it's an enlightening experience
the edge TF has is deploying on different systems with tflite, or tf.js but other than that:
- torch also has a bunch of guides
- torch has a ton of premade models, datasets etc as well
- it has a very similar api to keras, while being significantly faster
- and it's definitely usable in production
@odd yoke
I recommended TensorFlow because I know him much better(and I like him :3). But he can choice another fork.
you will see what bad design is
tf is actually much better now but not that long ago it was an absolute mess
@odd yoke
TF 1.x was so stoopid.
yet it's what my company uses 
you will see what bad design is
@odd yoke
But we talking about present.
And for beginner is good choice.
really i don't think there's any major difference anymore between torch and tf ever since tf2 came out
both are completely fine
ive found the pdf of book "hands on machine learning" do you recommend me to start with taht without any prior experiences or to explore here and there by myself ?
@odd yoke
I recommended TensorFlow because I know him much better(and I like him :3). But he can choice another fork.
But yeah what u said is true.
of course :D
ive found the pdf of book "hands on machine learning" do you recommend me to start with taht without any prior experiences or to explore here and there by myself ?
@lapis sequoia
Deep Learning with Python.
I personally am not a fan of starting directly with code because, in my experience, i see a tendency in ppl that do that to never go back to see the fundamentals and theory behind the algorithms
What is ppl?
people
kk
hmm yeah like they apply it as a method or recipe without really knowing what they are doing
@brittle agate is it a book ?
Yep)
ok thanks will look into it
there's "deep learning with pytorch" which is free and actually is a nice mix of theory and programming
and it's free
Yeah, it's first edition. Second edition in process.
it's from the pytorch people too
@lapis sequoia
Deep Learning with Python.
Yeah, it's first edition. Second edition in process.
https://pytorch.org/deep-learning-with-pytorch you can get it here if you're comfortable giving some of your informations, they don't check you directly get a download link to the pdf
it's from the pytorch people too
@odd yoke
This book is actual for today?
it came out 3 months ago
I mean, PyTorch changed too how TF.
it came out 3 months ago
@odd yoke Really?
XD
I didn't know.
@brittle agate do you know if the author is Francois Chollet or if it's Nikhil ketkar ? I found 2 books with this name
thanks guys I'll do some research and make my choice
Francois Chollet
ok thanks 😉
francois chollet is the original creator of keras, so it's not some random dude too
But some methods is so hardest than in TF.
For example K-Fold. U don't know what is that right now. You will find out soon x)
francois chollet is the original creator of keras, so it's not some random dude too
@odd yoke
I read only original book by him.
And yeah, all guides for beginners, on the TF official site, Francois Chollet coded.
@odd yoke thoughts on my question above?
the numba prange question ?
yes
Ok, I don't really understand something. In Numba, there is the function numba.prange() that if I understand things properly, is supposed to speed up for loops (an maybe other loops?) by parallelizing their execution. When I use prange() my loops are significantly faster. But, if I use prange() and set the @njit(parallel=True), usually things slow down.
And then, there is @njit(nogil=True)... while is related to parallel execution too.
So, it seems like numba runs code in parallel in many situations, regardless of your nogil and parallel settings? And what is the point of setting Parallel=True if nogil=False?
And, are there other functions like prange() that I should be aware of, that if I use instead of core python functions or common numpy functions will improve my speeds further?
iirc prange if there's no parallel=True is the same as range
i'm not sure what could cause slow downs
I have about 6 nested functions that all are decorated with @njit. I get the best performance if only the top level function is set to parallel=True and all of my loops use prange() instead of range(). If I make all 6 fucntions parallel=True, massive slowdown.
Like 50x
crazy internet
@lapis sequoia
PDF Drive is good storage of books for free :))))))))))))))
My guess is that numba is spinning up too many parallel OS processes when I set too much stuff to parallel, and this is not ideal.
just too much overhead created for the amount of cores my laptop has.
@odd yoke
Q:
When need to use Flatten?
After Conv Layers is good practice to use Flatten or not?
it depends on the architecture you're building
For example transfer learning(ResNet50).
FCN for example don't have any fully connected layers, so you would not need to flatten
I know.
For a regular resnet for something like classification, yeah, you would need to flatten
For multi-class classification.
I mean.
For a regular resnet for something like classification, yeah, you would need to flatten
@odd yoke
One time after Conv Layers?
yes
you can have like
conv -> relu -> maxpool -> conv -> relu -> maxpool -> ... -> flatten -> dense -> relu -> dense -> softmax
obviously for a resnet it's much more complex
Okay, thanks)
obviously for a resnet it's much more complex
@odd yoke
But I can use already Transfer Learning and I don't need to make residual connections))
if the model is already made, then yeah you don't have to bother
hey all, is there a way to have pandas only calculate pct_change for the first row of a group in a groupby group?
or more simply maybe, given something like this:
A B C D
2000-01-02 14 5 20 14
2000-01-09 4 2 20 3
2000-01-16 5 54 7 6
2000-01-23 4 3 21 2
2000-01-30 1 2 8 6
2000-02-06 55 32 5 4
can I calculate pct_change for only like, 2000-01-30 to 2000-02-06?
this example is from here: https://www.geeksforgeeks.org/python-pandas-dataframe-pct_change/ - what I'd like to do is skip the calculation for everything except the bottom row
Can someone help me with a pandas dataframe, I want to search a single column with a string, then output the value of a different column of the same row
or maybe just select the whole row
Hello guys, i'm programming just as a hobby and Python is the only language i know. I recently came to conclusion that i would like to creaty my own Neural Network, just to see how is it working, how hard it is to do one. I knew that i need data so i gathered a lot of numeric data regarding Character Auctions in some MMORPG. Now i would like to create the NN which based on auctions ended (data that i have) will be able to predict the price of given character. For now i'm using parameters such as: Level, Skills, Achievment Points, Charms Points, Gold (everything measured in numbers). I also made numeric data related to Vocation and World (server) (i gave different number to every voc and server). When it comes to items etc. i skipped that part for now. I realise that this is really hard, so for now i just leave it. I would like to add it in the future, when let's say "alpha version" would work.
The problem is - i don't really know where to start. I see bunch of articles which touch this topic but it wasn't "it". Are you able to recommend me some sources from which i can learn creating NN from scratch?
@tulip briar neural networks fundamentally require a bit of linear algebra and calculus to understand. you can get started without knowing those things, but you might get stuck quickly. maybe check out https://www.fast.ai/ for some intro material oriented towards people with programming experience and toward getting you into neural networks (and deep learning) quickly
i also recommend subscribing to https://towardsdatascience.com/ which has a lot of tutorial-level content
eventually if you want to get seriously into data science and machine learning you will need to start understanding statistics and probability in addition to linear algebra and calculus. but that's farther down your learning path and you will probably start learning those concepts as you go along anyway
the object you have is not a tensor @grave frost it's a dataset, you can use
.as_numpy_iterator()and cast to list or whatever, or consume it using.takeor a for loop
@odd yoke Could you elaborate a bit more?
the docs for numpy mention axes a lot
if I want to find the average of all elements in a matrix, regardless of the shape of that matrix, does that mean the axis is zero or what?
[[1, 2], [3, 4]], I would want the average to be sum([1, 2, 3, 4]) / 4 even if it were [1, 2, 3, 4] or [[1], [2], [3], [4]]
for what I'm doing, of course, not necessarily in general
I think there is a flatten kwarg, or just .reshape((-1,))
I guess I can use that for now but I figured axes held the key
oh, you can use .flat
also I have a data frame with a column called 'Binary Label', but I need to delete that series for one part
the docs, unsurprisingly, mentions axes
that series is all booleans though so I might be able to keep it if I can represent them as 1.0 and 0.0 in a numpy matrix
@desert oar I don't have much troubles with math, i'm just totally newbie in coding :-D. I have learning about tensorflow, did some udemy course about Pandas but didn't find any intresting guide / course about NN.
looks like representing booleans as floats is fine
I need to do calculations for elements that are not null and for which the corresponding boolean element for that row is correct
I'm sure I'm not phrasing this effectively
let's turn it into an xy problem
[[3., 2., 1.],
[5., 4., 0.],
[6., 1., 1.]]
[[1., 1., 1.],
[0., 0., 0.],
[1., 1., 1.]]
if I can go from the first to the second, I can implicitly solve the actual problem.
@serene scaffold i dont understand either version of the problem
see how the rightmost column is always 1 or 0?
yes
I just want to make the entirety of each row match that
what do you mean "make the entirety of each row match that"
if the last element of the row is 1.0, every element becomes 1.0
i see. and now what do you actually want to do?
the last column comes from a dataframe where the values are the strings Yes or No
and I'm storing them as these floats because every other column in the dataframe is a float
that is, every other column stores only floats
or nan
ok, but what do you actually want to do
and, for example, I need to take the average of every non-nan value in a column, for which the value in the rightmost column is 1.0, and replace the nans with the average of those values.
ok. you are doing mean imputation for missing data?
ye
stratified by the value of the label?
I guess I could have just said that
because you really can't/shouldn't stratify by the value of the label, since you won't know the label at prediction time if a record w/ missing data happens to appear
it's conditional mean imputation. The two classes are just "Yes" and "No"
is the yes/no category a feature or the target
feature
argh. pandas makes me wanna stab someone.
the yes/no designation is always given
can you do this in pandas? or do you have to do it in numpy?
its easy to do both ways but the pandas way is "more elegant" since you can use groupby
is this homework? a job interview?
homework. which is why I'm trying to make my question very pointed to learning the libraries
this is the only programming assignment for the course and we're not even required to do it in python
but there's a part where we have to do 8000 ** 2 calculations and idk how the Java users are going to do that without numpy
principle 1: when in doubt, use boolean subsetting by row
principle 2: pandas groupby exists
java loops are presumably a lot faster than python loops
so let's do the pandas version because it sounds like your data is in pandas
the professor is only letting people use Java because it's the default language in the department and he doesn't want to force people to learn python if they don't want to. but that's beside the point.
you can do it with a for loop in python too, it will just take a long time
data = pd.DataFrame({
"x": [1.0, 1.5, 3.0, -3.2, -4.7, -2.2],
"is_valid": ["no", "no", "no", "yes", "yes", "yes"]
})
is this a fair representation of the data? of course including more than columns just x
0.19958,0.2133,0.47241,0.36529,0.57143,0.53333,0.41898,0.35424,0.39056,0.41731,No
0.26562,0.25566,0.47299,0.34705,0.5,0.33333,0.42342,0.38472,0.40689,0.41538,Yes
0.11211,0.37656,0.53267,0.37086,0.57143,0.46667,0.5287,0.47377,0.50504,0.48333,No```
that's setting the bar obscenely low already
give me an algorithm for imputing the mean within each group. it can be imprecise, we will make it more precise later.
just a rough sequence of steps
ignore anything related to pandas and numpy
imagine you are a number magician and you can just operate on tables of data directly
I guess you could filter out all the rows that aren't the right class, and then take the average of each column
stated another way: select only the rows that are the right class, take the average of each column in those rows, and replace every missing value with the average in its respective column
is that right?
yes
great. in pandas, how do you select rows such that column "class" equals "yes"
is it group_by?
can someone please explain to me wtf the point of limit is in pct_change()? It seems to do basically nothing lol
>>> revs
ticker calendardate revenue
None
29054 A 2019-12-31 5.163000e+09
29055 A 2018-12-31 4.914000e+09
29056 A 2017-12-31 4.472000e+09
29057 A 2016-12-31 4.202000e+09
29058 A 2015-12-31 4.038000e+09
>>> revs['revenue'].pct_change()
None
29054 NaN
29055 -0.048228
29056 -0.089947
29057 -0.060376
29058 -0.039029
Name: revenue, dtype: float64
>>> revs['revenue'].pct_change(limit=1)
None
29054 NaN
29055 -0.048228
29056 -0.089947
29057 -0.060376
29058 -0.039029
Name: revenue, dtype: float64
>>>
according to the docs, "The number of consecutive NAs to fill before stopping." -- doesn't this mean it should stop after hitting that first NaN?? all I want to do is calculate the pct_change the first fu%&$( row and I've been at this for 3 damn days and gotten nowhere.
I don't really know, I haven't used pandas before this
you have never ever subsetted rows in pandas ever?
I've just never used pandas
(and yeah I realize this is reversed, I'm using that as a test to ensure what would happen if it results in NaN for the first row, I want it to just stop and move on to the next group (next ticker))
I've used numpy a bit but I've learned a lot of what I now know from trying to solve this assignment
i see
i thought youve used both libraries before
lets do pandas then. slicing and subsetting in numpy is actually messier in some respects than in pandas
I've never needed to use pandas for nlp and a lot of the math stuff is handled by existing libraries even if those libraries are using numpy internally.
https://numpy.org/doc/stable/user/basics.indexing.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
i recommend at least trying to struggle through both of these documents
you will get confused
but they do explain what you need to know
😨
there are 3 ways to select rows or columns in pandas: by "label", by numerical position, or by an array of True/False
i use the term "array" to mean anything "array-like", which includes pd.Series, np.ndarray (1 dimensional), and list
in this case, you want the 3rd option
pandas dataframes have 2 main "accessors" that you can use for selecting data: .loc and .iloc. .loc is for selecting by label or boolean array. .iloc is for selecting by numerical position.
Quick question about appending multiple txt files into one data frame as I cannot figure out how to do this with Pandas.
I have a folder with a bunch of text files representing a banks quarterly reports from year 2006-2015 (2006-03-31, 2006-06-30, 2006-09-30, 2006-12-31....2015-12-31)
@lapis sequoia ```python
from pathlib import Path
import pandas as pd
data = pd.concat([pd.read_csv(path) for path in Path('data/').iterdir()])
you might need to adjust the arguments to pd.read_csv and to pd.concat depending on your specific setup
well it depends on how the data is formatted
.txt is just part of the filename
it could be anything in there
is it tabular data? tab-separated maybe?
New in version 3.4.
Source code: Lib/pathlib.py
This module offers classes representing filesystem paths with semantics appropriate for different operating systems. Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.
If you’ve never used this module before or just aren’t sure which class is right for your task, Path is most likely what you need. It instantiates a concrete path for the platform the code is running on.
Pure paths are useful in some special cases; for example:... read more
ok
the txt files are in a folder called rawdata which is in my current directory
then I don't need to specify path?
@serene scaffold maybe i'll give you an example to help you get started w/ those docs. here it is without groupby. the numpy-only version is very similar, except without .loc:
import pandas as pd
data = pd.DataFrame({
"x": [1.0, 1.5, None, -3.2, -4.7, None],
"y": [11, None, 12, None, 105, 101],
"is_valid": ["no", "no", "no", "yes", "yes", "yes"],
})
# pd.Series of bool values
is_yes = data["is_valid"] == "yes"
is_no = data["is_valid"] == "no"
# pd.DataFrame of bool values
is_null = data.drop(columns=["is_valid"]).isnull()
for c in is_null.columns:
# Fill "yes" rows
data.loc[is_null[c] & is_yes, c] = data.loc[~is_null[c] & is_yes, c].mean()
# Fill "no" rows
data.loc[is_null[c] & is_no, c] = data.loc[~is_null[c] & is_no, c].mean()
of course, this changes the data in-place, which you might not want
so maybe you can put the imputed version of the data in a new column
import pandas as pd
data = pd.DataFrame({
"x": [1.0, 1.5, None, -3.2, -4.7, None],
"y": [11, None, 12, None, 105, 101],
"is_valid": ["no", "no", "no", "yes", "yes", "yes"],
})
# pd.Series of bool values
is_yes = data["is_valid"] == "yes"
is_no = data["is_valid"] == "no"
# pd.DataFrame of bool values
is_null = data.drop(columns=["is_valid"]).isnull()
for c in is_null.columns:
# Make a new column to receive the imputed data
c_new = f'{c}_imputed'
data[c_new] = data[c]
# Fill "yes" rows
data.loc[is_null[c] & is_yes, c_new] = data.loc[~is_null[c] & is_yes, c].mean()
# Fill "no" rows
data.loc[is_null[c] & is_no, c_new] = data.loc[~is_null[c] & is_no, c].mean()
@lapis sequoia well you need to specify the directory where to look for files
the general principle is, build a list of dataframes, then pass that list to pd.concat
Oh ok so it doesnt matter if I have the folder there in the working directory then
Will read into this. Thanks for the help!
@dense copper im honestly not sure what that limit parameter is for... looks like bad documentation. but, just practically, how do you expect to calculate the % change in the first row? there is nothing for the percent change to be based on, in the first row.
@lapis sequoia you can always specify the full path to the directory
e.g. /home/mrbambocha/projects/lab3/data/ or something like that
Yes, I understand. Thanks a lot!
@desert oar thanks, I'm looking at this
@desert oar assuming the data is reversed (which it will be), the % change would be row 0 / row 1
there is a lot going on here @serene scaffold so i recommend reading through the pandas guide and using my examples to help
or more specifically for my case I want row 0 / row 4, but that's easy
then you need to run the pct change in reverse... right?
yes I know that, but my point with reversing it in this example is I want it to stop when it hits the first NaN
I assumed that would mean it would just not calculate anything if limit=1 and the first result is NaN, but instead it just apparently does the same thing
and basically everything I find about it in tutorials is someone who just copy/pasted the docs from Pandas to explain it. lol
@dense copper you might have to just make up some fake data and experiment
what happens if limit=0?
i see
consider this ...
ive never actually used this method before so i know as much as you do
>>> revs
ticker calendardate revenue
None
29054 A 2019-12-31 5.163000e+09
29055 A 2018-12-31 4.914000e+09
29056 A 2017-12-31 4.472000e+09
29057 A 2016-12-31 4.202000e+09
29058 A 2015-12-31 4.038000e+09
>>> revs['revenue'].pct_change(99)
None
29054 NaN
29055 NaN
29056 NaN
29057 NaN
29058 NaN
Name: revenue, dtype: float64
>>>
https://github.com/pandas-dev/pandas/blob/v1.1.2/pandas/core/generic.py#L10230-L10244 this is the source
pct_change(99) will return NaN in every row because there's only 5 rows. anything more than pct_change(4) will be NaN for everything right?
the source explains it. limit= is passed to .fillna, which is called before shifting and dividing
but unfortunately it still runs the calc for every row.
which is bad for me, since I have 250k+ rows and 100+ columns
the limit parameter here explains itself in more detail
this is an issue of unclear docs from pandas
surprised pikachu lol
it's annoying how everything in Pandas is so needlessly complex
yeah. i'd recommend filing a bug report: "in the docs for pct_change, clarify that limit and fill_method are kwargs passed to fillna"
i'd argue that it is not needless
poorly documented in some cases, perhaps
want a cell of data? sure just do my_data.get_index[0:,:99]['some_var'].get_the_data(act_fucky=False).index(0)[1]! easy!
lol
I'm being facetious...
usually when pandas seems "overly complex" it means one of two things: the existing APIs are too low level and should have some wrapper functionality developed, or the existing APIs are too high level and need to expose more internals to the user
anyway, I'm looking through the source
the actual complexity is in my opinion structural and unavoidable
I think the bigger issue is I'm a dumbass and don't understand pandas
i rarely need long annoying invocations like that except when i'm using multiindex, which is definitely under-supported from an API design perspective
well that is a problem
just frustrated cause I've been trying to 3 days now to accomplish something that seems so simple
in order to use pandas effectively imo it's important to understand the separation between data and axes
and I'm getting nowhere
what are you trying to accomplish exactly
given data like the following:
ticker calendardate revenue
A 2019-12-31 10
A 2018-12-31 9
A 2017-12-31 7
A 2016-12-31 1
A 2015-12-31 3
B 2019-12-31 5
B 2018-12-31 4
B 2017-12-31 3
B 2016-12-31 2
B 2015-12-31 3
my goal is to calculate the pct change in revenue from 2015 to 2019 ONLY, from each ticker
e.g. I want ONLY (10/3) ** (1/5), and (5/3) ** (1/5)
using pct_change(4) calculates it for every row, so I will get the value I want, plus four NaNs
instead I want it to calculate the first result, see that the second is NaN, and skip the last 3 attempts since I know they will all be NaN
(for each group, if that makes sense)
that more complex calculation (e.g. ** 1/5) is to calculate CAGR, and I'm doing that with a lambda function using .apply(), but if I can figure out how to just do pct_change (or any function for that matter lol) then I can figure out the rest.
the problem is I have hundreds of thousands of rows and hundreds of columns, but I only need the result of the first row divided by the 5th, and everything else gets thrown away. The default behavior increases the time it takes to run this process by an order of magnitude.
@dense copper have you tried using the freq parameter?
I'm not sure how, and as usual the documentation is terrible lol
been looking around at that though
import pandas as pd
data = ...
def pct_change_for_ticker(y):
return y.pct_change(freq=pd.tseries.offsets.YearEnd())
pct_changes = data.set_index(['ticker', 'calendardate']) \
.groupby(level='ticker')['revenue'] \
.apply(pct_change_for_ticker)
cursory attempt on that gave me a NotImplementedError
I'll keep trying though ... thanks for the idea
Hi, I am using Weka for text classification. For some reason when I apply the StringtoWrodVector filter, it does nothing to my data set and the number of attributes remains the same.
Does anyone have any idea why this filter is not working for me?🙏
can someone help me convert python script to exe file
anyone here use dash?
I could use some help. This particular part of my numba code accounts for 90% of my slowdown. I am hoping for a much faster way to accomplish this simple task:
def closest(values, desired):
AB = np.abs(desired - np.expand_dims(values, 1))
indexes = np.empty((AB.shape[0],), np.int64)
for i in nb.prange(AB.shape[0]):
indexes[i] = np.argmin(AB[i])
return indexes
This is what the function is supposed to do. I have two 1D arrays, values and desired. desired has 200 elements in it, evenly spaced apart. Depending on when I call the function, values can have anywhere from 1 element to 200 elements, each with hard to predict spacings.
The code goes through every value of values and finds the INDEX of the element in desired that is closest to value.
Here is an example:
Input:
desired = [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]
values = [1.12, 4.77, 3.39]
Output:
indexes = [0,8,5]
anyone here use dash?
@lapis sequoia
I do
Im having an issue moving a legend on this map i have made
I looked at the documentation, and it seems to have no effect on it
This is my problem
You asked if I use it. I do, but I am terrible at it.
lolol
I mean, dash uses html and css. And to get the position of things to work properly, you really need to understand html and stylesheets really well.
dash makes beautiful plots though.
I was able to do lots of things in relationship to the map and even the colorbar without using html
How can I find out the index of multiple columns in pandas
make a slice
I was able to do lots of things in relationship to the map and even the colorbar without using html
@lapis sequoia
Yeah, the graphical elements usually have non-html/css attributes you can mess with. But, it wouldn't suprise me if text positioning was purely html and css.
Let me rephrase
I know what you are asking... let me see if I can help
dataframe.columns.index(column_name1, column_name2)?
index is a python core function that returns the index of the item in a list with that name.
like I could get what I need with select_dtypes but I want to include other columns too
dataframe.columns, should return a list of column names
yea I was trying things w .index but I was trying after I isolated columns based on their dtype
@dense copper https://repl.it/@maximum__/pct-change-pandas#main.py does this get you close to what you want?
import io
import pandas as pd
txt = """
ticker calendardate revenue
A 2019-12-31 10
A 2018-12-31 9
A 2017-12-31 7
A 2016-12-31 1
A 2015-12-31 3
B 2019-12-31 5
B 2018-12-31 4
B 2017-12-31 3
B 2016-12-31 2
B 2015-12-31 3
"""
data = pd.read_fwf(io.StringIO(txt), header=1)
data['calendardate'] = pd.to_datetime(data['calendardate'])
print(data)
def pct_change_for_ticker(y):
return y.pct_change(freq=pd.tseries.offsets.YearEnd())
pct_changes = data.set_index('calendardate') \
.groupby('ticker')['revenue'] \
.apply(pct_change_for_ticker)
print(pct_changes)
@wise garden you want to select certain columns by dtype?
dataframe.columns, should return a list of column names
@modest rune anyway to pass a list through?
I've got 40+ column names to pass through
sorry, not the columns portion, the index portion
data.loc[["a", "b", "c"], ["X", "Y"]]
something like this? that selects the rows with index labels "a", "b", and "c", and columns "X" and "Y"
data.loc[["a", "b", "c"], ["X", "Y"]]something like this? that selects the rows with index labels "a", "b", and "c", and columns "X" and "Y"
@desert oar I'm hoping to use df.iloc
oooo but might be able to use this
so you have lists of row numbers and/or column numbers?
i.e. by position
data.iloc[[3, 2, 12], [45, 46]]
same syntax
data.iloc[:, [45, 46]]
data.loc[:, ["X", "Y"]]
for all rows
Yea I was trying to make it harder by passing the col names that I found into indices and then passing them to .iloc.
data.loc[:, ["X", "Y"]]
data[["X", "Y"]]
pandas also accepts the 2nd line as a synonym for the 1st
i write data[["foo", "bar"]] all the time when using pandas code
yea solid, thx for the help
@modest rune did you solve your problem yet?
Yes! All is good now
how did you fix it?
Hey, pandas issue
My dataframe here is taking the first row as a header row. Can i somehow edit it so it does NOT do that?
nevermind I solved it
Hi, hope everyone is doing fine, I'm not really into data science but I need pandas to do a quick work, I have a CSV file, and I need to make a search on it on a particular column, I could get away with a for loop but I've read that Pandas is more efficient, how can I achieve this?
Hi, hope everyone is doing fine, I'm not really into data science but I need pandas to do a quick work, I have a CSV file, and I need to make a search on it on a particular column, I could get away with a for loop but I've read that Pandas is more efficient, how can I achieve this?
@subtle tide what do you mean a search?
To be clear the variables are x and z
@solemn cosmos logarithm
Don't worry about efficiently so early. If this csv file isn't massive you can happily use csv module in python and call it a day.
@velvet thorn , I have a csv file with this header, [name, phone_number, email], and about a thousand row, I want to get the name and phone_number of a particular email
df.loc[df['email'] == the_email, ['name', 'phone_number']]
@ripe forge , ok thanks, I will have that in mind
but yes, a thousand rows is nothing
thanks @velvet thorn
yw
how did you fix it?
@velvet thorn
I took advantage of the fact that I knew how the desired array was constructed.
# a : data that I am analyzing
desired = numpy.linspace(a.min(), a.max(), 200)
By storing the 3 values used in the linspace and then later on applying them to values in the right way and converting to an integer, I was able to calculate the proper index, instead of searching for the index. Much much faster.
You must be on the other side of the world because when you asked me:
How did you fix it?
It was 2am-ish and I was asleep.
hey guys I justed wanted to ask is the first node is same as start node in Binary Tree?
pls fell free to ping me if you have a answer
@terse ridge this is a better question for #algos-and-data-structs
data science consists of statistics, machine learning, and related topics
ok sorry
umm hey guys i want to ask something about a particular dataset anyone wanna help me?
How can do the average of the next n rows for every row in a data frame? I have a data frame that looks like :
A|1
B|2
C|3
D|4
E|5
F|6
G|7
H|8
I|9
J|10
K|11
L|12
M|13```
and I want to average the next 3 rows for every row so the output would be like
```Object|Value|Average_3
A|1|6.3
B|2|8.6
C|3|11
D|4|13.3
... and so on```
I was thinking on doing something like
```df['average_3']=df['value'].apply(lambda x: x.shift(1)+x.shift(2)+x.shift(3))
However, the n number of rows will not always be the same so I was wondering how I can apply a for loop inside the lambda function and also how will this manage the last n rows since they won't have all the future rows to do the average on?
@fringe dagger it's better to just ask your question, don't "ask to ask"
i prefer to use column names but yes use rolling
how does the training data work here?
like i'm using two different columns to predict a regression model
i actually dont know if you can roll forward though @sterile wyvern
@fringe dagger can you also share your code as text (either using code formatting or a paste site)? it's difficult for some people to read text in images
oh okay never done that and its in notebook format
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
@desert oar need to go through rolling documentation
@sterile wyvern its not in there. but i found this, which is... messy https://stackoverflow.com/a/51619099/2954547
for i in range(0,df.shape[0]-2):
df.loc[df.index[i+2],'SMA_3'] = np.round(((df.iloc[i,1]+ df.iloc[i+1,1+df.iloc[i+2,1])/3),1)
alternate for rolling
https://repl.it/@maximum__/pandas-forward-roll#main.py @mellow spruce
k = 3
data['average_3'] = (
data['Value']
.shift(-1).fillna(0)
.rolling(k, win_type='boxcar').sum()
.shift(-k+1)
)
Is there a simple way to be able to do arithmetic operations on an extremely large number? The documentation says that it should just automatically work for newer versions of python but that doesn't seem to be the case
im making a smart macro that detects if the message sent has been sent twice without a message in between and im not sure how im going to go about the AI for this
ive made the macro thats all done
if anyone has any clue on how to help make sure to ping me, thanks
@merry ridge what error are you getting? how large is extremely large?
@lapis sequoia do you really need AI for that? you can't just compare the two messages for equality (or similarity)?
and how would i go about doing that?
what do you mean? if message1 == message2:
Function messages?
ohhh ok
The upper bound of what I need is about 1e400~