#data-science-and-ml
1 messages Β· Page 100 of 1
"there is no chess board"
on a serious note
write_json is a blessing to isolate test cases with polars
I wouldn't use polars, spark is the way
sssshhh
Polars may promise multi processing, but spark actually delivers
for my use case polars is absolutely great
i rewrote all my validation and coalescing into expr
and it didnt suck
Yeah I remember some parts being nice, but when it came time to scale it didn't deliver
There were also issues with some of the interfaces
And the IO didn't work as promised
That one killed me >.<
i didnt see any great features in spark for data transformation in the context of what im doing/need to do
It will use all your cores while maintaining ram below 4gb
Spark is not without fault tho, there's some serious issues with memory leaking
I just re create the session when I use it, but it's not ideal cuz of the random split
It's probly something I'm doing wrong tho, somehow
Clearly states java heap
Execution is everything
You can benchmark C to be slower than python
If you don't got the skills for C
not that java is impossible to run anywhere constrained... you could argue that smartcard applets are java bytecode and indeed they are.... but thats besides the point.
writing shit C that performs worse than python sounds like a challenge
whereas writing shit py is arguably trivial
just like js
the entry bar is low
All I'm saying is that the language is not everything if the person writing doesn't do a good job
there are intrinsic/inherent requirements for certain languages that make the mistakes also comparatively more complex. ex. in C memory safety.
I don't even know if java allows intereopability with low level anyway
I don't think that will protect you enough
android is java and you have tons of JNI/native code
most android software protections are actually using native code to interop with java for obfuscation etc
The type system can't predict if you're reading your CSV right
Or if you're making good use of your cores
sure, but thats a different issue
It's not, that's what drove me away from polars
rust forces constraints for a reasonable baseline of safety
Spark is written in Scala apparently
btw ontopic: how can I "pretty print" a df in polars?
the console output scales horizontally
I assume print(df) makes an ugly output with that size
@buoyant vine also where do i send you a case of beer/drinks/whatever?
π I wouldn't worry about it
lol
Unless you have a miracle cure for whooping cough lol
i believe in the honorable practice of displaying gratefulness to those who help
i mean, i wouldnt rob a hospital of cough suppressants, but if that pays the moral debt....
There's a context manager that lets you alter the width of the rows
Scala seems appropriate for this stuff
Rust may be the wrong choice as it's more of a systems language

Ask gpt
lol
But it was something Config
well that is an unhelpful awnser lol
What am I gonna do, lie ?
humiliate yourself saying 'i dunno'
That's exactly what I said tho
While pointing you to where I got it in the first place
gpt just threw a pandas answer to me
derp
ever polite.
lets hope those fearsome hostile AIs never become a thing. i bet she will remember.
gpt-4 is pretty awful for polars
out of date and also gets quite some things awfully wrong
That's why you cross check with the docs
Or alternatively you just go straight to the docs π Makes stuff much quicker and simpler
This model has like 100M params, gonna take forever to train
shouldn't take too long
Has been working out for me, bad stuff happens when I don't cross check tho
@buoyant vine im missing more code samples
one is 500dim, the other is 1000dim (they are transformers )
@buoyant vine quick q: im using a public csv for testing, and have been teasing the idea of doing the coalescing and grouping entirely via expr:
mapping = {
"CUSTOMER_FORENAME": "first_name",
"CUSTOMER_SURNAME": "last_name",
"CUSTOMER_GENDER": "gender"
}
structured_column = pl.struct({
new_key: pl.col(old_key) for new_key, old_key in mapping.items()
}).alias("person")
How can I retrieve this to create a dict properly containing the keys-values? ex. person : { first_name: ...., }
df.select(structured_column)?
hmm
I think that should work
Or at least I can't see anything wrong with the idea]
or it might be a with_columns otherwise
pprint.pprint(df_with_struct.rows(named=True)[0]) still shows the column names though
{'person': {'CUSTOMER_FORENAME': 'JOHN',
'CUSTOMER_GENDER': 'male',
'CUSTOMER_SURNAME': 'DOE'}}
Isn't that what you have defined as your new_key?
if i swap them they obviously dont exist: polars.exceptions.ColumnNotFoundError: first_name
Hello everyone, I am an AI/ML engineer working in the US. We have recently started a discord channel aimed at
- Sharing AI ideas
- Finding project mates for AI projects
- Study groups to learn AI
- Resource sharing
- Networking
This is in it's early stages, but if you are one of those who are interested in leading conversations and building a beautiful AI community join this channel.
Reach out to me if you are interested
What if you rename the columns before hand?
i could but that seems against my ocd tendency towards not touching the original columns
could do a .rename(mapping) before so the columns are correct before they go into the struct
lemme try the rename
@buoyant vine where do i place the rename_fields? can i concat/chain it directly to the pl.struct?
I know for sure its gonna do the thing, but it sure is taking a while
I think you can just do .alias("person").rename_fields(["field_1", "field_2"])
it expects the new fields to be in the order you defined the struct in
.struct.rename... seems to work
yep
structured_column = pl.struct({
new_key: pl.col(original_column) for new_key, original_column in mapping.items()
}).struct.rename_fields(list(mapping.values())).alias("person")
you know you can also do stuff using sql, probly easy for those kinds of operations
easier *
@final kiln i havent got started with the sql side yet
does it go thru the same engine?
exprs seem blazing fast
i mean if all you're doing is renaming stuff
i think its about to do the thing, or am i losing my mind already
@buoyant vine now im rewriting the dynamic expression stuff. the basic things like "if this bool column is set to True, then the field value is foo" does not seem too complicated
i wrote my own sandboxed asteval-like expression engine, but it was horribly slow
π
I'm just gonna leave it and go enjoy my Saturday ._.
you should
Ah I see the issue tho, it's just super slow
It's still on the third slice after an hour
it was hailing here so ill be cranking out shitcode
is there a way to limit/condition an expression to the presence of a non null value in a specific column?
can combine the expression with a col.not_null() expr
i.e. (pl.col(col_name).not_null() & other_expr)
ex. if column X is not null and set to boolean true, set a new column FOO to value XYZ
on it
@buoyant vine is it possible to add a new field to a struct without recomposing it?
I dont think so
i suppose then the way to do it is to create an intermediate column
and add it
can map_* be used to do something like what i asked earlier re conditional field values?
it basically is a pandas.apply method
i.e. it gives you the column value, and expects a value returned
what you do inbetween those points it doesn't really care about
the problem is it limits your performance signficantly
expr?
@buoyant vine http://pastie.org/p/4yTznKtTuJeKQQLQFm2m2z not getting the phones part to work
dont you want pl.concat_list(list(fields)).unique().drop_nulls().alias(name) rather than doing it after the explode?
lemme check
polars.exceptions.InvalidOperationError: unique operation not supported for dtype list[str]
what if you do .arr.unique?
I changed the LR schedule, increased the warmup period.
I really need to look up the rationale behind the 2017 LR scheduler
It had the opposite effect on the scheduler, but it looks like it improved the situation
Which would be awesome, except that it totally means idk what I'm doing >.>
Ok so.
With max LR of 500e-6 it stayed up there for a long time but I could notice a slight slope downwards + the mini batch loss was becoming more stable.
A smaller max LR (which implies smaller LR throughout), has not changed it very much except that the slope downwards has increased, but still nothing major
hi, need a quick help, i've got an xlsx file that contains these numeric columns, they are float64, now as you can see, they are not very pretty id say, i mean the way they are represented are too long, i tried to change using with as type by doing
df['FF','Rs','Rsh','VOC(mV)','jsc(Β΅A/cm2)']= df['FF','Rs','Rsh','VOC(mV)','jsc(Β΅A/cm2)'].astype(double)
which haven't worked cause of the name of each column, would love to hear some tips and tricks
My intention was to increase max LR to speed up the process. But since the opposite occurred, it means that the model is overshooting the minima.
You want to change the output format, the way the dataframe displays the floats?
id like to change the way each column has its information displayed
Start here, see precision/etc: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html
Thereβs more advanced things you can do to render it differently, but that might be enough
@buoyant vine polars.exceptions.ComputeError: expected array dtype
thats good it helps but thing is some of the data is not decimal but more in the millions, and thats why it shows this way
Since I cannot increase LR to speed it up, my only option is to decrease the size of the batch so that the number of weight updates is higher
what is your code rn?
See float_format
Hello, I need to develop an AI to play the 2048 Game, I have a large experience with Python, I know i have to use Tensorflow but I just need a roadmap for my learning, thanks in advance !
if i saw you irl id kiss you , thank you very much
In gonna do the opposite, increase batch size and decrease LR/keep it as it is. Higher batch size = more accurate gradient calculation
taking break, stuck on this one
Batch size of 128 now
It's also possible that it's just not a good idea to do d=1000 and N=12, since that's equivalent to what they used in 2017. And they used a lot more GPU than what I'm using.
I'm gonna let this one roll and start doing from 800
anything out there to scrape or query github?
well, it has an api
and googling pypi github gets me at least one popular library implementing it, even.
hm right, i hope the rate limit doesn't affect me
maybe there's a github dataset of repositories somewhere?
a dataset of what info from the repos?
googling dataset of github repos gets me https://www.kaggle.com/datasets/github/github-repos
one is batch size of 32. the other 128
32 has higher learning rate, 128 lower
one is betting that the gradient calculation is accurate enough, so I just need to double down on them (double the updates and larger steps), the other is betting that the model was overshooting the minima so it needs to calculate more accurate gradients and take smaller steps
wow i suck i should google
thanks a lot!
New approach, I'm starting from 500, which I already saw that converges, then increase it til 1000 in chunks of 100
hey im trying to normalize my data, in the next way (pic) , but the 'B.C' column is string and id like to keep it without applying anything to it, any suggestions?
well, simplest way would be normalized_df["B.C."] = df["B.C."]. :p
the nice way would be to exclude that column from df before doing this stuff
im omega pepega lol ty
This helped a lot.
There's something funky going on with my gradient accumulation code. It doesn't look wrong at all tho. I suspect that it is something related to the order of magnitude of the values when I use small mini batches
will two prompts with the same tokens in different positions have different vector embeddings?
I believe so yes, on top of the vector embedding, it is then added a positional encoding.
ah right that makes sense
Can anybody help with this error βπ½
Hey guys I am trying to run GMMHMM model for regime detection on a time series. But I am not sure about the amount of clusters distribution I need. I remember from normal ML for K means I can use metrics like WSS and other methods based on the centroids. But now I am working with Gaussian distributions here what metrics can I use? I am thinking maybe KL or JS divergence but are these really a good metrics to use?
can anyone help mentor me for my ml journey
I keep giving up easily relying only on myself
pip install pandas ?
Weird fix had two instances of Python but thank you π₯
Jep that sounds about right
I completely understand the struggle with giving up easily. While I'm not at the point of being a mentor someone, I'm also on this ML journey and I've found this roadmap helpful: [https://i.am.ai/roadmap/#note] It includes specific steps and resources that helped me stay motivated when I felt stuck. Perhaps some of them could be useful for you too!
Thanks!
100% bookmarked that π
I've checked the commit hashes, compared them using github, there's no code differences
800 and 900 are missing because celery failed, I need to implement a circuit breaker and a timeout thing
This is turning out to be random. I need to step back and reflect on why this is random and how to make it, not be random. Otherwise I won't get anywhere with this.
I'm reshuffling the batches to prevent the network from capturing any patterns coming from the order in which it sees the sequences
My hypothesis is that the small batch size is at fault.
The way it is, it might be throwing the dices until it finds a sequence of batches that happen to accurately represent the gradient landscape, when it finds them in sequence and in sufficient number it then finds the direction towards the local minima and from there I suppose the slope is large and all directions are funneling towards the minima so the loss starts decreasing sharply, until it no longer is the case and the situation regresses to where it was but at a lower level, which would explain why they all converge to the same value more or less.
Guesswork is no good here. Since this setup is highly efficient memory wise, I can fit the transformer from 2017 and possibly their batch configuration. There's not gonna be a one to one correspondence but I can calculate the information content on each batch from 2017 and try to match it in my batches. The model hyper parameters are more or less the same too, especially if the MetaFormer stuff translates to NLP.
Hello everyone; letβs say you want to train a dataset, where can I find the data instead of creating it from scratch?
Hey, I came across this today, does it help? https://insar.dev/
It's focused on interferometry, you didn't say what exactly you are doing with SAR
Hi. Question about variational autoencoders. Is the main idea that during encoding, latent space is sampled from multidimensional distribution which is shaped by parameters obtained from input data?
Hey! I tried to fit the flux of cosmic rays versus energy for AllParticles&H. I used the CRDB package to extract the data. Then, i used the power law to fit. However, the shape of x is (1236,) and the error in y is (1236,2). how do i fix the shape issue?
Here is my code:
x = t_combined.e
y = t_combined.value
err = t_combined.err_sta
lsq = LeastSquares(x, y, err, power_law)
m = Minuit(lsq, a=1, gamma=-2.0)
plt.errorbar(x, y, err, fmt="o", label="data")
plt.plot(x, (x, *m.values), label="fit") # what does this line do?
ax.scatter(x, y, label="Combined original data", marker="x")
a_fit = minuit.values.a gamma_fit = minuit.values.gamma
x_fit = np.logspace(np.log10(t_combined['e'].min()), np.log10(t_combined['e'].max()), 100) y_fit = power_law(x_fit, a_fit, gamma_fit)
ax.plot(x_fit, y_fit, label="Fitted power law", linestyle='--', color='red')
plt.xlabel(r"πΈπ [GeV]")
plt.ylabel(r"πΈπ dπ½/dπΈπ [1/(m2 s sr)]")
plt.title('Power Law Fit')
plt.legend()
plt.xscale('log')
plt.yscale('log')
plt.show()
print("Fitted parameters (a, gamma):", m)
plt.show()
As mentioned in the help post, please format your code with markdown to make it easier to read
behold, mah pipline
it now uses pull requests to centralize note taking
so I open a PR, it automatically detects if it's an experiment, creates it in mlflow, when I merge it starts running it
Check Kaggle.com
the AI bubble blew up ? looks like there's nothing new in huggingface.co for a long time
maybe there's just too many piglets for the teats
so is polars just generally faster than pandas
or is it situational
may I try to convert you to the religion of Spark
What's the best low pc computing cost object detector? I want create my own security cam (only a detector of people)
I think mediapipe has something of the sort
Now I can use the PRs as logbooks for each experiment. And they can refer each other and all that stuff so everything is gonna be neatly organized
Hah neat, It has come up during the week! I'm not sure yet if it suits my application, but it does InSAR. Thanks!
any polars guy around?
my data isn't bigger than memory so what would spark do for me
Lets you use all your cores, afaik polars doesnt do that
Polars is written from the ground up with performance in mind. Its multi-threaded query engine is written in Rust and designed for effective paralellism. I
pandas is also that way afaik
The docs can say what they wanna say, but it don't do it
I mean ig the lib wasn't even able to read my data so idk
All I know is spark took my kaggle and got CPU up to 300% with no effort from my part
While polars wasn't able to read a basic CSV in lazy mode thing
Generally, yes.
I will risk and say always cuz I've never seen pandas be fast in my life
pandas is faster than pyspark lol
if it's a small dataset
the cost of spinning up the workers is not worth it oftentimes
Ig if the dataset is small I'll be using python constructs
It's way more performant than pandas
Maybe in some specific single-threaded cases the Numpy operations used are faster in their C/Fortran implementations than those written in Rust (unlikely, even a simple loop with optimizations (auto-vectorization) enabled in LLVM will be fast).
If you're using numpy
by small dataset I mean smaller than RAM
or your RAM budget
I still wouldn't use a dict for nested indexes or anything like that
Why do you find the spark init so expensive tho, the memory management alone makes it worth it
because you don't need memory management if the data is smaller than your RAM budget?
It's nice if you can cap it a 4gb at will
But yes, you still need management
for what purpose
For the purpose of having memory for your other stuff
my 2 cents is that polars is generally faster than pandas (pandas<2 for sure, less so for pandas>=2)
but i don't think time series support in polars is really fully there yet (if you care about those stuff, iirc you deal with finance so i thought this would be relevant)
Memory leaking and etcs too ofc
If you want performance
You think memory management
that is not what I think at all
Then you're thinking wrong
when I think performance I think using all the memory I have
because that's the axiom of computer science
Idk none of that, all I know is that performance is about where you put your memory and how you lay it out
And pandas and polars are not the tool for that
the tradeoff between using memory and not using memory is that you use the memory to get things done faster
im not worried about constraining memory usage
I think you're arguing against a straw
you said i need memory management but in fact that stuff has an overhead
and i was asking for performance
Memory constraint is not the only thing in memory management
If you want true performance I suggest using Cython to do intereopability with C
not willing to go that far
As the code will be specifically made for your use case
Then numpy and polars are the next best thing afaik
Ah and spark ofc
whats the problem with time series in polars?
Spark being better since it makes better use of resources
I'm not "against" spark but adding dependencies especially ones that need separate runtimes has a cost
Execution is everything
If it's a well done thing, you don't care
That's my take at least, I was very impressed with it
Why is there any issue here? Just download and try all of them on your data. Measure it.
I second this, best thing is always to measure
I would have to rewrite a lot of code and I'm not sure if it's the best use of my time rn vs other things I need to write for this project
Otherwise we just arguing about the size of the angels wings
so I was trying to figure out as much as I could from ppl who have used both
If it may end up making the difference between something like 1 hour or 10 hours of training (or whatever you are doing), yeah, probably worth.
that's what I'm trying to figure out.. lol
if it's like 5% faster
then there's no point right now
hey guys
To know that you need to know if you are compute bound, memory bound, IO bound.
I have a question
can someone help me to debug my implementation of DNN and backprop from scratch? i cannot provide provide any more information about the problem im having in the code because idk where the problem is
my goal is to write DNN using only numpy, but after i finished the implementation, my implementation of DNN just doesnt learn
i have been debugging for a few days and i couldnt find the problem
please dm me and ill send you the code
it's bound by different resources at different stages. the ultimate bound is the CPU-GPU mem interface
at some stages it's bound by my shitty python code
at others by pandas implementations
but CPU bound in general
CPU bound is not a thing, compute bound or memory bound is.
# here we initialize a random data matrix X and random numerical labels y
import numpy as np
X = np.random.randn(10,3)
y = np.random.randn(10,1)
# we also initialize a hypothetical hyperplane defined by w and b
w = np.random.randn(1,3)
b = -1
# (i) find the numerical labels predicted by the model (w,b) for the points in X
# your code should be a single numpy line
# hint: we wrote this equation for a single point x in class
# try to generalize it by expressing everying in terms of matrices
# your code goes here
y_predicted = np.dot(X, w.T) + b
print(y_predicted)
# (ii) find the updated weights after one application of gradient descent with lr = 0.1
# your code should be a single numpy line
y_ = np.random.randn(10,1)
w_updated = w - 0.1 * np.dot((y_predicted - y_).T, X)
print(w_updated)
What is your question
i think my old gripe was just there is no groupby rolling and/or the interface was fairly clunky for my specific usecase - this is probably fixed, i can't recall my exact issue
my latest gripe is that ewm_mean in poalrs doesn't take a times like in pandas
https://stackoverflow.com/questions/868568/what-do-the-terms-cpu-bound-and-i-o-bound-mean was using it like this
so it's a reinforcement learning environment
I know, it's not super useful in optimization.
the environment itself is computationally complex and run on the CPU
I used y_ nevertheless
the GPU is doing the actual network
and it's bound by the CPU-GPU memory interface
What is y_ ?
but I'm also reading/writing with DB
I think you gotta use y right ? Since that's your data
Compare y_predict with y and apply grad desc
It may be to simulate the validation step idk
So you load your data in your RAM, since you can fit it all. What are you doing with it? Are you even running any Pandas operations on it?
My love for numpy is undying
Ok, what kind of operations are you doing on the CPU on that data?
I could have the data in an in-memory database instead of postgres
I assume you just fetch all of it, so it's all in main memory (RAM).
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
here's an example: https://paste.pythondiscord.com/5JNA
don't mind the asserts I'm still cleaning it up
I just wanted to quiet the typechecker without comments all over the place
this doesn't really make sense, y is not used anywhere here at all. which parts did you write and which were already given?
here what? what am i looking at
the question
Ig they want to apply gradient descent with y_
Stuff like concatenate is a performance red flag.
why
Memory allocations.
Memory management
so it's copying everything?
doesn't look like y is used anywhere, then
Yes, and allocation new memory / a chunk of it from heap.
yeah
isn't that what Spark does too?
Can't you ask the professor
If you allocate in a loop on the heap, all performance goes out the window.
for y_predicted I used this line of code
y_predicted = np.dot(X, w.T) + b
@wooden sail
Like it's becoming a matter of interpretation
so are you saying I should make a big numpy array first
and then assign the individual elements
Not sure what it does, but normally you would either just avoid allocation entirely (probably not needed), or if you really need to, a fast memory arena (arenas are often used by big fast data projects (probably Spark does)).
A memory arena basically just being a pre-allocated chunk that is O(1) to allocate on.
y_ = np.random.randn(10,1)
w_updated = w - 0.1 * np.dot((y_predicted - y_).T, X)
print(w_updated)
``` for the second part of the question
so should I just make a big numpy array with the shape of the output
and then assign the elements
rather than concatenating
Yes, that helps.
what do you think? @wooden sail sounds good to me but Im not sure why we dont use y at all
because the code you were given is poorly written
lol
Any way you can avoid memory allocation. Without that, you have no chance at fast speeds in a tight loop.
That's why asking the person who wrote it is the best option
Be careful with those things when using for learning.
Note that with something like polars, since it chains operations together and does them all together, it can avoid many allocations that you would have to do if you had to do it step by step in something like Pandas.
i can't comment on the gradient since the cost function isn't written there
Consider something like np.sum(a + b) in Numpy. Numpy has to run its elementwise addition, and then sum on that. That is looping over all the elements twice. But something that chains operations together like polars can just do that in a single loop. ```c
int sum = 0;
for (int i = 0; i < N; ++i) {
sum += a[i] + b[i];
}
So even though it may be nicely vectorized and whatever in numpy, it's still just doing more work.
is there a place where i can upload python notebook to share it?
not using least squares cost?
I think it's least squares, I recall him showing a slides from a lecture
You are right and I may very well be doing a type of InSAR as well. It seems I'll be doing Pol-SAR as well
here's my python code
https://paste.pythondiscord.com/RCBA
(This is why libraries that build compute graphs can optimize better, they don't do uneeded work)
Tensorflow ftw
(This is especially important for reducing allocations)
1. y_predicted
If you meant y_predicted, it represents the predicted output of the model for the given data matrix X. After calculating y_predicted = np.dot(X, w.T) + b, this variable holds the model's predicted values for each corresponding sample in X.
2. y_
If you actually did mean y_, it seems to be an unused and undefined variable in the provided context. Perhaps it was meant to serve a purpose but got left in the code without assignment.
Explanation
Here's a breakdown of what typically happens in a simple linear regression model, as seems to be intended in your code:
y: This variable holds the true labels or target values for each data point in your dataset.
y_predicted: This would store the model's predicted labels after being calculated using the weights w, bias b, and the data X.
Model Update: y_predicted and y are then used to compute the errors in the model's predictions and to update the weights w through gradient descent.
Let me know if you have a specific portion of the code where you're seeing y_ and I can provide a more tailored explanation!
profile picture
# here we initialize a random data matrix X and random numerical labels y
import numpy as np
X = np.random.randn(10,3)
y = np.random.randn(10,1)
# we also initialize a hypothetical hyperplane defined by w and b
w = np.random.randn(1,3)
b = -1
# (i) find the numerical labels predicted by the model (w,b) for the points in X
# your code should be a single numpy line
# hint: we wrote this equation for a single point x in class
# try to generalize it by expressing everying in terms of matrices
# your code goes here
y_predicted = np.dot(X, w.T) + b
print(y_predicted)
# (ii) find the updated weights after one application of gradient descent with lr = 0.1
# your code should be a single numpy line
y_ = np.random.randn(10,1)
w_updated = w - 0.1 * np.dot((y_predicted - y).T, X)
print(w_updated)
what does y_ do here
Show drafts
In this code, y_ serves as a source of randomness or noise, and it does not directly contribute to the linear regression process. Here's why:
Gradient Descent Calculation:
Let's focus on the line where you update the weights using gradient descent:
Python
w_updated = w - 0.1 * np.dot((y_predicted - y).T, X)
Use code with caution. Learn more
Notice that in this expression, you are using the following:
y_predicted: The values predicted by the current model.
y: The true labels for your data.
X: The data matrix itself.
The calculation doesn't involve y_. The goal of gradient descent is to update the weights w to improve the model's accuracy. This improvement is based on the difference between the true labels (y) and the model's predictions (y_predicted).
Possible Scenarios for y_:
Typo: It's very possible that y_ is the result of a typo, and the intention was to reuse y (the true labels) instead.
Experimentation: The code might be intended for experimentation. Perhaps the developer had the idea of introducing random noise during the gradient descent step, although this is not a standard practice in a typical linear regression context.
Key Point: When performing linear regression with gradient descent, the key elements for weight updates are the true labels (y), the predicted labels (y_predicted), and the data matrix (X).
Let me know if you'd like me to analyze a different section of code or explore more advanced variations on gradient descent!
I asked Gemini
Im gonna use y instead of y_
you need to ask your lecturer because the way it's written, y is not used anywhere
R u in the US, idk if Gemini ultra is already the Gemini latest in the EU
what cost function are you using?
yes you are
he just wants one line of code there
you are using one, otherwise there is no gradient to speak of
It's the square function from one of your lectures
lr 0.1?
0.1*dy/dw, 0.1 is the dw
In the numerator, not the other one
calculus of variations anyone
.latex the way you have written it, in column vector form, would be
[
\bm{y} = \bm{Xw} + \bm{b}
]
for which the gradient, assuming a least squares cost of the form
[
\Vert \bm {y} - \bm{Wx} - \bm{b} \Vert_2^2
]
is
[
g(\bm{w}) = 2(\bm{X}^T\bm{Xw} - \bm{X}^T(\bm{y} - \bm{b}))
]
which you'd then scale by 0.1
gradient means you took the derivative of something. what did you take the derivative of?
this is an exercise
because what you got does not match a least squares function. what was it instead, then?
then it doesn't make sense
i can't say "find the derivative" and not tell you what to take the derivative of
it says there you wrote the equation in class, so the answer is in your notes from class
i'd wager the mistake is defining a random y_ instead of using y.
there's more issues than that. since y is also random, there is additionally no ground truth for w and b
Invalid index type "tuple[slice, ndarray[Any, dtype[bool_]]]" for "_LocIndexerFrame"; expected type "slice | ndarray[Any, dtype[integer[Any]]] | Index[Any] | list[int] | Series[int] | Series[bool] | ndarray[Any, dtype[bool_]] | list[bool] | Callable[[DataFrame], slice | ndarray[Any, dtype[integer[Any]]] | Index[Any] | list[int] | Series[int] | Series[bool] | ndarray[Any, dtype[bool_]] | list[bool] | list[<nothing>]] | list[<nothing>] | tuple[slice | ndarray[Any, dtype[integer[Any]]] | Index[Any] | list[int] | Series[int] | Series[bool] | ndarray[Any, dtype[bool_]] | list[bool] | list[<nothing>] | tuple[Index[Any] | Series[bool] | ndarray[Any, dtype[bool_]] | list[bool] | str | bytes | date | datetime | timedelta | datetime64 | timedelta64 | bool | int | float | Timestamp | Timedelta | complex | list[Any] | slice | tuple[str | bytes | date | datetime | timedelta | datetime64 | timedelta64 | bool | int | float | Timestamp | Timedelta | complex, ...], ...] | Callable[..., Any], list[<nothing>] | slice | Series[bool] | Callable[..., Any]]"Mypyindex
thanks mypy!
this is pretty contrived
the question is called Linear Regression with numpy 1liners
standard linear regression is based on least squares
go to your notes and find the cost function that was used
isnt this a simpler expression, 2(y - ...) X or something
that's just generating some example data, I believe. as in, that part of the code comes with the problem statement.
which is then not used, and instead a new y_ is drawn
yup, y_ shouldn't exist
which is fine, X has full rank with high probability
I think its probably a typo
Im using y
y_predicted - y
y_ doesnt make any sense there
the same expression with X^T factored out
^typechecking errors in any program I write be like
and i like it this way
ah didnt see that
neither makes sense, you can use wichever and it will work because of how the problem is written
i'd use y_ because it's what they put in the code block, but you have to ask them
and go verify in your notes whether it's least squares, cuz your gradient looks wrong
I'm starting to think that typecheckers just can't handle get item
at least not with pandas
Reminds me of earlier C++ errors with templated types.
ok, so it's least squares and you're absorbing the scaling factors into the 0.1
yup
what I dont understand is
this is not a linear regression
I dont see any fitting or nothing
feels stupid
yes there is
where is the fit function?
L
oh the learning rate?
it failed
r u writing latex manually
wdym manually
so should I use y_ or y?
this has nothing to do with your previous question
without at least real time feedback of the result
there's a nice web app thing for it
what? lol
.latex "regression" is another word for "fitting" or "finding parameters". you're doing gradient descent on the function
[
L = \frac{1}{n} \Vert \bm{y} - (\bm{Wx} + \bm{b}) \Vert _2 ^2
]
there we go
edd, crunching his bones: back in my day we had to write latex on a piece of paper, and get it right the first time
this function L is what you're minimizing. and you're doing so by tuning w via gradient descent
HTML LaTeX equation editor that creates graphical equations (gif, png, swf, pdf, emf). Produces code for directly embedding equations into HTML websites, forums or blogs. Images may also be dragged into other applications like Word. Open source and XHTML compliant.
no paper writer worth their salt uses that
use y_, and more importantly, go review your course material
thats all the course material
cuz it sounds like you aren't grasping the key ideas
what's wrong with it, i'd say it's quite handy
I dont see any mention of y_ in my notes
it's just slower. i'll just write a large chunk of raw tex and compile it later
because it has nothing to do with the rest of the problem
i keep telling you, the code you were given is not consistent
you could use either y or y_ and it will work
this is just a parameter you evaluate into the loss function L
i assure you the person grading won't care either, but since they went through the trouble of making y_ in the new cell, just use that
yes
that's the same as i wrote above
the sum of squared errors, which you then minimize to achieve the "least" value
hence "least squares"
a is the same as w in your task
yup
what is the question
the real question is, why do they insist on using row vectors. yuck
#data-science-and-ml message here @final kiln
no, pytorch bad cuz row
your prof is asking you to apply gradient descent on y_
π
I'd guess it's a lot more intuitive for most people to think of tables as a set of rows stitched together than to think of it as columns where each index represents a different individual
okay then y_predicted - y_
yeah once it gets to several dimensions row is easier to think about
like (x, y, z, d, v, c) shapes
that completely throws away all of the power of linalg, hopefully they at least think of the rows as spanning a vector space still
linalg is agnostic to this tho
it honestly makes no difference as long as you're consistent and keep in mind your fundamental vector spaces
but math books canonically use column vectors, so
either one will work, you can leave a note saying you were unsure which one to use, but both look equivalent
If our given data set is linearly separable, does the same hold true for the transformed set? In the following cells you can plot a transformed version of the Iris dataset, so that you see how it behaves (for your choice of π , π , π .) But you should also try and justify your answer in a theoretical way: if there exists a 'good' perceptron for the original data set, what would be the weights for the perceptron that works on the transformed set? Are there any issues that might arise?
I answered this question using linear algebra
linalg would be the way
rank-reducing transformations will give you a nontrivial kernel
only for full rank transformations π
well you see, an (n,m) matrix consists of n column vectors, the ith one obtained by matrix[i, :].reshape(-1,1),
he wanted us to use chatGPT for this question as well
guys, i need help with finding out why my implementation of dnn is not learning π i have been trying to debug for 3 days and im slowly going insane
Here's my code https://paste.pythondiscord.com/RCBA
notebook version: https://colab.research.google.com/drive/1R4tpsRi4gHXrAcUU9zGRj76FQBt9LJtX?usp=sharing
But I assume linear transformations dont cause loss of linear separability? @wooden sail
eh, i thought academia was afraid of new tech
they do, if they're rank deficient. that's what i'm telling you
go ahead and try to separate your data if T is the zero matrix
oh hsit yeah
the same will be true for any T that is not full rank
well, there's a discussion to be had about domains, kernels, and pre-images
this is exactly what i meant about your fundamental subspaces
you can play with the rank-nullity theorem or the fundamental theorem of linear algebra here
things get a little bit more tricky for nonzero b because it becomes an affine transformation, but the spirit of the discussion is the same
π
Arent you a math wiz
you wanna see how stupid my DSA assignment is?
@wooden sail
linear algebra has always hurt my head tbh
i want to sleep π
Option 1: all arriving passengers are placed in a single queue, and service stations take passengers from the front of that queue.
Option 2: each service station has its own queue, and arriving passengers are dispatched to a queue according to one of many policies:
2.A: round robin (1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, ...).
2.B: arriving passenger is placed in a shortest queue.
2.C: arriving passenger is placed in a random queue.
Inputs to the simulation:
The duration of the simulation measured in minutes (D: make it arbitrarily long, do not worry about it being or not being realistic).
The average arrival rate measured in minutes (A: arrivals are random, but on average there is one new passenger every A minutes),
The average service rate measured in minutes (S: service rates are random, but on average they need about S minutes of service).
For the sake of this study, make sure to crowd the system, by choosing S >> 5*A, without causing an overflow of your queues. Also choose D to be long enough to get rid of any transitory effects.
Outputs of the Simulation for each queuing policy:
The duration of the simulation (which may be longer than the input parameter, as when check-in closes, there may be passengers in the waiting queues and service stations).
The maximum length of the queue for each queue.
The average and maximum waiting time for each queue.
The rate of occupancy of each service station (percentage of time each station was busy).
If you want: show the real-time evolution of the queues during the run-time simulation.
try linear algebra with complex numbers
I asked teacher if I could use Markov Queues from Queue Theory
M/M/c
He didnt even know what that was
where does that say markov queue though
Its a markovian model
everything circles back to markov
glad I don't need to know that
where are markov chains used in ai ?
its an implementation of markov chains on queues
theyre used in reinforcement learning
oh, havent gotten to that stuff
yup
state-action-reward-state-action-reward...
makes sense
the learning curves in reinforcement learning go up because we are trying to maximize rewards instead of minimizing losses
I like to joke that it's the optimistic branch of ML
yeah it's a sort of random walker right
yeah
Andrey Andreyevich Markov (14 June 1856 β 20 July 1922) was a Russian mathematician best known for his work on stochastic processes. A primary subject of his research later became known as the Markov chain. He was also a strong, close to master-level chess player.
Markov and his younger brother Vladimir Andreevich Markov (1871β1897) proved the M...
the way it fits is random exploration of states
but it has what is called a policy gradient
so it's partially random in that the state it explores are randomly generated but it optimizes the policy gradient which is the function that links actions to rewards
a | s -> R
interesting
so you have a state, and you throw a dsiplacement at random and see if that improves the reward
so it's proposal based right
like, if a certain displacement doesnt work you throw it away
yeah it learns to make the right action given the state to optimize the rewards
not just for one step but for an entire episode of steps
are you studying Data Science? @agile owl
yea sure
nice that was my undergrad are you enjoying it
sure
I already got a degree in it they just didn't cover reinforcement learning very much so I'm self-studying it now
i wonder what other mundane things mathematicians have turned into fields
I think reinforcement learning is really a lot more exciting than static learning problems
that's how they get robots to walk etc.
the realsπ
the thing that drew me to ML was llms, really fascinated by them
I would be surprised if chatgpt didn't use some kind of reinforcement learning
no like, is there a sock theory
it does
I think every time you tell it it did something wrong it learns from that
google "RLHF".
plus side is you get to use your CPU and GPU at the same time π
yeah I remember this
reinforcement learning to the moon
I need to learn more about the implementation of these models instead of using sb3 though
I want to try to reimplement SAC in rust from scratch using their version of torch
rust has torch ?
python is awesome but when the project starts growing I feel the need for type safety
mypy dont cut it cuz a lot of libs dont have stubs
I also like NLP @agile owl
I like NLP too but more for quantizing things like sentiment than generation
cool!
I wrote a data and training pipelin in github actions
I did a CNN implementation once during my undergrad. My computer shut down @agile owl RTX 2080
models are trained in spot
not Ti tho
they all train during the night
I think the Ti has 3 extra gb of VRAM
VRAM is super expensive did you guys know that
no o.O
the good stuff is
is that why Graphics Cards are expensive?
part of the reason yes
everything is automated
I'm honestly somewhat surprised it took people so long to realize the potential of GPUs for machine learning
they had GPUs for a long time
but they were not as capable
I don't think they have changed the basic premise of it that much
but in relation to the CPU they always had more but worse cores didn't they?
right but think about it
using GPUs for compute
that only really became a thing when NVIDIA wrote Cuda
even though in principle it could have been done earlier
is it because the GPU instruction sets are proprietary?
I don't understand why no one did something like CUDA independently
I asked chatGPT and it gave me a reasonable sounding answer
what did it say?
lack of standardization
ohhh
given how hard it is to use the AMD environment, may make sense
ive spent a whole weekend trying to get an amd gpu in the cloud to do stuff
took me an hour to do the same in nvidia
ig my issue is the lack of docker support
the experience of getting it set up sucked tho
it's not zero support, but everything is so badly done
it's like one dude
10gb docker images for ex
fr ?
lol
like one person supporting it ?
I remember looking for resources about it and it was ONE GUY answering everyone's questions on github
yeah that explains it
i'd be surprised if they haven't grown the team
I mean he probably had a few coworkers but still
behind every bad code is one overworked developer
I still think AMD is underinvesting in AI
I don't understand it
if they could undercut Nvidia in AI it would be a massive coup
massive corps are hard to change
people don't get compensated for innovating
so they kinda dont
I think AI was nvidias end game from the start
I mean they don't need to be the first mover
being the second mover is also good if there's only one other company
they just need to invest in having something as good or better than cuda
there seems to be a market vacum of sorts
or even slightly worse
no competitor to nvidia
exactly
even if you're worse than nvidia if you're a viable option and can compete on price
their problem is rocm is barely viable
it just has to be good enough to work with common libraries
and they would get a massive sales boost from people going for the value alternative
idk if it's easy to do that, you're saying like make cheaper gpu rite
they already make cheaper GPU
I'm saying they need the SOFTWARE
so people buy them for compute
ah, yeah that's for sure
their gpus are usually slightly worse than nvidias
from a hardware perspective
but the software can be worse too
it just has to like, work
I personally dont care or would notice the hardware
I do notice the complete lack of support on the software side
I mean rn I just schedule the thing and let it do it during the night
so I wouldn't notice it
I'm making a webserver to provide a UI for model creation
yeah I get the feeling a lot of modelling can be done with UI or some DSL
this server is currently just for training an already curated dataset and plotting results
the next step I want to do is add the ability to do ETL from different APIs into a joined table before standardizing and slicing for CV etc.
have you tried mlflow
is that some paid service
a webserver that does graphs sounds an awful lot like it
no its open source
does it work for reinforcement learning
its what Ive been using to log my stuff
you kinda just do .log_metric("metric_name", spme_val)
and it saves it and you can see it real time on the UI
theres also an auto log feature, but ive never used it
like it does some magic that you dont even have to explicitly log stuff
like in a callback?
there's a lot of automation done by me, each experiment is a PR, when I merge it automatically runs the training loops and they appear as runs
idk how it does it, but doesnt seem to use callbacks, there seems to be a ton of py magic to it
I meant the auto log feature
what im doing you just do .log_param, .log_metric and .log_artifact
so it backs up your models and everythng
surely you'd prefer not to reimplement all this
there are others similar to this
I'm honestly not sure if that API can work with sb3
without doing deep surgery
they already have their own logging functions
I just need to visualize them
I already have the points
I mean, getting access to that state at the right level of granularity to plot it
someone asked me what ML agorithms don't require regularization and which ones do
I said this
There are some ML algorithms where overfitting is not a problem at all. For example Naive Bayes is known for its conditional independence which makes it resistant to overfitting. KNN is another algorithm that is resistant to overfitting as it works by memorizing the training data. Random forest is also resilient when it comes to overfitting thanks to the way it combines many independently trained decision trees.
am I right?
knn doesnt overfit ?
knn has a very good calibration
i thought every model can overfit
it depends on what your assumptions are
some people say RL can't overfit but I'm pretty sure it can as soon as you introduce different data to the same environment
it's way too late for me to use my brain to learn new stuff
I found this online:
Non-parametric: KNN doesn't learn a fixed set of parameters. It essentially relies on memorizing the training data.
Focus on local regions: KNN makes predictions based on localized neighborhoods in the data space, reducing its susceptibility to extreme patterns that might mislead parametric models.
i mean yeah, does it even make sense to say "knn overfits"
it's just a database query almost
it's an interesting property though
I think you could make a more sophisticated model based on the same principle
its called a transformer
is there any other algorithm that uses memorization?
Locally Weighted Regression: A non-parametric regression method that fits simple models to localized subsets of the training data. The focus is on predictions made close to a query point, relying more heavily on training examples in that local region.
found this
Locally Weighted Regression
Locally Weighted Regression
if you're localizing to time then that's just a rolling window
interesting read
you could like weight by time too
exponentially weight the cost with some halflife with respect to time
When to use Locally Weighted Linear Regression?
When n (number of features) is small.
If you donβt want to think about what features to use.
Hi, does anyone know how to get simba work on top of spark? is it enough to just install the driver?
so I got my code working with polars and it's actually slower
by quite a bit
gonna profile it and see what the problem seems to be
same stuff as where it was taking a lot of time with pandas except the flamechart is shallower
and it's taking longer
I feel like I got memed
I'm not sure my results were actually correct because I got a different end result but I'll save learning polars for a new project
I think it kind of stands to reason that a dataframe with an index built in is going to be faster than doing a filter on a column...
can you think about a case where it might behave differently?
Also that's not really related to #data-science-and-ml
#python-discussion or #βο½how-to-get-help are a good start
think about the different boundaries and examples of values that fit in each one
that's also why I am asking you
so you can think about it
morning!
An interesting side effect in polars:
filtered_df = df.with_columns(
pl.when(
pl.col(column).is_not_null() & pl.col(column).str.contains(phone_regexp)
).then(
pl.col(column).str.extract_groups(phone_regexp)
).otherwise(
pl.lit(None)
).alias(column)
)
This will create a structured column with as many None values as capture groups
I tried to fix that behavior to no avail
[.when([(col("PHONE3").is_not_null()) & (col("PHONE3").str.contains([String(^(?:(?P<country_code>+\d+)[\s-]+)?(?P<number>(?:\d[\s-]*)+)$)]))]).then(col("PHONE3").str.extract_groups()).otherwise(null.cast(Struct([Field { name: "country_code", dtype: String }, Field { name: "number", dtype: String }]))).alias("PHONE3")]
the dataframe does not contain any nulls in the columns parsed
'phones': [{'country_code': None, 'number': None},
{'country_code': None, 'number': '5551234'},
{'country_code': None, 'number': '5551234'},
{'country_code': None, 'number': '5551234'}]}
Guys can someone recommend a good beginner course for python for datascience
[2024-02-12 11:45:17,350] [MainProcess:MainThread] INFO: CSV: Processed 499999 lines in 1.66 seconds, 300776.14 lines/second
rewrote all the coalescing/transforms into expr engine query plans
Check the pinned messages
Does pytorch have yolo model? My lab asked to train object detecting using it and specifically asked to switch to pytorch for this assignment.
Hi, I have a pandas dataframe which is grouped by a column named 'run'. Each group should have more or less the same amount of rows. In this dataframe, there is another column called 'total_data' and I would like to merge these groups into a single group, effectively eliminating the need for a 'run' column. While merging it would be nice if it took the mean of the row value of 'total_data' horizontally across each 'run' group, rather than the mean on the column itself. The end result should be a Series with the same amount of rows as a 'run' group. Could someone please assist me with this? I've been trying to solve this with AI but I can't seem to figure out the right combination of functions to call. Any help will be appreciated
Hopefully I explained that correctly. I'm new to pandas so I'm not sure if I'm describing the problem correctly
Maybe let's tackle the first question first: you have two dataframes, and you want to "combine" then.
Do they have a common index or something to "join" them on?
They have all the same columns, might differ in the number of rows by less than 5%
Yes, so sounds like you want to a left outer join then.
The place to start is: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html:
df1.merge(df2, how="outer", ....)
There's a few different ways to merge. You can use "on=[list of columns to join on]", or left_index/right_index if the indexes is what you want to use.
Ok, great, this gives me a path to follow. This will then be possible to merge values by taking the mean?
Across the dataframes
I'll experiment, thank you for the help
Not using seeds was a mistake
But also, I may be coming to the realization that 16gb of GPU is not gonna be enough to train this thing. What else could make this thing always converge to the same value other than it's just too small. Bert uncased is like 100M parameters
I have a question to those experienced in Dash. Can you call another function that you defined earlier in the Python script to generate the dataframe inside a Callback function to replace the old dataframe with a new one?
Like I have a program that generates a Pandas dataframe based on a given date and I want to implement a date picker that will replace the data with different data from the other dates entered by the user in the text box.
Inside the Callback function, I called the function that originally generated the dataframe so that it can generate a new dataframe with the new date.
spot = self.data.loc[curr_date, "spot"]
spot_window = self.data.loc[prev_dates, "spot"].to_frame()
log_spot_window = spot_window.apply(np.log)
if self.current_step > 0:
spot_returns = log_spot_window.unstack().diff().dropna().iloc[-1]
spot_returns = spot_returns.values
else:
spot_returns = np.zeros(self.no_symbols)
spot_window_vals = spot_window.values
spot_values = spot.values
spot_rank = get_percentile(spot_values, spot_window_vals, axis=0)
@left tartan pd.loc and pd.unstack are faster than pl.filter and pl.pivot in this code. particularly in the definition of spot_window and spot_returns
(I'm leaving this for the resident polars experts π
But could you describe the problem you're trying to solve?
my difficulty is always figuring out the right place to start and stop
for the appropriate amount of context
so basically
Basically, it's: You're starting with Dataframe X, and you want Dataframe Y... so maybe describe the starting state and ending state?
the starting state is I have the dataframe that represents the entire chunk deserialized as self.data
self.dates is the list of unique dates in data
I have an index counting which date I'm on
I need to update the current state given the date and self.data
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
https://paste.pythondiscord.com/Y2RQ here's the whole method
the class is 550 lines so don't wanna post it and be rude unless you're really interested
Looks like you want the percentile of spot price over a window from current date -10 to +1?
that's part of it
I should probably split this up into multiple methods
at the end you see everything that goes into the return
lines 31-38
the docstring is also wrong
I forgot to update it
I was originally clustering for a single stock and haven't updated it
there's obviously some repetition and dumbness going on here but the critical part is I think the part I posted originally
polars profile:
pandas profile
the two most expensive high-level polars operations are filter and pivot
this corresponds to the pandas indexing and pandas unstack
so that kind of makes sense
but the polars version is just 2x slower
unfortunately I lost the polars code I had because I never committed it out of disgust at the results
surely not having an index comes at some cost though
if pandas can't beat polars at a loc index vs a generic filter then I'd be surprised because the whole reason pandas has the badness it does have is so it can optimize around indexing
It always comes down to the data
I mean, it also comes down to how the data is structured for access
I don't understand how polars gets rid of the index and doesn't pay a price
you either have an index that's set as a discrete action, set it every time you do an operation which seems extremely expensive, or don't have it
Ah I was talking about my thing
oo mb
Need to step back and re process the data using all the lessons I learned til now
the polars code was something like this before I tossed it
spot = self.data.filter(pl.col("date") == curr_date))
spot_window = self.data.filter(pl.col("date").is_in(prev_dates))
...
spot_returns = log_spot_window.pivot(index = "date", columns = "ticker", values="spot").drop("date").diff()
A 40M parameter model having the same loss graph as a 1M parameter one, like I can't even
if you encode your text I bet you could use GPU for that
Probly not advisable
I think
It's also possible that these things converge slowly, that graph looks an awful lot like mine
well more or less, x axis is number of steps w max range being the end of the dataset
I have a question. I'm trying to load into my Dash App with Debug Mode enabled. Does it usually take long?
Can you show me the polars
I didn't commit it unfortunately but it was something like this:
spot = self.data.filter(pl.col("date") == curr_date))
spot_window = self.data.filter(pl.col("date").is_in(prev_dates))
...
spot_returns = log_spot_window.pivot(index = "date", columns = "ticker", values="spot").drop("date").diff()
I don't think polars has diff actually
but if my polars code was correct to begin with then I probably wouldn't have had this issue in the first place hah
It has diff
I also don't understand why polars should be faster than pandas if it never indexes the data?
Is there a docs for training transformers, I'd be really happy if there was one
I'm just trying all this stuff until something sticks, not very efficient
It does index the data
There's just no Pandas index weirdness
that's the nuance
Giving an index based on inter position in Pandas is also just a bit pointless
It's the default, of course there's smarter ways to do it
But I don't see people doing that, llike picking an index in Pandas that aligns with their data access patterns
You also need to check what type of index Pandas uses, as you know in DBs there's many different kinds
Hash based indexes don't give you a lot if you're filtering like <
it's a multiindex on date as datetime64 and ticker as a string
I'm not using lt or gt just equals and isin
hmm then a hash index is good
I refer to: https://www.postgresql.org/docs/current/indexes-types.html for a concise overview
the issue isn't really with the DB
ah gotcha
Because they aren't magic β¨
Do you use the lazy api
or eager only?
I believe the read_database returned a lazy frame
No
it returns an eager frame?
yes
then i was using eager
can you read from db using lazy
No, query the db and call .lazy() immediately
thanks you're a lot more helpful than the polars discord
they have a beginner questions channel where no one answers beginner questions lol
π€£
typical of the rust community memes I have to admit
π polars simp # 1
they have a reputation for being elitists and thinking people haven't done enough work if they need help
I used to think like that then I realized being eager to ask questions has almost no downside on the internet
as long as they are somewhat reasonable..
Answering questions makes you think
- they're without obligation
Nobody loses, unless someone is spamming or so
Oh, good addition about the lazy API is that it removes all footguns
You can't iterrows, maprows or whatever I see people doing
well time to rewrite everything I wrote yesterday but with lazy this time
I thought I must have really screwed something up bad so I tossed it all
lesson learned
last recommendation
I also didn't feel like branching because I was lazy
It will sound crazy
but 1) read all the docs. do it while you're eating cornflakes or whatever 2) browse through the method names in the API ref
Polars has the GOD tier method for time series
The danger is, if you focus on translating Pandas to Polars you'd never find it
reading the whole API sounds a bit ambitious
here's a funny thing
I asked copilot to retrieve what I wrote yesterday but it said it's not allowed to do it
but when I start retyping it it suggests what I had written
that would be an interesting use for AI
"help I accidentally deleted this code I forgot to commit, replay your telemetry buffer"
XD
While you're at it replay Jeff Bezos' credit card details thanks
i had the exact opposite take O.o - i only just started using polars again after a long hiatus so a lot probably changed.
#data-science-and-ml message
When was this?
probably 1 year ago or even more
This time last year is when I turned the >1h Pandas data pipeline to ~15s polars
And I definitely used group by dynamic
Maybe it was longer ago then yeah
this is in the environment step function in a reinforcement learning setup