#data-science-and-ml
1 messages · Page 95 of 1
One file being created per process
Or I guess I should say that something is causing each process to hang
I don't know exactly what it is
It must be that a race condition is being created in each process due to the multi-threading though I did attempt to implement threadsafe write
are you multithreading within a spawned process
It's not too late to read the manual and rewrite
I never had issues doing the pattern you're describing
pooling out, writing and concatenating
Don't treat it as if it were Pandas, it's one of the anti-patterns described in the documentation
but I want it to be pandas
but arctic
🐼 ❄️
polars is a lot more like spark or sql than pandas
the python api is obviously heavily inspired by pyspark
No secret I much prefer the syntax over Pandas. The indexing etc. of pandas makes it the most awkward DF library I've used across various programming languages 😅
It remains a bit hard to use because imo Polars only makes sense if there's no data viz or ML angle to the part of the project you're using it in. It shouldn't be used in all (sub)packages of your project. I've the mistake of writing some of my feature engineering in Polars which automatically means I need to kind of manage both in my ML code.
does anyone know of a good strategy to piece together several simulations composed of difference equations in python? Is there a library?
oh i may try this https://pysd.readthedocs.io/en/master/getting_started.html
Hey guys, I have been following a video on Fine-tuning Mistral 8*7Bm it works, but the training process if very slow, its taking about 1.5 hours for 320 steps training. I am running it on 2x A30.
Code: ```import transformers
from datetime import datetime
project = "final-finetune-2"
base_model_name = "mixtral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name
tokenizer.pad_token = tokenizer.eos_token
trainer = transformers.Trainer(
model=model,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
args=transformers.TrainingArguments(
output_dir=output_dir,
warmup_steps=5,
per_device_train_batch_size=2,
gradient_checkpointing=True,
gradient_accumulation_steps=4,
max_steps=321,
learning_rate=2.5e-5,
logging_steps=25,
fp16=True,
optim="paged_adamw_8bit",
logging_dir="./logs", # Directory for storing logs
save_strategy="steps", # Save the model checkpoint every logging step
save_steps=10, # Save checkpoints every 50 steps
evaluation_strategy="steps", # Evaluate the model every logging step
eval_steps=10, # Evaluate and save checkpoints every 50 steps
do_eval=True, # Perform evaluation at the end of training
report_to="wandb", # Comment this out if you don't want to use weights & baises
run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}" # Name of the W&B run (optional)
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()```
i have this
those are alll x values for maximums and minimums on this functions
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot()
xx = np.linspace(-5., 10., 1000)
# ------- Vul verder aan -------
functiewaardes_extrema1 = []
vgl = opl.subs(a, 1/2)
vgl1 = vgl.subs(b, 2)
vgl2 = vgl.subs(b,4)
display(vgl1)
display(vgl2)
t = sp.lambdify(x,vgl1.rhs)(xx)
q = sp.lambdify(x,vgl2.rhs)(xx)
ax.plot(xx,t)
ax.plot(xx,q)
ax.scatter(extrema, vgl1.rhs.subs(x, extrema))
plt.xlim([0, 10])
plt.ylim([-60, 30])
plt.show()
i have this code and im truing to get the minimums to show up as dots on the functions using the scatter function but idk how to do that
this is the variable called extrema
vgl1 and vgl2 are the functions
vgl is a function with still a, b in it
'Union' object has no attribute 'as_base_exp' this is the error i get
i give up
does anyobe have experience about web scraping, especially using Goose3
Hi, can anyone look at the raw tensor values of a 256*256 image and tell the features and the objects of the image without looking the image or using any tools? And look at the image and convert it into tensors or sinusoids with just using his own mind. It would probably take insane amount of practice and training to do that?
Yeah that will solve the problem i figured... but what if i went into micro scaling to make my model better... i am talking about precision... i am going into more and more micro so to make my model most accurate.. other than scaling is there any other option? I tried doing moving averages on a linear regression model that had noise and it worked but not for a good percentage amount...
Can anyone hep me out on this?
from sklearn import datasets
data = datasets.load_breast_cancer()
ulaz = data.data
izlaz = data.target
ep = 100
bs = 32
from sklearn.model_selection import train_test_split
ulaz_trening, ulaz_test, izlaz_trening, izlaz_test = train_test_split(ulaz, izlaz, shuffle=True, test_size=0.2,
random_state=42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(ulaz_trening)
ulaz_trening_norm = scaler.transform(ulaz_trening)
ulaz_test_norm = scaler.transform(ulaz_test)
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers.legacy import Adam
n_in = ulaz_trening_norm.shape[1]
n_out = 1
def make_model(hp):
model = Sequential()
no_units = hp.Int('units', min_value=3, max_value=15, step=2)
act = hp.Choice('activation', values=['sigmoid', 'relu', 'tanh'])
model.add(Dense(units=no_units, input_dim=n_in, activation=act))
model.add(Dense(n_out, activation='sigmoid'))
lr = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
opt = Adam(learning_rate=lr)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
return model
import keras_tuner as kt
from keras.callbacks import EarlyStopping
stop_early = EarlyStopping(monitor='val_accuracy', patience=5)
tuner = kt.RandomSearch(make_model, objective='val_accuracy', overwrite=True, max_trials=10)
tuner.search(ulaz_trening_norm, izlaz_trening, epochs=ep, batch_size=bs, validation_data=(ulaz_test_norm, izlaz_test),
callbacks=[stop_early], verbose=1)
Why do I get this error? I've seen this SO post, but I dont call it from terminal, rather from my PyCharm IDE ( https://stackoverflow.com/questions/55675199/tensorflow-python-framework-errors-impl-failedpreconditionerror-is-a-direct)
2 lineplots are looking kinda similar but they are showing moderate negative correlation of -0.6 why?
put the numbers on a scatterplot
even if the overall trends are both positive, it's possible that there are enough opposite movements to cause negative correlation. that, or you made a mistake in the code
if you did something like a moving average you would hopefully expect a positive correlation
ax.scatter is correct, so your calculations might be wrong. check the actual computed values and confirm that they are actually in the plot bounds and are the values you expect
corr_USD = NAS_U['Close'].corr(S_AND_P['Close'])
print(corr_USD, "<--- :. Moderate negative linear relationship b/w NASDAQ_500 and S&P500\n=> As Close price of one increases, other decreases")
I did thid code for correlation and
# Plot NAS_U
sns.lineplot(x='Date', y='Close', data=NAS_U, ax=axes[0, 0])
axes[0, 0].set_title('NASDAQ')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Closing Stock Price')
# Plot S&P500
sns.lineplot(x='Date', y='Close', data=S_AND_P, ax=axes[0, 1])
axes[0, 1].set_title('S&P500')
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Closing Stock Price')
for the plot
Try the scatterplot
Hi,
I am doing a Kaggle project. I was able to run and submit the file. Even though I didn't analyze results myself. The issue I am having is I want to show a viewer the accuracy of my model through visualizations. or metrics (Maybe MAE?). How do I do this? How would you recommend visualizing your models for business users?
thanks
I have 2 encoder in model.
I give 1 inputs each.
Output from each are A and B.
both are embedding.
If i do B.detach then calculate loss, what will happen?
Hi everyone,
can someone pls help recommend a well curated and properly outlined roadmap for Data Analytics, Last year, i'm 90% through into basics and fundamental of python programming, and i'd love to pussssh fwd into the data analysis part.. i understand roadmaps are subjective but a well outlined one with touches of mathematical and statistical contents would be very useful...
just like this one for Datascience and AI on roadmap.sh which points to the recommended materials...
I would also find it very useful if most of the materials are pointed to Coursera where i can easily apply for a F.A...
Thank you 
see the pinned messages on this channel, I also reccomend picking a project and learn from the experience. super effective
start with simple stuff, ask people and/or gpt when you don't know something, as you immerse yourself into the subject it will become clear to you what to learn next
thanks.. any recommendation on where to get these projects..?
yes, just come up with one, the rule is that it must be something you find to be cool, the fun of doing something you like gets you through the painful process of forcing your brain to make new synapses
I usually choose stuff that's challenging but technically possible, so like, I wouldn't choose training GPT4 because that's not within my budget, but I'd choose something I don't know how to do and then come up with a plan on how I'll build up the knowledge to get there
Awesome... thanks for the recommendation...
hi
im trying to learn on using mobilebeart and i found this code
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
"https://kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/multi-cased-preprocess/versions/3")
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer(
"https://www.kaggle.com/models/tensorflow/mobilebert/frameworks/TensorFlow2/variations/multi-cased-l-24-h-128-b-512-a-4-f-4-opt/versions/1",
trainable=True)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"] # [batch_size, 512].
sequence_output = outputs["sequence_output"] # [batch_size, seq_length, 512].
Text preprocessing for BERT + SavedModel implementation of the encoder API
Mobile BERT Q&A model.
embedding_model = tf.keras.Model(text_input, pooled_output)
sentences = tf.constant(["(your text here)"])
print(embedding_model(sentences))
i already downloaded the necessary modules but the error is this
ValueError: Exception encountered when calling layer "keras_layer" (type KerasLayer).
any fix on this?
should I trust a model whose learning curve looks like this
Ok so after some testing today I found that I can't write out from a process directly when using multiprocessing before the process has finished. I'm not sure why, but the file will be created, the empty dataframe that I initialize as a variable at the start of each process will be written, but none of the data gathered during the process is written.
I removed threading from the program(though now it is too slow so I will have to add it back in) and by returning the dataframe from each process as the result in the process pool I am able to aggregate the data at the end and write that out to a csv successfully.
I've been looking into process Queues and I think I have to implement something like that to get the data out before the process has finished
Are you using lazy frames?
Are you writing Polars as if it were pandas? (Not using expressions)?
If you could show me some of your code (the one where you removed threading) I could instantly tell you what's up
Are you writing Pandas as if it were regular Python? (Using for loops modifying data structures in-place)?
Another typical question 😄
hmm ok gimme one second I'll show a minimal example
def extract(chunk_ids):
main_df = pl.DataFrame()
num = chunk_ids[0]
csv_name = f'data_{num}.csv'
for id in chunk_ids:
#Process data here and add to main_df
#if I try to write to csv from within here
#or even after this for loop the process hangs
return main_df
def main():
executor = ProcessPoolExecutor(max_workers=cores)
tasks = [executor.submit(extract, chunk['column_name'])
for chunk in id_chunks]
doneTasks, _ = concurrent.futures.wait(tasks)
results = [item.result() for item in doneTasks]
final = pl.concat(results)
final.write_csv('final.csv')
if __name__ == "__main__":
cores = cpu_count()
df = pl.read_csv("test.csv")
total = df.height
id_chunks = df.iter_slices(n_rows=math.floor(total/cores))
main()
This allows me to write final.csv successfully, but I want to intermittently write out to csv from within each process in case any errors in the process cause it to hang then I lose all the data
Can you show me some actual Polars code
I'm just about 90 % sure you really really should remove the multiprocessing and you'd be in a wayyyyy better place
I can't remove it
Polars will utilize all my cpu cores?
Yes 😦
Yeah, I had the same thing but since Polars uses dramatically less memory than Python I just loaded all of it in at once
And then it was just 100 % CPU bound
Can't you make all those requests before doing anything else
Any reason why you can't have 1 job that does the requests, writes them to say 1 big parquet without doing any processing and then a second that then reads the data using Polars
wouldn't that run into the same issue?
My data pipeline also does millions of requests and that's how I structured it
Hmmm
No it wouldn't because then you wouldn't have to "infect" the polars stuff with multiprocessing
lol
For the requests if you're using non-blocking client you could also just use async
The way it was set up was that each of the 16 processes had 10 worker threads that were making the requests and as the requests finished it would take them one at a time and process the data and append to the "main_df" for that process
Or I guess that's not exactly right
There were a max of 10 tasks allowed
per process at any given time
Splitting is easier, the data pipeline I wrote basically does this, the first stage polls an API and just dumps the raw data into a data lake (JSON), the second stage runs every couple of hours and transforms JSON into SQL and writes it into a DB, the third stage is where I'm using polars to read hundreds of millions of rows from the DB and do a bunch of aggregations and then write to a diff DB
I think I was getting timed-out for requesting too much too fast
But that was right when I started, so maybe I was doing something else wrong and thought I was getting timed out
Yeah, you can make your life so much easier by doing it in steps and focusing on one problem at a time 😄 Maybe you are getting rate-limited or so. In that case a basic cron job can save you by doing it in steps and writing somewhere else.
Yeah I see what you're saying
Aside from that, having the raw data at hand is very important. I'd almost never recommend going straight from source => processed
A common task I do is verifying my processed data by comparing it to the raw data that came from the APIs etc.
Basically, if you set it up like this you only have the processed data and not the raw ones
Right. This is a pretty specific use case. I'm writing a program for someone else and they just want to run it and get a CSV from it so they won't be doing anything with the raw data really, but I do see what you mean
Well, you can still use it to check if your program is correct
Oh, it's correct 😉 lol
I will see if I can do it that way though. Make all the requests, add all the raw data to a pool, then process it all afterward
That decouples the CPU bound part from the I/O bound part, so really the requests will likely finish even faster
Has anyone worked with this kinda library or know whiçh library can do this visualization with a good documentation
Made this "AI agent" or whatever you wanna call it myself from scratch in python
And running the model locally
Rn it just has access to powershell commands, but it is able to chain them together if it wants to
Smort
Bruh
This thing is actually like smart
Like really smart
Smarter then me probably, i don't think i'd be able to pull that off if someone asked me to do it
I'm not really a powershell wizard
nice
I do not understand all use cases of detach() and no_grad, required_grad etc.
On the surface i know datach prevent gradient computation by excluding that tensor, but I dont understand there behaviour when fors example:
there are two backbone and we only want to update one
How gradients accumulate with required_grad = true?
Why should i not do required_grad = true in last loss.backward()
detach() and no_grad() are methods in PyTorch while require_grad=True is a parameter of a tensor() method.
Any tensor that has require_grad =True can be differentiable. We can perform backpropagation on it because for any computation we perform that involves that tensor, PyTorch builds a dynamic computational graph of that operation in the background (you can roughly think of this computational graph as the footprint. Just like how humans can trace back their family origin with DNA, we can sort of use computational graph to do same thing in a tensor so long as that tensor is differentiable)
Now, in some situations, gradient accumulation can be likened to cancer, because any tensor that has require_grad =True therefore becomes susceptible to this bottleneck (the more gradient accumulates, the deeper & complex the computational graph becomes)
There are some situations where we legit wouldn’t want creation of computational graph because it's redundant in that situation; one of such cases is during inference time. So to avoid that situation, to speed up things, to save memory, we can then decide to inform PyTorch that we don't want it to stress itself in building any computational graph 'cos we don't need it.
Now, you could instruct PyTorch to do that using either detach() or no_grad()
tensor.detach() detaches the output from the computational graph. So no gradient will be backpropagated along this variable.
On the other hand, torch.no_grad() temporarily set all the requires_grad to false. torch.no_grad() means that no operation should build the graph.
The difference is that one refers to only a given variable on which it is called. The other affects all operations taking place within the context manager.
Also, torch.no_grad() uses less memory because it knows from onset that no gradients are needed so it doesn’t need to keep intermediary results.
https://discuss.pytorch.org/t/detach-no-grad-and-requires-grad/16915/6
torch.no_grad yes you can use in eval phase in general. detach() on the other hand should not be used if you’re doing classic cnn like architectures. It is usually used for more tricky operations. detach() is useful when you want to compute something that you can’t / don’t want to differentiate. Like for example if you’re computing some indice...
Calling loss.backward() performs backpropagation in PyTorch.
You can't perform backpropagation if a tensor isn't differentiable (i.e if a tensor has require_grad=False).
And if you can't perform backpropagation you also won't be able to update your model parameters during the backward pass with gradient descent.
So, you see why we need to use 'require_grad=True` when creating a tensor in PyTorch. The situation where gradient accumulation becomes a pain in the ass is when we've finished training our model and we now want to make prediction.
In this situation we have to inform PyTorch that, it shouldn't build computational graph since we simply want to make prediction.
The more deeper or complex the branches of a Computational graph extends, the the more time and memory it takes to compute so that why we straight up turn it off during Inference / Prediction time.
Thanks for the detailed response.
So, for the case where:
- If have two modality: text and vision
- I only want to back propogate in vision backbone and keep text backbone fixed.
Can I put detach on output of text backbone? I actually want to conduct a projected gradient descent attack on images, I only want to modify images.
One more question, when we do newtensor = oldtensor.clone or do newtensor = Variable(oldtensor.data), and use detach() on newtensor, do gradient of oldtensor still update if oldtensor is being used elsewhere too?
I mean if oldtensor was use to get embedding A, and newtensor was cloned from oldtensor and detached, and then newtensor was use to get embedding B, and we do loss.backward(), oldtensor will still be updated, right? and oldtensor will be updated independent of newtensor.
And even though newtensor is detached it will change cuz oldtensor is changing due to its own pipeline.
Good Evening guys. So I tried to install ydata-profiling into the colab but it resulted in an error. can anyone tell me where i did wrong?
Hey guys, does anyone have experience with Triangulation of 3d computer vision and linear algebra ? Please please help me..
Guys I got done with ML specialization by Andrew NG and Mathematics for Machine Learning Specialization by deeplearning.ai on coursera as well. What should I do next?
I went on reddit to search and I was confused by various suggestions ranging from CS229 on youtube for deep dive in the course, fast.ai DL course, ISL and ESL, cs231n by stanford or Deep leaning specialization by deeplearning.ai. I would like to stick to a learning path which makes sure that I have no learning gaps and make me job ready
Kinda typical of a big community . More people w questions than ppl who can answer
a lot of questions are redundant with the resources in the pinned messages, or simply aren't answerable without a huge amount of effort
what's your goal, and what's your background other than those ML courses?
Mechanical Engineering bachelors and MBA (Marketing major and Business Analytics Minor)
in that case it might be a good idea to pause here and start a personal project to reinforce the things you already learned. doing projects is a great way to learn and get hands-on experience that will be essential when you eventually want to find a job in this field.
it's likely that in the course of any project you'll end up having to learn various things anyway along the way
but yeah, once you've done a project, i think any of those resources you named are good options. there isn't a single learning track to follow. DS/ML is probably one of the most open-ended fields in that respect.
personally i think it will serve you very well to go back and learn some statistics fundamentals, but that material tends to be a little less exciting than deep learning.
so I'd suggest pausing to work on a project, then just pick any of those deep learning courses and do one of them
Ok I think what you are saying makes sense. So I will just strengthen my maths (stats primarily) as of now and will do a project. But if I have to pick up one course, which one would you suggest?
cs229? DL spec? or anything else
I never took any of them so I'm not a good person to answer that question, but i do know that d2ai and fastai look decent and should cover state of the art material
isn't cs229 just more of Ng's stuff?
ok buddy. I will do this then
i would go for something a little more advanced given that you already covered the intro stuff
Coursera version has very less of the original content of the ML course as people say and that cs229 is the real ML class and goes deep into maths
i see. that could be interesting, but if it's just a more rigorous redo of the course you already took, I'd say it's better to go elsewhere. Plus I think it's valuable to learn from different instructors with different styles
Meanwhile what would you suggest for projects? I have heard good things about kaggle and I have stratascratch and dataquest with me as well
oh ok
Kagle Titanic is never a bad place to start, but if you have any specific interests or actual work problems, I think that will serve you better if you can find some data and do your own thing
oh ok
But yes there are plenty of project ideas and datasets on Kaggle and elsewhere
Also do you have any experience with either ISL or ESL?
the books?
yes
i have a physical copy of ESL that i browsed through in grad school, it's probably less valuable than it used to be, but it's still kind of interesting as a menagerie of various things people have used for model fitting over the years. i'm sure newer editions are somewhat more relevant. probably the only really valuable chapter is the one that describes gradient boosting, since people still use that.
oh ok
i haven't spent much time with ISL, but it's very popular and might be worth your time. the "statistical" part is valuable for your learning and might be a good entry point into statistics more generally.
ok, actually after doing Andrew NG ML spec on coursera. I dont feel much confident, because the labs were easy and quizzes were quite easy too. A much more handson approach which would improve my knowledge and skills is something which I am looking for, that is why I am skeptical to go for Deep Learning spec
when people say that coursera spec is watered down version of cs229, I really felt that it is true
In addition to that so many resources did leave me confused as hell
that's valid. i'm just concerned about you spending weeks or months re-learning material you've already covered
it can't hurt, it's just a question of whether it will be hard to force yourself through it
frankly i don't love Ng's teaching style
hmmm, i understand your point
I mean people also say Aurelion Gerons book and see? so many resources 😅 but time is always so less
I am not too inclined towards cs229, I just want to pick one sequence of learning and then stick entirely to it
I hope I made sense about my concerns
Picking any resource is good and doing multiple on the same topic is also fine
All of the ones that have been listed here are nice so any of them will do
oh ok
any resource which is the most hands-on out of all?
I mean which would get me ready to be able to take on any of the projects from kaggle and build portfolio?
Kaggle itself has resources you can use to learn
Personally I'm a fan of contrast, take a book or resource like the ones listed and then take that to apply it on Kaggle. On their platform you can take their courses afterwards
ISL is my favourite as well
thats nice man
How do I deal with pandas-on-spark giving an error on DirectByteBuffer.<init>? I tried setting the properties mentioned on JIRA but it didn't seem to do anything
how do i declare a variable?
@drowsy jacinth depends on the language
prolly python idk
Def (variable name) = any reason for the variable
do you knoq?
oh right i forgot about that
python does not have variable declarations. variables are created when you assign something to them
this isn't valid python syntax, nor does it resemble how variables are created
boo hoo
hey guys
does anyone know how i can properly implement confusion matrix in this code?
https://pastebin.com/z3exrcrr
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Not sure what you mean. Is something still not working? Are you all set?
I coded up an k clustering algorythm
after the user draws an image on a grid. The algorythm tryings to identify clusters/individual shapes
I want to show the algorythm "learning" by drawing a mathplotlib 2d grid and updateing it
how would i update the grid without it having me close the current window to open the next
i have it in a display function
GRID_SIZE = 1
COLOR_MAP = {0: 'white', 1: 'red', 2: 'green', 3: 'blue'}
# Create the figure and axis with equal aspect ratio
if (fig == None or ax == None):
fig, ax = plt.subplots()
ax.set_aspect('equal', adjustable='box')
# Plot the grid with colors
for x in range(len(grid[0])):
for y in range(len(grid)):
color = COLOR_MAP[grid[y][x]]
ax.add_patch(plt.Rectangle((x, y), 1, 1, fill=True, color=color))
# Draw crosshairs inside the squares
if crosshair_coordinates and (x, y) in crosshair_coordinates:
ax.plot(x + 0.5, y + 0.5, marker='x', color='yellow', markersize=10)
# Set axis limits and labels
ax.set_xlim(0, len(grid[0]))
ax.set_ylim(0, len(grid))
ax.set_xticks(range(len(grid[0])))
ax.set_yticks(range(len(grid)))
ax.set_xticklabels([])
ax.set_yticklabels([])
# Show the plot
plt.show()```
One of the problems I'm running into using model outputs as features for my RL algorithms is that I think I should be in theory retraining all the algorithms at each step but that's way too costly so I just do the best I can and fit the model for the training periods and leave test out-of-sample but I'm not sure how else to do it besides just not using model-based features
wonder if anyone else has run into a similar issue and what they did about it
any reccs for an online clustering algorithm? Not sure you can do anything with HMMLearn
gonna check this out https://riverml.xyz/latest/api/cluster/CluStream/
Online machine learning in Python
Examples using sklearn.cluster.MiniBatchKMeans: Biclustering documents with the Spectral Co-clustering algorithm Compare BIRCH and MiniBatchKMeans Comparing different clustering algorithms on toy d...
let me let you into a secret: if a sklearn estimator has the partial_fit method it can be used in an offline online fashion (edit: corrected)
tip: be sure to update your preprocessing as well
You can call partial_fit on Standardscaler and so on as well and ime it makes a massive difference if your signal is drifting
I guess this is the problem in going from hmm to kmeans though
CluStream is adapted for online usage
the first figure is what kmeans does on time series data, the second is hmm
if you believe things happen in markov sequences you probably want to use an hmm
and I've been struggling to find an online implementation in a library, I see a lot of people's projects but I don't really want to be spending my time on that
HMM as in hidden markov model?
yeah
yeah but k-means has so many problems
right now I'm using hmm learn and cheating on the training data with the clustering part
i wouldn't use k-means for just about anything nowadays except very quick and dirty EDA
My go-to cluster method, if I'd cluster, is DBscan
i don't think that works online either
It doesn't
But I think clustering is something you learn in uni together with association rules and has way less use cases than touted
i actually don't know any online clustering algorithms. sems like an odd gap in my knowledge now that i think about it.
yeah, it's mostly an EDA and reporting technique
You can make many of them online quite easily
hi guys, my pytorch cant load cublas64_11.dll, could you help me?
can you? i suppose you can with hierarchical clustering, incrementally building up the distance matrix. i suppose you can online-ify dbscan that way too. and maybe hdbscan depending on the details of the graph pruning algo it uses (i don't know them)
I'm using it for dimensionality reduction
I don't want my observation size for the reinforcement learner to be too big
so I'm summarizing some data before I put it into the feature extractor using clustering
The ones based on the EM algorithm (k-means, GMMs, ...) can trivially be made online
anything that requires you to choose a # of clusters in advance is a non-starter for me unless it's for EDA or i have strong domain knowledge
as we see in the output above, k-means isn't smart enough to handle that in online use
For other ones, it depends. We covered this in uni I'd have to check my notes I made for hierarchical clustering etc.
Maybe not explicitly but we had to cluster by hand with pen and paper, doing that shows you which is and isn't online as a side-effect
hah that's a great assignment
also there was some paper from NYU a few years ago about combining RL with HMM for financial time series
So what you really want is an online dimensionality reduction method that works on time series. Clustering is just a possible implementation?
that's why I chose to use an HMM
yes
i was inspired by remembering that NYU paper though about people applying RL with HMM as input
A bespoke dimensionality reduction method that always works is some sort of autoencoder
would you fine tune it as data comes in?
Yesn't
Maybe that's not the smartest thing to do but it's what I came up on the top of my head. I'd have to read papers.
The basic idea is that you just have an autoregressive encoder and decoder. At inference time you just take the encoder and roll with that.
Maybe I'd store the last N data points and finetune it in some batch job so the thing doing inferencing doesn't need to hold both the encoder and decoder but that's an implementation detail 😄
by "autoregressive encoder" do you mean like an RNN? or do you mean that the input is something like "the previous output" and "the current input" ?
i'm actually curious @agile owl how you were planning on using the clustering for dimension reduction
online PCA is a thing for example
In the current context of data explosion, online techniques that do not require storing all data in memory are indispensable to routinely perform tasks like principal component analysis (PCA). Recursive algorithms that update the PCA with each new observation have been studied in various fields of research and found wide applications in industri...
yup the former
Maybe taking the lags and using a standard dimensionality reduction method could also work. It depends on the downstream task.
I'd actually start doing that and using PCA or similar. Occam's razor and all.
all that said, this river library looks interesting and i'm going to bookmark it. pretty much everything i do now at work is "online" to some extent and i have very little to go on, i'm making a lot of it up as i go
Why doesn't river do any input validation?¶
wat
meanwhile here they are doing input validation https://github.com/online-ml/river/blob/e16ce2b562c51ca27e6cbfab4e6b36fa3068de48/river/drift/kswin.py#L88-L96
river/drift/kswin.py lines 88 to 96
super().__init__()
if alpha < 0 or alpha > 1:
raise ValueError("Alpha must be between 0 and 1.")
if window_size < 0:
raise ValueError("window_size must be greater than 0.")
if window_size < stat_size:
raise ValueError("stat_size must be smaller than window_size.")```
the input validation in sklearn isn't exactly intrusive or excessive... it's pretty reasonable to expect a library to check array shapes and dtypes for you
otherwise you end up with impossible to debug 50-line tracebacks
sincerely, someone who survived using pandas before v1.0
sklearn has a lot to offer there as well 😄
My thesis was on online-ML, specifically concept drift, and it was done fully with sklearn. partial_fit is definitely your friend here.
You can still use the common Pipeline interface and pass in fit_params to specify which parts need to use partial_fit, basically StandardScaler and the model
thanks for the input I'm eating right now gonna decide what to do after this meatball parm
true. i've used scikit-learn a lot but never the partial fitting stuff
yum
same
so the thesis of that paper was that a two-state hmm learns the difference between high variance and low variance regimes in the time series and that is meaningful information for the reward structure in question
so i was using HMM states, which I consider to be a form of clustering
so each observation frame consists of a vector of information at that time step, including the hmm latent state prediction (i.e., the cluster) where that state prediction gives conditional information about the variability in the time series
e.g., an hmm conditions on a length 5 window of rolling differences of the time series value and conditional variance estimates across the time axis
it summarizes that information that would be a 5x(2n) matrix into a scalar
so my observation for the RL learner just takes that scalar instead of having another 5x(2n) elements in it
also I'm not sure the Mlp policy would actually do a good job of being able to learn that type of information itself
basically looking to implement some version of this: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4556048
Just wondering why would someone use an algorithm, such as NEAT, over other machine learning algorithms? What situation would only apply to an algorithm such as NEAT, but no other machine learning algorithm? (I wanna use neat for a something but want too know it's advantages).
????
neat requires the use of machine learning, as it is a technique for automatically creating topologies for machine learning models.
yes
sorry I think i misread your question
1. Implement online learning for temporal clustering regime state algorithm
2. Implement online learning for variance model
3. Embed 1 + 2 as components of Reinforcement Learning environment to be updated on each timestep
4. Create type representing the model configuration to reduce overall args being passed around
5. Create declarative config for main script run that will iterate over configurations for experiments and save experiment results
6. Create (co)variance-weighting mechanism to create composite strategy and analytical tools for model combinations```
laziness is powerful
what do you mean by no other machine learning algorithm?
like avoiding another hyperparameter optimization algorithm, or like avoiding fancy optimizers, etc?
NEAT just tunes your model's topology, there's no reason to not use it along with other optimization methods unless you have a competing topological optimization algorithm you want to try.
I haven't used it but my understanding is it's not just the topology it's solving for but also the weights
EA is an optimization strategy
it's as opposed to something like Adam
makes sense to me. that's not usually what people think of as "dimension reduction" but it sounds reasonable, and HMM has particular intuitive appeal for this task
Hey, guys I am making face emotion detection system for my final year project. Does anyone has any ideas abou similar systems ?>
well it's dimension reduction in the sense that the alternative of the latent state model is to feed the would-be inputs of that model directly into an MLP
I agree wasn't the best terminology
I need to figure out how to use light gbm online in a reasonable way too
I kind of want to use an online version of HMM for the reasons discussed but I'm having a hard time finding one and I don't trust myself to implement it from scratch
Hi everyone im learning AI as my specilization i want to ask if i have to learn competitive programming too to be good in my field?
So, I took the plunge and tried Polars instead of Dask to do a Big Giant Sparse GroupBy (basically doing a GroupBy on 10k or so sparse columns with like 1m rows, and maxing them), and it seemed to do it in 2 minutes? (vs Dask taking like 90 minutes for half the rows, and the original Pandas code taking like 4 hours)
Which is awesome but...why would that happen, I wonder? Like why would it be so much faster than Dask?
I also realized that by converting my feature models into online versions and updating them along with the reinforcement learner policy that they are actually embedded in the learning environment rather than the model itself when it gets saved so what I need to do is implement saving them separately from the reinforcement model on the filesystem and make my gym env accept loaded model objects from kwargs and branch the logic on whether it instantiates a new one based on those kwargs
everything gets so much... messier when you start to do online learning
I'm really enjoying the river API so far
very natural to use it while iterating over my sequence for reinforcement learning
I love river already it's great
What RL algorithm are you using?
I think you could further simplify stuff by using an RNN instead of a vanilla MLP
the SB3 implementation of SAC doesn't support a recurrent layer
you can get it for PPO though
SAC being soft actor critic?
yeah
I am already stacking the observations with a buffer
I read that in most applications that gives the maximum benefit you'd see from using a recurrent network anyway
for RL applications I mean
I'm too rusty on RL but representation learning was always a big part of it
Which you'd get by using an RNN or similar
the LSTM states you mean
that would be a possible way to go with PPO
if I felt like implementing recurrent SAC I could do that too
I already have a need for a specially crafted online model as an input anyway
it's a regressor for the time series variance
that can't be left to the policy network, it's "intelligently designed"
I'm using CluStream for the latent state model and the AMF regressor for the variance regressor
both of them the river implementation
still doing testing, the training time has basically doubled because I did the variance model and latent state using batch methods and leaked information across the training dataset before
that's the main impetus to switch to online learning
because information was leaked over the training period although it didn't affect the test results it presumably led to suboptimal out-of-sample performance
Let me take abstraction from the details for a sec. All RL has a similar problem. You move from tabular RL to using function approximation which allows you to "group" related states. You're no longer updating a single Q[S,A] or V[S] entry.
Defining relatedness is the key question now. You can do it manually with feature construction, which is what you're arguably doing, and then pass it on to a function approximator. The second thing you could do is pick an MLP with the right inductive biases and it'll sort this out on its own.
designing MLP networks to achieve certain goals like that is a skill issue for me
For time series it's exactly why you'd pick a recurrent neural net
Or a CNN with dialated convolutions
I'd have to break free of sb3 to do that which is the ultimate goal but not the time yet
I guess I could do the recurrent PPO
but the PPO results just aren't compelling
compared to what I was getting with SAC
maybe with the changes it will be different
things are coalescing quite nicely though the ultimate goal is to roll my own policy network
I never got to covering PPO/SAC when I was doing RL so I wouldn't have any pointers there in all honesty
PPO is on-policy and SAC is off-policy and for financial data I think it's clear that off-policy is better because you can't just turn on a data spigot
so maximum exploitation is important
although the PPO results weren't terribly bad in most cases
I might as well give it another shot I have it modularly set up to just switch out the models and the policy networks
need to see where the SAC model ends up on the test data though
Both of them are policy-gradient methods I presume
yes
Off-policy stuff can have unintended consequences though as you probably know already
More risky moves until it has converged to the optimal policy
solid
if the expected variance of the reward goes above a certain value
it clips the reward on a linear gradient down to zero after a certain value
or if the reward is negative it is unclipped
Hey guys I need some help
Is there anyway to take linux system calls and turn them into stack traces
reward = ret / prev_var if prev_var else 0
if np.isnan(reward):
reward = 0
if prev_var * self.var_sigma > self.soft_var_limit:
reward = min(
reward,
max(0, (1 - prev_var * 1 / (self.hard_var_limit - self.soft_var_limit)))
* reward,
)
Something like this where prev_var is directly related to the variance of the time series and the weight the agent is putting on it (absolute value of positive or negative weight * the conditional stdev * some multiplier)
this is why that engineered variance model is so important
it controls a lot of logic in the environment
so with something like this the environment isn't strictly defined, I can add my own rules as long as they are compatible with reality
I guess that could be applied to robotics too but it's a lot more obvious here that I don't have a strictly defined environment and I can add my own rules to it
and a lot of them depend on a variance estimate
it's like PPO is the guy who is doing the least possible effort to get a good grade and SAC is the gunner in the class
PPO is like "Hey, I passed right?" SAC is like "I want to have the highest average in the class ever"
so I figured out I could get better test results by constraining SAC than using PPO
I'm not sure how to make PPO more ambitious but I know how to rein in SAC somewhat
which YOLOV model should i use for my first cv detection of common everyday objects?
HI, im trying to using python to create a program that classifies input text (expenditure items) in a list of categories , im looking for some library how can help me with this , any recomendations?
I've heard you talk about Naive Bayes but I'm a bit lost here.
scikit-learn
I will take a look at it, thank you
Can you show the code? Dask might have a lot of overhead from passing data around between processes. As far as I know, the actual code execution engine in Dask is still just Pandas
Whereas if you're using Polars with lazy and maybe also streaming, you'll be running an optimized query, using a faster execution engine, distributing the work among threads rather than processes, which has much less overhead
This is spot on. Polars' execution model is closer to that of a database when you're using the lazy API. It pools a number of operations and then has a query optimizer pick an optimal way to arrange them.
how a can send the code like this?
you need to surround you code with 3 backticks (`) if you want to have a code block or between 1 backtick if you want to have it inline.
Example: you do inline `like this`
A code block like this
```python
print("hello world")
```
(note: it's not showing because I escaped all the backticks)
''' print("Hola Mundo") '''
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Descargar las stopwords y punkt si aún no las tienes
# nltk.download('stopwords')
# nltk.download('punkt')
rutaDataSet = 'datos_gastos.csv'
# Cargar los datos desde un archivo CSV
data = pd.read_csv(rutaDataSet)
# Función para limpiar y preprocesar texto
def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^a-zA-Z\s]', '', text)
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in stopwords.words('spanish')]
text = ' '.join(tokens)
return text
# Aplicar la función de preprocesamiento al conjunto de datos
data['Concepto_de_gasto'] = data['Concepto_de_gasto'].apply(preprocess_text)
# Inicializar el vectorizador
vectorizer = CountVectorizer()
# Vectorización de la columna 'Concepto_de_gasto'
X = vectorizer.fit_transform(data['Concepto_de_gasto'])
y = data['Categoria']
# Inicializar y entrenar el clasificador Naive Bayes
classifier = MultinomialNB()
classifier.fit(X, y)
# Función para predecir la categoría a partir de una cadena de texto
def predecir_categoria(texto):
texto_preprocesado = preprocess_text(texto)
texto_vectorizado = vectorizer.transform([texto_preprocesado])
probabilidad_prediccion = classifier.predict_proba(texto_vectorizado)
max_probabilidad = max(probabilidad_prediccion[0])
if max_probabilidad < 0.5:#Si la probabilidad de ser correcto es menor de un 50% devolvemos categoría desconocida
return 11
else:
prediccion = classifier.predict(texto_vectorizado)
return prediccion[0]
thats the code i get from now, i have another file to call the function "predecir_categoria"
some parts are in spanish , sorry
import NaiveBayesModel
# Ahora puedes usar la función predecir_categoria
texto_gasto = "Estanco"
categoria_predicha = NaiveBayesModel.predecir_categoria(texto_gasto)
print("La categoría predicha para el texto de gasto es:", categoria_predicha)
Thats my main
im really confused about how to like download and install YOLOv5
wth am i suppose to download
!code in case you forget
I tried to run yolov8 on ultralytics on ubuntu 22.04 but it showed some errors here like wayland and i don’t know how to fix them
I actually don't have Discord on my work laptop, but it's something like:
*make DaskML OneHot Encoder, fit with a different DF*
*encode a column and get like 10k sparse columns, GroupBy another column & max*
vs
polars get_dummies
GroupBy & max
I tried the threaded version of Dask too, and I was using the Eager version of Polars. I think Polars might just be a little better about parallelizing high-cardinality GroupBys for some reason?
were you able to get a sense of the bottleneck? it's just the groupby?
Green button “code”
And download then extract it
managed to dowload the folder
do you know how can i make it work?
Open terminal and type “pip install -r requirements.txt” to download libraries
what gpt sugested doesn't seem to work
yeah i have done this
yes
didn't work
so where can i learn this?
You Search for videos that have been posted recently
Welcome to my comprehensive guide on building a custom YOLO (You Only Look Once) object detection model tailored to your specific dataset! Whether you're in computer vision research or developing practical applications, this tutorial will walk you through the entire process.
In this video, I'll cover:
Data Collection and Annotation: Learn to g...
Maybe this vid
I regret buying a laptop with RX5500M
It’s uncomfortable
I think so - cuz it's the GroupBy where it would hang. Everything else completed beforehand. And I've seen stuff on Dask having trouble with high-cardinality GroupBys. Even with p2p shuffling, it's still not ideal.
One other possibility is that Dask doesn't fully play nicely with Pandas' Sparse Columns atm? Polars doesn't support them either, but I can up my memory (and at least it's fast). Maybe just for thoroughness I'll try Dask on a dense representation and give the old girl one last shot lol
interesting results. yeah i wonder if dask is moving the data around too much, or otherwise not maximizing efficiency
my intuition in spark is always that it's designed to make operations on big data possible, rather than making operations fast
i wouldn't be surprised if dask followed a similar philosophy: make bigger data processing possible via distributed computing, rather than maximizing throughput
whereas polars is designed to work all very tightly in a single process with a thread pool on local arrow arrays. it's a very different execution model with much lower overhead and many more opportunities to optimize
and the core operations are written in a very fast runtime, whereas afaik dask is largely python aside from whatever numpy and pandas do internally in compiled extensions
Yeah tbh I think that has maybe even gotten more true as time has gone on, I think they often tell people "if you just want multicore Pandas, use Modin" and are specializing more in For Real Big Data lol.
But I've got my ingrained habits, dasknabbit
makes sense. i've only used dask for parallelizing parameter search when i was hitting weird pickling problems with multiprocessing.Pool and joblib
i liked the nice web UI so i could see progress
But Dask also has their nice sklearn analogs and other stuff that Modin doesn't.
interesting, i think dask-ml was very immature when i last used it
i've never actually done anything with dask for "big data" processing, at that time i had access to databricks so i used spark for big stuff
and i like to brag about how in R data.table i was able to handily work on a 1 billion row time series dataset just on my 2015 MBP with 16 GB RAM, with several browser windows etc. all open while doing so. just one core and highly optimized code.
That's still the main thing we use Dask for lol. Especially that exact thing, Joblib giving me lip lol
lol
and imagine, joblib is still better than the default...
i think at some point i gave up on joblib's caching and just loaded the data from disk in each worker process
i had a whole lot of cool ML framework helper code i lost when i left that job, i was heavily burnt out at the time and didn't do a good job of archiving my work
I fear this now that I'm on a w2 and not a contractor anymore!
on the flip side, stay at your job for a while and you won't need to worry about it 😛
Can someone please help me with pytorch, it throws an error when i import it
the rightmost column is time per episode for my reinforcement learner with incremental feature learners . Will these incremental learning algos eventually reach a stable performance or will the performance just decaying forever
good news is it seems to have found a deep and steady gradient at least
You still going I see, I'm gonna get back to Shakespeare tomorrow, had to spend a couple days with resume writing.
Jesse, we need to cook AI
what do the kids mean when they say "let him cook"
don't criticize something until you see more of what's being made
Is cooking good? Like a musician is really cooking if they're playing well. Or is cooking bad? Like they're burning up because they fucked up and they're suffering the consequences
Ahhh i see
A neutral third path
Honestly that's kind of an interesting concept to package up in a saying
Ty for the education
I only know that from watching meme code review videos and people in twitch chat going wild and the host is like let him cook let him cook as he's reading through it
Yeah, no one likes to eat raw food. Gotta let it cook first
...meme code review?
Re Dask: I used it a bit and I can't say I enjoyed the developer experience. When a thread failed the entire thing just failed without showing me an error / stack trace
anyone know of library that has an incremental booster that can be trained starting from one obs?
I don't know what happens if you try to start lgboost with ones observation but the way they've done it really makes it seem like it's intended to be instantiated with the majority of the data and if you update it then you do that after the model is already trained
kind of annoying how they designed it
you can't just instantiate the model directly and iterate over a dataset calling update which is how I was hoping it would work
the mondrian forest is okay but I think a boosting algorithm would be better than a forest for this data
i don't think boosting can really work incrementally on data like that
unless it's specifically an online version of boosting, which does exist
that's what I suspected but I saw people do some stuff
MEME CODE REVIEWS ?!?!?!?!?!?
so you'd need something that specifically supports online training
but you mean it has to be a totally online model
not just training the existing versions in an online fashion
although lightgbm does KIND of support updates it's a really awkward api
as far as i know you need a different algorithm, but i could be wrong, or maybe lightgbm implements it
I realized I forgot something that was making the AMF perform a lot worse than lightgbm did in the batch version though and I fixed it
I'm actually surprised how easy it was to just replace that stuff
I guess I should stop getting surprised it's python after all
river is clutch
I know it's a long shot, but if anyone knows the answer to this, I'd appreciate it:https://stackoverflow.com/questions/77789702/glitchy-mp4-file-saved-with-matplotlib-animation-via-ffmpeg
Please ping me if you reply
Strangely enough while I had a smooth train using SAC on one time series, switching to another it wasn't able to actually find any gradient at all it seemed, but when I switched to DDPG it did, which I did not expect because I thought SAC was overall the more robust at finding the gradient
Currently, I would like to visualize the predictions generated by the model I created. However, the output provided for each row and column appears as the same image.
I made the code as follows:
import pandas as pd
pred_df = pd.DataFrame({'y_true': y_labels,
'y_pred': pred_classes,
'pred_conf': pred_probs.max(axis=1),
'y_true_classname': [class_names[i] for i in y_labels],
'y_pred_classname': [class_names[i] for i in pred_classes]})
pred_df['pred_correct'] = pred_df['y_true'] == pred_df['y_pred']
top_100_wrong = pred_df[pred_df['pred_correct'] == False].sort_values('pred_conf', ascending=False)[:100]
images_to_view = 9
start_index = 0
plt.figure(figsize=(15, 10))
for i, row in enumerate(top_100_wrong[start_index : images_to_view].itertuples()):
ax = plt.subplot(3, 3, i+1)
_, _, _, pred_prob, y_true_classname, y_pred_classname, _ = row
ax.imshow(images[i].numpy().astype("uint8"))
plt.imshow(images/255)
plt.title(f'actual: {y_true_classname}, pred: {y_pred_classname} \nprob: {pred_prob:.2f}')
plt.axis(False)
Can you tell me what's wrong with the code?
it should be 101 pictures of food
I want to visualize the most wrong predictions
right but it's not defined in what you posted
and the error could be there for all we know
yeah, but do you know how to fix this?
I think you want a confusion matrix?
No. I want to make a visualize for the most wrong predictions
I don't understand what you mean by that
You can define "most wrong" in several ways, most true positives, false negatives, in terms of proportion and so on
here's what I mean
the image above is taken from another source
I still don't know what you want but that's okay, maybe someone else will get it 👍
log all predictions and sort them by loss
you will need some way of marking images in each batch
fastai can find the worst predictions, idk how though
How to do this?
might as well just save the images if loss is big
maybe limit list size if you get too many of them
hi!
i just want to put the x in a step of 1 instead of 5, is there any way to do that?
the key word is "ticks". each marker on the axis is a "tick"
thank you!
btw, i have another graph with this kind of individual filled upward lines, i want to just make it continuous like this in green, how can i do that?
show your code
!paste
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
how do i train YOLOv5 with this dataset i found?
Hello
How can I swich recognition faces live system to photo recognition faces?
Someone can help me?
Please
hello
i want to know the how the student performance is affected on the dataset https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
now the question is what approach should i consider while performing statistical analysis
like anova ftest or t-test like i know how to do them but don't know to apply on dataset
what is the best way to train a gpt model off of api docs? I created a json file from redocs for an api and I wanted to know the best way to train a gpt model or something else in langchain. I tried using gpt assistants but tbh it hallucinates alot. Anyone have any suggestions?
i got a dataset here, these folders are full of images of each category by the name of the folders, all i want to know is how do i train YOLOv5 on it 😬
it is subtle
in the metric attention the number of parameters goes down as you increase the number of heads
I forgot my calculations, but in the limit it goes down to 0.5 or 0.75, something like that
n_para_metric/n_param_transf = 3/4 + 1/(4n)
where n is the number of heads
they both scale with O of c**2 where c is the dimensin size of the embeddings
transformer scales with 4c**2
Hello world,
I am using PyTorch for generating pixel art by using GAN, my train model works however I don't know how to increase the quality. Maybe you have some ideas how to do or what methods/algorithms I need to use?
I am attaching two images (left one of the image from my dataset where the quality is pretty good and on the right is mine generated by my model)
how big is your dataset?
3706 files
def step(self, action) -> tuple:
assert self.action_space.contains(action)
self.prev_portfolio_value = self.current_portfolio_value
self._update_px()
self._update_regime_model()
self._update_regime()
self._calc_rolling()
self._accrete_carry()
self._update_variance_model()
self._update_variance()
self._update_vm()
self._update_im()
if self.balance < 0:
self._manage_liq_deficit()
amount = round(action[1], 0)
if action[0] > 0: # Buy
self._buy(amount)
elif action[0] < 0: # Sell
self._sell(amount)
self._update_return()
self._update_risk()
self._update_sharpe()
reward = self._get_reward()
self.render(action, amount)
state_frame = self._get_state_frame()
self.state_buffer.append(state_frame)
obs = self._get_observation()
self.current_step += 1
if self.current_portfolio_value < 0.6 * self.max_portfolio_value:
pass
done = True
self.reset()
elif self.current_step == len(self.periods[self.period]):
done = True
self.reset()
else:
done = False
self.done = done
info = {}
return obs, reward, done, False, info
How I learned to stop worrying and love the state
`import pandas as pd
import os
!pip install ydata-profiling
from ydata_profiling import ProfileReport
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import silhouette_score, silhouette_samples
import sklearn.metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, log_loss, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import scipy
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"`
i wanted to install ydata profiling but it ended in error
`/usr/local/lib/python3.10/dist-packages/typeguard/_importhook.py in <module>
20 from collections.abc import Buffer
21 else:
---> 22 from typing_extensions import Buffer
23
24 if sys.version_info >= (3, 11):
ImportError: cannot import name 'Buffer' from 'typing_extensions' (/usr/local/lib/python3.10/dist-packages/typing_extensions.py)`
where i do wrong?
can you guys say something? because tbh i just wanted to finish my portofolio so i can look for a job
you might consider some data augmentation
if you haven't already
rotate, color, add noise, etc to the images to increase the size of your dataset
@past meteor heey i have a question when and why we do differencing before finding the correlation i found the answer to why but not when so i wanna confirm from you
I suppose you're specifically talking about auto-correlation (ACF/PACF) here right?
You do differencing to remove the trend
o okay will try, thanks
not specifically for acf and pacf like i have seen in some videos that when we find correlation b/w 2 different time series we sometimes do differencing not sure when those sometimes are
and in acf ik the the time series should be stationary so seasonality and trend must go away and that why differencing is needed (correct me if i am wrong)
anyone ?
@wooden sail
computing the ACF does not require stationarity
the general ACF depends on the two time lags instead of their difference, though
the differences and the levels are two different things
people look at correlation of levels and correlation of differences
it depends on what you're trying to do
if i just want to find correlation b/w 2 different time series then in what scenarios should i 1st do differencing and then find the correlation coefficient
can you please elaborate further if you dont mind ?
yea my bad thnx for correcting me
what should i learn first for ai and data science - numpy, panda or madplotlib
that order that you just said.
but you also need to learn DS/AI theory itself, as an entirely separate thing from "learning libraries". DS/AI is a scientific field that uses programming. it's not a programming field.
and the libraries themselves don't make a lot of sense if you don't know conceptually what you're doing
I started my first job working with an actual dev team. Transitioned from a data analyst, but python was always my go-to analytics tool. Although I was never a python developer in the traditional sense, I was taught the proper conventions that [attempts] to make code readable to others (of course, no one gets them all). I now work out of databricks notebooks with pyspark. I took a peek at the prod utils repo, and there were some things that didnt sit well. The biggest being one line importing a few functions from a module, and in the same cell, importing the whole module as a qualifier. EG:
''fake_example.py'''
from pandas import DataFrame, read_csv(), Series
import numpy as np
import pandas as pd
---some custom functions here---
and in another notebook they create even more redundancy, best way to show is with a fake example:
%run fake_example.py
from pandas import read_json
%run some_other_helper.py
something = custom_func(var)
---some other logic/flow control---
hard_to_follow = some_other_custom_func(read_json(sloppy.json))
so its not exactly clear which function comes from where, although I think databricks can tell you the source somehow. I know notebooks are a completely different animal, but I didn't expect this.
The funniest thing is that they tell us not to use the pyspark-pandas api because it is too slow. And I believe it. But the it seems to run increadibly slow regardless.
is it possible that these redundant imports could be slowing things down?
@covert finch "is it possible that these redundant imports could be slowing things down?"
the work is only done the first time you import something. for all subsequent imports, python just sees that you already imported it and does nothing.
that's globally. not per module
I see, that makes sense
Either way, it is so out of my lane to question that codebase.
I will keep my head down and do my job
Theres also a lot about spark and databricks that I havent fully grasped may explain different conventions
from pandas import DataFrame, read_csv(), Series
import numpy as np
import pandas as pd
This allows you to use DataFrame without having to write pd.DataFrame f.e. So it is not redundant.
Although normally people just write pd.DataFrame
@covert finch
redundant is the wrong word. I understand now its not like its using more resources. Its just... qualifiers exist for a reason, ya know?
Although I definitely see them a lot less in pyspark
I actually think its because im so used to pandas and numpy
A lot of people only import a few functions of course, I just dont understand why you would do both.
Its easy to judge someone elses code. I shouldnt have that mindset.
Is duckdb faster than polars
It is possible for both things to be true simultaneously: you are a noob and don't understand why things are the way they are, and the person who wrote the code is either inept or refuses to do things the conventional way or both
If anything, sometimes the perspective of a newbie is valuable
But if somebody is stubborn enough to avoid writing code the way everybody else writes code, they might not be interested in other peoples perspectives
That's a good question that can only be answered by benchmarking on your particular use case of interest. Duckdb does have a benchmark suite in which duckdb outperforms polars by a tiny bit: https://duckdblabs.github.io/db-benchmark/
I would expect them to be similar on most workloads, since they use similar design
I wouldn't be surprised if duckdb was slightly faster overall, it doesn't need to worry about interoperation with python as much
Its just pretty much appending and resampling on a many large dataframe
I really appreciate this response. Knowledge and experience are a world apart. I may be able to write decent pyton code, follow conventions, use git blah blah. But I'm a total noob as a programmer. That blog post "Learn to Code in 10 Years" really changed my perspective
Polars has been the fastest library I've tested so far in reading in large files (flat files). It was wayyyyyyyyyyyy faster than pandas thats for sure. But once pandas and polars performed similarly on simple arithmetic interestingly once the files were loaded
appending rows is going to be slow and inefficient in just about any column-oriented tabular data structure, which includes pandas, polars, and duckdb
if you need to insert rows incrementally, use a traditional row-oriented storage system like sqlite
or a list, or plain files on disk, etc.
(at least, if you care about throughput)
No it’s just creating a dataframe from metadata of the first dataframe and then merging the two dataframes
the world of software engineering is vast. lots to learn. much like data analysis etc, it's more than anyone can expect to master, let alone learn, in a career or lifetime
Idk why I said appending
do you have a specific use case in mind, and are wondering about optimizing performance?
maybe you meant "appending columns", which is reasonable
It’s just going to take a while because it’s like 70gbs
It’s just counting mentions of a specific word in each row
i suggest stating your actual question in detail if you want a useful response. so far every post you've made has completely changed the answer you will get
don't force people to interview you to help you
hello I have a question! I have an np array with the following variables, X = [x_1, x_2, x_3, x_4, x_5 ... x_(2n-1)] with length (2n-1), I would like to transform this into an (n, n) matrix Y, such that the variables within X are the diagonals of matrix and that they are present within every element on the same the cross-diagonal axis. Like so,
How may I accomplish this? I am not deft enough in numpy yet so I really can't think of a better way than a for loop. but I only want to compute the values x_1 x_2 and so on once.
x_flat = np.array([ ... ]) # your input data
n = len(x_flat) // 2 + 1
x = np.empty(shape=(n, n))
for i in range(n):
x[i, :] = x_flat[i : i + n]
try that
!e ```python
import numpy as np
x_flat = np.array([11, 12, 13, 14, 15, 16, 17, 18, 19])
n = len(x_flat) // 2 + 1
x = np.empty(shape=(n, n))
for i in range(n):
x[i, :] = x_flat[i : i + n]
print(x)
@desert oar :white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | [[11. 12. 13. 14. 15.]
002 | [12. 13. 14. 15. 16.]
003 | [13. 14. 15. 16. 17.]
004 | [14. 15. 16. 17. 18.]
005 | [15. 16. 17. 18. 19.]]
thanks a lot for confirming and coming up with this glorious snippet
sometimes the best approach is to go full caveman and write a loop over indexes
i'm sure APL has a fancy combinator for this, but i'm too stupid for that
the important thing is to pre-allocate the output and operate with slices, instead of doing something like building up a list of lists and converting it to an array at the end
I see, especially that I am working with large lists, it is important to allocate enough memory at first to avoid large computational and allocational costs
I was quite reluctant of a for loop because of the costs it might've incurred but this algorithm should be as good as it gets
for Python methinks
I am sure some crazy man could use Unions in C to make this work but we do not talk about C
numpy is all C internally 😉
you could do it this way too, but i suspect it will be slower:
x = np.vstack([x_flat[i : i + n] for i in range(n)])
actually that would be interesting... let me benchmark
does it use CUDA automatically or is it multithreaded for CPU only
numpy can't do cuda
I shall hope for when the CUDA library for Python comes out
other libraries can do it, e.g. torch and jax. just not numpy
and C bindings will not be too necessary for people like me
cupy is supposed to be a drop-in replacement for numpy that can do cuda, but most of the cases where you'd want cuda are covered by torch et al.
numpy is raw C + blas/lapack bindings
you might be interested in Julia, which has one unified native array type that can do both cpu and gpu more or less transparently. compared to python which now has like 5 different array libraries, not including any custom bindings to eigen, armadillo, etc. that someone might be tempted to write
I have been acquainted with Julia through Grant Anderson's tutorials before
the 3b1b guy?
aye
though I lost interest quickly
I guess I couldn't learn on my own back then, I needed a structured program
he has julia tutorials? i know he did all his animations in python
the sterilization of self-ventures you encounter in high school education is sad
Oh he has, just not in his channel
they are within Julia programming language's own channel I think? He talks about kernels, convolutions and some cool image processing algorithms with his usual intuitive explanations.
nice, i'll watch those. i'm a perpetual julia newbie
I am a perpetual everything newbie, I still can't get over the fact that I have to watch 'beginner tutorial' videos every now and then
The immediate settling of imposter syndrome is sooooo real
that's fair. it might help to try to practice learning from other kinds of resources. videos tend to encourage passive information consumption, which discourages retention and understanding.
i've seen some things written about how students tend to rate videos highly for feeling like they're learning, but actually they perform the worst when you assess how much they leraned
YES, that is wonderfully descriptive of the problems I had since I arrived at higher education
had I actually gotten textbooks and read them, done practice instead of watching lectures and learning through animated neat videos, I would've had a much keener comprehension and skill with most of the subjects I've delved into so far.
It is to do that matters, so shall I, have a nice day.
Actually I was wrong, polars is still significantly faster at arithmetic than pandas
if it makes you feel better, i was a college student before the days of high quality videos on youtube, and i still didn't learn the value of studying out of a book and grinding through exercises until i was halfway through and had already made regrettable irreversible decisions, including a C in a core math class
i was the kind of student in public school who could learn everything by just listening attentively in lecture. basically the equivalent of watching a video.
Hello. I'm having trouble with some simple functionality of Pandas. I read a df from a json file and when I do df.iloc[0]['someprop']['someotherprop'] it all works. When I do df.iloc[0:1]['someprop']['someotherprop'] I expect it to work same as before but instead I get KeyError: 'someotherprop'
can someone please tell me why this works in this strange way?
you should pretty much never have expressions that look like df.iloc[ ][ ] with pandas
can you show a representative example of the json file?
and the code you're using to load it?
Have you checked what gets returned, if you leave away some of the [someprop] part?
if you try:
df.iloc[0]df.iloc[0]['someprop]df.iloc[0:1]df.iloc[0:1]['someprop]df.iloc[0:1]['someprop]['someotherprop]
it should be clear, why you get the error
From the Docs: df.iloc with a slice returns a dataframe, df.iloc with a single integer returns a series i.e. column.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
But like @serene scaffold said, without knowing the filestructure its hard to say, what exactly went wrong
Hey @trim saddle I simply did pd.json_normalize(df['someprop'])['someotherprop'] The important part is the json_normalize 😉
according to the docs, json_normalize returns a dataframe object
so use df.loc , not iloc
oh I see, you probably tried read_json first but normalized fixed it
i suggest working through what zaloog said, understanding why something happened is often just as valuable as finding a workaround
idk if this reasoning helps you out any, but you can think of integration as a "low pass filter" and differentiation as a "high pass filter". that means taking the finite differences of a time series is helpful whenever you want to high pass. practically: when you want to remove any constant offset that shifts the whole series up or down, when there's a slow upward or downward trend, or any slow variations you don't care about (including seasonality over a large time scale). these things are not wrong to include in the ACF, but you may need to compute it differently, since they can indeed affect stationarity. as discussed before, stationarity is not required, but the ACF of a non stationary signal has 2 parameters (it's a 2d or matrix quantity derived from a 1d or vector quantity)
it appears so
Possibly relevant: https://ieeexplore.ieee.org/document/9581809
Im starting my masters on Tuesday
MS CS with AI/ML specialization
congrats!
thanks
i thought this was a classic result 😛
could've sworn it was in old books
*edit: oops, had posted a meme in the wrong place
Benchmarked what I wanted to do, from fastest to slowest it was polars, pandas, dask. And I dont think it was even possible to do with duckdb lol. Polars won by a mile though
heey that was helpful ! thnx edd keep up the good work man
Hmm interesting. I haven't used Polars before. I now use Pandas cuDF for large files.
Have you any idea how Pandas cuDF match up to Polars in terms of speed?
Congratulations 🎉🎉🎉🎉 in the U.S?
I am always skeptical when everyone says something is "the next big thing" or whatever in data science. So I poo pooed polars for a long time. You can do a lot of other things to speed pandas, but polars really surprised me. Its super fast right out of the box
havent tried cuDF pandas
keep in mind I only calculated the mean of a row reading large files into memory
I'll test it out and share the results!
oh it uses CUDA?
I thing my gpu has 16g of vram so keep that in mind
Hey, I have a stats question. Given a set of data points, is there a straightforward/standard way to estimate the mutual information of a pair of variables?
congratulations!
what kind of mutual information?
Are there different kinds? 😄
I mean like H(X) - H(X|Y), where H is entropy.
isnt there a method in sklearn called mutual_info_regression?
Examples using sklearn.feature_selection.mutual_info_regression: Comparison of F-test and mutual information
I can't explain the maths behind it, but this may be what you are looking for in terms of python implementation
cuDF looks really interesting. But its a pain to install. Ill look at it more tomorrow. My guess is regular cuDF will run faster than cuDF pandas tho
I would be interested in knowing this as well.
Personally, I'd still default to Polars because the biggest advantage imo is that the syntax is a lot better than Pandas.
I think cuda stuff is more important for super heavy computations, like deep learning. I don't think there will be a huge difference in data transformation, reading large files, etc
although their documentation claims otherwise haha
Haven't really tried Polars but I've heard great things about it's speed from a work colleague. He just can’t keep calm about it 😀
The juice must be that nourishing perhaps.
It was the same for me, I used it to make a pipeline run in less than a minute that took an hour in Pandas. This is for various reasons though, Pandas consumed a ton more memory so I had to split up the pipeline in parts
do you use a cloud system or do you run it on your local machine?
But honestly, Polars reminds me a bit of writing dplyr and spark in terms of ergonomics
the syntax is really similar to pyspark
I had some bottlenecks too while trying to install cuDF. But I'm glad I didn't relent in getting it in my machine.
This video might be useful https://youtu.be/9KsJRyZJ0vo?si=lXZWX7QSCRC18tHs
What if I told you that all this time we've been using Pandas wrong? 🐼 🐼 🐼
We keep running it on our CPU and wondering why it's slow - but what happens when we switch to GPU processing? 🤔
In this tutorial we will explore the brand new technology behind cuDF Pandas Accelerator Mode that allows us to use our graphic cards to make Pandas MUCH fast...
Last but not least, if you're in to that, the type hints are also a lot better so the LSP/IDE experience is fantastic as well
im curious how it will perform on my local gpu haha
Mostly on my Local machine.
oh i read your message wrong
yeah, im not worried about bottlenecks per say, I just dont like using conda
but if Ill do it on WSL I guess, but why only ubuntu?
Thanks! I'll have a look at that.
I'm actually currious now, because its been a while since I was heavy on the stats side. I do pipes now lol. but isnt that kind of the whole question of binary classification models?
or prop density functions
now you have me reading the wikipedia on entropy
hahaha
Erm, not sure. I'm just interested in doing exploratory data analysis on data where there may be non-linear relationships between the variables.
I'm trying to get into data science properly, finally.
oh man I thought you were like a statistician or something
Nah, but I did take a few stats courses at university. But I wouldn't say I really have a good/comprehensive working knowledge of statistics.
I do know a fair amount about probability.
But more on the theoretical side.
So... is this actually a reasonable way to evaluate the pairwise relationships between variables, or is this just... stupid lol? 😄 ```py
import pandas as pd
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression
data = pd.read_csv('data.csv')
def mutual_info(x, y):
return mutual_info_regression(x.reshape(-1, 1), y)[0]
sns.heatmap(data.corr(method=mutual_info), cmap='coolwarm')
It did make a pretty picture ¯_(ツ)_/¯
lol
Although I would have thought the diagonals should be red 
It is the Boston data set from the ISL book: https://intro-stat-learning.github.io/ISLP/datasets/Boston.html
It's not actually a csv file, I just added that for illustrative purposes.
okay, let me look
After installing the ISLP pypi package, you load the data by doing: ```py
from ISLP import load_data
data = load_data('Boston')
sns.heatmap(data.corr(method=mutual_info), cmap='coolwarm')
The red squares seem reasonable. I mean, you would expect there to be a strong relationship between the amount of industry in an area and air polution.
this is pearsons r correlations, which is not super comprehensive
you really can't say for sure, there could be multicoliniearity, confounding variables, etc
It should be using the mutual_info function. I used corr because I wasn't sure how else to apply a function to all pairs of columns.
oops ur right
Maybe there's a special pandas way to do that?
no mutual_info i think is a better way to do that.
but this still is true
okay
this is basically k-nearest neighbors
Right yeah. I was thinking, if I were to train a predictive model, I could exclude predictors where there is another predictor that has a high degree of mutual information?
agreed
I haven't learned about multicollinearity, but that seems to be a similar concept, but specifically regarding linear relationships.
you can do that with pearsons r too actually. I used ordinary least squares regression models a lot (majored in econometrics) and its always good to check relationships
yeah im fairly confident we are talking about the same thing
waiting for a phd to enter the chat haha
Alright. I distrust linear models because of that Anscombe's quartet thing from school 😄
Distrust all models
but yes, you are correct linear models are meh. but its all just applied probability theory. I think for your use case clustering or logistic regression would be the way to go tho yeah
hold on
nvm I was right, phew
you can also plot a joint probability distribution
did you look at the github lol
"""
Helper functions for clustering
This module contains functions used for clustering in the unsupervised
lab of ISLP. Currently it contains just a simple function to construct
a linkage matrix to assist plotting a dendrogram of a hierarchical
clustering.
"""
import numpy as np
def compute_linkage(hclust):
"""
Create linkage matrix used to plot a dendrogram
Follows [sklearn example](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html)
Parameters
----------
hclust : `sklearn.cluster.AgglomerativeClustering`
Fitted hierarchical clustering object.
Returns
-------
linkage_matrix : np.ndarray
Array to be passed to `dendrogram` from `scipy.cluster.hierarchy`.
"""
counts = np.zeros(hclust.children_.shape[0])
n_samples = len(hclust.labels_)
for i, merge in enumerate(hclust.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack([hclust.children_, hclust.distances_,
counts]).astype(float)
return linkage_matrix
its even on their website
Hello! I have a question, which may be quite complex but I'm stuck with it.
I have two datasets contaning captured messages from two different antennas. Most of the received messages are the same as they are close to each other but there are some messages which are caught by only one of the antennas. Their timestamp not always matches, as there is some variable delay which is not constant throughout the dataset.
I have to remove the duplicates and merge the dataset into a unique one. I tried correlation in small time windows to find the delay at that given time and then try to do some dynamic time warping to match the replies and remove the ones closer than a threshold. However the results are very random I feel.
Any insights?
I would have to see the data set
It has the timestamp of the received message (nanoseconds precision) and the message type, among other technical characterics but I figured that these two are the most telling ones
and you want to merge the datasets on the timestamps, but some of them are slightly off, correct?
Yes!
Something like this by plotting the received messages from the two antennas
The dotted lines means that I have removed a common point in the two datasets (received same message at same or almost same time)
its hard to give good insight without the raw data, but i mean one possible solution would be rounding the timestamp to the nth degree such that the match up, unless you are talking about signals being like 1/2 a nanosecond off, in which case does it really maatter? haha
sorry, your gaph means nothing to me
I just plotted the recevied messages vs time
i dont see time
just lines
theres no other unique identifier? like a ping id or something?
ah but you are most concerned about the time
Red line and blue lines are the data from the two antennas. I just put a dot on the x axis being time when a message is received
Yes, ther is
But yes, the problem is this unpredictable delay
right got it
Just a quick sample of the data
If you are comfortable sending me the data, I will help you
But there is another message identifier which is more unique for each message and other properties, like the strength, format...
I can send a part of it yeah, but not all as it is confidential
totally understand
Yeah. Looking at the pair plots there are some weird things, like variables that are clearly very correlated but apparently have low mutual information. I think I need to just read the book before getting ahead of myself and trying to come up with my own ideas tbh 😄
dont send it here
okay give me a second
ohhh the man who made me read about entropy
ahhahaa
did you use the pytorch script
You might like this book if you want to learn more. I haven't got around to reading it myself.
where can I send it to you
but I think all the models you need are on their github man
do you have git on your machine?
What alternatives do you guys use to Google Colab (pro +)? GPUs at work have an issue and I need some soonish. I could just use Azure but I'm interested in things with a transparent pricing regimen which traditional cloud providers aren't imo.
say I have non-integer coordinates (50.345, 60. 789). How would I get the value of the pixel in an image array at this coordinate?
Well depends on your coordinate notation, if you want the top-left corner of your screen to have float coordinates (0.0, 0.0) then you could just do int() on each of these float values to get their floor integers which should be indices to the corresponding pixel.
x = (50.345, 60.789)
x = (int(x[0]), int(x[1]))
this should return the required indices, (50, 60)
wouldnt this cause some issues regarding pixel value accuracy since we're defaulting to a rounded number?
ideally you would use an interpolation function that is well motivated. otherwise, you have to pick one by eye and see what gives you good results
Well not really, say, the pixel whose top-left corner is at (0.0, 0.0) has length 1.0
So anything between 0.0 and 1.0 is literally in the area spanned by that pixel
hence if you transform the coordinates into indices, you will get the pixel that includes that specific point no matter what.
The accuracy in regards to picking the right pixel that encompasses that point is therefore absolute. You will have digitization however, as multiple float inputs will map into the same integer tuple output, which is just how images work in the first place so there is no error.
rip no gpu capacity on aws spot
ty
be sure to read what salt rock lamp said as well.
Most people I know use
- AWS
- Lightning AI
- Saturn Cloud
For free TPUs for your personal experiments, https://sites.research.google/trc/about/
TPU Research Cloud by Google
Thanks, I'll check out lightning and Saturn. We use Azure in house so Azure ML is also an option if need be 😄
I'm gonna use this stuff
https://huggingface.co/docs/transformers/perplexity
to evaluate the models on the shakespeare dataset
this is the LR used in the attention is all you need paper
model_factory = ModelFactory(
coordinates = 300,
tokens = gpt2_encoder.max_token_value,
words = 250,
number_of_blocks = 10,
number_of_heads = 20,
bias = False,
attention = "metric"# "scaled_dot_product", # or "metric"
)
training_loop_factory = TrainingLoopFactory(
number_of_epochs = 1000,
number_of_batches = 100,
warmup_steps=100,
loss_function = "CrossEntropyLoss",
batch_size = 32,
input_text_file = "train/static/raw_data.txt",
split_ratio = 0.9,
)
gonna let it be until colab takes away the machine, need to do something about the spot instances
I feel like this book is a bit too advanced for both of us
I would start here: teaches you concepts but not overly complex, focusses more on the code, etc
Can anyone help out with a NEAT-Python game that i just, for the life of me, cant get to come to a solution?
Data Science Notes:
https://github.com/sunnyallana/dataScienceResources
Don't ask to ask
If you have a specific question about your program, an error you are getting, and everything you have tried its a different story. Next time, try explaining exactly what you don't understand. Its not that people dont want to help you out, but if everyone asked "can you do this for me" it would be like a full time job for us
What value does complex numbers hold in the field of data analysis? Like if I have logarithmic equation (ln(y) = c + ln(x)) and if I derive my input data also consisting of negative entries does that mean those entries are inavlid in my analysis which should be discarded or can I carry the analysis by applying absolute value which technically means considering just the real part of the ln(x) solution
Need to find out how many complex words are present in a set/list of tokenized words
Any clue how I can achieve this
Yelp
Usually we avoid it. You can of course do just about all of the standard linear algebra with complex numbers that you can do with real numbers, but I never actually see that done in practice
Partly it's a problem of interpretation. In a lot of cases, complex numbers act more like a pair of numbers than a single number. How do you actually interpret linear regression with a complex output?
Complex logarithms in particular, as far as I know, are not straightforward
That is, I don't think it's like a square root where you put a negative number in and a complex number comes out
But I could be wrong, I never studied complex analysis
I think the only time it's likely to come up in applied data science or machine learning is in some kind of signal processing context with Fourier decomposition, which I believe people use sometimes for feature engineering in deep learning, but I have no experience with that
One more thing to note is that in the case of the logarithm, if you're just trying to transform highly skewed data that might contain zeros or negative numbers, you can use the inverse hyperbolic sine function instead, although it's not as easy to interpret as a logarithm, and you don't have a library worth of tidy well understood results for the distribution of random variables transformed with the inverse hyperbolic sine function
So I would say it's only relevant in data analysis if it's relevant to your problem domain. Otherwise, you're not likely to encounter it at all
im a bit confused, am I supposed to feed the encoder output to every decoder block ?
Only for an encoder-decoder unit
so I have to code a series of encoder blocks, and the final one connects to the decoder ?
how do I choose where to connect it ? in the middle ?
and it only shows 3 attention heads, two coming from the encoder, how do I decide on that split ?
and are the positional encoding modules the same ? or should I create one for each branch ?
it shows 3 multi-head attention units, not 3 attention heads. each multi-head attention unit has its own heads
i think you just do positional encoding once at the beginning but don't quote me on that. maybe look at an implementation for these details
yeah I think I need to do that, they have the code in a repo I think
I think I narrowed it down
with tf.variable_scope(name):
for layer_idx in range(hparams.num_decoder_layers or
hparams.num_hidden_layers):
x = transformer_decoder_layer(
x,
decoder_self_attention_bias,
layer_idx,
hparams,
encoder_decoder_attention_bias=encoder_decoder_attention_bias,
encoder_output=encoder_output,
cache=cache,
decode_loop_step=decode_loop_step,
nonpadding=nonpadding,
save_weights_to=save_weights_to,
make_image_summary=make_image_summary,
losses=losses,
layer_collection=layer_collection,
recurrent_memory_by_layer=recurrent_memory_by_layer,
chunk_number=chunk_number
)
seems like it's feeding the encoder output each time, think I need to dig deeper
does not seem to be conditioned on the layer index, the index is only used once and for logging purposes
I think the arrows might be the Q, K, V vectors
if memory_antecedent is not None:
# Encoder-Decoder Attention Cache
q = common_attention.compute_attention_component(
query_antecedent, total_key_depth,
q_filter_width, q_padding, "q",
vars_3d_num_heads=vars_3d_num_heads)
k = cache["k_encdec"]
v = cache["v_encdec"]
else:
tensor2tensor/layers/vqa_layers.py lines 232 to 240
if memory_antecedent is not None:
# Encoder-Decoder Attention Cache
q = common_attention.compute_attention_component(
query_antecedent, total_key_depth,
q_filter_width, q_padding, "q",
vars_3d_num_heads=vars_3d_num_heads)
k = cache["k_encdec"]
v = cache["v_encdec"]
else:```
two come from cache, and the other is being computed somehow from the compute_attention_component
aaah, I think im gonna need to rework my layers
this is quite interesting, I don't think I have a parallel for this setup in the metric tensor thing
I mean, ofc I have, I just dot the output from the encoder with the output from the prev decoder
in both cases it is a simple mod, on the forward method I add a new argument, forward(sequence1_bwc, sequence2_bcw), first one is used to compute queries, the second to compute keys and values. When I need to use decoder only I just feed the same sequence twice. My attention mechanism is even simpler since I'm just doing xMx.T, I can do xMy.T
how do i prepare the data and LSTM model in keras for suporting multiple features as input?
Yes that's definitely what they are
im gonna train the networks to do summarization, it's easier for me to do a qualitative assessment of the ouput and it is a useful application that I can easily deploy for portfolio value
so far I've only seen one metric for evaluating the output, it's called perplexity
for training data, I'll likely query wikipedia articles on-demand and feed them through lamma to get an output
@past meteor @proud wing @desert oar The answer to the problem in the message I'm replying to here (you all helped me with it a while back) is found in this article https://pythonspeed.com/articles/python-multiprocessing/
Thought I'd share. Interesting little niche problem for Linux users using multiprocessing
Short answer: Explicitly setting the context for creating new processes as 'spawn' rather than the default 'fork' fixes the issue and allows me to write out to disk from inside a child process
interesting, I don't remember the original issue but honestly I wouldn't have expected that to work
I would've expected that you need to create a new thread in each worker process, or use a cross-process-capable queue for logging
I'm not really sharing data across processes. I'm just writing out from each process separately into separate files
But each write would cause the process that called it to hang
if you were writing from within each process that should work and shouldn't cause the problem here
unless you're trying to share an open file object across processes or something weird
Anyone want to practice pandas or matplotlib with me?
i am using torch.nn.functional.Interpolate() to convert 3,16,16 to3,224,224
but i am geting following error during back propogation:
RuntimeError: The size of tensor a (224) must match the size of tensor b (16) at non-singleton dimension 3
what to do?
Boys of data analysis do I need to take all maven analytics courses for python or till visualization into seaborn and metabolite will be enough??
Are there any good source for refering machine learning and ai available on the net?
I'll refer you to this pinned message: #data-science-and-ml message
!rule 6
Seems like you made this, which would mean it's unnapproved advertising which means it sadly has to go.
A massive tip I can give you for learning matplotlib is reading the quick start guide https://matplotlib.org/stable/users/index.html. It's really important you understand the anatomy of a figure intuitively (in the link). it explains the "philosophy" quite well.
For Pandas reading this would also be a help https://pandas.pydata.org/docs/user_guide/10min.html#min
Both Pandas and matplotlib have cheatsheets I'd recommend you reference when using them in the beginning (after reading both links on top):
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://matplotlib.org/cheatsheets/
I have deleted the post. To clarify I had not made that post. I only discovered it, I enjoyed reading so I thought to share it in most relevant community. @past meteor Definition of advertisement : "a piece of information that persuades people to buy something". That article was not persuading anyone to buy anything. No membership was required to even read the article. For anyone still wanting link to article just tell me.
why is the training of YOLOv5 stopping for me after it has done 1 epouch 😭
im training it on a dataset of about 8k images from roboflow
i used these parameters
and yes i have re-run the code 3-4 times already but same issue
I have dataframe named df and I need to add a dict to it. How?
Can you give a brief example of what you mean?
nvm
Hi everyone,
I want do some regression DS projects. Do you have any suggestion for choosing projects?
I would like to teach a NN to recognize if wires are inserted into the connector in the right order. I have a hundred images of connector side of a cable with wires attached. I've annotated them with polygons denoting where in the image the connector housing is and where each of the wires is. I have also annotated if the whole thing is assembled correctly. All tutorials are focusing mostly on recognizing different classes between images but I have all the same images with the same n classes on there (connector, x amount of wires and if it's "CORRECT"). Only some training images have some wires removed and made incorrectly. I have trouble extrapolating from the tutorials of how to go about this.
metabolite? maven analytics? can you clarify
classes don't have to be objects within the image. In this case, you have only two classes: correctly connected, and incorrectly connected or disconnected
however if you run object detection on the image first, you might be able to pre-process the images in some way that might make the classifier more effective
otherwise, if you just take the images and put them directly into a classifier, the model basically has to learn how to detect objects, before it can learn to distinguish connected and disconnected
Matplotip
All these courses I mean do I need all of them??
i suggest spending a small amount of time every week reading the matplotlib documentation and practicing
I have no idea what these courses are, so I can't comment, but I rarely think it's worth taking a course just to learn a library
I guess those prices aren't that bad? They might get you started more quickly
I don't think any of that is necessary, it won't help you on a job application for example
But it might help get you started much more quickly than you might be able to learn it on your own, in which case it's money well spent depending on how you value your time
But it's also hard to say what the quality of the course material is
There are a lot of junk courses out there taught by people who are essentially beginners themselves and should have no business teaching
I certainly do not think you should take all of them
Particularly when it comes to learning how to actually build models and reason about data at a professional level, lots of small bite-size courses will not help you make progress in my experience. You will need to spend some time and effort on larger hands-on projects, and at least some deeper study of the underlying math and statistics, if only so that you can share a common language with other practitioners and learn from them, read books, etc.
So what do you recommend what I should do with python, SQL, Visualization and Excel??
I need some help with identifying how to access the GRIT dataset that Ferret model uses which was released by apple. Here is the repo link "https://github.com/apple/ml-ferret/"
But both ways should yield the same result? Even if I don't teach the model what a connector or what a black wire is? I didn't yet label NOK ones as "incorrect" so I think I'll do that before proceeding.
I don't understand what a classificator is then. Is that just the dense layers after the convolution and pooling?
Why python is outdated and JavaScript is updated?
wat
It isn't
A classifier is any model that distinguishes between classes. It is a more general concept than neural networks and images.
In theory yes, the model can learn what black wires are, what connectors are, etc. but you might need a lot more data for that. The more information the model needs to learn from the data, the more data you need
class EncoderDecoder(nn.Module):
def __init__(self, params: "ModelFactory"):
super(EncoderDecoder, self).__init__()
self.sequence_encoder = SequenceEncoder(params)
self.encoder = nn.Sequential()
for i in range(self.number_of_blocks):
block = TransformerBlock(params, is_decoder = False)
self.encoder.add_module(f"encoder_block_{i}", block)
decoder_blocks = nn.ModuleList()
junction_blocks = nn.ModuleList()
for i in range(self.number_of_blocks):
decoder_blocks.add_module(f"decoder_block_{i}", TransformerBlock(params, is_decoder = True))
junction_blocks.add_module(f"junction_block_{i}", TransformerJunction(params))
self.output_layer = nn.Sequential(
nn.LayerNorm(params.coordinates),
nn.Linear(params.coordinates, params.tokens, bias=params.bias)
)
def forward(self, sequence_bw: TensorInt) -> TensorFloat:
sequence_bwc = self.sequence_encoder(sequence_bw)
encoder_output_bwc = self.encoder(sequence_bwc)
for i in range(self.number_of_blocks):
sequence_bwc = self.decoder_blocks[i](sequence_bwc)
sequence_bwc = self.junction_blocks[i](sequence_bwc, encoder_output_bwc)
return self.output_layer(sequence_bwc)
Uhm, I think this is it
So for the 100 images I currently have labelled I should only be training for the "correct" and "incorrect".
Not sure how I'm supposed to train it tho. Probably the same way as before ? Have the output shifted so that it does next token pred
i'd be concerned that 100 isn't enough to get good results without preprocessing
but you can definitely try it
What amount would you say should be enough? Just the order of magnitude (1k, 10k)?
i'm just thinking about the conceptual application. I assume they're interested in doing something like taking a photo of an electrical box and determining if it's been wired correctly. There are going to be so many variations of configurations, lighting, cameras, etc. that the model is going to have to be robust to
hard to say. it's more about covering possible variations of the data than hitting some required number
are you doing sequence translation now
Still, I'd try it. Last time I used one I was able to get it to remove backgrounds with v good generalization from just 50-75 images
But, it worked because the background was always the same
yeah that's what I was saying, just try it and see. but don't be disappointed if it doesn't work well either
So it was memorizing the background
if it's something like one camera mounted in a fixed location taking a photo of a specific jig in a production facility, 100 might be fine
That's exactly what I have
ok, then go for it. you might be able to improve your results by pre-processing with object detection etc. but at least this way you'll have a baseline
It was, but I didn't predict it at the time, we had over 1k images and I wanted to determine how many images we actually needed to fit one network, the images were expensive to get the bg removed, was pleasantly surprised by the unet
there should also ways to combine a small number of labeled instances with a large number of unlabeled instances to improve results, but i'd have to do some digging to see how that works with images and NNs specifically. that's called "semi supervised" learning
this, right? https://en.m.wikipedia.org/wiki/U-Net
U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segment...
I'm doing summarization, the encoder is for the document and the decoder auto regressive summarization thing
Yes, all SOTA for bg removal and matting are a variation of the UNET, they usually further introduce MoE to the mix
what are you using on the decoder side? are you getting summaries from another model?
It's a normal transformer decoder, I based the setup off of the tensorflow repo they link on the paper
Still haven't checked if the positional encoding is shared by the two branches
But I reckon they do
which paper? attention is all you need?
yes
This repo
i'll take a look
I'm working with Keras and I have issues with logits and label shapes during model.fit(). My images are (N, 240, 320, 3) and my labels are a (N, 1) array of 0 or 1 (for OK and NOK). Do I just say my labels are (N, 240, 320, 1) and fill them with either 0 or 1 or do I have to segment my images to make masks of where the desired wires and connectors are?
if you're just classifying "connected" and "not connected" then your labels are just (N, 1) of 0 and 1, assuming N is the number of images
if you segment the images i'm not sure what the "correct" approach is, but you can maybe include each mask as another "channel" along with r g and b
so instead of (N, 240, 320, 3) you have (N, 240, 320, 3 + M) where M is the number of different mask layers
there might be a better way to do it, that's just what i came up with off the top of my head
these are 240x320 images right? if your labels were N,240,320,1 then you'd be labeling each pixel, rather than each image
I did that and then I tried to put in just the 1D array of labels but I'm getting stuck with keras complaining of mismatched shapes in model.fit https://i.imgur.com/CYZbGZq.png
show your code. based on your messages before you might be missing a softmax layer
!paste
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
I am indeed without a softmax layer.
For lack of a solution I'm trying by throwing stuff at the wall and seeing what sticks. I put a flatten at the end of the decoder and it now goes through the epochs but sadly with 0.0 accuracy
Well you need one
softmax is how you get from raw "scores" to a probability distribution over classes
in the case of a single output you actually just need a sigmoid activation
but you need to get from your inner layers to that one output
I don't completely understand what softmax does with only two unique classes. I guess it's the same as just normalizing between 0 and 1. The main issue right now is that I can't visualize how all of this operates. What would be some useful debug info to display to make any sense of it?
In 1d softmax is just the inverse logistic function
With two classes, isn't it the same as with N classes ?
it simplifies further, but yes it's the same
sorry it is the logistic function
the inverse logistic function is the other way, going from 0,1 to the real line
So if I have sigmoid activation on the last conv2d, do I still need a softmax layer then?
nope that's identical in 1d
but you need sigmoid activation on one output, if you have multiple outputs you need to condense that down to a single output somehow
i'm not really sure what people do for that typically, I suppose you could do a fully connected layer
So it's the same interpretation, a probability assigned to each of the two classes
Ah! So I can do layers.Dense(1) to connect all the previous ones down to one output
No errors but the results are a bit weird. For all 10 epochs I get
loss: 9.2921 - accuracy: 0.3976 - val_loss: 8.0797 - val_accuracy: 0.4762
Could this be because of a too small dataset?
possibly. what % do you have ok OK and NOK?
60-40
I'll label more images tomorrow. Try to get at least 500. Sadly I don't have that many NOK pics so I need to get more when I go back to work.
yeah, if you have 60% OK then you should be beating 60% accuracy as minimum baseline
more labeled data is good
That's a good to know. I lowered the batch size down to 8 and got up to 52% accuracy
Thanks for all the help. I'm off to bed now. Will probably be back with more questions in the coming days.
If you can visualize the learning curve that could help diagnose if you're underfitting or overfitting
you probably should also do a train/test split. more data will help for that
Hey does anyone have a good vidoe or place I can start learing pytorch and Machine Learning?
This is the best intro I've seen: https://youtu.be/0QczhVg5HaI?si=RgPV8UhO64tJXYu4
A video about neural networks, how they work, and why they're useful.
My twitter: https://twitter.com/max_romana
SOURCES
Neural network playground: https://playground.tensorflow.org/
Universal Function Approximation:
Proof: https://cognitivemedium.com/magic_paper/assets/Hornik.pdf
Covering ReLUs: https://proceedings.neurips.cc/paper/2017/hash...
A moe elaborate playlist https://youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&si=d08-29J4j74zjDte
And then the resources on the pinned messages + just getting your hands dirty with it
This looks awesome
anyone want ot practice data analysis on python (pandas, numpy, matplotlib, seaborn)?
Didnt know google had stuff like this. thanks
and thanks for both links ill try both of those and the google one
@agile owl my first success with implementing an architecture described in a paper was this one about continual learning https://ojs.aaai.org/index.php/AAAI/article/view/17600
awesome let me check it out