#data-science-and-ml
1 messages Β· Page 72 of 1
what GPU and cuda version do you have?
were you wanting to make a voice assistant? because that would not be a good first project. voice assistants are several separate AI components rolled into one interface.
most AIs that use language deal with text, and if they have to deal with audio, they work with transcribed audio.
been a long time since i installed it how do i check
nvidia-smi tells you your cuda information
started immidietly closed
oh wrong thing
so what am i looking for
do i just paste it in pastebin ?
sure
CUDA Version: 12.2
NVIDIA GeForce RTX 3060
alr
i know there is a version list wich shows u the compatibilities between python cuda and tf
but cant find it
I would look to see if you can find a wheel that you can download and install
those are pre-compiled for specific combinations of (OS, python version, tensorflow version, cuda version)
you have to find the right one for all four parameters, or it won't work.
god how do i do that never done anything with wheels
for pytorch, there's just a web page that has all the wheels
using tf
what OS are you on
looks like this hasn't been updated for a while, but you might find the one you want here https://github.com/fo40225/tensorflow-windows-wheel
I've got to get into something else. Good luck!
you can just pip install them
like pip install https://github.com/fo40225/tensorflow-windows-wheel/blob/master/1.1.0/py36/GPU/cuda8cudnn6avx2/tensorflow_gpu-1.1.0-cp36-cp36m-win_amd64.whl would potentially work
ah
thanks a lot!
did not work but i wont annoy you anymore
yeah i have no clue how to work with this.
also i have python 2.9.0 my bad
found one that works from tensorflow docs
there is no python 2.9
tensorflow*
oh
cant think when sleepy
mb haha
oh
well this is interesting it might actually work
oh yeah it doesnt just didnt show the errors.
π¦
im so confused with theese wheels ill need to check this out
but yeah ill go to sleep but if anyone can help you can dm me because i have no idea what wheels are what do they do how to work with it where to get the links. never done something like this so help will be highly appriciated
anyone still wake lol
its noon in my country
what? im only 14 year old if i say idk dont blame me
you have done CNN before?
CNN?
hmm sorry im only a begginer π
i have reach the if statement
good job
always starts with little things, you will get there bro
thanks man
Sure what about them?
lol I want to make sure my model is not in a forever loop
if you can help me double check?
You shouldn't call evaluate on the test set while doing your kfold
yeah it actually run super slow
I'm sure there is a way to optimize it
here what it print out so far:
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
16/16 [==============================] - 0s 6ms/step - loss: 0.2187 - accuracy: 0.9612
16/16 [==============================] - 0s 6ms/step
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
16/16 [==============================] - 0s 7ms/step - loss: 0.1716 - accuracy: 0.9265
16/16 [==============================] - 0s 6ms/step
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
16/16 [==============================] - 0s 7ms/step - loss: 0.1190 - accuracy: 0.9571
16/16 [==============================] - 0s 7ms/step
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
16/16 [==============================] - 0s 6ms/step - loss: 0.0944 - accuracy: 0.9673
16/16 [==============================] - 0s 7ms/step
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
Are you training on CPU?
It shouldn't be slow, your model is really small (too small)
Do you have a GPU? If not, consider using colab π
colab?
also can you elaborate on this?
is this why my model is so slow
https://colab.research.google.com/?utm_source=scs-index free compute from google, incl. GPU hours
No it's slow because you're training on CPU
ahhhh
You're supposed to train your model using kfold cross validation to check performance and then train on the entire training set and then evaluate on the test set
yup this is what I want to do
How should I fix it
so move # Get the predicted probabilities for ROC curve y_pred = model.predict(X_test) out of the loop?
and then do this again history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test), verbose=0)
outside the loop
and keep this inside the loop?
just want to make sure I understand
how should I seperate them? have them completely outside the model?
You just use X_train as validation_data again
Hi guys I am doing EDA on Olympics Dataset from kaggle. I am having a little problem . I want to extract some rows from dataset based on a condition but I am having difficulty. I am having a list **sports **and a dataset with name country_teams which has 3 columns Team,Sport,Gold,Times_participated and I want to extract those countries(or Teams) which has earned highest medals in a particular sport. For eg. If I do country_teams[country_teams['Sport'] == 'Basketball'] then I get US on top and I want to keep that row only and discard the rest. Same thing with every sport. The **sports **list has name of every sport . Can anyone help?
so keep using X_train as validation_data or should I change it?
Currently you have X_test there, no? π so you should change it to X_train
I see lol, I will look into it, Thanks π
Btw if this isn't making sense or isn't intuitive, that's pretty normal I think
Does group by followed by max and then taking the top row solve your problem?
Left one is the country_team dataframe(originally I am having this) and right one is the one you are telling. I am getting back the same dataframe that I a having originally.
Oh that's how your dataset looks like. Just to be sure, you want the max per sport type?
Then a group by filter works. Group by sport type and then filter(lambda x: x["gold"] == x["gold"].max())
hello guys
i would like to know how can I plot the red curve that represents the fifth and ninety-fifth percentile of the simulation from a brownian motion
is it possible to do that in Python ?
does anyone use ArrayFire?
just woke up, and im still in need of help tried to find wheels that work but nothing works. can anyone help me setup the gpu for tensorflow ? talked about it earlier here.
Sure, I expect you have an NxT array with N being the number of simulations and T being the number of timesteps? It's trivial to find the 50th percentile for each T
atm no but i can build that @past meteor
i thought of doing like that :
for each time step compute an array containing the output of each simulation
and use the np.percentile() on it to get a scalar and store it in another array
i get one representing the percentile at each timestep then plot
right ?
Yes, though thanks to numpy's vectorized functions you can do it for all timesteps at once:
# data is an (n,t) array
perc5 = np.percentile(data, 5, axis=0) # (t,) array
perc95 = np.percentile(data, 95, axis=0) # (t,) array
wonderful
so i just have to compute an array for each time step containing all the outputs of my simulations
then i store this in an array of arrays
aka the "data" in your code
and that's done
Are you using numpy so far?
The matrix "array of arrays" would just be an np.ndarray
I have a quick question
For K-fold validation
Does folds train on top of each other?
Like fold 2 train on top of fold 1, then fold 3 train on top of fold 2 and so on
@sick ember - i don't quite understand what you meant, but this image i yoinked from sklearn's docs should be quite clear what's going on in k-fold CV, does it help? (https://scikit-learn.org/stable/modules/cross_validation.html)
If you're unsure of Kfold and how it works you can code it up (it's short, less than 10 loc using numpy) and I can look at your implementation.
Writing it from scratch and having other people (or chatGPT) critique it is the best way to check if you unambigiously understand the concept π
hey I need some help
I'm trying to write a library like micrograd but i wanna implement tensors in it I studied tinygrads code it has lazy buffer implemented which is going above my noob brain can some guide me on how get more knowledge in this field Or a guide on how to write the Tensor class with buffer
Thank you I actually have my model: https://paste.pythondiscord.com/aduxefuwut
You can see the clear difference: fold one vs fold 2
Here is val loss for all folds
Does not looks normal at all right
I understand your question
Looks like they train on top of each other, is that supposed to happen
Iirc Keras .fit method continues from the weights of the last time
But are they supposed to train on top of each other?
No, you should start from a clean slate each time
What do you mean by that?
How do I start form a clean slate each time?
Easy way to solve this would be to move instantiate your model inside of your kfold loop
Iβm a little confused
So move all the model layer inside the for loop?
You can put it in a function that returns a model and call it there yes
I see, thank you
so does anyone else have problems with gpu support with tensorflow ? still cant figure out how to use gpu with tensorflow
Hey all, I have a question: Lets say I have an arrow image like that, how do I go about developing a model that segments the arrow object shape, even if the arrow is faded/damaged
Anyone here know about motplotlib's surface plots? Is it possible to make surfaces like a 3D X without the space between the lines that make up X being filled in with a surface?
Have a folder with decaying arrows and you label them
The more samples you have the better
And then another folder with other shapes you want to differentiate from the decaying arrows
14 y.o on control flow statos?... go get it lad
What do you mean by this, is this the mask of the arrow shape?
For any learning model, you need training data for it to learn
Itβs like a student studying for an exam, you need to give it examples
You have to create training data first(ie: 2000 samples of decay arrows, 2000 samples of other stuff that is not decay arrows), label your training data correctly and feed it into your model
You might want to change up the ratio a bit, if recognizing decay arrow is important for you, you might want the ratio of sample to be 2:1 decay arrows to not decay arrows
Make sure you normalize the picture, greyscale it if color is not important to you
So like mix of healthy arrows and damaged arrows?
isnt that how captchas work
take a shit ton of samples
of diff things
Yea
Sure thanks, its semantic segmentation right?
A little bit different, but similar ideas I think
Thanks with regards to the masks of the image
Do i put all as healthy arrows? (even for damaged arrows)
well thats my mission
What kind of classification are you doing?
Do you want to categorically classify healthy arrows and damage arrows too?
Or simple binary classification between arrows or not
Im trying to make a model to create mask for both healthy and damaged arrows, the mask should be healthy for any arrows under any conditions
So you donβt care if they are healthy or damage, but just that they are arrows
Yes but im having trouble getting the mask of non healthy arrows currently
Bingo, use binary classification
Put all healthy and non healthy arrows in one folder/label
You need more non healthy samples
The more the better
Also look into your layer settings
Wait, but I want the mask of non healthy samples to appear healthy
And optimizers
You donβt care if they are healthy or non healthy right? And you want to mask the non healthy ones as if they are healthy?
Correct
Okay put all arrows both health and non healthy in the same folder and with the same label
For your training data
Thank you appreciate it
Np bro any time
Any data engineers here? I want to make a project for my resume. What should I make? Also are there any guides and stuff for the field? Iβve learned some of the basics and used airflow at work as a data collection pipeline
can anyone comment whether this is too simple a model for FaceGen? ```py
class Generator(nn.Module):
def init(self, latent_dim: int):
super(Generator, self).init()
self.latent_dim = latent_dim
self.fc = nn.Linear(latent_dim, 128 * 4 * 4) # Adjust the output size for your latent vector
self.deconv = nn.Sequential(
nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.ConvTranspose2d(32, 3, kernel_size=4, stride=2, padding=1),
nn.Tanh()
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.fc(x)
x = x.view(x.size(0), 128, 4, 4)
x = self.deconv(x)
x = nn.functional.interpolate(x, scale_factor=2)
return x```
Hello can anyone take a look at my tsne graph to see if it make any sense
it seems to me that my code is too long to post on python help what do i do
!paste
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
thank you
hello guys
I am trying to fetch the time (hour minute seconds) from a datetime row from an excel dataset i am working with, the dataset has been imported as a pandas dataframe
for k in range(len(dataset)):
print(dataset['TIMESTAMP'][k])
print(dt.datetime.strptime(str(dataset['TIMESTAMP'][k]), '%H:%M:%S'))
I keep getting this erro
error
ValueError: time data '0 2023-03-14 09:43:40.547\n0 2023-03-13 09:43:36.160\n0 2023-03-14 08:36:27.753\n0 2023-03-16 01:02:01.147\n0 2023-03-16 10:46:21.233\n0 2023-03-17 17:20:14.313\n0 2023-03-20 10:42:03.623\n0 2023-03-20 02:30:00.000\n0 2023-03-21 02:30:00.277\n0 2023-03-22 02:30:00.000\n0 2023-03-23 02:30:00.000\n0 2023-03-24 02:30:00.000\n0 2023-03-24 13:00:07.667\n0 2023-03-25 20:14:17.683\n0 2023-03-27 02:30:00.380\n0 2023-03-29 10:33:33.397\n0 2023-03-29 02:30:00.503\n0 2023-03-30 02:30:00.237\n0 2023-03-31 02:30:00.533\n0 2023-04-01 02:43:31.530\n0 2023-04-03 16:12:37.930\nName: TIMESTAMP, dtype: datetime64[ns]' does not match format '%H:%M:%S'
i do not understand this
can somebody point to me what i am missing
you want to see the original dataset ?
i just want to keep the hour minute second part
print(dataset) ?
Yeah sure
0 2023-03-29 10:33:33.397
0 2023-03-29 02:30:00.503
0 2023-03-30 02:30:00.237
0 2023-03-31 02:30:00.533
0 2023-04-01 02:43:31.530
0 2023-04-03 16:12:37.930
that is the type of objects im working with
it's always useful if you can post some example data, maybe something for next time π
import pandas as pd
data = '''0 2023-03-14 09:43:40.547\n0 2023-03-13 09:43:36.160\n0 2023-03-14 08:36:27.753\n0 2023-03-16 01:02:01.147\n0 2023-03-16 10:46:21.233\n0 2023-03-17 17:20:14.313\n0 2023-03-20 10:42:03.623\n0 2023-03-20 02:30:00.000\n0 2023-03-21 02:30:00.277\n0 2023-03-22 02:30:00.000\n0 2023-03-23 02:30:00.000\n0 2023-03-24 02:30:00.000\n0 2023-03-24 13:00:07.667\n0 2023-03-25 20:14:17.683\n0 2023-03-27 02:30:00.380\n0 2023-03-29 10:33:33.397\n0 2023-03-29 02:30:00.503\n0 2023-03-30 02:30:00.237\n0 2023-03-31 02:30:00.533\n0 2023-04-01 02:43:31.530\n0 2023-04-03 16:12:37.930\n'''
data = data.strip().split('\n')
df = pd.DataFrame([row.split(' ') for row in data], columns=['Index', 'TIMESTAMP'])
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])
# chatgpt gave me the above from the snippet you gave, i cba to parse it myself.
print(df['TIMESTAMP'].dt.time)
# this is what you want
docs on datetime accessor: https://pandas.pydata.org/docs/user_guide/basics.html#dt-accessor
i guess no because there is the pb with the microseconds
i thought passing the format as i did would convert it
You can always do print() to see what you get and debug
it's important to note pandas already have parse the datetime for you here, note the dtype: datetime64[ns] in your error.
you can double check with print(dataset.dtpyes)
and once you confirm that this is true, you are almost always better served using the datetime accessor like i demonstrated.
also in general you don't want to be iterating over a dataframe manually
@boreal gale are you good at Tsne
not really no.
d a t e t i m e
i recommend reading https://pandas.pydata.org/docs/user_guide/basics.html#dt-accessor which i linked above - in essence it's the datetime accessor.
and also https://pandas.pydata.org/docs/user_guide/timeseries.html if anything is confusing to you there
they are too detailed and nuanced for me to just explain here from memory, i am bound to forget some crucial bit of info, so i really recommend reading it yourself!
ok noted
and last question
to select only wanted timestamps i can compare them to another one using basic operators ?
@sick ember - just go ahead and post your question π i am sure someone will have some thoughts, and don't worry about reposting (in another time where people are more active) once your post is buried in other activities
yes! i assume you meant the usual > >= <= < etc etc? if so, yes!
dtime = '02:00:00'
dtim = '03:00:00'
for k in range(len(dataset)):
if dataset['TIMESTAMP'][k].dt.time > datetime.datetime.strptime(dtime,'%H:%M:%S') and dataset['TIMESTAMP'][k].dt.time < datetime.datetime.strptime(dtim,'&H:&M:&S'):
print('ok')
this fails badly ahah
!e here is how i would use datatime accessor to get what you want (or halfway there)
import pandas as pd
data = '''0 2023-03-14 09:43:40.547\n0 2023-03-13 09:43:36.160\n0 2023-03-14 08:36:27.753\n0 2023-03-16 01:02:01.147\n0 2023-03-16 10:46:21.233\n0 2023-03-17 17:20:14.313\n0 2023-03-20 10:42:03.623\n0 2023-03-20 02:30:00.000\n0 2023-03-21 02:30:00.277\n0 2023-03-22 02:30:00.000\n0 2023-03-23 02:30:00.000\n0 2023-03-24 02:30:00.000\n0 2023-03-24 13:00:07.667\n0 2023-03-25 20:14:17.683\n0 2023-03-27 02:30:00.380\n0 2023-03-29 10:33:33.397\n0 2023-03-29 02:30:00.503\n0 2023-03-30 02:30:00.237\n0 2023-03-31 02:30:00.533\n0 2023-04-01 02:43:31.530\n0 2023-04-03 16:12:37.930\n'''
data = data.strip().split('\n')
df = pd.DataFrame([row.split(' ') for row in data], columns=['Index', 'TIMESTAMP'])
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])
# ^^^^^^^^
# chatgpt generated - subsitute with your own dataset.
# below is what will be useful to you
import datetime
print(df['TIMESTAMP'].dt.time < datetime.time(3))
@boreal gale :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | 0 False
002 | 1 False
003 | 2 False
004 | 3 True
005 | 4 False
006 | 5 False
007 | 6 False
008 | 7 True
009 | 8 True
010 | 9 True
011 | 10 True
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/iqeyovazid.txt?noredirect
I posted it and still waiting, thanks though π
on this channel?
No in the python help channel
ah found it yes, unfortunate :
maybe repost here in another few hours or so π€·
oh nice, can i get the datetimes between 2 and 3 am on one shot ?
i think & operator is useful to do that
sure you can! indeed with the &!
print(dataset['TIMESTAMP'].dt.time > datetime.time(2) & dataset['TIMESTAMP'].dt.time < datetime.time(3))
i would always wrap with ( ) if i am not 100% sure on the operator precendence
(edit: sorry that's a bit cryptic, i am working out atm and typed that in a rush - i meant you probably want to wrap dataset['TIMESTAMP'].dt.time > datetime.time(2) with () and the same with dataset['TIMESTAMP'].dt.time < datetime.time(3))
& has a pretty high precedence, so you need to do it very often
otherwise this is dataset['TIMESTAMP'].dt.time > (datetime.time(2) & dataset['TIMESTAMP'].dt.time) < datetime.time(3) which you don't want
from 1 to 10 how easy is it to write an ai in python for a python beginner
At the moment, I focus only at audio transcribing.
I have learned a lot in these 2 days (yesterday and today) about convolutional Neural Networks (CNN's) and how to make the program on real-time. Further, I have a lot of freetime, so if it cost me 100's of hours to make this program (train models), I don't mind doing it.
Tomorrow I will going to record a lot of samples of my voice with the transcribing.
The personal assistant program is going to be used for me, so I don't have to worry about recording other peoples voices. Where I do have to pay close attention is the language rules of how I am going to be consistents with my samples.
And I know this is not a basic easy project, this requires a lot of patience/skills.
Good luck @lapis sequoia!
You got me curious within your project
What kind of libraries are you using for you CNN?
I am not on that phase yet of my project π
I am now going to search what library I am going to use for CNN.
I probably going to use matplotlib and tensorflow.
matplotlib, because I am going to use spectrogram figures.
For tensorflow, I don't know yet, but I see it often in AI projects. -> source: https://www.tensorflow.org/tutorials/images/cnn
Does anyone know of good ai short courses at the intermediate to advanced+ level? Either free or paid
Andrew Ng courses tend to be highly recommended
i think usually pytorch is recommended as being easier to work with. if you want to use tensorflow, at least use the keras interface
<@&831776746206265384> scam
The tutorial he pointed to uses keras interface eventhough it isnt mentioned. So it should be fine.
Having used both,Keras is a very high-level library that makes basic things easier than doing them in Pytorch. OTOH, "advanced" things in Torch are generally easier.
For beginners that are "serious" about deep learning I'd actually recommend Torch because NN's are a leaky abstraction and Pytorch exposes you to more of it. If you're a software engineer that needs a CNN from time to time you can stick to Keras.
I really like tf but I've been shifting more and more to pytorch recently
Especially after they changed the keras-tf relationship and made it a separate python package? idk what's going on with that but it makes things annoying
didn't they do the opposite?
i think originally keras was its own thing intended to be a higher-level abstraction layer over tensorflow and the now-defunct lasagne? then it got eaten by tensorflow and became part of tensorflow? idk i barely use this stuff and it made me just stick to pytorch whenever i need it.
i think there's still a separate keras package but tensorflow also includes a keras "layer" that resembles the keras api
i assume the former is a thin wrapper over the latter when using the tf backend
Hello Guys I am working on the project related to the Context Based Mcq and subjective questions generation in English Language, can anyone suggest me the way like high level design how can I solve this problem by utilising which Transformer or please tell me about do I need to fine tune the model because I don't have the data right now so I am looking for the approaches that don't require any training process
Long ago keras was integrated into tensorflow ig?
The recent change I'm talking about was that keras was a part of the tensorflow pip package till tf v2.7, but a standalone package since
yeah and now they're separating it again
it's a mess
there's IDE autocomplete issues open for over a year where you don't get autocomplete for a simple
tensorflow.keras.xx
huh
finally made me shift to pytorch as my default choice
https://github.com/tensorflow/tensorflow/issues/56231
this has a lot of related discussion
embrace jax :x
jax is fantastic
but not a lot of companies use it rn ;-;
atleast from my experience
I was even trying Sonnet for sometime when jax wasn't very mature
yeah i don't see it in the wild much
that's neat too
or I wasn't mature enough to use jax xd
hey
Need help implement a Large Language Model. My dataset is a text log of system requests. Each request syntax is composed of multiple a key-value pairs. A request may contact upto 108 key value pairs. The log consists of millions of requests. The objective is to implement a ML algo that can predict the next log request. Need your help to understand which Large Language model should be used and how to implement. TIA
Are they separating it again? Source?
if you are using an LLM, you are not implementing it. "to implement a LLM" means to code up the architecture and train it.
If the logs contain structured text, using a LLM is probably a bad idea.
TF and Keras can be really unergonomic because of the sheer amount of breaking changes they do to their API. Makes me a bit wary for using Jax for anything serious (I use it in pet projects) edit: jax.numpyshould be stable, I guess what's likely gonna be unstable is people building higher level tools on top of it.
LLMs are for natural language. If the logs are structured, you would get better results parsing them into some representation that can be used in an LSTM.
hi, is anyone familiar with power bi here?
ask your actual question
We're at tf 2.13, I'm pretty sure I've installed TF past 2021 and it came with Keras
hello guys,
I hope you are all well.
I have this algorithm to apply for images.
I need help in implementing it using numpy. Thank you
we can still use it with the tf.keras API
tf.python.keras is deprecated I think
and a whole bunch of internal mess to make this happen
plus the python.keras was actually removed in 2.7.0 ig
in 2.6.0 only the live code was moved to keras-team/keras
but the older code still remain in tensorflow
i am trying to model the relationship between these 3 tables. however, despite everything i have tried, i cant really seem to connect all 3 of them together. the stores table has StoreKey and Channel Key, the sale table has Channel AND Store Key, and i have a channel table with Channel Key. How can i relate these 3 tables together?
TF requires Keras and just installs it I guess? https://github.com/tensorflow/tensorflow/blob/master/requirements_lock_3_11.txt
i tried putting a cross directional filtering between channel and stores, and a single directional from channel to sale, it doesnt seem to work...
hey zestar long time no see
i was the person who constantly asked abt the train test split stuff last time xD
zebra
I think so
But the import mechanism is/was a mess
avoid bidirectional filters
Join them like this Table A -> Table B -> Table C
Don't join A and C directly
disclaimer, I'm far from a PBI expert. Did some of it while studying because it paid so so well. The microsoft PBI forums are generally receptive to questions
i tried to find some PBI discord server, nothing found
Use the actual msoft forums
The faster you post, the faster you get an answer π
In the meantime what I said might solve the problem
fairly straight forward. we can do this:
p = (z - np.min(z))/(np.max(z) - np.min(z))
``` where some indexing might be necessary if you have several images in the same numpy array
thank you for response. I have made it the same but the results are not like the formula mentions
do you happen to have several images in your array z?
no I have only 1 image
and what issue are you having with the output?
you're missing some parentheses
your code is equivalent to ((x - np.min(x)) / np.max(x)) - np.min(x)
python uses PEMDAS order of operations
yes thank you @desert oar !
in the future it would be better if you could post your output as text, rather than a screenshot
!code
βοΈ read above for instructions
also @wooden sail thank you very much
the "py" part changes the syntax highlighting, so you'd write py for python, sql for sql, sh for shell/bash, etc.
any recommendation on ML papers for beginners?
How much of a beginner are you? If you're starting from scratch a book might be better.
sure, do you got one?
I saw grokking series on deep learning, I was interested in that
I always link this book https://www.statlearning.com/
thanks man, If you got any other too I would happily read it
I'd start with something like this that is pretty high-level and then follow it up with a harder text like https://hastie.su.domains/Papers/ESLII.pdf if you want to go "deeper" into ML theory
For Deep learning the "dive into deep learning" book is good
Murphy's books are dense and long: https://probml.github.io/pml-book/book1.html https://probml.github.io/pml-book/book2.html the second one is particularly "hard" because each ~30 pages of a 1100 page book could be a book in and of itself, e.g., monte carlo methods is just 20 pages
Anyone good with T-SNE ;-;
thanks @past meteor let me know whenever I could help you with something
Final tip, don't read all of them at once. Start with one, do projects while you're reading it. Read it again (but fast), do more projects. Wait 6 months and then do the next one π€£
I'm interested into logic behind that, can you explain why take longer periods between? I get it in a sense that you will memorize more, but should you consume as much as possible as fast as possible?
Things take time to learn and you should be OK with the fact that learning anything (be it code, statistics, math, playing the guitar) is a multi year journey in which you'll take breaks
Ok. good point. The request syntax has key value pairs and looks something like this $key1 = value1; $key2 = value2 and so on. So there is definitely a structure. Which ML do you recommend?
@serene scaffold The objective is to create syntax of the request. Can LSTM do that? Appreciate some referenes
Thanks
Personally I always need a multi-month break after binging one of those books and a bunch of papers. I do it in my own time mostly. At work it's also code, books, papers and I think it's also good to touch grass or idk play video games as well in your spare time.
what does it mean to "create syntax"?
this is very wise and deep what I just read
I will think about that and try not to beat myself again if I won't binge read 2 books per week month
Any recommendations on reading material for PAC learning?
Can anyone please help me with t-sne graph
Please stop asking "can anyone help with x"-like questions without saying what it is that you actually need help with. If you've already done that, and no one volunteered to help, you might be on your own.
Youβre right lmao
Answerers don't want to do a bunch of back-and-forth with the asker to figure out what the actual question is, so you must ask a complete question all at once.
From Mitchell's original paper on it maybe?
I learnt about it in uni but idt it had value for me personally π€·ββοΈ. I can send you the slides.
I only vaguely remember what it is because it's used in VC-dimensions but then again, I'm not sure if those are still relevant with neural nets and double descent. I guess people are trying: https://arxiv.org/abs/2205.15549
There has been growing interest in generalization performance of large
multilayer neural networks that can be trained to achieve zero training error,
while generalizing well on test data. This regime is known as 'second descent'
and it appears to contradict the conventional view that optimal model
complexity should reflect an optimal balance bet...
basically creating key-value pairs ... generating upto 108 such pairs. $key1 = value1; $key2=value2; $key3= ... and so on. Can LSTM do it?
are you asking if an LLM can emit code that parses data according to a structured format that you provide?
I don't think so.
yes
You said
The objective is to implement a ML algo that can predict the next log request. Need your help to understand which Large Language model should be used and how to implement. TIA
i see...
@weak lagoon salt rock lamp is asking you if you're trying to use ChatGPT (which is an LLM) to write a log parser program.
i didnt use chatgpt. i was able to parse the files and create a df with all keys stored in columns and value in rows; so millions of such rows were created. Thing is there is no ' 1 ' target feature to predict, else i could have used a LR or NB ML algo. I actually have to predict 108 keys and values
@weak lagoon LSTMs are for predicting on sequences of data. So if you have ["x=a", "y=b", "c=d"], an LSTM that has been trained on sequences like that could predict what goes next, and then next, and then next, etc.
even though natural language is a sequence of words, and LLMs work in terms of sequences (of words), you shouldn't use LLMs for all problems that involve sequences
perfect... that is is exactly what I want
I'll google but any recommended resources?
Not off the top of my head
I read Sutton and Barto RL book, what is that DP ?
@sick ember - i read your code re. your tsne question
i don't think averaging the embedding makes a lot of sense, where did you get that idea from?
see: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
specifically :
t-SNE [1] is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
even though you pinned the random state, the fact that you are training with different training data would make the embedding in different folds not comparable, hence average would not make sense.
that's my 2 cents anyway, someone with more theory background into how tsne work can comment more.
Hello! I hope this is ok to post. I'm working on a Python package to generate SQL using AI. It's in an early state and I was wondering if anyone would be willing to give me feedback on the documentation? https://vanna.ai/
Thank you! And thanks for the link! I will look into it more in the meantime
dynamic programming
If you look at the bellman equation the DP solution would be the one that requires all information. You need to reward model, the transition model and you need to visit all states, multiple times (which isn't really possible)
is it ok to adjust all rewards for steps after full "game cycle"? Like for example there are mini rewards for every action. But we also change rewards using endscore
Don't fully understand the question, do you mean for DP? Usually you'd do backwards
I mean RL
Deep Q to be specific
Oooh like that. There's two classes of algorithms, Monte Carlo algorithms, they need the episode (= the game) to end and temporal difference (they can update after each step)
I'm lying a bit because there's a whole class of algos that combine MC and TD, it's more of a spectrum
Hello, im having an issue with tensorflow and keras, I have installed tensorflow and imported it, however I am getting the error 'Import tensorflow.keras could not be resolved.'
just a methodology question but, when using a gradient boosting model like xgboost or lightgbm what percentage of missing in a variable is too high
like if 70% is missing should I just through it away
I need a guide to mathematical notations is there a site where you type a notation and you get the numpy python version?
@civic elm hi
there is an entire field of study on missing data. throwing an entire feature away is usually not a good idea, unless you already don't think it would be a useful feature. it's usually important to think about why the data is missing, as a starting point.
tree-based models are interesting because they can handle missing values by treating "missing" as a distinct value. but that's not always what you want.
Assuming I want to predict the stock marketβs closing price the next day using 30 previous days of closes, opens, highs, lows, etc using an LSTM, how would I feed the data into the model? Iβm not sure how you would input multiple features into an RNN, and the above question was an example if it helps.
Hi everyone, does anyone have experience with the SciPy signal (scipy.signal) processing libraries?
I wanted to get some guidance with a few functions in the library.
Hello, I have a website, and It has a couple hundered pages, I want to build a chat bot using its data sources, FROM ONLY THE WEBSITE no documents, how i can build this???
I have a list of times with coordinates in matplotlib, how can I make it so that the x-axis is not just a black blur (too many ticks and all of them are shown)
I was looking for something additional π
Sure that'd be great
I'm only looking into it as a passing interest
any recommendation on podcasts for ML or AI in general? could use some during transits.
Hello everyone kindly look at this question, https://stackoverflow.com/questions/76668053/time-series-forecasting-with-5-min-intervals
The early episodes of the Lex Fridman podcast were somewhat focused on ai
These days they're more general I think
i have a question
i have this program that is suppose to scrape a website and put into a spreadsheet format
but i keep getting an error
``import requests
from bs4 import BeautifulSoup
import openpyxl
from openpyxl.styles import Font
wb = openpyxl.load_workbook('NewExcel6.xlsx')
sheet = wb['Sheet']
sheet.column_dimensions['A'] = 32
sheet['A1'] = 'Company Name'
font_name = Font(size=18, bold=True)
sheet['A1'].font = font_name
sheet.column_dimensions['B'] = 60
sheet['B1'] = 'Required Skills'
sheet['B1'].font = font_name
sheet.column_dimensions['C'] = 153
sheet['C1'] = 'Link'
sheet['C1'].font = font_name
sheet.column_dimensions['D'] = 30
sheet['D1'] = 'years of experience'
sheet['D1'].font = font_name
sheet.column_dimensions['E'] = 13
sheet['E1'] = 'location'
sheet['E1'].font = font_name
for page in range(1, 11):
url = requests.get(f'https://www.timesjobs.com/candidate/job-search.html?from=submit&actualTxtKeywords=python&searchBy=0&rdoOperator=OR&searchType=personalizedSearch&luceneResultSize=25&postWeek=60&txtKeywords=python&pDate=I&sequence={page}&startPage=1').text
soup = BeautifulSoup(url, 'lxml')
jobs = soup.find_all('li', class_='clearfix job-bx wht-shd-bx')
for job in jobs:
date = job.find('span', class_='sim-posted').span.text
if 'few' in date:
skills = job.find('span', class_='srp-skills').text.strip()
company_name = job.find('h3', class_='joblist-comp-name').text.strip().replace('(MoreJobs)', '')
more_info = job.header.h2.a['href']
years_of_experience = job.find('ul', class_='top-jd-dtl clearfix').find('li').text.strip(' card_travel')
location = job.find('ul', class_='top-jd-dtl clearfix').find('span').text
for row in range(2, sheet.max_row+1):
sheet['B' + str(row)] = company_name
sheet['C' + str(row)] = skills
sheet['D' + str(row)] = more_info
sheet['E' + str(row)] = years_of_experience
sheet['F' + str(row)] = location
wb.save('NewExcel6.xlsx')``
@jade raptor Just share the error and the line of code which gives you the error.
the error is at line 6 wb = openpyxl.load_workbook('NewExcel5.xlsx')
heyyo ! btw i wanted to thank you for the otext website link you sent me it was really helpful you have any more resources like for stats i learned most of the stats from statquest on youtube but i dont think its enough and anything which help me better do data analysis ? do you have any such resources ?
Okay so you guys know how there are multiple models for stable diffusion, I want to create a model that takes the users prompt and basically analyses it to choose the model best suited to produce the requested results, I need the mathematical Approach since I am trying to write a graduation thesis on it. A few Ideas I got are K's nearest neighbors, one hot encoding and word2vec but I need some insight from someone who better understands how to implement a mathematical and practical solution to this issue.
well you can make your custom dataset its quite similar to detecting meaning of texts and basically classifying them in one of the following categories one the approaches you can use is smilar to what you have mentioned is using word 2vec to convert words to their respective vector representation and then classifying the using a classification algo like knn or svm
i would suggest a better way
use transformer netwokrs for it
for example your feed it your tokenized sentence to your transformer network
so I'd tokenize my input prompt and then input it to a transformer network and let the attention mechanism do it's magic?
right you can then add a final layer as dense layer which will have like n units (n= number of models you have or number of classes you have which you cant to classify your promopt to ) once you have classified output lets say i input the test hello world and the output is a vector [1,0,0,1]of it means that this prompt should go to the model which is linked with the output label [1,0,0,1}
mostly it should just be able to classify the type of model checkpoint that's appropriate to use with the given input, gonna have like 30 checkpoints, gonna have to extract descriptions for each of those checkpoints and then use them as classes or something
like anime man standing on a bridge --> the model should detect that the best model to use here is anime checkpoint
could have a landscape model too
so in such cases it should just pick the most appropriate, for example realistic anime man on bridge
there is a conflict here but it should still work.
somehow i need to score the sentences on term of compatibility with the given model.
true
correct me if am interpretting it in a wrong way but
your model will learn
also 2 more questions sir, first does this count as multi-modality and 2nd is there a way i can represent all the different weights of the same model through a single equation so that a single forward pass may be able to produce all the different outputs by different checkpoints?
will learn how to represent the different checkpoints through extracting or generating feature labels for the different possible checkpoints
yup
it will then analyse the input prompt against the different checkpoints to classify which the appropriate model to use.
yess that will be our basic job i guess it should all work because i worked ona similar project used to classify comments on the basis of their toxicity
wow! YAY 
So you think the best architecture for this task to use would be transformers?
i am not sure about the equation part tho but yess it will be multi modality since we are using different classes for different objects
also you got any resources I could look into regarding this problem, I've mostly worked with CV before so this project is sorta new to me.
yess in my case i used just some bidirectional rnn layers and some dense layers but since you are doing it for thesis and want some nice results i would suggest trying transformers
I can figure the equation part out but I am not sure about the training part.
can i dm ?
so that i can send you the link of the project i worked on
your input is pure text and output is class labels
You're working with only a single modality of data (text)
So it will not be multi-modal
we are dealing with different classes so i guess we are dealing with multi modaility humans can be divided too in different classes and those cases are counted in multi modality iirc ?
yes for sure!
our processing is all of the same type of input data, correct?
Modality in deep learning refers to the mode of the data, whether it is Image, audio, text, etc.
As long as your model's inputs are all of the same modality, it is not a multi-modal model
the text classess will be used to select between different image generators.
This task could be solved in a multi-modal manner, but I do not see any advantages of that over a robust text classifier
our inputs are both images and text but the images will be manually labelled.
we could create a classifier to classify the images automatically
The model that decides which image-generator-checkpoint to use - this model takes what as input? Text from the user? Anything else?
text
and the checkpoint
i guess he is trying to use both texts and images thats why i suggested mutli modailty
why would it need checkpoints as an input? those could just be the output classes?
yess checkpoints are supposed to be output classes
assume someone used a checkpoint file not in the original training dataset
somehow the model should be able to extract the appropriate labels for the file
The user only enters text right?
its either the user should label it manually, it should come prelabelled
for example assume I had 3 model checkpoints (waifu-diffusion, photorealistic, stable-diffusion-base)
the user should input a prompt
and the model should be able to infer which of those checkpoints to use
rather than the user always having to select manually.
yup so you are basically gonna say heey i want a nice waifu image so the model should use waifu diffusion
tokenising the input and manually extracting the features of the checkpoints and then running them through a transformer?
a robust text classifier with user's prompt as input text, and your 3 model checkpoints as the list of output classes
for each prompt the user inputs, the model will output a signal indicating one of these classifiers to be used
not necessarily that clear, could be "portrait of anime girl standing by a lake"
as long as it's text
if its only text as input its unimodal
you can use a text-classifier with n number of output classes, that will take the prompt as input and classify which checkpoint to use
wwhat stargazer suggested about is fine
what text classifier would you recommend?
yess the output classes can be the checkpoints label
but I'll have to manually label the checkpoints right? unless i run them through a CNN somehow and extract the labels automatically.
all this is assuming you're providing a service to the user
That the user will come on to your platform and use one of the checkpoints you've made available
If you want the user to train their own checkpoint and plug that into this, then that's a very differen task and you would need a very different method
you would ideally need a dataset where the X feature is an input prompt by the user and the target feature is what model checkpoint would be most appropriate for it
You would probably have to create this manually
great, could you further elaborate! I am thinking between proposing it as a service and as a program.
the same thing i suggested up there
create a manual dataset
i was thinking if a user could train it on their own checkpoint then i'd have to first make their checkpoint generate a base generation.
and you transformer for classification
I'm a little confused if you want it to be something available to the user where they input a text and get the best output image based on that
or do you want just a model selecting system where any developer would be able to plug it in to match any user text to a set of custom checkpoints that developer has trained themselves
both, but I can drop the 2nd if it's too complex
i would suggest just go with the 1st one since you will be able to build it very easily and efficiently
for the first one you just need a simple text classifier, and make a dataset with the checkpoints you have matched to the prompts that would be appropriate for them
the second is a very different problem
great discussion so far.
for that you would probably want to use a T2I encoder to get the representation of the input text in a joint image + text vector space
then you would compare this with encoder outputs from all the different checkpoints by using an explicit conditioning token for each checkpoint
And then calculate the minimum distance between these vectors to get the best model
keep in mind that I'm hoping to include some of the math or apply it on pen and paper before applying it practically.
this is a very fragile solution, it'll take time to come up with a good solution for this. I just put it out there from what I could understand in 10 seconds
can't I use an I2T and apply a minimum distance?
that was my idea.
for that you need an image first
for which you will need to run all 30 checkpoints
which is very expensive
don't i need to run all 30 checkpoints either way
no, I thought that's what we were trying to avoid lol ;-;
also it's initially expensive but on the long run it should be cheap since i only gotta run the new checkpoints after mapping out the initial 30
otherwise it should be relatively simple
this is the reason i suggested the 1st model since it would be easier to implement and write the math for
lol same
for each new input you will have to calculate output images and vector embeddings to calculate disttance wrt that input text right
so this one for now.
do you think it would be good enough for a bachelors graduation proposal, I mean I'm studying computer science but i really wanna specialise in AI.
considering I do the math, theory and practical demo
I think it would be a little complex to figure out a one-shot solution for this
If you want to "map-out" the checkpoints i.e. obtain a generic representation for them via an initial process and then just use these later
That would probably require separate training itself
well I was hoping to create a joint representation for the 30 somehow where if i run the joint representation i should get the 30 outputs with one forward pass but that seems too complex.
you can't get 30 outputs for the cost of one
what we can hope to do is to get 30 representations of the checkpoints themselves and a representation of the input text
and then based on the distance select one checkpoint to run the full forward pass on
but that is an elaborate problem
it would basically be a simple text classifier + dataset
Depends on the standards of your uni
if you somehow figure this out, I think that would be much better suited to a graduation thesis, but that's a much more elaborate problem
hm... isn't this somehow like 2x + 2y = 6 , 3x - 2y = 7 kind of problem if you get what i mean?
wouldn't the 30 somehow be linear?
since they are both constricted by the same params.
which both? I don't follow
i mean all 30 checkpoints
they are all of the same neural net
so that's one thing they all share, the number of params.
can't there be a unified equation that's solvable through one forward pass for all the 30 checkpoints>?
hm.. nvm this is way over my head
what will you forward pass on? that output of the forward pass is entirely dependant on the weights
And that's what we're trying to identify
you don't have 2x + 3y = 7
i already have the weights, i have 30 sets of weights
yes but you don't know which to use
all those 30 weights will be different sets of {A, B} weights
can't i use all of the 30 through one pass.
that was my question, without significantly adding to computational intensity.
one pass through what?
your forward pass will be calculating f(Ax1 + Bx2 + ....) correct?
how will you choose what set of {A, B, ...} to forward pass on
if you have to calculate 30 results, you will have to perform 30 computations
I don't see how you can avoid that
{A1, B1} , {A2, B2} cant i create an {Ax, Bx}
and that will be what? Mean of all sets of weights?
as to be able to extract the individual components at the end step
not mean but maybe equation that represents A1, B1 and A2, B2 that can be applied to A12, B12, to get back A1,B1 and A2,B2
you know how in cryptography you get a key that's both the public key and private key
by what you are proposing one could simply hypothesize a set of 100,000 model checkpoints that would just represent the full set space of plausible weights for that task
in such a case a single forward pass would give you the results with the most appropriate model checkpoint (set of weights)
then we would never need to train a neural network at all!
and u can somehow extract both the public key and private key from that one key
so you got one piece key that represents 2 distinct keys, if you get my comparison here..
the point of the private key is that it's private and cannot be obtained directly, that's not how cryptography works
yes i now think this is way over my head
yes but i am talking about the concept where one key can be used to represent 2 keys, can't somehow one checkpoint represent 2
you can derive the public key from the private key from my knowledge, not the other way round
yea, mb xD
or you know how there the concept of equations where x could be -a or +b
even that is dependant on a lot of special cases I think
the only thing you might be better served by is Quantum Computers I guess xxd
maintain your model checkpoints as a superposition of different states
on completing the forward pass (observation) it decomposes to a particular state
this is a joke
for now i think I should just keep it as simple as possible, nothing generic but nothing newtonian
oh lol quantum AI
no thanks...
I don't think that's how quantum ai works either
what you're asking for is encoding the complete information, reversibly extractible, of an abstract number of components into just a single component
if such a thing was possible, it would be the most advanced and magical compression system in the world
exactly!
if i ever wrote it, I would use that as the name of the paper..
Imagine, you would never need to store all these different model checkpoints
You could devise your whole network as a collection of different checkpoints of just a linear layer, and a conv layer, etc.
such a system obviously does not exist
and I doubt it can
nothing is impossible it might not be the way i described
but maybe you could get a superposition of the matrix.
idk
I'll just stick to something that's tough but not that advanced..
good luck with your thesis!
@wooden sail any insight? π
so for my problem I am better suited looking into tokenizing my input, either manually labelling my checkpoints or extracting labels from my checkpoints and finally comparing them with a transformer
what should be looking into for the above 3 steps here
if you just want to provide a service where the user gets the best style of output image for their input text, on your platform, with your checkpoints
Then a text classifier should be sufficient
for this you will need to create a dataset with (text, checkpoint) pairs where each input text is labelled to the most suitable checkpoint
and pass them through a network with an attention mechanism?
so once it sees a key in the input text it should infer the checkpoint directly?
or just use a simple neural net?
yeah you can treat compression as an encoding task for which there is a hard limit based on the entropy of the data
there are many ways to create a text classifier, a transformer is one of the more powerful ones
maybe shannon's source coding theorem
mm yes perhaps
interesting, thanks!
https://en.wikipedia.org/wiki/Shannon's_source_coding_theorem yeah that seems to be the one
In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the limits to possible data compression, and the operational meaning of the Shannon entropy.
Named after Claude Shannon, the source coding theorem shows that (in the limit, as the length of a stream of independent and identically-distributed random v...
but it makes some assumptions from what I remember?
i think the usual statement of the theorem requires the symbols to be statistically independent, which in general is not true
tell me if you find anything interesting.
hmmm
you can find a similar bound but tighter for a particular set of data if you know its statistics
regarding the concept.
what are you trying to do?
lol edd the expert is here
hi! im really new to huggingface and getting crazy with the documentation :)))
anyone can help with this?:
I have local image data in which each subfolder contains images of a class, loaded the data with load_dataset .
then i noticed it is very slow in the feature extracting and training process,
so I want to divide the data into 10 parts, each containing N images of every class, and then feed these 10 parts separately to the extractor and trainer.
any suggestions?
oh the question about limit on compression was just an interesting tangent
MomentoAmori was trying to build a software that:
Given a text input and a set of n T2I generation model checkpoints each with a different "style" (anime/photorealistic/pencil), could decide the best model checkpoint to use and run the forward pass on just that
The model checkpoints are not known apriori
The user may run an initial training/setup process but it should be plug and play
Hello, I have a bot made from botpress and its clone in langchain python, it uses a website as its knowledge base,
when I ask for "who performs the keloid surgery", it responds with "Dr. XXXX"
BUT when I ask for "does Dr. XXXX perform the keloid surgery?", it responds that it does not know it.
How should I fix this?
Guys is there a way for me to display sns.heatmap using for loop?
seaborn uses matplotlib under the hood, i am just mentioning this in case the missing of this keywordis making you not able google the answer to your question.
personally i would just use matplotlib.pyplot.subplots to create a new figure and axes and pass that axes to seaborn via the ax argument.
Hmmm
See this @boreal gale
This code gives me the output as:
from sklearn.metrics import plot_confusion_matrix
models = {
"Logistic Regression": LogisticRegression(),
"AdaBoost Classifier": AdaBoostClassifier(),
"KNeighbors Classifier": KNeighborsClassifier()
}
for model_name, model in models.items():
# print(model)
model.fit(X_train, y_train)
model_prediction = model.predict(X_test)
# model_confusion_matrix = plot_confusion_matrix(y_test, model_prediction)
print("Evaluation for: {model_name}".format(model_name=model_name).center(77, '_') + "\n" +
"Model Type: {model_type}".format(model_type=model) + "\n\n" +
"{model_name} - Confusion Matrix:".format(model_name=model_name),
# plt.figure(i)
plot_confusion_matrix(model, X_test, y_test),
"\n" + "THE END".center(77, '-'))
plt.show()
I want to fit plot before the line THE END.
figured it out π
I got so confused by the syntaxes
Damn!
them matrix shapes so confusing
yeah i just watched the yann lecun episode and it was rly fascinating
Here I go again: introduction to statistical learning for a high level but rigorous introduction
will this book be enough i looked at the topics most of the stuff ik its just some nitty gritty stuff which i dont really know
hello guys
i'm learning to train Mask-RCNN model in Pixellib but i can't load the folder named "Nature"
i need your helps
what's the purpose of margins in SVM's if the decision boundary already separates the classes of data
idk
these margins are used to decide the the seprating hyperplanes you 1st need to decide and the half of the distance between these margins is where you draw the hyper plane
hope it helps feels free to ask if you dont undestand
stats is really a pain but you gotta go through it ;-;
got it! thanks!
I have a course for economics. So its a bit easier.
:) no issues please feel free to ask anymore
my course is electronics ;-; they dont teach a thing about stats in it ;-;
except for communication theory and singal and system theory
Mine is pretty easy. I have just to use formula and need a bit knowledge of different types of data. But its stat1, next semester I have stat2
I want to build a AI chatbot for a person, It will be building a knowledge base from his website and answering questions based on the content of the website, what tools and apps should I use for this, diagflow, watson, lex, or what?
and how should I approach it, how should I build it, any help would be appericiated thanks!
any book you wanna suggest for stats i wanna imrpove my knowledge of stats
I have no Idea. but I can say what my prof recommends ?
well you can take inpirations from LLMs
sure
nicholas renotte you can check his youtube for LLM's
So wait a second, I need to look first, if they have a translation of them in english
ook !
Academia.edu is a platform for academics to share research papers.
you can use prebuilt LLMs like langchain to make a bot and then ask that bot to learn from those websites its very much possible but i havent made sucha bot yet so i cant really guide you to the code but there are a bunch of tutorials avalible on youtube
I need a proper chatbot real fast, the best working solution till now is botpress for me, but its kinda buggy and has limited control
thnnxx
well if you need the code directly you can probably find such codes on github
no I do not need the code directly...
The others are in german only. But my prof liked on twitter these one: https://twitter.com/kareem_carr/status/1679141068275847168?s=20 Idk if there is much about statistics. Maybe read some comments...
thank you for the book suggestions ! i appreciate it
then ?
like what's the best solution, or nvm do you know about any open source thing built specifically for scraping the texts from a website.
I download the book. I can send it if you dont wanna sign up ?
naah sign up aint the issues i can always find the pdfs of famous books for free
but i really appreciate the effort
no problem. I help how I can.
I making a chatbot with python and react , I can't use a chrome extension within
you can check out this documentation if it works
bro i know that, nvm.
@wooden sail can you help him
well edd know a lot maybe he can help a bit
okay :)
nah i don't know anything about that
alright sorry about the ping
It's still worth reading imo
If you want a more detailed book elements of statistical learning is good
It doesn't have to π SVMs also have a slack variable that allows you to deal with data that isn't perfectly linearly separable
The most helpful way to think about SVMs is that they're logistic regression with a different loss function (hinge loss instead of binary cross entropy) and they're typically, but not always, solved in the dual formulation when logistic regrssion is solved in the primal. Reduces the "magic" by 90 %. If you solve log reg in the dual you can do the kernel trick as wel etc etc
alright thnx !
Just completed week 2 of Coursera of Andrew Ng. Would the binary classifier from its programming assignment gonna work on production?
It's a logistic regression nn. I understand the math completely but I am not sure how am I gonna make use if it.. is it just for academic and non-practical?
logistic regression and binary classifiers can be useful in production. The example he wrote is probably not too efficient, and just for educational purpose. But the idea is practically useful yes.
Because often they make the loss 0.5 * (y_true - y_predicted) ** 2 @zenith epoch
Just so the derivative is simply y_true - y_predicted
Yeah, and in the end it's just a scalar
Why does this not work?
What are you trying to do?
why would you expect it to work?
I'm trying to guess the intent here... was it df.columns, or just df?
hey guys, does local runtimes on google colab has limit?
is it same as using other gpu?
print('press Enter to begin.Afterward, press ENTER to "click" the stopwatch. Press Ctrl-C to quit.')
input()
print('Started.')
startTime = time.time()
LastTime = startTime
LapNum = 1
try:
while True:
input()
laptime = round(time.time() - LastTime, 1)
totaltime = round(time.time() - startTime, 1)
print(f'Lap {LapNum}: totaltime:{totaltime}s laptime:{laptime}s', end='')
LapNum += 1
LastTime = time.time()
except KeyboardInterrupt:
print('\ndone')``````
can someone explain this code to me
i saw it in a book but i don't fully understand it
I'm trying to get a list of column names so that I could make my own subset of the dataframe. I'm following this yt video and it worked for him, so I'm confused why it didnt work for me. The below image is the output he got when he ran it
hello guys
is there a way to select rows from a pandas dataframe matching a condition and reindex the returned dataframe starting from 0 on ?
I have already selected the rows matching a certain condition but their index values are not starting from 0 onward
and I would like to achieve this in order to run: print(myselectedrows[0]) successfully
assign it to another variable then reset_index()?
it sounds like you're trying to do df[df.columns], but that's just a shallow copy of the entire dataframe.
Anyone know of a good splines library? I'm 100% sure someone dropped a GitHub link some time ago, and I saved it, but.. I've saved a billion things after that and can't find it now.. :/
Scipy has at least some spline stuff: https://docs.scipy.org/doc/scipy/reference/interpolate.html#d-splines
sci-kit learn also has spline transformers
Thx.
Btw, are splines of any use for tree based models?
On a high level, it's a non-linear transformation that trees can do internally. That being said, for some problems it'll help, for others not
I think I've used periodic splines with gradient boosting for time series data. There's also generalized additive models but they're not popular in Python
Would having relatively noisy data that differs from split to split be an acceptable argument for using splines? To smooth out the noise a bit?
I think so yes
For me personally it's mostly a case of having a variable that has a non-linear relationship to w.r.t. the target and I want to encode this reasonably efficiently (specifically, without introducing too many columns, having smoothness as well, ...)
Exactly!!! I'm just not able to get a list of the columns like the tutorial mentioned
Unless you purely treat them as a hyperparameter (like GAMs do) the drawback is that it's still hard to how many knots you need and where to place them.
That'd be df.columns.
But that guy is also able to comment out the columns that he doesnt want so that he could get a subset of the df
Which is what I want to do, but I'm unable to do. I just dont want to manually enter the specific column names since there are over 50 columns
That indeed seems like a strange thing to do, don't do that.
I don't understand what you want to do, though.
If you want the list of the dataframe's columns, print df.columns and copy it, I guess.
set it to a variable and use it?
You could slice the columns, like ```
cols=list(df.columns)
df[cols[1:10]]
you don't need to listify and also df.iloc[:, 1:10] is preferred over that
Anyone here ever worked with point cloud data and knows of a library for visualizing it?
Used pptk, and open3d but they both have the problem of just not visualizing some points (like it culs them)
This is the same tree point cloud but just rotated a bit, and a lot of points just randomly dissapear
u can get the columns as list and use that to filter
maybe plotly?
Needs to be specific for point cloud, they have 70 mil ish points, so plotly or matplotlib doesn't cut it
ah ok i see
matplotlib already almost crashes at 10k π
maybe this works with 3d aswell:
https://plotly.com/python/datashader/?_gl=1*zw2ahj*_ga*MTQ3MjM5OTgzNy4xNjg5MjU0MjIw*_ga_6G7EE0JNSC*MTY4OTI1NDIxOS4xLjAuMTY4OTI1NDIyMC4wLjAuMA..
pls stop spamming and if u want to acquire clients read the #rules
I'm not spamming, I want to help for free.
@mild dirge maybe even check scattergl i dont know how well it scales but for a few 100k-1M it worked for me
Thanks for the recommendations, I'll take alook
!rule 6
And definitely no cross-channel spamming.
let me know if that helped might be interesting for future to know its limits π
Yeah sure. Currently the only one that works well is cloudcompare, but it's standalone, so can't directly call a python function to visualize.
To use certain in built functions like transformations and such yeah
But not to open a window and visualize it
Only used open3d which worked well, but my point clouds were smaller than yours. Lmk if/when you solve the problem π
looks a bit like a column hahaha
Yeah I just don't know, I also manually cropped it to only 100k points or so
But the same problem happened (it even looked worse)
Actually the image I showed is the cropped one
Can you sample less points and plot it that way or does that not work for your application?
It wouldn't fix it, and it is also a bit worse for my application
@mild dirge jupyter-scatter (https://github.com/flekschas/jupyter-scatter) can show millions of points but it is 2D
Does anyone here know how to implement pct change with forward fill in pyspark??
@mild dirge for 3d you could try ipyvolume
Also checked that out, think it doesn't do too well on very large point clouds
then 2d and to think how to slice 3d into 2d projections
can someone provide me with a cheatsheet for Librosa please? It'll help out me a lot
hi I want to get in to ML but I do not know where to start can someone help me?
https://github.com/RAHUL13-13
Can i get reviews on how i can improve my Github
my tip is to enroll to the linear algebra course in Coursera
that's the bottom-up approach but if you want the top-down approach then try some books on amazon
I bought the A. Geron book but reading it is like someone shoots water coming from a firehose on my face
I was just skimming, I thought ML is just try this framework and work on the documentation api lol
for test_num in range(1, 101):
X_train, X_test, y_train, y_test = train_test_split(X, df.left, test_size=test_num / 100.000)
model.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
if score > best_score:
best_score = score
I have this code but I am getting this error: ValueError: test_size=1.0 should be either positive and smaller than the number of samples 14999 or a float in the (0, 1) range
I tried to make a conditional to find the test_size that gets the highest model.score for a logistic regression, not sure what is going wrong
Figured it out actually, test_size has to be in between 0 and 1, not including either
Hello, my script works, but the dictated voice has an English accent, whereas the text is in French. How can I remedy this? Thank you in advance for your feedback!
from elevenlabs import generate, play, set_api_key, Voices, stream, voices
set_api_key("xxx")
audio = generate(
text="Bonjour je suis Bernard comment allez-vous ?",
voice="Bella",
model="eleven_monolingual_v1"
)
print("Audio en cours...")
play(audio)
maybe use a different model?
i assume eleven_monolingual_v1 only support one language, and by default it probably is english?
how is eleven_multilingual_v1? (see https://github.com/elevenlabs/elevenlabs-python#-multilingual)
otherwise have a look at https://docs.elevenlabs.io/api-reference/models-get and see what other models can you use
hello guys
simple question
i have a pandas dataframe
containing a list of names and how much money they spent each time they came to the store
how can I sum the total money spent per user in the store over the entire db ?
What's strange is that it says in English: "The eleven_multilingual_v1 model supports multiple languages, including English, German, Polish, Spanish, Italian, French, Portuguese, and Hindi."
The problem is solved I made a mistake, here is the correction: ```py
model="eleven_multilingual_v1"
Youβd groupby/sum; df2 = df.groupby('name')['spent'].sum()
@desert oar @serene scaffold Update: RNN with LSTM variant worked. I was able to generate the request log. The output is not exactly what I wanted but I have a good starting point. Thank you!
Why linear algebra?
Tensor math operations are nice foundations
I'd start with some friendly youtube... 3blue1brown is amazin: https://www.3blue1brown.com/topics/neural-networks & https://www.3blue1brown.com/topics/linear-algebra
Mathematics with a distinct visual perspective. Linear algebra, calculus, neural networks, topology, and more.
Mathematics with a distinct visual perspective. Linear algebra, calculus, neural networks, topology, and more.
Ok thank you
Suggest me some sources to get datasets from . I need a dataset of ingredients used in Dairymilk and its harmful effects when overconsumed.
what are you trying to model?
like i want to scan a product's ingredient and to list the harmful effects of it when overconsumed
Anyone here also use matplotlib and C++? I am trying to use this matplotlib for c++ library and would love some input since I am stuck with something and I may need to mod it
what's the best way to store numpy arrays of image datasets for tensorflow? i was thinking of resizing the images, converting it to numpy arrays, and writing it to a .json file, but there are ~8000 images, and it seems like it would create an extremely large file
i want to save it because i don't want to have to resize & preprocess the images every time i want to run the program so is there any other alternative?
I'm pretty sure numpy.savez could help. It saves multiple np arrays efficiently in a .npz file format
ill check it out. thanks!
Ill be real I tried replicating your scenario with 8000 (512, 512, 3) np arrays and i ran out of memory in 5 secs
Could also just store them as images
Tensorflow and Pytorch both have methods to load in datasets, which basically just has a directory per class, and all images belonging to that class in a directory.
@coral field
so it automatically converts images to numpy arrays to fit on?
and what if i need to resize them?
You can have transforms to change the images
Even allows for random flipping and stuff to augment the data
Then you use a dataloader to load random batches of those images
does it work on datasets not found on tfds?
Not sure what you mean with that?
since i have a custom dataset downloaded from kaggle
i want to prepare it so it works with tensorflow models
If you order the data as follows:
/Data
/apple
img1.png
img2.png
/pear
img1.png
img2.png
...
dataset.py
Then tf and pytorch both have a dataset class that recognize this format, and can load in this data
And then a dataloader can be used with this class to constantly return random batches of data (including the label)
The labels are the directory names (apple and pear in my case)
Which automatically get converted to an integer (0 for apple, 1 for pear, ...)
ok
has anyone dealt with Stanford's Car Dataset before? Since I am currently unsure on how to find the label of each car, as well as the corresponding image's number.
In the cars_annos.mat file it seems @coral field
it contains an "annotations" key in the dict, but its information is:
(array(['car_ims/000002.jpg'], dtype='<U18'), array([[48]], dtype=uint8), array([[24]], dtype=uint8), array([[441]], dtype=uint16), array([[202]], dtype=uint8), array([[1]], dtype=uint8), array([[0]], dtype=uint8))
(array(['car_ims/000003.jpg'], dtype='<U18'), array([[7]], dtype=uint8), array([[4]], dtype=uint8), array([[277]], dtype=uint16), array([[180]], dtype=uint8), array([[1]], dtype=uint8), array([[0]], dtype=uint8))
...
(array(['car_ims/016183.jpg'], dtype='<U18'), array([[25]], dtype=uint8), array([[32]], dtype=uint8), array([[587]], dtype=uint16), array([[359]], dtype=uint16), array([[196]], dtype=uint8), array([[1]], dtype=uint8))
(array(['car_ims/016184.jpg'], dtype='<U18'), array([[56]], dtype=uint8), array([[60]], dtype=uint8), array([[208]], dtype=uint8), array([[186]], dtype=uint8), array([[196]], dtype=uint8), array([[1]], dtype=uint8))
(array(['car_ims/016185.jpg'], dtype='<U18'), array([[1]], dtype=uint8), array([[1]], dtype=uint8), array([[200]], dtype=uint8), array([[131]], dtype=uint8), array([[196]], dtype=uint8), array([[1]], dtype=uint8))]
```, which makes it kinda unclear on where the label is
How many classes?
196
Those are the values in each array
Seems to be reversed somehow
fname, class, bbox_y2, bbox_x2, bbox_y1, bbox_y1
@mild dirge did u find a workaround?
around what?
3d plot
No not yet. Just using the non-python viewer might just be simplest.
Did buy 64 GB Ram though, those point clouds take up space..
Gonna do a master project at a forest analysis company. The project will be tree specie classification with drone lidar data.
But it's in 1.5 months, so I'm just looking into some of this out of interest
so millions of trees by their class?
and then simulations on fire or some stuff?
oh nvm
i should read
There are certain beetles that are harmful that can also be detected on trees.
with drones nice
such high resolutions are achieved?
I guess, they gave that task to a 3rd party company
crazy
would love insight in satelite images
this looks like a pretty nice master topic
Speaking of tiny animals. An idea we've been playing with is to detect them through vibrations
At work. Say you do an intervention to help wildlife. The issue is that measuring the amount of bees, insects etc. before and after requires counting
oh i saw a similar thing on kaggle
so u would use microphones and apply FFT and filter?
did u guys do a POC?
I guess our novelty is that we wouldn't do it with images but with sound. Not sure how viable it is, we'll think it over and then decide.
problem by sound is the counting
It's not my idea so I might be conveying it slightly incorrectly. The idea came from the physics people
interesting idea for sure
I had questions for them such as how to solve the same animal going in, out, in, out, ... We'll see how they solve it
do squares
smaller passing areas but all comes back to the counting identify problem π
there's mimo radar for insect tracking, the principle with sound should be somewhat similar
if it's purely passive it gets a lot more challenging though
Thanks, that goes straight to the top of my reading list
why to do repeatedly cast to float
the notes are giving me anxiety
"why to do" -- is that how you say "why do you" in your language?
also (float) (1 / 4) is syntactically correct python, funnily enough
If you're still around, it might be interesting that we help you make this more readable
since when do u click on images with code πΏ
Because as it is, it's giving me a headache, but we can solve that with a few low hanging fruit
I wanted to see what Edd (PBUH) was referring to. I still wouldn't help the asker until they gave text.
lmao no
just wanted to mock with u π
naaaaa
Well, when you're not lazy or tired and I'm around I could help
i just wanna know how to do dem derivatives
what is the function you want to derive?
newton raphson method literally has a "f'(x)" in it so it means any function
If you're truly lazy or tired you'd use Autograd
ah no that's just me writing nonsense. i meant to type "why do you..."
Hi! I'm developing a battery management system using Machine Learning to predict the state of charge of the battery.
If anyone had worked on or have any experience on working with Kalman Filter or any approach regarding BMS in general.
I'm not an electrical engineer and have no knowledge in this domain.
IDK if this is the right channel for this.
is there an alternative to cuda for intel integrated graphics
Really...how can people manage to make Variational AutoEncoders that actually work with MSE as Decoding Loss?
My VAEs never work with MSE, only with Log Likelihood. And I wanted to make a paper where I also compare a VAE with Gaussian Likelihood and a VAE with MSE, but simply saying "The model with MSE as Decoding Loss didn't manage to converge nor produce any meaningful output at any one of the 10 attempts" is a bit meh...
What exactly can be referred to as a model that uses extra training data?
The literal meaning doesn't seems to be applicable.
It seems model with fixed/frozen pre-trained backbone are called so.
Or one with no pre-trained backbone at all.
opencl works on basically any hardware i think
but if you have integrated graphics the cpu will definitely be faster for machine learning
can i use opencl with pytorch?
I use IDE
Hm...my VAE is generating images correctly within range [-1, 1] just like the original images from my dataset... but rescaling then into [0, 1] for matplotlib is making them too...I don't know how to say it...but the values are too close to 0 and the image is almost white, though there are colorful figures.
This doesn't happen to the original images when rescaling from [-1, 1] to [0, 1].
I hope I don't have to use a dataset within [0, 1] and tune my VAE...that would be a bit sad 
"maximum likelihood" depends on your probability distribution. MSE is a particular case of maximum likelihood
Hi everyone!
I'm learning TensorFlow Extended on my own and I'm running into an issue where CsvExampleGen generates a venv inside a venv
Could anyone with me some context on what might be going on please?
Curious. I wasn't expecting that...
In fact, I've seen that MSE works with grayscaled images because it would be trying to look for the likelihood of a Bernoulli distribution
Maybe the condition for MSE Loss working is to use grayscaled images, then? And since I'm using RGB images... 
Ah, what a silly mistake, haha. my IDE is creating the directories because I didn't specify the path correctly LOL
When I used conv autoencoders on RGB images I tested MSE and BCE and the latter worked better
Oh yes... I was thinking about using BCE also... Maybe I should go for it after the 5th attempt on MSE, then
MSE pops out of MLE when working with gaussian distributions, particularly ones whose covariance is a scaled identity
Soo... rarely works with colored images?
I'm not sure why a greyscale image would use a Bernoulli distribution unless each pixel would be strictly black or white or at least very close to it. If you can get things to work for a large variety of greyscaled images, the immediate thing I would probably try is to separate each RGB image into its three color channels, generate the images separately and recombine them afterwards.
I think the 0 or 1 thing is exactly the context for some MNIST datasets...which is a must for every VAE tutorial I see around there
But I also find it strange because some VAE papers also use MSE. I think VQ-VAE uses it...or VAE-GAN...
Say I have a numpy array of shape (N, 3) (N 3d points) , called point_cloud
I also have a list of 4 empty lists
and another array of shape (N,) called indices which contains for each point an integer between 0-3 which says which list each point is assigned to
How do I (efficiently) add each point to the correct list, is there an efficient numpy way to do it (and not use a for loop)?
I think for my regular conv AE I just clipped the numbers to be [0, 1]
So...Sigmoid function?
I lied, I used BCE and a sigmoid in the last layer after all. It's been a while.

I'm now even more concerned about rescaling my Decoder outputs... The outputs are within range -1 and 1, just like my dataset. So, to rescale them, I simply apply (x + 1.0) * 0.5.
But...though it works fine for the dataset, the VAE outputs tend to get a bit... bright?
And not rescaling them makes the images get too dark
Oh yes...I forgot to monitor the outputs mean and standard deviation... The dataset STD tend to be higher than the outputs', while the mean tends to be lower than the outputs'
What clustering algorithm would be good to remove the small clusters and outlier points from this
what about dbscan?
true, could i maybe use OPTICS would that also be decent?
worth a try
is ml model code review allowed in this channel?
i have a simple CNN to classify Stanford's Car Dataset, yet it keeps on overfitting on the training data, even though I am trying to shuffle it after every iteration with dataset.shuffle(), and any help would be appreciated
Now I get it...the model is already generating outputs within range [0, 1]...and I'm just applying a modification that makes the values closer to 1(white color in matplotlib)
At least, that's my guess... I was expecting my model to generate outputs within [-1, 1], since those are the values for my dataset...
Maybe the fact that the Decoder can't generate negative parameters for the per-pixel Gaussian Distribution due to the sigmoid function assures that, in the end, my output will be [0, 1] already...I guess...
Show code.
are there any suggestions on how to reduce the overfitting?
Ok, now the only thing that bothers me... is the loss decaying +- 0.5 points per epoch...and the epoch loss is currently 920 and the generated images can be better
Damn...Guess I just reached the most sad part of deep learning: let the model run, take a vacation and forget it exists until you come back
Problem is...I can't go on a vacation 
What if... Genetic Algorithms? Together with Gradient Descent? 
Oh yes...and there's VAE-GAN...
what comes to mind is to not use a list of lists, but a list of numpy arrays. then what you can do is something like mylist = [point_cloud[indices == 0]. point_cloud[indices == 1], ....]
if you really need it to be a list of lists, i think just composing that with this approach should work
What came to my mind was doing this with something like ufunc.at, (seems like the right approach but not 100 sure) https://numpy.org/doc/stable/reference/generated/numpy.ufunc.at.html
ah maybe i misunderstood what pccamel meant with "add", i was thinking of appending
I think they meant append, but what if the lists were just initialized to zero, then it was just an add?
icic
i really wanted to avoid explicitly creating the indices though, for no special reason since the bool indexing is virtually the same
Yah, what you wrote is probably what Iβd do in reality, but I took the βefficientlyβ question as: could this be done in a single numpy operation?
slight note: dave beazley talks are great for intermediate python topics. can I ask you books about data engineering in python ?
Either way, this seems like something you can easily do with Numba with loops
The fact that lists are involved make me concerned if numba will be fast
embrace jax
yeah it super doesn't work for me in numba
here is my attempt, though not really a single numpy operation
import numpy as np
# generate some dummy data
N = 10
points = np.random.random((N, 3))
groups = np.random.randint(0, 4, N)
group_sorter = np.argsort(groups)
sorted_groups = groups[group_sorter]
sorted_groups_diff = sorted_groups[:-1] != sorted_groups[1:]
transition_indices = np.flatnonzero(sorted_groups_diff ) + 1
np.split(points[group_sorter], transition_indices)
This is a very cool solution π
for 4 groups it seems to be a bit slower than the naive numpy one:
def separate_elems_numpy(cloud: np.ndarray, k: int, indices: np.ndarray):
return [cloud[indices==i] for i in range(k)]
!e
import numpy as np
N=10
points = np.random.random((N, 3))
groups = np.random.randint(0, 4, N)
split = [points[groups == group, :] for group in (0,1,2)]
print(split)
@wooden sail :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | [array([[0.09083567, 0.60900562, 0.28339988],
002 | [0.72970575, 0.29073936, 0.47246215]]), array([[0.34886624, 0.26685568, 0.41064815],
003 | [0.05640709, 0.84097418, 0.9440061 ]]), array([[0.1018377 , 0.67275258, 0.99977404],
004 | [0.19277101, 0.66099668, 0.27566082],
005 | [0.13309009, 0.1609066 , 0.49823524]])]
yeah, same thing
idk where you draw the line of "one operation" vs "one line"
hmm, I have an idea
but yeah ry's has the extra overhead of sorting. there might be a way to circumvent the == so that the indexing isn't so memory intensive
nested numpy where 
my cursed pandas groupby solution isn't even that horribly slow, lol
are you concatenating the arrays and then groupby?
just a 2-column dataframe - point and group
the cursed part is that the point column has to be object-type
i feel like that's a crime against humanity
splitting it into points.shape[1] columns would need no objects, but that'd need reassembly.. π₯΄
holy shit
rewriting the cursed pandas solution in polars made it about as fast as ry's 
data_pl = pl.DataFrame({"point":cloud,"group":indices}) # this takes like 3 seconds though, haha
%timeit data_pl.groupby(pl.col("group")).agg(pl.col("point")) # 6.25 ms Β± 882 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
and how does that compare to the naive list + numpy
k = 4
d = 3
N = 10**5
cloud = np.random.random((N, d))
indices = np.random.randint(0, k, N)
data = pd.DataFrame({"point": list(cloud), "group": indices}) # cursed
data_pl = pl.DataFrame({"point": cloud, "group": indices}) # even more so
%timeit separate_elems(cloud, k, indices)
%timeit separate_elems_numpy(cloud, k, indices)
%timeit separate_elems_ry(cloud, k, indices)
%timeit separate_elems_pd(data, k)
%timeit data_pl.groupby(pl.col("group")).agg(pl.col("point"))
61.8 ms Β± 5.25 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)
4.64 ms Β± 366 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
7.64 ms Β± 146 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
11.4 ms Β± 270 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
4.54 ms Β± 148 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
it being the fastest in this run is probably a fluke
whats the one that just says elems
purepython
def separate_elems(cloud: np.ndarray, k: int, indices: np.ndarray):
N, d = cloud.shape
assert indices.shape == (N,), indices.shape
lists = [[] for _ in range(k)]
for el, i in zip(cloud, indices):
lists[i].append(el)
return lists
def separate_elems_numpy(cloud: np.ndarray, k: int, indices: np.ndarray):
return [cloud[indices==i] for i in range(k)]
def separate_elems_ry(cloud: np.ndarray, _: int, indices: np.ndarray):
group_sorter = np.argsort(indices)
sorted_groups = indices[group_sorter]
sorted_groups_diff = sorted_groups[:-1] != sorted_groups[1:]
transition_indices = np.flatnonzero(sorted_groups_diff ) + 1
return np.split(cloud[group_sorter], transition_indices)
def separate_elems_pd(data: pd.DataFrame, k:int):
grouped = data.groupby("group")["point"]
return [grouped.get_group(i).values for i in range(k)]
lemme also try some lazy cython
%%cython
cimport numpy as np
cpdef separate_elems_cython(cloud: np.ndarray, k, indices: np.ndarray):
# N, d = cloud.shape
# assert indices.shape == (N,), indices.shape
lists = [[] for _ in range(k)]
for el, i in zip(cloud, indices):
lists[i].append(el)
return lists
~10% faster than purepython
no, this is on windows
time to rewrite it in rust
(not actually going to do it now, it'd take a bit of time)
unless...
huh
anyone here studying computer science willing to collaborate on a project with me, it's in the form of a research paper.
no windows support π
Is it possible to share open source project here without being banned?
it's impossible. in fact, reptile is about to get banned
Bot will say I do advertising
I wanna implement this in a single program, I'm doing this for my graduation project, anyone wanna tag along|?
you can always use modmail to ask for permission
basically FOM, Face Detection, Neural Pose Transfer, SD, StyleGan 2, InfiniteNatureZero, Background Matting V2, DiscoDiffusion.
you don't use wsl?
no
why not π
i have a dual-boot linux
ah