#data-science-and-ml
1 messages · Page 90 of 1
It's the most sensible way to track experiments (for me)
It all depends how deep you want to go, I think I've mentioned it before. If you're in it for the long game then starting with math and then going towards probability theory and then going towards statistics and then ML is the most sensible route.
This route is very long and a bit boring tho so you can start at any point you want, especially if you're more interested in applications of ML than building new methods.
I mean i just wanna understand the concepts, so thats why I thought that starting with a statistics books and understanding the basics of it and implementing it in python and then move onto statistical learning would be more beneficial
and yeah im in it for the long term and I wanna build new methods rather than just implement stuff
you're probably going to need an advanced degree if you want to build new methods
is that what you're working towards?
yeah I'm only 16 years old
I'm planning to go with computer science and mathematics and self-learn a bunch of things during the way
Thanks 🙂
solid plan
My main advantage is time so taking my time with everything rather than just rushing would be much more nicer
i think others in the channel would have more insight than me, but an advanced degree like a MS or a PHD will help you with building new models
it's a good goal to have as a 16 year old
yeah im planning to do something similar
Yeah, im very motivated and passionaite
good stuff
Then I recommend you do the long route with math, stats etc but do some real life projects it could even be kaggle when you're bored 🙂
yeah i can second kaggle
i'm going to have to learn a shit ton of sql if i get the l3harris offer lol
SQL is truly something you can learn on the job though
I learnt SQL at internships before I took a database course in uni
the puzzles on leetcode would be helpful too
Yeah I was planning to do like a library stuff in python that would implement these statistical concepts and I would build like a library where I can have all of these available and I can train my models as well
just food for thought
I actually have a big experience with web-development so I'm a bit familiar with databases but not in a deep level though
I just looked at w3schools SQL whenever I was stuck 💀
Just for learning purposes and understanding mathematical concepts and modelling them in python
Thats good to know
Then I would move onto statistical learning and I will learn differntial calculus this year in school which is kinda cool as well
Also @past meteor Do you know any good statistics introduction books that you've read?
I don't sadly, I just learnt that in uni. We used a standard, boring, textbook that isn't available in English 😦
what is token classification?
i need a model that can do this simple task but so far i have been unable to make llama2 reply in single work also it can't differentiate between spam or usefull messages
should i be looking at some other model
is data science same as /related to ML/AI?
well its everything to do with data, im far from professional but data science is everything to do with data(managing, processing, cleaning) and machine learning models are using that data to train and make further predictions, so yeah its very related to ml because we need data to train models so without data science we cant do that.
Statistics is highly related to ml and data science and is very significant
"openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable" I have "export OPENAI_API_KEY="xxxx" in my .env
how do I solve that?
maybe try setting env variable in terminal befroe running the app
thanks
if it is in .env u will have to load it unles u are already loading it
use dotenv to load the env variables from .env first
os.getenv(.env) ?
from dotenv import load_dotenv
load_dotenv()
kk
u will need to install dotenv
im using venv
virtualenv
then load_dotenv() will load em
ok
exporting them on runtime is ogood idea
k, thanks, coffee is ready. bbl
hi
They're explained alongside a lab session that shows you the code implementation of each chapter/topic in Python.
so I dont need to read the stats book beforehand? or would it be any different?
I use W&B and MLFlow mostly for my experiment tracking. I use the free tier version though and it works just fine for me.
I'll recommend you check out
- W&B
- MLFlow
- DVC
- ML Comet
- 🤗 Hub
- Evidently
Yeah, some sort of dashboard is the final piece in the puzzle with our current setup
I might see if we can just buy a neptune account though, since it's less hassle for me to setup
would you mind answering to my question
ISL is a nice statistical textbook which also happens to have a code lab session for each chapter.
So after reading each chapter, you'll see the code implementation of what you've just read therein.
Just skim through a couple of chapters in the book, you'll see it has the section where the code session is shown (inside same book)
yeah I know that, but my question meant that I have no experience with statistics, so I asked whether i need to go thru a very basic statistics book to understand the mathematics and then switch onto this book which implements it in code.
i am looking for someone who could help me install opencv for cuda when i follow the guide on YouTube it says it is unavailable for python 2 and 3 and the solution given for the problem given by the same video does not work
I wrote a script in python which I want to compile into single file using 'pyinstaller' or 'auto-py-to-exe' for easier distribution however I'm encountering issues and none of the answers online helped.
So I wrote the script in Pycharm and made venv inside which are required python packages installed.
When I run the following command:
pyinstaller --onefile --name BoulderDimensionsCalculator --icon=Polygon.ico main.py
Both pyinstaller and auto-py-to-exe compile it just fine. However when I try to run .exe, the console window just hangs there for a bit. Throw an error module not found and it exits.
I tried including --paths="C:...\venv\Lib\site-packages" in the pystaller command as well but it didn't help
Missing module is rasterio.sample but sometimes it also throws error on some other like geopandas or whatever. So my assumption is pyinstaller doesn't compile python packages which are installed in venv, how to fix that?
Honestly, I don't know. You might have to check it out and see for yourself first. If you feel it's not beginner-friendly enough, then you might wanna start with this instead
Okay, thanks very much
Anyone got any idea what could be causing PyTorch Lightning to never make progress after the first epoch?
we never actually complete the Epoch validation
but for some reason, 3 GPUs drop to 0% usage, and then just 1 GPU pins to 100% like it is still doing work
but it just doesn't progress 
My current train step:
early_stop = pl.callbacks.early_stopping.EarlyStopping(
monitor="val_loss",
min_delta=0.00001,
patience=5,
verbose=False,
mode="max",
)
self.trainer = pl.Trainer(
logger=neptune_logger,
callbacks=[early_stop],
max_epochs=self.model_config.n_epochs,
num_nodes=1,
log_every_n_steps=32,
accelerator="auto",
strategy="ddp_find_unused_parameters_true",
)
logger.info("Trainer has been created")
self.trainer.fit(self.model, self.data_module)
logger.info("Model fitting has been completed")
def forward(self, token_ids: torch.Tensor, attention_mask: torch.Tensor, labels=None):
output = self.pretrained_model(input_ids=token_ids, attention_mask=attention_mask)
pooled_output = torch.mean(output.last_hidden_state, 1)
pooled_output = self.hidden(pooled_output)
pooled_output = self.dropout(pooled_output)
activated_output = F.relu(pooled_output)
logits = self.classification(activated_output)
loss = 0
if labels is not None:
loss = self.loss_fn(logits, labels)
return loss, logits
Forward function
Only this is called bar log in the validation and train steps
As much as I'd like to let this run without being able to see what it is doing, the machines cost $7-8/hr 😅 So i'm not big on leaving this run idle overnight if it is just plain suck
absolutely nothing from the logger
also 0% cpu usage / idle
- 698 classes
- 738692 train data points
- 78570 validation data points
n_classes=698 total_steps=36 warmup_ratio=0.2 embedding_length=256 learning_rate=1e-05 focal_alpha=1.0 focal_gamma=2.0 k_items=5 adam_beta1=0.9 adam_beta2=0.99 adam_eps=1e-07 adam_weight_decay=0.001 n_epochs=50
Hmm
I am getting this when force aborting the container
Sat, 25 Nov 2023 22:37:41 GMT thread 'thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.0/src/registry.rs:167:10:
Sat, 25 Nov 2023 22:37:41 GMT The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
Sat, 25 Nov 2023 22:37:41 GMT note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

which I guess means that it is the huggingface tokenizer that is getting stuck
ah nope it looks similar to this error https://github.com/Lightning-AI/lightning/issues/11242
we do love deadlocks
does anyone have any advice for being able to process the data from the game Dwarf Fortress, I am having trouble finding tools that are able to process the data made by the game, I can find tools for processing unreal or unity game code, but i am having issues since Dwarf Fortress is almost entirely written in openGL without a stock game engine, so it runs off a fully homebrewed engine
would i have to make that tool myself?
I am looking at creating a neural network that can effectively play Dwarf Fortress
Guys, I've looked into the idea of least squares in regards of fitting a line to a given data set, and I do understand it and I know how to use it but I don't know why does it work? How can I possible understand the idea behind it and understand why does it make sense? How can someone have potentially come up with this idea? Is it a stupid question to ask?
Guys I have invented a way to not bother websites as much:
@lru_cache(maxsize=None)
def requests_saver(url):
return requests.get(url, headers=headers)```
Does anyone know how much data like this costs? https://www.mastercardservices.com/en/spendingpulse
I imagine it's super super expensive
How do you guys use Jupyter notebooks? I seem to only use them to look at the data, and then write an actual python module to train the model and stuff
it's just as well that you don't use them for training.
I mostly use them for stuff that I need to present to other people. For exploratory stuff, I usually use an IPython repl, so there's even less of a pretense that whatever I write is non-disposable.
however you write jupyter notebooks, if the result of the notebook needs to be reproducible (like, it needs to train a model and report on its performance), make sure that it works correctly by executing each cell exactly once, in order, with a fresh kernel session
if the expected behavior depends on a non-linear sequence of cell executions, it's crap.
Can someone answer this please?
Anybody know if it is possible that GPT4ALL can combining models or/ and databases?
i saw around 100GB of Datasets and some Models. But i dont know what is what.
looks it is the same...
YT shows me i need to choose...
Hey guys, I'm not understanding something here:
- I need to fine tune a BERT model for binary classification of sequences.
- Right now I have this architecture: BERT + torch linear classification layer.
- In order to fine tune this, I am simply passing the input through the architecture, calculating the loss and backpropagating it.
- Do I need to do any extra steps? Am I missing something?
Hope this helps:
https://github.com/v3xlrm1nOwo1/fungi-diagnostic-chars-comparison-japanese-classification-with-BERT
Hi I am new to this discord so not 100% sure this is the right place for this question but I made a chart analyzing stock performance over the past month and the way the data is presented seems wrong to me. Any thoughts?
sounds like you basicaly have it right.
do you guys mind taking a look at my end2end mlops project diagram?
the basic idea is to pull weather data daily automatically and store it in a db, then train a model based on this data to predict various metrics as an endpoint, and also monitor the performance, and say it makes 3 wrong predictions in a row, then an automatic retrain and deploy triggers. The deployment and monitoring part is missing, im not yet sure what to put there.
jesus, i have a lot to learn regarding data science in python 😭
Wait, I’n a keras pipeline .transform() in the all steps have to take in a ndarray right
I've been trying to use the opencv with cuda and keep getting errors, i was told to build it myself and i'm following this tutorial
https://www.youtube.com/watch?v=YsmhKar8oOc
I got to 4:55 and upgraded numpy but for some reason when i check the to be bulit part of open cv modules it still doesn't say python and says unavailable for python 2 and 3 i am not sure what infos you need so feel free ask for more infos
Build OpenCV 4.5.1 with CUDA GPU acceleration on Windows 10. In this tutorial, we will build OpenCV from source with CUDA support in Anaconda base environment as well as in a virtual environment. Building OpenCV with CUDA from source allows OpenCV to be used in any programming language. We will focus on Python 3.8 for this tutorial.
-----------...
import numpy as np
import scipy.integrate
import matplotlib.pyplot as plt
def rhs(N, t, lambda_, lambda_1):
dNdt = lambda_ * -N
dN2dt = dNdt - lambda_1 * -N
return [dNdt, dN2dt]
N1 = 40
N2 = 0
t = np.linspace(0, 10, 100)
y = scipy.integrate.odeint(rhs, [N1, N2], t, args=(1/20, 1/15))
RuntimeError: The array return by func must be one-dimensional, but got ndim=2.
How do i stop this
i want a dataset to practice convolutional neural networks and i already have used MNIST, any other datasets i you would recommend? I would prefer something larger(in terms of size) than MNIST
Anybody know if it is possible that GPT4ALL can combining models or/ and databases? I would like the AI choose the set itself. So i dont need to select manually. Thank you.
What is the canonical grammar of graphics package for python?
i dunno, pillow or cv2
why does that seem wrong to you
What bad advice
What would be the best way to go about finding correlation between numerical and categorical values. Suppose I have 2 columns one with scores ranging from 0-12 and the other column with the names of universities. I was thinking of assigning numerical values to each university and go about that approach but I wanted to know if there's a better way of doing it
Cifar10 or Cifar100, or any other image dataset from Kaggle
cifars have very low resolution compared to what i want to do, that's the main issue
You might wanna use an image data from Kaggle or create your own image dataset (more fun)
I am already creating a dataset for the final project so no thanks on creating one, but i will check kaglle out, thanks!
It's statistically wrong to compute the correlation of a categorical and continous variable the way you proposed to do it (even when the categorical variable has been encoded to numeric value for ML task).
So, you can't compute Pearson, Spearman Rank, or Kendall Tau correlation on such features.
Remember, if you call .corr() on those two features it will surely return a correlation coefficient but is it capturing what you intended it to capture effectively? No.
In such situation, you're to use a non-parametric test statistic like Kruskal Wallis test. If you want to use parametric test statistic, then go for Point Biseral test.
The thing is, there's always a lot to learn in this field. New things keep poping up everyday. It rains everyday here, research don't really care if you need to catch your breath or take a short break.
I'm still tryna get to LLM now Q-learning is buzzing 😂😂
Once you're done with the basis and now ready to jump into Deep Learning, I'd suggest you just focus on what you're interested in and double down on it. You might wanna subscribe to a couple of ML newsletters to stay abreast of new stuff.
You'll be fine that I know for sure. More so, If you start now, you'll be better off than you were yesterday.
You got this 💪💪💪
hi, requesting some help with running a jupyter notebook cell. getting this error: ModuleNotFoundError: No module named 'pandas'
i ran pip install pandas in the notebook dir and then re-ran jupyter notebook, but when i try and run the cell, i get the same error. what am i doing wrong?
Whats the best udemy course for machine learning using scikit learn library? Also if there is a playlist on YouTube then do let me know
You have duplicate line colors, making it impossible to distinguish curves. I typically do something like plt.gca().set_prop_cycle(cycler(linestyle=["-", "--", ":", "-."]) * plt.rcParams["axes.prop_cycle"]) (right after plt.figure()) if I need this many curves on one plot.
(cycler here is from cycler import cycler, see also https://matplotlib.org/stable/users/explain/artists/color_cycle.html)
Thanks! I ended up do that first part of code as well as realizing I had forgotten to produce cumulative returns. Here is the updated work, I will try the other curve suggestions.
it should look like this, I forgot to cumulate returns
Anyone got any idea what might be causing this sharp increase in F1 score (I typo'd the label it isn't accuracy it is F1) after th first epoch
you can see it is fairly stable until the 2nd epoch where it begins rapidly increasing the score
I have a LR scheduler but I dont think that is it
the only thing I can think of is it is suddenly overfitting
but idk, I haven't seen a model overfit that aggressively
Just some details:
- BERT base model with fine tuning layer
- AdamW optimizer with
get_cosine_schedule_with_warmupscheduler
What does the validation f1 look like? @buoyant vine
basically the same as the train
Can't say much about overfitting without a measure on a separate dataset
Well if it is the same, it wouldn't be overfitting, because it appareantly generalizes well to other data as well
Maybe it got stuck on a plateau for a bit
I suppose, but it is supposidly getting 99% 😅 So I am skeptical
the data should be well defined though, so idrk
Maybe it got stuck in a local maximum, and because of the sudden increase in LR it managed to get out
It is classifying website metadata (Title, Snippet/og_description)
so to some extent, the datasets are very similar no matter what
I suppose it might just be for the val side of things, it got a large subset of the data which had the categories it does really well in
ig I can remove the scheduler and see if it is getting stuck initially
what does the number in the middle signify? how can I choose it? when trying to remove outliers?
The standard deviation is a measure of spread of your data. You want to remove data that is very far removed from the mean relative to other data. The average "distance" to the mean is signified by the standard deviation, so we remove anything that is a few standard deviations away from the mean. @olive pecan
In your case any data beyond 2 standard deviations from the mean is removed
If you increase this value, then you accept samples that are further away, thus being less strict in filtering. Decreasing means you only accept values that are quite close to the mean, thus filtering away more samples.
Thank you WcCamel
Well, you’re not losing money, so you’re already winning 😉
ahah the prints aren't correct anymore
it's only the last train period now
so that just means it didn't trade in the last period
I think I'm going to try to figure out a way to do early stopping with EA next because looking at these charts doesn't really mean anything to me without out-of-sample data to compare to
Are you using elitism?
no
Interesting that your best solution never dropped over iterations
tournament of 4 blendcrossover and polynomial bounded mutation
sometimes it did in the past
What did you change?
I implemented the fitness as the median of the fitnesses over different iterations of the training data after splitting it up
so there's an inner loop
that goes over each individual training segment
and produces a fitness, then the actual fitness is the median of those
What are you searching?
what are the parameters you mean?
Yeah, what values are you searching over
I have a hack that lets the ints be blended because it rounds them when the class actually gets instantiated from the params
I don't know what that sentence means
so normally you wouldn't be able to use blend crossover on ints
because it will produce floats
so to take care of that I just round the floats that come out of the algorithm when it's actually creating the individual to have its fitness calculated
There's cross-overs and mutations that can deal with a mix of floats and ints
But I actually don't know blend crossover to be honest. My problems were fully int based or fully float based (and then I just use numpy's differential evolution)
I was playing around with different tools from the DEAP toolkit
It's basically just like averaging the two
but if I had to guess why the fitness is monotonically increasing it's probably because of how I'm calculating it as a median across iterations
It's a weird way to express fitness
it's the median of the fitnesses across different time periods
Express it normally because that'll tell you if you need elitism or not
well I did it to avoid overfitting
and it seemed to work for that
at least in this one case
it produced a better test result
what is elitism?
can you elaborate on that a bit
you mean as part of the cross validation
the way I'm doing it now or something else
At each iteration keep the top N or top N % of the start of the run
Lets you increase the randomness of your algorithm without destroying the population
you mean from a previous run?
I was doing that manually, saving the params and seeding them as an individual into the starting population of a new run
idk if it's part of the deap framework
So you evaluate fitness at T1, you do crossover, mutation and selection. After selection you inject the best N / N% at T1 before heading to T2
I see
For instance for my research I do this:
Remove 15 % of the subjects into a test set.
On the remaining 85 % I do a test-train split.
30 % of the data is put in a "freezer" and not touched until way later.
70 % is left and with this I test train split again, cross validate on train and evaluate on test.
When I'm happy of the models on the 70 % I select N models and evaluate on the 30 %. When I'm happy with M < N models on the 30 % I evaluate these on the 15 % totally held out subjects, no tweaking whatsoever and those are the numbers.
thanks I'll see if I can use that methodology
part of the issue though is they are also time series
Mine too
so they are one-way information
I sat down, used my 🧠 and wrote a train-test split that is not completely random but makes sense for my domain
gotcha
It doesn't need to be "this" specifically but you should really be worried about "adaptive overfitting"
yea
The more you test and tweak your algorithm the more you'll tweak your hyperhyperparameters (?) to produce hyperparams that produce good models
hyperhyperparameters being the params of your EA lol
i mean what's the difference between that and actually understanding the problem, just that it doesn't generalize forward?
and if so, how can you ever know for sure?
This is the worst part: you can never know for sure
If you split too much the numbers become overly biased on a small sample, if you don't split enough and you reuse too much you adaptively overfit
This part is a bit art not science lol
definitely worth it to get this part right early
before I get too far with this
the testing methodology is probably the most important part of anything
Yeah, I think @left tartan can help here if they're up for it because they know a ton about financial data etc
I haven't seen many very compelling testing methodologies on time series fitting
i was under the impression that most people just leave out the last x%
for now just one
oh no this is a special thing
this is futures trading
it's actually the ten year treasury rate
I could do it on other ten year government bond rates
but besides that it's really in a class of its own
If you create lagged features you can split randomly afterwards
But doing so may or may not make sense for your application
mmm I am doing moving averages of long range
and other time series attributes of long range
like the hurst exponent
so wouldn't be practical
if it were just of order one, then it would make sense
Nah this works for > order one
if it's just lagged variables sure
but I'm calculating moving averages over 100s of observations
It's what I do, I have code to create arbitrary lag + horizon combos
And then I drop all the ones where the lag columns have nulls
And then I express my moving averages in function of the lagged features
Afterwards a random split is 100 % legal
not sure I understand that last bit entirely
so are you just creating a very wide dataframe
because you have hundreds of lags
in my example
And then making a narrow df from the wide one
how is that legal
totally legal 😄
Say you have a sample every 5 minutes (that's the case for my application)
you're telling me if you include all the lags and melt it then you can treat that like you can shuffle it?
like what if something is inherently cyclical
the point in front and the point in back always have to be connected to the current point to model the cyclicality accurately no?
You say this:
6 lags + a horizon of 6, which means you're predicting 30 mins ahead of time. Basically you're predicting at t_30 using t_[0;-5:-10:-15:-25;-30] right?
As soon as you make this dataframe you will encounter some null values. As soon as you drop these and compute your lags on using the latter dataframe you'll never use the future to predict the past. This assumes your data is trend stationary though.
strong assumption with real life data
in this realm
still not sure I completely understand though I'm reading what you said very carefully
I think I'm going to definitely use this though
The point is to avoid leakage, as per usual
If splitting randomly doesn't leak, you can
The major assumption is a consistent relationship between the dependent and independent variables ofc
If that is not the case you'll leak. I think this is broken in your case indeed 🤔
yeah the whole reason I'm splitting multiple times to train is there isn't a consistent relationship
the more training periods the better but they need to be suffiicently long to learn from
I'd like to have them be of random sizes too but not sure how to do that in practice because of how I need to read my data into one of the libraries
I guess what I'm trying to say is, if you know what you're doing idt you should be averse to rolling your own evaluation strategies (if you're sure they're water tight)
ML is, imo, the business of evaluating models and not the business of making models 😄
my problem is always that I want to train on recent data in case the relationship has changed or is changing
not really a problem but my frustration with the process
You can make an intricate ensemble using a model that uses exclusively lags and a model incorporating more information
one of the things I've looked into is segmenting time series with state models to try to capture things like cyclicality then using all of the same points from the same state as if you can shuffle them
hmm state models
Say model A (pure lagged model, can be just exponential smoothing) has an error of N that is more or less consistent over time.
Model B (incorporates exognenous information) has an error of M that varies over time.
if N < M you have an indication of a changed pattern over a sufficiently long time period t
This is one basic way where you can react you changes in the "real world" when the model is deployed or so
makes sense
as with anything though it can only be clear in retrospect
and what if it's actually just a cyclical thing and the relationship is about to mean revert
so you switch out your model
but it was just a temporary disruption
and the beta is actually stationary over time
Yeah, this was what my thesis was about
https://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf this is a good survey on this large topic
Fundamentally unsolveable because you only know in retrospect
what about incrementally reoccuring concepts
that are one directional graphs
like up and back down
but once it starts going down it doesnt go back up until it bottoms out
that's kind of what I'm thinking of
yeah I will thanks
FWIW, the problem with -all- models is ‘what if’. And, in particular, the dreaded black swan. That’s what risk management is for: to limit loss when your model falls apart
I mean doing the whole fitting a new model because that happens thing
Aha, I think what I mentioned with the ensemble is a form of risk management 😄 I don't know how finance deals with this typically though.
I have a fundamental model and a trading model
where the fundamental model for the rate is an input to the trading model
along with a volatility model for the variance
the fundamental model uses variational inference hmm states
and is order 1
it's slower moving
What open source embedding models are best for RAG systems? Do I need large context like 8k?
in finance risk management is scaling the size of your action to the variance of the outcome
or paying for convexity
(that's what options are)
also things like stopouts, etc.
Finance is not my cup of tea in all honesty 🙂 loads of domain specific rules and knowledge I don't know
i mean time series is time series
the actions you take as a result of some signal can be abstracted away it's all just producing signals from time series data at the end of the day
is there a way to display stuff like dataframes, plots.. side by side in a jupyter notebook? And I don't mean subplots or any of that stuff, I just mean like put a plot next to a dataframe. Would be cool if the display() function accepted a list and tried to put things in a row, if there's space, but no, doesn't work. All I get when I google are ways to mess with the underlying html of a notebook, but.. kinda hacky. Is there a "normal" way of doing this?
Yes, use ipywidgets to arrange things (side by side). You’d probably want a hbox with two children.
i have a model which can generate a 1s of audio, how to generate longer samples than 1s like 10 - 30s samples it is train using the GAN on .wav files and it take noise of size of 100 the genertates the 1s sample audio, because it was trained on such length audio, can i increase the sample lenght with that trained model or i have to train it once againg but on bigger sample leghts?
Hello all, I'm currently building a regression model using ml techniques but can't seem to get the MSE down. I need to predict 2 variables from another 3 input variables. Does anyone have an ideas? Here is my code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
data = pd.read_csv("data/TA_LGs_combined.csv")
X = data[["dist[kpc]","vrad[km/s]","vtan[km/s]"]]
y = data[["mass1[Msun]","mass2[Msun]"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)
model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train, y_train)
y_pred, _ = model.predict(X_test, return_std=True)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')
plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()
plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()
mse_mass1 = mean_squared_error(y_test['mass1'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2'], y_pred[:, 1])
print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')
P.S. my dataset is quite small (about 600 rows)
current MSE is about 1x10^24
standardize everything and what is the MSE?
no I was saying if he standardized it
demean and divide by stdev
subtract average and std before fitting model
divide by std*
just giving 1x10^24 doesn't mean much if you are working with large masses
if you standardize it first then it's clearer
it also might help the model
could you elaborate?
depending on the model the scales of different variables being different might skew the results
whereas if you make everything have unit variance first
So should I normalise the inputs?
Btw current MSE is at 1.0013291012933846e+24
not saying it will fix everything
I'll try it now
It seems to have increased the MSE
did you scale the test data too
I find it hard to believe that you could get to an e24 mse on normalized data
This is what I did:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
data = pd.read_csv("data/TA_LGs_combined.csv")
X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler(with_mean=True, with_std=False)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)
model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred, _ = model.predict(X_test_scaled, return_std=True)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')
plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()
plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()
mse_mass1 = mean_squared_error(y_test['mass1[Msun]'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2[Msun]'], y_pred[:, 1])
print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')
just scale the y variable too, also not sure why with_std is set to False
I'll set it to true
if it's just those 5 variables I would do data = (data-data.mean())/data.std() at the start
and save the means and stds for later use
so you can rescale the output
yea
then rescale the outputs and calculate the MSE again and compare it to the original
the 1e24 one
that's how you know if it helped or not
but if it's saying the mse is 0.7 isn't it already better?
no because you changed everything to be normalized
you aren't comparing apples to apples
that's why you need to save the means and stds
so you can invert the normalisation on the model outputs
to get back to the original problem space
in principle what you did should give you 1e24 or better when you rescale it to the original problem
it really just depends on the stdev of the y variables you are predicting
whether that 0.7 is better than the 1e24 or not
i'd be surprised if it isn't tbh
Anyone who is or have faced issue downloading Sckit-bio with python before?
i'll try these
!build
When you install a library through pip on Windows, sometimes you may encounter this error:
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
This means the library you're installing has code written in other languages and needs additional tools to install. To install these tools, follow the following steps: (Requires 6GB+ disk space)
1. Open https://visualstudio.microsoft.com/visual-cpp-build-tools/.
2. Click Download Build Tools >. A file named vs_BuildTools or vs_BuildTools.exe should start downloading. If no downloads start after a few seconds, click click here to retry.
3. Run the downloaded file. Click Continue to proceed.
4. Choose C++ build tools and press Install. You may need a reboot after the installation.
5. Try installing the library via pip again.
I have already downloaded Visual studio C++ 14 or greater. But people have had same issue with pillow and AutoGPTQ and they have downgraded the pacakge. Maybe I could try to downgrade the hdmedians package
so if i'm understanding correctly, I normalise the data and fit the model with it. Then I denormalise the data once it is predicted
yes
then compare results of that method vs not normalizing at all
vs using the normalized MSE
you can probably make some shortcuts in doing this comparison if you have the stdevs bound to a name
Unless the original data was in the 1e10+ I think it probably did help though
I have to admit my ignorance not sure how MSE is calculated in this case where there's two outputs
and what it means vs the univariate case
I did this but it's not working:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
data = pd.read_csv("data/TA_LGs_combined.csv")
data = (data-data.mean())/data.std()
X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler(with_mean=True, with_std=True)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
means = scaler.mean_
stds = scaler.scale_
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)
model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred_scaled, _ = model.predict(X_test_scaled, return_std=True)
y_pred = y_pred_scaled * stds + means
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')
plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()
plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()
mse_mass1 = mean_squared_error(y_test['mass1[Msun]'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2[Msun]'], y_pred[:, 1])
print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')
It gives:
Traceback (most recent call last):
File "/Volumes/Isaac's External Drive/7d-emulator/model_ml.py", line 31, in <module>
y_pred = y_pred_scaled * stds + means
~~~~~~~~~~~^
ValueError: operands could not be broadcast together with shapes (116,2) (3,)
@agile owl
data = pd.read_csv("data/TA_LGs_combined.csv")
data = (data-data.mean())/data.std()
means = data.mean()
stds = data.std()
y_pred = y_pred_scaled * stds[["mass1[Msun]", "mass2[Msun]"]] + means[["mass1[Msun]", "mass2[Msun]"]]
you can remove the StandardScaler
you don't need it anymore
think that's just confusing things
one second I'll write it out
File "/Volumes/Isaac's External Drive/7d-emulator/model_ml.py", line 27, in <module>
y_pred = y_pred_scaled * stds[["mass1[Msun]", "mass2[Msun]"]] + means[["mass1[Msun]", "mass2[Msun]"]]
~~~~~~~~~~~~~~~~~~~~~~~^
...
File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/common.py", line 561, in require_length_match
raise ValueError(
ValueError: Length of values (116) does not match length of index (2)
yep
means = data.mean()
stds = data.std()
data = (data-means)/stds
is what I meant
what is the shape of y_pred_scaled
nx2 right?
nx2 you mean
well no there will only be one for the means and stds
nothing i got confused
you want them to be 1x2 so you can broadcast them onto the 116x2 array
i'll transpose them
Traceback (most recent call last):
File "/Volumes/Isaac's External Drive/7d-emulator/model_ml.py", line 33, in <module>
y_pred = y_pred_scaled * stds[["mass1[Msun]", "mass2[Msun]"]] + means[["mass1[Msun]", "mass2[Msun]"]]
~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'method' object is not subscriptable
This is what I have now:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
data = pd.read_csv("data/TA_LGs_combined.csv")
means = data.mean()
stds = data.std()
data = (data-means)/stds
X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)
model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train, y_train)
y_pred_scaled, _ = model.predict(X_test, return_std=True)
print(y_pred_scaled.shape)
means = means.transpose
stds = stds.transpose
y_pred = y_pred_scaled * stds[["mass1[Msun]", "mass2[Msun]"]] + means[["mass1[Msun]", "mass2[Msun]"]]
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')
plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()
plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()
mse_mass1 = mean_squared_error(y_test['mass1[Msun]'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2[Msun]'], y_pred[:, 1])
print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')
Hi I have a simple problem but can't seem to find the numpy function, I want to have a condition where the numpy array contains any values in another fixed numpy array. i.e. if array 1 is [1, 2, 256] are there any values that in it that are the same as array 2 [32, 256, 256] which is the fixed array?
Any help would be appreciated
you need to call transpose
I tried this and it still doesn't work
Okay I tried this:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv("data/TA_LGs_combined.csv")
means = data.mean()
stds = data.std()
X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]
Ymeans = y.mean()
Ystds = y.std()
data = (data-means)/stds
X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)
model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train, y_train)
y_pred_scaled, _ = model.predict(X_test, return_std=True)
print(y_pred_scaled.shape)
print(np.shape(Ymeans))
print(np.shape(Ystds))
y_pred = y_pred_scaled * Ystds.values + Ymeans.values
print(np.shape(y_pred))
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')
plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()
plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()
mse_mass1 = mean_squared_error(y_test['mass1[Msun]'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2[Msun]'], y_pred[:, 1])
print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')
The mse is now 3x10^24
please tell me i did something wrong 😅
Ymeans is already 0 and ystds is already 1
what do you mean?
i'll do the same for x
actually you don't need to unscale X and it looks like you did already
I'm a little surprised the MSE went up tho
also why are you using a GaussianProcessRegressor
thought that process regression was for things where the x variable is time
im seeing if i can use svms
but might be wrong on that
you're right
i have another project with gps
they're best for time series data
maybe try RF with the normalization?
I didn't know normalization could make models substantially worse but I guess it can
in my experience it usually improves things a lot, or it's within a very small error of the original one way or the other
could it be my code is flawed?
I can't look at it anymore without running it myself
if you give me the data i would take a look at some things
i can't do that unfortunately
it just gets hard to spot things at a certain point without actually running the code
maybe someone else can see something I missed
thanks for the help anyways
yeah I have ADHD I can't keep reading code without running it xD
I would have been awful at programming with punchcards
yeah they are good with low sample size
mse for mass1 0.
for mass2 1.
MSE for mass1: 0.49354012715694157
MSE for mass2: 1.2389531306923065
are those good values?
they should be
100% that’s something we constantly have to explain to new developers
What software do you guys use for python when running data? I am tired of Jupyter notebook…..
personally I use VSCode, though Jupyter is not that bad, specially when you want to make graphs often. Definitely not fit for all things though.
PyCharm is also somewhat popular
vscode and vim
Thanks guys. I am facing some issues downloading certain plugins using jupyter now for a long time without figuring out the solution! I guess it’s time to try something else. I wanted to try PyCharm. I will have a look at it tomorrow
What do you think?
The model seems to perform slightly worse on new data than on the data that is used for training
So is it overfitted/underfitted on the data?
i think underfitted?
my lecturers taught me that when the validation loss starts to go up
thats when its overfitting
currently i think my model is underfittiing based on this diagram
I guess from the looksof it, it isn't going up yet, so it seems to be a decent fit right now
Though on the right it does seem to go up slightly, but it's too inconsistent to tell
sshld i increase the learning rate and the epoch to see how it goes in this case?
Testing different learning rates can always be a good idea
In this case it's not overfitting yet, but it might be underfitting because of model complexity
Because even the training accuracy stops at 80%
But not underfitting because it isn't trained for long enough
If I have a high number of classes/labels for multi label classification, should I reduce my dropout ratio? ATM it is at 0.5 or 0.2 but I have nearly 700 labels so 0.5 seems quite high
looks like it's been spinning its wheels since 50 epochs tbh
\jru
Hey**, im currently doing numerical intergrations and differential equations using scipy.odeint
def model(y, t, b, c):
S, I, R = y # reads in values in y and assigns to S and I
dSdt = -b * I * S # rhs of dS/dt
dIdt = b * I * S - c * I # rhs of dI/dt
dRdt = c * I # rhs of dI/dt
print(S, I, R)
return [dSdt, dIdt, dRdt] # important that they are this way round!
# Parameters
b = 0.002
c = 0.5
# Initial conditions
y0 = [999, 1, 0]
# Time points
t = np.linspace(0, 20, 1000)
# Solve ODE
y = scipy.integrate.odeint(model, y0, t, args=(b, c))
plt.plot(t, y)
plt.legend(['S(t)', 'I(t)', 'R(t)'])
plt.xlabel('t')
plt.title('SIR Model')
in my model function, does it matter what way round i do S, I, R = y?
Guys, this might be a stupid question but does it make sense to write models in C instead of python? Because of its insane speed?
Sure but there's python libraries that do most of their work in C or Cython like Numpy and scipy
Yeah I was gonna say that the array structure or the underlying logic is in c which is why i think its very good to learn it
Yes, if it's a novel algorithm that can't be composed out of existing parts of libraries (e.g. a series numpy functions). I recommend still writing Python bindings for that C implementation, so you can access it from Python where you have access to all the other Python libraries.
This sums it up well
even just implementing basic algorithms in c can be a good idea, because you actually get to understand them in a deeper way without any external stuff
I agree but you seem to have the same issue as I do so I'll give both of us some advice: don't forget to do stuff instead of going deeper down the rabbit hole 🐰
For learning purposes it can be a very good idea. C has a DIY culture surrounding it, and so you will find more resources on how to implement things yourself than with Python, which is more focused on using existing libraries (because it's focused on productivity, which is dominated by having the thing already written for you, and easy access to it). But depending on what you are trying to do, there may not be any need to have a deeper understanding of the implementation, in which case it's kind of a waste of time.
yeah, thats why sometimes i struggle to use things that i dont understand and it is smth that i have to get comfortable with.
This is how you end up writing a GPU implementation of the FFT for a month.
Yeah i know what you mean, i would do it for learning purposes to understand how algorithms and other important functions has been made and truly work
Exactly, otherwise you'll be figuring out how electrons pass through silicon
There's nothing inherently wrong with this, but you need to be aware of this being a thing
yeah i mean, trying to always go deeper is good but there should be a limit and we gotta be aware of whether what we do actually makes any difference or sense.
Knowing (some) C is generally a good idea though I agree
It's also a strange language in the sense that it's small so in theory it's easy to pick up, but you need to do most of the things yourself ... because it's small
Even if only so you can make use of some C library that you find in the wild that does not have Python bindings.
yeah exactly, its very "primitive" like you dont even have any oop concept, but it makes you think in some way and makes you understand things in a deeper level, like pointers are very interesting and using them gives you an idea that who programming languages might have implemented things
C does have abstract data types / opaque types so you can still encapsulate etc. without OOP 🙂
Or with OOP, it just does not have the same syntax sugar and stuff as C++.
The real thing that makes C "primitive," that you will immediately run into, is that its standard library is bare, it does not come with any of the common data structures you will find in other language's standard libraries (or built into the language itself).
Hi, in the case of using Matrix Factorization, how do we determine the dimension of embedding matrices? is it a hyperparameter or something like a weight parameter?
anyone please lol
i just need to know why you cant swap them around
they're read in exactly the order you tell them to
imagine you have x, y = [0, 3], and compare that to y, x = [0, 3]
the items in the list are assigned to x and y one to one, and in the order you specify
So when scipy calls the function it inputs a list for y?
hi
Welcome to our wonderful data science chat
:) thank you
What kind of graph would best represent this kind of data? ie each set contains the % of students that approved a test for each curse. This is just an example, in my particular case I may have 10 courses or more
I would probably just use a bar char for showing each individual % and not even try to show the intersections (or lack of thereof)
if you really, really want to show the intersections, then maybe something like https://seaborn.pydata.org/generated/seaborn.pairplot.html
Is Visual Studio (VSCode) free? Is asking me for a product key
Visual Studio != Visual Studio Code
I just figured out that I could use the free version.
has anyone done multi class classification for multi layer perceptron before?
I think I'm doing it wrong
my codes:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# Load the dataset
file_path = "modified_result.csv"
df = read_csv(file_path, header=None)
# Split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# Ensure all data are floating point values
X = X.astype('float32')
# Split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the input features
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
# Determine the number of input features
n_features = X.shape[1]
print(n_features)
print(X)
print(y)```
n_classes = 5
# Define model
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(n_features,)))
model.add(Dense(n_classes, activation='softmax'))
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# fit the model
history = model.fit(X_train, y_train, epochs=300, batch_size=32, verbose=1, validation_data=(X_test, y_test))```
Is there anyone here who have used structural equation model in jupyter notebook?
Hello Hello ^^
No, I am not. Do you have any problems?
Are you after suggestion on stuff to do?
I think those graphs are a lil hard to digest. I decided to use a heatmap. Basically, the more students approved a certain course, the greener the cell is. Idk how well this will workout when I have 600k students but we'll see heh.
.
Has anyone ever used google or tools its CP SAT before?
are these two questions the same or different,
General question regarding image-processing:
I have to code an edge-detection program, optimally in the end with ML algorithms to detect the edges of drill-holes (scanning electron microscope, SEM)
Would I use:
OpenCV
or
scikit-image?
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 534, in _run_script
exec(code, module.__dict__)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\main.py", line 58, in <module>
ax.imshow(df_wc)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\matplotlib\__init__.py", line 1478, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\matplotlib\axes\_axes.py", line 5756, in imshow
im.set_data(X)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\matplotlib\image.py", line 723, in set_data
self._A = self._normalize_image_array(A)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\matplotlib\image.py", line 688, in _normalize_image_array
raise TypeError(f"Image data of dtype {A.dtype} cannot be "```
helper code
def create_wordcloud(selected_user,df):
if selected_user != 'Overall':
df = df[df['user'] == selected_user]
wc = WordCloud(width=500, height=500, min_font_size=10,background_color='white')
df_wc = wc.generate(df['message'].str.cat(sep=" "))
return df_wc
main code
df_wc = helper.create_wordcloud(selected_user,df)
fig, ax = plt.subplots()
ax.imshow(df_wc)
st.pyplot(fig)
using streamlit
it says
Image data of dtype object cannot be converted to float
@past meteor do you use deap?
I'm trying to figure out how to evolve using a training set but keep results based on the validation set
their hall of fame class is how I've been doing it but now I want to use a different fitness than the one it's training on and that just doesn't seem to be supported at all
do you just manually transplant the models between the stages of your testing plan?
No, when I used GA's I wrote everything myself using Numpy, Numba and Dask
the way I have it split up it's train-val-train-val-train-val etc. and then test data at the end
I didn't want to keep reusing the earliest data because I think the earliest data is actually the least likely to be relevant so I just had it keep going through time with overlapping periods
The results are uninspiring so far lol
Can I be honest and say I'm not comfortable giving advice here? 🙂
sure
I don't know if this is actually evidence of overfitting or me doing something that doesn't make sense
I don't fully understand what you're trying to do. It also seems to be related to finance, which is a domain with extensive research
wasn't expecting it to be so bad
Before I could give real advice I'd have to read tons of finance literature but that's not particularly my interest
mmm it's not really finance specific just need to figure out the right way to do out-of-sample time series validation on these
I guess the fact it's an agentic problem is relevant but not really anything finance specific
it seems like df_wc is returning None value but idk why 😦
I'm not sure ignoring finance specific methods is smart. That's kind of my point
I think what I did does disprove that what I'm trying to do works without having to look at the test data
mmmm... there's no agreed upon way to forecast time series or build trading bots in finance if that's what you mean
I think most people just use time series/signals processing stuff like anything else
there's crazy quasi-religious people who use fibonacci sequences etc.
I did my MSc in business engineering, but majored in data science. Tons of my friends did the finance or actuarial science tracks. Many of those that did, went into stock prediction etc
When I asked them about their methods it was totally different things than I did
Did they all give you the same answers? I would be somewhat surprised if they did.
unless they all worked for the same place
Nah this was for their thesis
most of finance is a form of regression problem with noise and probabilistic uncertainty
solved :()
signals engineering people tend to do the best i think
when it comes to the quantitative stuff
it sounded like you worked with signals of some kind
I want to learn Reinforcement Learning AND Symbolic Regression AND finish learning EA
this is so cursed
too much to learn
too little time
u preparing for what
the future
oh
just in general
no i'm out of school
o
well i think i pretty much started with ml stuff
and im in 3rd year of eng (just started)
currenly im doing some projects to get a good hand
then ill start unsupervised
reinforment learning ill try it for fun tbh
yeah I don't think that you will actually master everything just in school I also took a masters in data science but we didn't touch on all those topics
i see yea
so I know the supervised and unsupervised stuff pretty well
reinforcement learning they didn't cover
we covered deep nets, architectures and LLM
but missing some stuff
it's such a broad field to cover everything
ooh yea deep learning omy i have no idea about it i need to learn that too
ugh to much to learn hahaha
I think that deep learning is actually not as useful for a lot of practical things as reinforcement learning
reinforcement learning takes on a classic problem solving form for anything happening over time that requires interactions
u think so ? i was said that its not that useful when compared to supervised , un, and deep learning
I guess it depends on what you are doing
i see
well as u are a very experienced person can u provide me a guide map
currenly im trying out differnt projects on supervised
well for example it's the type of algorithm that will learn how to beat a game
i did some maths and intuition of basic algorithms of supervised
like chess
yes i know something about rl
we need those punishments and reward system
its pretty interesting and fun
yeah you need to map different game outcomes to different rewards/penalties
I honestly don't know all the details otherwise I wouldn't be saying I need to learn it still
specially i saw that person who created a RL agent which plays pokemon red, i was so inspired by that
and got interest in RL
I think anyone who says RL isn't useful compared to supervised and unsupervised is mistaken
I think supervised and unsupervised learning answer lower conceptual level problems
I think RL's problem space is on a higher conceptual level
which makes it more useful not less
its like brute force
eh, I'm not sure about how the calculus works but I'm pretty sure there's some method to it beyond brute force
hm u saw that pokemon red video where he trains those agents to play the game?
supervised learning: What should this house cost?
reinforcement learning: At what discount to value should I buy the house and when should I sell it?
that's what I mean by different conceptual level
oh it could be used in that way too ?
thats awesome
i see haha i have very little knowledge aobut it then
i thought it would be better only in games and stuff
i never understood it in business scale
if something can play a game it can probably do something useful
hm
if you think about it
yes
and figure out what it's doing
games are fun because they require intelligence of some sort
yea i can think about it hmm
so like what are the problem solving techniques that a programmatic agent to play a game would use
and how can they be applied to other domains outside of games
yes i see how you are thinking of applying the game ideology in real life
thats pretty interesting
but for me it will prolly take too long to get there
im still stuck at supervised
oki gtg ill complte this project and think of something else seeya
Hi bicubic interpolation is not implemented on MPS, how do i get around this for the diffusion model for a Mac
Hi, can someone help me, I’m trying to plot some images and see it’s corresponding arrays but nothing shows up, this is what I have:
`#LOADING AND SPLITTING DATA
x_train = tfds.load('celeb_a', split='train[:10000]', shuffle_files=True)
x_test = tfds.load('celeb_a', split='test[:2000]', shuffle_files=True)
#PREPROCESSING IMAGES
x_train_arrays = []
x_test_arrays = []
for dataset in (tfds.as_numpy(x_train), tfds.as_numpy(x_test)):
for img in dataset:
image = img['image']
if dataset is tfds.as_numpy(x_train):
x_train_arrays.append(image)
else:
x_test_arrays.append(image)
#PLOTTING
train_examples = x_train_arrays[:2]
for example in train_examples:
image = example['image']
print(image)
plt.imshow(image)
plt.show()`
Hi, I need help. My Google colab not can install !pip install chatterbot, Does anyone have the same problem?
what error message did you get when you tried it? be sure to always show error messages when asking for help, if you have one
What is the best way to train a text-classification model using BERT to classify labels which are part of a hierchy?
I.e. if i have categories like:
Sports/College Sports/College Basketball Ideally I'd like the model be able to know that College Basketall is a child of College Sports and Sports
I have been playing with a basic multi-label model currently, but it seems to struggle quite heavily when looking at the overall 700 categories spread across 4 tiers of Hierarchy.
And realistically if the model is predicting College basketball strongly, it should have Sports and College Sports as high if not higher.
I haven't done this (yet) but this is called hierarchical classification, just so you can look at literature if you care. for it.
Something you can definitely do is train classifiers per tier but you'll end up with many of them. Doing it should not be prohibitively "expensive" if you train BERT to classify the first level, take the final layer embeddings and do the "cascade" with simpler models (say gradient boosting, linear regression and whatnot). Drawback is that this isn't some end-to-end thing you can train in one go
😅 Do you have any links which are fairly basic to read? My mathematics is not great so a lot of the theory goes over my head.
One thing that seems to be weird is roberta-base struggles less than xlm-roberta-base so I gues the amount of data also plays a roll, but 800k points for 700 categories seems fairly solid
Ty ty
What I described are "local classifiers"
And it's their fav, seems my intuition is decent then 😄
- Train bert to predict the top most category
- Train 1 model per tier 1 to predict each tier 2
- Train 1 model per tier 2
- trian 1 model per tier 3
Just use (the same) embeddings for all these
hmm so in the end we have roughly 700 smaller trained models
for their specific niche
Yeah, you can have less than that if you train 1 model for each tier for instance
Then you have just 3 or 4
But how would you pre-feed to the model on the next tier that what the parent tier was?
You wouldn't
But you have 600 less models 😄
You can also zero out things that are not possible with post-processing ofc
I think it's a matter of trying out all options and seeing which has the best metrics
Hmm, yeah the 700 models approach might be best, or at least i need to partition the data somewhat because it currently gets suck and can't learn with the 700 categories and multi-label
or at least... It does not work out how to learn within a sane time frame, it can just about do it on Tier 1 and Tier 2 where there are ~100 - 150 categories, but past that it seems to largely just go "aha yes I think this text is all categories"
I guess you'll have to try it out
Something about hundreds of models is unsettling to me lol
😅 I think it may raise some questions.
What I will probs do first is partion Tier by tier a step at a time.
So to start with we do 29 (Tier 1) classifiers and see if that aids it
that way we only have 30 classifiers
and then all we have to do is workout how to stick this into Pytorch Lightning without it breaking things
Idk how you'll do inference but you can train as usual and have a method that does all the multiplications till you reach the last layer, this is the one you then multiply with your entire dataset to have n_docs x emb_size and then the scuffed part begins
It's also really a case of how you want to evaluate this thing, if it gets it wrong at tier 0 will you propagate errors?
T1 wrong -> T2 wrong -> T3 -> Wrong or do you do this:
T1 wrong -> use actual T1 to predict T2 -> ...
The only issue I may have... Is GPU memory 😅 These machines only run A10g's so they have at most 24GB but the BERT models in memory are already pretty big (XLM at least :P)
What i'm curious about is how I manage loss / optimizers
Yeah so the models you use for this stuff can be anything, it can be xgboost running on say CPU
400-500 category specific fully connected layers aint gonna work
In scuff we trust
At prediction time for real for real it can be a big dict/look-up table lol
oh trust me that is what I am planning to do lol
that is the simple part 😅 It is getting Pytorch lightning and the rest of the training framework to not loose its mind
But you can do these separately?
Well ideally it would be, but the way lightning sets it up with our loggers etc... Means if we did that, each CI run to train would create 30 runs in our Neptune system 😅 although fuck it maybe that is useful tbh
We'll worry about it not clogging up the graphs later 😅
Another question, is there anyway to add a manual layer of data points for the model to consider
Like a guidance as such?
I.e I already have a vague idea of the categories so it would be "it is likely to do with X, Y and Z"
Is there a way i can connect ollama/llama2 with internet (so it can give updated info mostly i just AI so i don't have to read docs, or maybe some way i can train a smol AI on some docs
does anyone know a resource that I could use to explain spectral clustering in a simplified manner?
I am making a recipe recommendation app and using flask as backend, I have a csv containing recipes data with a column with list of ingredients, but they are in the form like :
"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher salt, divided, plus more', '2 small acorn squash (about 3 lb. total)', '2 Tbsp. finely chopped sage']"
But I want pure ingredient names without any quantity or state i.e., chicken, salt, acorn etc
Is there any way to filter my ingredients like this in python, I've heard about libraries such as nltk but I am not sure about them
Have you looked into spaCy? It's not my expertise, but when it comes to NLP, this may be what you need. Found a random reddit link talking about recipes & spacy: https://www.reddit.com/r/LanguageTechnology/comments/erw687/using_nlp_to_parse_recipe_lines_with_spacy/
I'm getting error, how does it not find the file from the directory- `from torch.utils.data import DataLoader, Dataset
import torchvision.transforms as T
import torch
import torch.nn as nn
from torchvision.utils import make_grid
from torchvision.utils import save_image
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import random
%matplotlib inline
image_size = 64
DATA_DIR = '/Users/Name/Documents/ai\ info/test.out.npy'
X_train = np.load(DATA_DIR)
print(f"Shape of training data: {X_train.shape}")
print(f"Data type: {type(X_train)}")1 image_size = 64
2 DATA_DIR = '/Users/Name/Documents/ai\ info/test.out.npy'
----> 3 X_train = np.load(DATA_DIR)
4 print(f"Shape of training data: {X_train.shape}")
5 print(f"Data type: {type(X_train)}")
File /opt/homebrew/Cellar/jupyterlab/4.0.7_1/libexec/lib/python3.11/site-packages/numpy/lib/npyio.py:427, in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)
425 own_fid = False
426 else:
--> 427 fid = stack.enter_context(open(os_fspath(file), "rb"))
428 own_fid = True
430 # Code to distinguish from NumPy binary files and pickles.
FileNotFoundError: [Errno 2] No such file or directory: '/Users/Name/Documents/ai\ info/test.out.npy' `
Now i added import os abs_data_path = os.path.abspath(DATA_DIR)
anyone able to help with power bi it is so annoying
cuz seems like i am logged it but have a pop up to connect acc and then i get this ^
which 3b model would be better to run on low ram devices like a phone or a raspberry pi, btlm-3b or orca mini 3b
i have room for orca 2 7b but it needs to be quantized to 2 or 3 bits, which i heard reduces accuracy by a lot
OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacty of 5.79 GiB of which 155.31 MiB is free. Including non-PyTorch memory, this process has 4.77 GiB memory in use. Of the allocated memory 4.53 GiB is allocated by PyTorch, and 118.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
how can i make it so that it can take longer time to generate img but doesn't run outa memory or is that not possible
When I run this code in google colab: '!pip install Chatterbot' this comes out: Collecting ChatterBot
Downloading ChatterBot-1.0.5-py2.py3-none-any.whl (67 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.8/67.8 kB 3.1 MB/s eta 0:00:00
Collecting mathparse<0.2,>=0.1 (from ChatterBot)
Downloading mathparse-0.1.2-py3-none-any.whl (7.2 kB)
Requirement already satisfied: nltk<4.0,>=3.2 in /usr/local/lib/python3.10/dist-packages (from ChatterBot) (3.8.1)
Collecting pint>=0.8.1 (from ChatterBot)
Downloading Pint-0.22-py3-none-any.whl (294 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.0/294.0 kB 9.4 MB/s eta 0:00:00
Collecting pymongo<4.0,>=3.3 (from ChatterBot)
Downloading pymongo-3.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (516 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 516.2/516.2 kB 11.9 MB/s eta 0:00:00
Collecting python-dateutil<2.8,>=2.7 (from ChatterBot)
Downloading python_dateutil-2.7.5-py2.py3-none-any.whl (225 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 225.7/225.7 kB 12.1 MB/s eta 0:00:00
Collecting pyyaml<5.2,>=5.1 (from ChatterBot)
Downloading PyYAML-5.1.2.tar.gz (265 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 265.0/265.0 kB 15.0 MB/s eta 0:00:00
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (setup.py) ... error
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Hey! Was wondering if anyone has any experience with Kaggle? I'm a math major but wanted to get better at coding and kaggle seems fun and interesting, but I am a beginner at python at best. I was wondering if there are any pre-reqs i should know
Kaggle has both courses and competitions, if you're a beginner you can just start with the courses
Probably by reducing the batch size and number of workers in your DataLoader.
Anybody here
why do you want to know why someone is here? people who might glance at the channel aren't going to feel enticed just to announce that they're looking at the channel.
Is it possible to create a python program where you give it multiple documents at once and once you do a keyword search that it can go through those documents and pull out that information
I have no idea if it is a short or long program or what python modules can help out in doing this
what are the format of the documents?
(and yes, it is possible.)
Sorry as PDFs and word documents
this notebook I wrote a few months ago has code for extracting text from both of PDF and word documents. It's based on code that I stole from someone else. https://github.com/center-for-threat-informed-defense/tram/blob/main/user_notebooks/predict_multi_label.ipynb
as for "keyword search", if the keyword is "walk", do you want "walked" to count as a match?
and what do you want the output to be? all the sentences that contained a match?
No as like cut that part off
Yes as mostly all the sentences that do match the word
what about "swim" and "swam"? do you want those to count as matches for each other?
No as I want them to be seperate
okay. if you're not a native English speaker, keep in mind that "swam" is just the past tense of "to swim", so they're the "same word"
anyway, it sounds like you're just trying to match substrings.
I am a native English speaker and so I do get the idea
for the record, I had no reason to think that you weren't
Okay as I do get it
So your program can work that same way
@serene scaffold thank you for showing me this
Do you think also that this ia an AI model or can it be as I think of it just being data science
you don't need AI for this
I agreed to that idea as for me I do not think it is to be but my boss said to me he thinks it can be
Maybe throw some blockchain in there too for your boss 😉
anyone here work with MS/MS data like itraq ratios?
Will do
anyone can help with this? it's Anaconda Navigator
there is error, but the error didn't specify what is the error
No idea but proly just remake the environment
I've been trying to figure this out for a minute now, any tips appreciated:
I want to send in a piece of text to the ChatGPT chat API and sequentially ask it questions and get answers to those questions one by one. I don't want to create a new input prompt each time to ask a new question on the same text which would use context tokens for the same text each time.
I'm currently submitting all questions in the prompt and asking it to end each answer with two line breaks so I can break up the responses to each question. But I'm sure there's a better way to do it.
for question in questions:
messages.append({"role": "user", "content": question})
chat_response = client.chat.completions.create(
model="gpt-3.5-turbo-1106",
messages=messages,
)
full_response = chat_response.choices[0].message.content
# Split the response into parts based on '\n\n'
response_parts = full_response.split('\n\n')
summary = response_parts[0]
category = response_parts[1]
background = response_parts[2]
claims = response_parts[3]
stakeholders = response_parts[4]
impact = response_parts[5]
What are some feature selection techniques for k-means clustering. My variables are mostly categorical.
You'll have to pick the features that you think matter and aren't redundant, and one hot encode them
Check out Prince they have a good chart on their GitHub of what methods you can use, you could for instance do a factor analysis and then run k-means on the result
could you send a link?
https://github.com/MaxHalford/prince
https://maxhalford.github.io/prince/famd/
Thanks a lot.
Could you guys suggest me a good resource to learn Numpy, Pandas, Matplotlib and Seaborn.
Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
Can you be specific in what you don't understand? Have you studied the course so far? I don't think people will explain the entire coursework over discord 😄
Alright thanks for sharing
i wanna ask if this roadmap for machine learning is good enough ??
https://ml2022.my.canva.site/
Machine Learning Roadmap 2022, how to learn machine learning in 2022.
ping is someone responds
looks decent if you are more or less starting from scratch, yes
i know python but have no idea about ML and i want to learn it
do you already know numpy and how to use it to perform matrix computations?
obviously imma skip the part that i already know
no i have basic coding knowledge and bunch of algorithms and data struts but no libs of python
actually i used to do a lot of competitive prog
I am just completing with a high score a really challenging online course offered by the MIT that takes place every 6 months or so
so should i do this one??
or stick with that roadmap??
I think it is better if you do some easier courses first, MIT on edX.org (aka. MITx) is pretty hardcore
one of my cousin approved it (he is a data scintist and usually work with ML at microsoft)
I can recommend you an easy one just to get a first certificate without much fuss, but with some nice examples, which I took, it happens to be https://www.edx.org/learn/data-visualization/ibm-visualizing-data-with-python and it is about visualizing data, not ML, but it is a nice little aperitif
so this roadmap then MIT course ??
yes. lots of TODO
seems good enough
its okay i really gotta land a job with end of my 3rd year
note that the Google Crash Course in Machine Learning is decent, but somewhat high level... but yeah try it out
thanks for the information buddy
no problem, let's put the cards on the table... these playlists are not too long and they are super helpful in preparing/refreshing relevant math skills
https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
https://www.youtube.com/watch?v=WUvTyaaNkzM&list=PL0-GT3co4r2wlh6UHTUeQsrf3mlS2lk6x
Beginning the linear algebra series with the basics.
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Home page: https://www.3blue1brown.com/
Correction: 6:52, the screen should show [x1, y1] + [x2, y2] = [x1+x2, y1+y2]
Full series: http://3b1b.co/eola
Fu...
What might it feel like to invent calculus?
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/lessons/essence-of-calculus#thanks
In this first video of the series, we see how unraveling the nuances of a simp...
what do you prefer the course of coursera or this??
cuz i already completed a module of it
these playlists are basically must watch knowadays, but I consider them to add/support learning, not replace it
two more gems in the same sense
https://www.youtube.com/watch?v=cy8r7WSuT1I
Where's the circle? And how does it relate to where e^(-x^2) comes from?
Help fund future projects: https://www.patreon.com/3blue1brown
Special thanks to these supporters: https://www.3blue1brown.com/lessons/gaussian-integral#thanks
An equally valuable form of support is to simply share the videos.
The artwork in this video is by Kurt Bruns, ai...
A visual trick to compute the sum of two normally distributed variables.
3b1b mailing list: https://3blue1brown.substack.com/
Help fund future projects: https://www.patreon.com/3blue1brown
Special thanks to these supporters: https://www.3blue1brown.com/lessons/gaussian-convolution#thanks
For the technically curious who want to go deeper, here's...
thankyou so much bud
and then just a bit later, when you consider/learn about artificial Neural Networks
What's actually happening to a neural network as it learns?
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/nn3-thanks
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-netw...
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/nn3-thanks
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks
This one is a bit more symbol-heavy, and that's actual...
have fun learning!
thankyou very much
Can anyone tell me about the steps i need to follow in order to fine tune a yolo v7 on a customized dataset. I have done other models in keras but i can't understand where to start from while working with yolo.
I'm working with tensorflow, and I'm trying to fit a dataset to a model, but I am not sure what its wanting from me. I got a dataset where each dataset.take(1) provides me with a dictionary of 16 features with each a tensor with size (2048, 512) for each feature. 2048 is the batch size and 512 is the window size. This configuration is likely totally wrong, but what should I be looking to give to the model, what does it want? The reason I'm not using numpy ndarrays or pandas df is because all this can't fit in my memory.
I'm getting errors like this: Input to reshape is a tensor with 1048576 values, but the requested shape has 2048 [Op:Reshape]
The 1048576 is 512*2048.
So your data is shape (batch_size, window_size, num_features) or (2048, 512, 16) @hallow cargo
I got some progress and in the generator I attempted to convert the dictionaries to just a tensor, and I got (16, 2048, 512) which might just be where the issue lies
I'm using CsvDataset() and am sorta new to tf.data.
Yes
So you basically have a batch size of 2048, and you use 512 time steps, each with 16 features?
Yes
So if you can't load it all at once, you can just lower the batch size
oh wait
That wasn't the issue, but something just dawned on me
I could have used numpy arrays and batched it using a generator and still fit all the data to my memory?
Since my problem was loading the entire data to my memory.
It should be 16777216 (2048 * 512 * 16) floats of 32 bit probably, so about 67 MB-ish
Don't see how that wouldn't fit at once
tf probably has some type of class for a dataloader, you wouldn't load it all at once
pytorch f.e. has a class for a dataset, which can retrieve a sample given an index, and dataloader would load in batches of samples in random order given the dataset class
It definitely does, but I am new to the tf.data framework and have no clue how datasets in tensorflow particularly work
Since loading it all up into one giant ndarray filled my ram up, I attempted to use tf.data, but I've been very confused with what I even want as an input for the model in terms of tf.data
I'm still not completely sure about the type of data, you say the data has a batch size, window size, and number of features. I'd expect a single sample of the data to just be a window size, and number of features correct? (num_timesteps, num_features)
Yes, correct
So you made a custom dataset class or?
I don't often usen tf, so I'm not super sure how this is normally done with that library
No, I initially loaded a csv file using tf.data.experimental.CsvDataset(), then windowed and batched the data.
Fair enough, it has little guidelines
This is how I windowed it if it helps:
def window_data(self, data_ds):
data_ds = data_ds.window(self.window_size, shift=self.shift, stride=1, drop_remainder=True)
for sub_ds in data_ds.take(1):
print(sub_ds)
data_ds = data_ds.flat_map(lambda x, y: tf.data.Dataset.zip(({key: ds.batch(self.window_size) for key, ds in x.items()}, y)))
return data_ds
I guess you could just make a class that selects a random sample each time, and selects a random window from this sample
Depending on what kind of data you want to give to your model
Sorry, I might have not been clear, this is really my issue. I don't really know, I do know its supposed to have a shape resembling (2048, 512, 16), but the whole data structure is very confusing.
What is your model supposed to do?
Recieve 16 sequences of times series data of length 512 each into an LSTM. I succesfully did this using numpy arrays of shape (batch_size, 512, 16) with a smaller dataset, but as I increased the number of samples, it got too big to fit my memory, and thus I attempted to use tf.data. Would there be any alternative ways of doing this?
How did you get batch_size samples before then?
I loaded up a numpy array of (total samples, 512, 16) and designated batch_size to be 2048 (or lower then) in the model.fit function.
and another one with the labels
And the data is a single csv or?
Yes
I would probably find out what tf has to offer. inituitively you'd want to have a function that can load a single random sample of shape (512, 16), and then another function which combines batch_size samples into a single batch of data to feed to your model.
I see, thank you so much, you gave me some ideas, I think I might be able to fix this. Would it be alright if I tagged you or messaged you with some questions incase I reach a dead end?
Is there a way to filter data within a sheet and display the entire row that display the filtered word as this is trying to work with xlsx
For me I have no idea
I am wanting to get the user to type in their keyword and than filter that row from that sheet name to displayed
For some reason whenever I re-run my program multiple times, I get different results. I am not using any modules other than __future__ annotations Any ideas? Link: https://paste.pythondiscord.com/ZMFEUMVNDAJFLPTI3UCO6MESQ4
sets are unordered - the order of the elements you get when you iterate over them depends on the string hashing 'salt', which varies each time in order to prevent some cyber attacks
Yeah sure, but like I said, I don't generally use tf, so I can only really help with a perspective of using pytorch
Any of you guys familiar with Pytorch Lightning, that can help me understand whether I can load a "super-model" which contains the VQGAN model I am interrested in + a transformer, and then just deep copy the VQGAN model?
I plan on finetuning the VQGAN model, but I don't know if there is any sort of connection to the original model?
Guys, what algebra and trigonometry do i need specifically to be able to understand calculus?
if you've just taken high school math, you should probably brush up on the rules for exponents and logarithms. (those are part of algebra.) for trig, you should know the various trig identities (like what compositions of trig functions are identical).
.latex
For example, $\sqrt{x} = x^{$
fuck
.latex For example, $\sqrt{x} = x^\frac{1}{2}$, $x^{-1} = \frac{1}{x}$
.latex More generally, $\sqrt[n]{x} = x^\frac{1}{n}$, $x^{-n} = \frac{1}{x^n}$
Do i need a strong understanding of polynomial and rational functions?
I don't know what counts as a "strong understanding", but calculus involves differentiating and integrating those kinds of functions.
Okay
It depends on how deep you go as usual
Like, I remember that in freshman math they'd always show us the geometric interpretation of both Lin alg and calculus, I always zoned out and I turned out just fine
there is a good mit ocw course on single variable calculus. there's also khan academy. and iirc some other stuff.
i'm actually doing said mit ocw course on single variable calc rn
well yeah, im just trying to make sure that im learning maths while doing machine learning and ai, it has been very challenging cause self-learning multivar calculus as a 16 year old is not easy which is completely fine because i like struggling, but i keep asking myself that "what am i gonna use all of this knowledge for" since im beginner at machine learning and i dont know maths(at a higher level) its very challenging to keep going with both of them and not knowing whether all of the time spent on it was worth it. At this point all i can do is trust that i will use all of this knowledge, again for me the biggest problem is the balance between learning mathematics and machine learning. i think i should probably just focus on maths and then do what you recommended and then go with statistics and hope that everything will make sense.
To give you a concrete idea, there's the possibility to create new methods relying on determinantal point processes. The determinant of the kernel matrix is in a way a measure of dissimilarity. This is clear if you know the geometric interpretation of a determinant. Does this matter? It depends, if you're doing foundational research where you're inventing new methods yes. If you're doing any kind of other work, not so much.
I think you should really start applying the things you're learning
well, because of the lack of mathematics i have no idea what is this, but i get the idea.
A lot of math I learnt was "on-demand" because I needed it for ML stuff
but how?
Kaggle
Build projects etc.
When you're doing that, go to sci kit learn's user guide, read what they have to say about the algorithms as you're doing the projects
okay i was trying to implement a basic linear regression model which i successfully did and understood the idea behind it but when it came to find the best fitting line, since i do not know partial derivatives i was stuck on understand gradient descent.
Sooner or later you'll ask yourself questions that will lead you to picking up math textbooks
but again i have to remind myself that i have to get comfortable with not understanding everyhtng
Okay then try and learn about that and only that, while you are you'll encounter new things to learn 🙂
ive never used kaggle tbh, should i just go thru the tutorials or how should i do it
fair enough
it seems that no matter where do i go and what i do at some point i will get stuck with not having the mathematical background for the machine learning topic that im learning
The first tutorials explain how the platform works. Afterwards I recommend you do the tabular playground series. If you forget, just ping me
cool
thanks
That's why you need to do kaggle and see at what level of abstraction ML is done you'll feel way more comfortable then
is the tabular playground series a competition? or how would i "do" it?
and also what projects would i build?
ahh man im so frustrated by the amount of information that i need to know and i just cant "balance it"
Guys, I intend to create a chatbot that can answer any science-related question. I'm considering using chatterbot library and some pretrained transformer models (via the transformers package). I was wondering if this is a good approach? Also, where can I get data that the model will train on?
I'm making use of pytorch as well
and how would you balance it with learning mathematics? I assume that the route of my problem is i wanna understand everything very deeply and without high level mathematics i wont be able to do that, so i would need to just trty building projects and learn mathematics and then as i learn more maths i understand more i assume
Maybe start with the courses because they explain the platform
Well, it really depends on your taste. I never really got joy from doing math for the sake of it so I typically start at the algorithm level and learn what math is required etc.
Aside from that I just took math courses in uni 🤷
You'll still get all of those
well yeah my problem is similar i just feel bad when im learning smth that im not sure what i will use for
like when im learning polynomial function, im thinking about that it is needed for calculus and calculus is needed for machine learning
an approach in which i try to build something would be much nicer because i actually will get to see that what do i actually need in order to understand it
Guys, please
Hi, do you guys know if there is a specific rule to follow for reshaping data? I’m looking at an example where the initial images are 28 x 28 x 3 and in the following lines of code the data is reshaped as so:
tf.keras.layers.Dense(units=7*7*32, activation=tf.nn.relu), tf.keras.layers.Reshape(target_shape=(7, 7, 32)),
I’m trying to make a chatbot too but I’m making it from scratch. There are a lot of videos I’ve seen on YT where they make a bot in like 20mins. Never have actually watched the vids. but they are prolly using some pretrained models. So yeah, it could be a good approach.
In terms of the data, just go to Kaggle and looks for datasets for chatbots. They have a lot.
Thanks for this. I'll check out Kaggle.
Does anybody know how I can use Python to determine how a protein will fold? I already have the start and stop codon working properly, now I need to do any entire polypeptide chain. I already have the tRNA conversion table that takes 3-letter blocks (A, C, U, G) to convert into the right amino acid. I don't know how to get the directions down, especially in 3D. What modules should I use, and how should I structure this?
You mean the thing Google/DeepMind trained AlphaFold for?
That's an ultra hard and specialised problem to say the least, probably borderline impossible if not literally impossible to code it by hand - if nothing else, it's hard enough that Google had a team of multiple scientists working on building an AI for it
anyone know how to change the delimeter in a csv file to a semicolon? At me if you know.
Do you mean change the text file and replace ,’s with ;’s? Or do you mean how read a semicolon delimited file?
I mean is there a way to go into the .csv with notepad or something of that nature, and say "make the seperator a semicolon in this .csv file."
Ctrl H
technically you could use Find and Replace in VSCode or similar tools, but that would mess up any commas inside of quotes like title,quote discord,"Welcome to our server, we hope you have fun" would become ```
title,quote
discord;"Welcome to our server; we hope you have fun"
you're better off just specifying the separator when reading in python, both the built-in csv library as well as most if not all popular libraries like pandas support it
Basically, I have a dataset of norms of vectors.
I want to at increments of Y distance from the center, plot the probability of finding a norm with equal or smaller size. I want to plot it in 2D. How would I go about doing this?
So, like, an Empyrical CDF of the norms? You could use matplotlib's hist with cumulative=True, say.
Not sure what you mean about plotting it in 2d, though - if you plot ecdf(√(x^2+y^2)) by x,y it'd be rotationally symmetric, which isn't very interesting. You could do it, though (my approach would probably be calculating the ECDF as a cumulative histogram, then making a mesh, calculating the distance from the center of each point in the mesh, then using linear interpolation between the points in the ECDF to get the values for these distances (np.interp should be powerful enough for that, IIRC))).
I meant something like this:
However, for some reason it doesn't ploat th elower levels...
Also, if I wanted to confirm whether or not latent representations that are geometrically close are also semantically close, and the same for their codebooks, how would I go about doing this...?
not sure about the context here, but usually we interpret geometric closeness (e.g. cosine similarity) as semantic closeness
show your code? https://paste.pythondiscord.com/
It is for VQGAN.
The relationship isn't that simple anymore, because we discretize the latent representations int their codebooks
this is not immediately true though
you need to include such a criterion as part of the cost when learning the latent representation, otherwise this is not in general the case
as a small example, things like LDA explicitly include the minimization of intra class variance and maximization of inter class variance in the latent representation, because simply using an SVD does not do this
even if both were to find a latent representation of the same dimensionality
i almost mentioned this but didn't. unless i'm misunderstanding the context, don't people just do it anyway?
i see, i didn't know the context. i don't know enough about how GANs work and don't have practical experience with them, so i'll butt out 🙂
sure they do, but it's wrong 😛
or rather, not as helpful as they'd like
the correct choice of basis or transformation into the correct space or manifold is the secret sauce
in classification tasks, which GANs usually address through their discriminator, you formulate the cost explicitly in terms of correctly classifying the input based on a latent representation, which means that representation needs to preserve or enhance the class similarity. it promotes learning a "good" representation
Hi I have this Data, Variant ID's have id's separated with " | ",i need each variant id in front of each order id how do i do it ? Thanks
right, although wasn't that kind of the magic of word2vec? embeddings for the purpose of predicting the context from the current token (or vice versa) turned out to also give good results when used for semantic similarity, at least empirically
right, that should be the case. i also didn't read the whole convo, so my comment was very generally about the task of finding a latent representation. if you're already choosing a representation properly, it should be the case that you get semantic similarity related to geometry
i jumped in at the same point you did
but i think i see what you mean - word2vec was designed to produce embeddings like that, even if the actual learning task doesn't look like it
in pandas specifically? split the variant ids on | (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html), then "explode" each variant id to its own row (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html), then concatenate the two strings
if you train it so that it can detect sentiment, for example, then it'll make embeddings so that words that convey similar sentiment are easy to discern
if the cost does not explicitly include distance in the embedding space, it might still not be the case that geometric similarity is the same as semantic closeness though. that depends on what is done after the embedding too. the remaining network layers can simply learn to decode the embedded, which then only needs to be "good enough" for the remaining layers
by that what i mean is that the choice of metric is as important as the choice of embedding, networks usually learn both in a black box fashion
e.g. two classes might actually be "parallel" to each other after embedding, but you can distinguish them by their norm instead. but if you don't know which metric to apply after embedding, then you wouldn't know this. a cosine similarity here would tell you they're the same class, but something like an SVM would still work
broadly the embedding does encode similarity both in geometric and semantic sense, provided that the embedding is learned based on some sort of classification task. but usually you also don't know HOW it encodes it
yeah, that makes sense. and good point about them being "parallel", i don't think i've encountered that before
i don't think it needs to be explicit in the cost function though, does it? again consider word2vec, my understanding of how it works is that words with similar meanings should appear in similar contexts. we'd expect a classifier to learn an embedding space in which the classes are well-separated. so model being trained to predict the context from the word or the word from the context should tend to learn embeddings that create good separation between plausible words and implausible words for the context.
again i know word2vec isn't a great example because very smart people designed it for this particular task
it doesn't if you also learn the decoding along with the embedding, which you usually do
all you train for is the final "task": like "please find the sentiment behind these sentences"
and you establish an architecture that does this passing through a low dimensional representation
this forces the network to both learn a good representation, and then use it correctly
but done black box like this, you don't know which embedding, nor which decoding is done
if you swap the decoder, it will probably fail for the same embedding
right. i never considered that you could just train embeddings without some kind of "task" at the other end
how would that even work?
text is probably a bad example since i think there is no classical approach for it that works remotely well
but for example, we can think of fourier and SVD/PCA
yeah
you can get philosophical regarding the interpretation for those, but the task is in any case very simple: represent the original data with as little error as possible based on a low dimensional representation
this inherently carries no semantic meaning. through linalg you can give it geometric meaning, but not anything with like "real world connection" inherently
but does it? surely there's some connection between distance in reduced PCA space and distance in the original space
through orthogonal projection, yes
but usually the original space was not useful for you in a classification task, which was the problem in the first place if you wanted to find another representation
what you care about is a different kind of distance not explicit in the original space
in the classification context all you care about is creating separation between classes
but then the distance we care about is the distance between the classes the data belongs to, not between the original data
that's what i mean
right, but that's not the same as semantic similarity between data points either
which is why PCA does not carry semantic info
nor class info
only geometric info from the original space the data is collected in
are you assuming that distance is semantically meaningless in the original space?
yes, that's definitely the case
this is why there is more than one definition of distance
you endow the data with additional structure to be able to distinguish it, because looking at it naively does not work
well sure. but that's not because there cannot be semantic meaning in the original data
as an example, in some tasks, the measurement data [0,0,0,0,0,0,0,1] and [0,0,0,0,0,0,1,0] are very similar, and in other cases they're completely different
Yeah, this is one of the first things mentioned in a neural network class 😄 you're learning representations and a classifier the same time
but you never hedge your bets on this. the whole point of latent representations based on task is to guarantee that this is the case, as much as possible
right, i think we agree on that
From an applied point of view I'd like to add that it's not fair to compare LDA and PCA. One is supervised and the other is not
While in statistics they're similar, in ML they fall in quite different places ime
this depends entirely on the task and your choice of metric. but notice: if we stop using euclidean distance, it's also no longer euclidean space. you have given the space where the data lives new structure altogether
meaning the data carries no class or semantic info if you don't first distinguish what kind of geometric object it is. which vector space and in which basis, or which manifold
that's part of what you learn when you look for a latent representation that is useful for classification and semantic seg.
it's not that the data doesn't carry the info in the original domain, it's that neither you nor the network know HOW it carries it
and euclidean space is usually not it 😛
i think we're converging on the point where we agree
i was just about to say, there are a lot of scenarios when you can choose an appropriate semantically meaningful distance function without supervised learning of similarities
Or rather, it's not that you can it's that you have to
sure, and when you can do that, you can skip ML altogether, or use a model-based approach that requires less data, smaller networks, etc
right
you should at least pick a proper metric to massage the network in the right direction
it can also be that choosing the wrong representation destroys the class/semantic info, too
It depends what your downstream task is
assuming the structure of the data is low dimensional in the linear sense can lead you to do PCA and reduce the rank, which surely minimizes distance in the frobenius norm sense w.r.t. the original data, but can very well throw away all the semantic info
the other piece is that, as in the example of word2vec, it's possible to construct learning tasks that implicitly result in semantically-meaningful similarity in the learned embedding space, without specifically doing something like triplet loss (although skip-gram negative sampling is very close to doing it explicitly)