#data-science-and-ml

1 messages · Page 90 of 1

past meteor
#

I use mlflow on premise

#

It's the most sensible way to track experiments (for me)

#

It all depends how deep you want to go, I think I've mentioned it before. If you're in it for the long game then starting with math and then going towards probability theory and then going towards statistics and then ML is the most sensible route.

This route is very long and a bit boring tho so you can start at any point you want, especially if you're more interested in applications of ML than building new methods.

echo mesa
#

and yeah im in it for the long term and I wanna build new methods rather than just implement stuff

hollow sentinel
#

you're probably going to need an advanced degree if you want to build new methods

#

is that what you're working towards?

echo mesa
hollow sentinel
#

oh i had no idea

#

cool stuff

echo mesa
#

I'm planning to go with computer science and mathematics and self-learn a bunch of things during the way

echo mesa
hollow sentinel
#

solid plan

echo mesa
#

My main advantage is time so taking my time with everything rather than just rushing would be much more nicer

hollow sentinel
#

i think others in the channel would have more insight than me, but an advanced degree like a MS or a PHD will help you with building new models

#

it's a good goal to have as a 16 year old

echo mesa
echo mesa
hollow sentinel
#

good stuff

past meteor
hollow sentinel
#

yeah i can second kaggle

#

i'm going to have to learn a shit ton of sql if i get the l3harris offer lol

past meteor
#

SQL is truly something you can learn on the job though

hollow sentinel
#

they're using oracle ERP systems for military defense radios

#

agreed

past meteor
#

I learnt SQL at internships before I took a database course in uni

hollow sentinel
#

the puzzles on leetcode would be helpful too

echo mesa
hollow sentinel
#

just food for thought

echo mesa
past meteor
#

I just looked at w3schools SQL whenever I was stuck 💀

echo mesa
echo mesa
echo mesa
#

Also @past meteor Do you know any good statistics introduction books that you've read?

past meteor
narrow tiger
#

what is token classification?
i need a model that can do this simple task but so far i have been unable to make llama2 reply in single work also it can't differentiate between spam or usefull messages

#

should i be looking at some other model

true solar
#

hello

#

does anyone have courses in data science

narrow tiger
#

is data science same as /related to ML/AI?

echo mesa
# narrow tiger is data science same as /related to ML/AI?

well its everything to do with data, im far from professional but data science is everything to do with data(managing, processing, cleaning) and machine learning models are using that data to train and make further predictions, so yeah its very related to ml because we need data to train models so without data science we cant do that.

#

Statistics is highly related to ml and data science and is very significant

nova widget
#

"openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable" I have "export OPENAI_API_KEY="xxxx" in my .env

#

how do I solve that?

narrow tiger
#

maybe try setting env variable in terminal befroe running the app

narrow tiger
nova widget
#

import os
os.getenv("OPENAI_API_KEY")

#

seems right

narrow tiger
#

use dotenv to load the env variables from .env first

nova widget
#

os.getenv(.env) ?

narrow tiger
#

from dotenv import load_dotenv
load_dotenv()

nova widget
#

kk

narrow tiger
#

u will need to install dotenv

nova widget
#

im using venv

narrow tiger
#

what you mean u are using venv?

#

what is ur venv

nova widget
#

virtualenv

narrow tiger
#

that's ok

#

ur api keys are in .env right

nova widget
#

yes

#

also was a bit contradictionary to either use export or set

narrow tiger
#

then load_dotenv() will load em

nova widget
#

ok

narrow tiger
nova widget
#

k, thanks, coffee is ready. bbl

dark hound
#

hi

odd meteor
#

They're explained alongside a lab session that shows you the code implementation of each chapter/topic in Python.

echo mesa
odd meteor
#

I use W&B and MLFlow mostly for my experiment tracking. I use the free tier version though and it works just fine for me.

I'll recommend you check out

  1. W&B
  2. MLFlow
  3. DVC
  4. ML Comet
  5. 🤗 Hub
  6. Evidently
buoyant vine
#

Yeah, some sort of dashboard is the final piece in the puzzle with our current setup

#

I might see if we can just buy a neptune account though, since it's less hassle for me to setup

echo mesa
odd meteor
echo mesa
snow wyvern
#

i am looking for someone who could help me install opencv for cuda when i follow the guide on YouTube it says it is unavailable for python 2 and 3 and the solution given for the problem given by the same video does not work

faint sigil
#

I wrote a script in python which I want to compile into single file using 'pyinstaller' or 'auto-py-to-exe' for easier distribution however I'm encountering issues and none of the answers online helped.
So I wrote the script in Pycharm and made venv inside which are required python packages installed.
When I run the following command:
pyinstaller --onefile --name BoulderDimensionsCalculator --icon=Polygon.ico main.py
Both pyinstaller and auto-py-to-exe compile it just fine. However when I try to run .exe, the console window just hangs there for a bit. Throw an error module not found and it exits.
I tried including --paths="C:...\venv\Lib\site-packages" in the pystaller command as well but it didn't help

#

Missing module is rasterio.sample but sometimes it also throws error on some other like geopandas or whatever. So my assumption is pyinstaller doesn't compile python packages which are installed in venv, how to fix that?

odd meteor
buoyant vine
#

Anyone got any idea what could be causing PyTorch Lightning to never make progress after the first epoch?

#

we never actually complete the Epoch validation

#

but for some reason, 3 GPUs drop to 0% usage, and then just 1 GPU pins to 100% like it is still doing work

#

but it just doesn't progress think

#

My current train step:

        early_stop = pl.callbacks.early_stopping.EarlyStopping(
            monitor="val_loss",
            min_delta=0.00001,
            patience=5,
            verbose=False,
            mode="max",
        )

        self.trainer = pl.Trainer(
            logger=neptune_logger,
            callbacks=[early_stop],
            max_epochs=self.model_config.n_epochs,
            num_nodes=1,
            log_every_n_steps=32,
            accelerator="auto",
            strategy="ddp_find_unused_parameters_true",
        )
        logger.info("Trainer has been created")

        self.trainer.fit(self.model, self.data_module)
        logger.info("Model fitting has been completed")
#
    def forward(self, token_ids: torch.Tensor, attention_mask: torch.Tensor, labels=None):
        output = self.pretrained_model(input_ids=token_ids, attention_mask=attention_mask)
        pooled_output = torch.mean(output.last_hidden_state, 1)

        pooled_output = self.hidden(pooled_output)
        pooled_output = self.dropout(pooled_output)
        activated_output = F.relu(pooled_output)

        logits = self.classification(activated_output)

        loss = 0
        if labels is not None:
            loss = self.loss_fn(logits, labels)

        return loss, logits

Forward function

Only this is called bar log in the validation and train steps

#

As much as I'd like to let this run without being able to see what it is doing, the machines cost $7-8/hr 😅 So i'm not big on leaving this run idle overnight if it is just plain suck

#

absolutely nothing from the logger

#

also 0% cpu usage / idle

#
  • 698 classes
  • 738692 train data points
  • 78570 validation data points
    n_classes=698 total_steps=36 warmup_ratio=0.2 embedding_length=256 learning_rate=1e-05 focal_alpha=1.0 focal_gamma=2.0 k_items=5 adam_beta1=0.9 adam_beta2=0.99 adam_eps=1e-07 adam_weight_decay=0.001 n_epochs=50
#

Hmm

#

I am getting this when force aborting the container

Sat, 25 Nov 2023 22:37:41 GMT thread 'thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.0/src/registry.rs:167:10:
Sat, 25 Nov 2023 22:37:41 GMT The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
Sat, 25 Nov 2023 22:37:41 GMT note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

sadge

#

which I guess means that it is the huggingface tokenizer that is getting stuck

#

we do love deadlocks

solar ferry
#

does anyone have any advice for being able to process the data from the game Dwarf Fortress, I am having trouble finding tools that are able to process the data made by the game, I can find tools for processing unreal or unity game code, but i am having issues since Dwarf Fortress is almost entirely written in openGL without a stock game engine, so it runs off a fully homebrewed engine

would i have to make that tool myself?

#

I am looking at creating a neural network that can effectively play Dwarf Fortress

echo mesa
#

Guys, I've looked into the idea of least squares in regards of fitting a line to a given data set, and I do understand it and I know how to use it but I don't know why does it work? How can I possible understand the idea behind it and understand why does it make sense? How can someone have potentially come up with this idea? Is it a stupid question to ask?

scenic parcel
#

Guys I have invented a way to not bother websites as much:

@lru_cache(maxsize=None)
def requests_saver(url):
    return requests.get(url, headers=headers)```
agile owl
#

I imagine it's super super expensive

torpid quartz
#

How do you guys use Jupyter notebooks? I seem to only use them to look at the data, and then write an actual python module to train the model and stuff

serene scaffold
#

however you write jupyter notebooks, if the result of the notebook needs to be reproducible (like, it needs to train a model and report on its performance), make sure that it works correctly by executing each cell exactly once, in order, with a fresh kernel session

#

if the expected behavior depends on a non-linear sequence of cell executions, it's crap.

echo mesa
left pumice
#

Anybody know if it is possible that GPT4ALL can combining models or/ and databases?

#

i saw around 100GB of Datasets and some Models. But i dont know what is what.

#

looks it is the same...

#

YT shows me i need to choose...

vocal dune
#

Hey guys, I'm not understanding something here:

  • I need to fine tune a BERT model for binary classification of sequences.
  • Right now I have this architecture: BERT + torch linear classification layer.
  • In order to fine tune this, I am simply passing the input through the architecture, calculating the loss and backpropagating it.
  • Do I need to do any extra steps? Am I missing something?
ashen latch
quaint crescent
#

Hi I am new to this discord so not 100% sure this is the right place for this question but I made a chart analyzing stock performance over the past month and the way the data is presented seems wrong to me. Any thoughts?

serene scaffold
raven ivy
#

do you guys mind taking a look at my end2end mlops project diagram?
the basic idea is to pull weather data daily automatically and store it in a db, then train a model based on this data to predict various metrics as an endpoint, and also monitor the performance, and say it makes 3 wrong predictions in a row, then an automatic retrain and deploy triggers. The deployment and monitoring part is missing, im not yet sure what to put there.

cold dawn
#

jesus, i have a lot to learn regarding data science in python 😭

torpid quartz
#

Wait, I’n a keras pipeline .transform() in the all steps have to take in a ndarray right

snow wyvern
#

I've been trying to use the opencv with cuda and keep getting errors, i was told to build it myself and i'm following this tutorial
https://www.youtube.com/watch?v=YsmhKar8oOc

I got to 4:55 and upgraded numpy but for some reason when i check the to be bulit part of open cv modules it still doesn't say python and says unavailable for python 2 and 3 i am not sure what infos you need so feel free ask for more infos

Build OpenCV 4.5.1 with CUDA GPU acceleration on Windows 10. In this tutorial, we will build OpenCV from source with CUDA support in Anaconda base environment as well as in a virtual environment. Building OpenCV with CUDA from source allows OpenCV to be used in any programming language. We will focus on Python 3.8 for this tutorial.
-----------...

▶ Play video
umbral charm
#
import numpy as np
import scipy.integrate
import matplotlib.pyplot as plt
def rhs(N, t, lambda_, lambda_1):
    dNdt = lambda_ * -N
    dN2dt = dNdt - lambda_1 * -N
    return [dNdt, dN2dt]
N1 = 40
N2 = 0
t = np.linspace(0, 10, 100)
y = scipy.integrate.odeint(rhs, [N1, N2], t, args=(1/20, 1/15))
#

RuntimeError: The array return by func must be one-dimensional, but got ndim=2.

#

How do i stop this

lapis sequoia
#

i want a dataset to practice convolutional neural networks and i already have used MNIST, any other datasets i you would recommend? I would prefer something larger(in terms of size) than MNIST

left pumice
#

Anybody know if it is possible that GPT4ALL can combining models or/ and databases? I would like the AI choose the set itself. So i dont need to select manually. Thank you.

vale jungle
#

What is the canonical grammar of graphics package for python?

pine wolf
#

i dunno, pillow or cv2

vale jungle
pastel valve
#

What would be the best way to go about finding correlation between numerical and categorical values. Suppose I have 2 columns one with scores ranging from 0-12 and the other column with the names of universities. I was thinking of assigning numerical values to each university and go about that approach but I wanted to know if there's a better way of doing it

odd meteor
lapis sequoia
odd meteor
lapis sequoia
odd meteor
# pastel valve What would be the best way to go about finding correlation between numerical and...

It's statistically wrong to compute the correlation of a categorical and continous variable the way you proposed to do it (even when the categorical variable has been encoded to numeric value for ML task).

So, you can't compute Pearson, Spearman Rank, or Kendall Tau correlation on such features.

Remember, if you call .corr() on those two features it will surely return a correlation coefficient but is it capturing what you intended it to capture effectively? No.

In such situation, you're to use a non-parametric test statistic like Kruskal Wallis test. If you want to use parametric test statistic, then go for Point Biseral test.

odd meteor
# cold dawn jesus, i have a lot to learn regarding data science in python 😭

The thing is, there's always a lot to learn in this field. New things keep poping up everyday. It rains everyday here, research don't really care if you need to catch your breath or take a short break.

I'm still tryna get to LLM now Q-learning is buzzing 😂😂

Once you're done with the basis and now ready to jump into Deep Learning, I'd suggest you just focus on what you're interested in and double down on it. You might wanna subscribe to a couple of ML newsletters to stay abreast of new stuff.

You'll be fine that I know for sure. More so, If you start now, you'll be better off than you were yesterday.

You got this 💪💪💪

junior trail
#

hi, requesting some help with running a jupyter notebook cell. getting this error: ModuleNotFoundError: No module named 'pandas'
i ran pip install pandas in the notebook dir and then re-ran jupyter notebook, but when i try and run the cell, i get the same error. what am i doing wrong?

trail yacht
#

Whats the best udemy course for machine learning using scikit learn library? Also if there is a playlist on YouTube then do let me know

tidal bough
quaint crescent
quaint crescent
buoyant vine
#

Anyone got any idea what might be causing this sharp increase in F1 score (I typo'd the label it isn't accuracy it is F1) after th first epoch

#

you can see it is fairly stable until the 2nd epoch where it begins rapidly increasing the score

#

I have a LR scheduler but I dont think that is it

#

the only thing I can think of is it is suddenly overfitting

#

but idk, I haven't seen a model overfit that aggressively

#

Just some details:

  • BERT base model with fine tuning layer
  • AdamW optimizer with get_cosine_schedule_with_warmup scheduler
mild dirge
#

What does the validation f1 look like? @buoyant vine

buoyant vine
#

basically the same as the train

mild dirge
#

Can't say much about overfitting without a measure on a separate dataset

#

Well if it is the same, it wouldn't be overfitting, because it appareantly generalizes well to other data as well

#

Maybe it got stuck on a plateau for a bit

buoyant vine
#

I suppose, but it is supposidly getting 99% 😅 So I am skeptical

#

the data should be well defined though, so idrk

mild dirge
#

Maybe it got stuck in a local maximum, and because of the sudden increase in LR it managed to get out

buoyant vine
#

It is classifying website metadata (Title, Snippet/og_description)

#

so to some extent, the datasets are very similar no matter what

#

I suppose it might just be for the val side of things, it got a large subset of the data which had the categories it does really well in

#

ig I can remove the scheduler and see if it is getting stuck initially

olive pecan
#

what does the number in the middle signify? how can I choose it? when trying to remove outliers?

mild dirge
#

The standard deviation is a measure of spread of your data. You want to remove data that is very far removed from the mean relative to other data. The average "distance" to the mean is signified by the standard deviation, so we remove anything that is a few standard deviations away from the mean. @olive pecan

#

In your case any data beyond 2 standard deviations from the mean is removed

#

If you increase this value, then you accept samples that are further away, thus being less strict in filtering. Decreasing means you only accept values that are quite close to the mean, thus filtering away more samples.

agile owl
#

now we are pod racing

#

64 cores of EA

#

chef's kiss

olive pecan
#

Thank you WcCamel

left tartan
agile owl
#

ahah the prints aren't correct anymore

#

it's only the last train period now

#

so that just means it didn't trade in the last period

#

I think I'm going to try to figure out a way to do early stopping with EA next because looking at these charts doesn't really mean anything to me without out-of-sample data to compare to

agile owl
#

no

past meteor
#

Interesting that your best solution never dropped over iterations

agile owl
#

tournament of 4 blendcrossover and polynomial bounded mutation

#

sometimes it did in the past

past meteor
#

What did you change?

agile owl
#

I implemented the fitness as the median of the fitnesses over different iterations of the training data after splitting it up

#

so there's an inner loop

#

that goes over each individual training segment

#

and produces a fitness, then the actual fitness is the median of those

past meteor
#

What are you searching?

agile owl
#

what are the parameters you mean?

past meteor
#

Yeah, what values are you searching over

agile owl
#

I have a hack that lets the ints be blended because it rounds them when the class actually gets instantiated from the params

past meteor
#

I don't know what that sentence means

agile owl
#

so normally you wouldn't be able to use blend crossover on ints

#

because it will produce floats

#

so to take care of that I just round the floats that come out of the algorithm when it's actually creating the individual to have its fitness calculated

past meteor
#

There's cross-overs and mutations that can deal with a mix of floats and ints

#

But I actually don't know blend crossover to be honest. My problems were fully int based or fully float based (and then I just use numpy's differential evolution)

agile owl
#

I was playing around with different tools from the DEAP toolkit

#

It's basically just like averaging the two

#

but if I had to guess why the fitness is monotonically increasing it's probably because of how I'm calculating it as a median across iterations

past meteor
#

It's a weird way to express fitness

agile owl
#

it's the median of the fitnesses across different time periods

past meteor
#

Express it normally because that'll tell you if you need elitism or not

agile owl
#

well I did it to avoid overfitting

#

and it seemed to work for that

#

at least in this one case

#

it produced a better test result

#

what is elitism?

past meteor
#

If I were you I'd do several train-test splits

#

3 or more

agile owl
#

can you elaborate on that a bit

#

you mean as part of the cross validation

#

the way I'm doing it now or something else

past meteor
#

Lets you increase the randomness of your algorithm without destroying the population

agile owl
#

you mean from a previous run?

#

I was doing that manually, saving the params and seeding them as an individual into the starting population of a new run

#

idk if it's part of the deap framework

past meteor
#

So you evaluate fitness at T1, you do crossover, mutation and selection. After selection you inject the best N / N% at T1 before heading to T2

agile owl
#

I see

past meteor
# agile owl can you elaborate on that a bit

For instance for my research I do this:

Remove 15 % of the subjects into a test set.

On the remaining 85 % I do a test-train split.

30 % of the data is put in a "freezer" and not touched until way later.

70 % is left and with this I test train split again, cross validate on train and evaluate on test.

When I'm happy of the models on the 70 % I select N models and evaluate on the 30 %. When I'm happy with M < N models on the 30 % I evaluate these on the 15 % totally held out subjects, no tweaking whatsoever and those are the numbers.

agile owl
#

thanks I'll see if I can use that methodology

#

part of the issue though is they are also time series

past meteor
#

Mine too

agile owl
#

so they are one-way information

past meteor
#

I sat down, used my 🧠 and wrote a train-test split that is not completely random but makes sense for my domain

agile owl
#

gotcha

past meteor
#

It doesn't need to be "this" specifically but you should really be worried about "adaptive overfitting"

agile owl
#

yea

past meteor
#

The more you test and tweak your algorithm the more you'll tweak your hyperhyperparameters (?) to produce hyperparams that produce good models

#

hyperhyperparameters being the params of your EA lol

agile owl
#

i mean what's the difference between that and actually understanding the problem, just that it doesn't generalize forward?

#

and if so, how can you ever know for sure?

past meteor
#

If you split too much the numbers become overly biased on a small sample, if you don't split enough and you reuse too much you adaptively overfit

#

This part is a bit art not science lol

agile owl
#

definitely worth it to get this part right early

#

before I get too far with this

#

the testing methodology is probably the most important part of anything

past meteor
#

Yeah, I think @left tartan can help here if they're up for it because they know a ton about financial data etc

agile owl
#

I haven't seen many very compelling testing methodologies on time series fitting

#

i was under the impression that most people just leave out the last x%

past meteor
#

idk if you have multiple stocks or so?

#

Keep a few stocks completely out

agile owl
#

for now just one

#

oh no this is a special thing

#

this is futures trading

#

it's actually the ten year treasury rate

#

I could do it on other ten year government bond rates

#

but besides that it's really in a class of its own

past meteor
#

If you create lagged features you can split randomly afterwards

#

But doing so may or may not make sense for your application

agile owl
#

mmm I am doing moving averages of long range

#

and other time series attributes of long range

#

like the hurst exponent

#

so wouldn't be practical

#

if it were just of order one, then it would make sense

past meteor
#

Nah this works for > order one

agile owl
#

if it's just lagged variables sure

#

but I'm calculating moving averages over 100s of observations

past meteor
#

It's what I do, I have code to create arbitrary lag + horizon combos

#

And then I drop all the ones where the lag columns have nulls

#

And then I express my moving averages in function of the lagged features

#

Afterwards a random split is 100 % legal

agile owl
#

not sure I understand that last bit entirely

#

so are you just creating a very wide dataframe

#

because you have hundreds of lags

#

in my example

past meteor
#

And then making a narrow df from the wide one

agile owl
#

how is that legal

past meteor
#

totally legal 😄

#

Say you have a sample every 5 minutes (that's the case for my application)

agile owl
#

you're telling me if you include all the lags and melt it then you can treat that like you can shuffle it?

#

like what if something is inherently cyclical

#

the point in front and the point in back always have to be connected to the current point to model the cyclicality accurately no?

past meteor
#

You say this:

6 lags + a horizon of 6, which means you're predicting 30 mins ahead of time. Basically you're predicting at t_30 using t_[0;-5:-10:-15:-25;-30] right?

As soon as you make this dataframe you will encounter some null values. As soon as you drop these and compute your lags on using the latter dataframe you'll never use the future to predict the past. This assumes your data is trend stationary though.

agile owl
#

strong assumption with real life data

#

in this realm

#

still not sure I completely understand though I'm reading what you said very carefully

agile owl
past meteor
#

The point is to avoid leakage, as per usual

#

If splitting randomly doesn't leak, you can

#

The major assumption is a consistent relationship between the dependent and independent variables ofc

#

If that is not the case you'll leak. I think this is broken in your case indeed 🤔

agile owl
#

yeah the whole reason I'm splitting multiple times to train is there isn't a consistent relationship

#

the more training periods the better but they need to be suffiicently long to learn from

#

I'd like to have them be of random sizes too but not sure how to do that in practice because of how I need to read my data into one of the libraries

past meteor
#

I guess what I'm trying to say is, if you know what you're doing idt you should be averse to rolling your own evaluation strategies (if you're sure they're water tight)

#

ML is, imo, the business of evaluating models and not the business of making models 😄

agile owl
#

my problem is always that I want to train on recent data in case the relationship has changed or is changing

#

not really a problem but my frustration with the process

past meteor
#

You can make an intricate ensemble using a model that uses exclusively lags and a model incorporating more information

agile owl
#

one of the things I've looked into is segmenting time series with state models to try to capture things like cyclicality then using all of the same points from the same state as if you can shuffle them

#

hmm state models

past meteor
#

Say model A (pure lagged model, can be just exponential smoothing) has an error of N that is more or less consistent over time.
Model B (incorporates exognenous information) has an error of M that varies over time.

if N < M you have an indication of a changed pattern over a sufficiently long time period t

#

This is one basic way where you can react you changes in the "real world" when the model is deployed or so

agile owl
#

makes sense

#

as with anything though it can only be clear in retrospect

#

and what if it's actually just a cyclical thing and the relationship is about to mean revert

#

so you switch out your model

#

but it was just a temporary disruption

#

and the beta is actually stationary over time

past meteor
#

Yeah, this was what my thesis was about

#

Fundamentally unsolveable because you only know in retrospect

agile owl
#

what about incrementally reoccuring concepts

#

that are one directional graphs

#

like up and back down

#

but once it starts going down it doesnt go back up until it bottoms out

#

that's kind of what I'm thinking of

past meteor
#

idk I'd just read the paper 🙂

#

It'll give you good inspiration

agile owl
#

yeah I will thanks

left tartan
agile owl
#

I mean doing the whole fitting a new model because that happens thing

past meteor
#

Aha, I think what I mentioned with the ensemble is a form of risk management 😄 I don't know how finance deals with this typically though.

agile owl
#

I have a fundamental model and a trading model

#

where the fundamental model for the rate is an input to the trading model

#

along with a volatility model for the variance

#

the fundamental model uses variational inference hmm states

#

and is order 1

#

it's slower moving

tacit basin
#

What open source embedding models are best for RAG systems? Do I need large context like 8k?

agile owl
#

or paying for convexity

#

(that's what options are)

#

also things like stopouts, etc.

past meteor
#

Finance is not my cup of tea in all honesty 🙂 loads of domain specific rules and knowledge I don't know

agile owl
#

i mean time series is time series

#

the actions you take as a result of some signal can be abstracted away it's all just producing signals from time series data at the end of the day

sleek harbor
#

is there a way to display stuff like dataframes, plots.. side by side in a jupyter notebook? And I don't mean subplots or any of that stuff, I just mean like put a plot next to a dataframe. Would be cool if the display() function accepted a list and tried to put things in a row, if there's space, but no, doesn't work. All I get when I google are ways to mess with the underlying html of a notebook, but.. kinda hacky. Is there a "normal" way of doing this?

left tartan
dense crane
#

i have a model which can generate a 1s of audio, how to generate longer samples than 1s like 10 - 30s samples it is train using the GAN on .wav files and it take noise of size of 100 the genertates the 1s sample audio, because it was trained on such length audio, can i increase the sample lenght with that trained model or i have to train it once againg but on bigger sample leghts?

limpid parrot
#

Hello all, I'm currently building a regression model using ml techniques but can't seem to get the MSE down. I need to predict 2 variables from another 3 input variables. Does anyone have an ideas? Here is my code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

data = pd.read_csv("data/TA_LGs_combined.csv")

X = data[["dist[kpc]","vrad[km/s]","vtan[km/s]"]]
y = data[["mass1[Msun]","mass2[Msun]"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)

model = GaussianProcessRegressor(kernel=kernel, random_state=42)

model.fit(X_train, y_train)

y_pred, _ = model.predict(X_test, return_std=True)

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')

plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()

plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()

mse_mass1 = mean_squared_error(y_test['mass1'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2'], y_pred[:, 1])

print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')

P.S. my dataset is quite small (about 600 rows)

#

current MSE is about 1x10^24

agile owl
#

standardize everything and what is the MSE?

limpid parrot
#

mean squared error

#

What do you mean by standardize?

agile owl
#

no I was saying if he standardized it

#

demean and divide by stdev

#

subtract average and std before fitting model

#

divide by std*

#

just giving 1x10^24 doesn't mean much if you are working with large masses

#

if you standardize it first then it's clearer

#

it also might help the model

limpid parrot
#

could you elaborate?

agile owl
#

depending on the model the scales of different variables being different might skew the results

#

whereas if you make everything have unit variance first

limpid parrot
#

So should I normalise the inputs?

agile owl
#

they are not skewed in the same way

#

try that

limpid parrot
#

Btw current MSE is at 1.0013291012933846e+24

agile owl
#

not saying it will fix everything

limpid parrot
#

I'll try it now

limpid parrot
#

It seems to have increased the MSE

agile owl
#

did you scale the test data too

#

I find it hard to believe that you could get to an e24 mse on normalized data

limpid parrot
#

This is what I did:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

data = pd.read_csv("data/TA_LGs_combined.csv")

X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler(with_mean=True, with_std=False)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)

model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train_scaled, y_train)

y_pred, _ = model.predict(X_test_scaled, return_std=True)

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')

plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()

plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()

mse_mass1 = mean_squared_error(y_test['mass1[Msun]'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2[Msun]'], y_pred[:, 1])

print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')
agile owl
#

just scale the y variable too, also not sure why with_std is set to False

limpid parrot
#

I'll set it to true

agile owl
#

if it's just those 5 variables I would do data = (data-data.mean())/data.std() at the start

#

and save the means and stds for later use

#

so you can rescale the output

limpid parrot
#

wow

#

i just did the data thing and the mse shot down to 0.7

agile owl
#

yea

#

then rescale the outputs and calculate the MSE again and compare it to the original

#

the 1e24 one

#

that's how you know if it helped or not

limpid parrot
#

but if it's saying the mse is 0.7 isn't it already better?

agile owl
#

no because you changed everything to be normalized

#

you aren't comparing apples to apples

#

that's why you need to save the means and stds

#

so you can invert the normalisation on the model outputs

#

to get back to the original problem space

#

in principle what you did should give you 1e24 or better when you rescale it to the original problem

#

it really just depends on the stdev of the y variables you are predicting

#

whether that 0.7 is better than the 1e24 or not

#

i'd be surprised if it isn't tbh

quaint loom
limpid parrot
arctic wedgeBOT
#
Microsoft Visual C++ Build Tools

When you install a library through pip on Windows, sometimes you may encounter this error:

error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

This means the library you're installing has code written in other languages and needs additional tools to install. To install these tools, follow the following steps: (Requires 6GB+ disk space)

1. Open https://visualstudio.microsoft.com/visual-cpp-build-tools/.
2. Click Download Build Tools >. A file named vs_BuildTools or vs_BuildTools.exe should start downloading. If no downloads start after a few seconds, click click here to retry.
3. Run the downloaded file. Click Continue to proceed.
4. Choose C++ build tools and press Install. You may need a reboot after the installation.
5. Try installing the library via pip again.

quaint loom
# serene scaffold !build

I have already downloaded Visual studio C++ 14 or greater. But people have had same issue with pillow and AutoGPTQ and they have downgraded the pacakge. Maybe I could try to downgrade the hdmedians package

limpid parrot
agile owl
#

yes

#

then compare results of that method vs not normalizing at all

#

vs using the normalized MSE

#

you can probably make some shortcuts in doing this comparison if you have the stdevs bound to a name

mild dirge
#

Unless the original data was in the 1e10+ I think it probably did help though

agile owl
#

I have to admit my ignorance not sure how MSE is calculated in this case where there's two outputs

#

and what it means vs the univariate case

limpid parrot
#

I did this but it's not working:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

data = pd.read_csv("data/TA_LGs_combined.csv")
data = (data-data.mean())/data.std()

X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler(with_mean=True, with_std=True)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

means = scaler.mean_
stds = scaler.scale_

kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)

model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train_scaled, y_train)

y_pred_scaled, _ = model.predict(X_test_scaled, return_std=True)

y_pred = y_pred_scaled * stds + means

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')

plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()

plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()

mse_mass1 = mean_squared_error(y_test['mass1[Msun]'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2[Msun]'], y_pred[:, 1])

print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')

#

It gives:
Traceback (most recent call last):
File "/Volumes/Isaac's External Drive/7d-emulator/model_ml.py", line 31, in <module>
y_pred = y_pred_scaled * stds + means
~~~~~~~~~~^~
ValueError: operands could not be broadcast together with shapes (116,2) (3,)

#

@agile owl

agile owl
#

data = pd.read_csv("data/TA_LGs_combined.csv")
data = (data-data.mean())/data.std()

#

means = data.mean()
stds = data.std()

#

y_pred = y_pred_scaled * stds[["mass1[Msun]", "mass2[Msun]"]] + means[["mass1[Msun]", "mass2[Msun]"]]

#

you can remove the StandardScaler

#

you don't need it anymore

#

think that's just confusing things

#

one second I'll write it out

limpid parrot
#

File "/Volumes/Isaac's External Drive/7d-emulator/model_ml.py", line 27, in <module>
y_pred = y_pred_scaled * stds[["mass1[Msun]", "mass2[Msun]"]] + means[["mass1[Msun]", "mass2[Msun]"]]
^~~~~~~~~~~~~~~~~~~~~~~~
...
File "/opt/homebrew/lib/python3.11/site-packages/pandas/core/common.py", line 561, in require_length_match
raise ValueError(
ValueError: Length of values (116) does not match length of index (2)

limpid parrot
agile owl
#

means = data.mean()
stds = data.std()
data = (data-means)/stds

is what I meant

#

what is the shape of y_pred_scaled

#

nx2 right?

limpid parrot
#

right now: (116, 2)

#

I'll add that alteration to the code

agile owl
#

you might have to transpose the stds

#

and the means

#

so they are 1x2

#

instead of 2,

limpid parrot
#

nx2 you mean

agile owl
#

well no there will only be one for the means and stds

limpid parrot
#

nothing i got confused

agile owl
#

you want them to be 1x2 so you can broadcast them onto the 116x2 array

limpid parrot
#

i'll transpose them

agile owl
#

instead of 2x1

#

or 2xnone

#

which is what I suspect they are , one dimensional

limpid parrot
#

Traceback (most recent call last):
File "/Volumes/Isaac's External Drive/7d-emulator/model_ml.py", line 33, in <module>
y_pred = y_pred_scaled * stds[["mass1[Msun]", "mass2[Msun]"]] + means[["mass1[Msun]", "mass2[Msun]"]]
~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'method' object is not subscriptable

#

This is what I have now:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

data = pd.read_csv("data/TA_LGs_combined.csv")
means = data.mean()
stds = data.std()
data = (data-means)/stds

X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)

model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train, y_train)

y_pred_scaled, _ = model.predict(X_test, return_std=True)
print(y_pred_scaled.shape)

means = means.transpose
stds = stds.transpose


y_pred = y_pred_scaled * stds[["mass1[Msun]", "mass2[Msun]"]] + means[["mass1[Msun]", "mass2[Msun]"]]

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')

plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()

plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()

mse_mass1 = mean_squared_error(y_test['mass1[Msun]'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2[Msun]'], y_pred[:, 1])

print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')
versed gulch
#

Hi I have a simple problem but can't seem to find the numpy function, I want to have a condition where the numpy array contains any values in another fixed numpy array. i.e. if array 1 is [1, 2, 256] are there any values that in it that are the same as array 2 [32, 256, 256] which is the fixed array?

Any help would be appreciated

limpid parrot
#

I tried this and it still doesn't work

limpid parrot
# agile owl you need to call transpose

Okay I tried this:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv("data/TA_LGs_combined.csv")
means = data.mean()
stds = data.std()


X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]
Ymeans = y.mean()
Ystds = y.std()
data = (data-means)/stds
X = data[["dist[kpc]", "vrad[km/s]", "vtan[km/s]"]]
y = data[["mass1[Msun]", "mass2[Msun]"]]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=1e6)

model = GaussianProcessRegressor(kernel=kernel, random_state=42)
model.fit(X_train, y_train)

y_pred_scaled, _ = model.predict(X_test, return_std=True)


print(y_pred_scaled.shape)
print(np.shape(Ymeans))
print(np.shape(Ystds))

y_pred = y_pred_scaled * Ystds.values + Ymeans.values
print(np.shape(y_pred))

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error on Test Data: {mse}')

plt.scatter(X_test['dist[kpc]'], y_test['mass1[Msun]'], color='black', label='Actual mass1')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 0], color='red', label='Predicted mass1')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass1 (Msun)')
plt.legend()
plt.show()

plt.scatter(X_test['dist[kpc]'], y_test['mass2[Msun]'], color='black', label='Actual mass2')
plt.scatter(X_test['dist[kpc]'], y_pred[:, 1], color='red', label='Predicted mass2')
plt.xlabel('Distance (kpc)')
plt.ylabel('Mass2 (Msun)')
plt.legend()
plt.show()

mse_mass1 = mean_squared_error(y_test['mass1[Msun]'], y_pred[:, 0])
mse_mass2 = mean_squared_error(y_test['mass2[Msun]'], y_pred[:, 1])

print(f'MSE for mass1: {mse_mass1}')
print(f'MSE for mass2: {mse_mass2}')

The mse is now 3x10^24

#

please tell me i did something wrong 😅

agile owl
#

Ymeans is already 0 and ystds is already 1

limpid parrot
#

what do you mean?

agile owl
#

ah nvm

#

you also need to scale X again

#

I didnt see you took it out at the beginning

limpid parrot
#

i'll do the same for x

agile owl
#

actually you don't need to unscale X and it looks like you did already

#

I'm a little surprised the MSE went up tho

#

also why are you using a GaussianProcessRegressor

limpid parrot
#

What are my other options?

#

I had tried the randomforestregressor

agile owl
#

thought that process regression was for things where the x variable is time

limpid parrot
#

im seeing if i can use svms

agile owl
#

but might be wrong on that

limpid parrot
#

i have another project with gps

#

they're best for time series data

agile owl
#

maybe try RF with the normalization?

limpid parrot
#

i tried it

#

it's worse

agile owl
#

I didn't know normalization could make models substantially worse but I guess it can

#

in my experience it usually improves things a lot, or it's within a very small error of the original one way or the other

limpid parrot
#

could it be my code is flawed?

agile owl
#

I can't look at it anymore without running it myself

#

if you give me the data i would take a look at some things

limpid parrot
#

i can't do that unfortunately

agile owl
#

it just gets hard to spot things at a certain point without actually running the code

#

maybe someone else can see something I missed

limpid parrot
#

thanks for the help anyways

agile owl
#

yeah I have ADHD I can't keep reading code without running it xD

#

I would have been awful at programming with punchcards

limpid parrot
#

keep it up bro

#

bro

#

i tried svms

#

they're awesome

agile owl
#

yeah they are good with low sample size

limpid parrot
#

mse for mass1 0.

#

for mass2 1.

#

MSE for mass1: 0.49354012715694157
MSE for mass2: 1.2389531306923065

#

are those good values?

#

they should be

agile owl
#

it depends on the problem

#

i don't know enough about this to know

limpid parrot
#

thanks anyway my friend

#

wish you all the best in life and your career

left tartan
quaint loom
#

What software do you guys use for python when running data? I am tired of Jupyter notebook…..

agile cobalt
#

personally I use VSCode, though Jupyter is not that bad, specially when you want to make graphs often. Definitely not fit for all things though.
PyCharm is also somewhat popular

limpid parrot
#

vscode and vim

quaint loom
#

Thanks guys. I am facing some issues downloading certain plugins using jupyter now for a long time without figuring out the solution! I guess it’s time to try something else. I wanted to try PyCharm. I will have a look at it tomorrow

placid cedar
#

is this overfitting or underfitting?

mild dirge
#

What do you think?

#

The model seems to perform slightly worse on new data than on the data that is used for training

#

So is it overfitted/underfitted on the data?

placid cedar
#

i think underfitted?

#

my lecturers taught me that when the validation loss starts to go up

#

thats when its overfitting

#

currently i think my model is underfittiing based on this diagram

mild dirge
#

I guess from the looksof it, it isn't going up yet, so it seems to be a decent fit right now

#

Though on the right it does seem to go up slightly, but it's too inconsistent to tell

placid cedar
#

sshld i increase the learning rate and the epoch to see how it goes in this case?

mild dirge
#

Testing different learning rates can always be a good idea

#

In this case it's not overfitting yet, but it might be underfitting because of model complexity

#

Because even the training accuracy stops at 80%

#

But not underfitting because it isn't trained for long enough

placid cedar
#

oh actually i ran it on 175 epoches before

#

125*

buoyant vine
#

If I have a high number of classes/labels for multi label classification, should I reduce my dropout ratio? ATM it is at 0.5 or 0.2 but I have nearly 700 labels so 0.5 seems quite high

agile owl
# placid cedar

looks like it's been spinning its wheels since 50 epochs tbh

umbral charm
#

\jru

#

Hey**, im currently doing numerical intergrations and differential equations using scipy.odeint

#
def model(y, t, b, c):
    S, I, R = y  # reads in values in y and assigns to S and I
    dSdt = -b * I * S  # rhs of dS/dt
    dIdt = b * I * S - c * I  # rhs of dI/dt
    dRdt = c * I  # rhs of dI/dt
    print(S, I, R)
    return [dSdt, dIdt, dRdt]  # important that they are this way round!
# Parameters
b = 0.002
c = 0.5
# Initial conditions
y0 = [999, 1, 0]
# Time points
t = np.linspace(0, 20, 1000)
# Solve ODE
y = scipy.integrate.odeint(model, y0, t, args=(b, c))
plt.plot(t, y)
plt.legend(['S(t)', 'I(t)', 'R(t)'])
plt.xlabel('t')
plt.title('SIR Model')
#

in my model function, does it matter what way round i do S, I, R = y?

echo mesa
#

Guys, this might be a stupid question but does it make sense to write models in C instead of python? Because of its insane speed?

past meteor
echo mesa
iron basalt
echo mesa
past meteor
iron basalt
# echo mesa even just implementing basic algorithms in c can be a good idea, because you act...

For learning purposes it can be a very good idea. C has a DIY culture surrounding it, and so you will find more resources on how to implement things yourself than with Python, which is more focused on using existing libraries (because it's focused on productivity, which is dominated by having the thing already written for you, and easy access to it). But depending on what you are trying to do, there may not be any need to have a deeper understanding of the implementation, in which case it's kind of a waste of time.

echo mesa
iron basalt
echo mesa
past meteor
#

There's nothing inherently wrong with this, but you need to be aware of this being a thing

echo mesa
#

yeah i mean, trying to always go deeper is good but there should be a limit and we gotta be aware of whether what we do actually makes any difference or sense.

past meteor
#

Knowing (some) C is generally a good idea though I agree

#

It's also a strange language in the sense that it's small so in theory it's easy to pick up, but you need to do most of the things yourself ... because it's small

iron basalt
echo mesa
past meteor
#

C does have abstract data types / opaque types so you can still encapsulate etc. without OOP 🙂

iron basalt
#

The real thing that makes C "primitive," that you will immediately run into, is that its standard library is bare, it does not come with any of the common data structures you will find in other language's standard libraries (or built into the language itself).

quartz karma
#

Hi, in the case of using Matrix Factorization, how do we determine the dimension of embedding matrices? is it a hyperparameter or something like a weight parameter?

umbral charm
#

i just need to know why you cant swap them around

wooden sail
#

imagine you have x, y = [0, 3], and compare that to y, x = [0, 3]

#

the items in the list are assigned to x and y one to one, and in the order you specify

umbral charm
thorn bobcat
#

hi

serene scaffold
thorn bobcat
#

:) thank you

gaunt axle
#

What kind of graph would best represent this kind of data? ie each set contains the % of students that approved a test for each curse. This is just an example, in my particular case I may have 10 courses or more

agile cobalt
#

I would probably just use a bar char for showing each individual % and not even try to show the intersections (or lack of thereof)

quaint loom
agile cobalt
quaint loom
hard osprey
#

has anyone done multi class classification for multi layer perceptron before?
I think I'm doing it wrong

#

my codes:

#
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot

# Load the dataset
file_path = "modified_result.csv"
df = read_csv(file_path, header=None)

# Split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]

# Ensure all data are floating point values
X = X.astype('float32')

# Split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the input features
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

# Determine the number of input features
n_features = X.shape[1]
print(n_features)
print(X)
print(y)```
#
n_classes = 5 

# Define model
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(n_features,)))
model.add(Dense(n_classes, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit the model
history = model.fit(X_train, y_train, epochs=300, batch_size=32, verbose=1, validation_data=(X_test, y_test))```
quaint loom
#

Is there anyone here who have used structural equation model in jupyter notebook?

dark compass
#

Hi

#

Skrileze

quaint loom
dark compass
#

Hello

#

Are you developer?

quaint loom
dark compass
#

Yes

#

I am seeking new project

quaint loom
gaunt axle
next yoke
#

Has anyone ever used google or tools its CP SAT before?

onyx phoenix
#

are these two questions the same or different,

frail dune
#

General question regarding image-processing:
I have to code an edge-detection program, optimally in the end with ML algorithms to detect the edges of drill-holes (scanning electron microscope, SEM)
Would I use:
OpenCV
or
scikit-image?

dull flare
#
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 534, in _run_script
    exec(code, module.__dict__)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\main.py", line 58, in <module>
    ax.imshow(df_wc)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\matplotlib\__init__.py", line 1478, in inner
    return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\matplotlib\axes\_axes.py", line 5756, in imshow
    im.set_data(X)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\matplotlib\image.py", line 723, in set_data
    self._A = self._normalize_image_array(A)
File "C:\Users\name\PycharmProjects\whatsappAnalyzer\venv\lib\site-packages\matplotlib\image.py", line 688, in _normalize_image_array
    raise TypeError(f"Image data of dtype {A.dtype} cannot be "```

helper code 

def create_wordcloud(selected_user,df):
if selected_user != 'Overall':
df = df[df['user'] == selected_user]

    wc = WordCloud(width=500, height=500, min_font_size=10,background_color='white')
    df_wc = wc.generate(df['message'].str.cat(sep=" "))
    return df_wc

main code 

df_wc = helper.create_wordcloud(selected_user,df)
fig, ax = plt.subplots()
ax.imshow(df_wc)
st.pyplot(fig)

using streamlit
#

it says
Image data of dtype object cannot be converted to float

agile owl
#

@past meteor do you use deap?

#

I'm trying to figure out how to evolve using a training set but keep results based on the validation set

#

their hall of fame class is how I've been doing it but now I want to use a different fitness than the one it's training on and that just doesn't seem to be supported at all

#

do you just manually transplant the models between the stages of your testing plan?

past meteor
agile owl
#

the way I have it split up it's train-val-train-val-train-val etc. and then test data at the end

#

I didn't want to keep reusing the earliest data because I think the earliest data is actually the least likely to be relevant so I just had it keep going through time with overlapping periods

#

The results are uninspiring so far lol

past meteor
#

Can I be honest and say I'm not comfortable giving advice here? 🙂

agile owl
#

sure

#

I don't know if this is actually evidence of overfitting or me doing something that doesn't make sense

past meteor
#

I don't fully understand what you're trying to do. It also seems to be related to finance, which is a domain with extensive research

agile owl
#

wasn't expecting it to be so bad

past meteor
#

Before I could give real advice I'd have to read tons of finance literature but that's not particularly my interest

agile owl
#

mmm it's not really finance specific just need to figure out the right way to do out-of-sample time series validation on these

#

I guess the fact it's an agentic problem is relevant but not really anything finance specific

dull flare
past meteor
#

I'm not sure ignoring finance specific methods is smart. That's kind of my point

agile owl
#

I think what I did does disprove that what I'm trying to do works without having to look at the test data

#

mmmm... there's no agreed upon way to forecast time series or build trading bots in finance if that's what you mean

#

I think most people just use time series/signals processing stuff like anything else

#

there's crazy quasi-religious people who use fibonacci sequences etc.

past meteor
#

I did my MSc in business engineering, but majored in data science. Tons of my friends did the finance or actuarial science tracks. Many of those that did, went into stock prediction etc

#

When I asked them about their methods it was totally different things than I did

agile owl
#

Did they all give you the same answers? I would be somewhat surprised if they did.

#

unless they all worked for the same place

past meteor
#

Nah this was for their thesis

agile owl
#

most of finance is a form of regression problem with noise and probabilistic uncertainty

dull flare
#

solved :()

agile owl
#

signals engineering people tend to do the best i think

#

when it comes to the quantitative stuff

#

it sounded like you worked with signals of some kind

#

I want to learn Reinforcement Learning AND Symbolic Regression AND finish learning EA

#

this is so cursed

#

too much to learn

#

too little time

dull flare
agile owl
#

the future

dull flare
#

oh

agile owl
#

just in general

dull flare
#

nice keep it up

#

are u enginnering student?

agile owl
#

no i'm out of school

dull flare
#

o

#

well i think i pretty much started with ml stuff

#

and im in 3rd year of eng (just started)

#

currenly im doing some projects to get a good hand

#

then ill start unsupervised

#

reinforment learning ill try it for fun tbh

agile owl
#

yeah I don't think that you will actually master everything just in school I also took a masters in data science but we didn't touch on all those topics

dull flare
#

i see yea

agile owl
#

so I know the supervised and unsupervised stuff pretty well

#

reinforcement learning they didn't cover

#

we covered deep nets, architectures and LLM

#

but missing some stuff

#

it's such a broad field to cover everything

dull flare
#

ooh yea deep learning omy i have no idea about it i need to learn that too

#

ugh to much to learn hahaha

agile owl
#

I think that deep learning is actually not as useful for a lot of practical things as reinforcement learning

#

reinforcement learning takes on a classic problem solving form for anything happening over time that requires interactions

dull flare
#

u think so ? i was said that its not that useful when compared to supervised , un, and deep learning

agile owl
#

I guess it depends on what you are doing

dull flare
#

i see

agile owl
#

if you are trying to optimize something over time

#

then it's what you want

dull flare
#

well as u are a very experienced person can u provide me a guide map

#

currenly im trying out differnt projects on supervised

agile owl
#

well for example it's the type of algorithm that will learn how to beat a game

dull flare
#

i did some maths and intuition of basic algorithms of supervised

agile owl
#

like chess

dull flare
#

yes i know something about rl

#

we need those punishments and reward system

#

its pretty interesting and fun

agile owl
#

yeah you need to map different game outcomes to different rewards/penalties

#

I honestly don't know all the details otherwise I wouldn't be saying I need to learn it still

dull flare
#

specially i saw that person who created a RL agent which plays pokemon red, i was so inspired by that

#

and got interest in RL

agile owl
#

I think anyone who says RL isn't useful compared to supervised and unsupervised is mistaken

#

I think supervised and unsupervised learning answer lower conceptual level problems

#

I think RL's problem space is on a higher conceptual level

#

which makes it more useful not less

dull flare
#

its like brute force

agile owl
#

eh, I'm not sure about how the calculus works but I'm pretty sure there's some method to it beyond brute force

dull flare
#

hm u saw that pokemon red video where he trains those agents to play the game?

agile owl
#

supervised learning: What should this house cost?
reinforcement learning: At what discount to value should I buy the house and when should I sell it?

#

that's what I mean by different conceptual level

dull flare
#

oh it could be used in that way too ?

#

thats awesome

#

i see haha i have very little knowledge aobut it then

#

i thought it would be better only in games and stuff

agile owl
#

games imitate life

#

in some ways

dull flare
#

i never understood it in business scale

agile owl
#

if something can play a game it can probably do something useful

dull flare
#

hm

agile owl
#

if you think about it

dull flare
#

yes

agile owl
#

and figure out what it's doing

#

games are fun because they require intelligence of some sort

dull flare
#

yea i can think about it hmm

agile owl
#

so like what are the problem solving techniques that a programmatic agent to play a game would use

#

and how can they be applied to other domains outside of games

dull flare
#

yes i see how you are thinking of applying the game ideology in real life

#

thats pretty interesting

#

but for me it will prolly take too long to get there

#

im still stuck at supervised

#

oki gtg ill complte this project and think of something else seeya

feral kernel
#

Hi bicubic interpolation is not implemented on MPS, how do i get around this for the diffusion model for a Mac

vale swallow
#

Hi, can someone help me, I’m trying to plot some images and see it’s corresponding arrays but nothing shows up, this is what I have:
`#LOADING AND SPLITTING DATA
x_train = tfds.load('celeb_a', split='train[:10000]', shuffle_files=True)
x_test = tfds.load('celeb_a', split='test[:2000]', shuffle_files=True)

#PREPROCESSING IMAGES
x_train_arrays = []
x_test_arrays = []

for dataset in (tfds.as_numpy(x_train), tfds.as_numpy(x_test)):
for img in dataset:
image = img['image']
if dataset is tfds.as_numpy(x_train):
x_train_arrays.append(image)
else:
x_test_arrays.append(image)

#PLOTTING
train_examples = x_train_arrays[:2]

for example in train_examples:
image = example['image']
print(image)
plt.imshow(image)
plt.show()`

lapis sequoia
#

Hi, I need help. My Google colab not can install !pip install chatterbot, Does anyone have the same problem?

serene scaffold
buoyant vine
#

What is the best way to train a text-classification model using BERT to classify labels which are part of a hierchy?

I.e. if i have categories like:

Sports/College Sports/College Basketball Ideally I'd like the model be able to know that College Basketall is a child of College Sports and Sports

I have been playing with a basic multi-label model currently, but it seems to struggle quite heavily when looking at the overall 700 categories spread across 4 tiers of Hierarchy.
And realistically if the model is predicting College basketball strongly, it should have Sports and College Sports as high if not higher.

past meteor
# buoyant vine What is the best way to train a text-classification model using BERT to classify...

I haven't done this (yet) but this is called hierarchical classification, just so you can look at literature if you care. for it.

Something you can definitely do is train classifiers per tier but you'll end up with many of them. Doing it should not be prohibitively "expensive" if you train BERT to classify the first level, take the final layer embeddings and do the "cascade" with simpler models (say gradient boosting, linear regression and whatnot). Drawback is that this isn't some end-to-end thing you can train in one go

buoyant vine
#

😅 Do you have any links which are fairly basic to read? My mathematics is not great so a lot of the theory goes over my head.

#

One thing that seems to be weird is roberta-base struggles less than xlm-roberta-base so I gues the amount of data also plays a roll, but 800k points for 700 categories seems fairly solid

buoyant vine
#

miku_pray Ty ty

past meteor
#

What I described are "local classifiers"

#

And it's their fav, seems my intuition is decent then 😄

#
  1. Train bert to predict the top most category
  2. Train 1 model per tier 1 to predict each tier 2
  3. Train 1 model per tier 2
  4. trian 1 model per tier 3

Just use (the same) embeddings for all these

buoyant vine
#

hmm so in the end we have roughly 700 smaller trained models

#

for their specific niche

past meteor
#

Yeah, you can have less than that if you train 1 model for each tier for instance

#

Then you have just 3 or 4

buoyant vine
#

But how would you pre-feed to the model on the next tier that what the parent tier was?

past meteor
#

You wouldn't

#

But you have 600 less models 😄

#

You can also zero out things that are not possible with post-processing ofc

#

I think it's a matter of trying out all options and seeing which has the best metrics

buoyant vine
#

Hmm, yeah the 700 models approach might be best, or at least i need to partition the data somewhat because it currently gets suck and can't learn with the 700 categories and multi-label

#

or at least... It does not work out how to learn within a sane time frame, it can just about do it on Tier 1 and Tier 2 where there are ~100 - 150 categories, but past that it seems to largely just go "aha yes I think this text is all categories"

past meteor
#

I guess you'll have to try it out

#

Something about hundreds of models is unsettling to me lol

buoyant vine
#

😅 I think it may raise some questions.

What I will probs do first is partion Tier by tier a step at a time.

So to start with we do 29 (Tier 1) classifiers and see if that aids it

#

that way we only have 30 classifiers

#

and then all we have to do is workout how to stick this into Pytorch Lightning without it breaking things

past meteor
#

Idk how you'll do inference but you can train as usual and have a method that does all the multiplications till you reach the last layer, this is the one you then multiply with your entire dataset to have n_docs x emb_size and then the scuffed part begins

#

It's also really a case of how you want to evaluate this thing, if it gets it wrong at tier 0 will you propagate errors?

T1 wrong -> T2 wrong -> T3 -> Wrong or do you do this:
T1 wrong -> use actual T1 to predict T2 -> ...

buoyant vine
#

The only issue I may have... Is GPU memory 😅 These machines only run A10g's so they have at most 24GB but the BERT models in memory are already pretty big (XLM at least :P)

#

What i'm curious about is how I manage loss / optimizers

past meteor
#

Yeah so the models you use for this stuff can be anything, it can be xgboost running on say CPU

#

400-500 category specific fully connected layers aint gonna work

buoyant vine
#

002_salute In scuff we trust

past meteor
#

At prediction time for real for real it can be a big dict/look-up table lol

buoyant vine
#

oh trust me that is what I am planning to do lol

#

that is the simple part 😅 It is getting Pytorch lightning and the rest of the training framework to not loose its mind

past meteor
#

But you can do these separately?

buoyant vine
#

Well ideally it would be, but the way lightning sets it up with our loggers etc... Means if we did that, each CI run to train would create 30 runs in our Neptune system 😅 although fuck it maybe that is useful tbh

#

We'll worry about it not clogging up the graphs later 😅

past meteor
#

sometimes a little bit too much > too little

#

But this is definitely copium

buoyant vine
#

Another question, is there anyway to add a manual layer of data points for the model to consider

#

Like a guidance as such?

#

I.e I already have a vague idea of the categories so it would be "it is likely to do with X, Y and Z"

narrow tiger
#

Is there a way i can connect ollama/llama2 with internet (so it can give updated info mostly i just AI so i don't have to read docs, or maybe some way i can train a smol AI on some docs

vestal spruce
#

does anyone know a resource that I could use to explain spectral clustering in a simplified manner?

pseudo yew
#

I am making a recipe recommendation app and using flask as backend, I have a csv containing recipes data with a column with list of ingredients, but they are in the form like :
"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher salt, divided, plus more', '2 small acorn squash (about 3 lb. total)', '2 Tbsp. finely chopped sage']"
But I want pure ingredient names without any quantity or state i.e., chicken, salt, acorn etc
Is there any way to filter my ingredients like this in python, I've heard about libraries such as nltk but I am not sure about them

left tartan
feral kernel
#

I'm getting error, how does it not find the file from the directory- `from torch.utils.data import DataLoader, Dataset

import torchvision.transforms as T
import torch
import torch.nn as nn
from torchvision.utils import make_grid
from torchvision.utils import save_image
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import random
%matplotlib inline
image_size = 64
DATA_DIR = '/Users/Name/Documents/ai\ info/test.out.npy'
X_train = np.load(DATA_DIR)
print(f"Shape of training data: {X_train.shape}")
print(f"Data type: {type(X_train)}")1 image_size = 64
2 DATA_DIR = '/Users/Name/Documents/ai\ info/test.out.npy'
----> 3 X_train = np.load(DATA_DIR)
4 print(f"Shape of training data: {X_train.shape}")
5 print(f"Data type: {type(X_train)}")

File /opt/homebrew/Cellar/jupyterlab/4.0.7_1/libexec/lib/python3.11/site-packages/numpy/lib/npyio.py:427, in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)
425 own_fid = False
426 else:
--> 427 fid = stack.enter_context(open(os_fspath(file), "rb"))
428 own_fid = True
430 # Code to distinguish from NumPy binary files and pickles.

FileNotFoundError: [Errno 2] No such file or directory: '/Users/Name/Documents/ai\ info/test.out.npy' `

feral kernel
signal whale
#

anyone able to help with power bi it is so annoying
cuz seems like i am logged it but have a pop up to connect acc and then i get this ^

lapis sequoia
#

which 3b model would be better to run on low ram devices like a phone or a raspberry pi, btlm-3b or orca mini 3b

#

i have room for orca 2 7b but it needs to be quantized to 2 or 3 bits, which i heard reduces accuracy by a lot

narrow tiger
#

OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacty of 5.79 GiB of which 155.31 MiB is free. Including non-PyTorch memory, this process has 4.77 GiB memory in use. Of the allocated memory 4.53 GiB is allocated by PyTorch, and 118.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
how can i make it so that it can take longer time to generate img but doesn't run outa memory or is that not possible

lapis sequoia
# serene scaffold what error message did you get when you tried it? be sure to always show error m...

When I run this code in google colab: '!pip install Chatterbot' this comes out: Collecting ChatterBot
Downloading ChatterBot-1.0.5-py2.py3-none-any.whl (67 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.8/67.8 kB 3.1 MB/s eta 0:00:00
Collecting mathparse<0.2,>=0.1 (from ChatterBot)
Downloading mathparse-0.1.2-py3-none-any.whl (7.2 kB)
Requirement already satisfied: nltk<4.0,>=3.2 in /usr/local/lib/python3.10/dist-packages (from ChatterBot) (3.8.1)
Collecting pint>=0.8.1 (from ChatterBot)
Downloading Pint-0.22-py3-none-any.whl (294 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.0/294.0 kB 9.4 MB/s eta 0:00:00
Collecting pymongo<4.0,>=3.3 (from ChatterBot)
Downloading pymongo-3.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (516 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 516.2/516.2 kB 11.9 MB/s eta 0:00:00
Collecting python-dateutil<2.8,>=2.7 (from ChatterBot)
Downloading python_dateutil-2.7.5-py2.py3-none-any.whl (225 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 225.7/225.7 kB 12.1 MB/s eta 0:00:00
Collecting pyyaml<5.2,>=5.1 (from ChatterBot)
Downloading PyYAML-5.1.2.tar.gz (265 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 265.0/265.0 kB 15.0 MB/s eta 0:00:00
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (setup.py) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

trim oxide
#

Hey! Was wondering if anyone has any experience with Kaggle? I'm a math major but wanted to get better at coding and kaggle seems fun and interesting, but I am a beginner at python at best. I was wondering if there are any pre-reqs i should know

past meteor
odd meteor
orchid sky
#

Anybody here

serene scaffold
# orchid sky Anybody here

why do you want to know why someone is here? people who might glance at the channel aren't going to feel enticed just to announce that they're looking at the channel.

orchid sky
#

Is it possible to create a python program where you give it multiple documents at once and once you do a keyword search that it can go through those documents and pull out that information
I have no idea if it is a short or long program or what python modules can help out in doing this

serene scaffold
#

(and yes, it is possible.)

orchid sky
#

Sorry as PDFs and word documents

serene scaffold
# orchid sky Sorry as PDFs and word documents

this notebook I wrote a few months ago has code for extracting text from both of PDF and word documents. It's based on code that I stole from someone else. https://github.com/center-for-threat-informed-defense/tram/blob/main/user_notebooks/predict_multi_label.ipynb

GitHub

TRAM is an open-source platform designed to advance research into automating the mapping of cyber threat intelligence reports to MITRE ATT&CK®. - center-for-threat-informed-defense/tram

serene scaffold
#

and what do you want the output to be? all the sentences that contained a match?

orchid sky
#

No as like cut that part off

orchid sky
serene scaffold
#

what about "swim" and "swam"? do you want those to count as matches for each other?

orchid sky
#

No as I want them to be seperate

serene scaffold
#

okay. if you're not a native English speaker, keep in mind that "swam" is just the past tense of "to swim", so they're the "same word"

#

anyway, it sounds like you're just trying to match substrings.

orchid sky
#

I am a native English speaker and so I do get the idea

serene scaffold
orchid sky
#

Okay as I do get it

#

So your program can work that same way

#

@serene scaffold thank you for showing me this

#

Do you think also that this ia an AI model or can it be as I think of it just being data science

serene scaffold
orchid sky
#

I agreed to that idea as for me I do not think it is to be but my boss said to me he thinks it can be

left tartan
lapis sequoia
#

anyone here work with MS/MS data like itraq ratios?

haughty cradle
#

anyone can help with this? it's Anaconda Navigator

#

there is error, but the error didn't specify what is the error

cold osprey
#

No idea but proly just remake the environment

fallen dagger
#

I've been trying to figure this out for a minute now, any tips appreciated:

I want to send in a piece of text to the ChatGPT chat API and sequentially ask it questions and get answers to those questions one by one. I don't want to create a new input prompt each time to ask a new question on the same text which would use context tokens for the same text each time.

I'm currently submitting all questions in the prompt and asking it to end each answer with two line breaks so I can break up the responses to each question. But I'm sure there's a better way to do it.

     for question in questions:
         messages.append({"role": "user", "content": question})

     chat_response = client.chat.completions.create(
         model="gpt-3.5-turbo-1106",
         messages=messages,
     )
     full_response = chat_response.choices[0].message.content

     # Split the response into parts based on '\n\n'
     response_parts = full_response.split('\n\n')


     summary = response_parts[0]
     category = response_parts[1]
     background = response_parts[2]
     claims = response_parts[3]
     stakeholders = response_parts[4]
     impact = response_parts[5]
rare ferry
#

What are some feature selection techniques for k-means clustering. My variables are mostly categorical.

serene scaffold
past meteor
past meteor
glad pasture
#

Could you guys suggest me a good resource to learn Numpy, Pandas, Matplotlib and Seaborn.

odd meteor
past meteor
#

Can you be specific in what you don't understand? Have you studied the course so far? I don't think people will explain the entire coursework over discord 😄

glad pasture
neat torrent
neat torrent
#

ping is someone responds

dusk knot
neat torrent
dusk knot
neat torrent
#

obviously imma skip the part that i already know

neat torrent
#

actually i used to do a lot of competitive prog

dusk knot
neat torrent
#

or stick with that roadmap??

dusk knot
neat torrent
#

one of my cousin approved it (he is a data scintist and usually work with ML at microsoft)

dusk knot
# neat torrent so should i do this one??

I can recommend you an easy one just to get a first certificate without much fuss, but with some nice examples, which I took, it happens to be https://www.edx.org/learn/data-visualization/ibm-visualizing-data-with-python and it is about visualizing data, not ML, but it is a nice little aperitif

neat torrent
dusk knot
neat torrent
#

seems good enough

neat torrent
dusk knot
#

note that the Google Crash Course in Machine Learning is decent, but somewhat high level... but yeah try it out

neat torrent
dusk knot
# neat torrent thanks for the information buddy

no problem, let's put the cards on the table... these playlists are not too long and they are super helpful in preparing/refreshing relevant math skills

https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

https://www.youtube.com/watch?v=WUvTyaaNkzM&list=PL0-GT3co4r2wlh6UHTUeQsrf3mlS2lk6x

Beginning the linear algebra series with the basics.
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Home page: https://www.3blue1brown.com/

Correction: 6:52, the screen should show [x1, y1] + [x2, y2] = [x1+x2, y1+y2]

Full series: http://3b1b.co/eola

Fu...

▶ Play video

What might it feel like to invent calculus?
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/lessons/essence-of-calculus#thanks

In this first video of the series, we see how unraveling the nuances of a simp...

▶ Play video
neat torrent
#

cuz i already completed a module of it

dusk knot
neat torrent
#

okay i will complete this one

#

mind if i add you as a friend and we stay in touch??

dusk knot
#

Where's the circle? And how does it relate to where e^(-x^2) comes from?
Help fund future projects: https://www.patreon.com/3blue1brown
Special thanks to these supporters: https://www.3blue1brown.com/lessons/gaussian-integral#thanks
An equally valuable form of support is to simply share the videos.

The artwork in this video is by Kurt Bruns, ai...

▶ Play video

A visual trick to compute the sum of two normally distributed variables.
3b1b mailing list: https://3blue1brown.substack.com/
Help fund future projects: https://www.patreon.com/3blue1brown
Special thanks to these supporters: https://www.3blue1brown.com/lessons/gaussian-convolution#thanks

For the technically curious who want to go deeper, here's...

▶ Play video
dusk knot
# neat torrent mind if i add you as a friend and we stay in touch??

and then just a bit later, when you consider/learn about artificial Neural Networks

https://www.youtube.com/watch?v=Ilg3gGewQ5U

https://www.youtube.com/watch?v=tIeHLnjs5U8

What's actually happening to a neural network as it learns?
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/nn3-thanks
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-netw...

▶ Play video

Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/nn3-thanks
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks

This one is a bit more symbol-heavy, and that's actual...

▶ Play video
#

have fun learning!

neat torrent
#

thankyou very much

daring pier
#

Can anyone tell me about the steps i need to follow in order to fine tune a yolo v7 on a customized dataset. I have done other models in keras but i can't understand where to start from while working with yolo.

hallow cargo
#

I'm working with tensorflow, and I'm trying to fit a dataset to a model, but I am not sure what its wanting from me. I got a dataset where each dataset.take(1) provides me with a dictionary of 16 features with each a tensor with size (2048, 512) for each feature. 2048 is the batch size and 512 is the window size. This configuration is likely totally wrong, but what should I be looking to give to the model, what does it want? The reason I'm not using numpy ndarrays or pandas df is because all this can't fit in my memory.

#

I'm getting errors like this: Input to reshape is a tensor with 1048576 values, but the requested shape has 2048 [Op:Reshape]

#

The 1048576 is 512*2048.

mild dirge
hallow cargo
#

I got some progress and in the generator I attempted to convert the dictionaries to just a tensor, and I got (16, 2048, 512) which might just be where the issue lies

#

I'm using CsvDataset() and am sorta new to tf.data.

mild dirge
#

What kind of data is it?

#

Time series data?

hallow cargo
#

Yes

mild dirge
#

So you basically have a batch size of 2048, and you use 512 time steps, each with 16 features?

hallow cargo
#

Yes

mild dirge
#

So if you can't load it all at once, you can just lower the batch size

hallow cargo
#

oh wait

#

That wasn't the issue, but something just dawned on me

#

I could have used numpy arrays and batched it using a generator and still fit all the data to my memory?

#

Since my problem was loading the entire data to my memory.

mild dirge
#

It should be 16777216 (2048 * 512 * 16) floats of 32 bit probably, so about 67 MB-ish

#

Don't see how that wouldn't fit at once

hallow cargo
#

My whole data is several hundred thousand samples

#

Thats just one batch

mild dirge
#

tf probably has some type of class for a dataloader, you wouldn't load it all at once

#

pytorch f.e. has a class for a dataset, which can retrieve a sample given an index, and dataloader would load in batches of samples in random order given the dataset class

hallow cargo
#

It definitely does, but I am new to the tf.data framework and have no clue how datasets in tensorflow particularly work

#

Since loading it all up into one giant ndarray filled my ram up, I attempted to use tf.data, but I've been very confused with what I even want as an input for the model in terms of tf.data

mild dirge
#

I'm still not completely sure about the type of data, you say the data has a batch size, window size, and number of features. I'd expect a single sample of the data to just be a window size, and number of features correct? (num_timesteps, num_features)

hallow cargo
#

Yes, correct

mild dirge
#

So you made a custom dataset class or?

#

I don't often usen tf, so I'm not super sure how this is normally done with that library

hallow cargo
#

No, I initially loaded a csv file using tf.data.experimental.CsvDataset(), then windowed and batched the data.

hallow cargo
#

This is how I windowed it if it helps:

    def window_data(self, data_ds):
        data_ds = data_ds.window(self.window_size, shift=self.shift, stride=1, drop_remainder=True)
        for sub_ds in data_ds.take(1):
            print(sub_ds)
        data_ds = data_ds.flat_map(lambda x, y: tf.data.Dataset.zip(({key: ds.batch(self.window_size) for key, ds in x.items()}, y)))

        return data_ds
mild dirge
#

I guess you could just make a class that selects a random sample each time, and selects a random window from this sample

#

Depending on what kind of data you want to give to your model

hallow cargo
mild dirge
#

What is your model supposed to do?

hallow cargo
#

Recieve 16 sequences of times series data of length 512 each into an LSTM. I succesfully did this using numpy arrays of shape (batch_size, 512, 16) with a smaller dataset, but as I increased the number of samples, it got too big to fit my memory, and thus I attempted to use tf.data. Would there be any alternative ways of doing this?

mild dirge
#

How did you get batch_size samples before then?

hallow cargo
#

I loaded up a numpy array of (total samples, 512, 16) and designated batch_size to be 2048 (or lower then) in the model.fit function.

#

and another one with the labels

mild dirge
#

And the data is a single csv or?

hallow cargo
#

Yes

mild dirge
#

I would probably find out what tf has to offer. inituitively you'd want to have a function that can load a single random sample of shape (512, 16), and then another function which combines batch_size samples into a single batch of data to feed to your model.

hallow cargo
orchid sky
#

Is there a way to filter data within a sheet and display the entire row that display the filtered word as this is trying to work with xlsx
For me I have no idea
I am wanting to get the user to type in their keyword and than filter that row from that sheet name to displayed

oblique marsh
agile cobalt
mild dirge
spice mountain
#

Any of you guys familiar with Pytorch Lightning, that can help me understand whether I can load a "super-model" which contains the VQGAN model I am interrested in + a transformer, and then just deep copy the VQGAN model?

#

I plan on finetuning the VQGAN model, but I don't know if there is any sort of connection to the original model?

echo mesa
#

Guys, what algebra and trigonometry do i need specifically to be able to understand calculus?

serene scaffold
#

.latex
For example, $\sqrt{x} = x^{$

#

fuck

#

.latex For example, $\sqrt{x} = x^\frac{1}{2}$, $x^{-1} = \frac{1}{x}$

strange elbowBOT
serene scaffold
#

.latex More generally, $\sqrt[n]{x} = x^\frac{1}{n}$, $x^{-n} = \frac{1}{x^n}$

strange elbowBOT
echo mesa
serene scaffold
past meteor
#

Like, I remember that in freshman math they'd always show us the geometric interpretation of both Lin alg and calculus, I always zoned out and I turned out just fine

hollow sentinel
#

there is a good mit ocw course on single variable calculus. there's also khan academy. and iirc some other stuff.

#

i'm actually doing said mit ocw course on single variable calc rn

echo mesa
# past meteor It depends on how deep you go as usual

well yeah, im just trying to make sure that im learning maths while doing machine learning and ai, it has been very challenging cause self-learning multivar calculus as a 16 year old is not easy which is completely fine because i like struggling, but i keep asking myself that "what am i gonna use all of this knowledge for" since im beginner at machine learning and i dont know maths(at a higher level) its very challenging to keep going with both of them and not knowing whether all of the time spent on it was worth it. At this point all i can do is trust that i will use all of this knowledge, again for me the biggest problem is the balance between learning mathematics and machine learning. i think i should probably just focus on maths and then do what you recommended and then go with statistics and hope that everything will make sense.

past meteor
#

To give you a concrete idea, there's the possibility to create new methods relying on determinantal point processes. The determinant of the kernel matrix is in a way a measure of dissimilarity. This is clear if you know the geometric interpretation of a determinant. Does this matter? It depends, if you're doing foundational research where you're inventing new methods yes. If you're doing any kind of other work, not so much.

past meteor
echo mesa
past meteor
#

A lot of math I learnt was "on-demand" because I needed it for ML stuff

past meteor
#

Kaggle

#

Build projects etc.

#

When you're doing that, go to sci kit learn's user guide, read what they have to say about the algorithms as you're doing the projects

echo mesa
#

okay i was trying to implement a basic linear regression model which i successfully did and understood the idea behind it but when it came to find the best fitting line, since i do not know partial derivatives i was stuck on understand gradient descent.

past meteor
#

Sooner or later you'll ask yourself questions that will lead you to picking up math textbooks

echo mesa
past meteor
echo mesa
# past meteor Kaggle

ive never used kaggle tbh, should i just go thru the tutorials or how should i do it

echo mesa
#

it seems that no matter where do i go and what i do at some point i will get stuck with not having the mathematical background for the machine learning topic that im learning

past meteor
past meteor
echo mesa
#

and also what projects would i build?

#

ahh man im so frustrated by the amount of information that i need to know and i just cant "balance it"

past meteor
#

It's multiple competitions

#

I'd start doing one of those

thorn flame
#

Guys, I intend to create a chatbot that can answer any science-related question. I'm considering using chatterbot library and some pretrained transformer models (via the transformers package). I was wondering if this is a good approach? Also, where can I get data that the model will train on?

#

I'm making use of pytorch as well

echo mesa
# past meteor I'd start doing one of those

and how would you balance it with learning mathematics? I assume that the route of my problem is i wanna understand everything very deeply and without high level mathematics i wont be able to do that, so i would need to just trty building projects and learn mathematics and then as i learn more maths i understand more i assume

past meteor
past meteor
#

Aside from that I just took math courses in uni 🤷

#

You'll still get all of those

echo mesa
#

like when im learning polynomial function, im thinking about that it is needed for calculus and calculus is needed for machine learning

#

an approach in which i try to build something would be much nicer because i actually will get to see that what do i actually need in order to understand it

past meteor
#

Then you should really just go hard on Kaggle etc

#

You'll learn the math on the go

abstract wasp
#

Hi, do you guys know if there is a specific rule to follow for reshaping data? I’m looking at an example where the initial images are 28 x 28 x 3 and in the following lines of code the data is reshaped as so:
tf.keras.layers.Dense(units=7*7*32, activation=tf.nn.relu), tf.keras.layers.Reshape(target_shape=(7, 7, 32)),

abstract wasp
thorn flame
onyx belfry
#

Does anybody know how I can use Python to determine how a protein will fold? I already have the start and stop codon working properly, now I need to do any entire polypeptide chain. I already have the tRNA conversion table that takes 3-letter blocks (A, C, U, G) to convert into the right amino acid. I don't know how to get the directions down, especially in 3D. What modules should I use, and how should I structure this?

agile cobalt
cerulean kayak
#

anyone know how to change the delimeter in a csv file to a semicolon? At me if you know.

left tartan
cerulean kayak
cold osprey
#

Ctrl H

agile cobalt
#

technically you could use Find and Replace in VSCode or similar tools, but that would mess up any commas inside of quotes like title,quote discord,"Welcome to our server, we hope you have fun" would become ```
title,quote
discord;"Welcome to our server; we hope you have fun"

#

you're better off just specifying the separator when reading in python, both the built-in csv library as well as most if not all popular libraries like pandas support it

cerulean kayak
cold osprey
#

Yes

#

Lol

spice mountain
#

Basically, I have a dataset of norms of vectors.

I want to at increments of Y distance from the center, plot the probability of finding a norm with equal or smaller size. I want to plot it in 2D. How would I go about doing this?

tidal bough
# spice mountain Basically, I have a dataset of norms of vectors. I want to at increments of Y d...

So, like, an Empyrical CDF of the norms? You could use matplotlib's hist with cumulative=True, say.
Not sure what you mean about plotting it in 2d, though - if you plot ecdf(√(x^2+y^2)) by x,y it'd be rotationally symmetric, which isn't very interesting. You could do it, though (my approach would probably be calculating the ECDF as a cumulative histogram, then making a mesh, calculating the distance from the center of each point in the mesh, then using linear interpolation between the points in the ECDF to get the values for these distances (np.interp should be powerful enough for that, IIRC))).

spice mountain
#

Also, if I wanted to confirm whether or not latent representations that are geometrically close are also semantically close, and the same for their codebooks, how would I go about doing this...?

desert oar
spice mountain
#

The relationship isn't that simple anymore, because we discretize the latent representations int their codebooks

wooden sail
#

you need to include such a criterion as part of the cost when learning the latent representation, otherwise this is not in general the case

#

as a small example, things like LDA explicitly include the minimization of intra class variance and maximization of inter class variance in the latent representation, because simply using an SVD does not do this

#

even if both were to find a latent representation of the same dimensionality

desert oar
desert oar
# spice mountain It is for VQGAN.

i see, i didn't know the context. i don't know enough about how GANs work and don't have practical experience with them, so i'll butt out 🙂

wooden sail
#

or rather, not as helpful as they'd like

#

the correct choice of basis or transformation into the correct space or manifold is the secret sauce

#

in classification tasks, which GANs usually address through their discriminator, you formulate the cost explicitly in terms of correctly classifying the input based on a latent representation, which means that representation needs to preserve or enhance the class similarity. it promotes learning a "good" representation

elfin swan
#

Hi I have this Data, Variant ID's have id's separated with " | ",i need each variant id in front of each order id how do i do it ? Thanks

desert oar
wooden sail
#

right, that should be the case. i also didn't read the whole convo, so my comment was very generally about the task of finding a latent representation. if you're already choosing a representation properly, it should be the case that you get semantic similarity related to geometry

desert oar
#

i jumped in at the same point you did

#

but i think i see what you mean - word2vec was designed to produce embeddings like that, even if the actual learning task doesn't look like it

wooden sail
#

yeah

#

the embedding step is where this happens, based on a well chosen cost function

wooden sail
#

if you train it so that it can detect sentiment, for example, then it'll make embeddings so that words that convey similar sentiment are easy to discern

#

if the cost does not explicitly include distance in the embedding space, it might still not be the case that geometric similarity is the same as semantic closeness though. that depends on what is done after the embedding too. the remaining network layers can simply learn to decode the embedded, which then only needs to be "good enough" for the remaining layers

#

by that what i mean is that the choice of metric is as important as the choice of embedding, networks usually learn both in a black box fashion

#

e.g. two classes might actually be "parallel" to each other after embedding, but you can distinguish them by their norm instead. but if you don't know which metric to apply after embedding, then you wouldn't know this. a cosine similarity here would tell you they're the same class, but something like an SVM would still work

#

broadly the embedding does encode similarity both in geometric and semantic sense, provided that the embedding is learned based on some sort of classification task. but usually you also don't know HOW it encodes it

desert oar
#

yeah, that makes sense. and good point about them being "parallel", i don't think i've encountered that before

#

i don't think it needs to be explicit in the cost function though, does it? again consider word2vec, my understanding of how it works is that words with similar meanings should appear in similar contexts. we'd expect a classifier to learn an embedding space in which the classes are well-separated. so model being trained to predict the context from the word or the word from the context should tend to learn embeddings that create good separation between plausible words and implausible words for the context.

#

again i know word2vec isn't a great example because very smart people designed it for this particular task

wooden sail
#

it doesn't if you also learn the decoding along with the embedding, which you usually do

#

all you train for is the final "task": like "please find the sentiment behind these sentences"

#

and you establish an architecture that does this passing through a low dimensional representation

#

this forces the network to both learn a good representation, and then use it correctly

#

but done black box like this, you don't know which embedding, nor which decoding is done

#

if you swap the decoder, it will probably fail for the same embedding

desert oar
#

right. i never considered that you could just train embeddings without some kind of "task" at the other end

#

how would that even work?

wooden sail
#

text is probably a bad example since i think there is no classical approach for it that works remotely well

desert oar
#

true

#

maybe dimension reduction with PCA is the simplest example to consider?

wooden sail
#

but for example, we can think of fourier and SVD/PCA

#

yeah

#

you can get philosophical regarding the interpretation for those, but the task is in any case very simple: represent the original data with as little error as possible based on a low dimensional representation

#

this inherently carries no semantic meaning. through linalg you can give it geometric meaning, but not anything with like "real world connection" inherently

desert oar
#

but does it? surely there's some connection between distance in reduced PCA space and distance in the original space

wooden sail
#

through orthogonal projection, yes

#

but usually the original space was not useful for you in a classification task, which was the problem in the first place if you wanted to find another representation

#

what you care about is a different kind of distance not explicit in the original space

desert oar
#

in the classification context all you care about is creating separation between classes

wooden sail
#

but then the distance we care about is the distance between the classes the data belongs to, not between the original data

#

that's what i mean

desert oar
#

right, but that's not the same as semantic similarity between data points either

wooden sail
#

which is why PCA does not carry semantic info

#

nor class info

#

only geometric info from the original space the data is collected in

desert oar
#

are you assuming that distance is semantically meaningless in the original space?

wooden sail
#

yes, that's definitely the case

#

this is why there is more than one definition of distance

#

you endow the data with additional structure to be able to distinguish it, because looking at it naively does not work

desert oar
#

well sure. but that's not because there cannot be semantic meaning in the original data

wooden sail
#

as an example, in some tasks, the measurement data [0,0,0,0,0,0,0,1] and [0,0,0,0,0,0,1,0] are very similar, and in other cases they're completely different

past meteor
#

Yeah, this is one of the first things mentioned in a neural network class 😄 you're learning representations and a classifier the same time

wooden sail
desert oar
#

right, i think we agree on that

past meteor
#

From an applied point of view I'd like to add that it's not fair to compare LDA and PCA. One is supervised and the other is not

#

While in statistics they're similar, in ML they fall in quite different places ime

wooden sail
#

meaning the data carries no class or semantic info if you don't first distinguish what kind of geometric object it is. which vector space and in which basis, or which manifold

#

that's part of what you learn when you look for a latent representation that is useful for classification and semantic seg.

#

it's not that the data doesn't carry the info in the original domain, it's that neither you nor the network know HOW it carries it

#

and euclidean space is usually not it 😛

desert oar
#

i think we're converging on the point where we agree

#

i was just about to say, there are a lot of scenarios when you can choose an appropriate semantically meaningful distance function without supervised learning of similarities

past meteor
#

Or rather, it's not that you can it's that you have to

wooden sail
#

sure, and when you can do that, you can skip ML altogether, or use a model-based approach that requires less data, smaller networks, etc

desert oar
#

right

wooden sail
#

you should at least pick a proper metric to massage the network in the right direction

#

it can also be that choosing the wrong representation destroys the class/semantic info, too

past meteor
#

It depends what your downstream task is

wooden sail
#

assuming the structure of the data is low dimensional in the linear sense can lead you to do PCA and reduce the rank, which surely minimizes distance in the frobenius norm sense w.r.t. the original data, but can very well throw away all the semantic info

desert oar
#

the other piece is that, as in the example of word2vec, it's possible to construct learning tasks that implicitly result in semantically-meaningful similarity in the learned embedding space, without specifically doing something like triplet loss (although skip-gram negative sampling is very close to doing it explicitly)