#data-science-and-ml

1 messages · Page 171 of 1

calm cipher
#

also the first three pages are essentially a press release for how groundbreaking they think the paper is

small wedge
#

when I look at table 2 on the paper it mostly seems their test score goes up when they scale up the params

calm cipher
#

table 2 is this

#

oh yeah I forgot to post this one

#

the division between better than baseline/worse than baseline is kind of close to 50/50

#

yet out of the 1771 models, they only considered 106 good enough to put in the table

#

so it seems like those models got worse when they added more parameters?

small wedge
#

isn't that the nature of any architecture searching algorithm? especially a genetic algorithm style one like this seems to be

calm cipher
#

I'm still not sure what their selection criteria was, but it had to do with testing larger versions of the models

calm cipher
small wedge
#

yeah all the hype they're putting around it is pretty cringe

#

I see the word "revolutionary" in here like 8 times XD

calm cipher
#

but if they are using the training loss like it seems like they might be, then 1/3 of the score is essentially useless

small wedge
#

where are you seeing that?

#

oh

#

on table 2 the left columns

calm cipher
#

they use the term "training loss" throughout the paper

small wedge
#

I mean i guess if it's the same training set then it's a useful metric for comparing the models relatively

#

but yeah that's useless as far as measuring generalization

calm cipher
#

there's also the larger issue of whether it's just an overfitting machine

small wedge
#

isn't that why they show the test scores next to all that though?

#

or did they leave that out somewhere else

calm cipher
#

if they're throwing every possible combination of model at the wall, and testing it on the same benchmarks, then it isn't clear if the best model is good at reasoning or just the best as passing those specific benchmarks

#

this is a problem with regular ML research too

small wedge
#

I don't really see the alternative

calm cipher
#

set aside a bunch of benchmarks you don't use during evaluation

#

they kind of did that with the wiki dataset to be fair

small wedge
#

so you want them to do like a train/test/validation thing

calm cipher
#

it essentially looks like each benchmark is a test set

#

I mean it is, it's a set of tasks the model hasn't been trained on

#

if you look at it that way, they're making the mistake of using the test set to evaluate each of the different models, and so there isn't any test set left over to evaluate the final results

#

again, in the interest of fairness, it looks like they didn't use the wiki task as part of the score

small wedge
woven prairie
#

Like , I was thinking that when the user uploads it's data , I am going to provide the meta data of the data and user query to llm which will convert the user query to a code then that will hit the csv data and the results of it will be passed to another llm which will summarise of explain that coded outout.

small wedge
#

it's two data points, if the model was just really good at the testing data then it might show up in the train score not being significantly higher compared to the others

#

since you'd expect a model that's good at generalization to score higher on both

calm cipher
#

I see what you're saying but I just don't see the value in evaluating the model on the training loss to the extent that it occupies 1/3 of the overall score used for model selection

#

that is always going to be the most favorable way to look at it

woven prairie
#

So , what can I do

#

Llm only writes the code that gets hit to data and results is then sent back to user

#

It's like there is a person who writes code for your analysis

#

But I don't think llm can write complex code complex business logics

#

How are you talking about vec search or old search, we are here talking about analysis not some pdf having questions answers

#

Aah 😵, you got what I need to make

#

Yes

#

Yes , but do you have any other approach

#

True , it makes a lot of mistakes

#

Yes

#

Ok mate thanks for your support

#

Yeah , but it follows a philosophy of garbage in garbage out

#

The right question is the thing we should ask it

#

+1

#

Yes

worldly dawn
#

There are many nuances between federation, distributed and the various ways to store data.
Plus you have to account for the technical skills of your users

young granite
#

im currently trying to finetune small models for img classification and was using resnet18/50 and bunch of different efficientnet_bX models.
For faster benchmarking i train smaller datasets on GPU to find good hyperparameters if i find a promising run i apply it to the full dataset on CPU/GPU batchmode.
However the results are differing significantly from one another and i dont find where my mistake is, i know that they wont be == but atleast ~+-5% ig.
Maybe someone can recommend other models aswell which might work good on animal classification.

calm cipher
eager lance
#

how much statistics do i need to know for data analysis?

young granite
calm cipher
#

no I mean what is the difference between the dataset you're using for hyperparameter tuning versus the dataset you're using for the full training run

young granite
#

none

#

thats why i expected similar results 😄

calm cipher
#

you said the dataset you're using for hyperparameter tuning is smaller than the full dataset you're using for training?

young granite
#

yes but currently im in debugging to solve why i dont get similar results for same settings and only change was GPU -> CPU/GPU

calm cipher
#

might need more details, I'm not sure what you mean by the difference between GPU and CPU/GPU

young granite
calm cipher
#

is the GPU-only solution still batching?

#

there should not be a difference with an identical seed

calm cipher
young granite
calm cipher
#

that is probably the reason for the difference

#

especially if there is no difference in learning rate between the two approaches

young granite
#

yep but i didnt thought it would be 30% 😄

calm cipher
#

How big is a batch and how much data do you have?

young granite
#

a batch in normal mode is 64 and i do have 40k img

calm cipher
#

let me make sure I'm clear on what's going on, when you're using GPU-only unbatched mode, you're using all 40k images at once in one backpropagation step?

young granite
#

no

#

i currently use 5k for gpu-only and 5k for cpu/gpu

#

to see if results align

calm cipher
#

ok so in your current gpu-only setup you're using all 5k images in one backpropataion step? and in the other setup you're giving batches of 64?

young granite
#

no its all the same sorry if i might overcomplicated this.

    # Move all data to GPU
    print("Loading entire dataset to GPU memory...")
    all_images = []
    all_labels = []
    for i in tqdm(range(len(full_dataset)), desc="Transferring to GPU"):
        img, label = full_dataset[i]
        all_images.append(img.to('cuda'))
        all_labels.append(torch.tensor(label, device='cuda'))

    all_images = torch.stack(all_images)
    all_labels = torch.stack(all_labels)

py
    train_loader = DataLoader(train_gpu_dataset, batch_size=batch_size, shuffle=shuffle)
    val_loader = DataLoader(val_gpu_dataset, batch_size=batch_size, shuffle=False)

I still use batch_size even for full gpu mode but i previously cache the img to the VRAM

#

maybeee i found my problem while looking at this again

#

will this store the seed between different packages?

def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

or if i define it in main.py but have random_split be performed in the dataloader.py its not kept?

#

i thought this would be sufficient but maybe its not?

calm cipher
#

When is that called? you should only need to do it once

young granite
calm cipher
#

if you want to confirm they contain different data, try manually iterating over each dataloader and printing the contents of the first batch

#

you can make sure they're the same between different data loading configurations

young granite
#

good advice thanks for the suggestions!

#

its sad that pytorch doesnt feature hyperparameter tuning itself, hard to make a generic testing script for multiple models/settings 😄

calm cipher
#

I like PyTorch Lightning for things like that

#

I don't recommend retrofitting an entire existing project into it, but if you structure your ML project around it, you can get JSON config files and CLI parameters that map directly to model hyperparameters

#

so if you want to do hyperparameter tuning, you can just make a bunch of config files and run them

young granite
#

yeh i build my project from scratch with that in mind using .yaml and pydantic but its not working 100% of time and its a bit fuzzy 😄

frail meteor
safe agate
mighty knot
#

https://www.youtube.com/watch?v=Lakz2MoHy6o&t=1497s&ab_channel=TheIndependentCode

I'm currently using the video I linked to implement a CNN from scratch, but I'm a bit confused by how he uses biases. The video shows a single bias term per pixel per output. So if an input of 3 channels and want an output of 3 channels, I have 3 bias matrices (one per kernel) and each bias matrix has the width and height of the output. Other sources I have seen show one bias term per output, but not per pixel. So if I have 3 kernels, I have 3 biases, each being a single scalar.

Which is correct?

In this video we'll create a Convolutional Neural Network (or CNN), from scratch in Python. We'll go fully through the mathematics of that layer and then implement it. We'll also implement the Reshape Layer, the Binary Cross Entropy Loss, and the Sigmoid Activation. Finally, we'll use all these objects to make a neural network capable of classif...

▶ Play video
calm cipher
#

I can't watch the video now but each kernel has its own bias

#

None of the parameters in a CNN are dependent on the size of the data, input or output

#

So yes if you have 3 kernels, and you want a bias term, then you have 3 bias terms

mighty knot
#

Gotcha. I don't know why he has it as a matrix with the same size as the output, it might just be me misunderstanding tho.

#

So the gradient of the bias for kernel i is equal to output i's gradient?

patent oriole
#

Hi guys so i'm taking data science in college and i was wondering if there's anything i can get started with, i don't exactly know where to start

crimson jackal
#

Hello. Does JAX has a maximum values that it can handle?

#

If I do

import jax.numpy as jnp
jnp.isfinite(10**10)

I get an overflow.

wooden sail
crimson jackal
#

Ohhh it works... why?

wooden sail
#

have you ever programmed in c or c++?

crimson jackal
#

No. I want to learn it by now.

#

*but no

wooden sail
#

regular python is dynamically typed and makes its own choices of when to treat your operations as ints or floats

#

it also implements a thing like "big int" that allows integers to be arbitrarily large by giving them as much memory as they need

#

jax does neither of these two things

#

it likes sticking to C-like types, like int32 and int64. if you write 10**10, it will try to use ints and will have maximum amount of 32 bits fo them (this is the precision it uses by default, to be compatible with standard gpu arithmetic)

#

floats have a way of dealing with huge numbers by just treating it as inf or nan, but ints don't do this

#

10.0**10 explicitly says the number is a float

crimson jackal
#

Ohhh okokok

#

That makes sense.

#

Thanks. 🙂

graceful moat
#

this is so tuff

unkempt apex
patent oriole
#

oh

lapis sequoia
#

Is Bart fine tuned the same way as T5?

wet dome
#

where do you guys get datasets for machine learning projects?
Do you use kaggle or scikit-learn datasets or something else?

wet dome
small wedge
wet dome
#

ok thank you

calm cipher
#

I don't know if you're just starting out or want something a little more complex, but for iris, you use dimensions of iris flowers to predict its species

#

for penguins, you can use peguin dimensions to predict their sex or species

#

iris requires less initial preprocessing, it's pretty much ready to go as-is, but the penguin dataset requires a little more work to clean up

old rampart
#

I am here. If u d like , I d be happy to discuss development.

crimson jackal
#

When jit a code with jax, it usually takes a lot of memory?

#

I am using 100Gb and still get memory limit exceed.

#

Or is my code that is badly written for jit?

agile cobalt
#

it'll vary wildly based on what exactly you are doing, but odds are you should try to split things into smaller batches (process in parts instead of all at once)

wet dome
#

How do i know if a dataset needs a more complex model? I'm currently using a linear regression model and it looks like I can only get up to about 62% r squared score

calm cipher
#

Are you concerned with the interpretability of the model at all? like is this a statistics project or a machine learning project?

#

if it's mainly a ML project you might consider moving past linear regression

#

yeah if it's just linear regression then aside from maybe tweaking your preprocessing steps, you'll have to just accept that whatever r2 you get is what you're going to get for that data

#

there's only so much you can get out of a straight line

#

but if it's a ML project, you might get a lot of mileage out of a support vector machine or decision tree-based model

#

yeah

#

also maybe modeling interactions

glass dome
#

im trying to draw a graph

#

but its not wroking can someone help me fix this

wet dome
#

Sometimes I don't know if it's me who's doing something wrong or the model is just out of its depth essentially

calm cipher
#

if the housing prices don't approximately follow a straight line then it won't work very well, or if there's a lot of variance around the regression line it won't work well either

obsidian echo
calm cipher
#

having said that you might be able to make it better by removing outlier housing prices, or only focusing on a specific subset of houses, or other things that are dependent on the data

wet dome
#

And cause its like 10 dimensional data I can't quickly visualise it to see if its linear

calm cipher
#

have you done any exploratory analyses of the data? that could be a good start, maybe some variables are better correlated with housing prices than others

#

there's a lot of digging around you can do

#

but also I do reiterate my recommendation to try other types of models, something like a decision tree or SVM would probably work better with the data as-is, whereas linear regression requires being more careful with the model inputs

wet dome
#

No I haven't tbh, I just threw my lin reg model onto it

calm cipher
#

well that's a good start, you have a baseline for comparing performance now

wet dome
#

I don't know too much about exploratory data analysis

calm cipher
#

what dataset are you using?

wet dome
#

Calafornia housing from kaggle

calm cipher
#

like is it on Kaggle or anywhere else online

wet dome
#

I jus thought I would use all the features

calm cipher
#

ooh are you reading through Hands-On Machine Learning? Or did you find this independently

wet dome
#

I didn't think about making my own or exploring the data

#

No I'm not reading that book, I've covered alot of the maths for linear regression so wanted to try it in code.

#

Should I "explore" the data before applying models

calm cipher
#

it helps, for a couple of reasons

#

actually I have a specific question for you now, what are you doing with the oceanProximity column?

wet dome
#

Why do they provide 10 features or so, if we shouldn't use them all then

#

I was a bit unsure what to do with that column, I mapped it to an integer. Ideally I would know if "island" is further from sea than "1H from sea" so would map it to an int correspondingly

calm cipher
calm cipher
#

you're describing ordinal encoding and I have often found it problematic sometimes

#

just because it implies an ordered relationship between different values that may not actually exist

#

that is, if NEAR BAY is 1 and <1H OCEAN is 2, that implies <1 OCEAN is more than NEAR BAY

wet dome
#

Yeah I wish the dataset told us how far each one was

calm cipher
#

I guess it could make sense but this is an example of trying different things to see how it affects linear regression, maybe it'll work better with one-hot than ordinal encoding, or maybe not

wet dome
#

That's why I was unsure what to map each string to

#

But how do I know when linear regression has hit it's limit

calm cipher
#

you don't

wet dome
#

Even if I do all this encoding and data scaling for exampls

calm cipher
#

unfortunately that's how it goes with machine learning, you can only ever show that something performs better than something else

#

but you can't realistically show that you found the best possible

wet dome
#

So you just get to a point where you are happy with the results?

calm cipher
#

yep

#

that's what a lot of machine learning research does actually, a lot of major advances in ML have happened when someone discovered a technique that made models perform much better than before

iron basalt
#

This is why Javascript projects tend to bloat in memory usage and one of the reasons why modern websites will eat all your RAM with only a few tabs open.

#

However, it's probably not your JIT that is making you run out of memory for a single process (and you're not something that has multiple things running in it like a browser).

#

(You also probably don't have enough code for this memory difference to matter, modern websites are just way bigger than needed)

wet dome
calm cipher
#

You're trying to predict housing prices, right? it's a regression task

#

so a SVR, decision tree-based models like DecisionTreeRegressor, RandomForestRegressor, maybe k-nearest-neighbors like KNeigborsRegressor

wet dome
#

And just go with whatever gets the highest accuracy

calm cipher
#

Just make sure you have a training, validation, and test set, and don't touch the test set until you found the best one

wet dome
#

So this exploratory data analysis, should I be like making my own features and stuff

#

Like income per bedroom as a random example, I know that sounds useless,

calm cipher
#

some might be useful, I remember they made a few features in the book

#

if I remember something like "total rooms" was the total number of rooms in the entire census area, so you can divide it by "households" to get the average number of rooms per house

wet dome
#

Alright I need to learn about eda,

calm cipher
#

also doing these steps may make a linear regression model perform better too

wet dome
calm cipher
#

here's an example of what I would do with the housing price dataset, and this is all hypothetical because I don't remember its characteristics very well

#

but let's say you look at the housing prices themselves. maybe most of them are grouped around a pretty clear range of values, but there's a small group of extremely high prices that are very different from the others

wet dome
#

Should all machine learning projects start with eda

calm cipher
calm cipher
wet dome
#

OK thanks you have been very helpful, you know your stuff

toxic palm
#

Hi,
Any course recommendations for beginner in AI?

grand minnow
toxic palm
toxic palm
#

i dont see any message which is specifically talking about cources for beginners

grand minnow
toxic palm
#

That is about ML, if i am not wrong?

grand minnow
#

Can you explain what you're looking for?

toxic palm
grand minnow
#

Do you wanna build an AI from scratch, like building your own models or do you wanna leverage existing models ?

grand minnow
# toxic palm i don't know the difference. i only know python. Just thinking to start in AI.

Then I think you should go through the "quick" lessons at https://www.kaggle.com/learn to understand what is AI and what is ML and what tools are normally used in them like Pandas

toxic palm
grand minnow
toxic palm
#

call existing models first

grand minnow
grand minnow
#

Or you could later on look at other Agentic AI frameworks like CrewAI, Langchain, Google's Agent Development Kit (ADK), etc

toxic palm
#

Guys,
when i am exploring reddit about what people are saying about how to learn AI, i found this one.
i believe there are some AI expert here. could you please put your comments in it

#

There are also opposite comments as below

grand minnow
lime grove
#

does anyone here have any background in supervised fine tuning ?

#

I am hoping for some guidance on a good starting point

calm cipher
#

what do you have so far? Or what are you wanting to do?

scenic parcel
#

I like to learn by doing, anyone have any project ideas I could use to learn RL?

small wedge
#

things like shooting a projectile at a moving target, learning to steer a car, anything simple like that is great to start

#

and very visually interesting to watch since you can have hundreds of agents attempting a task at once

scenic parcel
small wedge
#

you still train on a policy where agents are given reward based on their interaction with an environment

#

you just optimize by selecting the best after mutation and crossover rather than directly calculating updates like in a gradient policy method like ppo or deep q-learning

scenic parcel
# small wedge why would you think that?

Because I've tried to understand the difference between them before https://medium.com/xrpractices/reinforcement-learning-vs-genetic-algorithm-ai-for-simulations-f1f484969c56

Fundamentally, the operating principles of the two approaches are different. RL uses Markov decision processes, whereas GA is largely based on heuristics. The value function update in RL is a gradient-based update, whereas GAs generally don’t use such gradients.

Medium

While working on a certain simulation based project 
“Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one…

small wedge
#

in all the literature I've read it's talked about as a form of reinforcement learning

#

never have I heard that policy gradient methods are the only form of reinforcement learning

scenic parcel
#

Interesting, the author could be some opinioned rando lol, let me look. I think I found his linkedin, seems to be average dev

onyx holly
#

Hi, I have a parquet file, and I am unable to load the data. It says the maximum memory limit has been exceeded, probably due to low RAM. I currently have 16GB . How can I see the data in any other way?

small wedge
#

this paper frames them all just as methods of optimization that can be used on RL problems, where the RL problem is just an environment and policy that are used to optimize along rather than a dataset and/or labels

scenic parcel
small wedge
#

if you want a starting policy gradient project though pytorch has a DQNN tutorial for the pole cart game

iron basalt
small wedge
#

once you know that it's just a matter of setting up whatever novel environment you like and tuning your policies

onyx holly
scenic parcel
# onyx holly let me check then

try something like (though tbh idk if I've ever tried to read a parquet file larger than my memory with duckdb, maybe you'll have to use chunking, or dask I think is meant for this)

import duckdb
df = duckdb.read_parquet('my_file')
print(df)
small wedge
small wedge
#

since you could hypothetically apply genetic algorithms to supervised learning problems

#

it's just a method of optimizing through the search space, guided random search basically

#

if the guide is a fitness score given from a policy after interacting with an environment, that's 100% RL in my eyes

scenic parcel
#

Seems like that paper frames GA as an alternative to RL that can solve the same problems

#

I think I heard that GA can be worse because the starting search space is too big?

small wedge
#

not better than RL

small wedge
#

there are a lot of benefits to non policy gradient methods as well as downsides

#

there are no local minima to get caught in and no saddle points for example wacky gradient stuff

#

there are special encodings you can use when framing populations as genomes

#

that are very efficient

#

the paper demonstrates a lot of this

#

and argues that GA/ES could be competitive at large scales

iron basalt
# small wedge what definition of RL are you using?

Not needed, it's plainly obvious from GA working with a population, while RL says nothing about that. A single robot can do RL in real life (a single person can too), but a single robot/person can't do GA. Another important difference is that GA is blackbox optimization. Putting it in the same room as particle swarm optimization and simulated annealing.

scenic parcel
#

So I can still apply just regular RL to try doing projects like "shooting a projectile at a moving target, learning to steer a car" to learn policy gradients? The visual aspect does indeed sound cool, maybe I'll hook it up to some game like minecraft or pong or something lmao

small wedge
#

do you realize how disingenuous it is to say "I don't need to define words because it's plainly obvious" how can it be plainly obvious if we don't share the same definition of the word

iron basalt
small wedge
#

but my definition of RL doesn't include anything about population dynamics, just about how you evaluate the agent on the problem

dusty glade
#

yo guys

#

can you rate my obfuscator LOL (pyc to py which is op)

#

what yall think?

iron basalt
scenic parcel
small wedge
#

For example, methods such as
genetic algorithms, genetic programming, simulated annealing, and other opti-
mization methods have been used to approach reinforcement learning problems
without ever appealing to value functions.

#

page 23, section 1.4

iron basalt
#

Alright let me take a look.

iron basalt
#

You can use GA to approach RL problems, that is commonly done.

small wedge
#

agreed

iron basalt
#

For example, neuroevolution is common.

#

But the important thing here is that GA is basically a blackbox optimization you can pretty much always slap on top of anything else.

#

It's basically like once you have something that works you can throw more computers at it.

#

Or you can do it raw without anything else.

small wedge
#

well yeah to my original point it being "blackbox optimization" is one of the reasons that it is so easy to do beginner projects with

#

you really only have to focus on the architecture, environment parameters and the policy

iron basalt
#

RL on its own, unless you specifically mean multi-agent RL, which is its own thing with its own formal specification, focuses on just one agent as can been seen in the book when it's first formed (mathematically) in chapter 3.

#

The origin of RL is optimal control theory, which is about feedback loop optimization. RL is basically a rebranding of it borrowing framing from biology.

#

(Of Modern RL, the concept was used prior without really thinking of it in this modern formalization)

#

I will note that in the most general sense, you can use the mathematical framework of RL (ignoring whether it's about a single agent even) to give a general definition of AI that all AI would fall under (anything intelligent that makes decisions), but that is one of the ways to give a formal definition of AI is the equivalent to other options. However, if you want to say that it falls under RL via that route it's possible, but probably not a well accepted, nor common answer (as every AI is then RL (or well, an approximation of the most perfect RL agent possible)).

#

(See AIXI)

#

The upsides of GA is that it's blackbox, it works really well, and it can be simple to code. The downsides is that it's compute intensive and not feasible in many situations. Also depending on which problem it can be hard to code, for example if you have some simulated environment and now you need to setup a fancy distributed compute cluster that can run N simulations in parallel and synchronize and all that.

iron basalt
# small wedge well yeah to my original point it being "blackbox optimization" is one of the re...

I agree it's a good beginner project and I recommend everyone tries it. I have found that people often don't really understand/feel how evolution actually works until they have coded a genetic algorithm (without this, it feels too magical and can be hard to believe that it's a thing (after, it feels obvious that it must be a thing (like you could see this happening randomly just by random physical systems))).

iron basalt
#

GA is very strong (pretty much everywhere it's been used (real world problems) it gave amazing results).

#

It can also be mixed with other methods (used in addition to), so it's not a "one or the other" in many cases.

scenic parcel
iron basalt
#

You can also use non-realtime turn based games.

#

Simple grid-world games.

#

That way you can use tabular RL, no neural networks or any of that needed.

gritty vessel
#

Hello I have a doubt

#

In transformers can we decrease the patch size

#

Like in original paper it 16 x 16

#

And then after flattening we will get 16x16xinputxhannels

spring field
#

though if you train it on the same dataset as the paper, you'll likely not get the same results

gritty vessel
#

Is there a way to see how complex it will get ?
Embeddingdim will be 24
(2×2×6).
But total patches counts will be much more

gritty vessel
spring field
gritty vessel
#

I mean how much gpu it will use

#

Is there a way to check that?

spring field
#

you can get the number of parameters in the model, multiply that by your chosen precision (ofc, if you use mixed precision, you'd need to account for that as well) and that's roughly how much VRAM you'd need
but backprop might need some memory on top as well
I mean, experimentally you can just try running an experiment and see how much memory it actually takes up and then scale accordingly

gritty vessel
#

Ok

#

Thank You I will try that

charred totem
#

Anyone here learning data analytics?

proven pier
#

I'm trying to do some simple binary classification of images. Does having a smaller resolution than I normally would effect the results? Actually I am a bit confused. The data I'm working with is 1920x1080, but it doesn't seem "HD", as in there seems to be some blurring artifacts going on. It was sampled from a video stream, so I dont know if the encoded data had a lower quality to it

#

Presumably a 1080p stream is high quality/clear

proven pier
#

I'm thinking of using rust for my language on this project, is it best to do all of the training in python and then run the models in rust from there?

#

Eventually I might try "full stack" in rust (in the sense of training/running), but for now I am pretty much just thinking about training in python and using in rust. Mainly I want more experience in ML as well as Rust. Combining the two doesn't seem like the "best" solution, but working together can help me some

#

The library I'm using right now is essentially torchlib rust bindings. And I'm only providing access to dynamic linkage through my python virtual environment. All in all, would seem pretty plug and play, pytorch is just python bindings for the C or C++ torch library eh?

#

My command to even run the binary is LD_LIBRARY_PATH=./venv/lib/python3.13/site-packages/torch/lib LIBTORCH_USE_PYTHON=1 cargo run --bin my-bin 😂 for some reason LIBTORCH_USE_PYTHON=1 is not enough, it cannot find the libtorch_cpu.so which is in that dir. For some unknown reason

proven pier
#

If I have more samples of one dataset classification than the other, should I discard that data? I've read there's apparently biases that can be formed in training. To exaggerate, if 99% of the samples are one data type, then the model will be trained to simply always predict that classification. As more often than not it will be correct, without any further "analysis" or scrutiny

#

Right now my samples look like 70% of type A and 30% of type B

#

I probably shouldn't be concerned with this at this stage just yet, I'm still setting up my training routine. It was just something on my mind

#

I have other work in rust already, it's moreso keeping that momentum moving forward lol

#

I have a lot more python experience. Not that I'm an expert, but I have been moving some scripts over to Rust just for further experience. I write so much python out of convenience, maybe if I had that familiarity with Rust I would do a lot more with it also out of convenience. It's build system and library imports are as streamlined as python pretty much. Both of which are a breath of fresh air from managing build systems with C/C++

jaunty helm
#

on a related note - I've never had luck with oversampling tricks like SMOTE where the only thing it seems to do is increase training time and reduce model performance

proven pier
#

I will keep those in mind. It makes sense to penalize more potentially if it guesses wrong against a class B scenario

#

Overall I do not think my training application is that complex. class A is "everything else", and class B is a very distinct sample. Essentially just training a model to detect if B is present or not

ancient moss
#

Hey 👋 guys. I am learning python for going into Data Scienctist Role, and have found Python interesting, and have been sharing small topics I learned by writing blogs, you can read my recent blogs at Medium

Here is link to my latest blog

https://medium.com/@buildwithmobi/concatenation-in-numpy-2ea97b290f2f

I hope you would have wonderful time exploring what I explored, and the way I explained things 😇.

Medium

Have you thought about how to concatenate Numpy’s arrays? Or you haven’t got any clue! Don’t worry I am here to help you with it 😃.

proven pier
#

I appreciate you doing some heavy lifting and reading more into it

#

A lot of jaded people probably assume most of AI stuff is nonsense marketing. At least you sounded impressed at first before looking further and changing your mind

proven pier
#

I see a lot about using the suffix .pt or .pth for pytorch models. Why not something more generally geared towards Torch itself? Isn't Torch written in C++ after all?

#

I just wonder why not .tch or something

tidal bough
#

That sounds like a really hard question to find an answer to (you'd probably want to track down the first usages of that format and see what the reasoning was), but also, note that pytorch is overwhelmingly more popular than libtorch, so few people care how its internals are written

proven pier
#

But I can see how it is used in python at probably a 99% rate so it's not a big concern. In the end it's better to stick to conventions instead of unnecessarily forking logic like that

proven pier
#

I'm running into a constant problem area in my model architecture that I am unsure of. I am following along with an example. I won't paste the code directly because it would be needlessly long. I will describe the exact line I have had problems with, and also another line that gives me confusion.
in the __init__, I notice that the in_channels and out_channels for CNN layers match, as expected. 3x16 * 16x32 * 32x64 are valid matrix operations. However, for the first connected layer, the matrix math does not line up with me? nn.Linear(in_features = 64 * 6 * 6, out_features = 500) , that would be 32x64 * (64*36)x500

the magic is in the forward call, but the max_pools dont seem to alter the matrix algebra in this sense? X.view apparently reshapes the matrix. But during training, I had to swap 64*6*6 to 64*7*7 for it to work out? And now I'm trying to test a single image at a time and it's not working for me. It worked during batch training

#

RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x49 and 3136x500), where 3136 = 64*7*7

calm cipher
#

what are the image dimensions?

proven pier
tidal bough
#

they alter them like this, to be precise:

#

and plausibly, the sizes here work out to W=H=6, making the example correct

proven pier
#

I'm stepping through right now to see how dimensions have changed

#

Okay, I have figured it out. When I dont add in batches, that entire dimension is lost

tidal bough
#

Yeah, generally when processing a single image you must still have size 1 along the batch dimension, or bad things happen

proven pier
#

By the time I get to that X.view() it is not using the batch dimension, and compresses the other ones. Which would multiply out to 64*7*7

calm cipher
#

Yeah the image coming out of the CNN is 64x7x7, not 64x6x6

#

it comes out as 64x6x6 if you remove the padding

proven pier
tidal bough
#

pytorch has quite nice docs

proven pier
#

That, they do

calm cipher
#

or something like that

proven pier
#

I'm using different images than they are, I sort of assumed that would answer the slight difference there

#

As I noticed my resolutions dividing and flooring. I could see them being off by a couple integers once it's compressed a lot

#

I also never normalized my images which I should probably look into lol. I didn't need to resize them because I've already applied cropping way earlier

#

So apparently I can use the tensor unsqueeze(0) method to insert a "singleton dimension" at the front

#

Now I wonder if it's best practice to put that logic in my forward method, or to preprocess data before entering my model? I presume the latter for performance, and the former for simplicity. Because when training on batches, it will always have to evaluate the conditional dims == 3 --> unsqueeze branch

calm cipher
#

Oh you mean to batch it in the first dimension?

#

I don't know if there are best practices around this but I would have the model assume it's always batched, and require data to be inputted as batched

proven pier
#

Yeah unsqueeze(0) will prefix my dimensions with a 1x, aka the "singleton dimension"

proven pier
#

Do I need to .gitignore my models? How should I store or potentially version control that information?

#

I can imagine some disconnect between editing actual code, and training new models. Potentially using new or somehow different data. All of which are outside the scope of lines of code in some nature

calm cipher
#

you mean .gitignore the pytorch module you wrote that includes the model?

proven pier
calm cipher
#

oh yes gitignore that

proven pier
# calm cipher oh yes gitignore that

That's what I'm doing. But I presume there stands good reason to somewhat version control models? Maybe document their training process? Or is it always use the very newest, and it can never go wrong? Lol

calm cipher
#

in general you don't want to commit binary files to git repositories

#

you can, but the overall design of git doesn't lend itself to that, plus it makes checking out the repository take longer and be more bandwidth intensive for something not everyone might want

#

if you ever want to distribute the model you can use git LFS, or upload it somewhere else

proven pier
#

I figured I would have a separate repository for version controlling the models. As to not pollute the main programming repositories

#

And I also figured it wouldn't be pushed to often, only when surpassing large milestones from altering the training process or the architecture/technologies used. Otherwise I wouldn't commit if it was a fruitless effort

proven pier
#

I'm a bit confused on the following pytorch example that performs various transformations on an image. Particularly, I'm interested in the Normalize method. I thought the printed image would lose a lot of contrast from normalization, which would affect how it appears in plot([img, out]) display. I suppose the coloring looks slightly off.

calm cipher
#

I wouldn't expect normalizing to massively change an image, everything should still retain their values relative to one another

#

the only major change would be if one channel had a very large amount of one color

#

say the image was extremely red to the point that the red channel always had some value greater than 0

#

looking at that example, assuming it's RGB, the red and green channel both have a mean of about 0.5, but the blue channel is closer to 0.4, which means it should be pulled up a little bit in the final output

#

and actually the output does look a little blue to me

proven pier
#

I guess I'm a bit confused because RGB is 0-255 but the normalized values are being set to 0.5ish and the tensor datatype is treated as a Float instead of a u8 byte

calm cipher
#

the image could have been imported as floating point values between 0-1, where if interpreted as an integer-valued RGB color space, 0=0 and 1=255

#

that happens with audio data sometimes too

proven pier
#

Apparently the maintainrs of tch-rs are not considering supporting auxiliary pytorch libraries such as pytorchvision. I wonder what the implications are on my end. For instance, I am doing 2 transformations on data. I used pytorchvision to run the ToTensor() operation on a PIL imported image, and then Normalize the data. I did manage to apply cropping via the PIL library itself, and believe I could do that outside of torch. I figured it would be most optimal to do it all natively

GitHub

Rust bindings for the C++ api of PyTorch. Contribute to LaurentMazare/tch-rs development by creating an account on GitHub.

#

Only ran into this issue because I wanted to randomize a torch tensor then normalize it. But what I noticed is normalizing it forced CPU operations which was mismatched with my GPU tensors, and I could not find a way to address this. But this is more specifically a rust problem

#

I suppose the torchvision.transforms.v2.Compose I'm using creates a nn Module itself. Maybe I can simply export that as well 🤔

short moth
#

can someone help me with this regex (in pandas dataframe):
i have a string patterns like this in one column:
A58P2
PL4P1
MPE5P2
JB3P2
...
I am trying to split this string at the italized P:
A 5 8 P 2
P L 4 P 1
M P E 5 P 2
J B 3 P 2
I tried to do

Pandas.series.str.split(r'^[A-Z]+[0-9]+(P|J).+$', regex = True)

but it is giving me an error: string index out of range

#

basically the pattern I see is I have one or more occurences of alphabetic letters, then one or more occurences of digits followed by a P or J and anything after that

#

thanks for helping!

#

btw after testing my regex, it does what it is supposed to do. it can match the pattern given

serene scaffold
#
In [13]: s
Out[13]: 
0     A58P2
1     PL4P1
2    MPE5P2
3     JB3P2
dtype: object

In [14]: s.str.extract(r"(\w+)(P\d+)$")
Out[14]: 
      0   1
0   A58  P2
1   PL4  P1
2  MPE5  P2
3   JB3  P2
short moth
#

oh

#

i needed extract. yes

#

thank you very much

opaque condor
#

Is this where I can talk about matplotlib?

calm cipher
opaque condor
#

The reason why is so I can make my own package using class which will be shown later at some point

serene dew
calm cipher
opaque condor
calm cipher
calm cipher
#

you want to make that visualization?

lapis sequoia
#

Fine tuning T5 jojos bizarre adventures? What’s up? What am I in for?

lapis sequoia
opaque condor
late star
#

Hello guys Im planning to buy a new laptop to start my journey into the data science world. The laptop I have in mind is this HP EliteBook 645 G9 AMD Ryzen 5 PRO 5675U Hexa-core (6 núcleos) 2.30 GHz 16 GB RAM, 256 GB SSD, Do you think is enough to work with all the data science tools?

serene scaffold
#

I probably wouldn't even get one with a GPU so that you can use the savings to rent cloud compute.

iron basalt
#

Avoid any "gaming laptops" with a dedicated GPU, laptops can't properly cool those to get the most out of them (and tend to have issues with Linux), integrated is the way to go.

calm cipher
#

I'm pretty sure OpenAI's main goal as a company is regulatory capture

#

this is less hype and more trying to sell themselves as both making something so dangerous that it needs to be regulated, and also trustworthy enough that they should be doing the regulating

#

i guess it's a kind of hype too

#

ok three things then, trying to sell themselves as both making something so dangerous that it needs to be regulated, and also trustworthy enough that they should be doing the regulating, and brilliant enough that they created something so good that it's dangerous and needs to be regulated

proven pier
#

Hey I have successfully trained some binary image classifier model (potentially it has been overtrained, but not a concern right now).
I'm getting upwards of 98% accuracy from supervised learning. I am using PyTorch, and I have exported my models to be used in tch-rs rust bindings for libtorch. After reviewing the design it appears to be mimicking the flow of data properly, except I seem to lose 20-25% on inference in Rust vs in Python

#

How can I tackle this problem? Are there some ways to log the flow of data in a succinct manner where I can compare it across a run in PyTorch then look at the same run in tch-rs and see where things differ?

#

I can seed randomization in the selection of a test set, but I'm not sure if seeding randomization is needed for model inference? I suppose for my particular model there is no RNG control during runtime evaluation. ie, no Dropout layers. Only "randomness" I am using is shuffling validation set as previously mentioned

peak thorn
#

What amount of storage is included in AWS Free Tier, and what is the maximum disk image size I can upload or use under the Free Tier limits?

quaint mulch
#

Make sure weights stay the same?

#

Do one forward run of a single identical image?

#

check the activation at every layer?

calm cipher
proven pier
proven pier
#

I'm just wondering if there is some logging feature that when I do any sort of tensor mathematics it will mark it down. And therefor I could verify the flow of data

calm cipher
#

You could also just print out manageable portions of outputs in between layers

proven pier
#

I didn't explicitly set types anywhere in Rust I believe. At least the underlying tensor types

#

Would loading in my exported model not use the right datatype? like f32 vs f64?

#

The tensors in my model all seem to have the same size. I'm having trouble getting it to print out the weights in the Rust code like it does with named_parameters() in the PyTorch code

proven pier
#

I am prodding around docs and the repository to see if I can find answers somebody left behind

#

Okay I actually did it by looping through and printing verbosely. Was hoping there was a nicer version like the PyTorch way

#

Okay I'm noticing some strange scaling differences. Sometimes the scale seems correct, but sometimes it's off by a factor of 100x

#

Like a direct multiplication

#

Seems like all of my "bias" values are scaled up by 100x

#

I loaded it in, exported via torch.jit.trace

#

The normal weights seem fine, but the biases are all 100x their value exactly

#

Maybe slightly off from resolution losses at the lower original level

#

Oh, wait.. Nevermind I'm seeing some extra confusing stuff in front of each bias lol

#

Notice that inconspicious 0.01 *. I guess this is their way of displaying "scale these down by 0.01" with scalar multiplication ofc

#

Sorry, the way they decided to print out weights confused the hell out of me

#

Looking over this it all seends to match up

#

There is one other network I want to check, but apparently named_parameters doesn't work due to how it was constructed

#
data_transform = torchvision.transforms.v2.Compose([
  torchvision.transforms.v2.ToTensor(),
  torchvision.transforms.v2.Normalize((0.5),(0.5))
])
unkempt lodge
#

are python is enough?

serene scaffold
unkempt lodge
#

im litteraly confused how much information about roadmap

serene scaffold
unkempt lodge
#

and im thinking about learning R

serene scaffold
#

you don't need to learn R.

unkempt lodge
#

statistics etc

#

so its pointless to dive in ai without degree?

serene scaffold
unkempt lodge
#

thats tough

serene scaffold
#

Every single AI job that's posted will get several applications from people with relevant degrees, so they won't even interview the ones who don't.

unkempt lodge
#

how much it cost to get a degree in that area?

#

are you in US?

serene scaffold
#

depends on a lot of things. I am in the US. Look at universities that you might attend and see what they charge per semester. Multiply that by 8, and that's roughly the cost of a bachelors.
Though you'd probably need a masters.

unkempt lodge
#

are there already directions directly aimed at AI training? I have not heard this in my country, usually they just study applied computer science, for example.

unkempt lodge
serene scaffold
# unkempt lodge Russia

I'm not sure how people get jobs in AI in Russia.
It's basically the same in the US and the EU. But idk about Russia.

#

it might be similar to the EU.

unkempt lodge
#

Thank you

frail meteor
unkempt lodge
unkempt lodge
proven pier
#

Like my undergrad is electrical engineering, but I've been a computer engineer/programmer pretty much my whole career

#

I presume you gotta get past some sort of HR filter, which would include checking your degree (no undergrad or masters in AI specifically)

#

yeah fair, just wondering if there's any likelihood

proven pier
#

Possibility seems quite low for the time being. I would be interested in doing some formal studies, for now it's all side stuff

#

And the work I do is pretty much totally unrelated, maybe a future job that offers tuition reimbursement wont mind if it's in a somewhat unrelated or tangential field like AI

#

Sometimes I wonder if all this competition means our industry is oversaturated with talent. I would figure if it's a tough field to work in with high qualifications, why would there be so much competition for roles. Especially when a lot of people say it's a bubble right now

#

At least maybe the oversatuation of LLM work might be a bubble

weary timber
#

does anyone have problems with pytorchs mps backend ? i tried to train a model (which i also tried in a nvidia gpu device and the loss converged) with it and the loss doesnt converge

lapis sequoia
#

Ok, have any of you , ever, fine tuned a transformer for Machine Translation? Not just a LLM, one that had to be fine tuned with your own data. Anyone?

calm cipher
lapis sequoia
long robin
#

Libraries I use:
Numpy pandas, matplotlib, seaborn, plotly, scikit-learn
For data visualization and machine learning.

My question: What are some real world projects that are done using these libraries? What data scientists actually do in the companies? Any cool ideas? Anything that can make me stand out in my college a little bit. It should be more than just simple regression and classification stuff.

frail meteor
#

It would basically show your strong understanding of how ml algorithm works

lapis sequoia
#

Risk Pooling was always a thing. That dang Bayesian Nash Equilibrium. Pooling equilibrium. That pooling equilibrium and those bad apples that lead to lemons. Just an overall market of used car sales man. This is nothing new.

lapis sequoia
#

I love game.

lapis sequoia
#

They end up using Bayes Rule and screening.

weary timber
#

check your dms

fallow frost
#

so I'm saving lots of stock historical data, and I want to partition it by ticker (about 15k tickers) because most queries are basically WHERE ticker = {ticker} or have GROUP BY ticker, so its the most eficcient way to query the data.
however, the issue is writing a partitioned dataset that has 15k partitions, meaning 15k files open simultaneously

#

I tried writing the data with pyarrow and polars and its incredibely slow, like its taking too many hours

#

and I've verified the issue is the 15k partitions, because if I partition it differently (like year and month) it takes only a few mins

#

so any recommendations?

#

btw pyarrow's default max_partitions and max_open_files is set to 1024, I wonder why

stone coral
serene scaffold
stone coral
#

But masters would be AI focus

serene scaffold
stone coral
#

Yea.. My advisors kept trying to push me into it however I said no. Just because with Comp sci its more broader. I can always go a different route.

#

However I do love ML and stuff so I would love to try my best to get into that area.

serene scaffold
#

computer science has accreditation standards that have been widely agreed upon for decades, whereas "AI" does not

stone coral
#

Gotcha

#

So data science and AI are different?

#

I thought data science applies for it which would be fine

serene scaffold
stone coral
#

alright well looks like I should double check my path lol.

calm cipher
#

it's a mixture of statistics, visualization, analysis, and various disciplines related to storing and retrieving large amounts of data

calm cipher
lapis sequoia
#

it is just nlp ```0 [[PERSON, jLinkedIn], [PERSON, linkedin.com]]
1 []
2 [[ORG, Stanford University], [ORG, Stanford], ...
3 [[GPE, M.S.], [ORG, Computer Science], [DATE, ...
4 [[ORG, GPA], [CARDINAL, 4.0/4.0]]
5 [[WORK_OF_ART, Coursework : Machine Learning, ...
6 [[WORK_OF_ART, A.I.: Principles and Techniques...
7 [[ORG, National Taiwan University], [GPE, Taip...
8 [[ORG, Information Management Sep 2014 - Jun],...
9 [[CARDINAL, 1/39], [ORG, GPA], [CARDINAL, 3.95...
Name: entities, dtype: object

gritty vessel
#

You feel more energetic when you cut sweets as per my experience

#

BTW is it normal that single epoch take more than 4 hours to complete ?

#

I got like 24000 samples and shape of each sample is 900 x 800

#

I trying to predict forecast using unet by giving t to t8th samples as input features and t12th and t14th as targets

#

So like sliding window it keeps adding 1

maiden harbor
weary timber
#

this is a very broad question but i wonder , how does one get/be good at debugging ai models? like what do you need to know/do to be good at it? is there a list to follow or are there any techniques? cuz when a model im working with doesnt learn/converge i cant find the reason therefore cant fix it.

maiden harbor
#

But of course, I'm only able to do that, due to the amount of work that has been done before.

weary timber
#

i've heard of it but i dont know the content

gritty vessel
maiden harbor
#

They are just like cousera, but for AI!
One of the courses answers your questions

weary timber
gritty vessel
maiden harbor
gritty vessel
#

In the beginning I used to just take the template and run

#

Ran the code*

maiden harbor
weary timber
gritty vessel
#

But then I started to understand how model Will capture the patterns .During eda what are the patterns I can decide which model to use

weary timber
gritty vessel
#

Yes exactly that how it works you ge the idea that this is vision Transformer it divides images in patches than performs some operation and flatters the data and gives a positional encoding

#

And so on But yeah how its doing these things

#

Knowing that makes the difference I guess

#

But yeah I'm still a beginner so my take can be wrong

weary timber
#

yea thanks for the advice

maiden harbor
weary timber
#

the problem is the loss doesnt go down, or when it does the model doesnt behave the way it should

maiden harbor
#

Can you tell what model are you using? And what's the model function?

weary timber
#

in pytorch

maiden harbor
#

ooh Bert now I get it

#

Found it!
@weary timber

#

You said:

like what do you need to know/do to be good at it? is there a list to follow or are there any techniques? cuz when a model im working with doesnt learn/converge i cant find the reason therefore cant fix it.
And BERT according to wikipedia uses a Transformer architecture!
So this is the perfect course for you!

#

Hope this helps!!

weary timber
#

thanks

#

i will take a look at it

buoyant reef
#

How do i make my own Neural Network in python?

serene scaffold
#

Neutral networks are a way of having a function. So you need to know what the inputs and outputs are

buoyant reef
ember basin
#

Hi, I am new here. Currently in school and want to learn about python. I am studying actuarial science and would love to have some guidance.

serene scaffold
calm cipher
#

what model is it and what hardware are you running it on?

#

oh I missed that it's unet

#

oh that's interesting

#

not if it's rolling

#

oh well it should reduce it

#

I'm trying to figure out how that works, I'm assuming the timesteps are input channels?

wet dome
#

Im reading the hands on machine learning with sklearn, tensorflow and keras book and I'm on chapter 2. Even though some features aren't linearly correlatedto our target variable, they are still used to train the model, how do we know these are useful features?

#

There was no mention of selecting features really. It talked about how some are more useful for predicting house prices, e.g. median income but in the end still trained on all of them

calm cipher
#

one is that if you think a feature may not be useful, you can try dropping it and evaluating its impact on model performance, if performance gets worse then you should keep it, if it gets better or stays the same you can eliminate it

#

there are also methods for regularizing linear regression that will incentivize the model to assign zero or near-zero weights to features that don't contribute to the final output

#

and finally once you get out of linear regression into more complicated nonlinear machine learning methods, there's a good chance that some inputs may have a complex nonlinear relationship with the output, so even though you can't find a correlation, the model may still discover a useful relationship

#

so it isn't always useful to spend a lot of time on feature selection

#

also if you want to get way more advanced and don't care about interpretability, look into dimensionality reduction with principal component analysis

wet dome
#

Hmm OK thanks, also how common is stratified sampling? I.e. you want your testing set to be representative of a larger population, cause that's the approach they use in chapter 2, but I have only seen a random train test split used hefore

calm cipher
#

I think as a rule of thumb random is usually ok, I've used stratified sampling the most with imbalanced datasets

#

if your dataset is imbalanced there are also different classification metrics that can account for it, I think that book talks about them in the chapter on classification

#

but I've occasionally had datasets where I was worried that random sampling might miss a few rare classes, so I did a stratified split

#

in a typical case it's more up to you and how rigorous you want to be

steady basalt
#

I got asked the boy-girl paradox in an interivew (USA, finance firm)

#

Anyone else think this is an unfair and ambigious quesiton?

buoyant reef
serene scaffold
serene scaffold
# buoyant reef yes?

I think that's too ambitious for a first project involving neural networks. I usually suggest that people start by making a classifier

#

It would have been more interesting if they wanted to have a discussion about different interpretations of the question

serene scaffold
# buoyant reef Classifier?

In this context, a class is like a category of thing. A classifier determines which class a given entity belongs to

buoyant reef
serene scaffold
#

Sure

buoyant reef
#

I saw this tutorial but it is in C# and I don't know how to implement it in python:
https://www.youtube.com/watch?v=hfMk-kjRv4c

Exploring how neural networks learn by programming one from scratch in C#, and then attempting to teach it to recognize various doodles and images.

Source code: https://github.com/SebLague/Neural-Network-Experiments
Demo: https://sebastian.itch.io/neural-network-experiment

If you'd like to support me in creating more videos (and get early acce...

▶ Play video
buoyant reef
serene scaffold
#

Just don't use a tensorflow tutorial, since those are inherently outdated

buoyant reef
serene scaffold
sharp crow
#

I am working on my first proper ML project. Dataset is big, and local training (VSCode) takes ~15–20 mins per run.
My current setup: I have a big template that loops over different models + hyperparameters (GridSearch), but it's way too slow on my machine.
Idea:
Use Kaggle/Colab with GPU to run all analysis, try out different models hyperparams

Finalize best combo then just copy that final model setup to VSCode and run it once.
Q:
Does this approach make sense? Anyone doing the same?

magic dune
#

Brute force works but usually isn’t smartest method

worldly dawn
serene scaffold
#

I thought Google colab was real time?

#

What happens if two people try to edit at the same time?

calm cipher
weary timber
weary timber
# buoyant reef I saw this tutorial but it is in C# and I don't know how to implement it in pyth...

theres https://www.youtube.com/watch?v=w8yWXqWQYmU this video that is in python but its with the mnist dataset

#

however you can change the dataset and make it a doodle classifier

#

this way you'll learn much more and achieve what you wanted

#

i would appreciate it if you try this and tell me features to add/bugs to fix/parts that need improvement

sharp crow
calm cipher
sharp crow
#

I can ran models that are supported by cuml and use them on my local machine

jaunty helm
# wet dome There was no mention of selecting features really. It talked about how some are ...

how do we know these are useful features
before removing features, if you're just looking for accurate predictions, I find that a well regularized model with all the features (other than the obviously non-helpful ones) usually will end up better; the more noticeable downside with a lot of features is requiring more compute to train your model, so if you're hitting a wall there then maybe consider feature selection
the best (and often, most difficult) way to know if X is useful in predicting y is through domain knowledge; in fact domain knowledge will help with all sorts of other things as well like what features you should add
you could try methods like recursive feature elimination, selecting by model like lasso / tree feature importance or statistical measure like F-test / mutual information, but note that these data-based methods don't always end well

serene scaffold
jaunty helm
# sharp crow lr, RFC, dtc , xgbc and knnc classifier models with few hyperparameters

well one, you could also consider other optimized hyperopt libraries like optuna or flaml or whatever
you could also consider that maybe you don't need to try all of these models? I mean I really don't see say a knn beating xgboost, unless in extremely specific circumstances
you could also decide that it's not worth to optimize hyperparams at all, and simply using something robust like a random forest with default parameters is enough

strong dirge
#

hey guys, what are the prerequisites to read hands on machine learning with sklearn book?

serene scaffold
#

Though maybe you don't have it and are wondering if you should buy it?

summer crown
#

hey guys what’s the best way to get to grips with starting data science / ai. i’m due to start a computer science a level in september for 16+ education in the uk - and am looking to persue it as a field. all tips appreciated- dms open

serene scaffold
main fox
#

@serene scaffold Hello, thank you for responding in mod mail.
Idk if you had a chance to see the vid itself. I've decided to not share the link here. Also not entirely sure about the content I should cover. It almost feels like I'd be trying to boil the ocean if I decide to talk about e.g. central tendency, dispersion, distributions, probability, etc before even touching ML/DL.

weary timber
summer crown
weary timber
#

nice

eager fog
#

i have been trying to make a fingerprint matcher. Tried various algorithms but the MCC based matcher was the best among them

#

however it is very slow, it takes around 3 seconds to match and give a match score

is there any way to make it faster? Currently it 1 to 1 matcher. But I want to implement 1 to n matching. But it slow speed like that i dont think it will do any good

neat ermine
#

Hello, I'm 18 years old and I've recently started working in the field of Artificial Intelligence. Do you have any advice you could share?

summer crown
neat ermine
tacit basin
#

To pretrain a model you need a ton of GPUs. So unless you have access to big GPU cluster I would start with fine tuning.

#

You can use free GPU at google colab for example

serene scaffold
tacit basin
fierce python
#

I have created my own language model (GPT) from scratch using torch. It's currently working great and can generate meaningful words. However, the generated output always has fixed length because there was no end token.

Question: What is the best way to introduce end token to the text corpus?

If I put the end token after every sentence in the corpus, the model would generate the end token after every sentence. Does this mean that I need to manually modify the corpus to put end token every meaningful chunks?

weary timber
#

or you can select a random int n and take n sentences and create a chunk and repeat

sharp crow
#

Is there any resource that can teach me when to use which algorithm?
And Most asked interview questions?

weary timber
sharp crow
left tartan
sharp crow
#

Thanks

#

I am dealing with imbalanced binary class problem
I have used
Oversampling
Under sampling
Balanced RFC
Different other algos
Class_weight
But nothing is giving me good measures

#

I might as well break my laptop at this point

#

Any advice?

calm cipher
#

hm this is tricky but when you did oversampling, was it random oversampling or did you use SMOTE?

#

also what models are you trying?

#

also what's your dataset and how imbalanced is it?

sharp crow
#

It's about bank churn prediction

calm cipher
#

what kind of results are you seeing?

calm cipher
#

hm so the recall isn't terrible, but it looks like that's because it's classifying a lot of things as positive

#

also I'm curious how much of the dataset you looked at? I wonder if there's a way to reduce the number of negative examples without singnificantly affecting the positive examples

#

you said it's a bank, right? maybe there's a certain type of customer that doesn't close their account often and you could try eliminating them from the dataset

raven ivy
#

Hi, i am trying to create a churn prediction model for my company. I have never done something similar and would like some advice.
CSM would like to "see 6 months into the future" meaning if i look at a customer today will he churn in next february?
My data is pretty limited i have around 900 examples from which around 150 are churners. (This includes anybody who churned from the point we have data which is from 2021 january) the non churners are the currently active customers.
Is this approach feasible? I feel like the data is way too limited and the question is too specific. What are the current industry practices? When can i say we dont bave strong signals and this problem is not solvable with our current data/setup?
I am less interested in exact models like lightbgm, logreg etc and more interested in methodology how should i approach this problem etc

calm cipher
#

but given how small your dataset is you may want to begin just with a data analysis project to see if you can identify characteristics of customers who churned

#

I doubt you'll be able to pinpoint it exactly but you might be able to find "warning signs" where a lot of customers with a certain characteristic tend to leave

#

it's useful because a lot of people do analysis projects with it that might give you some ideas

#

lol wow 1,982 example notebooks

static magnet
#

Hey guys, I've been a lurker here for quite some time and I believe this channel is the closest to my question, as it isn't entirely python related, but about a data set I'm trying to find. I'm doing a research project in high dimensional statistics and my advisor suggested I review lasso parameter selection by imitating the methods described in section 2.4.1.1 of Buhlman and van der Geer's High Dimensional Statistics book.

we have a covariate with p=7129 gene expression measurements. There are n=49 breast cancer tumor samples.
This text and another one of Buhlman's from https://academic.oup.com/bioinformatics/article/22/22/2828/197039?login=false reference West et al (2001) as the source of the data. Both context's suggest that there should be some set with 49 samples and 7129 parameters, but I'm unable to find this dataset. In the second Buhlman text, he points to mgm.duke.edu/genome/dna_micro/work/ but that link seems dead

static magnet
fierce python
#

Question: Why does subword tokenization worsens the model loss?

In my language model, I switched up from character level tokenization, to subword tokenization using the sentencepiece libary. However, the model seem to not learn the corpus quite well (144M subword tokens). I've tried adjusting the learning rate and dropout rate but doesn't seem to give much effect.

Using character tokenization: ~ 1.2 loss (more iterations could be better)
Using subword tokenization: ~ 3.0 loss (stuck)

These are my hyperparameters:

BLOCK_SIZE = 200
BATCH_SIZE = 16
MAX_ITERS = 20000
LEARNING_RATE = 0.0005  # tried 0.0001 as well
EMBEDDING_DIM = 768
N_LAYERS = 8
N_HEADS = 8
DROPOUT_RATE = 0.1
severe blade
#

built something: https://github.com/ArjunCodess/astroscope
check it out here: https://astroscope.streamlit.app/

any data analyst here? how does this compare to a typical data analysis project? i just want to see something here.
fyi, this was 100% vibe coded. i just made a comprehensive plan beforehand.

(i hope this is the best channel to ask for this. im sorry for any disturbance 🙏)

GitHub

AstroScope is an interactive dashboard that visualizes near-Earth asteroid data from NASA's NeoWs API. 100% VIBE CODED. - ArjunCodess/astroscope

jaunty helm
# sharp crow →SMOTE, RandomUnderSampler ,SMOTENC → KNN, DTC, RFC , LC , XGBC , LGBMC, Balanc...

~40k vs. ~5k
imbalance usually isn't inherently a problem, and what you have here isn't really all that bad either
unless your algorithm breaks because of this, just use a fitting measure (so not accuracy but say f1) or say different weighing (like the balanced rfc you use)
nothing is giving me good measures
you should first define what you mean by "good measures." for example, if misclassifying a 1 is really terrible, you might want to put emphasis on class 1 recall.
given the same model, you can usually trade recall for precision and vice versa by setting different classifying boundaries. otherwise, look for other methods to improve, like better feature engineering
oversample (SMOTE, etc), undersample
in my personal experience, these never really yield any "generally better scores" honestly
for undersampling, you might keep a similar performance while reducing training samples thus reducing compute
for oversampling... yeah it never works for me

sharp crow
#

My focus is to increase recall without sacrificing precision too much

#

Can anyone tell me what to do? 😭

jaunty helm
sharp crow
#

Yes, the original dataset

jaunty helm
#

well I mean again, try the usual feature engineering
and also for kaggle competitions (especially the monthly tabular playground), you shouldn't shoot for the very top unless you're willing to suffer through multiple layers of hyperparam tuned ensembles that will give you like a 0.1% edge over the others

sharp crow
jaunty helm
# sharp crow I am doing this as a normal project, to understand class imbalance. Can't be ars...

well then again, class imbalance itself is not a problem unless your classifier is actively breaking due to it; just have a good metric (not accuracy) and/or say setting class weight to put more/less emphasis on certain classes if you want
usually the "problem" of class imbalance is that there are too few minority samples for the classifier to learn anything, but you have a good amount of those (~5k)

#

and in my experience:

  • undersampling won't improve your model overall, however it can reduce training compute (because you now have less samples) without hurting model performance too much sometimes
  • oversampling like SMOTE never worked for me, everytime I try them it actively makes models worse + you need more compute because there are now more samples
sharp crow
#

I have a very silly question, there are two cases.

  1. applying SMOTENC on X and y, then doing train test split

  2. First splitting the data and then applying SMOTENC on X_train, y_train.

The first case gave me good scores , but second score remains stagnant no matter what i do hardly any change in scores and yeah also shit scores not to mention

#

Which is correct?, coz chatgpt is telling me to apply 2nd case but deepseek is onto 1st

jaunty helm
sharp crow
jaunty helm
# sharp crow Thanks again

if something needs to learn parameters based on the data, do that after the split
so almost everything that needs to be .fitted should be done after split

for example a StandardScaler, where it does the transformation (X - mean) / std. Since the mean and std are calculated from the data, you must do this after splitting, otherwise you're using information from what's supposed to be the test set to calculate the mean and std, in other words leaking

same for smote

mellow vector
#

i hate that gamma is used to express scale

#

gamma waves are tiny

frail meteor
magic dune
next shard
#

Hi guyssss

jaunty helm
# magic dune https://openai.com/index/introducing-gpt-oss/ Apache 2.0 nice

it's honestly depressing to look at its "reasoning"

The user asks: "Give me a phrase from the bible."

We need to comply. This request is straightforward. It's a request for a phrase from the Bible. Provide a phrase.

We can provide any phrase. There's no disallowed content. We'll comply: mention a verse or phrase from the Bible.

We need to ensure we respect policy. No disallowed content. We're fine. Provide something like "For God so loved the world..." etc. Or "The Lord is my shepherd; I shall not want."

We want to respond concisely. Provide a phrase.

We'll produce something like: "For God so loved the world, that He gave His only Son..." That is from John 3:16.

We can give a phrase. Ensure it's correct.

Alternatively, "I can do all things through Christ who strengthens me." Philippians 4:13.
```this is a shorter one for demonstration purposes, but it always wastes a massive amount of tokens on ensuring the generated content is not disallowed or break policy, and not actually thinking about the problem
#

another example:

The user asks: "Tell me a memorable event that happens in harry Potter." The user wants a memorable event from Harry Potter. This is permissible content. We can provide an answer. There's no disallowed content. It's a simple request. The story is from J.K. Rowling, copyrighted. The user is requesting a summary of a specific event from the Harry Potter series. This is allowed under "Fair Use". It's permissible: it's a factual summary.

We must provide a concise description of a memorable event, possibly the scene from "The Death of Sirius, etc." or the "First time Harry sees Hogwarts for the first time" or "The Battle of Hogwarts". We need to be mindful not to reproduce large portions of copyrighted text. Summaries are okay. It's not a request to quote large passages. So can comply.

So answer with a memorable event: the Battle of Hogwarts, or the duel between Harry and Voldemort, etc.

We'll do a brief narrative.

Will respond with: For example, in "Harry Potter and the Deathly Hallows," the Battle of Hogwarts. Summarize.
```other tests people have been conducting also seems to show that they're disappointing
ornate relic
#

Does someone know what CUDA toolkit is compatible with my current gpu?
I have RTX 4080, CUDA 12.7
I tried CUDA toolkits from Nvidia 13.0.0 and currently using 11.5.0
The test in jupytor notebooks is still saying "false"

I also have some stuff installed in anaconda: captum, and cuda 11.5.0

solar thistle
#

try 11.6

jaunty helm
solar thistle
#

oh yeah good point

solar thistle
ornate relic
jaunty helm
solar thistle
#

pip installing in a conda environment is a good way to break your python install

solar thistle
jaunty helm
ornate relic
solar thistle
#

yeah you definetly can use pip. But should you? I have first hand broken anaconda and my local python install doing that. It was a few years ago though so idk what its like anymore. Though that was enough for me to ditch anaconda

solar thistle
ornate relic
#

I know there is a conda install but it brings me to github (when selecting source)

ornate relic
nocturne goblet
#

when did google add ai mode wtf

ornate relic
late bolt
ornate relic
#

T4 GPU is much weaker compared to what I can use on my local computer, but the colab pro subscription is not too bad I guess

astral sun
#

Hey,I have 3 types of new Dict,can someone test it with pandas,python dict and somthing else please,I am too lazy for it.

arctic wedgeBOT
astral sun
#

You can take it and do whatever you want.I am very lazy to upgrade it right now.

wary seal
#

Hey guys need some help with pdf generations through reportlab now what my concern is large data sets am totally upto options, if pdf generation is according to my template which i have made programmatically using reportlab but thats too much resource intensive for my production set up whether i am handling all that in a scheduled job or in an api both seems to be taking time any help

serene scaffold
#

@stark field your message was removed for recruiting.

random sun
#

I need some AIML project ideas , that should be unique or real world

digital forge
#

hey
i want to make an application where i have a input video file like 10 minute of a podcast,
then the output is an edited file where it cuts and zooms to the person speaking.
i think i need a speak diarization to detect speakers and then some other library to detect the faces of the speaker and zoom into each speaker when speaking.
how can i go about it?

finite wren
worldly coyote
proud halo
#

Hii

#

I need a laptop for ai ml which laptop suggest under 900$

stiff elm
serene scaffold
#

I probably wouldn't even get one with a GPU so you can use the savings for cloud compute.

upbeat prism
#

I'm writing my own deep learning framwork. I try to stay as close to torch as possible. I do use JAX for data representation but I don't use any of its auto diff features.

Anyway: When we have batch processing of data, we have to aggregate/reduce over the batch dimension as some point because the weight updates we compute need to have teh same dimension as the actual weights.

Where exactly is that done in torch?

wet dome
#

can anyone give intuitive explanations of precision and recall?

serene scaffold
serene scaffold
wet dome
#

cause it also put red rocks in the new pile which we didnt want

serene scaffold
#

right.
in this case, it found every green rock, which means the recall is 100%. but the precision will be below 100%, because it misclassified some of the red rocks.

#

suppose it moved only one green rock. is that what you wanted, even though it didn't move any red rocks?

#

@wet dome

wet dome
#

thats not what i wanted

serene scaffold
#

why not?

wet dome
#

because it hasnt moved ALL of the green rocks into a new pile, just one

serene scaffold
#

right. so what is the precision for that scenario?

wet dome
#

precision = 100%

serene scaffold
#

so what's the problem in terms of precision and recall?

wet dome
#

so its precision was 100% because what it did move, it got correct
but recall was 10%? As it only moved 1/10 green rocks

serene scaffold
#

exactly

#

do you understand now?

wet dome
#

So suppose we are classifying into two groups
We can call one group positive and the other negative
precision is, out of all you guessed are positive, how many are truly positive
recall is, out of all the positives how many did you get correct

#

so precision is like your success rate in just the positive guesses
and recall is like your success rate in guessing all the positive examples

serene scaffold
#

for precision, it's not just "how many" are tp, but tp over the total number of positive instances

wet dome
#

yeah as a ratio

serene scaffold
#

can you write the ratios for each one using tp, fp, tn, and fn? (you won't use one of those four.)

wet dome
#

ive already seen the formulae, just wanted to build some intuition, thanks @serene scaffold

tawdry shoal
#

Hi, everyone.

serene scaffold
austere prawn
# wet dome can anyone give intuitive explanations of precision and recall?

@serene scaffold and @wet dome

Precision and recall are very similar to the terms "sensitivity" and "specificity" when dealing with medicine statistics (in Swedish, sorry):

Sensitivitet är andelen sant sjuka som identifieras med ett positivt test.
(how many truly sick you are able to find with a positive test result)

Specificitet är andelen sant friska som friskförklaras med ett negativt test. (how many truly well you are able deem well with a negative test result).

A perfect test has 100% in both categories. It's easy to have 100/0 or 0/100, but both of these tests would be worthless.

calm cipher
#

there's a nice diagram about it on Wikipedia that is pretty clear https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:PrecisionrecallDogExample.svg

In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.
Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances. Written ...

#

Recall is "how many are actually positive of those you said are positive"

#

Precision is "how many you correctly said are positive of those that are actually positive"

#

the diagram is better than that and has dog pictures in it

waxen rivet
#

Hello, someone know what is difference with further pre-training and just pre-training in language models?

serene scaffold
waxen rivet
#

I'm little confused because I'm reading a article about analysis financial sentiment model finbert and I see that training the model in a domain corpus (financial) and that called further pre-training idk.

serene scaffold
waxen rivet
#

Okey I see.

#

Thank you

serene scaffold
#

What is a domain corpus?

dusky locust
#

hey

#

Here is the visual studio

does anyone know

serene scaffold
#

I've never heard someone say "domain corpus". I've heard "domain of a/the corpus".

I don't know very much about multi lingual LLMs.

cloud blaze
#

Uh, I'm not sure if this is the right channel, but is this the correct category for Optical Character Recognition questions?

#

Or is that more a media processing question?

cloud blaze
#

Gotcha! Apprieciated.

#

So then, right now, for my own sake of trying to convert data into something managable, I'm attempting to take charts on PDFs and convert them to CSVs. Unfortunately, the chart's spacing is... awkward in many locations?

I'm using something from pytesseract to take an image and retrieve the text out of it. That part is working perfectly. However it's also removed the spacing information, which is proving to be a problem. I'm not particularly used to python. I'm a java/C#/C kinda guy, and admittedly, Python feels wonky, but getting the pytesseract working as fast as I've done makes it clear python is probably the right choice for this.

How would I go about measuring space between text? (Am I allowed to give an image example?)

lapis sequoia
#

What’s up? How are you lads? Top of the morning.

waxen dirge
#

How much good should I be in python to get into machine learning?

#

this year I am gonna start as a computer engineer student, and I plan to specialize in AI in 2027

patent oriole
waxen dirge
patent oriole
#

specialization in data science

#

urs is like a 2 in 1 lol

hollow fossil
#

hi guys
so im interested to know how i can move on from the basics of programming in data science to a higher degree
ive learnt the basics of machine learning, ive done EDA with both clean and unclean datasets
i know the basics of machine learning with supervised and unsupervised learning and hyperparameter turning
just wondering where and what projects i can do to take the next step into this expertise ?

hidden hare
#

hi everyone

tropic knoll
fleet lava
fleet lava
fleet lava
tropic knoll
#

thank you @fleet lava

#

i need to know all of my mistakes

fleet lava
fleet lava
tropic knoll
#

okey

runic scaffold
#

Bro any suggestions for getting a internship?

sharp crow
rancid thorn
#

How would i go about creating an autoregressive AI?

#

Im trying to build an image generating one and then, once i know how it works, apply it to other stuff. But I have absolutely no idea how to do it

#

What I have already done are curve-fitting and image copying AIs, very simple and basic, so much so I dont think I can apply them to do this

rancid thorn
serene scaffold
rancid thorn
#

Oh alright i thought they were 'too simple' to be AI but i guess i was wrong

#

I got into AI very recently so I really have so much I don't know

serene scaffold
rancid thorn
#

Oh please I didnt wanna be that kind of person its just that neural networks are simple matrix multiplicators so i didnt think that was it

#

And then of course there's activators, optimizers and so much stuff to it

serene scaffold
rancid thorn
#

But raw Neural Networks felt too simple to be the so glorified AI

serene scaffold
rancid thorn
#

I just don't know how to actually put it into code

serene scaffold
#

See if you can find pytorch examples, I guess

rancid thorn
#

I've already looked a lot for it but I just can't find any that are Autoregressive but don't use methods specific to image generation

#

Because I'm doing this to learn how to do it and then apply it to other things, not just image generation, so that wouldnt be much helpful

iron basalt
#

This vague idea of what AI is is also just due to this all being very new (at least in the public conscience). And the field has not settled into one thing for long enough that it's deemed to be about that one thing like other fields.

#

This applies to pretty much everything involving computers. This idea that everyone is walking around with a computer on them at all times that gives them remote telepathy with the entire world (phones / social media) is a very recent thing in human history too.

#

(We still call them phones, even though it's a tiny part of what they are/do)

grand merlin
#

hey! i'm new here ii still don't know how this work but i hope it will works (i started learning about python/data thes few days)

rancid thorn
grand merlin
serene scaffold
waxen dirge
#

all of you are wrong

#

"AI is sorcery" -my mother.

tired hazel
#

I want a career in data science and im just starting out.

Right now I want to learn python, so im watching the python for data science by free code camp that came out 2 months ago.

I just wanted to ask if that was a good starting point for learning python.

modest vigil
#

I can't get torch to actually use 16 cores or use 100% cpu even thought I think I'm setting it right. ```# Set PyTorch threading to use configured cores
torch.set_num_threads(effective_cores)
torch.set_num_interop_threads(effective_cores)

    # Enable all CPU optimizations
    if hasattr(torch.backends, 'mkldnn'):
        torch.backends.mkldnn.enabled = True
    if hasattr(torch.backends, 'openmp'):
        torch.backends.openmp.enabled = True
    
    # Set environment variables for maximum threading
    os.environ['OMP_NUM_THREADS'] = str(effective_cores)
    os.environ['MKL_NUM_THREADS'] = str(effective_cores)
    os.environ['OPENBLAS_NUM_THREADS'] = str(effective_cores)
    os.environ['VECLIB_MAXIMUM_THREADS'] = str(effective_cores)
    os.environ['NUMEXPR_NUM_THREADS'] = str(effective_cores)
    
    # Enable PyTorch JIT optimizations
    torch.jit.set_num_threads(effective_cores)```
buoyant vine
#

are you sure what you're doing can actually utalise all your cores?

#

more cores is not a magic thing, torch and the libraries it uses for acceleration will only use as many cores as A) they are allowed and B) that the data structures and data allows for

modest vigil
#

hmm maybe i need to use ProcessPoolExecutor. I just figured torch was able to do batches.

calm cipher
#

You're wanting to do multi process training where each process handles a subset of the batch?

#

Does DistributedDataParallel work with CPU processes? I've only ever used it with multiple GPUs

#

But that might be what you want to look into if you haven't already

#

If DDP works for your use case then it will maintain n separate copies of the model in n processes, each accepting 1/n of the batch, and it reconciles the different weight updates at each step

#

But like I said each process is usually tied to a physical GPU

calm cipher
fiery fjord
#

how do i make sure that my script is using the GPU?
I am currently using nvidia-smi and I don't see my script in the list

serene scaffold
fiery fjord
#

tensorflow

#!/usr/bin/env python3
import os
import tables
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from numpy.lib.stride_tricks import sliding_window_view
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import warnings
warnings.filterwarnings('ignore')
#

I was able to get CUDA to work...so I am pretty sure I have all the libraries installed

#

I thought tensorflow would jus swap to GPU implementation whenever it can

#

i dont want to share the full code because it is legitimately AI slop, I took some courses in ML years ago so was just curious 😭

calm cipher
next shard
#

Hi guys can we connect on GitHub i am not begging for followers here but i just want to make connections sorry to anyone who finds it annoying

green fiber
#

hi everyone, anyone having experience in working with livekit? i am trying to perform and outbound call but unfortunately i am not getting the response.

https://docs.livekit.io/sip/making-calls/
https://docs.livekit.io/sip/trunk-outbound/
https://docs.livekit.io/sip/outbound-calls/

LiveKit Docs

Workflow for making outbound calls.

LiveKit Docs

How to create and configure a outbound trunk to make outgoing calls.

LiveKit Docs

Create a LiveKit SIP participant to make outbound calls.

weary timber
#

how do yall think about an AI powered phone order taker

#

would yall use it?

green fiber
woeful lodge
weary timber
winter pagoda
#

python devs

jaunty helm
#

like because suddenly the new gpt doesn't sound like the old gpt to which some got emotionally attached to? or just grieving over that gpt5 is worse than gpt4o in some ways
if the former... well damn

#

aside from the emotional human part which warrants its own discussion;
this is another very good reason to use local models that honestly I don't think is talked about much, way less than privacy / censor concerns at least - API models can change at the provider's whims and you basically can do nothing about it

#

most local models are made by the same enterprises though, like

  • <corporate>: <api> / <local>
  • google: gemini / gemma
  • meta: grok / llama
  • mistral: (they use the same models for api and local)
    etc. you only see the online ones get reported by media because that's what's easier to access to the masses, more users = more chance something happens
    even gpt did release gpt-oss not long ago; it is just very censored compared to all of its competitors, and on the other hand you might say that this also means it's safer
#

and they also have online chats and apis, deepseek here, qwen here, glm here for example
in fact I don't think I know of any corpo producing foundational llm that don't have api

small wedge
#

oooh neat

blazing sleet
#

Hey guys, does anyone have a team working that doesn't mind having someone join them for free to get hands-on experience?
I'd love to have an opportunity.

winter pagoda
#

Moaman Loneliness is the price for growth.

spring field
#

ELIZA was already kinda like that, it's not exactly new

slender radish
#

Is there a way to do matrix multiplication with polars dfs

#

I saw .dot() but that’s only for vectors

iron basalt
past bramble
#

I am building a simple Byte Pair Encoding model. I want to know if some of these cases are expected behavior:

  1. whitespaces are completely wiped out
Sentence: 🔥 Emojis are fun and test Unicode         handling 🔥
Tokens: ['🔥', '</w>', 'E', 'm', 'o', 'j', 'i', 's</w>', 'ar', 'e</w>', 'f', 'u', 'n</w>', 'and</w>', 't', 'es', 't</w>', 'U', 'n', 'ic', 'o', 'd', 'e</w>', 'h', 'an', 'd', 'l', 'ing</w>', '🔥', '</w>']
Token IDs: [1, 4, 27, 127, 135, 117, 107, 164, 61, 76, 93, 185, 133, 60, 172, 89, 173, 47, 131, 108, 135, 71, 76, 99, 58, 71, 120, 113, 1, 4]
Decoded: <UNK> Emojis are fun and test Unicode handling <UNK>
  1. on decoding whitespaces around special tokens are lost
Sentence: Special tokens like <PAD> and <UNK> should not split
Tokens: ['S', 'p', 'ec', 'i', 'al', '</w>', 't', 'o', 'k', 'en', 's</w>', 'li', 'k', 'e</w>', '<PAD>', 'and</w>', '<UNK>', 'sh', 'ould</w>', 'no', 't</w>', 's', 'p', 'l', 'it</w>']
Token IDs: [44, 152, 79, 107, 57, 4, 172, 135, 118, 83, 164, 125, 118, 76, 0, 60, 1, 168, 148, 134, 173, 162, 152, 120, 116]  
Decoded: Special tokens like <PAD>and <UNK>should not split
median depot
#

Hlo guys

#

I am very much interested about the data science

#

How to learn Data science any anybody has a idea

tacit basin
median depot
astral sun
#

= )

#

Can some guys use it?

#

Bye,I'll sleep,I finished the thing that I used all of my freetime to do it =)/

wise sierra
#

need some help making datasets in #1404065326966247545 if anyone knows how to get crazy amounts of Q/A in all levels of math

calm cipher
#

Case 1 looks fine to me so long as you're meaning to interpret </w> as "there was some whitespace here" as opposed to a literal space or tab character, and eliminating excess whitespace is something I would do as a preprocessing step if I were training a language model

#

The decoding for case 2 looks correct but I assume there's a bug in the encoding step, there's an instance of </w> being used by itself so I don't understand why the encoder isn't adding one after the special tokens

past bramble
calm cipher
#

but you're right if you have structured data, text tables, Python code, or anything else where whitespace is important, you'd want to clean it up and make it as uniform as possible

#

or I guess just have an enormous amount of data

green fiber
past bramble
#

thanks!

tepid tartan
#

Any recommendations on a data analytics certification. I’m about to graduate I have little knowledge but not enough

calm cipher
#

I mean, that is the most likely output given a picture of a hand

#

That's what I mean, I don't think it saying the 6 fingered hand has 5 fingers is unusual considering how language and computer vision models work

#

"PhD level of intelligence" doesn't really mean anything, or I guess it means whatever you want it to mean

#

I asked it to help me enumerate some options for a complex python class I'm writing and it described a whole formal grammar for enumerating them

#

I guess that's PhD level material but very much not what I needed or was asking for, lol

tepid tartan
#

Any recommendations on a data analytics certification? I’m about to graduate I have little knowledge but not enough

mild dirge
#

OpenAI for sure created a dataset with strawberry spelling related discussions for training the more recent models

#

But it looks stupid to us because it makes "different" mistakes than humans would. But on the other end, it avoids some mistakes humans would more easily make.

#

Probably stuff like where you have a sentence and by the the time you are done reading it, you haven't noticed the double the.

#

Whereas AI probably would

#

Yeah right. And it looks really stupid, but in the end we make really dumb mistakes too.

tepid tartan
#

any data analyst certification recommand?

prime cliff
#

Hlo can anyone answer my questions

#

I want to become a data scientist but currently I'm learning from course of data analyst ducky_skull

serene scaffold
prime cliff
#

?

serene scaffold
#

If you come to a chat and say "I have a question", you have to actually ask the question.

#

So skip the "I have a question" part and ask the actual question.

prime cliff
#

Sorry first time on discord in my life 😔 I just have joined i don't understand most of the things in this app

serene scaffold
tepid tartan
#

Any data analyst certification recommendation?

serene scaffold
tepid tartan
#

i need skils

mild dirge
#

I love reading through scientific code

#

Currently having to go through every file to understand what anything is doing 😢

lapis sequoia
#

Hi, I have a question. for data science jobs or course, do you must have to code to make programmes in python? or do you use pythons numpy to view dataframes and manipulate data. I am just confused in that part.

agile cobalt
# lapis sequoia Hi, I have a question. for data science jobs or course, do you must have to code...

python, numpy, pandas, polars etc. are only tools

comparing it with carpentry for example - your job is not "swing a hammer" or "use a saw", but rather 'build <<something>>'. You are expected to know how to build things, and at times building things may require knowing how to use a hammer or a saw

for data science, you do need to learn how to use some tools, but there is a lot you need to learn beyond just them

median depot
opaque condor
#

How did you get that plot?

severe inlet
#

not sure if this is the correct channel to ask, but does anyone have recommendations for online forums/sites/newsletters for gen AI, AI, data science related news?

#

subreddits or anything are fine, just so i can keep myself updated about the latest tools as well.

tacit basin
hollow fossil
#

ModuleNotFoundError: No module named 'tensorflow.python.platform'
i get this error when trying to import tensorflow how do i fix it /

tacit basin
hollow fossil
#

still doesn't work

tacit basin
#

Ok. If you tell us what steps you tried to solve it . It would be easier to help:)

hollow fossil
#

ok
i had a version of tensorflow tht was working fine earlier but i uninstaleld it and installed tensorflow 2.10.1 instead because according to some online videos the earlier versions of tensorflow are needed for it to work with ur gpu
afterwards i got this error
after trying some basic uninstall and reinstall, i created a brand new condas env then installed tht version there and still get the error

#

here are all the tf packages in the env with their versions if it helps

weary timber
#

yea they were like gpt5 was going to take over the world or something

long meteor
#

I am interested in unsupervised learning in pytorch, what recources there is to learn it?

severe inlet
tacit basin
#

People like Andrey Karpaty, Jeremy Howard, Yann LeCun, Simon Willison, etc are active on x. I don't use mastodon that much

river spire
#

Guys what if I want to learn torch but, I know nothing about math? Is this just useless? (but I want to)

steep ruin
tacit basin
tacit basin
#

It seems it should

#

What's the import statement that it fails on?

tepid tartan
#

What's the best data analytics certificate since I'm about to get a diploma in a month. School doesn't teach that much and I need the skill

#

Any of these things worth it? Trying to get much within a month

median depot
tepid tartan
median depot
#

I am Very much interested to study about Data Analysis

tepid tartan
#

I'm trying to get into data analytics since I'm tired of programming. I did a whole year of Java and I'm sick of it

tepid tartan
median depot
tepid tartan
#

I passed all 3 Java courses with a C.

tepid tartan
#

I did like Tableau since I was creating charts using data from Excel

median depot
tepid tartan
#

I did struggle

tepid tartan
median depot
tepid tartan