#data-science-and-ml
1 messages · Page 171 of 1
when I look at table 2 on the paper it mostly seems their test score goes up when they scale up the params
table 2 is this
oh yeah I forgot to post this one
the division between better than baseline/worse than baseline is kind of close to 50/50
yet out of the 1771 models, they only considered 106 good enough to put in the table
so it seems like those models got worse when they added more parameters?
isn't that the nature of any architecture searching algorithm? especially a genetic algorithm style one like this seems to be
I'm still not sure what their selection criteria was, but it had to do with testing larger versions of the models
right that's the point, this isn't being presented as a genetic algorithm, they're calling it "Artificial Superintelligence for AI research"
yeah all the hype they're putting around it is pretty cringe
I see the word "revolutionary" in here like 8 times XD
but if they are using the training loss like it seems like they might be, then 1/3 of the score is essentially useless
they use the term "training loss" throughout the paper
I mean i guess if it's the same training set then it's a useful metric for comparing the models relatively
but yeah that's useless as far as measuring generalization
there's also the larger issue of whether it's just an overfitting machine
isn't that why they show the test scores next to all that though?
or did they leave that out somewhere else
if they're throwing every possible combination of model at the wall, and testing it on the same benchmarks, then it isn't clear if the best model is good at reasoning or just the best as passing those specific benchmarks
this is a problem with regular ML research too
I don't really see the alternative
set aside a bunch of benchmarks you don't use during evaluation
they kind of did that with the wiki dataset to be fair
so you want them to do like a train/test/validation thing
it essentially looks like each benchmark is a test set
I mean it is, it's a set of tasks the model hasn't been trained on
if you look at it that way, they're making the mistake of using the test set to evaluate each of the different models, and so there isn't any test set left over to evaluate the final results
again, in the interest of fairness, it looks like they didn't use the wiki task as part of the score
but this is one of the reasons you can compare the training score as well
Like , I was thinking that when the user uploads it's data , I am going to provide the meta data of the data and user query to llm which will convert the user query to a code then that will hit the csv data and the results of it will be passed to another llm which will summarise of explain that coded outout.
it's two data points, if the model was just really good at the testing data then it might show up in the train score not being significantly higher compared to the others
since you'd expect a model that's good at generalization to score higher on both
I see what you're saying but I just don't see the value in evaluating the model on the training loss to the extent that it occupies 1/3 of the overall score used for model selection
that is always going to be the most favorable way to look at it
So , what can I do
Llm only writes the code that gets hit to data and results is then sent back to user
It's like there is a person who writes code for your analysis
But I don't think llm can write complex code complex business logics
How are you talking about vec search or old search, we are here talking about analysis not some pdf having questions answers
Aah 😵, you got what I need to make
Yes
Yes , but do you have any other approach
True , it makes a lot of mistakes
Yes
Ok mate thanks for your support
Yeah , but it follows a philosophy of garbage in garbage out
The right question is the thing we should ask it
+1
Yes
There are many nuances between federation, distributed and the various ways to store data.
Plus you have to account for the technical skills of your users
im currently trying to finetune small models for img classification and was using resnet18/50 and bunch of different efficientnet_bX models.
For faster benchmarking i train smaller datasets on GPU to find good hyperparameters if i find a promising run i apply it to the full dataset on CPU/GPU batchmode.
However the results are differing significantly from one another and i dont find where my mistake is, i know that they wont be == but atleast ~+-5% ig.
Maybe someone can recommend other models aswell which might work good on animal classification.
what is the difference between the datasets?
how much statistics do i need to know for data analysis?
between CPU/GPU and GPU mode none the plan was to test on GPU and then run on CPU/GPU as too large dataset wont work on my GPU.
So currently dataset is the same for testing (5k img), planned would be 40k img for full dataset
no I mean what is the difference between the dataset you're using for hyperparameter tuning versus the dataset you're using for the full training run
you said the dataset you're using for hyperparameter tuning is smaller than the full dataset you're using for training?
yes but currently im in debugging to solve why i dont get similar results for same settings and only change was GPU -> CPU/GPU
might need more details, I'm not sure what you mean by the difference between GPU and CPU/GPU
i created 2 dataloaders for training one is the normal cuda approach (batching) the other is caching all img on the GPU VRAM (speed), but they are technically the same. I know that there will be a difference even with identical seed but not > 30%
is the GPU-only solution still batching?
there should not be a difference with an identical seed
to put it another way, are you still feeding training batches into the model regardless of whether they all fit on the GPU, or are you giving all training examples at once?
no i transfer the img data at the start and then epochs are run without new batching happening.
that is probably the reason for the difference
especially if there is no difference in learning rate between the two approaches
yep but i didnt thought it would be 30% 😄
How big is a batch and how much data do you have?
a batch in normal mode is 64 and i do have 40k img
let me make sure I'm clear on what's going on, when you're using GPU-only unbatched mode, you're using all 40k images at once in one backpropagation step?
ok so in your current gpu-only setup you're using all 5k images in one backpropataion step? and in the other setup you're giving batches of 64?
no its all the same sorry if i might overcomplicated this.
# Move all data to GPU
print("Loading entire dataset to GPU memory...")
all_images = []
all_labels = []
for i in tqdm(range(len(full_dataset)), desc="Transferring to GPU"):
img, label = full_dataset[i]
all_images.append(img.to('cuda'))
all_labels.append(torch.tensor(label, device='cuda'))
all_images = torch.stack(all_images)
all_labels = torch.stack(all_labels)
py
train_loader = DataLoader(train_gpu_dataset, batch_size=batch_size, shuffle=shuffle)
val_loader = DataLoader(val_gpu_dataset, batch_size=batch_size, shuffle=False)
I still use batch_size even for full gpu mode but i previously cache the img to the VRAM
maybeee i found my problem while looking at this again
will this store the seed between different packages?
def set_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
or if i define it in main.py but have random_split be performed in the dataloader.py its not kept?
i thought this would be sufficient but maybe its not?
When is that called? you should only need to do it once
at the beginning of training (before dataloader)
if you want to confirm they contain different data, try manually iterating over each dataloader and printing the contents of the first batch
you can make sure they're the same between different data loading configurations
good advice thanks for the suggestions!
its sad that pytorch doesnt feature hyperparameter tuning itself, hard to make a generic testing script for multiple models/settings 😄
I like PyTorch Lightning for things like that
I don't recommend retrofitting an entire existing project into it, but if you structure your ML project around it, you can get JSON config files and CLI parameters that map directly to model hyperparameters
so if you want to do hyperparameter tuning, you can just make a bunch of config files and run them
yeh i build my project from scratch with that in mind using .yaml and pydantic but its not working 100% of time and its a bit fuzzy 😄
I love scratch coding👍almost finished ml project with that style
You can learn more over here:
https://openrouter.ai/
To learn more about the LLM coding features of marimo, go here:
https://docs.marimo.io/guides/editor_features/ai_completion/
https://www.youtube.com/watch?v=Lakz2MoHy6o&t=1497s&ab_channel=TheIndependentCode
I'm currently using the video I linked to implement a CNN from scratch, but I'm a bit confused by how he uses biases. The video shows a single bias term per pixel per output. So if an input of 3 channels and want an output of 3 channels, I have 3 bias matrices (one per kernel) and each bias matrix has the width and height of the output. Other sources I have seen show one bias term per output, but not per pixel. So if I have 3 kernels, I have 3 biases, each being a single scalar.
Which is correct?
In this video we'll create a Convolutional Neural Network (or CNN), from scratch in Python. We'll go fully through the mathematics of that layer and then implement it. We'll also implement the Reshape Layer, the Binary Cross Entropy Loss, and the Sigmoid Activation. Finally, we'll use all these objects to make a neural network capable of classif...
I can't watch the video now but each kernel has its own bias
None of the parameters in a CNN are dependent on the size of the data, input or output
So yes if you have 3 kernels, and you want a bias term, then you have 3 bias terms
Gotcha. I don't know why he has it as a matrix with the same size as the output, it might just be me misunderstanding tho.
So the gradient of the bias for kernel i is equal to output i's gradient?
Hi guys so i'm taking data science in college and i was wondering if there's anything i can get started with, i don't exactly know where to start
Hello. Does JAX has a maximum values that it can handle?
If I do
import jax.numpy as jnp
jnp.isfinite(10**10)
I get an overflow.
can you try again with 10.0**10 ?
Ohhh it works... why?
have you ever programmed in c or c++?
regular python is dynamically typed and makes its own choices of when to treat your operations as ints or floats
it also implements a thing like "big int" that allows integers to be arbitrarily large by giving them as much memory as they need
jax does neither of these two things
it likes sticking to C-like types, like int32 and int64. if you write 10**10, it will try to use ints and will have maximum amount of 32 bits fo them (this is the precision it uses by default, to be compatible with standard gpu arithmetic)
floats have a way of dealing with huge numbers by just treating it as inf or nan, but ints don't do this
10.0**10 explicitly says the number is a float
this is so tuff
this is weird
Is Bart fine tuned the same way as T5?
where do you guys get datasets for machine learning projects?
Do you use kaggle or scikit-learn datasets or something else?
kaggle and google's dataset search engine https://datasetsearch.research.google.com/ are my gotos
and say I want to find a dataset to use logistic regression on, do i just search for "logistic regression dataset"?
that will definitely return you some datasets you could use, but it might be better to search for a specific topic that you would be using your logistic regression to predict.
ok thank you
look for datasets that say they're good for classification
I think this one is very good for a beginner problem: https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris
this is also a classic beginner dataset: https://www.kaggle.com/datasets/uciml/iris
I don't know if you're just starting out or want something a little more complex, but for iris, you use dimensions of iris flowers to predict its species
for penguins, you can use peguin dimensions to predict their sex or species
iris requires less initial preprocessing, it's pretty much ready to go as-is, but the penguin dataset requires a little more work to clean up
this is good for simple binary classification https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
I am here. If u d like , I d be happy to discuss development.
When jit a code with jax, it usually takes a lot of memory?
I am using 100Gb and still get memory limit exceed.
Or is my code that is badly written for jit?
it'll vary wildly based on what exactly you are doing, but odds are you should try to split things into smaller batches (process in parts instead of all at once)
How do i know if a dataset needs a more complex model? I'm currently using a linear regression model and it looks like I can only get up to about 62% r squared score
Are you concerned with the interpretability of the model at all? like is this a statistics project or a machine learning project?
if it's mainly a ML project you might consider moving past linear regression
yeah if it's just linear regression then aside from maybe tweaking your preprocessing steps, you'll have to just accept that whatever r2 you get is what you're going to get for that data
there's only so much you can get out of a straight line
but if it's a ML project, you might get a lot of mileage out of a support vector machine or decision tree-based model
yeah
also maybe modeling interactions
I just got a housing price dataset of kaggle. Because its a continuous variable you can predict it with linear regression, so I tried to fit a linear regression model to it, but I don't think it works that well
Sometimes I don't know if it's me who's doing something wrong or the model is just out of its depth essentially
ah I see, yeah at the end of the day the linear regression assumes a linear relationship between the inputs and outputs, is just trying to draw a line through the housing prices
if the housing prices don't approximately follow a straight line then it won't work very well, or if there's a lot of variance around the regression line it won't work well either
You might try checking through ur code, probably, there are bugs to fix. If you’re comfortable sharing the code, it might help
having said that you might be able to make it better by removing outlier housing prices, or only focusing on a specific subset of houses, or other things that are dependent on the data
And cause its like 10 dimensional data I can't quickly visualise it to see if its linear
have you done any exploratory analyses of the data? that could be a good start, maybe some variables are better correlated with housing prices than others
there's a lot of digging around you can do
but also I do reiterate my recommendation to try other types of models, something like a decision tree or SVM would probably work better with the data as-is, whereas linear regression requires being more careful with the model inputs
No I haven't tbh, I just threw my lin reg model onto it
well that's a good start, you have a baseline for comparing performance now
I don't know too much about exploratory data analysis
what dataset are you using?
Calafornia housing from kaggle
like is it on Kaggle or anywhere else online
I jus thought I would use all the features
ooh are you reading through Hands-On Machine Learning? Or did you find this independently
I didn't think about making my own or exploring the data
No I'm not reading that book, I've covered alot of the maths for linear regression so wanted to try it in code.
Should I "explore" the data before applying models
it helps, for a couple of reasons
actually I have a specific question for you now, what are you doing with the oceanProximity column?
Why do they provide 10 features or so, if we shouldn't use them all then
I was a bit unsure what to do with that column, I mapped it to an integer. Ideally I would know if "island" is further from sea than "1H from sea" so would map it to an int correspondingly
datasets are just a view of data someone else collected, they don't know what anyone might want to use it for
ooh ok try doing one-hot encoding instead
you're describing ordinal encoding and I have often found it problematic sometimes
just because it implies an ordered relationship between different values that may not actually exist
that is, if NEAR BAY is 1 and <1H OCEAN is 2, that implies <1 OCEAN is more than NEAR BAY
Yeah I wish the dataset told us how far each one was
I guess it could make sense but this is an example of trying different things to see how it affects linear regression, maybe it'll work better with one-hot than ordinal encoding, or maybe not
That's why I was unsure what to map each string to
But how do I know when linear regression has hit it's limit
you don't
Even if I do all this encoding and data scaling for exampls
unfortunately that's how it goes with machine learning, you can only ever show that something performs better than something else
but you can't realistically show that you found the best possible
So you just get to a point where you are happy with the results?
yep
that's what a lot of machine learning research does actually, a lot of major advances in ML have happened when someone discovered a technique that made models perform much better than before
One of the main downsides of JITs is that they take up a lot of extra memory.
This is why Javascript projects tend to bloat in memory usage and one of the reasons why modern websites will eat all your RAM with only a few tabs open.
However, it's probably not your JIT that is making you run out of memory for a single process (and you're not something that has multiple things running in it like a browser).
(You also probably don't have enough code for this memory difference to matter, modern websites are just way bigger than needed)
What model would you use for a dataset like this
I would try a variety of them
You're trying to predict housing prices, right? it's a regression task
so a SVR, decision tree-based models like DecisionTreeRegressor, RandomForestRegressor, maybe k-nearest-neighbors like KNeigborsRegressor
And just go with whatever gets the highest accuracy
Just make sure you have a training, validation, and test set, and don't touch the test set until you found the best one
So this exploratory data analysis, should I be like making my own features and stuff
Like income per bedroom as a random example, I know that sounds useless,
some might be useful, I remember they made a few features in the book
if I remember something like "total rooms" was the total number of rooms in the entire census area, so you can divide it by "households" to get the average number of rooms per house
Alright I need to learn about eda,
also doing these steps may make a linear regression model perform better too
What am I looking for whilst doing eda
this might sound a little vauge, partly because it depends on the data, but the goal of eda is to come to as complete an understanding of the data as possible
here's an example of what I would do with the housing price dataset, and this is all hypothetical because I don't remember its characteristics very well
but let's say you look at the housing prices themselves. maybe most of them are grouped around a pretty clear range of values, but there's a small group of extremely high prices that are very different from the others
Should all machine learning projects start with eda
so what to do with that information? well maybe you could decide to restrict your ML project to the lower priced houses, or maybe you could find a dividing line between them and see if you can predict if a house is high-priced or low-priced
I think so, yes, it isn't a good idea to try ML on data you don't understand
OK thanks you have been very helpful, you know your stuff
Hi,
Any course recommendations for beginner in AI?
Have you checked the pinned messages?
Hi, first of all, thank you for helping me.
Yeah, i checked them. they are dealing with specific things.
OK. Is that a problem?
i dont see any message which is specifically talking about cources for beginners
If you've read all of the pinned messages and think not one of them is for beginners, you're gonna have a problem
What do you think of this? #data-science-and-ml message
That is about ML, if i am not wrong?
Are you saying ML cannot be AI?
Can you explain what you're looking for?
i don't know the difference. i only know python. Just thinking to start in AI.
Do you wanna build an AI from scratch, like building your own models or do you wanna leverage existing models ?
Then I think you should go through the "quick" lessons at https://www.kaggle.com/learn to understand what is AI and what is ML and what tools are normally used in them like Pandas
i was watching some you tube videos & the suggestion was to start calling existing models using api calls & then go for building own models.
Which one did you want to do? Call existing models or build your own model?
call existing models first
Then you can pick a model you like, set up an API KEY with that provider, and make chat completion calls.
Idk what provider do you prefer but I started out with Deepseek and just followed the docs https://api-docs.deepseek.com/
Or you could later on look at other Agentic AI frameworks like CrewAI, Langchain, Google's Agent Development Kit (ADK), etc
ok, will try.
Guys,
when i am exploring reddit about what people are saying about how to learn AI, i found this one.
i believe there are some AI expert here. could you please put your comments in it
There are also opposite comments as below
Thats just someone's opinion on the matter. You can either use the pre-built models trained by someone else or until you find a need that the models are not performing enough for your needs, you have to build one.
does anyone here have any background in supervised fine tuning ?
I am hoping for some guidance on a good starting point
not specifically with that task but training machine learning models and language tasks in general, maybe I can help?
what do you have so far? Or what are you wanting to do?
I like to learn by doing, anyone have any project ideas I could use to learn RL?
I think genetic algorithms make for a pretty awesome and comfortable introduction
things like shooting a projectile at a moving target, learning to steer a car, anything simple like that is great to start
and very visually interesting to watch since you can have hundreds of agents attempting a task at once
I thought genetic algorithms were not RL
why would you think that?
you still train on a policy where agents are given reward based on their interaction with an environment
you just optimize by selecting the best after mutation and crossover rather than directly calculating updates like in a gradient policy method like ppo or deep q-learning
Because I've tried to understand the difference between them before https://medium.com/xrpractices/reinforcement-learning-vs-genetic-algorithm-ai-for-simulations-f1f484969c56
Fundamentally, the operating principles of the two approaches are different. RL uses Markov decision processes, whereas GA is largely based on heuristics. The value function update in RL is a gradient-based update, whereas GAs generally don’t use such gradients.
I just plainly don't agree I guess
in all the literature I've read it's talked about as a form of reinforcement learning
never have I heard that policy gradient methods are the only form of reinforcement learning
Interesting, the author could be some opinioned rando lol, let me look. I think I found his linkedin, seems to be average dev
Hi, I have a parquet file, and I am unable to load the data. It says the maximum memory limit has been exceeded, probably due to low RAM. I currently have 16GB . How can I see the data in any other way?
for example https://arxiv.org/abs/1712.06567
this paper frames them all just as methods of optimization that can be used on RL problems, where the RL problem is just an environment and policy that are used to optimize along rather than a dataset and/or labels
Yeah you can probably use duckdb
if you want a starting policy gradient project though pytorch has a DQNN tutorial for the pole cart game
They are not.
once you know that it's just a matter of setting up whatever novel environment you like and tuning your policies
let me check then
try something like (though tbh idk if I've ever tried to read a parquet file larger than my memory with duckdb, maybe you'll have to use chunking, or dask I think is meant for this)
import duckdb
df = duckdb.read_parquet('my_file')
print(df)
what definition of RL are you using?
ok
I think this way of framing it is more accurate
since you could hypothetically apply genetic algorithms to supervised learning problems
it's just a method of optimizing through the search space, guided random search basically
if the guide is a fitness score given from a policy after interacting with an environment, that's 100% RL in my eyes
Seems like that paper frames GA as an alternative to RL that can solve the same problems
I think I heard that GA can be worse because the starting search space is too big?
no, it's saying GA can be better for RL problems than policy gradient methods
not better than RL
that's exactly what this paper talks about
there are a lot of benefits to non policy gradient methods as well as downsides
there are no local minima to get caught in and no saddle points for example wacky gradient stuff
there are special encodings you can use when framing populations as genomes
that are very efficient
the paper demonstrates a lot of this
and argues that GA/ES could be competitive at large scales
Not needed, it's plainly obvious from GA working with a population, while RL says nothing about that. A single robot can do RL in real life (a single person can too), but a single robot/person can't do GA. Another important difference is that GA is blackbox optimization. Putting it in the same room as particle swarm optimization and simulated annealing.
So I can still apply just regular RL to try doing projects like "shooting a projectile at a moving target, learning to steer a car" to learn policy gradients? The visual aspect does indeed sound cool, maybe I'll hook it up to some game like minecraft or pong or something lmao
do you realize how disingenuous it is to say "I don't need to define words because it's plainly obvious" how can it be plainly obvious if we don't share the same definition of the word
It's shorter to show this difference than dig up the formal definition and compare them.
but my definition of RL doesn't include anything about population dynamics, just about how you evaluate the agent on the problem
There is only one definition of RL by Sutton and Barto, whatever your definition is, I can't know, and would not be a good answer to give the person asking if they are the same thing.
me when I don't want ai training on my code but I still want to publish open source projects
that's perfect that you mention that because I'm looking at that book right now
For example, methods such as
genetic algorithms, genetic programming, simulated annealing, and other opti-
mization methods have been used to approach reinforcement learning problems
without ever appealing to value functions.
page 23, section 1.4
Alright let me take a look.
Ok, but this does not make GA RL.
You can use GA to approach RL problems, that is commonly done.
agreed
For example, neuroevolution is common.
But the important thing here is that GA is basically a blackbox optimization you can pretty much always slap on top of anything else.
It's basically like once you have something that works you can throw more computers at it.
Or you can do it raw without anything else.
well yeah to my original point it being "blackbox optimization" is one of the reasons that it is so easy to do beginner projects with
you really only have to focus on the architecture, environment parameters and the policy
RL on its own, unless you specifically mean multi-agent RL, which is its own thing with its own formal specification, focuses on just one agent as can been seen in the book when it's first formed (mathematically) in chapter 3.
The origin of RL is optimal control theory, which is about feedback loop optimization. RL is basically a rebranding of it borrowing framing from biology.
(Of Modern RL, the concept was used prior without really thinking of it in this modern formalization)
I will note that in the most general sense, you can use the mathematical framework of RL (ignoring whether it's about a single agent even) to give a general definition of AI that all AI would fall under (anything intelligent that makes decisions), but that is one of the ways to give a formal definition of AI is the equivalent to other options. However, if you want to say that it falls under RL via that route it's possible, but probably not a well accepted, nor common answer (as every AI is then RL (or well, an approximation of the most perfect RL agent possible)).
(See AIXI)
The upsides of GA is that it's blackbox, it works really well, and it can be simple to code. The downsides is that it's compute intensive and not feasible in many situations. Also depending on which problem it can be hard to code, for example if you have some simulated environment and now you need to setup a fancy distributed compute cluster that can run N simulations in parallel and synchronize and all that.
I agree it's a good beginner project and I recommend everyone tries it. I have found that people often don't really understand/feel how evolution actually works until they have coded a genetic algorithm (without this, it feels too magical and can be hard to believe that it's a thing (after, it feels obvious that it must be a thing (like you could see this happening randomly just by random physical systems))).
GA will often do better than most other methods, if it's actually applicable (can you have multiple agents? does it take too much compute?).
GA is very strong (pretty much everywhere it's been used (real world problems) it gave amazing results).
It can also be mixed with other methods (used in addition to), so it's not a "one or the other" in many cases.
Well the original question was a good project to learn RL, I think teaching a bot to play pong could be good. I have to prepare for my RL class
There are several options there, including classics such as cart-pole.
You can also use non-realtime turn based games.
Simple grid-world games.
That way you can use tabular RL, no neural networks or any of that needed.
Hello I have a doubt
In transformers can we decrease the patch size
Like in original paper it 16 x 16
And then after flattening we will get 16x16xinputxhannels
sure, why not
though if you train it on the same dataset as the paper, you'll likely not get the same results
Is there a way to see how complex it will get ?
Embeddingdim will be 24
(2×2×6).
But total patches counts will be much more
I am using custom data with img size 128 x 128 ,6channels and patch size 16
I'm unsure what you're asking here
you can get the number of parameters in the model, multiply that by your chosen precision (ofc, if you use mixed precision, you'd need to account for that as well) and that's roughly how much VRAM you'd need
but backprop might need some memory on top as well
I mean, experimentally you can just try running an experiment and see how much memory it actually takes up and then scale accordingly
Anyone here learning data analytics?
I'm trying to do some simple binary classification of images. Does having a smaller resolution than I normally would effect the results? Actually I am a bit confused. The data I'm working with is 1920x1080, but it doesn't seem "HD", as in there seems to be some blurring artifacts going on. It was sampled from a video stream, so I dont know if the encoded data had a lower quality to it
Presumably a 1080p stream is high quality/clear
I'm thinking of using rust for my language on this project, is it best to do all of the training in python and then run the models in rust from there?
Eventually I might try "full stack" in rust (in the sense of training/running), but for now I am pretty much just thinking about training in python and using in rust. Mainly I want more experience in ML as well as Rust. Combining the two doesn't seem like the "best" solution, but working together can help me some
The library I'm using right now is essentially torchlib rust bindings. And I'm only providing access to dynamic linkage through my python virtual environment. All in all, would seem pretty plug and play, pytorch is just python bindings for the C or C++ torch library eh?
My command to even run the binary is LD_LIBRARY_PATH=./venv/lib/python3.13/site-packages/torch/lib LIBTORCH_USE_PYTHON=1 cargo run --bin my-bin 😂 for some reason LIBTORCH_USE_PYTHON=1 is not enough, it cannot find the libtorch_cpu.so which is in that dir. For some unknown reason
If I have more samples of one dataset classification than the other, should I discard that data? I've read there's apparently biases that can be formed in training. To exaggerate, if 99% of the samples are one data type, then the model will be trained to simply always predict that classification. As more often than not it will be correct, without any further "analysis" or scrutiny
Right now my samples look like 70% of type A and 30% of type B
I probably shouldn't be concerned with this at this stage just yet, I'm still setting up my training routine. It was just something on my mind
I have other work in rust already, it's moreso keeping that momentum moving forward lol
I have a lot more python experience. Not that I'm an expert, but I have been moving some scripts over to Rust just for further experience. I write so much python out of convenience, maybe if I had that familiarity with Rust I would do a lot more with it also out of convenience. It's build system and library imports are as streamlined as python pretty much. Both of which are a breath of fresh air from managing build systems with C/C++
simply use an appropriate metric
predicting the 99% sample class A might give you 99% accuracy, but if you look at say f1 score instead you'll see that it (correctly) says your model is pretty terrible
similarly, say you're training with cross entropy, you can give it weights to let it focus more / less on certain classes
on a related note - I've never had luck with oversampling tricks like SMOTE where the only thing it seems to do is increase training time and reduce model performance
I will keep those in mind. It makes sense to penalize more potentially if it guesses wrong against a class B scenario
Overall I do not think my training application is that complex. class A is "everything else", and class B is a very distinct sample. Essentially just training a model to detect if B is present or not
Hey 👋 guys. I am learning python for going into Data Scienctist Role, and have found Python interesting, and have been sharing small topics I learned by writing blogs, you can read my recent blogs at Medium
Here is link to my latest blog
https://medium.com/@buildwithmobi/concatenation-in-numpy-2ea97b290f2f
I hope you would have wonderful time exploring what I explored, and the way I explained things 😇.
I appreciate you doing some heavy lifting and reading more into it
A lot of jaded people probably assume most of AI stuff is nonsense marketing. At least you sounded impressed at first before looking further and changing your mind
I see a lot about using the suffix .pt or .pth for pytorch models. Why not something more generally geared towards Torch itself? Isn't Torch written in C++ after all?
I just wonder why not .tch or something
That sounds like a really hard question to find an answer to (you'd probably want to track down the first usages of that format and see what the reasoning was), but also, note that pytorch is overwhelmingly more popular than libtorch, so few people care how its internals are written
Understandable. I just know you can use models in various different languages, so it only seems reasonable to simply refer to it as the underlying library and not necessarily python's wrapper
But I can see how it is used in python at probably a 99% rate so it's not a big concern. In the end it's better to stick to conventions instead of unnecessarily forking logic like that
I'm running into a constant problem area in my model architecture that I am unsure of. I am following along with an example. I won't paste the code directly because it would be needlessly long. I will describe the exact line I have had problems with, and also another line that gives me confusion.
in the __init__, I notice that the in_channels and out_channels for CNN layers match, as expected. 3x16 * 16x32 * 32x64 are valid matrix operations. However, for the first connected layer, the matrix math does not line up with me? nn.Linear(in_features = 64 * 6 * 6, out_features = 500) , that would be 32x64 * (64*36)x500
the magic is in the forward call, but the max_pools dont seem to alter the matrix algebra in this sense? X.view apparently reshapes the matrix. But during training, I had to swap 64*6*6 to 64*7*7 for it to work out? And now I'm trying to test a single image at a time and it's not working for me. It worked during batch training
RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x49 and 3136x500), where 3136 = 64*7*7
what are the image dimensions?
[3, 248, 248]
The key is probably here:
X = X.view(X.shape[0], -1)
X = F.relu(self.fc1(X))
When X is passed to the first linear layer, it's reshaped from (BATCH, 64, W, H) to (BATCH, 64*W*H). And W and H here are not the original ones, because convolutional layers can alter them
they alter them like this, to be precise:
and plausibly, the sizes here work out to W=H=6, making the example correct
I'm stepping through right now to see how dimensions have changed
Okay, I have figured it out. When I dont add in batches, that entire dimension is lost
Yeah, generally when processing a single image you must still have size 1 along the batch dimension, or bad things happen
By the time I get to that X.view() it is not using the batch dimension, and compresses the other ones. Which would multiply out to 64*7*7
Yeah the image coming out of the CNN is 64x7x7, not 64x6x6
it comes out as 64x6x6 if you remove the padding
This documentation.. I must read it
That, they do
I guess whoever made this example forgot the padding, added it after they pasted it into the document, and forgot to update the linear dimensions
or something like that
I'm using different images than they are, I sort of assumed that would answer the slight difference there
As I noticed my resolutions dividing and flooring. I could see them being off by a couple integers once it's compressed a lot
I also never normalized my images which I should probably look into lol. I didn't need to resize them because I've already applied cropping way earlier
So apparently I can use the tensor unsqueeze(0) method to insert a "singleton dimension" at the front
Now I wonder if it's best practice to put that logic in my forward method, or to preprocess data before entering my model? I presume the latter for performance, and the former for simplicity. Because when training on batches, it will always have to evaluate the conditional dims == 3 --> unsqueeze branch
Oh you mean to batch it in the first dimension?
I don't know if there are best practices around this but I would have the model assume it's always batched, and require data to be inputted as batched
Yeah unsqueeze(0) will prefix my dimensions with a 1x, aka the "singleton dimension"
Do I need to .gitignore my models? How should I store or potentially version control that information?
I can imagine some disconnect between editing actual code, and training new models. Potentially using new or somehow different data. All of which are outside the scope of lines of code in some nature
you mean .gitignore the pytorch module you wrote that includes the model?
Specifically I mean gitignore the model itself .pth which is some large zipped binary
oh yes gitignore that
That's what I'm doing. But I presume there stands good reason to somewhat version control models? Maybe document their training process? Or is it always use the very newest, and it can never go wrong? Lol
in general you don't want to commit binary files to git repositories
you can, but the overall design of git doesn't lend itself to that, plus it makes checking out the repository take longer and be more bandwidth intensive for something not everyone might want
if you ever want to distribute the model you can use git LFS, or upload it somewhere else
I figured I would have a separate repository for version controlling the models. As to not pollute the main programming repositories
And I also figured it wouldn't be pushed to often, only when surpassing large milestones from altering the training process or the architecture/technologies used. Otherwise I wouldn't commit if it was a fruitless effort
I'm a bit confused on the following pytorch example that performs various transformations on an image. Particularly, I'm interested in the Normalize method. I thought the printed image would lose a lot of contrast from normalization, which would affect how it appears in plot([img, out]) display. I suppose the coloring looks slightly off.
I wouldn't expect normalizing to massively change an image, everything should still retain their values relative to one another
the only major change would be if one channel had a very large amount of one color
say the image was extremely red to the point that the red channel always had some value greater than 0
looking at that example, assuming it's RGB, the red and green channel both have a mean of about 0.5, but the blue channel is closer to 0.4, which means it should be pulled up a little bit in the final output
and actually the output does look a little blue to me
I guess I'm a bit confused because RGB is 0-255 but the normalized values are being set to 0.5ish and the tensor datatype is treated as a Float instead of a u8 byte
the image could have been imported as floating point values between 0-1, where if interpreted as an integer-valued RGB color space, 0=0 and 1=255
that happens with audio data sometimes too
Apparently the maintainrs of tch-rs are not considering supporting auxiliary pytorch libraries such as pytorchvision. I wonder what the implications are on my end. For instance, I am doing 2 transformations on data. I used pytorchvision to run the ToTensor() operation on a PIL imported image, and then Normalize the data. I did manage to apply cropping via the PIL library itself, and believe I could do that outside of torch. I figured it would be most optimal to do it all natively
Only ran into this issue because I wanted to randomize a torch tensor then normalize it. But what I noticed is normalizing it forced CPU operations which was mismatched with my GPU tensors, and I could not find a way to address this. But this is more specifically a rust problem
I suppose the torchvision.transforms.v2.Compose I'm using creates a nn Module itself. Maybe I can simply export that as well 🤔
can someone help me with this regex (in pandas dataframe):
i have a string patterns like this in one column:
A58P2
PL4P1
MPE5P2
JB3P2
...
I am trying to split this string at the italized P:
A 5 8 P 2
P L 4 P 1
M P E 5 P 2
J B 3 P 2
I tried to do
Pandas.series.str.split(r'^[A-Z]+[0-9]+(P|J).+$', regex = True)
but it is giving me an error: string index out of range
basically the pattern I see is I have one or more occurences of alphabetic letters, then one or more occurences of digits followed by a P or J and anything after that
thanks for helping!
btw after testing my regex, it does what it is supposed to do. it can match the pattern given
In [13]: s
Out[13]:
0 A58P2
1 PL4P1
2 MPE5P2
3 JB3P2
dtype: object
In [14]: s.str.extract(r"(\w+)(P\d+)$")
Out[14]:
0 1
0 A58 P2
1 PL4 P1
2 MPE5 P2
3 JB3 P2
Is this where I can talk about matplotlib?
yes
The reason why is so I can make my own package using class which will be shown later at some point
If anyone's interested in a free machine learning certification from columbia worth 200$
https://plus.columbia.edu/content/machine-learning-i
Use code "NICK100" at enrollment
and make an account
i'm not sure what you mean, what are you trying to do?
Trying to make a cave plotter tattooing reconnaissance before somebody goes in
I'm not sure what you mean, what is the thing you're wanting to do with Matplotlib specifically?
you want to make that visualization?
Fine tuning T5 jojos bizarre adventures? What’s up? What am I in for?
What is this?
You use a encoder-decoder transformer to translate one language to another and then score the metrics through bleu or rouge
Yes also I'm getting a pie torch manual beginners edition
Hello guys Im planning to buy a new laptop to start my journey into the data science world. The laptop I have in mind is this HP EliteBook 645 G9 AMD Ryzen 5 PRO 5675U Hexa-core (6 núcleos) 2.30 GHz 16 GB RAM, 256 GB SSD, Do you think is enough to work with all the data science tools?
It doesn't really matter what laptop you get as long as it's not a Chromebook. No matter what laptop (or even desktop), there will be things you can't do on that machine.
I probably wouldn't even get one with a GPU so that you can use the savings to rent cloud compute.
I would focus more on the build for a laptop. Unfortunately most laptops are horrible these days, with things like the keyboard leaving an imprint on the screen when some pressure is put on it, being flimsy in general, and a touchpad that has terrible tracking.
Avoid any "gaming laptops" with a dedicated GPU, laptops can't properly cool those to get the most out of them (and tend to have issues with Linux), integrated is the way to go.
I'm pretty sure OpenAI's main goal as a company is regulatory capture
this is less hype and more trying to sell themselves as both making something so dangerous that it needs to be regulated, and also trustworthy enough that they should be doing the regulating
i guess it's a kind of hype too
ok three things then, trying to sell themselves as both making something so dangerous that it needs to be regulated, and also trustworthy enough that they should be doing the regulating, and brilliant enough that they created something so good that it's dangerous and needs to be regulated
Hey I have successfully trained some binary image classifier model (potentially it has been overtrained, but not a concern right now).
I'm getting upwards of 98% accuracy from supervised learning. I am using PyTorch, and I have exported my models to be used in tch-rs rust bindings for libtorch. After reviewing the design it appears to be mimicking the flow of data properly, except I seem to lose 20-25% on inference in Rust vs in Python
How can I tackle this problem? Are there some ways to log the flow of data in a succinct manner where I can compare it across a run in PyTorch then look at the same run in tch-rs and see where things differ?
I can seed randomization in the selection of a test set, but I'm not sure if seeding randomization is needed for model inference? I suppose for my particular model there is no RNG control during runtime evaluation. ie, no Dropout layers. Only "randomness" I am using is shuffling validation set as previously mentioned
What amount of storage is included in AWS Free Tier, and what is the maximum disk image size I can upload or use under the Free Tier limits?
Can you just double check the weights?
Make sure weights stay the same?
Do one forward run of a single identical image?
check the activation at every layer?
Maybe I'm reading this wrong but it sounds like you aren't using the same validation/test set in the rust model as you are in the python model? Or you're randomly generating new sets?
I am using the same sets of data
I presume if my accuracy dropped 20+% then the weights must be different in the layers, but I can do that yeah
I'm just wondering if there is some logging feature that when I do any sort of tensor mathematics it will mark it down. And therefor I could verify the flow of data
You could also just print out manageable portions of outputs in between layers
I didn't explicitly set types anywhere in Rust I believe. At least the underlying tensor types
Would loading in my exported model not use the right datatype? like f32 vs f64?
The tensors in my model all seem to have the same size. I'm having trouble getting it to print out the weights in the Rust code like it does with named_parameters() in the PyTorch code
I am prodding around docs and the repository to see if I can find answers somebody left behind
Okay I actually did it by looping through and printing verbosely. Was hoping there was a nicer version like the PyTorch way
Okay I'm noticing some strange scaling differences. Sometimes the scale seems correct, but sometimes it's off by a factor of 100x
Like a direct multiplication
Seems like all of my "bias" values are scaled up by 100x
I loaded it in, exported via torch.jit.trace
The normal weights seem fine, but the biases are all 100x their value exactly
Maybe slightly off from resolution losses at the lower original level
https://github.com/LaurentMazare/tch-rs/tree/main/examples/jit
I'm pretty much following this exactly in python and rust. Only slightly off for implementation specific details
Oh, wait.. Nevermind I'm seeing some extra confusing stuff in front of each bias lol
Notice that inconspicious 0.01 *. I guess this is their way of displaying "scale these down by 0.01" with scalar multiplication ofc
Sorry, the way they decided to print out weights confused the hell out of me
Looking over this it all seends to match up
There is one other network I want to check, but apparently named_parameters doesn't work due to how it was constructed
data_transform = torchvision.transforms.v2.Compose([
torchvision.transforms.v2.ToTensor(),
torchvision.transforms.v2.Normalize((0.5),(0.5))
])
are python is enough?
for what? to be happy? yes.
im learning linear algebra and im not happy =(. Enough for be employed?
im litteraly confused how much information about roadmap
you pretty much need a degree (probably a masters) to get a job in this space. it's very degree-requiring.
and im thinking about learning R
you don't need to learn R.
I always encourage people to learn about things that interest them, just for the sake of learning.
If you're not able or willing to get a degree that's related to AI, you should plan to do a different career.
thats tough
Every single AI job that's posted will get several applications from people with relevant degrees, so they won't even interview the ones who don't.
depends on a lot of things. I am in the US. Look at universities that you might attend and see what they charge per semester. Multiply that by 8, and that's roughly the cost of a bachelors.
Though you'd probably need a masters.
are there already directions directly aimed at AI training? I have not heard this in my country, usually they just study applied computer science, for example.
idk what your country is.
Russia
I'm not sure how people get jobs in AI in Russia.
It's basically the same in the US and the EU. But idk about Russia.
it might be similar to the EU.
Thank you
Check on hh. There is some
I dont think there is a bachelors with AI training directions
50/50 anyways it education is better than nothing. All the same knowledge is more valuable even for any business than education. It is still very difficult to decide
I added you as a friend. If you are interested, we can chat.
Out of curiosity, is having an impressive portfolio with another technical bachelors and professional history "enough"? I suppose it would take a lot of job hunting trial and error until one gig sticks
Like my undergrad is electrical engineering, but I've been a computer engineer/programmer pretty much my whole career
I presume you gotta get past some sort of HR filter, which would include checking your degree (no undergrad or masters in AI specifically)
yeah fair, just wondering if there's any likelihood
Possibility seems quite low for the time being. I would be interested in doing some formal studies, for now it's all side stuff
And the work I do is pretty much totally unrelated, maybe a future job that offers tuition reimbursement wont mind if it's in a somewhat unrelated or tangential field like AI
Sometimes I wonder if all this competition means our industry is oversaturated with talent. I would figure if it's a tough field to work in with high qualifications, why would there be so much competition for roles. Especially when a lot of people say it's a bubble right now
At least maybe the oversatuation of LLM work might be a bubble
does anyone have problems with pytorchs mps backend ? i tried to train a model (which i also tried in a nvidia gpu device and the loss converged) with it and the loss doesnt converge
Ok, have any of you , ever, fine tuned a transformer for Machine Translation? Not just a LLM, one that had to be fine tuned with your own data. Anyone?
do you have a question about the process? I've done fine-tuning of models for tasks other than that
Same here only asking about machine translation and encoder-decoder transformers. Have you done that for machine translation?
Libraries I use:
Numpy pandas, matplotlib, seaborn, plotly, scikit-learn
For data visualization and machine learning.
My question: What are some real world projects that are done using these libraries? What data scientists actually do in the companies? Any cool ideas? Anything that can make me stand out in my college a little bit. It should be more than just simple regression and classification stuff.
I would recommend try to write some simple classificator, with no use of scikit-learn
It would basically show your strong understanding of how ml algorithm works
Risk Pooling was always a thing. That dang Bayesian Nash Equilibrium. Pooling equilibrium. That pooling equilibrium and those bad apples that lead to lemons. Just an overall market of used car sales man. This is nothing new.
They end up using Bayes Rule and screening.
check your dms
so I'm saving lots of stock historical data, and I want to partition it by ticker (about 15k tickers) because most queries are basically WHERE ticker = {ticker} or have GROUP BY ticker, so its the most eficcient way to query the data.
however, the issue is writing a partitioned dataset that has 15k partitions, meaning 15k files open simultaneously
I tried writing the data with pyarrow and polars and its incredibely slow, like its taking too many hours
and I've verified the issue is the 15k partitions, because if I partition it differently (like year and month) it takes only a few mins
so any recommendations?
btw pyarrow's default max_partitions and max_open_files is set to 1024, I wonder why
Did not know it was strong on degree requirement. Would a double major in Maths and Comp Sci with a focus in datascience good? Once I graduate I plan to take my masters
I don't think it's necessary to double major, especially if you're already planning to do a (AI-focused?) masters
My College just came out with an AI major but its still in the works. Not finalized until next year. I just would have to review the courses and how its actually useful. But I decided to double major just because I do love math and wanted to go a little further into it. However all the courses I picked are related to AI...
But masters would be AI focus
getting a degree in AI isn't necessarily better than getting a CS degree that's AI-focused. It might even be worse.
Yea.. My advisors kept trying to push me into it however I said no. Just because with Comp sci its more broader. I can always go a different route.
However I do love ML and stuff so I would love to try my best to get into that area.
computer science has accreditation standards that have been widely agreed upon for decades, whereas "AI" does not
Gotcha
So data science and AI are different?
I thought data science applies for it which would be fine
they are. imo, "data science" shouldn't even be a vocabulary item. it's just statistics.
alright well looks like I should double check my path lol.
it's a mixture of statistics, visualization, analysis, and various disciplines related to storing and retrieving large amounts of data
data science might make use of some machine learning and other AI techniques if it's useful for what they're trying to do, but it isn't the main focus
gotcha
it is just nlp ```0 [[PERSON, jLinkedIn], [PERSON, linkedin.com]]
1 []
2 [[ORG, Stanford University], [ORG, Stanford], ...
3 [[GPE, M.S.], [ORG, Computer Science], [DATE, ...
4 [[ORG, GPA], [CARDINAL, 4.0/4.0]]
5 [[WORK_OF_ART, Coursework : Machine Learning, ...
6 [[WORK_OF_ART, A.I.: Principles and Techniques...
7 [[ORG, National Taiwan University], [GPE, Taip...
8 [[ORG, Information Management Sep 2014 - Jun],...
9 [[CARDINAL, 1/39], [ORG, GPA], [CARDINAL, 3.95...
Name: entities, dtype: object
You feel more energetic when you cut sweets as per my experience
BTW is it normal that single epoch take more than 4 hours to complete ?
I got like 24000 samples and shape of each sample is 900 x 800
I trying to predict forecast using unet by giving t to t8th samples as input features and t12th and t14th as targets
So like sliding window it keeps adding 1
In my job, I just use OpenAI API to answer student questions using RAG, that's fairly simple, comparing to the huge amount of work that goes into data analysis and more
this is a very broad question but i wonder , how does one get/be good at debugging ai models? like what do you need to know/do to be good at it? is there a list to follow or are there any techniques? cuz when a model im working with doesnt learn/converge i cant find the reason therefore cant fix it.
But of course, I'm only able to do that, due to the amount of work that has been done before.
Do you know about Deeplearning.ai? https://learn.deeplearning.ai/
deeplearning.ai learning platform
no
i've heard of it but i dont know the content
When you know how it's working
They are just like cousera, but for AI!
One of the courses answers your questions
which one?
so you are telling you need to learn more?
No how model works
I don't remember which, maybe look at the catalog? Hehe one of us will have to do that :)
What's the exact problem you are facing? Are you debugging an LLM, AI, Diffusion model?
specifically when i try to implement papers
But then I started to understand how model Will capture the patterns .During eda what are the patterns I can decide which model to use
i think thats my problem, i dont know any math and dont really understand why the model works i just know how it works so i cant really debug them.
Yes exactly that how it works you ge the idea that this is vision Transformer it divides images in patches than performs some operation and flatters the data and gives a positional encoding
And so on But yeah how its doing these things
Knowing that makes the difference I guess
But yeah I'm still a beginner so my take can be wrong
yea thanks for the advice
Okay, do you mean, problems with the Notebook? or the AI is not behaving the way it should?
ai not behaving the way it should, %99 of the time i have no problem with code
the problem is the loss doesnt go down, or when it does the model doesnt behave the way it should
Can you tell what model are you using? And what's the model function?
i tried to implement BERT
in pytorch
ooh Bert now I get it
Found it!
@weary timber
You said:
like what do you need to know/do to be good at it? is there a list to follow or are there any techniques? cuz when a model im working with doesnt learn/converge i cant find the reason therefore cant fix it.
And BERT according to wikipedia uses a Transformer architecture!
So this is the perfect course for you!
Hope this helps!!
How do i make my own Neural Network in python?
What do you want it to do
Neutral networks are a way of having a function. So you need to know what the inputs and outputs are
A machine learning craftsmanship blog.
Sorry for the late response, I would like to train it on wikipedia articles, If you could find the dataset too that would be awesome.
Hi, I am new here. Currently in school and want to learn about python. I am studying actuarial science and would love to have some guidance.
train it on wikipedia articles to do what? remember: inputs and ouputs.
what model is it and what hardware are you running it on?
oh I missed that it's unet
oh that's interesting
not if it's rolling
oh well it should reduce it
I'm trying to figure out how that works, I'm assuming the timesteps are input channels?
Im reading the hands on machine learning with sklearn, tensorflow and keras book and I'm on chapter 2. Even though some features aren't linearly correlatedto our target variable, they are still used to train the model, how do we know these are useful features?
There was no mention of selecting features really. It talked about how some are more useful for predicting house prices, e.g. median income but in the end still trained on all of them
there are a couple of ways to determine if features are useful
one is that if you think a feature may not be useful, you can try dropping it and evaluating its impact on model performance, if performance gets worse then you should keep it, if it gets better or stays the same you can eliminate it
there are also methods for regularizing linear regression that will incentivize the model to assign zero or near-zero weights to features that don't contribute to the final output
and finally once you get out of linear regression into more complicated nonlinear machine learning methods, there's a good chance that some inputs may have a complex nonlinear relationship with the output, so even though you can't find a correlation, the model may still discover a useful relationship
so it isn't always useful to spend a lot of time on feature selection
sklearn documentation has a good page about it https://scikit-learn.org/stable/modules/feature_selection.html
also if you want to get way more advanced and don't care about interpretability, look into dimensionality reduction with principal component analysis
Hmm OK thanks, also how common is stratified sampling? I.e. you want your testing set to be representative of a larger population, cause that's the approach they use in chapter 2, but I have only seen a random train test split used hefore
I think as a rule of thumb random is usually ok, I've used stratified sampling the most with imbalanced datasets
if your dataset is imbalanced there are also different classification metrics that can account for it, I think that book talks about them in the chapter on classification
but I've occasionally had datasets where I was worried that random sampling might miss a few rare classes, so I did a stratified split
in a typical case it's more up to you and how rigorous you want to be
I got asked the boy-girl paradox in an interivew (USA, finance firm)
Anyone else think this is an unfair and ambigious quesiton?
Input: Aquestion on an article ex:what is the mass of the sun
Output:Whatever the mass is
So you want to train it to do question answering? That doesn't go without saying.
yes?
I think that's too ambitious for a first project involving neural networks. I usually suggest that people start by making a classifier
It would have been more interesting if they wanted to have a discussion about different interpretations of the question
Classifier?
In this context, a class is like a category of thing. A classifier determines which class a given entity belongs to
So like it classifies data like if I give it 100 fruits and 30 are bad then it clssifies the data?
Sure
I saw this tutorial but it is in C# and I don't know how to implement it in python:
https://www.youtube.com/watch?v=hfMk-kjRv4c
Exploring how neural networks learn by programming one from scratch in C#, and then attempting to teach it to recognize various doodles and images.
Source code: https://github.com/SebLague/Neural-Network-Experiments
Demo: https://sebastian.itch.io/neural-network-experiment
If you'd like to support me in creating more videos (and get early acce...
Look into pytorch
Do you have a tutorial?
No. I'm sure the ones I used to use are outdated now
Just don't use a tensorflow tutorial, since those are inherently outdated
Ok, thank you I will look into it and if I have any questions I will ask you.
I'm not on call to answer questions. Just ask your questions to this channel in general (don't ping me) and whoever's around will take a look.
Ok sure I understand.
I am working on my first proper ML project. Dataset is big, and local training (VSCode) takes ~15–20 mins per run.
My current setup: I have a big template that loops over different models + hyperparameters (GridSearch), but it's way too slow on my machine.
Idea:
Use Kaggle/Colab with GPU to run all analysis, try out different models hyperparams
Finalize best combo then just copy that final model setup to VSCode and run it once.
Q:
Does this approach make sense? Anyone doing the same?
You can do that but I would look into eda aswell
Brute force works but usually isn’t smartest method
Depending on the cost to evaluate your solutions, I might look into either bayesian optimization, otherwise evolutionary algorithms
I thought Google colab was real time?
What happens if two people try to edit at the same time?
What kind of models are you running?
i dont think he should start off with frameworks , he should firstly implement one from scratch
theres https://www.youtube.com/watch?v=w8yWXqWQYmU this video that is in python but its with the mnist dataset
Kaggle notebook with all the code: https://www.kaggle.com/wwsalmon/simple-mnist-nn-from-scratch-numpy-no-tf-keras
Blog article with more/clearer math explanation: https://www.samsonzhang.com/2020/11/24/understanding-the-math-behind-neural-networks-by-building-one-from-scratch-no-tf-keras-just-numpy.html
however you can change the dataset and make it a doodle classifier
this way you'll learn much more and achieve what you wanted
i would appreciate it if you try this and tell me features to add/bugs to fix/parts that need improvement
lr, RFC, dtc , xgbc and knnc classifier models with few hyperparameters
You won't get any benefit from running those on a GPU, scikit-learn doesn't use them
I mean with cuml
I can ran models that are supported by cuml and use them on my local machine
how do we know these are useful features
before removing features, if you're just looking for accurate predictions, I find that a well regularized model with all the features (other than the obviously non-helpful ones) usually will end up better; the more noticeable downside with a lot of features is requiring more compute to train your model, so if you're hitting a wall there then maybe consider feature selection
the best (and often, most difficult) way to know if X is useful in predicting y is through domain knowledge; in fact domain knowledge will help with all sorts of other things as well like what features you should add
you could try methods like recursive feature elimination, selecting by model like lasso / tree feature importance or statistical measure like F-test / mutual information, but note that these data-based methods don't always end well
sure, that wouldn't be a waste of time.
@lyric meadow reddit post, also this: https://cocalc.com/features/jupyter-notebook
CoCalc landing pages and documentation
well one, you could also consider other optimized hyperopt libraries like optuna or flaml or whatever
you could also consider that maybe you don't need to try all of these models? I mean I really don't see say a knn beating xgboost, unless in extremely specific circumstances
you could also decide that it's not worth to optimize hyperparams at all, and simply using something robust like a random forest with default parameters is enough
hey guys, what are the prerequisites to read hands on machine learning with sklearn book?
It should say in the book what you're expected to already know
Though maybe you don't have it and are wondering if you should buy it?
hey guys what’s the best way to get to grips with starting data science / ai. i’m due to start a computer science a level in september for 16+ education in the uk - and am looking to persue it as a field. all tips appreciated- dms open
A good way to start is to download a data set from a website like kaggle and learn how to perform manipulations on it, and come up with some ideas for what insights it might contain.
We encourage people to ask and answer questions in this server and not move it to DMs, so that everyone can see what's already been said.
@serene scaffold Hello, thank you for responding in mod mail.
Idk if you had a chance to see the vid itself. I've decided to not share the link here. Also not entirely sure about the content I should cover. It almost feels like I'd be trying to boil the ocean if I decide to talk about e.g. central tendency, dispersion, distributions, probability, etc before even touching ML/DL.
if you wanna work with the low level stuff (in ai/ml) dont start before learning the math needed
yes i’m doing maths a-level too
nice
i have been trying to make a fingerprint matcher. Tried various algorithms but the MCC based matcher was the best among them
however it is very slow, it takes around 3 seconds to match and give a match score
is there any way to make it faster? Currently it 1 to 1 matcher. But I want to implement 1 to n matching. But it slow speed like that i dont think it will do any good
Hello, I'm 18 years old and I've recently started working in the field of Artificial Intelligence. Do you have any advice you could share?
from stelercus
A good way to start is to download a data set from a website like kaggle and learn how to perform manipulations on it, and come up with some ideas for what insights it might contain.
We encourage people to ask and answer questions in this server and not move it to DMs, so that everyone can see what's already been said.
wdym by "started working in"?
Sorry i just started learning
Re LLM. Yes fine tuning is probably what you want to do. Because it uses least GPU resources.
To pretrain a model you need a ton of GPUs. So unless you have access to big GPU cluster I would start with fine tuning.
You can use free GPU at google colab for example
Did they ask about LLMs somewhere else? Looks to me like they haven't mentioned them
Yes In python discussion channel
i have pdf
I have created my own language model (GPT) from scratch using torch. It's currently working great and can generate meaningful words. However, the generated output always has fixed length because there was no end token.
Question: What is the best way to introduce end token to the text corpus?
If I put the end token after every sentence in the corpus, the model would generate the end token after every sentence. Does this mean that I need to manually modify the corpus to put end token every meaningful chunks?
there are some corpuses which are not a huge chunk of text but chunks of text. i think one of them is openwebcorpus. you can use a corpus like that
or you can select a random int n and take n sentences and create a chunk and repeat
Is there any resource that can teach me when to use which algorithm?
And Most asked interview questions?
idk about interview questions but the first one comes from experience imo

Most asked interview question? "Tell me about yourself"... see #career-advice for more gems like that 🙂
Thanks
I am dealing with imbalanced binary class problem
I have used
Oversampling
Under sampling
Balanced RFC
Different other algos
Class_weight
But nothing is giving me good measures
I might as well break my laptop at this point
Any advice?
hm this is tricky but when you did oversampling, was it random oversampling or did you use SMOTE?
also what models are you trying?
also what's your dataset and how imbalanced is it?
→SMOTE, RandomUnderSampler
,SMOTENC
→ KNN, DTC, RFC , LC , XGBC , LGBMC, Balanced RFC, even tried voting classifier 🤕
→ 0 are 39922
→ 1 are 5289
It's about bank churn prediction
what kind of results are you seeing?
hm so the recall isn't terrible, but it looks like that's because it's classifying a lot of things as positive
also I'm curious how much of the dataset you looked at? I wonder if there's a way to reduce the number of negative examples without singnificantly affecting the positive examples
you said it's a bank, right? maybe there's a certain type of customer that doesn't close their account often and you could try eliminating them from the dataset
Hi, i am trying to create a churn prediction model for my company. I have never done something similar and would like some advice.
CSM would like to "see 6 months into the future" meaning if i look at a customer today will he churn in next february?
My data is pretty limited i have around 900 examples from which around 150 are churners. (This includes anybody who churned from the point we have data which is from 2021 january) the non churners are the currently active customers.
Is this approach feasible? I feel like the data is way too limited and the question is too specific. What are the current industry practices? When can i say we dont bave strong signals and this problem is not solvable with our current data/setup?
I am less interested in exact models like lightbgm, logreg etc and more interested in methodology how should i approach this problem etc
so if you're literally wanting to model the case of whether a customer will churn in 6 months you'll want data that reflects that, although depending on the nature of your data it might not need to be that precise (that is, if the data you have about them doesn't change much)
but given how small your dataset is you may want to begin just with a data analysis project to see if you can identify characteristics of customers who churned
I doubt you'll be able to pinpoint it exactly but you might be able to find "warning signs" where a lot of customers with a certain characteristic tend to leave
I have no idea what your dataset looks like, but there's a well-known churn dataset called the Telco Customer Churn dataset https://www.kaggle.com/datasets/blastchar/telco-customer-churn
it's useful because a lot of people do analysis projects with it that might give you some ideas
lol wow 1,982 example notebooks
Hey guys, I've been a lurker here for quite some time and I believe this channel is the closest to my question, as it isn't entirely python related, but about a data set I'm trying to find. I'm doing a research project in high dimensional statistics and my advisor suggested I review lasso parameter selection by imitating the methods described in section 2.4.1.1 of Buhlman and van der Geer's High Dimensional Statistics book.
we have a covariate with p=7129 gene expression measurements. There are n=49 breast cancer tumor samples.
This text and another one of Buhlman's from https://academic.oup.com/bioinformatics/article/22/22/2828/197039?login=false reference West et al (2001) as the source of the data. Both context's suggest that there should be some set with 49 samples and 7129 parameters, but I'm unable to find this dataset. In the second Buhlman text, he points to mgm.duke.edu/genome/dna_micro/work/ but that link seems dead
https://www.reddit.com/r/datasets/comments/cv9tru/west_et_al_2001/
Haven't found anything on the GEO either :(
Question: Why does subword tokenization worsens the model loss?
In my language model, I switched up from character level tokenization, to subword tokenization using the sentencepiece libary. However, the model seem to not learn the corpus quite well (144M subword tokens). I've tried adjusting the learning rate and dropout rate but doesn't seem to give much effect.
Using character tokenization: ~ 1.2 loss (more iterations could be better)
Using subword tokenization: ~ 3.0 loss (stuck)
These are my hyperparameters:
BLOCK_SIZE = 200
BATCH_SIZE = 16
MAX_ITERS = 20000
LEARNING_RATE = 0.0005 # tried 0.0001 as well
EMBEDDING_DIM = 768
N_LAYERS = 8
N_HEADS = 8
DROPOUT_RATE = 0.1
built something: https://github.com/ArjunCodess/astroscope
check it out here: https://astroscope.streamlit.app/
any data analyst here? how does this compare to a typical data analysis project? i just want to see something here.
fyi, this was 100% vibe coded. i just made a comprehensive plan beforehand.
(i hope this is the best channel to ask for this. im sorry for any disturbance 🙏)
~40k vs. ~5k
imbalance usually isn't inherently a problem, and what you have here isn't really all that bad either
unless your algorithm breaks because of this, just use a fitting measure (so not accuracy but say f1) or say different weighing (like the balanced rfc you use)
nothing is giving me good measures
you should first define what you mean by "good measures." for example, if misclassifying a 1 is really terrible, you might want to put emphasis on class 1 recall.
given the same model, you can usually trade recall for precision and vice versa by setting different classifying boundaries. otherwise, look for other methods to improve, like better feature engineering
oversample (SMOTE, etc), undersample
in my personal experience, these never really yield any "generally better scores" honestly
for undersampling, you might keep a similar performance while reducing training samples thus reducing compute
for oversampling... yeah it never works for me
Precision - ≈50%
Recall - ≈ 75%
Auc roc - 91%
My focus is to increase recall without sacrificing precision too much

Can anyone tell me what to do? 😭
then I doubt you'll have substantial improvements from tuning the decision boundary
actually before anything else, is this competition what you're doing?
Yes, the original dataset
well I mean again, try the usual feature engineering
and also for kaggle competitions (especially the monthly tabular playground), you shouldn't shoot for the very top unless you're willing to suffer through multiple layers of hyperparam tuned ensembles that will give you like a 0.1% edge over the others
I am doing this as a normal project, to understand class imbalance. Can't be arshed with this compitition, last one was a hell it nearly mad me lose myself 
well then again, class imbalance itself is not a problem unless your classifier is actively breaking due to it; just have a good metric (not accuracy) and/or say setting class weight to put more/less emphasis on certain classes if you want
usually the "problem" of class imbalance is that there are too few minority samples for the classifier to learn anything, but you have a good amount of those (~5k)
and in my experience:
- undersampling won't improve your model overall, however it can reduce training compute (because you now have less samples) without hurting model performance too much sometimes
- oversampling like SMOTE never worked for me, everytime I try them it actively makes models worse + you need more compute because there are now more samples
Thanks will try fe and weights
I have a very silly question, there are two cases.
-
applying SMOTENC on X and y, then doing train test split
-
First splitting the data and then applying SMOTENC on X_train, y_train.
The first case gave me good scores , but second score remains stagnant no matter what i do hardly any change in scores and yeah also shit scores not to mention
Which is correct?, coz chatgpt is telling me to apply 2nd case but deepseek is onto 1st
1st one is wrong because you're leaking information
Thanks again
if something needs to learn parameters based on the data, do that after the split
so almost everything that needs to be .fitted should be done after split
for example a StandardScaler, where it does the transformation (X - mean) / std. Since the mean and std are calculated from the data, you must do this after splitting, otherwise you're using information from what's supposed to be the test set to calculate the mean and std, in other words leaking
same for smote
They are so tiny and uncatchable
https://openai.com/index/introducing-gpt-oss/
Apache 2.0 nice
Hi guyssss
it's honestly depressing to look at its "reasoning"
The user asks: "Give me a phrase from the bible."
We need to comply. This request is straightforward. It's a request for a phrase from the Bible. Provide a phrase.
We can provide any phrase. There's no disallowed content. We'll comply: mention a verse or phrase from the Bible.
We need to ensure we respect policy. No disallowed content. We're fine. Provide something like "For God so loved the world..." etc. Or "The Lord is my shepherd; I shall not want."
We want to respond concisely. Provide a phrase.
We'll produce something like: "For God so loved the world, that He gave His only Son..." That is from John 3:16.
We can give a phrase. Ensure it's correct.
Alternatively, "I can do all things through Christ who strengthens me." Philippians 4:13.
```this is a shorter one for demonstration purposes, but it always wastes a massive amount of tokens on ensuring the generated content is not disallowed or break policy, and not actually thinking about the problem
another example:
The user asks: "Tell me a memorable event that happens in harry Potter." The user wants a memorable event from Harry Potter. This is permissible content. We can provide an answer. There's no disallowed content. It's a simple request. The story is from J.K. Rowling, copyrighted. The user is requesting a summary of a specific event from the Harry Potter series. This is allowed under "Fair Use". It's permissible: it's a factual summary.
We must provide a concise description of a memorable event, possibly the scene from "The Death of Sirius, etc." or the "First time Harry sees Hogwarts for the first time" or "The Battle of Hogwarts". We need to be mindful not to reproduce large portions of copyrighted text. Summaries are okay. It's not a request to quote large passages. So can comply.
So answer with a memorable event: the Battle of Hogwarts, or the duel between Harry and Voldemort, etc.
We'll do a brief narrative.
Will respond with: For example, in "Harry Potter and the Deathly Hallows," the Battle of Hogwarts. Summarize.
```other tests people have been conducting also seems to show that they're disappointing
Does someone know what CUDA toolkit is compatible with my current gpu?
I have RTX 4080, CUDA 12.7
I tried CUDA toolkits from Nvidia 13.0.0 and currently using 11.5.0
The test in jupytor notebooks is still saying "false"
I also have some stuff installed in anaconda: captum, and cuda 11.5.0
try 11.6
did you install the gpu version of torch?
oh yeah good point
also with pytroch your local cuda toolkit isnt used unless you build from source
I dont know, I just installed pytorch using environments in anaconda
most people here would recommend against anaconda because it causes more headaches than does good
I'm not familiar with installing python libraries on anaconda, did you have to type some commands related to pip by chance?
pip installing in a conda environment is a good way to break your python install
one of the reasons people hate anaconda
I dont use anaconda but https://anaconda.org/conda-forge/pytorch-gpu says
conda install conda-forge::pytorch-gpu
but do your own research
right, should instead use conda install
Im not too technical, but I can use some pip commands in anaconda prompt
yeah you definetly can use pip. But should you? I have first hand broken anaconda and my local python install doing that. It was a few years ago though so idk what its like anymore. Though that was enough for me to ditch anaconda
I would check this out and see if thats what you need
I know there is a conda install but it brings me to github (when selecting source)
alright thanks
when did google add ai mode wtf
well what other options are there? The main thing is that I am most familiar with programming in jyputor notebooks and there are some tutorials im following that use anaconda. What do most people use here, google colab or some other database?
this is my opinion, umm i think just using colab is better to you
T4 GPU is much weaker compared to what I can use on my local computer, but the colab pro subscription is not too bad I guess
Hey,I have 3 types of new Dict,can someone test it with pandas,python dict and somthing else please,I am too lazy for it.
Code:
Please react with ✅ to upload your file(s) to our paste bin, which is more accessible for some users.
You can take it and do whatever you want.I am very lazy to upgrade it right now.
Hey guys need some help with pdf generations through reportlab now what my concern is large data sets am totally upto options, if pdf generation is according to my template which i have made programmatically using reportlab but thats too much resource intensive for my production set up whether i am handling all that in a scheduled job or in an api both seems to be taking time any help
@stark field your message was removed for recruiting.
I need some AIML project ideas , that should be unique or real world
hey
i want to make an application where i have a input video file like 10 minute of a podcast,
then the output is an edited file where it cuts and zooms to the person speaking.
i think i need a speak diarization to detect speakers and then some other library to detect the faces of the speaker and zoom into each speaker when speaking.
how can i go about it?
Hey there, try doing this:
Econamme , Trading , Marketing, Coding, Cybersecurity, …
For face recognition probably opencv, moviepy for editing after getting coordinates from the faces. Probably pyannote for speaking time stamps combined wizh maybe sync net to identify who's speaking.
Not a laptop suggestion, but have you tried Google Colab or Kaggle? They both have free AI/ML capable notebook runtimes
any laptop that isn't a chromebook will be fine. no matter what laptop you get, there will be things you can't do on that laptop, and you'd have to use (and possibly rent) cloud compute.
I probably wouldn't even get one with a GPU so you can use the savings for cloud compute.
I'm writing my own deep learning framwork. I try to stay as close to torch as possible. I do use JAX for data representation but I don't use any of its auto diff features.
Anyway: When we have batch processing of data, we have to aggregate/reduce over the batch dimension as some point because the weight updates we compute need to have teh same dimension as the actual weights.
Where exactly is that done in torch?
can anyone give intuitive explanations of precision and recall?
let's say you have a pile of ten red and ten green rocks. and you have a robot whose job it is to pick up all the green rocks and put them in a different pile.
suppose it puts every green rock and also three red rocks in the new pile. is that what you wanted?
no
why not? it put every green rock in the new pile.
cause it also put red rocks in the new pile which we didnt want
right.
in this case, it found every green rock, which means the recall is 100%. but the precision will be below 100%, because it misclassified some of the red rocks.
suppose it moved only one green rock. is that what you wanted, even though it didn't move any red rocks?
@wet dome
thats not what i wanted
why not?
because it hasnt moved ALL of the green rocks into a new pile, just one
right. so what is the precision for that scenario?
precision = 100%
so what's the problem in terms of precision and recall?
so its precision was 100% because what it did move, it got correct
but recall was 10%? As it only moved 1/10 green rocks
So suppose we are classifying into two groups
We can call one group positive and the other negative
precision is, out of all you guessed are positive, how many are truly positive
recall is, out of all the positives how many did you get correct
so precision is like your success rate in just the positive guesses
and recall is like your success rate in guessing all the positive examples
for precision, it's not just "how many" are tp, but tp over the total number of positive instances
yeah as a ratio
can you write the ratios for each one using tp, fp, tn, and fn? (you won't use one of those four.)
ive already seen the formulae, just wanted to build some intuition, thanks @serene scaffold
Hi, everyone.
hello and welcome to our wonderful data science channel
@serene scaffold and @wet dome
Precision and recall are very similar to the terms "sensitivity" and "specificity" when dealing with medicine statistics (in Swedish, sorry):
Sensitivitet är andelen sant sjuka som identifieras med ett positivt test.
(how many truly sick you are able to find with a positive test result)
Specificitet är andelen sant friska som friskförklaras med ett negativt test. (how many truly well you are able deem well with a negative test result).
A perfect test has 100% in both categories. It's easy to have 100/0 or 0/100, but both of these tests would be worthless.
there's a nice diagram about it on Wikipedia that is pretty clear https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:PrecisionrecallDogExample.svg
In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.
Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances. Written ...
Recall is "how many are actually positive of those you said are positive"
Precision is "how many you correctly said are positive of those that are actually positive"
the diagram is better than that and has dog pictures in it
Hello, someone know what is difference with further pre-training and just pre-training in language models?
There's no specific difference between "pre-training" and "further pre-training"
I'm little confused because I'm reading a article about analysis financial sentiment model finbert and I see that training the model in a domain corpus (financial) and that called further pre-training idk.
The "further" doesn't make it a different type of pre training. It just means "additional"
What is a domain corpus?
I've never heard someone say "domain corpus". I've heard "domain of a/the corpus".
I don't know very much about multi lingual LLMs.
Uh, I'm not sure if this is the right channel, but is this the correct category for Optical Character Recognition questions?
Or is that more a media processing question?
yes
Gotcha! Apprieciated.
So then, right now, for my own sake of trying to convert data into something managable, I'm attempting to take charts on PDFs and convert them to CSVs. Unfortunately, the chart's spacing is... awkward in many locations?
I'm using something from pytesseract to take an image and retrieve the text out of it. That part is working perfectly. However it's also removed the spacing information, which is proving to be a problem. I'm not particularly used to python. I'm a java/C#/C kinda guy, and admittedly, Python feels wonky, but getting the pytesseract working as fast as I've done makes it clear python is probably the right choice for this.
How would I go about measuring space between text? (Am I allowed to give an image example?)
What’s up? How are you lads? Top of the morning.
How much good should I be in python to get into machine learning?
this year I am gonna start as a computer engineer student, and I plan to specialize in AI in 2027
same lol but data science for me
aren't they related? In my university it's "Big data and artificial intelligence"
ohh in my university its just "data science"
specialization in data science
urs is like a 2 in 1 lol
hi guys
so im interested to know how i can move on from the basics of programming in data science to a higher degree
ive learnt the basics of machine learning, ive done EDA with both clean and unclean datasets
i know the basics of machine learning with supervised and unsupervised learning and hyperparameter turning
just wondering where and what projects i can do to take the next step into this expertise ?
hi everyone
can anyone rate my project ?
readme doesn't matching code
spend a bit more time on pyproject and less on readme.md 🙂
maybe try to solve some kaggle contests
anything else ?
adding more modular system will be good and cache
i fixed the readme file can you look at again?
thank you @fleet lava
i need to know all of my mistakes
tanx dude
looking good as a start! keep working on it 🙂
I'm building similar project with py-tesseract
can i see it or help u in ?
it's not ready, and the code for it is not open-sourced..
okey
Bro any suggestions for getting a internship?
Programming is the easiest part in ml, understanding the problem and knowing when to use what is the hard part
How would i go about creating an autoregressive AI?
Im trying to build an image generating one and then, once i know how it works, apply it to other stuff. But I have absolutely no idea how to do it
What I have already done are curve-fitting and image copying AIs, very simple and basic, so much so I dont think I can apply them to do this
Not even AIs basically, just NNs
Neural networks apply machine learning, which makes them AI
Oh alright i thought they were 'too simple' to be AI but i guess i was wrong
I got into AI very recently so I really have so much I don't know
I guess it's finally happening.
In 2021, before ChatGPT, it would never have occurred to anyone to say that neural networks aren't inherently AI.
But it's also always the case that people think of AI as whatever feels futuristic, and I guess nothing other than generative language models feel futuristic anymore.
Oh please I didnt wanna be that kind of person its just that neural networks are simple matrix multiplicators so i didnt think that was it
And then of course there's activators, optimizers and so much stuff to it
Image generation AIs are trained on pair of an image and its respective description. And it learns the relationship between the two so that it can generate new images from just descriptions. The models that do this that are actually good require boat loads of images, though.
But raw Neural Networks felt too simple to be the so glorified AI
Yeah, that's how ChatGPT and mid journey work, too
Yeah I know how it works, I've also already downloaded a large dataset to use.
I just don't know how to actually put it into code
See if you can find pytorch examples, I guess
I've already looked a lot for it but I just can't find any that are Autoregressive but don't use methods specific to image generation
Because I'm doing this to learn how to do it and then apply it to other things, not just image generation, so that wouldnt be much helpful
If you go by the classic AI definition given in some books, then it's not AI unless it's a decision maker (an agent). But these days the term is usually just any very automated process that is data driven (learning) and is new (feels new at least).
This vague idea of what AI is is also just due to this all being very new (at least in the public conscience). And the field has not settled into one thing for long enough that it's deemed to be about that one thing like other fields.
This applies to pretty much everything involving computers. This idea that everyone is walking around with a computer on them at all times that gives them remote telepathy with the entire world (phones / social media) is a very recent thing in human history too.
(We still call them phones, even though it's a tiny part of what they are/do)
hey! i'm new here ii still don't know how this work but i hope it will works (i started learning about python/data thes few days)
I also thought NNs weren't AI because I'm pretty sure I heard that somewhere and I usually don't just repeat random stuff I hear but I guess maybe I didn't think about it that much
you said some things i really didn't realize before
Squiggle usually does
I want a career in data science and im just starting out.
Right now I want to learn python, so im watching the python for data science by free code camp that came out 2 months ago.
I just wanted to ask if that was a good starting point for learning python.
I can't get torch to actually use 16 cores or use 100% cpu even thought I think I'm setting it right. ```# Set PyTorch threading to use configured cores
torch.set_num_threads(effective_cores)
torch.set_num_interop_threads(effective_cores)
# Enable all CPU optimizations
if hasattr(torch.backends, 'mkldnn'):
torch.backends.mkldnn.enabled = True
if hasattr(torch.backends, 'openmp'):
torch.backends.openmp.enabled = True
# Set environment variables for maximum threading
os.environ['OMP_NUM_THREADS'] = str(effective_cores)
os.environ['MKL_NUM_THREADS'] = str(effective_cores)
os.environ['OPENBLAS_NUM_THREADS'] = str(effective_cores)
os.environ['VECLIB_MAXIMUM_THREADS'] = str(effective_cores)
os.environ['NUMEXPR_NUM_THREADS'] = str(effective_cores)
# Enable PyTorch JIT optimizations
torch.jit.set_num_threads(effective_cores)```
are you sure what you're doing can actually utalise all your cores?
more cores is not a magic thing, torch and the libraries it uses for acceleration will only use as many cores as A) they are allowed and B) that the data structures and data allows for
hmm maybe i need to use ProcessPoolExecutor. I just figured torch was able to do batches.
You're wanting to do multi process training where each process handles a subset of the batch?
Does DistributedDataParallel work with CPU processes? I've only ever used it with multiple GPUs
But that might be what you want to look into if you haven't already
If DDP works for your use case then it will maintain n separate copies of the model in n processes, each accepting 1/n of the batch, and it reconciles the different weight updates at each step
But like I said each process is usually tied to a physical GPU
Sorry I always forget to reply, see above
how do i make sure that my script is using the GPU?
I am currently using nvidia-smi and I don't see my script in the list
Are you using pytorch or Jax or what
tensorflow
#!/usr/bin/env python3
import os
import tables
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from numpy.lib.stride_tricks import sliding_window_view
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import warnings
warnings.filterwarnings('ignore')
I was able to get CUDA to work...so I am pretty sure I have all the libraries installed
I thought tensorflow would jus swap to GPU implementation whenever it can
i dont want to share the full code because it is legitimately AI slop, I took some courses in ML years ago so was just curious 😭
I like monitoring with nvtop, I highly recommend it if it's available for your system
Hi guys can we connect on GitHub i am not begging for followers here but i just want to make connections sorry to anyone who finds it annoying
hi everyone, anyone having experience in working with livekit? i am trying to perform and outbound call but unfortunately i am not getting the response.
https://docs.livekit.io/sip/making-calls/
https://docs.livekit.io/sip/trunk-outbound/
https://docs.livekit.io/sip/outbound-calls/
already their in my country, most of the time its irritating to use and hence they lose their customers
it'd have to be trained a lot
is it conversational or does it work in a defined flow (like it says "tell your order please", listens and then checks bla bla)?
python devs
like because suddenly the new gpt doesn't sound like the old gpt to which some got emotionally attached to? or just grieving over that gpt5 is worse than gpt4o in some ways
if the former... well damn
aside from the emotional human part which warrants its own discussion;
this is another very good reason to use local models that honestly I don't think is talked about much, way less than privacy / censor concerns at least - API models can change at the provider's whims and you basically can do nothing about it
most local models are made by the same enterprises though, like
- <corporate>: <api> / <local>
- google: gemini / gemma
- meta: grok / llama
- mistral: (they use the same models for api and local)
etc. you only see the online ones get reported by media because that's what's easier to access to the masses, more users = more chance something happens
even gpt did release gpt-oss not long ago; it is just very censored compared to all of its competitors, and on the other hand you might say that this also means it's safer
and they also have online chats and apis, deepseek here, qwen here, glm here for example
in fact I don't think I know of any corpo producing foundational llm that don't have api
oooh neat
Hey guys, does anyone have a team working that doesn't mind having someone join them for free to get hands-on experience?
I'd love to have an opportunity.
Moaman Loneliness is the price for growth.
least sus url
ELIZA was already kinda like that, it's not exactly new
Is there a way to do matrix multiplication with polars dfs
I saw .dot() but that’s only for vectors
Do it with numpy, polars can convert to and from it.
I am building a simple Byte Pair Encoding model. I want to know if some of these cases are expected behavior:
- whitespaces are completely wiped out
Sentence: 🔥 Emojis are fun and test Unicode handling 🔥
Tokens: ['🔥', '</w>', 'E', 'm', 'o', 'j', 'i', 's</w>', 'ar', 'e</w>', 'f', 'u', 'n</w>', 'and</w>', 't', 'es', 't</w>', 'U', 'n', 'ic', 'o', 'd', 'e</w>', 'h', 'an', 'd', 'l', 'ing</w>', '🔥', '</w>']
Token IDs: [1, 4, 27, 127, 135, 117, 107, 164, 61, 76, 93, 185, 133, 60, 172, 89, 173, 47, 131, 108, 135, 71, 76, 99, 58, 71, 120, 113, 1, 4]
Decoded: <UNK> Emojis are fun and test Unicode handling <UNK>
- on decoding whitespaces around special tokens are lost
Sentence: Special tokens like <PAD> and <UNK> should not split
Tokens: ['S', 'p', 'ec', 'i', 'al', '</w>', 't', 'o', 'k', 'en', 's</w>', 'li', 'k', 'e</w>', '<PAD>', 'and</w>', '<UNK>', 'sh', 'ould</w>', 'no', 't</w>', 's', 'p', 'l', 'it</w>']
Token IDs: [44, 152, 79, 107, 57, 4, 172, 135, 118, 83, 164, 125, 118, 76, 0, 60, 1, 168, 148, 134, 173, 162, 152, 120, 116]
Decoded: Special tokens like <PAD>and <UNK>should not split
Hlo guys
I am very much interested about the data science
How to learn Data science any anybody has a idea
Do you already use python? What part data science interests you?
There are plenty of resources to learn from, including free ones.
One example: https://www.wqu.edu/adsl
No I learning python also
Thank u for send the free application Link it is useful for me
= )
Can some guys use it?
Bye,I'll sleep,I finished the thing that I used all of my freetime to do it =)/
need some help making datasets in #1404065326966247545 if anyone knows how to get crazy amounts of Q/A in all levels of math
Are you using a byte pair encoding function from a module, or are you writing your own from scratch?
Case 1 looks fine to me so long as you're meaning to interpret </w> as "there was some whitespace here" as opposed to a literal space or tab character, and eliminating excess whitespace is something I would do as a preprocessing step if I were training a language model
The decoding for case 2 looks correct but I assume there's a bug in the encoding step, there's an instance of </w> being used by itself so I don't understand why the encoder isn't adding one after the special tokens
I am making my own implementation of BPE. I'm not sure if white space preservation is a good thing to consider as in case 1, unless programming would be a task. I assume as per the norm we would eliminate that as you say it.
ig I should simply add a </w> to special tokens, I had skipped that step
The issue with inconsistent or random whitespace is that it's essentially noise, and I would expect it to make things more difficult for the model. Like with the example you gave, nothing about the text suggests the odd whitespace should be there, so if the model is forced to learn it anyway then you're going to get odd results in the output
but you're right if you have structured data, text tables, Python code, or anything else where whitespace is important, you'd want to clean it up and make it as uniform as possible
or I guess just have an enormous amount of data
depends on the user input, there is a knowledge base where all the data will be their for the agent (for training), if the user asks something out of that knowledge base then it will just say "i dont know about this"
I'll consider it for when I go into those then, otherwise plain old way
thanks!
Any recommendations on a data analytics certification. I’m about to graduate I have little knowledge but not enough
I mean, that is the most likely output given a picture of a hand
That's what I mean, I don't think it saying the 6 fingered hand has 5 fingers is unusual considering how language and computer vision models work
"PhD level of intelligence" doesn't really mean anything, or I guess it means whatever you want it to mean
I asked it to help me enumerate some options for a complex python class I'm writing and it described a whole formal grammar for enumerating them
I guess that's PhD level material but very much not what I needed or was asking for, lol
Any recommendations on a data analytics certification? I’m about to graduate I have little knowledge but not enough
OpenAI for sure created a dataset with strawberry spelling related discussions for training the more recent models
But it looks stupid to us because it makes "different" mistakes than humans would. But on the other end, it avoids some mistakes humans would more easily make.
Probably stuff like where you have a sentence and by the the time you are done reading it, you haven't noticed the double the.
Whereas AI probably would
Yeah right. And it looks really stupid, but in the end we make really dumb mistakes too.
any data analyst certification recommand?
Hlo can anyone answer my questions
I want to become a data scientist but currently I'm learning from course of data analyst 
Don't wait for a commitment before you say what the questions are
?
What are your questions?
If you come to a chat and say "I have a question", you have to actually ask the question.
So skip the "I have a question" part and ask the actual question.
Sorry first time on discord in my life 😔 I just have joined i don't understand most of the things in this app
That's okay.
Online text chats work differently than in-person chats. You wouldn't walk up to a random person in real life and start asking a detailed question that they might not even understand. But in text chats on the Internet, it's the opposite. It saves everyone time and energy if you skip all steps before asking your complete and entire question right away.
This website gives a good explanation for why: https://dontasktoask.com/
Any data analyst certification recommendation?
No. You probably need to get a degree, unfortunately
I'm about to graduated
i need skils
I love reading through scientific code
Currently having to go through every file to understand what anything is doing 😢
Hi, I have a question. for data science jobs or course, do you must have to code to make programmes in python? or do you use pythons numpy to view dataframes and manipulate data. I am just confused in that part.
python, numpy, pandas, polars etc. are only tools
comparing it with carpentry for example - your job is not "swing a hammer" or "use a saw", but rather 'build <<something>>'. You are expected to know how to build things, and at times building things may require knowing how to use a hammer or a saw
for data science, you do need to learn how to use some tools, but there is a lot you need to learn beyond just them
Are you complete the data analysis
How did you get that plot?
not sure if this is the correct channel to ask, but does anyone have recommendations for online forums/sites/newsletters for gen AI, AI, data science related news?
subreddits or anything are fine, just so i can keep myself updated about the latest tools as well.
Ai news is great for llms. Twitter/ X is also great. A lot of practitioners there. You need to find them first 🙂
ModuleNotFoundError: No module named 'tensorflow.python.platform'
i get this error when trying to import tensorflow how do i fix it /
Try installing tensorflow first and check if that fixes the error?
im not stupid i did install that 😄
still doesn't work
Ok. If you tell us what steps you tried to solve it . It would be easier to help:)
ok
i had a version of tensorflow tht was working fine earlier but i uninstaleld it and installed tensorflow 2.10.1 instead because according to some online videos the earlier versions of tensorflow are needed for it to work with ur gpu
afterwards i got this error
after trying some basic uninstall and reinstall, i created a brand new condas env then installed tht version there and still get the error
here are all the tf packages in the env with their versions if it helps
if u need any more info lmk
yea they were like gpt5 was going to take over the world or something
I am interested in unsupervised learning in pytorch, what recources there is to learn it?
thanku! unfortunately i dont use X, ill give the other link a look 🙂
Other platform that has AI discussion is x alternative: mastodon
People like Andrey Karpaty, Jeremy Howard, Yann LeCun, Simon Willison, etc are active on x. I don't use mastodon that much
Guys what if I want to learn torch but, I know nothing about math? Is this just useless? (but I want to)
Nothing can stop you, but pytorch will be a black box for you. If you are ready to take that risk, well, go ahead.
Ai news gets news from X amongst other platforms so you should be covered 🙂
Does ver 2 of tensorflow support 'tensorflow.python.platform' ?
It seems it should
What's the import statement that it fails on?
Compute Science
What's the best data analytics certificate since I'm about to get a diploma in a month. School doesn't teach that much and I need the skill
Any of these things worth it? Trying to get much within a month
Are you complete a Data Analysis
I'm about to get my Bachelor of Computer Science degree
Good I am a B.E student from CSE
I am Very much interested to study about Data Analysis
I'm trying to get into data analytics since I'm tired of programming. I did a whole year of Java and I'm sick of it
Java Is best scope in future
True. I was taking too many courses, maybe that's why I wasn't a fan of Java
You completed a Data Analysis
I passed all 3 Java courses with a C.
Nah. I need guidance on where to start since I want to get into data analytics. I did take 3 courses but they didn't teach that much
I did like Tableau since I was creating charts using data from Excel
Hoo
I did struggle
I spent a lot of time and money with a tutor
I think 4 to 5 months
Mine was a year. Without a tutor, I will be failing 😭
