#data-science-and-ml

1 messages · Page 165 of 1

untold bloom
#

wow

#

thanks for the enlightment

#

that last s has been bothering me

verbal oar
#

for example for drug discovery

#

if I said it correctly

#

I mean I saw one offer related to graph nn

fallow coyote
#

im struggling to understand the likelihood function. I look on one website, it says one defintion, I look at another, theres a completely different definiton. how do I define what it is

calm thicket
#

there are many different likelihood functions. probably in machine learning you are using maximum likelihood estimation

wooden sail
fallow coyote
#

I've somewhat figured it out. Still seems confusing but I read up that likelihood is a topic that still confuses even the bbest mathematicians so Im not too bothered if I dont understand it fully

wooden sail
#

the idea is kinda simple though

fallow coyote
#

thats what I thought and then I looked up the definitions and it started confusing me

wooden sail
#

the easiest way to see it, imo, is that you have a pdf for some set of observations, and the pdf depends on some parameter

#

for example you can say you have data X that is gaussian distributed. the gaussian distribution has 2 parameters, the mean and the variance (or standard dev, as you like)

#

we can then write the pdf as a function p(x, mean, std dev)

#

if you keep the mean and std dev constant, then p(x, mean, std dev) is a pdf describing the probability density of observing any x that follows this pdf

#

if on the other hand you observe one specific x and keep it constant, but allow the mean and std dev to vary

#

then p(x, mean, std dev) is no longer a pdf. it doesn't integrate to 1 if you integrate over the mean and std dev

#

this second case is what one calls the "likelihood function", and you can do this for any pdf

#

the difference is what is kept constant and what is allowed to vary

fallow coyote
#

I think I get it now

verbal oar
#

i just relate it to probability or odds, likelihood is related to bayes generally

#

when you have a priori, a posteriori and likelihood

#

but its likelihood not likelihood function

#

likelihood is B given A, B|A

#

I think this is origin of it

#

also lower, marginal is origin to marginal distribution or probability dont sure

#

yes likelihood is probability because of P

#

what do you think

#

looks like bayes is base

#

like its base for vae for example when you have priors

wooden sail
#

that's something separate

#

the posterior you compute here via bayes' rule can itself be either a pdf or a likelihood

verbal oar
#

ok

wooden sail
#

the distinction is again made by what is kept constant and what is allowed to vary

verbal oar
#

lol so I confused too because of word having two meanings likelihood

#

so depends on context

river cape
#

Hey guys so I have spent my time learning ML , DL , transformers and right now i am learning langchain , but I dont have much knowledge on DSA . So im confused at this point , as to what to do

left tartan
verbal oar
#

where I can find hypothesis testing inside deep learning?
with machine learning you see it for example in R statistical summary

left tartan
verbal oar
#

looks like its hidden?

#

I dont know about relavance of it

river cape
verbal oar
#

ah ok right

river cape
left tartan
verbal oar
#

so its doing it automatically

left tartan
#

The real question is "how much is needed", which is difficult to answer because it's difficult to measure.

verbal oar
#

yes as I recall inside linear regression is hypothesis testing

limber spear
#

Probably wouldn’t be able to 😅 there’s too much to learn

#

Hey chat

proper meteor
feral meteor
#

asked clude to show me the heat map

verbal oar
#

hey robert 🙂

hallow wagon
manic lion
#

machine learning is difficulty to work ?

#

exemple, an guy trainning an ia to respond the questions.

serene scaffold
serene scaffold
#

(and if you start with a foundation model, all the actual AI is abstracted away.)

rich moth
#

Whoa! Check out these polar plots! The the timeseries one (arrowhead) is crazy because all the points line up almost perfectly on the 0 degree, 180 degree line. It makes sense though, if you think of about time series just having a 1D nature . But whats interesting is the image one is that the complexity can be represented in a 2D nature. The points are distributed across multiple angles in the complex plane, forming patterns that extend in various directions rather than being confined to a single axis. What cool though is all these patterns are merging naturally from the math formula I made.

#

It's like this hidden "dna" of data

#

There's a unique "shape" to complexity across different data types ,almost kind of hidden signature or "calling card" that reveals the fundamental nature of information itself. It doesn't just measure how difficult a sample is to learn (magnitude), but also characterizes what kind of difficulty it represents (phase).

Images below are from IRIS dataset.

rich moth
limber spear
#

Test your research against the big dawgs. Llama, DeepSeek, ChatGPT etc.

#

Claude. What is the deal with Langchain. Why are folks excited about Langchain

rich moth
# limber spear Welp Plunder. I think it is time for you to publish a paper on this atp. Is this...

Honestly, I dont even know where to begin. Feels so overwhelming. I have started writing some, Im slow at it and takes me forever, dreading it already lol. As far as github its a great idea, im also slow at that, too. I'm not greatest with it. Do you know of any exceptional resources to better my GitHub skills? I once (years ago) nuked my entire hardrive because I was , well an idiot and didnt know what I was doing. But i appreciate your input, broski.

limber spear
rich moth
limber spear
rich moth
#

That's a great attitude

#

This one is on the Breast Cancer Dataset. Whats interesting is Phase ascending strategy took the cake on this one

#

oops, lol its obvious hard to easy. Im looking at two different things

#

It was the WINE tabular dataset i was looking at where Phase Ascending over random +5.41 %

rich moth
#

Its almost like the opposite rings, for this complexity tool. Feedings it lots of information , rather than little,. improves overall results. Which makes sense, more complexity better results. But its not always the case, sometimes another method seems to be working , but always beating random everytime. But complexity isnt defined by domains, it something else, thats what I think I found though. Well some of it, even though it works and works well. I feel like theres something else missing to the pie. It's like dark matter, I cant see it, but trail and error in our measurements show "something is there"

rich moth
#

Im really curious how fractals might play a role here. like thinking about patterns like the Mandelbrot set on the complex plane .. especially the boundaries between points that stay bounded and those that escape to infinity...

#

damn! i never thought about that... they both fundamentally operate in the complex plane there might a connection here

proper meteor
river cape
proper meteor
#

It allows the recruiters to check if you have problem solving abilities or not

abstract loom
#

Hey

charred estuary
#

Anyone else getting 503 with the Gemini API?

charred estuary
verbal oar
#

being little offtopic github is not backup site

#

its only version control, still need backup somewhere

limber spear
#

I use GitHub for backup. Unless GitHub says no lol

verbal oar
#

I say what I read in some git book

#

not my opinion

charred estuary
gritty notch
#

Hey everyone pls help me with this

#
import numpy as np

class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.weights_input_hidden = np.random.randn(self.input_size, self.hidden_size)
        self.weights_hidden_output = np.random.randn(self.hidden_size, self.output_size)
        self.bias_hidden = np.zeros((1, self.hidden_size))
        self.bias_output = np.zeros((1, self.output_size))

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_derivative(self, x):
        return x * (1 - x)

    def feedforward(self, X):
        self.hidden_activation = np.dot(X, self.weights_input_hidden) + self.bias_hidden
        self.hidden_output = self.sigmoid(self.hidden_activation)
        self.output_activation = np.dot(self.hidden_output, self.weights_hidden_output) + self.bias_output
        self.predicted_output = self.sigmoid(self.output_activation)
        return self.predicted_output

    def backward(self, X, y, learning_rate):
        output_error = y - self.predicted_output
        output_delta = output_error * self.sigmoid_derivative(self.predicted_output)
        hidden_error = np.dot(output_delta, self.weights_hidden_output.T)
        hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output)
        self.weights_hidden_output += np.dot(self.hidden_output.T, output_delta) * learning_rate
        self.bias_output += np.sum(output_delta, axis=0, keepdims=True) * learning_rate
        self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate
        self.bias_hidden += np.sum(hidden_delta, axis=0, keepdims=True) * learning_rate
#
def train(self, X, y, epochs, learning_rate):
        for epoch in range(epochs):
            output = self.feedforward(X)
            self.backward(X, y, learning_rate)
            if epoch % 4000 == 0:
                loss = np.mean(np.square(y - output))
                print(f"Epoch {epoch}, Loss: {loss}")

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=10000, learning_rate=0.1)
output = nn.feedforward(X)
print("Predictions after training:")
print(output)
viscid urchin
#

So, my buddy has this awful 820-page PDF slide deck (ugh).. and he's trying to figure out which pages of it refer to a particular concept. The problem is that any individual slide may or may not mention any interesting keywords about that, so it's really a full-LLM kind of context problem, it seems to me.

I took a swing at using LangChain to feed it into OpenAI for him, and it didn't go that well; the problem is that I'm having to chunk it into 50-page slices, and each of those isn't maybe enough context about the project to make good input.

What general approach is best-suited for this kind of problem? Do I need to "fine-tune" a model on this slide deck?

#

As an example, here's a slide that needs to be a 'hit', because despite not mentioning anything by name, it is ABOUT the project he's looking for hits on:

#

and if I don't luck into having that context in the same 'chunk' as this slide, it's not gonna realize this is a hit.

verbal oar
#

is ai agents related to reinforcement learning due to concept of agent in rl, can I think this way or its different meaning?

agile cobalt
# verbal oar is ai agents related to reinforcement learning due to concept of agent in rl, ca...

not related at all

an Agent (or more broadly, Agentic systems) using LLMs are when you take a language model, give it some tools (e.g. python functions it can call), then ask it to do a given task with some degree of autonomy
It does not necessarily involves reinforcement learning

RL is sometimes used for training the language models, specially to let them 'reason' on their own before writing a final answer or invoking a tool

verbal oar
#

yes because I was confused about rlhf

#

thanks for explanation

agile cobalt
viscid urchin
#

Yeah that's my best advice for him so far; hoping someone has a cooler plan.

verbal oar
#

split them maybe?

viscid urchin
#

The problem is that the first bunch of slides introduce the topic, and if you just feed it a slice of the later part of them it isn't doing a great job.

verbal oar
#

ah like needs context

viscid urchin
#

I tried chunking them with LangChain etc before posting, didn't really find what my buddy is looking for.

agile cobalt
viscid urchin
#

My suggestion to him was to try to use some 'tool memory' on the problem, but I don't have a specific MCP in mind for him to lean on.

#

Aah yeah I'll try that too.

agile cobalt
glacial yoke
#

What's a good way to prep for data science & ai uni course?

agile cobalt
#

it's relatively easy to identify that a slide is making a direct reference to X thing, but finding all other slices that indirectly reference that thing is hard

as far as I know usually you'd just take fragments of things around the direct reference

agile cobalt
#

see if any of the courses list their pre-requisites / things you are expected to know

glacial yoke
agile cobalt
#

not sure, maybe something like Datacamp but I never used it myself

make sure you take a look at the documentation of any libraries you use though, specially their User Guides if they have one

serene grail
glacial yoke
viscid urchin
verbal oar
#

so just syntax?

#

not more?

#

like generators,iterators

#

for example big dataset and getting stream of data

#

datacamp looks ok imo

#

I thought python is prerequisite so looks like I'm overthinking it

#

python in the sense not just syntax but more, but seems its like you said then

#

statistics could be also khan academy but not sure

agile cobalt
# verbal oar like generators,iterators

usually you'll be working with dataframes or other abstract representations of datasets that manage the looping for you, it's relatively rare for you to have to write a generator yourself when working with data

when you're using python with libraries like pandas or polars, the way you write code is extremely different from python on its own

(of course, it's still nice to understand how more advanced python features work, but you must not go out of your way to use it when the library offers a different, more efficient way to do the same thing)

glacial root
#

hi, typically what does the interview process look like for machine learning and/or computer vision roles, for both internships and full time

glacial root
#

kind of like with the leetcode round with swe but with ml algorithms

serene scaffold
glacial root
#

interesting

#

and this is for full time?

serene scaffold
glacial root
#

are there online assessments like with swe?

charred estuary
rich moth
#

This is based on the Fashion_Minst dataset.

#

Time data series Arrowhead. See how it still remains in 1D

rich moth
#

Alright, I think the next experiment is to really push this thing: gonna calculate the complexity scores for all the datasets (time series, images, tabular, text) using the right domain logic for each first. Then, I'll combine all those complexity values into one giant list and sort it using my framework's magnitude score. Really want to see if the tool can create a meaningful difficulty ranking that truly works across all domains when the data types are mixed together. Basically, putting the 'Unified' in UCF to the ultimate test!

#

I've noticed patterns in this thing, the more complex the data , the better the results. Wihich makes sense, thats the whole purpose

rich moth
#

Here are the cross domain results

verbal oar
#

can I fine tune some model with dataset containing python code?

#

both are on hugging face

#

of course I select only python code

#

50gb of data will be slow for finetuning, cut it to some smaller size?

fierce python
#

When dealing with binary classification, is it better to use torch.nn.BCELoss or torch.nn.CrossEntropyLoss and why?

From what I know, we use BCELoss to get the value close to either 0 or 1 and round it to the nearest one, but with CrossEntropyLoss we get the exact 0 or 1 value, so which one is better actually?

glacial root
#

would 400 images be enough to train a liscence plate detector

#

or is that too little

agile cobalt
#

might be enough to fine tune

you can also try using data augmentation

glacial root
#

nevermind it doesn't have labels

#

i've been trying to find a dataset with proper labels, all of the ones i find don't state exactly what the label format is and it just says it's meant to work with yolo

#

but i need to train this model from scratch

#

are any of you aware of any datasets out there that are clear with the labeling format

#

like if it's the top left and bottom right coordinates or if it's one coordinate and the dimensions of the box

odd meteor
# fierce python When dealing with binary classification, is it better to use `torch.nn.BCELoss` ...

TLDR: It doesn't really matter which one you use since both loss functions can get the work done.

BCELoss is strictly for binary classification and expects probabilities (0 to 1), but then, you must manually apply torch.sigmoid() to your logits first because BCELoss() does not ido this for you. Your labels also need to be floats. It's the application of sigmoid that actually squashes the values close to 0 or 1 not exactly BCELoss itself.

BCEWithLogitsLoss on the other hand simplifies this by handling the sigmoid internally for you, so you just need to feed it the raw logits and float labels, no extra steps needed.

CrossEntropyLoss, designed for multi-class, also works for binary so long as your model outputs 2 logits (for class 0 and class 1)... It auto-applies Softmax to the logits and uses integer labels, unlike BCELoss().

So there's no rule I've seen that states that using one is better than the other, however, just ensure your predicted output and target are in the right format ( depending on whichever loss function you decide to use.)

Personally, for classification tasks, whether binary or multiclass, I still prefer CrossEntropyLoss because of its design consistency (no sigmoid, works with int labels)... Meanwhile, another person might like BCELoss() or BCEWithLogitLoss()

fierce python
glacial root
#

what things should i do to augment the data and get accurate coordinates for the augmented data

agile cobalt
#

crop and pan, rotate, maybe warp

#

to clarify: by augmenting I don't mean improving, but rather creating more samples

tensorflow and pytorch both have some docs on it, and overall it shouldn't take too much work to re-implement yourself in another framework if you needed to

serene grail
# agile cobalt crop and pan, rotate, maybe warp

Is this a common thing to do in machine learning? I'm guessing this is supposed to help prevent overfitting on some particular criteria, for example license plate always being horizontal/always being in the center of the screen/etc.

glacial root
agile cobalt
# glacial root would i be able to get the coordinates for the new image though

not sure if any of the libraries do it automatically for you or if you must do it yourself, but even worst on the case scenario it should a relatively simple math calculation - just apply the same formula that's applied to the image pixels onto the bounding box

(imagine the label as a 2D image greyscale image with the same dimensions as the original image - as long as you apply the same transformations as you applied to the original image it remains in 'sync')

glacial root
#

man something is wrong with me, i cannot think at all lol

lapis sequoia
#

Going hard in the paint. Yo!

dreamy sky
#

Anyone doing the "Drawing with LLMs" Kaggle competition?

sudden wyvern
#

Hi everyone I am new in AI workspace and working on one chat bot kind of functionality for odoo system I have used ollama to run deepseekr1 model locally now I want to train this model to answer odoo related queries and use our custom postgres database to give best possible answer of customer query so can anyone guide me on this how can I train deepseekr1 model with my custom data ❓

glacial root
#

how much of a difference would there be in computational efficiency if i went with the region based method for classification versus training the model to both localize and classify

deep creek
#
GitHub

Template for Extract-Transform-Load (ETL). Contribute to mglowinski93/EtlTemplate development by creating an account on GitHub.

Reddit

Explore this post and more from the Python community

lapis sequoia
#

My main confusion is, these “LLMs” how big are they? Are they owned by a company? Are people just making LLMs like it’s nothing? Are they using langchain? I don’t know, I’ve seen this come up a lot.

viscid urchin
#

If you were learning it all again, which Calculus book would you send back to yourself?

lapis sequoia
#

Heavy optimization with partials and unconstrained optimization.

#

Best place to start

viscid urchin
#

Sure, but pretend you have a time portal that only one book fits through.

lapis sequoia
#

Just learn derivatives and integrals. Know the unit circle.

#

understand limits. I took calc1 so long ago, like 2016. Dang. But yeah, I would suggest going to school for math

viscid urchin
#

I went to school for math, yeah, that's not what I'm asking but I appreciate the feedback.

lapis sequoia
viscid urchin
#

To expand on my question.. I've got, for example, a 'programming paradigms' textbook in mind that would have greatly accelerated my learning if it had existed back then.. and I was just wondering today if there were a math equivalent text for that thought experiment.

lapis sequoia
#

I just felt like the structure of trig calc1-3 was good how it was when I took it. Maybe, more of emphasis on differential equations. I don’t remember that class at all. Linear algebra should always be a requirement, optimization is underrated in calc1-3, it’s very important. Yeah, I think calc should focus more on optimization. I don’t remember it was so long ago.

viscid urchin
#

Interesting; when I was taught, trig and calc were totally separate; did they overlap for you?

lapis sequoia
#

You need trig for calc. No, I took trig separately. This was so long ago.

viscid urchin
#

(I hated my trig class at the time, I hope they teach it in a different way now)

lapis sequoia
#

I remember when I was 18 I grinded trig so hard and got litterally a 100 didn’t miss a point. shout out to my 18 year old self

viscid urchin
#

Nice.

lapis sequoia
#

I didn’t think calc was bad honestly. This was so long ago, but honestly I remember being introduced to a limit and it made more sense than a bond. This was so long ago. Honestly. I didn’t think calc1-3 was bad at all. I am serious it mad direct sense.

viscid urchin
#

I do remember at one point I was given the example of a speedometer, and its relationship to distance traveled etc, and that was a WAY better guide to my intuition than what I'd been exposed to before

#

I also sorta ended up finding the 'fluxions' explanation of things more helpful than the modern one, oddly

lapis sequoia
#

I remember how much I hated physics. I did well, I just didn’t find it interesting. It felt forced.

viscid urchin
#

I had a really good high school physics teacher, I lucked out there.

lapis sequoia
#

I just remember the labs were so boring. All of this was so long ago. I am trying to remember how I felt.

viscid urchin
#

even more than some other classics, this changed how I thought about programming

pine arch
#

quick question, I'm attempting at PCA and from my understanding if your data looks clustered and not varied this means your PCA isn't usable is that right?

serene grail
toxic pilot
viscid urchin
#

I just wasn't smart enough to get it the first time I guess

toxic pilot
#

i kind of jumped around in the book and read the parts i really cared about

viscid urchin
#

I guess what it at least did was teach me the terminology, so I could go separately investigate the parts.

toxic pilot
#

idk i feel like it’s a good textbook for college students to read if they’re taking a PL/compiler class

viscid urchin
#

I mean, as long as you go also learn about PEGs and GLL parsing and other things it doesn't cover

#

One probably shouldn't actually build YACC again circa 2025

#

I think the "front-end" vs. "back-end" hard distinction is out of favor too, in comparison to a long pipeline of simple transformations

#

I see a lot of people suggest this instead now https://www.amazon.com/Modern-Compiler-Design-Dick-Grune/dp/1461446988 but I haven't had the pleasure of reading it

opaque condor
#

within the network

toxic pilot
viscid urchin
#

Yeah, it’s still probably a useful idea, I mostly was just saying it has turned out not to be sacred.

limber spear
#

Compilers? Nothing moves without them. Especially code.

#

This is why I study systems level programming.

limber spear
#

My head hurts. Have a good day/night chat 🫡

severe blade
#

this worked btw. thank you so much man. i hope you get all the success you wish for.

rich river
#
from ultralytics.utils.ops import clip_boxes, scale_masks

class YoloModel(BaseModel):
    """
    This YoloModel class is for object detection and instance segmentation task
    """

    def __init__(self, model_path: str, confidence: float):
        self._model = YOLO(model_path)
        self._confidence = confidence
        self._cv_bridge = CvBridge()

    def segmentation(self) -> Tuple[list[str], list[Segmentation]]:
        model_output = self._model.predict(
            self._color_img, conf=self._confidence, iou=self._confidence
        )[0]
        self._model_output = model_output
        if model_output.masks is None or model_output.boxes is None:
            return None, None

        names = [value for _, value in sorted(model_output.names.items())]

        # the box coordinates are given in float32 but we want int32,
        # clip them again to avoid rounding issues causing the boxes
        # to be out of the image
        boxxywhs = clip_boxes(model_output.boxes.xywh.int(), model_output.orig_shape)
        scale_up_masks = (
            scale_masks(model_output.masks.data[None], model_output.orig_shape)
            .squeeze(0)
            .to(torch.uint8)
            .cpu()
        )

        segmentations = []
        for i in range(model_output.boxes.data.shape[0]):
            item = self.yolo_result_to_segmentation(
                model_output.boxes.cls[i].int().item(),
                model_output.boxes.conf[i],
                boxxywhs[i],
                scale_up_masks[i],
            )
            segmentations.append(item)

        return names, segmentations

anyone has an idea how to rewrite codes using ultralytics in C++? since I need to deploy it using C++
my current thoughts are rewriting clip_boxes and .predict function in C++, but it seems a lot of work

wild zenith
#

loc or iloc which is better

serene scaffold
toxic pilot
gaunt zephyr
#

anybody tried Google-adp?

#

It’s for making chatbots with Gemini

limber spear
leaden narwhal
#

Anyone ok to go over with me my notebook so i can organize it properly? Im having a hard time doing so since this is so unorganized

#

Doing a solo project

limber spear
#

Just post it here meowthumbsup

#

We should have a workspace channel for this channel. Like the movie Inception pithink

serene scaffold
#

notebooks aren't amenable to sharing over Discord, so you'll want to do something like python -m jupyter nbconvert --to script --stdout your_notebook.ipynb

leaden narwhal
#

on discord ? ahahha

serene scaffold
#

Yes, that's where we are

leaden narwhal
#

No no i mean the code line

limber spear
#

SP is recommending converting any Python scripts to Jupyter notebooks

serene scaffold
# leaden narwhal

someone would have to start a notebook server to read this, so it would be easier for them if you do the command to convert it to flat text.

limber spear
#

I see 👍

#

Does nbviewer allow shareable notebooks

leaden narwhal
#

cmd?

limber spear
serene scaffold
leaden narwhal
leaden narwhal
#

I did a colab link

#

Easier i guess

#

Correct one

#

Also sorry for asking this guys, my head just hurts from looking at jupyter the whole day

rich moth
#

whats up people.

#

so this curriculum learning tool i made and been playing around with and I think i stumbled on something fundamental about how data/ knowledge is structured.

basically, I found a way to measure the "learning complexity" of individual samples in ANY dataset (images, time series, tabular, text) using a single unified framework but the crazy part when I sort training data by this complexity measure, I'm seeing performance gains from 3% to 150% (!!) depending on the dataset

#

whats even more wild though is the farmwork correctly identifies when data doesn't have inherent structure ( tlike the Madelon dataset), where random is the winner

#

I tested it on 62 datasets on 4 domains the biggest increase was +149% WAFER dataset, blood tranfusion +84%, ECG data 53%, but on truly random data its 0%, as it should be

#

but what i think im finding here, theres something like a "conceptual dependency graph" hidden in data. some knowledge has prerequisites (like learning addition before multiplication), some doesn't (like learning colors)

But this framwork i made can tect which is which automatically

i feel likes theres something deeper here aout how information itself is structured

limber spear
#

@rich moth test your stack on Manny’s dataset here 👀

rich moth
#

Yum yums! Lets do !

#

idk maybe im overthinking this but it feels like there's some universal pattern here about how knowledge organizes itself? like why does it work across images AND time series AND text AND tabular? seems weird right?

limber spear
#

Well we have data here. Let’s put your stack up to the test 🤔

#

Data can easily lie. A lot don’t understand that

rich moth
#

this sounds aweosme let me see

#

Sao Paulo Geospatial datasets ?

leaden narwhal
#

and now that i did it on jupyter my r2 scores went to shit and now linear regression is better than rfr?

leaden narwhal
limber spear
#

@leaden narwhal put your accuracy metrics toward the bottom maybe and group your map plots together. From what I see your Random Forest metrics have higher accuracy vs LR numbers.

A confusion matrix and F1, precision, recall metrics should give you a more reliable accuracy metric.

#

This should tell you if your models are confused or not. Basically lying to you 😂

#

I lecture my models all the time

leaden narwhal
#

i think this shows the actuall values

#

causes i was using log income values to do a prediction on income which is stupid

#

so now i changed to linear regression

#

Im going to try xgboost

limber spear
#

Your project is really cool. Dedo no cu e gritaria

#

Did I say that right

leaden narwhal
#

XD

#

Im not brazilian, im portyuguese but it also applies lol

limber spear
#

Someone taught me bad words mkay

leaden narwhal
#

Is this an xgboost moment

#

Ok i dunno why but xgboost actually was goated and made some crazy predictions

#

almost perfect, some districts still havent predicted properly but most yeah. Im happy with this!

#

@limber spear

limber spear
#

Xgboost is pretty nice RF is goated as well imo

leaden narwhal
#

yeah but the r2 score is 10 times better

limber spear
#

This is where your fine-tuning skills can come in. Some are fine-tuning goats. This is where OpenAI and Grok devs make their living

#

Billion parameter models

#

But big tech won’t tell you this. Big tech will say cutting edge AI or proprietary

versed axle
glacial root
viscid urchin
subtle mirage
#

messing around with matplotlib and remembered you can add text to plots :D

lapis sequoia
#

ok, today, this sounds dumb, I never cloned a repo that was not mine ever. I thought that was cheating or something. I would either read about or look at code for reference if I did not understand it. Cloning, makes this so much faster. I never knew this. I only cloned for my own repos to edit or someone else's I had permission to. Everyone clones?

exotic star
#

i started getting into ai and i didnt know how/what's the best way to do it

#

i started learning pandas now numpy and then matplot

#

and seaborn

#

after that pytorch scikit-learn

#

and then ML methods like classification regression decision trees and so on

#

after that essential method or before that? and then deep learning

#

is this strategy decent? i kinda seperated them into different parts and i do seperate projects with all of them after learning a bit then a combined 1

#

right now learning about numpy, almost done and then starting matplot

craggy patio
rich moth
#

@Manny @limber spear São Paulo geospatial dataset results

limber spear
obsidian bronze
#

Hello

#

some of you guys is a begineer

opaque flower
#

Is data science less saturated than other IT fields? Also, is it a good career choice for the future? I’ve heard some people say it’s a dying field.

limber spear
jaunty helm
subtle mirage
quaint mulch
quaint mulch
agile cobalt
exotic star
lapis sequoia
# quaint mulch Yes. People WANT to have their repo cloned. There is a counter that says how man...

All of this time, I would litterally book mark the actual repo if it was good and learn from it. The amount of time that would’ve been saved by simply cloning it and putting myself in their shoes…. It’s ok, the grit is there. Oh my god. I have only cloned my repo to change it or others the I had access to. Never ever cloned a repo as a guide when it’s like “oh I need a good example from their prospective “ I will remember this day. Forward onto Dawn. Let’s go. I don’t care I am glowing.

quaint mulch
proven current
#

What is data

gritty vessel
#

Hey I am working on image segmentaion and my targets have nan values so masked loss fucntion is the only way to go?

#

Like it ignores the nan and only get trained on valid data

verbal oar
#

heh I recalled "learn from it" from siraj raval data lit video 🙂

#

dont know if he uploads sth still must check

#

but he has ml on tensorflow not in pytorch as I correctly remember

gritty vessel
#

that day you helped me

#

So i noticed in scatter plot when there is rain all variables get low by 10-20 kelvin and some even get low by around 40 kelvin

#

so its a good idea to include all 4 vars

#

I had one more que guys

#

In meteroloigcal data

#

weighted loss function is a good choice or not?

#

as we give more weightage lets say rain events

#

but in real scenerio no rain events will be more

#

and rain events will be less

agile cobalt
gritty vessel
#

but trainig examples for rain case are definetly very less than no rain cases

#

95% data is of no rain cases

#

and 4% of rain

#

1%others

serene scaffold
#

!mute 1270417623296905301 "1 hour" This is your final warning to stop advertising.

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied timeout to @last oriole until <t:1746474360:f> (1 hour).

verbal oar
#

putting 1d data to 2d is can be called embedding?

#

or unproject
I'm viewing ml teach by doing and about feature representation

limber spear
verbal oar
#

ah ok Im thinking about embedding from math

#

as is word embedding

#

2d to 1d is to project

#

but vice versa? looks like its embedding not sure just

limber spear
#

I think vector embeddings have applications in cybersecurity though I am not sure.

verbal oar
#

hmm there is t-sne

#

t-sn e (embedding)

#

t-distributed stochastic neighbor embedding for sure

limber spear
#

I find it interesting what chunks of data can do. They dance around in our little machines 😅

#

Paint pictures. It’s fascinating

lapis sequoia
#

someone said this learn principles, algorithm, architecture (as in - design your own architecture not copy someone else architecture without understanding why you are doing things that way)
is this also related to ai ml stuff? or ml is a completely different thing and architecture and stuff dont apply to it?

quartz wren
#

hey anyone online
any tips for profile face detection
cant find anything anywhere
currently using cv2 and mediapipe
detection is really good for frontal
but not good for profile side
ping me if answer

verbal oar
#

architecture is about big picture of system birds eye view without going into details, as I think about it

#

I assume you mean model architecture

serene scaffold
limber spear
#

Let’s put them together and call it smodel or modware architecture

#

Tbh this is why I love this field. Innovation is endless. Make up words. Build a new model. Been having a blast. It’s like building with lego blocks

quaint mulch
quaint mulch
quaint mulch
dusty forge
#

I made a tool called ParquetToHuggingFace to help you upload your audio data to Hugging Face easily in Python. It takes your raw .wav files, turns them into Parquet format, and then uploads them to the Hub. The repo has clear steps on how to set everything up, where to put your files, and how to run the script. If you're working with speech data and want a quick way to share it on Hugging Face, give it a try!
GitHub Repo: https://github.com/pr0mila/ParquetToHuggingFace

GitHub

ParquetToHuggingFace processes raw audio data, converts it into Parquet files, and uploads them to Hugging Face. The README explains how to set up the environment, configure paths, and run the scri...

#

🎉 Introducing GroqStreamChain! 🎉
A real-time AI chat application built with Python , FastAPI, WebSocket, LangChain and Groq. 💬 Seamlessly stream AI responses and interact with smarter chatbots powered by cutting-edge technology. 🤖
🚀 Features:

  • Real-time WebSocket communication
  • Streaming AI responses
  • Smooth and responsive UI
    🔗 Check out the project on GitHub: https://github.com/pr0mila/GroqStreamChain
    Join the conversation and start building your own AI-powered chat apps today! 💬
GitHub

GroqStreamChain is a real-time AI-powered chat app using FastAPI, WebSocket, and Groq. It streams AI responses for interactive, low-latency communication with session management and a clean, respon...

rich river
#

https://github.com/ultralytics/ultralytics/blob/main/examples/YOLOv8-ONNXRuntime-CPP/inference.cpp

char* YOLO_V8::WarmUpSession() {
    clock_t starttime_1 = clock();
    cv::Mat iImg = cv::Mat(cv::Size(imgSize.at(0), imgSize.at(1)), CV_8UC3);
    cv::Mat processedImg;
    PreProcess(iImg, imgSize, processedImg);
    if (modelType < 4)
    {
        float* blob = new float[iImg.total() * 3];
        BlobFromImage(processedImg, blob);
        std::vector<int64_t> YOLO_input_node_dims = { 1, 3, imgSize.at(0), imgSize.at(1) };
        Ort::Value input_tensor = Ort::Value::CreateTensor<float>(
            Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU), blob, 3 * imgSize.at(0) * imgSize.at(1),
            YOLO_input_node_dims.data(), YOLO_input_node_dims.size());
        auto output_tensors = session->Run(options, inputNodeNames.data(), &input_tensor, 1, outputNodeNames.data(),
            outputNodeNames.size());
        delete[] blob;
        clock_t starttime_4 = clock();
        double post_process_time = (double)(starttime_4 - starttime_1) / CLOCKS_PER_SEC * 1000;
        if (cudaEnable)
        {
            std::cout << "[YOLO_V8(CUDA)]: " << "Cuda warm-up cost " << post_process_time << " ms. " << std::endl;
        }
    }
...

what does WarmUpSession do here?

GitHub

Ultralytics YOLO11 🚀. Contribute to ultralytics/ultralytics development by creating an account on GitHub.

quartz wren
#

theres some frames where it doesnt work

#

i waas thinking about using retinaface but in terms of costs i think i will go with opencv and mediapipe

rare bane
#

Hello is there anyone who works with Tf-idf vectorization?

rich moth
serene scaffold
rare bane
#

That said does a Tf-idf vectorization always have to transform-fit with description of an unsupervised data?

lapis sequoia
#

stealcrus

#

thanks bro

serene scaffold
lapis sequoia
#

stela are you into ml? whats the hardest thingforoy

serene scaffold
lapis sequoia
#

yeah

#

for you

#

?notforeveryone else for you

rare bane
serene scaffold
#

nothing is hard for me, because I'm awesome.

lapis sequoia
#

NICE ILOVE THAT CONFIDENCR BRO

#

THAT WAS COOL BRO THATS WHAT IM TALKING ABOUT BE CONFIDENT NOTHING IS HAARD

#

ITS TOOO EASY

#

ITS TOO EASY

serene scaffold
#

I'm not actually being serious. I have problems that I don't immediately know how to solve every day.

#

when I was a beginner, the hardest part was dealing with how everyone explains things differently. not like in this server, but in books and online guides

serene scaffold
rare bane
serene scaffold
rare bane
#

Sorry if that doesn't explain too much, but I understand why it is done now

#

I just needed to know why it was fit and now for the transform part?

rare bane
serene scaffold
verbal oar
#

I think ml is not hard, hard thing is to understand it then its easy

#

maybe rather its at start not intuitive and one must build some intuition

jaunty helm
#

TIL how to annotate a heatmap in hvplot/holoviews with dynamic text color based on the value by hitting my head against the wall a lot of times
(why is it so hard and why can't I find docs explaining it... probably skill issue)
assuming hm is your heatmap:
hm * hv.Labels(hm).opts(text_color=-hv.dim('value'), cmap='binary')
where 'value' is literal and not something you change, and the - is there so low values are dark instead, so you can actually see on the heatmap

verbal oar
#

also depends if someone thinks about classic ml as ml
or ml and deep learning as machine learning

#

for me its sometimes confusing

quaint mulch
sudden delta
#

i think the hardest thing about ML is people who know ML are bad at explaining ML

verbal oar
#

agree

#

you have sth in your eyes, see patterns and can't transfer it to someone

#

but this thing could be learned right, thats where experience comes in

lapis sequoia
#

can a data analyst please reach out to me? i'm facing trouble with cleaning shopify data

verbal oar
#

dont sure but maybe think about cleaning data not shopify specific, try to generalize it

jaunty helm
limber spear
#

Someone in chat build a data cleaning stack or something 😂 it’s so boring 💀

serene grail
limber spear
#

Especially with medical data for example

#

I love my granny bro

serene grail
#

Yeah that's probably above my head
I'm just learning basic EDA for now

limber spear
#

No worries keep on cooking 🔥

verbal oar
#

etl is responsible for data cleaning?

serene scaffold
verbal oar
#

extract transform load

serene scaffold
#

what's the context for this?

verbal oar
#

more specifically transform for data cleaning?

serene scaffold
#

I've never heard of "extract transform load"

viscid urchin
#

The “T” is where you typically do some cleaning yeah

#

“ETL pipeline” is a fairly common phrase

serene scaffold
#

people are always inventing new terms 😠

viscid urchin
#

More for regular systems than for ML though

verbal oar
#

in short sth related to normalization, normal forms, there is 5nf at most

#

etl is in data warehouses course

#

no one cleans data manually?

#

I mean instead one can just use some etl tool

#

load means feed data to data warehouse

dense needle
#

Seems like it is a term from data engineering?

#

I have seen it used a bunch in the small amount of time I have spent in r/dataengineering

viscid urchin
#

ETL is a pretty old term. From the 80s. There used to be arguments about whether you should transform before loading or vice versa.

past meteor
#

For example, SAP ERP can have 100k tables (not kidding)

#

If you want to do anything with this downstream you're definitely going to want to pull this data down to some other place, consolidate, reshape etc.

viscid urchin
#

SAP is wild. Some of the servers people are running it on just have comical amounts of RAM.

verbal oar
#

yes normalize

past meteor
#

The inverse actually denormalize

verbal oar
#

yes ok

past meteor
#

Normalize ==> adding more tables, denormalize ==> making it flat(ter)

verbal oar
#

I meant joining tables

#

merging

#

my bad

past meteor
#

No worries, it's clear you know what it means 🙂

lapis sequoia
#

gesturing whilest learning can help you immensely

verbal oar
#

I think must have muscle memory

#

so for example when understanding perceptron rotate hands?

fallow coyote
#

should I install the dotenv module for interacting with .env variables? or is there another module you lot recommend? For a little bit more informastion, Ill be making my first program that uses the AlphaVantage API to extract stock market information (for this case, extracting data concerning the US Dollar index). I want to start using webscraping and learning how to use APIs particularly for data analysis

verbal oar
#

to improve retention of rotating line

lapis sequoia
past meteor
# fallow coyote should I install the dotenv module for interacting with .env variables? or is th...

Yeah!

A common pattern is that configuration is stored as environment variables. Lots of us deploy with Docker or Kubernetes, which means we can "inject" these env vars right into a specialised place where we run our app.

During local development we still need to provide some config. This is typically done in the form of a .env file that contains all the secrets (API keys and whatnot). This file is read with stuff like dotenv.

lapis sequoia
#

no its true ever think about how when your thinking about somehtig youlook up in thes sky instead of doing that use yourhands as well. it might looks weird but it also might works

lapis sequoia
trim dock
spring field
spring field
past meteor
spring field
#

idk, I love 'em
don't have to worry about any env setup pretty much, just start the container and begin developing stuff

lapis sequoia
#

s

glacial root
#

what does it mean by manually describe

#

i don't understand how else it would do it without ocr

#

this is chatgpt by the way

agile cobalt
# glacial root this is chatgpt by the way

if ChatGPT says something that makes no sense, odds are it makes no sense and it's just the model hallucinating

Technically you could have

  • native multi-modal image inputs (tokenize the image and feed it directly to the model as part of the prompt)
  • a separate OCR tool the model can use via function calling
    and it would make sense if the model tried to use the OCR tool first, then used native image inputs after it failed, but that's extremely unlikely
glacial root
lapis sequoia
#

Is it rad that I avoided langchain forever because I just thought it was trendy garbage and fine tuning T5 is more lit? Am I like a hipster now? RLHF is cool, but I don’t want to abandon my roots.

grand minnow
#

The alternative to Langchain that I found is LiteLLM. Its nice

serene scaffold
#

@proven current I removed your message because the content is disturbing for some users

quaint mulch
rich moth
rich moth
#

Ok It's my rough draft on my research paper .

#

This could change the game of how we design datasets. Imagine datasets with built-in complexity metadata that map optimal learning pathways and make curriculum learning effortless, eliminating the need to calculate sample complexity during training. UCF can enhance these datasets, transforming machine learning data from simple collections into structured knowledge maps with clear learning trajectories, dramatically improving training efficiency and transfer learning capabilities. This could establish a new gold standard for ML datasets where curriculum-readiness becomes a core feature rather than an afterthought, reimagine how we approach data design across all domains

#

The proof is in the pudding. It's all on the wall.

sudden delta
#

a way of sorting data by complexity so you can feed hard or easy stuff first?

limber spear
quaint mulch
quaint mulch
#

achieving performance improvements of up to 149% over random sampling baselines.
You need to use a better baseline rather than just random sampling. You need to use the latest SOTA in curriclum learning.

#

It seems that you purposedly not reveal the methods?

quaint mulch
verbal oar
#

anyone can write research paper?

quaint mulch
#

Yes, anyone can.

verbal oar
#

but I see its hard to start different style of language, similar to writing thesis

quaint mulch
#

I mean, anyone can, not saying that it is easy.
With enough effort and resources, almost everyone can,
the question is if it is worth it.

coral apex
#

so im working on benchmarking various AF to improve a NN model that is trying to learn the pattens of the sin function.

the thing is that i dont want to use any fancy methods like normalization or special optimizers just yet.

and am trying to improve the functionality just by only changing the following configurations: number of hidden layers, number of hidden neurons, learning rate, gradient clipping threshold(ik this is a fancy method but its unavoidable for now) .

so the problem im encountering is that my current model is adapting and predicting well when the training values include values only form -pi to +pi (with 500 samples)with a loss of upto 10^-5, but the moment i increase the range to lets say -100 to +100 (5000 samples) predictions of all the activation function are stagnating at a loss of 0.5 which is no where enough obviously

any idea on how to fix this or improve this ?
shoudl i send my code here or is that not allowred ?

#

using basic GD btw

limber spear
coral apex
#

training samples i mean

#

should i just send the entire code ?

#
print("hello world test for syntax highlighting")
limber spear
# coral apex im using x = np.linspace(-np.pi,np.pi,500)

I went to the machines. Total guess here, you’re mapping of your inputs [-pi to +pi] to target labels of [-1 to 1], but issue maybe is your build of taking in inputs of [-100 to 100], your model isn’t designed for that. Probably just have to refactor portions of your build like functions

coral apex
#

waht do you mean by my model is not designed to take in those inputs?
im pretty new to ml too btw
like should it just take the inputs and try to learn it according to its map which will be [-1,1]??

arctic wedgeBOT
limber spear
coral apex
limber spear
#

Oh ok you would probably just have to do a light refactoring in your build to map everything correctly

coral apex
#

how would that look like and what was my mistake?

limber spear
coral apex
#

but whatt in that though ?

limber spear
#

Maybe what x does in your build. If that makes sense

#

The -pi to +pi you mentioned

coral apex
#

i am so lost here, what are you trying to tell me?

#

i commented out the -pi,pi cuz thatt one works fine but the -100,100 does not

limber spear
#

I’m a noob tutor mkay chat who’s a better explainer

coral apex
#

T_T
alright

#

thanks though!

limber spear
#

You can test ranges out. But I like breaking stuff to learn

#

👍

#

Lock in you’re on the right track bKC

#

I just suck at explaining. That means I don’t have the science down. Need to lock in as well lol

coral apex
limber spear
#

i was in the middle of cooking this up for a class

weak oxide
#

Have any of you guys used the SEC API

#

Not the one from Python itself which requires the API key but from the SEC which is the one where you request the headers

coral apex
stuck brook
#

Hello everyone, I have a small question, where can I find well-documented datasets that can support academic research or thesis developlment ?? Maybe some open-access platformsor even government data portals. Bonus poitns for anything that supports ML, predictive anaylitics. Thank you !! PLEASE @ ME HERE

agile cobalt
rich moth
limber spear
quaint mulch
# rich moth Honestly, I don't know. I'm lost what I do.. I would love to monetize it. Get...

I guess, 1st of all, congrats for getting this far, it seems that you did some real studying and real work.

2nd, sorry to burst your bubble, but there are still some gap between your draft and something publishable. The gap is not unsurmountable, but I suppose, somewhere between few weeks to few months. The biggest issue I can spot is that you need SOTA curriculum learning as your baseline. I also cannot judge your method and it is not revealed yet. So keep up the good work and you'll get there.

3rd, the sad news is, even if this is published at top journal, there are still a huge gap between that, and monetizing it. And I don't even know what. A lot of PhD fresh grad want to convert their thesis to a startup, but very few succeed. If I know how, I would have done it myself, I want to get rich quick too.

Finally, if you put the full version (with your methods and code) on a github, you can start emailing professors and ask for collaboration. They get to be co-authors, and you get valuable feedback and even funding for conference submission, and from there, I hope it is one step easier to get into some ML jobs.

sudden delta
limber spear
#

I have a bit of food for thought. So the father of modern genetics Gregor Mendel his work went unrecognized in the scientific community until about 16 years after his death. Sometimes no one even cares 😂 5+ centuries from now who will remember these billionaires

#

Sorry about laughing. Idk I think about these things.

rich river
#

I want to rewrite this without using libtorch

  torch::Tensor pred_masks = torch::nn::functional::interpolate(
      masks.index({scores_mask, torch::indexing::Ellipsis}),
      torch::nn::functional::InterpolateFuncOptions().size(
          std::vector<int64_t>({input_height_, input_width_})));

in which masks and scores_mask are both tensors.
I don't want to use libtorch because the library introduces great space cost to my project, but I dont know how to rewrite the functions such as the torch::nn::functional::InterpolateFuncOptions() and .index by using pointers like float *
anyone has ideas about it?

lapis sequoia
#

does anybody have experience analysing META data?

ashen venture
#

Guys which packages and from where I should learn in python to master data science

fallow coyote
verbal oar
#

I'm impressed you are self taught, I thought you have some background, nice to hear and good luck

fallow coyote
#

I swear learning the statistics for ML is fucking annoying. Half the time I'm trying to interpret the context of the notation and symbols. Like I'm learning in a section about multivariate gaussian distribution; why tf is sigma being used as a variable?! now I have to distinguish between sigma meaning 'sum of' and sigma as a variable. Apologies for the rant. Hopefully learning the linear algebra side will be a bit easier to interpret

verbal oar
#

learn at first about gaussian distribution not mutlivariate would be less confusing

#

sigma is variance

#

oh sorry std - standard deviation, sigma squared is variance

#

and maybe read "statistics for machine learning" not about statistics without context its more annoying

fallow coyote
#

I mean the big sigma not the small sigma that denotes the variance. i swear what were these staticians smoking when they came up with these formulas and decide to not use distinguishable notation?

quaint mulch
verbal oar
#

maybe machine learning mastery have sth like this dont sure

wooden sail
verbal oar
#

for example I learned sth about student t distribution "where it is used?", my question was then
I saw data science full archive and there ah right its in t-sne (t-distributed),
poisson distribution ah in poisson regression etc

#

I mean statistics course without context, also sth like confidence interval, dont sure if they should teach this way

fallow coyote
verbal oar
#

but it was not about calculating distribution but reading some stats lookup table

wooden sail
verbal oar
#

instead of substituting in formula

wooden sail
#

it isn't 😛 at all

fallow coyote
# wooden sail it isn't 😛 at all

I dont think the ISLP book has some form of notation guide. Tbf, the more I learn about statistics and the more I go through the book, it gets easier to understand the overall concepts. Im only learning the surface level understanding so I can be able to use the ML modules effectively enough. I can always at later date go indepth in the proper theory behind the cocepts

verbal oar
#

question why they reduce from 768 dims to 2 dims with umap, cant be t-sne or other dim reduction method? inside nlp and trasformers book by oreily

#

hmm assuming you didnt read it its hard to explain

fallow coyote
#

I couldnt tell you mate XD. Still attmepting to learn the maths

tawdry finch
#

hey i am an intermediate in python i am looking for communities to join to work with anyone intrested like small projects etc

limber spear
#

I asked the bots about the foundations of data science. Does this tree diagram look complete

#

Only linear algebra and calculus pikawow that is a ton of innovation baked into just 1 bubble node 👀 imagine what the mathematicians would say

elfin shadow
#

This looks nice. I cant be botherd to read most of it.

limber spear
#

Fair enough. Save for research purposes meowthumbsup

ashen venture
#

Never thought data science was so tough 😮

#

This probably explains the salary paid to them

serene scaffold
ashen venture
#

def calsal ():

frozen arch
lapis sequoia
#

because for example I know that MLOps is a whole field like ML

#

or you just need to have a shallow info about it like you do with ML?

#

plus you don't need all of these programming languages

#

just one would be enough

#

like python

limber spear
#

As shallow as an activation function blobhuh

#

Or a Wilcoxon rank sum test

lapis sequoia
#

I am not sure if it's the best roadmap but this website is kinda popular

limber spear
#

Ah yeh those guys. I know about them

limber spear
lapis sequoia
maiden harbor
lapis sequoia
#

but maybe they share these

maiden harbor
#

Maybe add, reinforcement learning, and or that thing where AI train themselves but I forgot the name

limber spear
#

I am looking to perfect my craft. Nothing more nothing less tbh

#

And contribute to society which I probably suck at

#

I like the roadmap peeps. They are building something to help others learn

#

And build 👍

lapis sequoia
#

try to get in touch with someone in the field you want to break in

#

tell them your background and let them guide you from there how to achieve your goal

limber spear
jaunty helm
limber spear
#

Total guess here. The bot probably ran a decision tree or algorithm of some sort

jaunty helm
#

also ig linear models, trees, clustering etc just don't exist anymore

limber spear
#

How so

jaunty helm
#

also I just realized it put automl under deep learning

limber spear
#

The diagram I posted? That would probably just be in 1 node of that diagram

jaunty helm
limber spear
#

I disagree

#

But. I understand your conjecture

#

If you frequent the Linux community. A diagram means squat

lapis sequoia
#

but it's up to you

limber spear
#

I don’t think you know who I know

sudden delta
#

would be interesting to see a timeline of when these different things on the map were invented

#

and how close that is to how they are chained

#

and how much is squished into the past few decades, while the math fundamentals go back centuries 😂

limber spear
#

Honestly that is what I took from when I first pulled that visual. The statistics node has a timeline on its own merit.

#

Statisticians cook.

verbal oar
#

i'll add graph theory

#

maybe fuzzy logic

#

maybe etl
metaheuristics, data mining

#

fuzzy logic due to fuzzy clustering c-means, neuro-fuzzy networks

#

maybe its overboard

verbal oar
#

Knowledge representation and reasoning

drifting loom
#

Anyone up for simple DS project? For skill up?

hollow cobalt
#

I’m trying to clean and format a large raw text file. Does anyone know any methods that are best for cleaning large amounts of text?

hollow cobalt
#

.txt files

drifting loom
drifting loom
untold dove
#

the problem with RSI if you have read the STOP algo paper is it has a significant bottleneck sadly I think if you were really creative you could build off the implementation's within that paper with multiple algos maybe say beam search top p top k alteration. Or some other sort of dynamic deducing algo

limber spear
#

I can’t believe I started a debate over that diagram. What I find interesting is that what if the diagram read left to right. Or right to left. Perspectives can differ thonk

short escarp
#

Hi guys, anyone here is expert in machine learning. Mainly in sklearn.svm SVC (support vector classification). I want to ask some questions

fickle shale
#

Don't ask to ask, just ask!!

quaint mulch
limber spear
#

You could literally draw a timeline in just the machine learning and deep learning nodes. 2 nodes.

#

But then if you turn it into a decision tree, everything changes

#

or does it

limber spear
#

Perplexing. You could legit earn a doctorates degree with this research

fluid cave
#

hello gys,
what are the best techniques to improve the accuracy of a classification model (tabular data with alot of categorical variables)

dusty forge
verbal oar
#

ensemble methods also

pine arch
#

After PCA I had more than 2 PC that I can use for clustering, and its impossible to work with more than 3, what should I do in such cases?

limpid dew
#

the tab:blue is goated

obtuse acorn
#

so im using seaborn to plot data, and im using a pairplot and i can set corner=True so it doesnt have duplicate plots on one side of it

#

but is there a way to have it plot a different type of plot on one side

#

like if i set corner=true it doesnt plot the ones on the top right, but is there a way to have it plot a kde plot on the top right?
obv i could just manually edit the images so its got the other type on the top right but it would be easier if there was a way to do it programatically

#

also, any ideas for better ways to plot stuff?

jaunty helm
obtuse acorn
#

thanks

jaunty helm
#

if you use something like plotly or hvplot, then you click an item in the legend and that will be hidden
e.g. if I made this in hvplot, I can click Male then all Male data will be set to alpha 0.3 (configurable)
(hvplot is more like a higher api that can wrap matplotlib, bokeh or plotly; the latter two can do what I described)

obtuse acorn
fresh mulch
#

Good day everyone, im new here and i just start learning Python its only been a month of daily solving fundamental problems for each basic topics in python (not include the OOP topics), and im almost done with studying basic topics while solving fundamental questions and i want to dive into the world of Data Science and AI but not sure what to do after im done studying Python Basic should i study Math and Statistics for Data Science or continue learning Python OOP or study Data Structure and Algorithms or just go straight into Python Data Science Libraries like NumPy and Panda?

peak thorn
#

i have to build projet of a numer,-license plates detection and extract the content from the plates please guide me in this because it's my first computer vison project. Thank you

jaunty helm
elfin shadow
#

When was this.

#

@sterile heath this is really interesting so dose that mean that your offspring could have the same problem.

sterile heath
#

I don't know about the genetics of it. It can sometimes take two to tango.

#

But also, offspring are not a likelihood in my future.

#

Nor have I any.

#

Nor have I ever.

elfin shadow
sterile heath
#

Sorry guys.

fickle shale
verbal elbow
#

I'm a data scientist who's looking for a paid internship opportunity, please how can you be of help to me? Thanks

rich moth
#

man I've been working on this UCF paper for days, i think I'm almost done but i don't know exactly how to incorporate the visuals. I think I will create some type of "gallery" with the results? Also , Do I need an endorsement for a arXiv submission?

limpid zenith
lapis sequoia
#

what is job
in coroutines
/threads

rare bane
#

been working on unsupervised data and i have a problem getting two annotated labels for the visualaization to be visible, i've modified the xytest for about 2 hours and i've gotten zilch results

here is the syntax for the code, if anyone can help:
colormap = plt.get_cmap('tab20', num_clusters)
colors = [colormap(i) for i in range(num_clusters)]
plt.figure(figsize=(17, 12))

for cluster_num in range(num_clusters):
cluster_points = tfidf_matrix_reduced[df['cluster'] == cluster_num]
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], color=colors[cluster_num], label=cluster_to_genre[cluster_num], alpha=0.9)

plt.title('Book Genre Clusters with 2D PCA')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')

plt.annotate('Popular / Romantic →', xy=(0.42, 0), xytext=(0.63, 0.18), fontsize=11, color='blue')
plt.annotate('← Serious / Thought-Provoking', xy=(-0.5, 0), xytext=(-1.0, 0.2), fontsize=11, color='red')
plt.annotate('↑ Literary / Award-Winning', xy=(0, 0.5), xytext=(0.05, 0.75), fontsize=11, color='purple')
plt.annotate('↓ Genre Fiction / Mass Appeal', xy=(0, -0.5), xytext=(0, -1.0), fontsize=11, color='green')

plt.legend(title='Predicted Genre', bbox_to_anchor=(0.5, -0.08), loc='upper center', borderaxespad=0., ncol=num_clusters // 2 if num_clusters > 2 else 2)

plt.tight_layout(rect=[0.08, 0.12, 0.92, 0.95])
plt.subplots_adjust(bottom=0.22)
plt.grid(True)
plt.show()

zealous hollow
#

Hii!!

so i have a question regarding pyomo.

my problem has an objective, and a stationary, for ideal convex, stationary value is 0. but for any real case, stationary value can never be zero, but i want it to move towards zero. like closer it is to zero, better,

objective function, is entirely different value, but PS. stationary and objective, cant be in objective object of pyomo, as that makes them compare, which shouldnt happen in my setup.

tropic sphinx
#

Hi everyone,
I'm looking for some guidance and would really appreciate your help.

I'm not a data scientist by profession, but I've been learning machine learning and working with Python. I'm currently building a tool that tracks data transformations before the data is fed into a model for training.

Right now, I'm trying to find example projects—either ones you've worked on yourself or available online—that I can use to test my tool. I'm primarily focusing on transformations using pandas, NumPy, and scikit-learn.

I can build basic pipelines myself (e.g., using fillna, one-hot encoding, PCA, etc.), but since I don’t have experience with real-world projects, I’m not confident I’m covering all the important cases. Any pointers to existing pipelines or datasets with preprocessing steps would be incredibly helpful.

Thanks in advance for your guidance!

rich moth
rich moth
#

I find the feature space fascinating. It's like the DNA of each dataset.

digital basin
#

hi guys, im trying to getting into the AI world, but idk where to start and where to learn, can you please help me?

rich moth
#

That all depends what the most effective learning strategy is for you. 😄

#

Out of all the feature spaces i looked at guys, theres hundreds. Look at this! I dont know what it is , but this one is hypnotic.

lapis sequoia
#

people hiring alot for ml engineers?

rich moth
warped notch
#

Hello what should I as a newbie to ML use ? plotly or matplotlib for visualization

#

what do advanced users use ?

#

is there anything better than these two ?

agile cobalt
#

for simple things, plotly express is very straightforward and can make interactive plots
(plotly express is a sub-module on plotly)

for more custom things you might want to look into matplotlib or seaborn

serene scaffold
agile cobalt
#

iirc Vega Altair is also somewhat popular, but personally I also prefer plotly

serene scaffold
#

When I do use matplotlib, it's only via pandas. Imo matplotlib has the worst API of any data science library

#

And it isn't close

gusty silo
#

Wassup guys

rich moth
#

This image shows what happens when you rank all samples from different data types (images, text, time series, and tabular data) by a single universal complexity measure and divide them into 10 difficulty bins.

#

Its 141k+ samples from 50+ datasets.

#

I think the fact the different data types naturally separate into ten distinct difficulty regions helps bring this all home.

fresh mulch
#

so i was wondering what SQL will i master(or start learning) can you help me decide?

what type of SQL is best for Python Data Science?

  1. MySQL
  2. PostgreSQL
  3. Oracle
  4. others(mention it 🙂 )
rich moth
#

I like option 2

#

In the long run I think it would do you the most justice because of its analytical capabilities.

limber spear
#

I was taught to develop in vanilla SQL. It codes to all 4 on that list

rich moth
#

I got the idea to create a unified dataset across a smaller subset of 14k samples. results confirm the exact same stratification pattern we saw in the larger test

storm nexus
#

Hello

#

Is there a book which discusses multi class classification techniques like OVR AND OVO

#

And explicitly states those names, and not just the math

obtuse acorn
#

so ive got a dataset thats got a date column and a time column and im reading the csv into a pandas dataframe, and i can set parse_dates=['date'], date_format='%d-%b-%y' and read the date just fine

#

but i dont see how i set it to read the date and the time if they are in seperate columns

#

ah data.time = pd.to_timedelta(data.time) worked

prisma patrol
#

I'm looking for someone interested in networking for Data Science. I'm very motivated and would like to meet people who are in tune with me

plush kettle
#

Guys, is it possible to train a resnet50 model with 640 x 640 images?

#

Here is my collate_fn: ```def collate_fn(batch):
images = list(image.to(DEVICE) for image, _ in batch)
targets = []
for _, target in batch:
boxes = []
labels = []
for annotation in target:
bbox = annotation['bbox']
# Convert from [x, y, width, height] to [xmin, ymin, xmax, ymax]
xmin = bbox[0]
ymin = bbox[1]
xmax = bbox[0] + bbox[2]
ymax = bbox[1] + bbox[3]
boxes.append([xmin, ymin, xmax, ymax])
labels.append(annotation['category_id']) # Use 'category_id' from COCO

targets.append({
    'boxes': torch.as_tensor(boxes, dtype=torch.float32).to(DEVICE),
    'labels': torch.as_tensor(labels, dtype=torch.int64).to(DEVICE)
})

return images, targets```

#

I use my data already processed with roboflow its pretty much like this: DATA_DIR = '/content/oreo-1' # Replace with the actual path TRAIN_ANNOTATION_FILE = os.path.join(DATA_DIR, 'train_annotation/_annotations.coco.json') TRAIN_IMAGE_DIR = os.path.join(DATA_DIR, 'train') # Adjust as needed VAL_ANNOTATION_FILE = os.path.join(DATA_DIR, 'val_annotation/_annotations.coco.json') VAL_IMAGE_DIR = os.path.join(DATA_DIR, 'valid') # Adjust as needed NUM_CLASSES = 122 # Replace with the number of classes in your dataset (e.g., 80 for COCO) BATCH_SIZE = 4 LEARNING_RATE = 0.001 NUM_EPOCHS = 175 DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') CONFIDENCE_THRESHOLD = 0.5 IOU_THRESHOLD = 0.5

verbal oar
#

hmm I thought collate is for text but its for images too why?

#

I see maybe too much of epochs

#

hmm I think not, resnet is for small images

#

must resize them

#

also it is longer to process bigger images

#

dont remember what size was sth like 32 or 64, check it

verbal oar
#

is temperature of llm related to simulated annealing?

serene scaffold
#

A higher temperature is a higher likelihood that the model will do that.

If you've taken chemistry, you'll remember that higher temperature systems are more chaotic, and stuff.

glossy canyon
#

Hey guys, is there anyone who has an idea on how to create custom tokenization for OCR or how to train Tesseract using custom datasets?

rich moth
#

I mean the answer is right there. Dont ask us I feel, you should have been showing us results by now!

#

But what insight do you seek that you that you probably can find yourself?

#

What I've found in my research is we all learn differently. If we treat humans as individual datasets, we all have our unique learning methods, some even hidden.

#

😄

rich moth
#

I finally got this visual working!

#

Its the culmination of processing of 62 data sets from 4 domains the 141k+ samples

lapis sequoia
#

can a data analyst please reach out to me? i'm struggling in my internship with no mentor.

outer cloak
#

yo @lapis sequoia

#

join the voice chat

#

"Voice Chat 0" ok

odd meteor
odd meteor
#

Meanwhile, who else is submitting their work to NeurIPS? 😛

glossy canyon
# rich moth Which version of tesseract are you planning on using. Theres a lot going on her...

Thanks for the honest feedback! I agree that most of this info is out there, and I have been researching it—I’m actually experimenting with both a custom CRNN model and now considering Tesseract 5.4.1 for comparison.
My main challenge is fine-tuning Tesseract on a custom script like Balochi, especially around generating accurate training data and understanding the lstmtraining process. I’m not looking for spoon-fed answers—just hoping someone might have hands-on experience or tips that could help speed things up a bit.
And you're totally right—we all learn differently. For me, discussing things out loud (or in chat) helps uncover blind spots and validate whether I'm on the right track. 😄

verbal oar
#

yes youre right 224x224, I checked in search

#

it was 224 not 24, as I guessed 32 or 64 size, my bad
but still relatively small images

verbal oar
#

I meant I thought it is 24x24 nvm

#

ah maybe you want to fine tune it, then its different thing

lapis sequoia
#

Hello everyone, to study ai/ml and robotics do you need to learn about electricity and how does it work (asking as a self-taught programmer)

agile cobalt
#

for ai/ml not really
for robotics you'll likely want to have a descent notion of physics though

lapis sequoia
#

Aha so it's like i don't need to study it deeply right?

woven stream
#

Nah I mean my CS course has robotics in it and it covers kinematics briefly and control (PID / MPC), Reinforcement learning, Markov decision processes, some sensor stuff and just general deep learning

heavy crow
#

If I have large amounts of data ~5million training points for a relatively small CNN with 0.4M params, should I be running the full dataset per epoch or only a subset? How would I estimate the number of batches per epoch to try?

#

Obviously if I run the full dataset per epoch my LR scheduler will kick in way later, but are there other benefits?

serene scaffold
heavy crow
#

okay, thank you! Is it at all common practice to have a LR scheduler act within an epoch?

serene scaffold
heavy crow
#

Okay! thanks for your input 🙂 Looking at some literature on similar networks i see they half the LR every 2*10^5 minibatches which seems like a good starting point

#

When training residual super resolution networks, is mode collapse a problem? I.e mean and var collapsing to zero beause the low and high res images are already close to each other?

serene scaffold
#

idk what that even is

heavy crow
#

For image super resolution you can increase efficiency by upscaling the image with normal upscalers such as bicubic or lanczos and then learn a delta that gets added to this.

#

instead of using transposed convolutions to reach a higher resolution output from the low res input

#

Okay they were not the orignal, seems like SRResNet came before them.

final jolt
#

thought id try asking here since stuff like pandas is related but anyone got some experience/recommendation on a python library to convert pdf to possibly csv? Im fine managing the data cleanup itself but looking for other options. I have tried pdfplumber and its not working well. tabula works quite well but it relies on java which im not a fan of needing to have that as part of my app

limpid zenith
#

like OCR libraries?

final jolt
limpid zenith
final jolt
#

I did see that one as well as one called ThePipe and not really a fan of feeding private financial data to an LLM, especially if I ever want to release this application for others

final jolt
#

huh tabula creates massive lists of what it parses. and each entry in the list is a table. Well thats notable but kinda annoying

rich moth
#

So I took the UCF stuff and decided to make a trading bot. The idea here is to represent market structure as a complex number, mapping the market into this phase space where I can visualize different regimes way more clearly.

What I did was build these layers that all talk to each other. Like one part figures out what market "regime" we're in (trending, choppy, whatever), another part picks the right strategy for that regime, and another part handles risk. The cool thing is they all continuously adapt, no retraining needed.

sudden delta
rich moth
#

I made a bunch of visuals, hoping to have some stuff to share.

rich moth
final cobalt
#

Though y'all might like this

obtuse acorn
#

any idea if its possible to resize the squares in a seaborn or plotly heatmap?

#

i got it working but it doesnt look right

radiant cipher
#

Anyone aware of something one can let loose on a set of gut repos with a task and getting plans/code changed out of it

obtuse acorn
final jolt
radiant cipher
#

So an agent system to apply global changes to about 100 git repos

final jolt
#

So you want to create like a report against a bunch of repos based on a code pattern you provide, like you are looking for vulnerabilities or improper code blocks and then change them?

final jolt
# radiant cipher <@308963648636715011> exactly

Well thats a lot of prep to do. The only things I have encountered like that are custom things in a corporate environment. What I would suggest starting with would be simple scripts that can do pattern matching for some example code you are trying to look for. Once you get that working checking a bunch of repos will be the easier part (automatically editing them is another matter though)

agile cobalt
#

even for a single git repo you'd get mixed results, let alone 100 at once

final jolt
#

imo

austere shore
#

Can anyone teach me AI/ML

final jolt
austere shore
#

Oh

#

I'm sorry

#

But can you teach me

final jolt
#

I cannot

austere shore
#

Oh

plush kettle
#
Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
Using 'backbone_name' and 'weights' as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-30-5af1204afd6e> in <cell line: 0>()
----> 1 model = create_retinanet_model()

3 frames
/usr/local/lib/python3.11/dist-packages/torchvision/models/detection/backbone_utils.py in <lambda>(kwargs)
     63     weights=(
     64         "pretrained",
---> 65         lambda kwargs: _get_enum_from_fn(resnet.__dict__[kwargs["backbone_name"]])["IMAGENET1K_V1"],
     66     ),
     67 )

TypeError: unhashable type: 'list'``` I got this error when executing this method ```def create_retinanet_model():
    # Load a pre-trained ResNet50 backbone
  backbone = torchvision.models.resnet50(pretrained=True)
  test = list(backbone.children())[:-2]
  backbone = torch.nn.Sequential(*test)  # Remove the last two layers

    # Input channels for the feature pyramid network.  Resnet50 outputs 2048
  in_channels_list = [2048, 1024, 512]  # Channels for P3, P4, P5

    # Output channels for FPN
  out_channels = 256

    # Create the feature pyramid network.
  fpn = resnet_fpn_backbone(in_channels_list,out_channels)

    # 91 because of the background class
  num_classes = NUM_CLASSES
    # Anchor generator
  anchor_generator = AnchorGenerator(
      sizes=((32,), (64,), (128,), (256,), (512,)),
      aspect_ratios=((0.5, 1.0, 2.0),) * 5
  )
    # Put anchor generator inside the model
  model = RetinaNet(backbone,
                    num_classes=num_classes,
                    fpn=fpn,
                    anchor_generator=anchor_generator)
  return model``` why is this?
#

I just want to remove the two last layers of torchvision Resnet50 backbone

radiant cipher
final jolt
radiant cipher
final jolt
#

anyone familiar with pymupdf?

serene scaffold
# final jolt anyone familiar with `pymupdf`?

please remember to always--every time--ask your actual question. please never ask "does anyone know about x". just ask your actual question about x, and people will know it's about x from reading it.

final jolt
#

yea sorry. got sidetracked lol

#

basically getting this error

  File "D:\scripts\pybudget\pdf_convert.py", line 57, in <module>
    pymu_pdf(pdf_path, csv_path)
  File "D:\scripts\pybudget\pdf_convert.py", line 35, in pymu_pdf
    pprint(tabs[0].extract())
TypeError: 'module' object is not callable```
when trying to just extract tables from a pdf.  One off the table parsing works but trying to iterate it is failing
```py
def pymu_pdf(pdf_path, csv_path):
    pdf = pymupdf.open(pdf_path)
    print(f"Total pages: {len(pdf)}")
    for pages in pdf: 
        if pages.number == 2:
            tabs = pages.find_tables(strategy="text")
            if tabs.tables:
                pprint(tabs[0].extract())```
serene scaffold
final jolt
#

yup and you are right the example is wrong on their docs

serene scaffold
#

because pprint is a module that contains a function that's also named pprint. so if you do import pprint, then pprint is a module. if you do from pprint import pprint, then it's a function

#

I recommend doing import pprint as pp and then pp.pprint. that way it's never a mystery which one pprint is.

final jolt
#

yup that was the issue, I was originally doing print and had errors so went back to the example to test and missed the pprint from pprint

serene scaffold
#

there was actually a PEP that could have fixed this, but it was rejected

final jolt
#

heh, bummer, and yea I have been good about just asking until this time heh. thanks for the info

#

now I can try this again with pprint to see if this works. However either way this is major progress as I was trying with tabula before and it was very cumbersome

final jolt
#

oh that is soooo much better

obtuse acorn
#

i basically drew white rectangles above the heatmap then drew colored scaled rectangles above those

scenic parcel
scenic parcel
dull mortar
#

im not sure if this comes under this channel but

how are numpy arrays structured? like how does it compare to matrices and their notation? (is a 3x4 matrix the same as a numpy array with shape (3, 4)? will using functions like np.dot() on such an array yield the same results as the same operation on a 3x4 matrix (and a 4x1 vector)?)

im asking specifically for like visualisation. in math usually the first number corresponds to the number of rows and im basically just wondering if thats the same for numpy. (and if itll work the same for matrix operations)

sorry if my phrasing is slightly off. kind of new to both linear algebra and numpy

wooden sail
dull mortar
wooden sail
rare bane
rare bane
# rare bane

So you can view it if you like...I accept constructive criticism

final jolt
#

Now if I could get pymupdf to be more consistent that would be cool. parsing pdfs suck for sure.

spring field
final jolt
#

ah no, I also never really use pprint and only did here because I was following some docs. I dont functionally need pprint for anything at the end of the day

long locust
#

!timeout 1079012483290890321 spam

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied timeout to @slow sleet until <t:1747234634:f> (1 hour).

obtuse acorn
#

any idea when to use standard scaler vs minmax scaler in scikit learn?

final jolt
#

So short version I am trying to use matplot to display gridlines on a page rendered from a pdf. I got all that working however I am trying to adjust the grid line spacing with no success. I thought the correct parameter was markevery but that seems to just be for an actual graph

    DPI = 150
    pix = chosen_page.get_pixmap(dpi=DPI)
    img = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv)
    plt.figure(dpi=DPI)  # set the figure's DPI
    plt.title("title")  # set title of image
    plt.grid(True)
    plt.grid(markevery=10)
    plt.grid(color='gray', linestyle='--', linewidth=0.5)
    _ = plt.imshow(img, extent=(0, pix.w * 72 / DPI, pix.h * 72 / DPI, 0))
    plt.show()```
code snipper here. the gridlines seem to default to every 100
*edit* I never got this to work but bruteforced what I needed but I am curious how to make this work if anyone wants to weigh in
charred ferry
#

Hi guys, basically if i wanted to do a final year project that combined data analytics and machine learning, do u guys have any good resources i can use to study to get a basic understanding of both? Idk which channel to ask this. I been looking for video tutorials and rsources myself but additional resources from other people would be useful.

final jolt
#

!res

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

final jolt
#

have you checked in that page? I think there was a section for ML and such but I could be wrong

charred ferry
#

oh thanks

final jolt
#

hmm looks like more generalized stuff there but could be a good starting point. Certainly some YT channels have other video series for more specific topics as well

charred ferry
#

thanks for the help tho

stiff crown
#

interesting

final jolt
#

I swear this join_tolerance doesnt do jack in PyMuPDF, heh

final jolt
#

ok pages.search_for() is goat

verbal oar
#

if there is crewai ok but what else to understand how it works?

#

crewai is very high-level is there sth more verbose?

agile cobalt
#

what exactly are you trying to understand

#

if you want to understand how the models themselves work, look up Andrej Karpathy's tutorials

if you want to understand how to build a simple agent using an LLM, use the HuggingFace Transformers library directly

if you want to orchestrate multiple agents more manually, either still just Transformers or PydanticAI / LangChain

verbal oar
#

in crewai you just prompting what agent must do and task description

#

but this works under the hood I assume some tokenization and nlp related things

#

maybe more manually ok so langchain then

agile cobalt
#

langchain still abstracts away all "nlp related things"

if you want to see how tokenization works then start from Andrej Karpathy's videos

verbal oar
#

I know tokenization

#

ok so its just llm agents this how should I think

agile cobalt
#

most "Agents" just replace all NLP techniques for one giant blackbox (the LLM), instructions goes in then text, a tool call or a well formatted object comes out

verbal oar
#

I see thanks, now more clear

#

so this is like learn llm
then learn llm agents
order

agile cobalt
#

depending on where you look you may find normal code mixed in the orchestration though, some of which may or may not be using classical NLP techniques

For example, before sending to an agent you can use a simple metric to determine which agent to send to (or not send to one at all)

agile cobalt
# verbal oar so this is like learn llm then learn llm agents order

yeah, if you want to mess around with agents or crews I'd strongly recommend understanding what the inputs and outputs of the llms look like

ideally also how the llm (neural network) itself works, but honestly that isn't vital unless you plan to customize the underlying model

final jolt
#

Well finally got this pdf parsing nearly complete and I can get back to the pandas side of thing. Still odd that it splits one of the columns randomly but what are ya gonna do I guess. other than join them I mean

rich moth
#

I gotta fix the Text above, i got an overlapping problem, but man what it captured it awesome. The different unique signatures "fingerprint" of a few crypto coins.

final jolt
rich moth
final jolt
#

stuff like number of trades, price changes, etc?

rich moth
final jolt
rich moth
#

thank you 😄

rich moth
#

i was thinking why not make it a continuous feedback loop also.

#

Its already learning..just in a new type of way

heavy knot
#

hey, I use homebrew as my installer and I'm trying to install flake8 in jupyterbooks. Anyone have any experience in doing so cause I can't get brew to recognized jupyterlab-flake8

vague trout
#

I've done foundational courses (andrew ng) and more deeplearning.ai courses but i don't understand how to start a project or what do I do, I have 0 practical knowledge. where do I start practicing ml?

rich moth
#

OK so I made it automatically tunes strategy parameters every 4 hours, it analyzes win rates and profit factors for each strategy, Underperforming strategies get parameter adjustments (tighter stops, adjusted take profits), Outperforming strategies get optimized for even better results and Cooling periods prevent over-adjustment.

Every 24 hours it builds and updates the "fingerprints".
It clusters UCF states and analyzes performance by cluster and creates an asset specific "memory" f which complexity states are profitable, which in turn influces future trading via confidence adjustments and postion sizing.

I added realtime feedback stuff to boost oreduce confidence based on histroical perofrmance in similar states, it adjusts confidewnce when phase alignment is strongf and modifies position sizing base on histroical profiatblre clusters. Most importantly it saves and loads all these learned adjustments in a pickle which inclues stratergy parameters and the state checkpoints.

iron basalt
rich moth
rich moth
vague trout
rich moth
#

I want to build something to test this theory

rich moth
#

I would need to first Transform RNA sequence data into a format suitable for the UCF

#

I thinking just adding the logic to the data preprocess for "rna_sequence" domain an potentially tailored to θ calculation, while reusing the N,A,ϵ logic where possible. it actually sounds fun. anyone got any ideas how to visualize this

rich moth
#

Ok I built the pipeline for the RNA 3D structure prediction that uses the UCF to biological sequences. Im using that kaggle data set from the comp. It's basically applying mathematical complexity theory to biological structure prediction. Might be a bit for visuals but im excited 😄

#

heres a couple that came in

#

visuals need work though 😛

serene grail
rich moth
#

Predicted RNA folding pattern visualization

naive matrix
#

I’m trying to learn numpy working with opencv but there’s no good vid in YouTube that teaches about it, please help or give me advice if y’all can

crystal pier
naive matrix
#

And server

crystal pier
#

Shouldn't be a numpy problem. Do you by any chance mean collecting frames from a stream, say a webcam or rtsp network?

naive matrix
#

Yes webcam

crystal pier
# naive matrix Yes webcam

Oh you're going to have to expose the RTSP link for your webcam to the opencv cv::VideoCapture API, shouldn't be a herculean task provided I've given some leads already.

It wouldn't be fun if you were just told what to do as well, so go ahead and break things😁

#

Also if you're doing any heavy inference of some sorts on the frames you'd also need to either:

  1. Learn threading, python has the threading module, mutexes, GIL, so on
  2. Or not learn anything and use the Inference library which is a pain to set up dependencies if you don't use a separate venv, you'll probably need some docker experience as well for this one

so I'd just recommend the former, cuz you'll learn things as well from the process

naive matrix
#

Thank you ima chatgpt this to understand cause I don’t understand fullywhat’s a rtsp link and some things u say

#

I appreciate it

torn hill
#

Topic - GLoVE Paper

Hi , so i recently started to read the GLoVE paper and there is this line in it which is confusing me which is "Since vector spaces are inher-
ently linear structures, the most natural way to do
this is with vector differences."

I dont get that how authors get to this conclusion that vector differences is a natural way? is there some logic behind it? or its pure heuristics?

Please tell me if its way less of a context I'll try to explain more

crystal pier
#

I'll make out time to read the paper, but a few lines could help

#

But in this context I'd say that it means two things

#
  1. Vector addition: vectors in a vector space can be added together to form another vector in that same vector space

I.e vectors are closed under additivity, this should be independent of the field, as vector spaces are inherently closed under additivity

#
  1. Scalar multiplication: vectors in a vector space can be scaled to get other vectors within that same vector space,

These two properties bring about other linear properties while being linear themselves

E.g distributive properties, additive and multiplicative inverses etc

#

But vector spaces would be non linear under operations like multiplication of vectors by other vectors i.e vector squaring

#

tldr basically; all structures and operations in a vector space just respect linearity,

#

Sorry for the wall of text got slightly too into it 😅

crystal pier
# torn hill Topic - GLoVE Paper Hi , so i recently started to read the GLoVE paper and ther...

Now when I read the second part it seems more like Euclidean geometry, but I don't know what "the most natural thing" that the paper is doing is

But the vector differences just means that in a vector space all positions are relative, there is no absolution, so if I move my vector origin some (0_1, 0_2, ..., 0_n), all vectors in the same sense are moved, and a vector say x_1 - x_2 would stay the same, so all vector differences stay the same

Make any sense?

iron basalt
#

(New vector tip is on B, and tail on A)

torn hill
#

ok i understand the premise of vectors the thing is the authors are suggesting that it makese sense for them if the are taking the difference of two vectors but not addition and am not sure why , maybe if you took a quick read of page 3 of the paper it might make more sense? @crystal pier @iron basalt

iron basalt
#

Example: you have a video game explosive barrel object with 3D position vector, and the player with another 3D position vector. Now to do the game logic, you want the player's position relative to the barrel, so you do player.pos - barrel.pos. Then you can check its magnitude for distance checks like if the player is in explosion hurt radius.

#

They wrote in the paper they want to encode the relative information of the probabilities.

#

First, we would like F to encode the information present the ratio Pik /Pj k in the word vector space.

#

If you use log probabilities, you get a difference instead of division...

#

(It's a morphism)

torn hill
iron basalt
#

Any time probabilities are involved, consider log probabilities, they make things way more clear.

torn hill
#
Ratio between probabilities is getting the relative info, and you want to also encode relative info in a vector space, which leaves the natural choice of difference```

Yes this that its a natural choice to use difference , maybe i dont have the intuition yet to also understand this abstractly

Like i understand we want to encode info of a scalar value in vector space but how is that leaving us with a "natural" choice of difference, is my math pretty weak to understand this?
iron basalt
iron basalt
#

It's physical, much less abstract.

torn hill
iron basalt
#

The difference vector on an abstract level encodes the relative information between the entities.

torn hill
#

ok i think i understand this but while we are on the topic can you help me one more aspect?
So the paper further said that - While F could be taken to be a complicated func- tion parameterized by, e.g., a neural network, do- ing so would obfuscate the linear structure we are trying to capture

#

I tried to make sense that why neural networks wasnt the first choice here and this is what i ended up with - The GloVe paper emphasizes that while a neural network could have been used to learn word embeddings and might produce good results, such models often act as black boxes. This means they provide embeddings without a clear understanding of why certain relationships emerge. In contrast, GloVe is built on explicit statistical information derived from word co-occurrence counts, making its embeddings more interpretable. This aligns with the goal of enabling meaningful vector arithmetic (e.g., king - man + woman ≈ queen) and revealing transparent relationships between word vectors based on how frequently words appear together in a corpus.

#

is this making sense?

#

the problem was LHS and RHS were not equal , LHS was vector and RHS was scalar

iron basalt
#

Simply it would mess up your ability to do simple vector operations that you want to be able to do with words.

iron basalt
#

Because networks scramble things.

crystal pier
#

oh hey I'm back

torn hill
#

ok so i guess i am on the right path

torn hill
crystal pier
#

I'm guessing sure squiggle has this on lock

torn hill
#

@iron basalt Thanks man

iron basalt
#

As it's often not the best choice.

torn hill
iron basalt
#

It's also just part of the academic process.

torn hill
#

hmm soooooo where should i look at, haha am kinda new to this

iron basalt
#

You are already in Yannic's discord, they cover papers there every week in a call, and people there cover papers all the time in chat.

torn hill
plush kettle
#

Guys I need help

#

So I am trying object detection with keras resnet50, here is how i prepared my data: ```def parse_tfrecord(example_proto):
feature_description = {
'image/encoded': tf.io.FixedLenFeature([], tf.string),
'image/object/bbox/xmin': tf.io.VarLenFeature(tf.float32),
'image/object/bbox/xmax': tf.io.VarLenFeature(tf.float32),
'image/object/bbox/ymin': tf.io.VarLenFeature(tf.float32),
'image/object/bbox/ymax': tf.io.VarLenFeature(tf.float32),
'image/object/class/label': tf.io.VarLenFeature(tf.int64),
}

parsed_features = tf.io.parse_single_example(example_proto, feature_description)

# Decode and preprocess the image

image = tf.image.decode_jpeg(parsed_features['image/encoded'], channels=3)
#image = tf.image.resize(image, [HEIGHT, WIDTH])
image = tf.cast(image, tf.float32) / 255.0
labels = tf.sparse.to_dense(parsed_features['image/object/class/label'])

return image, labels def get_object_detection_dataset(tfrecords_dir, batch_size):
files = tf.io.gfile.glob(tfrecords_dir)
dataset = tf.data.TFRecordDataset(files)
dataset = dataset.map(parse_tfrecord)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
return dataset```

#

and I built my model like this: ```def build_resnet50_fpn_backbone(input_shape=(640, 640, 3), weights='imagenet', include_top=False):
"""
Builds a ResNet50 backbone with a Feature Pyramid Network (FPN) for object detection.

Args:
    input_shape (tuple): The shape of the input images (height, width, channels).
    weights (str): The weights to load for the ResNet50 model.
        'imagenet' for pre-trained weights on ImageNet, or None for random initialization.
    include_top (bool): Whether to include the top (fully connected) layers of ResNet50.
        For feature extraction, this should be False.

Returns:
    tf.keras.Model: A Keras model representing the ResNet50 FPN backbone.  The model
                    has multiple outputs, which are the feature maps from the FPN levels (C3, C4, C5).
"""
# Ensure valid input shape
if input_shape is None or len(input_shape) != 3 or input_shape[2] != 3:
    raise ValueError("Input shape must be a tuple of (height, width, 3).")

# Ensure channels_last data format
tf.keras.backend.set_image_data_format("channels_last")

# Load ResNet50, excluding the top (fully connected) layers
resnet50 = ResNet50(
    include_top=include_top,
    weights=weights,
    input_shape=input_shape
)

# Get the outputs of the intermediate layers we need for FPN.  These are
# the activations before the pooling layers.
c3_output = resnet50.get_layer('conv3_block4_out').output  # Shape: (None, 80, 80, 512) for 640x640 input
c4_output = resnet50.get_layer('conv4_block6_out').output  # Shape: (None, 40, 40, 1024) for 640x640 input
c5_output = resnet50.get_layer('conv5_block3_out').output  # Shape: (None, 20, 20, 2048) for 640x640 input

# FPN layers.  These layers take the output of the ResNet stages and combine them
# to create feature maps at multiple scales.  This helps with detecting objects
# of different sizes.
# P5 is initialized directly from C5
p5 = layers.Conv2D(256, (1, 1), name='P5')(c5_output) # (None, 20, 20, 256)
# Upsample P5 and add it to C4
p4 = layers.Add(name='P4_add')([
    layers.Conv2D(256, (1, 1), name='P4_conv1')(c4_output), # (None, 40, 40, 256)
    layers.UpSampling2D(size=(2, 2), name='P4_upsample')(p5), # (None, 40, 40, 256)
])
p4 = layers.Conv2D(256, (3, 3), padding='same', name='P4_conv2')(p4) # (None, 40, 40, 256)

# Upsample P4 and add it to C3
p3 = layers.Add(name='P3_add')([
    layers.Conv2D(256, (1, 1), name='P3_conv1')(c3_output), # (None, 80, 80, 256)
    layers.UpSampling2D(size=(2, 2), name='P3_upsample')(p4), # (None, 80, 80, 256)
])
p3 = layers.Conv2D(256, (3, 3), padding='same', name='P3_conv2')(p3) # (None, 80, 80, 256)

# P6 and P7 are created by downsampling P5
p6 = layers.Conv2D(256, (3, 3), strides=2, padding='same', name='P6')(p5) # (None, 10, 10, 256)
p7 = layers.Conv2D(256, (3, 3), strides=2, padding='same', name='P7')(p6) # (None, 5, 5, 256)

# Define the model with multiple outputs

model = Model(inputs=resnet50.input, outputs=[p3, p4, p5, p6, p7])

#model = Model(inputs=resnet50.input, outputs=feature_map)
return model```
#
model = build_resnet50_fpn_backbone(input_shape=input_shape)``` ```losses = {'classification_output': 'sparse_categorical_crossentropy',
          'bbox_output': 'mse'
        }```  ```optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss=losses) # Two losses: bbox and class```
#

model.fit(training_data, epochs=NUM_EPOCHS, validation_data=validation_data) when I tried that fit, this came up: y_true and y_pred have different structures. y_true: * y_pred: ['*', '*', '*', '*', '*']

#

Does anyone have any idea how could this happen

verbal oar
#

debug it

main nymph
#

done

#

i debugged it

verbal oar
#

outputs=[p3, p4, p5, p6, p7]
['*', '*', '*', '*', '*']

main nymph
#

by print("debug code")

#

(this is all i know in python)

verbal oar
#

but "easier" to put some breakpoints and watch variables

#

with debugger

#

oh as I thought you can also add verbose param to fit

#

verbose: "auto", 0, 1, or 2. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. "auto" becomes 1 for most cases. Note that the progress bar is not particularly useful when logged to a file, so verbose=2 is recommended when not running interactively (e.g., in a production environment). Defaults to "auto".

charred estuary
limpid dew
#

what are you trying to do with that

toxic pilot
charred estuary
limpid dew
#

like pandas?

charred estuary
#

read the README

limpid dew
#

the readme isnt very descriptive, it's some llm training tool?

charred estuary
#

I think it includes what it needs to. You build your own dataset to train or fine-tune a model. It's automated and generates data from multiple AI models to avoid inbreeding data. || @limpid dew ||

limpid dew
#

have you build an LLM with it?

charred estuary
# limpid dew have you build an LLM with it?

Running it doesn't build you an LM you use the dataset it generates paired with your train.py file to train your model. You can modify the script to ask the cluster to only generate data that will help train an AI on python debugging or math or whatever you want.

limpid dew
#

I think you misunderstood my question. That's okay. You say you can use it to train an AI on math? What kind of math? How do you do this?