#data-science-and-ml
1 messages · Page 165 of 1
for example for drug discovery
if I said it correctly
I mean I saw one offer related to graph nn
im struggling to understand the likelihood function. I look on one website, it says one defintion, I look at another, theres a completely different definiton. how do I define what it is
there are many different likelihood functions. probably in machine learning you are using maximum likelihood estimation
are you struggling with the general concept?
I've somewhat figured it out. Still seems confusing but I read up that likelihood is a topic that still confuses even the bbest mathematicians so Im not too bothered if I dont understand it fully
the idea is kinda simple though
thats what I thought and then I looked up the definitions and it started confusing me
the easiest way to see it, imo, is that you have a pdf for some set of observations, and the pdf depends on some parameter
for example you can say you have data X that is gaussian distributed. the gaussian distribution has 2 parameters, the mean and the variance (or standard dev, as you like)
we can then write the pdf as a function p(x, mean, std dev)
if you keep the mean and std dev constant, then p(x, mean, std dev) is a pdf describing the probability density of observing any x that follows this pdf
if on the other hand you observe one specific x and keep it constant, but allow the mean and std dev to vary
then p(x, mean, std dev) is no longer a pdf. it doesn't integrate to 1 if you integrate over the mean and std dev
this second case is what one calls the "likelihood function", and you can do this for any pdf
the difference is what is kept constant and what is allowed to vary
I think I get it now
i just relate it to probability or odds, likelihood is related to bayes generally
when you have a priori, a posteriori and likelihood
but its likelihood not likelihood function
likelihood is B given A, B|A
I think this is origin of it
also lower, marginal is origin to marginal distribution or probability dont sure
yes likelihood is probability because of P
what do you think
looks like bayes is base
like its base for vae for example when you have priors
that's something separate
the posterior you compute here via bayes' rule can itself be either a pdf or a likelihood
ok
the distinction is again made by what is kept constant and what is allowed to vary
lol so I confused too because of word having two meanings likelihood
so depends on context
Hey guys so I have spent my time learning ML , DL , transformers and right now i am learning langchain , but I dont have much knowledge on DSA . So im confused at this point , as to what to do
Which DSA do you mean? Data Structures and Algorithms?
where I can find hypothesis testing inside deep learning?
with machine learning you see it for example in R statistical summary
Oh, see #algos-and-data-structs . MIT 06.001 is a good start.
The thing about deep learning is that it works like a black box , so I doubt whether you can actually figure it out
ah ok right
I see , but DSA is it actually required?
Required for what? For a CS degree? Yah, I don't know of any that don't require it.
so its doing it automatically
Required for a job?
That's a better question for another channel, like #career-advice . But, DSA basics (the content of an undergrad DSA class) are pretty fundamental and kinda assumed knowledge for software engineers.
The real question is "how much is needed", which is difficult to answer because it's difficult to measure.
yes as I recall inside linear regression is hypothesis testing
Learn it all lol
Probably wouldn’t be able to 😅 there’s too much to learn
Hey chat
Dsa is must if you want a job at good company you'll have to solve dsa problem in your technical interview regardless of the position you're applying for
hey robert 🙂
Has anyone here tried out Google's new Agent Development Kit (ADK) yet? https://github.com/google/adk-python
Curious about giving it a shot but wondered if anyone recommends it or prefers other libraries for building agents?
machine learning is difficulty to work ?
exemple, an guy trainning an ia to respond the questions.
Yes
That's relatively easy if you start with a foundation model. Otherwise it's very difficult.
(and if you start with a foundation model, all the actual AI is abstracted away.)
Whoa! Check out these polar plots! The the timeseries one (arrowhead) is crazy because all the points line up almost perfectly on the 0 degree, 180 degree line. It makes sense though, if you think of about time series just having a 1D nature . But whats interesting is the image one is that the complexity can be represented in a 2D nature. The points are distributed across multiple angles in the complex plane, forming patterns that extend in various directions rather than being confined to a single axis. What cool though is all these patterns are merging naturally from the math formula I made.
It's like this hidden "dna" of data
There's a unique "shape" to complexity across different data types ,almost kind of hidden signature or "calling card" that reveals the fundamental nature of information itself. It doesn't just measure how difficult a sample is to learn (magnitude), but also characterizes what kind of difficulty it represents (phase).
Images below are from IRIS dataset.
Welp Plunder. I think it is time for you to publish a paper on this atp. Is this going up on GitHub
Test your research against the big dawgs. Llama, DeepSeek, ChatGPT etc.
Claude. What is the deal with Langchain. Why are folks excited about Langchain
Honestly, I dont even know where to begin. Feels so overwhelming. I have started writing some, Im slow at it and takes me forever, dreading it already lol. As far as github its a great idea, im also slow at that, too. I'm not greatest with it. Do you know of any exceptional resources to better my GitHub skills? I once (years ago) nuked my entire hardrive because I was , well an idiot and didnt know what I was doing. But i appreciate your input, broski.
I just started making repositories and building.
How's that been working out?
Ehh. I contribute to open source. It is not so much for personal gain than it is to contribute to society
That's a great attitude
This one is on the Breast Cancer Dataset. Whats interesting is Phase ascending strategy took the cake on this one
oops, lol its obvious hard to easy. Im looking at two different things
It was the WINE tabular dataset i was looking at where Phase Ascending over random +5.41 %
Its almost like the opposite rings, for this complexity tool. Feedings it lots of information , rather than little,. improves overall results. Which makes sense, more complexity better results. But its not always the case, sometimes another method seems to be working , but always beating random everytime. But complexity isnt defined by domains, it something else, thats what I think I found though. Well some of it, even though it works and works well. I feel like theres something else missing to the pie. It's like dark matter, I cant see it, but trail and error in our measurements show "something is there"
Im really curious how fractals might play a role here. like thinking about patterns like the Mandelbrot set on the complex plane .. especially the boundaries between points that stay bounded and those that escape to infinity...
damn! i never thought about that... they both fundamentally operate in the complex plane there might a connection here
I was aiming for a start up
Then you have to hire employees with dsa and take away assignments I'd do that if I were you kek
I meant i was aiming to work as an employee for startup
Pretty sure they'll ask some dsa releated questions too
It allows the recruiters to check if you have problem solving abilities or not
Hey
Anyone else getting 503 with the Gemini API?
howdy
being little offtopic github is not backup site
its only version control, still need backup somewhere
I use GitHub for backup. Unless GitHub says no lol
I agree. Especially for Python since files are small.
Hey everyone pls help me with this
import numpy as np
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.weights_input_hidden = np.random.randn(self.input_size, self.hidden_size)
self.weights_hidden_output = np.random.randn(self.hidden_size, self.output_size)
self.bias_hidden = np.zeros((1, self.hidden_size))
self.bias_output = np.zeros((1, self.output_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(self, x):
return x * (1 - x)
def feedforward(self, X):
self.hidden_activation = np.dot(X, self.weights_input_hidden) + self.bias_hidden
self.hidden_output = self.sigmoid(self.hidden_activation)
self.output_activation = np.dot(self.hidden_output, self.weights_hidden_output) + self.bias_output
self.predicted_output = self.sigmoid(self.output_activation)
return self.predicted_output
def backward(self, X, y, learning_rate):
output_error = y - self.predicted_output
output_delta = output_error * self.sigmoid_derivative(self.predicted_output)
hidden_error = np.dot(output_delta, self.weights_hidden_output.T)
hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output)
self.weights_hidden_output += np.dot(self.hidden_output.T, output_delta) * learning_rate
self.bias_output += np.sum(output_delta, axis=0, keepdims=True) * learning_rate
self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate
self.bias_hidden += np.sum(hidden_delta, axis=0, keepdims=True) * learning_rate
def train(self, X, y, epochs, learning_rate):
for epoch in range(epochs):
output = self.feedforward(X)
self.backward(X, y, learning_rate)
if epoch % 4000 == 0:
loss = np.mean(np.square(y - output))
print(f"Epoch {epoch}, Loss: {loss}")
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=10000, learning_rate=0.1)
output = nn.feedforward(X)
print("Predictions after training:")
print(output)
So, my buddy has this awful 820-page PDF slide deck (ugh).. and he's trying to figure out which pages of it refer to a particular concept. The problem is that any individual slide may or may not mention any interesting keywords about that, so it's really a full-LLM kind of context problem, it seems to me.
I took a swing at using LangChain to feed it into OpenAI for him, and it didn't go that well; the problem is that I'm having to chunk it into 50-page slices, and each of those isn't maybe enough context about the project to make good input.
What general approach is best-suited for this kind of problem? Do I need to "fine-tune" a model on this slide deck?
As an example, here's a slide that needs to be a 'hit', because despite not mentioning anything by name, it is ABOUT the project he's looking for hits on:
and if I don't luck into having that context in the same 'chunk' as this slide, it's not gonna realize this is a hit.
is ai agents related to reinforcement learning due to concept of agent in rl, can I think this way or its different meaning?
not related at all
an Agent (or more broadly, Agentic systems) using LLMs are when you take a language model, give it some tools (e.g. python functions it can call), then ask it to do a given task with some degree of autonomy
It does not necessarily involves reinforcement learning
RL is sometimes used for training the language models, specially to let them 'reason' on their own before writing a final answer or invoking a tool
I'd just try yeeting it into Gemini
if it's a one-off thing you don't plan to reuse, skimming through all pages manually may be faster than doing something ultra fancy with AI
Gemini won't try, sadly; too many slides.
Yeah that's my best advice for him so far; hoping someone has a cooler plan.
split them maybe?
The problem is that the first bunch of slides introduce the topic, and if you just feed it a slice of the later part of them it isn't doing a great job.
ah like needs context
I tried chunking them with LangChain etc before posting, didn't really find what my buddy is looking for.
maybe run it through https://github.com/microsoft/markitdown or Mistral's OCR model first, should be small enough for most commercial LLMs afterwards
My suggestion to him was to try to use some 'tool memory' on the problem, but I don't have a specific MCP in mind for him to lean on.
Aah yeah I'll try that too.
it would take ages to get a memory-ish solution working well if your context is that sparse (as in, must catch loose references to other things refering to the same concept within the document)
What's a good way to prep for data science & ai uni course?
it's relatively easy to identify that a slide is making a direct reference to X thing, but finding all other slices that indirectly reference that thing is hard
as far as I know usually you'd just take fragments of things around the direct reference
basic python syntax, basic statistics, maybe scikit-learn's MOOC, perhaps play around with numpy/pandas a bit (e.g. look at Kaggle competitions)
see if any of the courses list their pre-requisites / things you are expected to know
Any recommendations for tools to study these (except Python, got that already covered)?
not sure, maybe something like Datacamp but I never used it myself
make sure you take a look at the documentation of any libraries you use though, specially their User Guides if they have one
I like the channel statquest on YouTube for basic statistics, if you search "statquest playlist" on YouTube you'll get a playlist called "statistics fundamentals", it covers a lot of concepts and it's pretty good IMO although I'm a beginner myself
Kaggle has some small (free?) courses that teach you how to use notebooks, pandas, etc.
I've seen people recommend Kaggle for that
hmm. will check those out after my a levels
I started investigating why my markdownit output was so bad-looking, and it turns out the source presentation is just trash haha whoops. I'm just going to tell my friend to come back with better data.
so just syntax?
not more?
like generators,iterators
for example big dataset and getting stream of data
datacamp looks ok imo
I thought python is prerequisite so looks like I'm overthinking it
python in the sense not just syntax but more, but seems its like you said then
statistics could be also khan academy but not sure
usually you'll be working with dataframes or other abstract representations of datasets that manage the looping for you, it's relatively rare for you to have to write a generator yourself when working with data
when you're using python with libraries like pandas or polars, the way you write code is extremely different from python on its own
(of course, it's still nice to understand how more advanced python features work, but you must not go out of your way to use it when the library offers a different, more efficient way to do the same thing)
hi, typically what does the interview process look like for machine learning and/or computer vision roles, for both internships and full time
Here are some example questions: #career-advice message
thank you, also are there any math questions or any programming questions they ask?
kind of like with the leetcode round with swe but with ml algorithms
I ultimately don't know what they'll ask you. But I've only ever applied for data science and AI positions, and I was never asked a leetcode-style question, nor have I asked one to a candidate.
not even math or algorithm questions?
interesting
and this is for full time?
they'll probably ask you theory questions. I'd be surprised if they asked you to do any live coding.
are there online assessments like with swe?
If I were to train a model from scratch and give it lines of data structured like this (shown in gist), how many of these data points should I have?
https://gist.github.com/Tyguy047/b492cdbc68797f518a952ea41f994f96
This is based on the Fashion_Minst dataset.
Time data series Arrowhead. See how it still remains in 1D
Alright, I think the next experiment is to really push this thing: gonna calculate the complexity scores for all the datasets (time series, images, tabular, text) using the right domain logic for each first. Then, I'll combine all those complexity values into one giant list and sort it using my framework's magnitude score. Really want to see if the tool can create a meaningful difficulty ranking that truly works across all domains when the data types are mixed together. Basically, putting the 'Unified' in UCF to the ultimate test!
I've noticed patterns in this thing, the more complex the data , the better the results. Wihich makes sense, thats the whole purpose
Here are the cross domain results
can I fine tune some model with dataset containing python code?
both are on hugging face
of course I select only python code
50gb of data will be slow for finetuning, cut it to some smaller size?
When dealing with binary classification, is it better to use torch.nn.BCELoss or torch.nn.CrossEntropyLoss and why?
From what I know, we use BCELoss to get the value close to either 0 or 1 and round it to the nearest one, but with CrossEntropyLoss we get the exact 0 or 1 value, so which one is better actually?
would 400 images be enough to train a liscence plate detector
or is that too little
might be enough to fine tune
you can also try using data augmentation
nevermind it doesn't have labels
i've been trying to find a dataset with proper labels, all of the ones i find don't state exactly what the label format is and it just says it's meant to work with yolo
but i need to train this model from scratch
are any of you aware of any datasets out there that are clear with the labeling format
like if it's the top left and bottom right coordinates or if it's one coordinate and the dimensions of the box
TLDR: It doesn't really matter which one you use since both loss functions can get the work done.
BCELoss is strictly for binary classification and expects probabilities (0 to 1), but then, you must manually apply torch.sigmoid() to your logits first because BCELoss() does not ido this for you. Your labels also need to be floats. It's the application of sigmoid that actually squashes the values close to 0 or 1 not exactly BCELoss itself.
BCEWithLogitsLoss on the other hand simplifies this by handling the sigmoid internally for you, so you just need to feed it the raw logits and float labels, no extra steps needed.
CrossEntropyLoss, designed for multi-class, also works for binary so long as your model outputs 2 logits (for class 0 and class 1)... It auto-applies Softmax to the logits and uses integer labels, unlike BCELoss().
So there's no rule I've seen that states that using one is better than the other, however, just ensure your predicted output and target are in the right format ( depending on whichever loss function you decide to use.)
Personally, for classification tasks, whether binary or multiclass, I still prefer CrossEntropyLoss because of its design consistency (no sigmoid, works with int labels)... Meanwhile, another person might like BCELoss() or BCEWithLogitLoss()
Thank you for the detailed explanation! That was beyond what I was expecting.
looks like i'm gonna have to go with this one, turns out it does have labels and is the only one i could find with clear labels for coordinates
what things should i do to augment the data and get accurate coordinates for the augmented data
crop and pan, rotate, maybe warp
to clarify: by augmenting I don't mean improving, but rather creating more samples
tensorflow and pytorch both have some docs on it, and overall it shouldn't take too much work to re-implement yourself in another framework if you needed to
Is this a common thing to do in machine learning? I'm guessing this is supposed to help prevent overfitting on some particular criteria, for example license plate always being horizontal/always being in the center of the screen/etc.
would i be able to get the coordinates for the new image though
not sure if any of the libraries do it automatically for you or if you must do it yourself, but even worst on the case scenario it should a relatively simple math calculation - just apply the same formula that's applied to the image pixels onto the bounding box
(imagine the label as a 2D image greyscale image with the same dimensions as the original image - as long as you apply the same transformations as you applied to the original image it remains in 'sync')
oh yeah true
man something is wrong with me, i cannot think at all lol
You go hard! UCF! UCF!
Going hard in the paint. Yo!
Anyone doing the "Drawing with LLMs" Kaggle competition?
Hi everyone I am new in AI workspace and working on one chat bot kind of functionality for odoo system I have used ollama to run deepseekr1 model locally now I want to train this model to answer odoo related queries and use our custom postgres database to give best possible answer of customer query so can anyone guide me on this how can I train deepseekr1 model with my custom data ❓
how much of a difference would there be in computational efficiency if i went with the region based method for classification versus training the model to both localize and classify
Hi,
I've prepared a ETL template.
Link: https://github.com/mglowinski93/EtlTemplate
More details: https://www.reddit.com/r/Python/comments/1kd4aib/etl_template_with_clean_architecture/
I hope you guys like it 🙂
What are these people doing with “LLMs”? Is this all just Q and A and RAG stuff? I’ve been seeing this all of the time.
My main confusion is, these “LLMs” how big are they? Are they owned by a company? Are people just making LLMs like it’s nothing? Are they using langchain? I don’t know, I’ve seen this come up a lot.
If you were learning it all again, which Calculus book would you send back to yourself?
Heavy optimization with partials and unconstrained optimization.
Best place to start
Sure, but pretend you have a time portal that only one book fits through.
Just learn derivatives and integrals. Know the unit circle.
understand limits. I took calc1 so long ago, like 2016. Dang. But yeah, I would suggest going to school for math
I went to school for math, yeah, that's not what I'm asking but I appreciate the feedback.
I didn’t mean to be rude. I misread your message. I apologize
To expand on my question.. I've got, for example, a 'programming paradigms' textbook in mind that would have greatly accelerated my learning if it had existed back then.. and I was just wondering today if there were a math equivalent text for that thought experiment.
I just felt like the structure of trig calc1-3 was good how it was when I took it. Maybe, more of emphasis on differential equations. I don’t remember that class at all. Linear algebra should always be a requirement, optimization is underrated in calc1-3, it’s very important. Yeah, I think calc should focus more on optimization. I don’t remember it was so long ago.
Interesting; when I was taught, trig and calc were totally separate; did they overlap for you?
You need trig for calc. No, I took trig separately. This was so long ago.
(I hated my trig class at the time, I hope they teach it in a different way now)
I remember when I was 18 I grinded trig so hard and got litterally a 100 didn’t miss a point. shout out to my 18 year old self
Nice.
I didn’t think calc was bad honestly. This was so long ago, but honestly I remember being introduced to a limit and it made more sense than a bond. This was so long ago. Honestly. I didn’t think calc1-3 was bad at all. I am serious it mad direct sense.
I do remember at one point I was given the example of a speedometer, and its relationship to distance traveled etc, and that was a WAY better guide to my intuition than what I'd been exposed to before
I also sorta ended up finding the 'fluxions' explanation of things more helpful than the modern one, oddly
I remember how much I hated physics. I did well, I just didn’t find it interesting. It felt forced.
I had a really good high school physics teacher, I lucked out there.
I just remember the labs were so boring. All of this was so long ago. I am trying to remember how I felt.
who's the author
A comprehensive programming textbook that
covers all important programming paradigms in a unified framework
that is both practical and theoretically sound.
Special attention is given to concurrent programming and data abstraction.
The textbook uses the Oz multiparadigm programming language for its examples.
even more than some other classics, this changed how I thought about programming
quick question, I'm attempting at PCA and from my understanding if your data looks clustered and not varied this means your PCA isn't usable is that right?
I can't recommend a book but I REALLY recommend 3blue1brown's YouTube video series on Calculus, it really helped me grasp the basics
i saw the word “techniques” and thought u were talking about the dragon book for a sec
I'm pleased to own a (red, I guess) copy of the Dragon book, but I can't say in retrospect it helped me learn much; I mostly just found it too dense, and when I finally understood it all from other sources enough for it to make sense, it was out of date.. seminal text though for sure
I just wasn't smart enough to get it the first time I guess
it’s very theory oriented i will admit
i kind of jumped around in the book and read the parts i really cared about
I guess what it at least did was teach me the terminology, so I could go separately investigate the parts.
a lot of is still pretty relevant tho. like a lot of the compiler optimization techniques
idk i feel like it’s a good textbook for college students to read if they’re taking a PL/compiler class
I mean, as long as you go also learn about PEGs and GLL parsing and other things it doesn't cover
One probably shouldn't actually build YACC again circa 2025
I think the "front-end" vs. "back-end" hard distinction is out of favor too, in comparison to a long pipeline of simple transformations
I see a lot of people suggest this instead now https://www.amazon.com/Modern-Compiler-Design-Dick-Grune/dp/1461446988 but I haven't had the pleasure of reading it
"Modern Compiler Design" makes the topic of compiler design more accessible by focusing on principles and techniques of wide application. By carefully distinguishing between the essential (material that has a high chance of being useful) and the incidental (material that will be of benefit only i...
https://paste.pythondiscord.com/OUSA is this code okey it graphs the acurecy over time
within the network
well, perhaps. I think there'll still be a discrete line between frontend and backend compilers as long as LLVM remains a key technology in compiler design
Yeah, it’s still probably a useful idea, I mostly was just saying it has turned out not to be sacred.
Compilers? Nothing moves without them. Especially code.
This is why I study systems level programming.
Errr isn’t PCA breaking down the features to find the best ones 
My head hurts. Have a good day/night chat 🫡
this worked btw. thank you so much man. i hope you get all the success you wish for.
from ultralytics.utils.ops import clip_boxes, scale_masks
class YoloModel(BaseModel):
"""
This YoloModel class is for object detection and instance segmentation task
"""
def __init__(self, model_path: str, confidence: float):
self._model = YOLO(model_path)
self._confidence = confidence
self._cv_bridge = CvBridge()
def segmentation(self) -> Tuple[list[str], list[Segmentation]]:
model_output = self._model.predict(
self._color_img, conf=self._confidence, iou=self._confidence
)[0]
self._model_output = model_output
if model_output.masks is None or model_output.boxes is None:
return None, None
names = [value for _, value in sorted(model_output.names.items())]
# the box coordinates are given in float32 but we want int32,
# clip them again to avoid rounding issues causing the boxes
# to be out of the image
boxxywhs = clip_boxes(model_output.boxes.xywh.int(), model_output.orig_shape)
scale_up_masks = (
scale_masks(model_output.masks.data[None], model_output.orig_shape)
.squeeze(0)
.to(torch.uint8)
.cpu()
)
segmentations = []
for i in range(model_output.boxes.data.shape[0]):
item = self.yolo_result_to_segmentation(
model_output.boxes.cls[i].int().item(),
model_output.boxes.conf[i],
boxxywhs[i],
scale_up_masks[i],
)
segmentations.append(item)
return names, segmentations
anyone has an idea how to rewrite codes using ultralytics in C++? since I need to deploy it using C++
my current thoughts are rewriting clip_boxes and .predict function in C++, but it seems a lot of work
loc or iloc which is better
Why do you think one is better than the other?
that that is true, especially with more complex languages like Rust, the line between frontend & backend (and middle end perhaps?) isnt as clear. I'm pretty sure i read somewhere that Rust is actually planning to move of of LLVM but somebody should definitely fact check that
What’s Google-adp
is it like yt-dlp or something
Anyone ok to go over with me my notebook so i can organize it properly? Im having a hard time doing so since this is so unorganized
Doing a solo project
Just post it here 
We should have a workspace channel for this channel. Like the movie Inception 
notebooks aren't amenable to sharing over Discord, so you'll want to do something like python -m jupyter nbconvert --to script --stdout your_notebook.ipynb
on discord ? ahahha
Yes, that's where we are
SP is recommending converting any Python scripts to Jupyter notebooks
the opposite
someone would have to start a notebook server to read this, so it would be easier for them if you do the command to convert it to flat text.
Im sorry if i sound like an idiot but where would i put this
cmd?
Nbviewer here: https://nbviewer.org/
you do not, and yes.
cmd is saying i dont have python installed, ill just go ahead and use nbviewer
This also didnt work
I did a colab link
Easier i guess
Correct one
Also sorry for asking this guys, my head just hurts from looking at jupyter the whole day
whats up people.
so this curriculum learning tool i made and been playing around with and I think i stumbled on something fundamental about how data/ knowledge is structured.
basically, I found a way to measure the "learning complexity" of individual samples in ANY dataset (images, time series, tabular, text) using a single unified framework but the crazy part when I sort training data by this complexity measure, I'm seeing performance gains from 3% to 150% (!!) depending on the dataset
whats even more wild though is the farmwork correctly identifies when data doesn't have inherent structure ( tlike the Madelon dataset), where random is the winner
I tested it on 62 datasets on 4 domains the biggest increase was +149% WAFER dataset, blood tranfusion +84%, ECG data 53%, but on truly random data its 0%, as it should be
but what i think im finding here, theres something like a "conceptual dependency graph" hidden in data. some knowledge has prerequisites (like learning addition before multiplication), some doesn't (like learning colors)
But this framwork i made can tect which is which automatically
i feel likes theres something deeper here aout how information itself is structured
Your notebook looks fairly organized Manny. Random Forest are 1 of my favorite stacks. What is your question exactly
@rich moth test your stack on Manny’s dataset here 👀
Yum yums! Lets do !
idk maybe im overthinking this but it feels like there's some universal pattern here about how knowledge organizes itself? like why does it work across images AND time series AND text AND tabular? seems weird right?
Well we have data here. Let’s put your stack up to the test 🤔
Data can easily lie. A lot don’t understand that
I thought things looked disorganized and wanted to put them more organized
and now that i did it on jupyter my r2 scores went to shit and now linear regression is better than rfr?
Ye
@leaden narwhal put your accuracy metrics toward the bottom maybe and group your map plots together. From what I see your Random Forest metrics have higher accuracy vs LR numbers.
A confusion matrix and F1, precision, recall metrics should give you a more reliable accuracy metric.
This should tell you if your models are confused or not. Basically lying to you 😂
I lecture my models all the time
Let me show you my organized notebook
i think this shows the actuall values
causes i was using log income values to do a prediction on income which is stupid
so now i changed to linear regression
Im going to try xgboost
Someone taught me bad words 
Is this an xgboost moment
Ok i dunno why but xgboost actually was goated and made some crazy predictions
almost perfect, some districts still havent predicted properly but most yeah. Im happy with this!
@limber spear
Check the folium at the end and tell me what you think
https://colab.research.google.com/drive/1rG_63k7PU3B-4q2gvQhThikKy_Htc4d-?usp=sharing
Xgboost is pretty nice RF is goated as well imo
yeah but the r2 score is 10 times better
This is where your fine-tuning skills can come in. Some are fine-tuning goats. This is where OpenAI and Grok devs make their living
Billion parameter models
But big tech won’t tell you this. Big tech will say cutting edge AI or proprietary
say his name moment?
why did they do this 💀
https://www.kaggle.com/datasets/naim99/lion-image?select=lion.jpg
messing around with matplotlib and remembered you can add text to plots :D
ok, today, this sounds dumb, I never cloned a repo that was not mine ever. I thought that was cheating or something. I would either read about or look at code for reference if I did not understand it. Cloning, makes this so much faster. I never knew this. I only cloned for my own repos to edit or someone else's I had permission to. Everyone clones?
i started getting into ai and i didnt know how/what's the best way to do it
i started learning pandas now numpy and then matplot
and seaborn
after that pytorch scikit-learn
and then ML methods like classification regression decision trees and so on
after that essential method or before that? and then deep learning
is this strategy decent? i kinda seperated them into different parts and i do seperate projects with all of them after learning a bit then a combined 1
right now learning about numpy, almost done and then starting matplot
He's roaring because of the extreme underfitting going on
Looks pretty well mapped. Is this going on a dashboard?
Is data science less saturated than other IT fields? Also, is it a good career choice for the future? I’ve heard some people say it’s a dying field.
science is an infinite circle, and data can scale to infinity. Forever. What do you think of this conjecture
xgboost falls into the category of gradient boosted trees; others include lightgbm and catboost
these are usually very competitive when it comes to tabular data
dying field? who the heck told you that? there will never not be data. and there will never, ever, not be the need to process it and analyze results and make something of them
beautiful
Yes.
People WANT to have their repo cloned. There is a counter that says how many times their repo got cloned. A repo that is cloned a lot is a good repo, because it means many people find it useful.
If, for any reason, people don't get want to get their repo cloned, they will not make it public in the first place.
Many people already made may list (including me)
https://www.pythondiscord.com/resources/?topics=data-science
http://introtodeeplearning.com/
https://deep-learning-drizzle.github.io/
https://kidger.site/thoughts/just-know-stuff/
https://github.com/aprbw/ArianDLPrimer (I made the last list myself)
The list that you give looks good to me.
Sounds like you are making steady progress.
Keep it up!
Just check the License before you clone/install something
some will limit what you can do with it
others will force you to share under the same license
and if there is no license, then strictly speaking you have no permission to do anything with it
thanks a lot! It's fun and i'll definitely keep it up
All of this time, I would litterally book mark the actual repo if it was good and learn from it. The amount of time that would’ve been saved by simply cloning it and putting myself in their shoes…. It’s ok, the grit is there. Oh my god. I have only cloned my repo to change it or others the I had access to. Never ever cloned a repo as a guide when it’s like “oh I need a good example from their prospective “ I will remember this day. Forward onto Dawn. Let’s go. I don’t care I am glowing.
@lapis sequoia make sure you read the caveat by etrotta.
good luck
What is data
Hey I am working on image segmentaion and my targets have nan values so masked loss fucntion is the only way to go?
Like it ignores the nan and only get trained on valid data
heh I recalled "learn from it" from siraj raval data lit video 🙂
dont know if he uploads sth still must check
but he has ml on tensorflow not in pytorch as I correctly remember
subscribe to my youtube channel!
www.youtube.com/c/sirajraval
- llSourcell
yo I performed the analysis
that day you helped me
So i noticed in scatter plot when there is rain all variables get low by 10-20 kelvin and some even get low by around 40 kelvin
so its a good idea to include all 4 vars
I had one more que guys
In meteroloigcal data
weighted loss function is a good choice or not?
as we give more weightage lets say rain events
but in real scenerio no rain events will be more
and rain events will be less
depends on which metric matters the most for you
for example, if you were trying to predict extreme weather, you might be willing to sacrifice accuracy in exchange for a higher recall knowing you'll get some more false alerts
I'm considering general weather,currently excluding storms,cyclones and all
but trainig examples for rain case are definetly very less than no rain cases
95% data is of no rain cases
and 4% of rain
1%others
!mute 1270417623296905301 "1 hour" This is your final warning to stop advertising.
:incoming_envelope: :ok_hand: applied timeout to @last oriole until <t:1746474360:f> (1 hour).
putting 1d data to 2d is can be called embedding?
or unproject
I'm viewing ml teach by doing and about feature representation
No clue mq. That is a great question. Embedded in hardware or are you wondering about embedding in software. What do you think chat
ah ok Im thinking about embedding from math
as is word embedding
2d to 1d is to project
but vice versa? looks like its embedding not sure just
I think vector embeddings have applications in cybersecurity though I am not sure.
hmm there is t-sne
t-sn e (embedding)
t-distributed stochastic neighbor embedding for sure
I find it interesting what chunks of data can do. They dance around in our little machines 😅
Paint pictures. It’s fascinating
someone said this learn principles, algorithm, architecture (as in - design your own architecture not copy someone else architecture without understanding why you are doing things that way)
is this also related to ai ml stuff? or ml is a completely different thing and architecture and stuff dont apply to it?
hey anyone online
any tips for profile face detection
cant find anything anywhere
currently using cv2 and mediapipe
detection is really good for frontal
but not good for profile side
ping me if answer
architecture is about big picture of system birds eye view without going into details, as I think about it
I assume you mean model architecture
Model architecture and software architecture are two different, largely unrelated things
Let’s put them together and call it smodel or modware architecture
Tbh this is why I love this field. Innovation is endless. Make up words. Build a new model. Been having a blast. It’s like building with lego blocks
the way I see people use it, both are projections, 1d to 2d and vice versa
It is completely different thing.
But strangely, still applies.
No, architecture is more about the "family" of models. For example, I think of VGG as an architecture and you can get many version of it, but they are all behaving in a similar way.
why is it not good?
I made a tool called ParquetToHuggingFace to help you upload your audio data to Hugging Face easily in Python. It takes your raw .wav files, turns them into Parquet format, and then uploads them to the Hub. The repo has clear steps on how to set everything up, where to put your files, and how to run the script. If you're working with speech data and want a quick way to share it on Hugging Face, give it a try!
GitHub Repo: https://github.com/pr0mila/ParquetToHuggingFace
🎉 Introducing GroqStreamChain! 🎉
A real-time AI chat application built with Python , FastAPI, WebSocket, LangChain and Groq. 💬 Seamlessly stream AI responses and interact with smarter chatbots powered by cutting-edge technology. 🤖
🚀 Features:
- Real-time WebSocket communication
- Streaming AI responses
- Smooth and responsive UI
🔗 Check out the project on GitHub: https://github.com/pr0mila/GroqStreamChain
Join the conversation and start building your own AI-powered chat apps today! 💬
https://github.com/ultralytics/ultralytics/blob/main/examples/YOLOv8-ONNXRuntime-CPP/inference.cpp
char* YOLO_V8::WarmUpSession() {
clock_t starttime_1 = clock();
cv::Mat iImg = cv::Mat(cv::Size(imgSize.at(0), imgSize.at(1)), CV_8UC3);
cv::Mat processedImg;
PreProcess(iImg, imgSize, processedImg);
if (modelType < 4)
{
float* blob = new float[iImg.total() * 3];
BlobFromImage(processedImg, blob);
std::vector<int64_t> YOLO_input_node_dims = { 1, 3, imgSize.at(0), imgSize.at(1) };
Ort::Value input_tensor = Ort::Value::CreateTensor<float>(
Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU), blob, 3 * imgSize.at(0) * imgSize.at(1),
YOLO_input_node_dims.data(), YOLO_input_node_dims.size());
auto output_tensors = session->Run(options, inputNodeNames.data(), &input_tensor, 1, outputNodeNames.data(),
outputNodeNames.size());
delete[] blob;
clock_t starttime_4 = clock();
double post_process_time = (double)(starttime_4 - starttime_1) / CLOCKS_PER_SEC * 1000;
if (cudaEnable)
{
std::cout << "[YOLO_V8(CUDA)]: " << "Cuda warm-up cost " << post_process_time << " ms. " << std::endl;
}
}
...
what does WarmUpSession do here?
its designed for frontal view
theres some frames where it doesnt work
i waas thinking about using retinaface but in terms of costs i think i will go with opencv and mediapipe
Hello is there anyone who works with Tf-idf vectorization?
Unified Complexity Framework (UCF): Deep Synthesis of Findings - Unveiling the Structure of Learnable Knowledge 1. Introduction: Beyond Optimization - A Framework for Understanding The extensive evaluation of the Unified Complexity Framework (UCF), defined by Φ(x) = N + A·e^(iθ) + ε, across ~62 d...
always always ask your actual question. never ask if someone knows about your question without asking your actual question.
Ok noted
That said does a Tf-idf vectorization always have to transform-fit with description of an unsupervised data?
sounds like you're using sklearn. which is a specific way of doing tfidf vectorization.
do you know the difference between fit and transform, in sklearn?
stela are you into ml? whats the hardest thingforoy
the hardest thing in ML?
Yes I do, however I've never worked with unsupervised datasets before, so I'm just learning
nothing is hard for me, because I'm awesome.
NICE ILOVE THAT CONFIDENCR BRO
THAT WAS COOL BRO THATS WHAT IM TALKING ABOUT BE CONFIDENT NOTHING IS HAARD
ITS TOOO EASY
ITS TOO EASY
I'm not actually being serious. I have problems that I don't immediately know how to solve every day.
when I was a beginner, the hardest part was dealing with how everyone explains things differently. not like in this server, but in books and online guides
a dataset isn't inherently supervised or unsupervised. training techniques are.
Hmm I see. I have used the .fit function in model.fit when I started out with linear regression models
a tfidf vectorizer encodes text, so you use the fit method to tell it what kind of text it's going to be encoding.
the tfidf vectorizer's knowledge of which words are frequent and which are not--and which words exist at all--are determined when you fit it.
Ohhhh that explains the clustering I was getting
Sorry if that doesn't explain too much, but I understand why it is done now
I just needed to know why it was fit and now for the transform part?
I think i may need to provide better context
you can use fit_transform if you want to encode the same text that you used to fit the vectorizer.
I think ml is not hard, hard thing is to understand it then its easy
maybe rather its at start not intuitive and one must build some intuition
TIL how to annotate a heatmap in hvplot/holoviews with dynamic text color based on the value by hitting my head against the wall a lot of times
(why is it so hard and why can't I find docs explaining it... probably skill issue)
assuming hm is your heatmap:
hm * hv.Labels(hm).opts(text_color=-hv.dim('value'), cmap='binary')
where 'value' is literal and not something you change, and the - is there so low values are dark instead, so you can actually see on the heatmap
also depends if someone thinks about classic ml as ml
or ml and deep learning as machine learning
for me its sometimes confusing
Setting up environment.
Nothing is harder than this.
i think the hardest thing about ML is people who know ML are bad at explaining ML
agree
you have sth in your eyes, see patterns and can't transfer it to someone
but this thing could be learned right, thats where experience comes in
can a data analyst please reach out to me? i'm facing trouble with cleaning shopify data
dont sure but maybe think about cleaning data not shopify specific, try to generalize it
oh, and to change the text formatting (e.g. only to 3 decimal places):
dim = hv.Dimension('value', value_format=lambda x: f'{x:.3f}')
hm * hv.Labels(hm, vdims=dim).opts(text_color=-hv.dim('value'), cmap='binary')
Someone in chat build a data cleaning stack or something 😂 it’s so boring 💀
See
tell em playboicatering
Which parts do you find boring? I'm only a beginner in pandas but automating some of the data cleaning sounds like an interesting project, I might try my hand at this
Omgersh where do I start. Feature engineering should be streamlined. But responsibly
Especially with medical data for example
I love my granny bro
Yeah that's probably above my head
I'm just learning basic EDA for now
No worries keep on cooking 🔥
etl is responsible for data cleaning?
what does etl stand for?
extract transform load
what's the context for this?
more specifically transform for data cleaning?
I've never heard of "extract transform load"
The “T” is where you typically do some cleaning yeah
“ETL pipeline” is a fairly common phrase
people are always inventing new terms 😠
More for regular systems than for ML though
in short sth related to normalization, normal forms, there is 5nf at most
etl is in data warehouses course
no one cleans data manually?
I mean instead one can just use some etl tool
load means feed data to data warehouse
It’s all over the place in job postings I have seen
Seems like it is a term from data engineering?
I have seen it used a bunch in the small amount of time I have spent in r/dataengineering
ETL is a pretty old term. From the 80s. There used to be arguments about whether you should transform before loading or vice versa.
It's broader than that, it's reshaping the data and putting it into a format that is fit for purpose for downstream tasks
For example, SAP ERP can have 100k tables (not kidding)
If you want to do anything with this downstream you're definitely going to want to pull this data down to some other place, consolidate, reshape etc.
SAP is wild. Some of the servers people are running it on just have comical amounts of RAM.
yes normalize
The inverse actually denormalize
yes ok
Normalize ==> adding more tables, denormalize ==> making it flat(ter)
No worries, it's clear you know what it means 🙂
I think must have muscle memory
so for example when understanding perceptron rotate hands?
should I install the dotenv module for interacting with .env variables? or is there another module you lot recommend? For a little bit more informastion, Ill be making my first program that uses the AlphaVantage API to extract stock market information (for this case, extracting data concerning the US Dollar index). I want to start using webscraping and learning how to use APIs particularly for data analysis
to improve retention of rotating line
Yeah!
A common pattern is that configuration is stored as environment variables. Lots of us deploy with Docker or Kubernetes, which means we can "inject" these env vars right into a specialised place where we run our app.
During local development we still need to provide some config. This is typically done in the form of a .env file that contains all the secrets (API keys and whatnot). This file is read with stuff like dotenv.
no its true ever think about how when your thinking about somehtig youlook up in thes sky instead of doing that use yourhands as well. it might looks weird but it also might works
kinda like that yeah think of a box but now think of a box whilesr gesturing your hands notice how when you do that the box becomes much clearer in yourhead
can i know does this looks gud?
did you "verify critical facts"?
you could do local development in the container as well though
dev containers always felt super janky to me
idk, I love 'em
don't have to worry about any env setup pretty much, just start the container and begin developing stuff
what does it mean by manually describe
i don't understand how else it would do it without ocr
this is chatgpt by the way
if ChatGPT says something that makes no sense, odds are it makes no sense and it's just the model hallucinating
Technically you could have
- native multi-modal image inputs (tokenize the image and feed it directly to the model as part of the prompt)
- a separate OCR tool the model can use via function calling
and it would make sense if the model tried to use the OCR tool first, then used native image inputs after it failed, but that's extremely unlikely
by tokenize the image, do you mean it would go through the image and extract haar features and then use that to detect each character?
Is it rad that I avoided langchain forever because I just thought it was trendy garbage and fine tuning T5 is more lit? Am I like a hipster now? RLHF is cool, but I don’t want to abandon my roots.
The alternative to Langchain that I found is LiteLLM. Its nice
@proven current I removed your message because the content is disturbing for some users
there are no inherent meaning behind anything ChatGPT said.
The Unified Complexity Framework: Revolutionizing Multi-Domain Data Complexity Analysis Abstract This paper introduces the Unified Complexity Framework (UCF), a revolutionary approach to quantifying data complexity across diverse domains. Unlike conventional methods that treat complexity as domai...
Ok It's my rough draft on my research paper .
This could change the game of how we design datasets. Imagine datasets with built-in complexity metadata that map optimal learning pathways and make curriculum learning effortless, eliminating the need to calculate sample complexity during training. UCF can enhance these datasets, transforming machine learning data from simple collections into structured knowledge maps with clear learning trajectories, dramatically improving training efficiency and transfer learning capabilities. This could establish a new gold standard for ML datasets where curriculum-readiness becomes a core feature rather than an afterthought, reimagine how we approach data design across all domains
The proof is in the pudding. It's all on the wall.
a way of sorting data by complexity so you can feed hard or easy stuff first?
I didn’t think of it this way before 🤯 that is a game changer. In the DE community a lot of the conversations move toward code but rarely fundamentals are discussed
I suggest you put it somewhere more "official"
The idea being, you can claim it that you have written this at a certain date with timestamp.
Ideally ArXiv, if you cannot, then github will do.
achieving performance improvements of up to 149% over random sampling baselines.
You need to use a better baseline rather than just random sampling. You need to use the latest SOTA in curriclum learning.
It seems that you purposedly not reveal the methods?
are you looking for feedback? or something else?
anyone can write research paper?
Yes, anyone can.
but I see its hard to start different style of language, similar to writing thesis
I mean, anyone can, not saying that it is easy.
With enough effort and resources, almost everyone can,
the question is if it is worth it.
so im working on benchmarking various AF to improve a NN model that is trying to learn the pattens of the sin function.
the thing is that i dont want to use any fancy methods like normalization or special optimizers just yet.
and am trying to improve the functionality just by only changing the following configurations: number of hidden layers, number of hidden neurons, learning rate, gradient clipping threshold(ik this is a fancy method but its unavoidable for now) .
so the problem im encountering is that my current model is adapting and predicting well when the training values include values only form -pi to +pi (with 500 samples)with a loss of upto 10^-5, but the moment i increase the range to lets say -100 to +100 (5000 samples) predictions of all the activation function are stagnating at a loss of 0.5 which is no where enough obviously
any idea on how to fix this or improve this ?
shoudl i send my code here or is that not allowred ?
using basic GD btw
What is meant by value here when you state -pi to +pi I’m not catching on
im using
x = np.linspace(-np.pi,np.pi,500)
training samples i mean
should i just send the entire code ?
print("hello world test for syntax highlighting")
I went to the machines. Total guess here, you’re mapping of your inputs [-pi to +pi] to target labels of [-1 to 1], but issue maybe is your build of taking in inputs of [-100 to 100], your model isn’t designed for that. Probably just have to refactor portions of your build like functions
waht do you mean by my model is not designed to take in those inputs?
im pretty new to ml too btw
like should it just take the inputs and try to learn it according to its map which will be [-1,1]??
print("test)
Click here to see this code in our pastebin.
It was a guess. It depends what you’re targeting for your outputs. It could be binary like 0 or 1. Or a range [0 to 1], [-1 to 1]. If that makes sense
you are completely right here i am trying to predict the mapped range of [-1,1] values
Oh ok you would probably just have to do a light refactoring in your build to map everything correctly
how would that look like and what was my mistake?
Total guess here. It could be as basic as the line of code you shared here: x = np.linspace(-np.pi,np.pi,500)
but whatt in that though ?
i am so lost here, what are you trying to tell me?
i commented out the -pi,pi cuz thatt one works fine but the -100,100 does not
I’m a noob tutor
chat who’s a better explainer
You can test ranges out. But I like breaking stuff to learn
👍
Lock in you’re on the right track bKC
I just suck at explaining. That means I don’t have the science down. Need to lock in as well lol
yess will do
but i think -100,100 is just too much to expect from a basic NN
ig ill just have to accept that this is the best it can do
and add upgrades like normalizatoin and optimization techniques
real real
i was in the middle of cooking this up for a class
Why
Isnt it -1:1
Have any of you guys used the SEC API
Not the one from Python itself which requires the API key but from the SEC which is the one where you request the headers
waht are you askingg exacctly ?
Hello everyone, I have a small question, where can I find well-documented datasets that can support academic research or thesis developlment ?? Maybe some open-access platformsor even government data portals. Bonus poitns for anything that supports ML, predictive anaylitics. Thank you !! PLEASE @ ME HERE
Kaggle and Hugging Face are good places to start
you can also use https://datasetsearch.research.google.com/
Honestly, I don't know. I'm lost what I do.. I would love to monetize it. Get me out of my UPS driving career. After 17 years I'd love to jump switch into a more technical domain.
Have you pinged the Hacker News community. They can Simon Cowell your stack if it’s good or not
I guess, 1st of all, congrats for getting this far, it seems that you did some real studying and real work.
2nd, sorry to burst your bubble, but there are still some gap between your draft and something publishable. The gap is not unsurmountable, but I suppose, somewhere between few weeks to few months. The biggest issue I can spot is that you need SOTA curriculum learning as your baseline. I also cannot judge your method and it is not revealed yet. So keep up the good work and you'll get there.
3rd, the sad news is, even if this is published at top journal, there are still a huge gap between that, and monetizing it. And I don't even know what. A lot of PhD fresh grad want to convert their thesis to a startup, but very few succeed. If I know how, I would have done it myself, I want to get rich quick too.
Finally, if you put the full version (with your methods and code) on a github, you can start emailing professors and ask for collaboration. They get to be co-authors, and you get valuable feedback and even funding for conference submission, and from there, I hope it is one step easier to get into some ML jobs.
just one recommendation, when you show it to people, instead of a paragraph about how it will revolutionize the field, just say what it does
I have a bit of food for thought. So the father of modern genetics Gregor Mendel his work went unrecognized in the scientific community until about 16 years after his death. Sometimes no one even cares 😂 5+ centuries from now who will remember these billionaires
Sorry about laughing. Idk I think about these things.
I want to rewrite this without using libtorch
torch::Tensor pred_masks = torch::nn::functional::interpolate(
masks.index({scores_mask, torch::indexing::Ellipsis}),
torch::nn::functional::InterpolateFuncOptions().size(
std::vector<int64_t>({input_height_, input_width_})));
in which masks and scores_mask are both tensors.
I don't want to use libtorch because the library introduces great space cost to my project, but I dont know how to rewrite the functions such as the torch::nn::functional::InterpolateFuncOptions() and .index by using pointers like float *
anyone has ideas about it?
does anybody have experience analysing META data?
Guys which packages and from where I should learn in python to master data science
python for data analysis by wes kineey is a good book to get you started off. Thats what I used. I will say make sure you spend a lot of your time learning the maths or you wont be able to use the modules to their full effectiveness
I'm impressed you are self taught, I thought you have some background, nice to hear and good luck
I swear learning the statistics for ML is fucking annoying. Half the time I'm trying to interpret the context of the notation and symbols. Like I'm learning in a section about multivariate gaussian distribution; why tf is sigma being used as a variable?! now I have to distinguish between sigma meaning 'sum of' and sigma as a variable. Apologies for the rant. Hopefully learning the linear algebra side will be a bit easier to interpret
learn at first about gaussian distribution not mutlivariate would be less confusing
sigma is variance
oh sorry std - standard deviation, sigma squared is variance
and maybe read "statistics for machine learning" not about statistics without context its more annoying
I mean the big sigma not the small sigma that denotes the variance. i swear what were these staticians smoking when they came up with these formulas and decide to not use distinguishable notation?
coz this is a frankenstine?
Some many people got into this field from electrical engineering and came up with a lot of signal processing ideas, some are stats, some are pure maths, some are coming from physics, ETC2
maybe machine learning mastery have sth like this dont sure
in fairness, the sigma used for covariance matrices is usually either bold or caligraphic, and the summation one has a sub- and superindex
for example I learned sth about student t distribution "where it is used?", my question was then
I saw data science full archive and there ah right its in t-sne (t-distributed),
poisson distribution ah in poisson regression etc
I mean statistics course without context, also sth like confidence interval, dont sure if they should teach this way
I've noticed that but even still, I've always seen the big sigma as 'sum of'. This is my first time actually seeing big sigma in another context. Im giving it another two weeks and then going to focus for a few weeks on learning linear algebra whilst going through the relational databases course on freecodecamp to learn some new skills
but it was not about calculating distribution but reading some stats lookup table
it's probably a good habit to always read the notation table at the beginning of a book. one lie that people learn in school is that math notation is somehow fixed and standardized
instead of substituting in formula
it isn't 😛 at all
I dont think the ISLP book has some form of notation guide. Tbf, the more I learn about statistics and the more I go through the book, it gets easier to understand the overall concepts. Im only learning the surface level understanding so I can be able to use the ML modules effectively enough. I can always at later date go indepth in the proper theory behind the cocepts
question why they reduce from 768 dims to 2 dims with umap, cant be t-sne or other dim reduction method? inside nlp and trasformers book by oreily
hmm assuming you didnt read it its hard to explain
I couldnt tell you mate XD. Still attmepting to learn the maths
hey i am an intermediate in python i am looking for communities to join to work with anyone intrested like small projects etc
I asked the bots about the foundations of data science. Does this tree diagram look complete
Only linear algebra and calculus
that is a ton of innovation baked into just 1 bubble node 👀 imagine what the mathematicians would say
This looks nice. I cant be botherd to read most of it.
Fair enough. Save for research purposes 
Never thought data science was so tough 😮
This probably explains the salary paid to them
the amount that a job pays is often a function of how much training/experience is required to do it
def calsal ():
Are you sure you need all of that?
because for example I know that MLOps is a whole field like ML
or you just need to have a shallow info about it like you do with ML?
plus you don't need all of these programming languages
just one would be enough
like python
where would AGI be?
I am not sure if it's the best roadmap but this website is kinda popular
Ah yeh those guys. I know about them
No clue 💀
but this roadmap is even looks overwhelming and kinda confusing as they are speaking about two fields at the same time
lol, it's okay. Do you intend to expand this diagram or not?
but maybe they share these
Maybe add, reinforcement learning, and or that thing where AI train themselves but I forgot the name
I am looking to perfect my craft. Nothing more nothing less tbh
And contribute to society which I probably suck at
I like the roadmap peeps. They are building something to help others learn
And build 👍
try to get in touch with someone in the field you want to break in
tell them your background and let them guide you from there how to achieve your goal
I agree. It can be considered overwhelming. But the freedom of the field imho lies in the data + the science. The possibilities are endless
not sure what it's even trying to say tbh
like, does A -> B mean A contains B like machine learning -> deep learning?
does A -> B mean A is a prerequisite like LinAlg/Calc -> Stats?
does A -> B mean you should do A before B like EDA -> Feature Engineering?
Total guess here. The bot probably ran a decision tree or algorithm of some sort
also ig linear models, trees, clustering etc just don't exist anymore
How so
not on the map anywhere
whereas deep learning gets like a quarter of the entire graph
also I just realized it put automl under deep learning
The diagram I posted? That would probably just be in 1 node of that diagram
yeah
what I'm saying is the graph isn't great imo
I disagree
But. I understand your conjecture
If you frequent the Linux community. A diagram means squat
it's always better to ask someone already in the field and has not even just a beginner
but it's up to you
I don’t think you know who I know
would be interesting to see a timeline of when these different things on the map were invented
and how close that is to how they are chained
and how much is squished into the past few decades, while the math fundamentals go back centuries 😂
Honestly that is what I took from when I first pulled that visual. The statistics node has a timeline on its own merit.
Statisticians cook.
i'll add graph theory
maybe fuzzy logic
maybe etl
metaheuristics, data mining
fuzzy logic due to fuzzy clustering c-means, neuro-fuzzy networks
maybe its overboard
Knowledge representation and reasoning
Anyone up for simple DS project? For skill up?
I’m trying to clean and format a large raw text file. Does anyone know any methods that are best for cleaning large amounts of text?
In excel, right?
.txt files
I don't know sorry
I suggest go for prompt in Google collab and it'll clean the data
recursive self improvement ? or unsuprivised learning
the problem with RSI if you have read the STOP algo paper is it has a significant bottleneck sadly I think if you were really creative you could build off the implementation's within that paper with multiple algos maybe say beam search top p top k alteration. Or some other sort of dynamic deducing algo
I can’t believe I started a debate over that diagram. What I find interesting is that what if the diagram read left to right. Or right to left. Perspectives can differ 
Hi guys, anyone here is expert in machine learning. Mainly in sklearn.svm SVC (support vector classification). I want to ask some questions
Don't ask to ask, just ask!!
I think it is very far for complete. Where would graph neural network be? How about Neural ODE? Or like contrastive learning? JEPA, Energy based model, geometric deep learning?
Idk the deep learning node
You could literally draw a timeline in just the machine learning and deep learning nodes. 2 nodes.
But then if you turn it into a decision tree, everything changes
or does it
Perplexing. You could legit earn a doctorates degree with this research
hello gys,
what are the best techniques to improve the accuracy of a classification model (tabular data with alot of categorical variables)
-
Feature Engineering -
Create new features by combining or grouping existing ones. -
Hyperparameter Tuning -
Test different model settings to boost performance.
ensemble methods also
After PCA I had more than 2 PC that I can use for clustering, and its impossible to work with more than 3, what should I do in such cases?
the tab:blue is goated
so im using seaborn to plot data, and im using a pairplot and i can set corner=True so it doesnt have duplicate plots on one side of it
but is there a way to have it plot a different type of plot on one side
like if i set corner=true it doesnt plot the ones on the top right, but is there a way to have it plot a kde plot on the top right?
obv i could just manually edit the images so its got the other type on the top right but it would be easier if there was a way to do it programatically
also, any ideas for better ways to plot stuff?
docs at the very bottom may be helpful
thanks
nothing off the top of my head; it simply looks like there's no difference between male/female when it comes to these features
if you use something like plotly or hvplot, then you click an item in the legend and that will be hidden
e.g. if I made this in hvplot, I can click Male then all Male data will be set to alpha 0.3 (configurable)
(hvplot is more like a higher api that can wrap matplotlib, bokeh or plotly; the latter two can do what I described)
hmmm, it just overlays it, is there a way to hide the scatter plot on one side, setting corner=True makes it not draw the kde plot
Good day everyone, im new here and i just start learning Python its only been a month of daily solving fundamental problems for each basic topics in python (not include the OOP topics), and im almost done with studying basic topics while solving fundamental questions and i want to dive into the world of Data Science and AI but not sure what to do after im done studying Python Basic should i study Math and Statistics for Data Science or continue learning Python OOP or study Data Structure and Algorithms or just go straight into Python Data Science Libraries like NumPy and Panda?
i have to build projet of a numer,-license plates detection and extract the content from the plates please guide me in this because it's my first computer vison project. Thank you
actually, if you just follow the link to PairGrid, doesnt it show what you're looking for?
g = sns.PairGrid(penguins, diag_sharey=False)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot)
When was this.
@sterile heath this is really interesting so dose that mean that your offspring could have the same problem.
I don't know about the genetics of it. It can sometimes take two to tango.
But also, offspring are not a likelihood in my future.
Nor have I any.
Nor have I ever.
Yea it’s something to do with the chromosomes I think because it’s split into pairs. Half and half from the male and the other for the female.
forgive!! Enjoy ur life!!
I'm a data scientist who's looking for a paid internship opportunity, please how can you be of help to me? Thanks
man I've been working on this UCF paper for days, i think I'm almost done but i don't know exactly how to incorporate the visuals. I think I will create some type of "gallery" with the results? Also , Do I need an endorsement for a arXiv submission?
You might wanna check LinkedIn
You probably won't if you have a university email.
what is job
in coroutines
/threads
been working on unsupervised data and i have a problem getting two annotated labels for the visualaization to be visible, i've modified the xytest for about 2 hours and i've gotten zilch results
here is the syntax for the code, if anyone can help:
colormap = plt.get_cmap('tab20', num_clusters)
colors = [colormap(i) for i in range(num_clusters)]
plt.figure(figsize=(17, 12))
for cluster_num in range(num_clusters):
cluster_points = tfidf_matrix_reduced[df['cluster'] == cluster_num]
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], color=colors[cluster_num], label=cluster_to_genre[cluster_num], alpha=0.9)
plt.title('Book Genre Clusters with 2D PCA')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.annotate('Popular / Romantic →', xy=(0.42, 0), xytext=(0.63, 0.18), fontsize=11, color='blue')
plt.annotate('← Serious / Thought-Provoking', xy=(-0.5, 0), xytext=(-1.0, 0.2), fontsize=11, color='red')
plt.annotate('↑ Literary / Award-Winning', xy=(0, 0.5), xytext=(0.05, 0.75), fontsize=11, color='purple')
plt.annotate('↓ Genre Fiction / Mass Appeal', xy=(0, -0.5), xytext=(0, -1.0), fontsize=11, color='green')
plt.legend(title='Predicted Genre', bbox_to_anchor=(0.5, -0.08), loc='upper center', borderaxespad=0., ncol=num_clusters // 2 if num_clusters > 2 else 2)
plt.tight_layout(rect=[0.08, 0.12, 0.92, 0.95])
plt.subplots_adjust(bottom=0.22)
plt.grid(True)
plt.show()
Hii!!
so i have a question regarding pyomo.
my problem has an objective, and a stationary, for ideal convex, stationary value is 0. but for any real case, stationary value can never be zero, but i want it to move towards zero. like closer it is to zero, better,
objective function, is entirely different value, but PS. stationary and objective, cant be in objective object of pyomo, as that makes them compare, which shouldnt happen in my setup.
Hi everyone,
I'm looking for some guidance and would really appreciate your help.
I'm not a data scientist by profession, but I've been learning machine learning and working with Python. I'm currently building a tool that tracks data transformations before the data is fed into a model for training.
Right now, I'm trying to find example projects—either ones you've worked on yourself or available online—that I can use to test my tool. I'm primarily focusing on transformations using pandas, NumPy, and scikit-learn.
I can build basic pipelines myself (e.g., using fillna, one-hot encoding, PCA, etc.), but since I don’t have experience with real-world projects, I’m not confident I’m covering all the important cases. Any pointers to existing pipelines or datasets with preprocessing steps would be incredibly helpful.
Thanks in advance for your guidance!
https://claude.ai/public/artifacts/0baf9898-9a1e-4da4-bed8-8a6f52c23d3b
This is what I got so far.
I find the feature space fascinating. It's like the DNA of each dataset.
hi guys, im trying to getting into the AI world, but idk where to start and where to learn, can you please help me?
That all depends what the most effective learning strategy is for you. 😄
Out of all the feature spaces i looked at guys, theres hundreds. Look at this! I dont know what it is , but this one is hypnotic.
people hiring alot for ml engineers?
I never saw this, thanks for the feedback this is great. I've updated my paper, please feel free to critique it. I will look into what you told me todo though. Seems like github is the way to go.
https://aima.cs.berkeley.edu/ The standard introduction to AI book.
Hello what should I as a newbie to ML use ? plotly or matplotlib for visualization
what do advanced users use ?
is there anything better than these two ?
for simple things, plotly express is very straightforward and can make interactive plots
(plotly express is a sub-module on plotly)
for more custom things you might want to look into matplotlib or seaborn
Matplotlib is older and its API is based on a different language (namely Matlab).
People use what they prefer. I've gravitated to plotly.
iirc Vega Altair is also somewhat popular, but personally I also prefer plotly
When I do use matplotlib, it's only via pandas. Imo matplotlib has the worst API of any data science library
And it isn't close
OK,thank you
Wassup guys
beautiful!!
This image shows what happens when you rank all samples from different data types (images, text, time series, and tabular data) by a single universal complexity measure and divide them into 10 difficulty bins.
Its 141k+ samples from 50+ datasets.
I think the fact the different data types naturally separate into ten distinct difficulty regions helps bring this all home.
so i was wondering what SQL will i master(or start learning) can you help me decide?
what type of SQL is best for Python Data Science?
- MySQL
- PostgreSQL
- Oracle
- others(mention it 🙂 )
I like option 2
In the long run I think it would do you the most justice because of its analytical capabilities.
I was taught to develop in vanilla SQL. It codes to all 4 on that list
I got the idea to create a unified dataset across a smaller subset of 14k samples. results confirm the exact same stratification pattern we saw in the larger test
Hello
Is there a book which discusses multi class classification techniques like OVR AND OVO
And explicitly states those names, and not just the math
so ive got a dataset thats got a date column and a time column and im reading the csv into a pandas dataframe, and i can set parse_dates=['date'], date_format='%d-%b-%y' and read the date just fine
but i dont see how i set it to read the date and the time if they are in seperate columns
ah data.time = pd.to_timedelta(data.time) worked
I'm looking for someone interested in networking for Data Science. I'm very motivated and would like to meet people who are in tune with me
Guys, is it possible to train a resnet50 model with 640 x 640 images?
Here is my collate_fn: ```def collate_fn(batch):
images = list(image.to(DEVICE) for image, _ in batch)
targets = []
for _, target in batch:
boxes = []
labels = []
for annotation in target:
bbox = annotation['bbox']
# Convert from [x, y, width, height] to [xmin, ymin, xmax, ymax]
xmin = bbox[0]
ymin = bbox[1]
xmax = bbox[0] + bbox[2]
ymax = bbox[1] + bbox[3]
boxes.append([xmin, ymin, xmax, ymax])
labels.append(annotation['category_id']) # Use 'category_id' from COCO
targets.append({
'boxes': torch.as_tensor(boxes, dtype=torch.float32).to(DEVICE),
'labels': torch.as_tensor(labels, dtype=torch.int64).to(DEVICE)
})
return images, targets```
I use my data already processed with roboflow its pretty much like this: DATA_DIR = '/content/oreo-1' # Replace with the actual path TRAIN_ANNOTATION_FILE = os.path.join(DATA_DIR, 'train_annotation/_annotations.coco.json') TRAIN_IMAGE_DIR = os.path.join(DATA_DIR, 'train') # Adjust as needed VAL_ANNOTATION_FILE = os.path.join(DATA_DIR, 'val_annotation/_annotations.coco.json') VAL_IMAGE_DIR = os.path.join(DATA_DIR, 'valid') # Adjust as needed NUM_CLASSES = 122 # Replace with the number of classes in your dataset (e.g., 80 for COCO) BATCH_SIZE = 4 LEARNING_RATE = 0.001 NUM_EPOCHS = 175 DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') CONFIDENCE_THRESHOLD = 0.5 IOU_THRESHOLD = 0.5
hmm I thought collate is for text but its for images too why?
I see maybe too much of epochs
hmm I think not, resnet is for small images
must resize them
also it is longer to process bigger images
dont remember what size was sth like 32 or 64, check it
is temperature of llm related to simulated annealing?
In both contexts, the "temperature" is just a probability that the model will, intentionally, make a sub optimal decision
A higher temperature is a higher likelihood that the model will do that.
If you've taken chemistry, you'll remember that higher temperature systems are more chaotic, and stuff.
Hey guys, is there anyone who has an idea on how to create custom tokenization for OCR or how to train Tesseract using custom datasets?
Which version of tesseract are you planning on using. Theres a lot going on here. You might want to really research this. Maybe gooogle it and have a chat with AI about brainstorming methods.
I mean the answer is right there. Dont ask us I feel, you should have been showing us results by now!
But what insight do you seek that you that you probably can find yourself?
What I've found in my research is we all learn differently. If we treat humans as individual datasets, we all have our unique learning methods, some even hidden.
😄
I finally got this visual working!
Its the culmination of processing of 62 data sets from 4 domains the 141k+ samples
can a data analyst please reach out to me? i'm struggling in my internship with no mentor.
I think ResNet requires resizing your image to 224x224
If you had mentioned what you're struggling with in your internship someone who have most likely responded by now.
You can also use the #career-advice channel to get advise on how to surmount the challenge you're facing.
Meanwhile, who else is submitting their work to NeurIPS? 😛
Thanks for the honest feedback! I agree that most of this info is out there, and I have been researching it—I’m actually experimenting with both a custom CRNN model and now considering Tesseract 5.4.1 for comparison.
My main challenge is fine-tuning Tesseract on a custom script like Balochi, especially around generating accurate training data and understanding the lstmtraining process. I’m not looking for spoon-fed answers—just hoping someone might have hands-on experience or tips that could help speed things up a bit.
And you're totally right—we all learn differently. For me, discussing things out loud (or in chat) helps uncover blind spots and validate whether I'm on the right track. 😄
yes youre right 224x224, I checked in search
it was 224 not 24, as I guessed 32 or 64 size, my bad
but still relatively small images
I meant I thought it is 24x24 nvm
ah maybe you want to fine tune it, then its different thing
Hello everyone, to study ai/ml and robotics do you need to learn about electricity and how does it work (asking as a self-taught programmer)
for ai/ml not really
for robotics you'll likely want to have a descent notion of physics though
Aha so it's like i don't need to study it deeply right?
Nah I mean my CS course has robotics in it and it covers kinematics briefly and control (PID / MPC), Reinforcement learning, Markov decision processes, some sensor stuff and just general deep learning
If I have large amounts of data ~5million training points for a relatively small CNN with 0.4M params, should I be running the full dataset per epoch or only a subset? How would I estimate the number of batches per epoch to try?
Obviously if I run the full dataset per epoch my LR scheduler will kick in way later, but are there other benefits?
an epoch is defined as a full pass over the dataset.
okay, thank you! Is it at all common practice to have a LR scheduler act within an epoch?
I'm not sure how common it is, but you get to decide if you want to call it after each batch, or after every n batches, or once at the end of each epoch, or what
Okay! thanks for your input 🙂 Looking at some literature on similar networks i see they half the LR every 2*10^5 minibatches which seems like a good starting point
When training residual super resolution networks, is mode collapse a problem? I.e mean and var collapsing to zero beause the low and high res images are already close to each other?
idk what that even is
For image super resolution you can increase efficiency by upscaling the image with normal upscalers such as bicubic or lanczos and then learn a delta that gets added to this.
instead of using transposed convolutions to reach a higher resolution output from the low res input
I belive this is the paper that introduced the approach initially.
Okay they were not the orignal, seems like SRResNet came before them.
thought id try asking here since stuff like pandas is related but anyone got some experience/recommendation on a python library to convert pdf to possibly csv? Im fine managing the data cleanup itself but looking for other options. I have tried pdfplumber and its not working well. tabula works quite well but it relies on java which im not a fan of needing to have that as part of my app
like OCR libraries?
not that complex. basically trying to convert pdf bank statements into CSV. whole bunch of junk that isnt needed in there
if you don't mind using LLMs then this seems interesting https://pypi.org/project/llama-parse/
I did see that one as well as one called ThePipe and not really a fan of feeding private financial data to an LLM, especially if I ever want to release this application for others
the two ones I remember out of the top of my head are https://github.com/microsoft/OmniParser and https://pymupdf.readthedocs.io/en/latest/
hmm ill take a look at those as well thanks. what input formats I support will be of course heaviliy limited to what formats I need so that will make it more forgiving to try
huh tabula creates massive lists of what it parses. and each entry in the list is a table. Well thats notable but kinda annoying
So I took the UCF stuff and decided to make a trading bot. The idea here is to represent market structure as a complex number, mapping the market into this phase space where I can visualize different regimes way more clearly.
What I did was build these layers that all talk to each other. Like one part figures out what market "regime" we're in (trending, choppy, whatever), another part picks the right strategy for that regime, and another part handles risk. The cool thing is they all continuously adapt, no retraining needed.
if you've got a trading bot, sounds like your career change is all set 😂
I just started it, I imagine its gathering enough data to figure out the current market regimes. I'll know buy this morning. You know.. I hate messing with the trading logic in these things lol
I made a bunch of visuals, hoping to have some stuff to share.
How's it going? Any luck on this stuff?
Though y'all might like this
any idea if its possible to resize the squares in a seaborn or plotly heatmap?
basically i saw this and thought it could be handy
the code for it is at https://github.com/ChawlaAvi/Daily-Dose-of-Data-Science/blob/main/Plotting/Size-encoded-heatmaps.ipynb
A collection of code snippets from the publication Daily Dose of Data Science on Substack: http://www.dailydoseofds.com/ - ChawlaAvi/Daily-Dose-of-Data-Science
i got it working but it doesnt look right
Anyone aware of something one can let loose on a set of gut repos with a task and getting plans/code changed out of it
like the squares dont line up properly
what do you mean by 'plans' like are you just wanting to see recent changes to the repo? Git basically shows you all of that in history and such
I want to submit tasks with example code patterns and fixes and get reports on occurrence and or create prs with fixes
So an agent system to apply global changes to about 100 git repos
So you want to create like a report against a bunch of repos based on a code pattern you provide, like you are looking for vulnerabilities or improper code blocks and then change them?
@final jolt exactly
Well thats a lot of prep to do. The only things I have encountered like that are custom things in a corporate environment. What I would suggest starting with would be simple scripts that can do pattern matching for some example code you are trying to look for. Once you get that working checking a bunch of repos will be the easier part (automatically editing them is another matter though)
The least worst is probably something like https://github.com/All-Hands-AI/OpenHands or just a custom RAG system, but in general agents are nowhere near reliable enough to do that well yet
even for a single git repo you'd get mixed results, let alone 100 at once
yea very much this which is why Ive only seen this type of thing done custom at any level of scale. And even then its not something that goes out and actively change repos but more just data integrity, syntax and format checking and validating. And most of the time as part of the commit workflows because that is the best place to fix issues like that
imo
Can anyone teach me AI/ML
dont post the same message in multiple channels. Also someone responded to you in #python-discussion already earlier
I cannot
Oh
Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
Using 'backbone_name' and 'weights' as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-5af1204afd6e> in <cell line: 0>()
----> 1 model = create_retinanet_model()
3 frames
/usr/local/lib/python3.11/dist-packages/torchvision/models/detection/backbone_utils.py in <lambda>(kwargs)
63 weights=(
64 "pretrained",
---> 65 lambda kwargs: _get_enum_from_fn(resnet.__dict__[kwargs["backbone_name"]])["IMAGENET1K_V1"],
66 ),
67 )
TypeError: unhashable type: 'list'``` I got this error when executing this method ```def create_retinanet_model():
# Load a pre-trained ResNet50 backbone
backbone = torchvision.models.resnet50(pretrained=True)
test = list(backbone.children())[:-2]
backbone = torch.nn.Sequential(*test) # Remove the last two layers
# Input channels for the feature pyramid network. Resnet50 outputs 2048
in_channels_list = [2048, 1024, 512] # Channels for P3, P4, P5
# Output channels for FPN
out_channels = 256
# Create the feature pyramid network.
fpn = resnet_fpn_backbone(in_channels_list,out_channels)
# 91 because of the background class
num_classes = NUM_CLASSES
# Anchor generator
anchor_generator = AnchorGenerator(
sizes=((32,), (64,), (128,), (256,), (512,)),
aspect_ratios=((0.5, 1.0, 2.0),) * 5
)
# Put anchor generator inside the model
model = RetinaNet(backbone,
num_classes=num_classes,
fpn=fpn,
anchor_generator=anchor_generator)
return model``` why is this?
I just want to remove the two last layers of torchvision Resnet50 backbone
Its a bit of a stretch
Im looking for sensible building blocks so i can set it up as something iterative with feedback
Well functionally you are asking for something very large and complex that doesnt really exist and is very difficult to do is what we are saying. So the better approach would be to simplify your goal to what is more feasible and start there and then try to expand on it as you improve its functionality
Starting point would of course be one repo at a time
There's be research/locate possibly some ondexing and eventually something that fires that chain at all the repos like a madman
I lack familiarity with concrete building blocks wrt Ai running knowledge store and state management
anyone familiar with pymupdf?
please remember to always--every time--ask your actual question. please never ask "does anyone know about x". just ask your actual question about x, and people will know it's about x from reading it.
yea sorry. got sidetracked lol
basically getting this error
File "D:\scripts\pybudget\pdf_convert.py", line 57, in <module>
pymu_pdf(pdf_path, csv_path)
File "D:\scripts\pybudget\pdf_convert.py", line 35, in pymu_pdf
pprint(tabs[0].extract())
TypeError: 'module' object is not callable```
when trying to just extract tables from a pdf. One off the table parsing works but trying to iterate it is failing
```py
def pymu_pdf(pdf_path, csv_path):
pdf = pymupdf.open(pdf_path)
print(f"Total pages: {len(pdf)}")
for pages in pdf:
if pages.number == 2:
tabs = pages.find_tables(strategy="text")
if tabs.tables:
pprint(tabs[0].extract())```
this actually isn't a pymupdf problem. it's a naming problem. so remember to never ask "does anyone know about x", because your problem might actually have nothing to do with x.
did you just do import pprint?
yup and you are right the example is wrong on their docs
because pprint is a module that contains a function that's also named pprint. so if you do import pprint, then pprint is a module. if you do from pprint import pprint, then it's a function
I recommend doing import pprint as pp and then pp.pprint. that way it's never a mystery which one pprint is.
yup that was the issue, I was originally doing print and had errors so went back to the example to test and missed the pprint from pprint
there was actually a PEP that could have fixed this, but it was rejected
heh, bummer, and yea I have been good about just asking until this time heh. thanks for the info
now I can try this again with pprint to see if this works. However either way this is major progress as I was trying with tabula before and it was very cumbersome
oh that is soooo much better
got it
i basically drew white rectangles above the heatmap then drew colored scaled rectangles above those
It would have prevented a module and a function from having the same name?
Why would you ever import pprint instead of from pprint import pprint though? Similar to datetime.datetime
im not sure if this comes under this channel but
how are numpy arrays structured? like how does it compare to matrices and their notation? (is a 3x4 matrix the same as a numpy array with shape (3, 4)? will using functions like np.dot() on such an array yield the same results as the same operation on a 3x4 matrix (and a 4x1 vector)?)
im asking specifically for like visualisation. in math usually the first number corresponds to the number of rows and im basically just wondering if thats the same for numpy. (and if itll work the same for matrix operations)
sorry if my phrasing is slightly off. kind of new to both linear algebra and numpy
numpy ndarrays with 2 dimensions work just like matrices, yes
thank you! for arrays larger than that, does the order go from the highest order array to the lowest (im not sure if im saying that right)? like would a size of (2, 3, 4) mean 2 slices, 3 rows, and 4 columns?
yeah that should be the case considering numpy is row-major by default. if you rely on functions like np.dot for multidimensional operations, the default behavior will work by treating the last 2 dimensions as rows and columns
why not pp.pp 😁
4-p
yea I just did it as from pprint import pprint as pp since I didnt see any reason to just import pprint and then have to do pprint.pprint()
Now if I could get pymupdf to be more consistent that would be cool. parsing pdfs suck for sure.
are you aware of pprint.pp ?
ah no, I also never really use pprint and only did here because I was following some docs. I dont functionally need pprint for anything at the end of the day
!timeout 1079012483290890321 spam
:incoming_envelope: :ok_hand: applied timeout to @slow sleet until <t:1747234634:f> (1 hour).
any idea when to use standard scaler vs minmax scaler in scikit learn?
So short version I am trying to use matplot to display gridlines on a page rendered from a pdf. I got all that working however I am trying to adjust the grid line spacing with no success. I thought the correct parameter was markevery but that seems to just be for an actual graph
DPI = 150
pix = chosen_page.get_pixmap(dpi=DPI)
img = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv)
plt.figure(dpi=DPI) # set the figure's DPI
plt.title("title") # set title of image
plt.grid(True)
plt.grid(markevery=10)
plt.grid(color='gray', linestyle='--', linewidth=0.5)
_ = plt.imshow(img, extent=(0, pix.w * 72 / DPI, pix.h * 72 / DPI, 0))
plt.show()```
code snipper here. the gridlines seem to default to every 100
*edit* I never got this to work but bruteforced what I needed but I am curious how to make this work if anyone wants to weigh in
Hi guys, basically if i wanted to do a final year project that combined data analytics and machine learning, do u guys have any good resources i can use to study to get a basic understanding of both? Idk which channel to ask this. I been looking for video tutorials and rsources myself but additional resources from other people would be useful.
!res
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
have you checked in that page? I think there was a section for ML and such but I could be wrong
oh thanks
hmm looks like more generalized stuff there but could be a good starting point. Certainly some YT channels have other video series for more specific topics as well
yaeah
thanks for the help tho
interesting
I swear this join_tolerance doesnt do jack in PyMuPDF, heh
ok pages.search_for() is goat
if there is crewai ok but what else to understand how it works?
crewai is very high-level is there sth more verbose?
I don't get what you are asking whatsoever?
what exactly are you trying to understand
if you want to understand how the models themselves work, look up Andrej Karpathy's tutorials
if you want to understand how to build a simple agent using an LLM, use the HuggingFace Transformers library directly
if you want to orchestrate multiple agents more manually, either still just Transformers or PydanticAI / LangChain
in crewai you just prompting what agent must do and task description
but this works under the hood I assume some tokenization and nlp related things
maybe more manually ok so langchain then
langchain still abstracts away all "nlp related things"
if you want to see how tokenization works then start from Andrej Karpathy's videos
most "Agents" just replace all NLP techniques for one giant blackbox (the LLM), instructions goes in then text, a tool call or a well formatted object comes out
depending on where you look you may find normal code mixed in the orchestration though, some of which may or may not be using classical NLP techniques
For example, before sending to an agent you can use a simple metric to determine which agent to send to (or not send to one at all)
yeah, if you want to mess around with agents or crews I'd strongly recommend understanding what the inputs and outputs of the llms look like
ideally also how the llm (neural network) itself works, but honestly that isn't vital unless you plan to customize the underlying model
Well finally got this pdf parsing nearly complete and I can get back to the pandas side of thing. Still odd that it splits one of the columns randomly but what are ya gonna do I guess. other than join them I mean
I gotta fix the Text above, i got an overlapping problem, but man what it captured it awesome. The different unique signatures "fingerprint" of a few crypto coins.
nice looking graphics though what exactly is the data points being graphed here. complexity of a crpyto coin in what sense?
hey dude thank you! each dot represents a specific market state at a moment in time.
ah so you are interpreting like certain aspects of market change as complexity? Thats neat.
stuff like number of trades, price changes, etc?
exactly! instead of just, is the market going up or is it volatile? i'm measuring how structured vs random the behavior is (through phase θ), how typical vs unusual the current conditions are (component A), and the overall energy/difficulty level (magnitude |Φ|)
Very clever. So is there a way to tell the timing of each data point is this more to get a better feeling of overall "volatility" of a coin in general over a given day? a fingerprint as you called it.
each dot is representing 1 min of market state over 30minutes, but rather than tracking volatility over time, im mapping their distribution in "complexity space" to hopefully reveal their structural patterns.
thank you 😄
i was thinking why not make it a continuous feedback loop also.
Its already learning..just in a new type of way
hey, I use homebrew as my installer and I'm trying to install flake8 in jupyterbooks. Anyone have any experience in doing so cause I can't get brew to recognized jupyterlab-flake8
I've done foundational courses (andrew ng) and more deeplearning.ai courses but i don't understand how to start a project or what do I do, I have 0 practical knowledge. where do I start practicing ml?
OK so I made it automatically tunes strategy parameters every 4 hours, it analyzes win rates and profit factors for each strategy, Underperforming strategies get parameter adjustments (tighter stops, adjusted take profits), Outperforming strategies get optimized for even better results and Cooling periods prevent over-adjustment.
Every 24 hours it builds and updates the "fingerprints".
It clusters UCF states and analyzes performance by cluster and creates an asset specific "memory" f which complexity states are profitable, which in turn influces future trading via confidence adjustments and postion sizing.
I added realtime feedback stuff to boost oreduce confidence based on histroical perofrmance in similar states, it adjusts confidewnce when phase alignment is strongf and modifies position sizing base on histroical profiatblre clusters. Most importantly it saves and loads all these learned adjustments in a pickle which inclues stratergy parameters and the state checkpoints.
If you were to loop over the elements in order as they are in memory (contiguous access) and then compute the N-dimensional index, the last element of that index would be the fastest changing, and the first the slowest changing.
https://drive.google.com/drive/folders/1UVWdutaFWTrw9DTeRr6v3uL5NvvTgDLX?usp=sharing
Heres the complexity space images of 12 different crypto coins.
People like to start with kaggle stuff. Try some reinforcement learning games.
thanks, can you be more specific, I don't know how I will start taking part in competition without practical knowledge
You can help me tackle this https://www.kaggle.com/competitions/stanford-rna-3d-folding
I made the tool todo it, i just need to write the code.
I want to build something to test this theory
I would need to first Transform RNA sequence data into a format suitable for the UCF
I thinking just adding the logic to the data preprocess for "rna_sequence" domain an potentially tailored to θ calculation, while reusing the N,A,ϵ logic where possible. it actually sounds fun. anyone got any ideas how to visualize this
Ok I built the pipeline for the RNA 3D structure prediction that uses the UCF to biological sequences. Im using that kaggle data set from the comp. It's basically applying mathematical complexity theory to biological structure prediction. Might be a bit for visuals but im excited 😄
heres a couple that came in
visuals need work though 😛
Kaggle has mini-courses that walk you through using notebooks on their website and how to do competitions (the Titanic dataset is used as a tutorial IIRC)
Just Google kaggle learn for the mini-courses
This is the Titanic competition just in case https://www.kaggle.com/competitions/titanic
Predicted RNA folding pattern visualization
I’m trying to learn numpy working with opencv but there’s no good vid in YouTube that teaches about it, please help or give me advice if y’all can
Read docs?
And what're you actually trying to work on? Learning library internals won't help if you're not working on a project, you'll end up forgetting the API
I’m trying to learn how to make a server and client, so I want a video stream of the client
And server
Shouldn't be a numpy problem. Do you by any chance mean collecting frames from a stream, say a webcam or rtsp network?
Yes webcam
Oh you're going to have to expose the RTSP link for your webcam to the opencv cv::VideoCapture API, shouldn't be a herculean task provided I've given some leads already.
It wouldn't be fun if you were just told what to do as well, so go ahead and break things😁
Also if you're doing any heavy inference of some sorts on the frames you'd also need to either:
- Learn threading, python has the threading module, mutexes, GIL, so on
- Or not learn anything and use the Inference library which is a pain to set up dependencies if you don't use a separate venv, you'll probably need some docker experience as well for this one
so I'd just recommend the former, cuz you'll learn things as well from the process
Thank you ima chatgpt this to understand cause I don’t understand fullywhat’s a rtsp link and some things u say
I appreciate it
Topic - GLoVE Paper
Hi , so i recently started to read the GLoVE paper and there is this line in it which is confusing me which is "Since vector spaces are inher-
ently linear structures, the most natural way to do
this is with vector differences."
I dont get that how authors get to this conclusion that vector differences is a natural way? is there some logic behind it? or its pure heuristics?
Please tell me if its way less of a context I'll try to explain more
I'll make out time to read the paper, but a few lines could help
But in this context I'd say that it means two things
- Vector addition: vectors in a vector space can be added together to form another vector in that same vector space
I.e vectors are closed under additivity, this should be independent of the field, as vector spaces are inherently closed under additivity
- Scalar multiplication: vectors in a vector space can be scaled to get other vectors within that same vector space,
These two properties bring about other linear properties while being linear themselves
E.g distributive properties, additive and multiplicative inverses etc
But vector spaces would be non linear under operations like multiplication of vectors by other vectors i.e vector squaring
tldr basically; all structures and operations in a vector space just respect linearity,
Sorry for the wall of text got slightly too into it 😅
Now when I read the second part it seems more like Euclidean geometry, but I don't know what "the most natural thing" that the paper is doing is
But the vector differences just means that in a vector space all positions are relative, there is no absolution, so if I move my vector origin some (0_1, 0_2, ..., 0_n), all vectors in the same sense are moved, and a vector say x_1 - x_2 would stay the same, so all vector differences stay the same
Make any sense?
It's how you do things relative to each other. If you have some position A as a vector and some position B as a vector, and you want a new vector at B (still same spot), but relative to A (it is the new origin), you can just do B - A.
(New vector tip is on B, and tail on A)
ok i understand the premise of vectors the thing is the authors are suggesting that it makese sense for them if the are taking the difference of two vectors but not addition and am not sure why , maybe if you took a quick read of page 3 of the paper it might make more sense? @crystal pier @iron basalt
Example: you have a video game explosive barrel object with 3D position vector, and the player with another 3D position vector. Now to do the game logic, you want the player's position relative to the barrel, so you do player.pos - barrel.pos. Then you can check its magnitude for distance checks like if the player is in explosion hurt radius.
They wrote in the paper they want to encode the relative information of the probabilities.
First, we would like F to encode the information present the ratio Pik /Pj k in the word vector space.
If you use log probabilities, you get a difference instead of division...
(It's a morphism)
so basically you are saying that log(Pik/Pij) = log(Pik)-Log(P(ij))
and thats why vector difference is making sense
Ratio between probabilities is getting the relative info, and you want to also encode relative info in a vector space, which leaves the natural choice of difference, but yes, you can also get more into it with log probabilities, it makes it a bit more obvious.
Any time probabilities are involved, consider log probabilities, they make things way more clear.
Ratio between probabilities is getting the relative info, and you want to also encode relative info in a vector space, which leaves the natural choice of difference```
Yes this that its a natural choice to use difference , maybe i dont have the intuition yet to also understand this abstractly
Like i understand we want to encode info of a scalar value in vector space but how is that leaving us with a "natural" choice of difference, is my math pretty weak to understand this?
What other choice would you use in a vector space for relative info?
Consider this example.
It's physical, much less abstract.
yes this actually helped in understanding the relative premise
The difference vector on an abstract level encodes the relative information between the entities.
ok i think i understand this but while we are on the topic can you help me one more aspect?
So the paper further said that - While F could be taken to be a complicated func- tion parameterized by, e.g., a neural network, do- ing so would obfuscate the linear structure we are trying to capture
I tried to make sense that why neural networks wasnt the first choice here and this is what i ended up with - The GloVe paper emphasizes that while a neural network could have been used to learn word embeddings and might produce good results, such models often act as black boxes. This means they provide embeddings without a clear understanding of why certain relationships emerge. In contrast, GloVe is built on explicit statistical information derived from word co-occurrence counts, making its embeddings more interpretable. This aligns with the goal of enabling meaningful vector arithmetic (e.g., king - man + woman ≈ queen) and revealing transparent relationships between word vectors based on how frequently words appear together in a corpus.
is this making sense?
the problem was LHS and RHS were not equal , LHS was vector and RHS was scalar
Simply it would mess up your ability to do simple vector operations that you want to be able to do with words.
Yeah the crux
Because networks scramble things.
oh hey I'm back
ok so i guess i am on the right path
well squiggle did solved my doubt although turns out it was pretty dumb
I'm guessing sure squiggle has this on lock
@iron basalt Thanks man
It's not dumb, when papers just say stuff like this is "natural," it's begging to be questioned.
As it's often not the best choice.
so are there forums out there where people questions or tear down a paper?
Yes, a lot.
It's also just part of the academic process.
hmm soooooo where should i look at, haha am kinda new to this
You are already in Yannic's discord, they cover papers there every week in a call, and people there cover papers all the time in chat.
ahh yes, I do join them occasionally
Guys I need help
So I am trying object detection with keras resnet50, here is how i prepared my data: ```def parse_tfrecord(example_proto):
feature_description = {
'image/encoded': tf.io.FixedLenFeature([], tf.string),
'image/object/bbox/xmin': tf.io.VarLenFeature(tf.float32),
'image/object/bbox/xmax': tf.io.VarLenFeature(tf.float32),
'image/object/bbox/ymin': tf.io.VarLenFeature(tf.float32),
'image/object/bbox/ymax': tf.io.VarLenFeature(tf.float32),
'image/object/class/label': tf.io.VarLenFeature(tf.int64),
}
parsed_features = tf.io.parse_single_example(example_proto, feature_description)
# Decode and preprocess the image
image = tf.image.decode_jpeg(parsed_features['image/encoded'], channels=3)
#image = tf.image.resize(image, [HEIGHT, WIDTH])
image = tf.cast(image, tf.float32) / 255.0
labels = tf.sparse.to_dense(parsed_features['image/object/class/label'])
return image, labels def get_object_detection_dataset(tfrecords_dir, batch_size):
files = tf.io.gfile.glob(tfrecords_dir)
dataset = tf.data.TFRecordDataset(files)
dataset = dataset.map(parse_tfrecord)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
return dataset```
and I built my model like this: ```def build_resnet50_fpn_backbone(input_shape=(640, 640, 3), weights='imagenet', include_top=False):
"""
Builds a ResNet50 backbone with a Feature Pyramid Network (FPN) for object detection.
Args:
input_shape (tuple): The shape of the input images (height, width, channels).
weights (str): The weights to load for the ResNet50 model.
'imagenet' for pre-trained weights on ImageNet, or None for random initialization.
include_top (bool): Whether to include the top (fully connected) layers of ResNet50.
For feature extraction, this should be False.
Returns:
tf.keras.Model: A Keras model representing the ResNet50 FPN backbone. The model
has multiple outputs, which are the feature maps from the FPN levels (C3, C4, C5).
"""
# Ensure valid input shape
if input_shape is None or len(input_shape) != 3 or input_shape[2] != 3:
raise ValueError("Input shape must be a tuple of (height, width, 3).")
# Ensure channels_last data format
tf.keras.backend.set_image_data_format("channels_last")
# Load ResNet50, excluding the top (fully connected) layers
resnet50 = ResNet50(
include_top=include_top,
weights=weights,
input_shape=input_shape
)
# Get the outputs of the intermediate layers we need for FPN. These are
# the activations before the pooling layers.
c3_output = resnet50.get_layer('conv3_block4_out').output # Shape: (None, 80, 80, 512) for 640x640 input
c4_output = resnet50.get_layer('conv4_block6_out').output # Shape: (None, 40, 40, 1024) for 640x640 input
c5_output = resnet50.get_layer('conv5_block3_out').output # Shape: (None, 20, 20, 2048) for 640x640 input
# FPN layers. These layers take the output of the ResNet stages and combine them
# to create feature maps at multiple scales. This helps with detecting objects
# of different sizes.
# P5 is initialized directly from C5
p5 = layers.Conv2D(256, (1, 1), name='P5')(c5_output) # (None, 20, 20, 256)
# Upsample P5 and add it to C4
p4 = layers.Add(name='P4_add')([
layers.Conv2D(256, (1, 1), name='P4_conv1')(c4_output), # (None, 40, 40, 256)
layers.UpSampling2D(size=(2, 2), name='P4_upsample')(p5), # (None, 40, 40, 256)
])
p4 = layers.Conv2D(256, (3, 3), padding='same', name='P4_conv2')(p4) # (None, 40, 40, 256)
# Upsample P4 and add it to C3
p3 = layers.Add(name='P3_add')([
layers.Conv2D(256, (1, 1), name='P3_conv1')(c3_output), # (None, 80, 80, 256)
layers.UpSampling2D(size=(2, 2), name='P3_upsample')(p4), # (None, 80, 80, 256)
])
p3 = layers.Conv2D(256, (3, 3), padding='same', name='P3_conv2')(p3) # (None, 80, 80, 256)
# P6 and P7 are created by downsampling P5
p6 = layers.Conv2D(256, (3, 3), strides=2, padding='same', name='P6')(p5) # (None, 10, 10, 256)
p7 = layers.Conv2D(256, (3, 3), strides=2, padding='same', name='P7')(p6) # (None, 5, 5, 256)
# Define the model with multiple outputs
model = Model(inputs=resnet50.input, outputs=[p3, p4, p5, p6, p7])
#model = Model(inputs=resnet50.input, outputs=feature_map)
return model```
model = build_resnet50_fpn_backbone(input_shape=input_shape)``` ```losses = {'classification_output': 'sparse_categorical_crossentropy',
'bbox_output': 'mse'
}``` ```optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss=losses) # Two losses: bbox and class```
model.fit(training_data, epochs=NUM_EPOCHS, validation_data=validation_data) when I tried that fit, this came up: y_true and y_pred have different structures. y_true: * y_pred: ['*', '*', '*', '*', '*']
Does anyone have any idea how could this happen
debug it
outputs=[p3, p4, p5, p6, p7]
['*', '*', '*', '*', '*']
but "easier" to put some breakpoints and watch variables
with debugger
oh as I thought you can also add verbose param to fit
verbose: "auto", 0, 1, or 2. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. "auto" becomes 1 for most cases. Note that the progress bar is not particularly useful when logged to a file, so verbose=2 is recommended when not running interactively (e.g., in a production environment). Defaults to "auto".
For anyone looking to make their own dataset for their AI's lmk what you think of my project and I would love to see if you build off of it:
https://github.com/Tyguy047/Cluster-Dataset-Builder
what are you trying to do with that
check the their dimmensions; looks like one is a vector and the other is a scalar
make your own dataset program. tweak it, refine it idc
like pandas?
the readme isnt very descriptive, it's some llm training tool?
I think it includes what it needs to. You build your own dataset to train or fine-tune a model. It's automated and generates data from multiple AI models to avoid inbreeding data. || @limpid dew ||
have you build an LLM with it?
Running it doesn't build you an LM you use the dataset it generates paired with your train.py file to train your model. You can modify the script to ask the cluster to only generate data that will help train an AI on python debugging or math or whatever you want.
I think you misunderstood my question. That's okay. You say you can use it to train an AI on math? What kind of math? How do you do this?
