#data-science-and-ml
1 messages · Page 157 of 1
thats exactly my problem. my original plan was to handle everything in the c# code, the adaptations, but my professor insisted on machine learning i guess because it is more interesting in the field nowadays, but i had to start from scatch trying to learn the basics of machine learning now and the more i learn the less it even makes sense... theres only so much data i can get from one playthrough
C lends itself very well to machine learning code but you have to known the underlying to correctly write it
For the basic idea: https://www.youtube.com/watch?v=sw7UAZNgGg8
By the 1950s, science fiction was beginning to become reality: machines didn’t just calculate; they began to learn. Machine calculating was out. Machine learning was in. But we had to start small.
Donald Michie’s “Machine Educable Noughts And Crosses Engine” -- MENACE -- was composed of 304 separate matchboxes that each depicted a possible stat...
I've implemented a few SVMs already in C++ and they were incredibly fast
You can use this for any game. You need some set of moves and conditions that makes them valid to choose from, you can then do what is done in that video (directly).
The learning part here is that it starts out really bad, basically playing random moves.
so... this is basically without any python based learning algorithms then?
You can do this in any programming language, or as show in the video, mechanically/IRL/by-hand.
If you have a computer (machine, not person, "computer" used to be a job title) do it, then it's machine learning.
my prof basically said if i use scikit learn i would avoid writing everything myself, since algorithms like that "already exist" but i'm kinda starting to doubt it... at least until now its been a hell of a lot more work than just writing it myself
It's very simple machine learning, but it does pretty much exactly what you are asking for. In terms of gameplay experience.
That does not apply here directly. You can for example take the game state and run some clustering on it, then based on which cluster the current state is part of (roughly what the current "situation" is in game represented by that cluster (if a unique situation happens, a new cluster can be formed with its own set of associated learned moves)), the boss decides from a certain set of appropriate moves (randomly, but then learned over time).
In the video I gave you don't need this because you basically just take the board state directly and map it directly (like a hashmap lookup table).
The game state is simple.
And not too many of them.
One way to extend this idea directly such that you can have similar game states map to the same "bucket" (as in a hashmap) of moves is called locality-sensitive hashing, which is an option (https://en.wikipedia.org/wiki/Locality-sensitive_hashing ).
In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since similar items end up in the same buckets, this technique can be used for data clustering and nea...
I'm rly trying to write all of this down but I already know this is probably gonna push my thesis back at least another month until I get all of that running 
I appreciate all the effort tho! Since my previous approach obviously ran me into a wall haha
If you do what that video does, it will work, but you may notice that in that video there are not that many game states. In realtime game you have an absurd number of possible unique game states happening quickly, so you need to group similar ones together to try to wrangle this complexity, this is where other parts of machine learning (clustering) come into play.
If you imagine you had a lookup table that maps from game state to set of moves the boss should play in that state (it can pick any of them as needed, learning which ones are best in that case), then this is fine for something like tic-tac-toe, there are not many entries in the table since there are not that many possibilities. But if you now have say, chess, big problem, your table is massive. So we have to try to reduce this, you can do this treating different, but "similar" states as a single entry in the table, so they all map to the same set of moves. This is where machine learning started, it's still the same problem.
To have this be fun for the player where they can see the boss getting better, it just picks random moves at first from the set of moves it got from being mapped to, but over time, it can remove moves that made it lose (with some random chance of that happening).
well, i kinda already gave up on realtime adaptation , hence why i'm working with rounds now . my boss already has states, since he already decides when to locate, approach, attack or defend . however just adding more states to this and conditions is not enough machine learning according to my prof?
if thats what you mean
I mean for example, if the player is at position (10, 10) -> {swing, jump back, etc}. Problem is, the player can also be at say (10.0000001, 10) -> {other set of things}.
When I say game state, I mean the state of the whole game.
maybe I dont understand 
This is a chess game state.
The boss takes actions based on the current (and past) game states.
If you take this state, and run it through a hashing function, you get a single number representing this unique game state.
You can then use that number to lookup a set of moves (lookup table).
Imagine I gave you a book with every possible chess game state in it, followed by the perfect move to play in that state (on the same page). I now gave you that book and asked you to play the perfect move. You can jump to the page that has the matching game state, and play the associated move.
But, try to imagine how big that book would be, how many possible game states chess has.
(It does not fit in this universe levels of big)
so a game state is all of the bosses stats and unlocked abilities of the current round the player is playing on? since, it really only changes once the next round starts

It's more than that, it's the bosses' position, the player position, the map geometry, their health, etc, literally every variable in the game.
In chess I can show all of that with a single screenshot, because chess does not have any hidden state (both players know the full exact game state at all times).
This is in contrast to poker, where you have hidden state.
Chess also does not have to handle stuff like a position of 1.00001, it's all integers.
So anyhow, instead of doing this, we instead make a much smaller book, with only a few states listed in it and each has a list of "decent/good moves." And now when I tell you to play the perfect (or just good) move, I give you the book and tell you to find the state that is most similar to the one given, then see if any of the moves listed there are valid, and if they are, there is a good chance they are decent moves to play (you still need to manually check this, but you have narrowed down your choices a lot, spending much less time to find a good move).
what would you ask them
You're probably looking for #databases . Remember to always ask your actual question and not if someone knows about the topic of a secret question
Bf8.
what do you think? Is F1 better in most cases when you care about trade off?
Ive seen cases where FP and FN can be both problematic
especially for spam and fraud
Its not secret I just need some paragraphs to explain
your emails get labeled as spam when they are not supposed to they dont get labeled as spam when they are supposed to
you're asking if anyone is a postgres expert. but postgres experts can't answer your question if they don't know what it is. so it saves everyone a step if you just ask the postgres question, and then postgres experts can answer it if they see it.
minimizing false positives vs minimizing false negatives xD
I did work with Postgres last semester for my DBMS course but Im not an expert
whats the question?
also isnt that more of a database question 🥲
Im going to ask at databases
do you know RL?
Reinforcement Learning
are you asking to ask?
Just wondering how hard it is
Im taking RL next semester
The syllabus looked horrifying
I don't know. there isn't really an application for it in NLP.
For me too
To mitigate hallucinations in LLMs we can use RLHF
we get human feedback during training and use a reward model
which is basically RL
also they ask examples of supervised and unsupervised learning?
Databases is dead
I would say Logistic Regression is a supervised learning
and unsupervised learning would be like clustering techniques
like K-Means
I used Isolation Forest for my spam detection which is unsupervised too xD
I have recently found out that the proper term for "unsupervised" is "self-supervised"
Loooool
Yeah but not that usual
Basically you can use CNN and RNN which are usually supervised learning and apply anomly methods on them to make them self supervised
Why
terms are only "proper" insasfaras there's a consensus about what they mean. and there often isn't.
I don't recognize those as synonyms.
though a "proper term" in this field is a bit ironic
you can use CNNs in GANs
if someone asks me if CNN is supervised or unsupervised I would say it depends
how CNN is used
true true, the "proper" was a bit rushed I suppose
though, well, that doesn't make life any easier that new terms are introduced, meaning different things for different people and institutions 
as a linguist, that must drive you crazy (at least you won't have to worry about a job in that sense, lol)
Why they aren’t synonyms
I started to forget. alot of things from my linguistics degree tbh
I havent practiced it for a long time
I mean self supervised is like you supervise yourself
unsupervised means no supervision at all
Ok got it
interesting, I guess there is a difference after all, though they both have in common that you don't have labeled/known targets
https://ai.stackexchange.com/questions/40341/what-is-the-difference-between-self-supervised-and-unsupervised-learning
Guys do you think all ML after all is statistics written in code
would you say a kid aged 7 at a pool is self supervised or unsupervised?
can they supervise themselves?
that is certainly one of the analogies of all time
thats why they cant be synonyms
Pretty on point
this is what you need to know:
Supervised learning is learning from labeled data
Unsupervised learning is learning from unlabeled data
Self-supervised learning is learning from unlabeled data with learned labels
in Self-supervised learning those learned labels are synthetic
Theres also Zero-Shot, One-Shot and Few-Shot Learning @spring field
which are used in NLP
you will see that one-shot and few-shot learning are type of supervised learning
and zero-shot is transfer learnng
I was going crazy with all those last semester
ugh
I am yet to, lol
are you doing DS?
In a sense I guess, I'm employed currently and the position does involve ML and DS
tensor[..., 0, True, 1::2, torch.tensor([1, 2])]
tensor.index({"...", 0, true, Slice(1, None, 2), torch::tensor({1, 2})})
can anyone explain what this is?
https://pytorch.org/cppdocs/notes/tensor_indexing.html
what do you want to know about it? this is a particularly contrived example of advanced indexing functionality that is possible with pytorch tensors
admittedly i can't find any official docs on the "Python API" that they refer to -- but as far as i know (and as far as i've used) it's similar to the Numpy API, which is described here https://numpy.org/doc/stable/user/basics.indexing.html
aha:
When accessing the contents of a tensor via indexing, PyTorch follows Numpy behaviors that basic indexing returns views, while advanced indexing returns a copy. Assignment via either basic or advanced indexing is in-place. See more examples in Numpy indexing documentation.
https://pytorch.org/docs/stable/tensor_view.html
that's at least a clue
so this is the (contrived) Pytorch Python code: tensor[..., 0, True, 1::2, torch.tensor([1, 2])]
and this is the C++ equivalent: tensor.index({"...", 0, true, Slice(1, None, 2), torch::tensor({1, 2})})
yeah I can understand previous codes
I have a datastream of 100 crypto coins. The top 100 for the hour. It charts their battle for the top. I get these vertical lines in the graph and I feel they are the result of some synch problem. Thoughts?
The system is all supposed to run on a 60 second cycle.
so 1 small gap is 60 seconds.
maybe a synch issue with the incoming datastream's update cycle?
Q.Q
I'm just getting black images from the generate() function. The thing seems to be learning pretty good, but I seem to be getting nans/infs in my output. Should I be clamping something?
SUCCESS!
You won crypto!
Boom!
hey guys can you help me build a model for my eeg analysis you can find the notebook here - https://www.kaggle.com/code/pramitroy/data-processing dm me if you guys have some suggestions or better model
Plz send me a code that creates its own answers for any questions asked
just API call a LLM or use Ollama
This GPU cost thing is a problem, programming with anxiety about spending $ sucks
The unfortunate reality is that cutting edge ML depends on the very best hardware, and the title for best hardware keeps getting won by bigger and more expensive devices
If you're a student, you can see if your university has a compute environment that you can use
Nope, no university. Think I'm going to try a huggingface subscription
What are you trying to make?
RAG chatbots, just for study
Success?
Something is still wrong of course, but this is progress
Here's the issue: what you'd normally expect from a half- or poorly-trained diffusion model is blobs of noise somewhat resembling structure. This looks more like a perfectly clean image with noise laid overtop of it. This image above was generated from pure noise
Anyone have any idea what might cause this?
is there a point in pre-normalizing your data if your model already contains batch normalization?
It depends on the what you're doing. That begs the question, what are you doing?
this isn't a question on a project, I'm just trying to learn more about batch norm
So, I've got a question for y'all
Something I think one can't learn from a book
How does one debug and tune a neural network? I mean, when you've got a network that is theoretically sound but isn't working (or could work better), what's the process for figuring it out?
Aside from virgin sacrifice, that is
You need to figure out in which specific situations it's not working and what those situations have in common
I'm not fully sure what I'm looking for, but I'm attempting to train and fine-tune a model. I have a high-end gaming pc that can process the datasets, however, this would take me very long. I'm going to be processing multiple terabytes of data. Is there a cheap cloud server or remote server I can run this all from and process data faster?
It will be very expensive to do this no matter what.
If you're trying to do this as a private person (and not on behalf of a company or institution that can pay for it), I would scale this down by orders of magnitude
Is there an alternative? So I just have to wait it out? Also, if my computer restarts or goes into sleep for some reason, how can I save the data?
What kind of data are the terabytes of it that you have?
Hm. Well, I really don't want to do that.
HuggingFace datasets such as FineWeb or Common Crawl. I already trained it on smaller datasets, however.
and what do you want to train the model to do?
Text-2-text generation/multi-turn dialogue.
I do not think you should try to do this on your own computer with terabytes of data, and I do not think there is a cloud compute platform where you can do this cheaply.
Salad comes to mind
But you'd have to be buying in bulk
Okay, if I can't do it cheaply, then what would it be?
Hello.
What are some advanced projects i can add in my portfolio?
i am making an agent with lots of functions to use in function call, i assume adding hundreds of functions to a llm request would be quite expensive.
how could i make it cheaper? i was thinking of implementing rag but not so sure about how that will work
currently the functions are split into files where each file has functions that relate to each other, all these files are stored in scripts folder
i am sure this is a common challenge when making agents
would appreciate any suggestions on how to deal with large amounts of function to add to llm request
uhm.... training llm using my discord data is legal???
What does game have to do with RL?
Training it using your personal data export, including only your own messages, is probably fine - but may not deliver good results as it'll be very out of context
Training it on data scrapped from discord including other people's messages is not cool
I heard about this guy when I did my undergrad at NYU
Nobody answered that ^
Are exponential based reward mechanisms good for reinforcement learning? Should provide globally differentiable training feedback?
I would suggest runpod. Try applying for aws credits if it's research based. But, runpod is as cheap as it gets.
lol yep it took me quite a bit to understand the entire working of that and write the code
is it domain based?
So, you can easily run a highly quantized model on cpus without even using a gpu and they perform quite well
you just need to know where to look tbh lol
https://github.com/SanshruthR/CPU_BlazeChat This might be useful for you
RL is genuinely Hit and trial there's no definite approach or guide to what would give the best output. You'd have to monitor the model quite closely as it can lead to gradient explosion but you can always implement gradient clipping etc. TLDR, Their effectiveness depends on the problem being solved.
TL;DR RL is hard
Are there any message board or social media site scripts? I don't know if it would be easier to start from scratch these days or to port my 20 year old PERL scripts. My searches keep pulling up spambots for various platforms instad of software for platforms.
Lets build a modern one using a MERN stack.
├── backend/
│ ├── models/
│ │ ├── User.js
│ │ └── Post.js
│ ├── routes/
│ │ ├── auth.js
│ │ └── posts.js
│ ├── middleware/
│ │ └── authMiddleware.js
│ ├── server.js
│ └── config/
│ └── db.js
├── frontend/
│ ├── public/
│ └── src/
│ ├── components/
│ │ ├── Auth/
│ │ │ ├── Login.js
│ │ │ └── Register.js
│ │ ├── Posts/
│ │ │ ├── CreatePost.js
│ │ │ └── PostList.js
│ │ └── Layout/
│ │ └── Navbar.js
│ ├── context/
│ │ └── AuthContext.js
│ ├── App.js
│ ├── index.js
│ └── api.js
├── .env
└── package.json
is it really just that easy these days? LOL!
So, how do I implement that at Blahblah.com?
(that's really the name, it's not a placeholder)
f u d g e... (only he didn't say fudge)
I rewrote the entire thing with security, UI, WebSocket's and everything cool. It works a bit like twitter but with some unique differences. It uses Go for the backend and Angular for the front. Ill paste it in the other channel
How do I implement it for testing? I haven't had a good system at blahblah in a long time and with everyone bailing on FB and X it really would be the PERFECT time!
I had the popular message boards before FB took over.
lol
I can set up a cloud account with a subdomain like blah.blahblah.com ( I think ,I've never actually done that yet lol)
i bought a domain pyposh.org awhile back we can test it on that., i bought it via the google cloud platform
ok, we could also use blahblah.net or blahblah.org, I don't have anything there yet.
zencoder just totally screwed my code.. Been trying to get it back this whole time... UGH...
what is this?
SEaaS: Social Experiment as a Service? 
Its main goal is to facilitate user engagement and interactions through seamless content sharing. It lets you register,, login, create and share post in real time for now. Its pretty simple now im gonna add more features for content sharing.
could i have some help with pytorch using Visual code studio be cause i don't understand the documentation that I've read through.
Can you provide more context?
i dont understand why newTen is not getting updated after i call append on sum/4 over newTen
when i try to print out the weights of my net's after training, nothing seems to be changed even when the net is trained well, like it has a accuracy of %94 but the weights and all are printed out the same , can someone help me with this?
forgot to tell , pytorch
i am working on the project to learn RAG and llm techniques, not trying to cut down costs by using cheaper models
check if the requires grad is set to True, if you are using pretrained weights they are already good so it wouldn't update that much. You can also try adjusting the lr to something greater
like use 1e-2 or something idk
which vector database are you using to learn RAG, are you following a yt tutorial or a course?
and do you wanna run a llm on your own server/ machine or do you prefer an api response?
it is a siamese net i coded myself
and i looked up a tutorial online, in the video the code works for the guy, i copied the exact code from the video to check if soemthings wrong with me and yeah, the code from the video doesnt work on my pc
when i run it
just create a docker file and spin up the mern stack
https://hub.docker.com/r/03192859189254/node-mern-stack/
just drop your code into it using
COPY /filepathonhost/ /containerdesitnation
assuming you dont want persitance, then you woudl create/use a volume
i don't under stand the documentation and I lose my place when I read it and it's just confuse I understand what a tensor is its just an array of numbers that could be an image broken into numerical sequences i understand tokening text ect.
tensors are simlar, but you break the image down into vector graphics
but its a tensor vs a vector because it contains its start cordinates usally
I think the main problem could be the sharing of weights, make sure you are sharing the weights and not instantiating two separate models.
and try to run that in a cloud environments
like kaggle or colab
also check if optimizer.step() is being called after loss.backward()
ohhhhhhhhhhhhh
i did it but didnt work
you want me to send the codE?
please do that
would you like the link to my live share
its a bit late where i am do do much of stuff like that 😉
I'm sorry i had a few places to go and time got away from me
its cool, sadly i am on hol so not around much till thursday really, i just pop in here for a quick read this week
should I put the link in any way?
sure others will love to review
I made a cool thing from a paper yesterday, It is a CNN that learns the group of transformations on an image by encoding within an embedding for a CNN
i havent made any thing yet i need help learning it from the begining
Build with Visual Studio Code, anywhere, anytime, entirely in your browser.
does anyone have issues with pos_tag and the lemmatizer of nltk
use spacy for this and do not use nltk.
😭 will do
In [1]: import spacy
In [2]: nlp = spacy.load('en_core_web_sm')
In [3]: doc = nlp("the boy walked to the store")
In [4]: doc
Out[4]: the boy walked to the store
In [5]: list(doc)
Out[5]: [the, boy, walked, to, the, store]
In [6]: doc[2]
Out[6]: walked
In [7]: doc[2].has_morph()
Out[7]: True
In [8]: doc[2].suffix
Out[8]: 13,622,047,838,477,328,034
In [9]: doc[2].suffix_
Out[9]: 'ked'
🤔 why was ked returned and not ed?
yeah, idk
kinda sus
In [10]: doc[2].morph
Out[10]: Tense=Past|VerbForm=Fin
ahh i'll go check it out myself
I started developing this method called horizon mapping thats kind of a higher level partner to MCTS, its supposed to analyze upper decision boundaries, compute entropy of decision trees, help identify horizon points or points of uncertianty to aid in triggering surprise minimization, generate adversarial interactions. Just overall find and visualize areas where the model can train and adapt. and the damn thing just wont work.
Even though everything looks right, imports then logging for global mapping, it just seems like one of those weird things with programming where a file just wont initialize properly. So I'm taking a break.
@left tartan but to answer your question, I'm not training it yet with the new method I'm just trying to get through the errors it's causing with the system.
What kind of errors?
The logging is saying it's not defined but the errors are popping up on 90 and 113. Meanwhile the refrence for global mapping is on 80. So maybe the problem is in the file itself for the resilient error guard.
Could you open a help thread and paste code and exact error?
Actually thanks for being a rubber duck! Just figured it out!
VectorStore and i use VectorStoreIndex for retrieving.
currently i am scraping information from yt, docs to learn how to efficiently design the workflow
the main llm (and the only one currently) is gpt 4 mini
so my current plan is to use llamaindex dataloader docstringwalker to get all the python functions (the functions are split into files based on relevancy and dependency), store it in VectorStore (vector db), then retrieve relevant file(s) with VectorStoreIndex
then query the llm
i am currently not sure which RAG and retrieval process i will us, there are a lot of options. I am using top_k 2 at the moment
There is a question in my head
Currently, with the development in Deep Learning, do traditional ML algorithms such as SVM, Decision Trees, K-Means, etc. need to be known, or is there no need for one to know them and focus only on Deep Learning, For someone who wants to specialize in ML Research ?
the dataset im using is at&t
the loss seems to go down but the accuracy and weights dont change
Just don't do it in danish or some non English languages
As the postfix is incredibly important there
I would personally say yes, as some problems might be better off with a more classic ML model then a deep learning model.
As extra this also build up a base of understanding of how you would tackle problems instead of alway using deep learning, a more broad arsenal is never bad.
I am interested in ML Optimization research.
What are some recommended ways to set up a version control system for a small data science project utilizing Jupyter Notebook? I am considering using GitHub as I will be working with a group of friends, but it seems like the notebook metadata differs between our devices.
Sorry if this is the wrong channel to ask this question. Redirect me to the correct one if necessary, thanks!
I would still say yes as it builds an understanding of the fundamental concepts as these are often the foundations for more advanced ML concepts.
I would suggest to start with the basics like linear regression, logistic,.. and focus on understanding the optimization methods like gradient descent, quadratic programing.
Once you get this, you can start by implementing easy deep learning models with optimizations.
If you want you can quickly go over the "classic" ML but i wouldn't skip out on it entirely.
You can try to use something like nbstripout. This removes cell outputs and metadata when you commit.
I've never used it myself but you can surely try it out.
You can also use google Colab.
Okay thank you ❤️
can someone help me with this
the loss goes down but the accuracy doesnt go up
im stuck atp
How to make real time object detection with a yolo11 model that i trained? The code part
Do you have an example of the output?
Git
Does anyone know optimization well?
Hyper parameter tuning and pipelines
Your also gonna need this
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/Callback
And tensorboard
Maybe try this `# Before training
initial_weights = {name: param.clone() for name, param in siamese_net.named_parameters()}
After training
for name, param in siamese_net.named_parameters():
diff = torch.sum(torch.abs(initial_weights[name] - param.data))
print(f"Parameter {name} changed by: {diff.item()}") `
remember to always always ask your actual question. don't ask if someone knows about the topic of a secret question.
they have a really good mongodb course so use that and maybe try using faiss library
Are you trying to optimise by speed or accuracy is a really good thing to know too
try using groq and reduce the tokenisation length and use faiss
How would you guys handle extracting specific data from insurances?
Right now i can extract all the text using pdfplumber and OCR, but i still need to extract the specific data like names, conditions, dates,....
The data should be put in to a csv
Note: I cant share the data here because its sensitive data that falls under an NDA
Use ms vision
Very good very cheap
On azure
Accelerate computer vision development with Microsoft Azure. Get insights from image and video content using OCR, object detection, and image analysis.
fyi there is a specific document OCR and its even cheaper than the main AI
I imported dotenv but it still says 'module not found'
this is how i did it:
from dotenv import load_dotenv
any idea why it's throwing this err?
pip install python-dotenv
this is what i installed
it simply goes down like to 0.003 from 1.1
and when i test it with the test data
its very bad
ok wait
strotmic this is for you
example output
and the accuracy your talking about is this accuracy on test or train set?
oh the accuracy wait lemme send that too
wait wtf
i changed a tiny thing and now the accuracy comes pretty high
the reason i have test epochs is the test count is only 30 and it selects random photos for test elements so to get a clearer result
i added test epoch
is this time serise data?
is
1 your data time shifted
2 is shuffle off
are you not splitting 80:20 using sklaern randomly but instead spliting the data
df[:80]
split_index = int(len(df) * 0.8)
train_df = df[:split_index]
test_df = df[split_index:]```
(make sure that you shuffle the data before doing that, and use iloc)
@rancid sorrel ^
I see
cause it gives you 100% accuracy
oh
How does shuffling time series give you 100% accuracy? Like what's the intuition behind it?
cause for time serise data it predicts the missing data,
if your trying to predict x+1 you you include the gradent of x+1 in your training data by shuffling then split
essentually your near enough including testing data in your training data to bork your model
This project demonstrates real-time object detection using the YOLOv5n6 model with low-resolution inference for high-speed processing, while drawing the results on high-resolution frames.
https://github.com/SanshruthR/CCTV_YOLO
Fast Real-time Object Detection with High-Res Output https://x.com/_akhaliq/status/1840213012818329826 - SanshruthR/CCTV_YOLO
I would like to do the ff: How much math do I need to learn?
- Finding the winning strategy in a card game
- Assessing online ad clicks for significance
- Tracking disease outbreaks using news headlines
- Using online job postings to improve your data science resume
- Predicting future friendships from social network data
what do i do and for how long to get good at ml like you?
I can't do this least square fit in excel I don't know why. Could python work well and how to do that? or Mathematica
dm'ed you
cuz i don't wanna send a block of text here lol
use =PY in excel
and write that in python
lol that's a really vague question there is no such thing as bare minimum in maths every problem can be solved with different methods and scenarios. Please pardon my crude analogy here, but it's like saying hey, I want to kill a person what should I use? You can use a bat, a gun, a rocket launcher or just your fists. So, it all depends on what you are trying to use, for example if you need to find the value of tan 37 you can use algebraic methods, geometric methods or idk taylor series. You can use high school maths and Advanced calc but it's going to give you the same thing. One would require idk 3 pages to solve and other would solve it in 2 lines. One requires little knowledge and other requires knowledge of calculus and permutations. So, like just dive into it man, and start solving it and you'd just learn that stuff as you'd progress 🙂
Deepseek R1 better V3?
Excel supports python
hello
sigh .py removed his post now i look like amuppet
it's v3 with CoT-reasoning
if you read the paper, you'll see they used v3 and then some fancy RL for the CoT training
Cara membuat personal chatbot sederhana👇🏻
https://www.wahyuikbal.web.id/blog/AI-Engineer-How-to-Integrate-a-Ai-into-Your-Personal-Website
i experience problem with text classification task with hf transformers bert library
anyone has experience with that?
be sure to never ask if someone has enough experience to answer your question. just ask your whole question. give enough information for someone to start answering right away.
this is short code can you explain in vc
sure
remember to always give text as text and not as a screenshot. if this is part of an error message, please give the whole error message, including the parts that you don't think are important.
i am not asking for help here actually, just rambling and being annoyed of pytorch
why do two incompatible types for float exist, torch.cuda.HalfTensor and torch.HalfTensor
because things on the GPU are different than those on the CPU
hm
if you decide that you want help, show the code and the error message, and I or someone else might take a look.
alright, one second.
when learning the mathematics for ML, what topics should I focus on more? Bare in mind, I will be learning univariate and multivariate calculus, as well as some introductory lessons into matrices later in the semester at my uni
https://paste.pythondiscord.com/OKMQ basically the stuff around line 71 in load_model seems to be loaded to cpu in that case
missing from what you said is probability theory
already struggled an hour here to get bitsandbytes to load it on the device I want using .to() or device_map but both are not supported yet; so I came up with torch.cuda.set_device(0) instead
@oblique comet and also the whole error message. (remember to always post both at the same time.)
Im struggling to find good resources in how to learn the statistical/probability aspect of ML. Maths has always been my strongest subject so, I dont struggle with learning the maths (even with limited knowledge), but Im struggling in trying to find good resources. Im going through ISLP and I understand the maths, but I want to understand it further so I fully know what the values are saying
@oblique comet I've never seen all these extra cuda settings (like torch.backends.cuda.enable_cudnn_sdp), but hopefully someone who's experienced in that area will come along.
disabling cudnn sdp was required for para attention; i later replaced that one with teacache instead so that part is obsolete
the error remains sadly even if removing it
just in case; https://paste.pythondiscord.com/VVSQ here is a simplified version that still produces the same error without all the extras
thanks for looking into it at least
adding device_map="balanced" to LTXImageToVideoPipeline fixed it for some reason
hello, this is a pretty much out of topic, but i am a highschooler trying to choose whether i should really focus on my data and statistics research instead of improving on my grades (its around 92 average), since I am still not sure whether universities care about which ones for scholarship
Your grades are probably more important
When you say "research" can you be very extra specific about the context and objectives? @thorny geode
@thorny geode I need to know if this "research" is a personal side project, or something you're doing in an official capacity.
0.0
I freaked out the server by talking about recall and precision in off topic channel 😦
Why would that freak people out
My bad
I am planning to win my national research competition around the end of this year, or at least on a city-level. The competition in my country for mathematics/statistics field is very scarce. For example, the city regional winner only yse ANOVA as its main methodology.
For the context, I’ve been steadily learning the book Introduction to Statistical Programming with only basic statistical knowledge, such as expected value and distribution in my high school, and a bronze national olympiad winner in mathematics for general skills (on junior high school though)
ANOVAAAA
bring the t-test
and f-statistics
statistical programming usually focuses on R
have you ever used R?
But I don’t know how this can even contribute to my probability of getting a scholarship (or maybe some intership opportunities), since my teachers suggests on improving my score, while edu fairs and university seminars just give a vague idea of “good academic record, extracurricular activies, leadership” stuff
I do use t-test to win my first “mathematical modelling” competition, even though its mostly just statistics
I mean f test is preferred for ANOVA
I have used R before, but I prefer Python as most of the machine learning models are based on Python, so now I have a good grasp on using pandas and matplotlib
you would get much more information with R if your aim is just statistical programming
ML and AI use Python
for example for ANOVA you need to focus on F-statistics
yes, of course, since ANOVA compares more than 2 variables, and F-statistics is made for that
thank you for the info
you can use F statistics for 2 variables too
For two variables the F-statistic in ANOVA is the square of the t-statistic
but im planning to use more advanced models for winning my championship, and it looks like lasso regression seems very nice… I mentions ANOVA as even simple hypothesis testing already wins city championship, so improving on my statistical knowledge and skills will bring me up to the national competition with no hard difficulties (hopefully)
Yes, and I believe that is to check the partial effect of adding that specific variables into the multiple regression model
lasso regression is used as a feature selection method
if your aim is to find the most important variables that can work
yes, hopefully I can finish that chapter before my semester ends, but Chapter 3 of regression would be really sufficient in my research
we can use forward selection, backward selection, or mixed selection
yeah stepwise regression too
and for low amount of variables, we test all the combination of variables to check all the posibilities
check this 😄
@serene scaffold I’m sorry, i moved into another conversation
this part can be helpful for you
for feature selection tasks
ooh yeah that will be a nice cheatsheet if im confused what to do in my research later
I’m thinking about using BMKG meteorogical data in predicting crop yield, as my country really focused on agriculture
so its not logistic regression
y is not categorical I assume
you can use linear regression
Im not sure if you learned tree based models but they are good as well
XGBoost can be good with nonlinear relations
I gave up that. the function is too complex I can't fit
{"job_title": "Asia Finance Controller", "tags": ["Manager", "Director"]},
{"job_title": "Assistant Audit Manager AVP", "tags": ["Manager", "Director"]},
{"job_title": "Business Controller", "tags": ["Manager", "Director"]}
]
# Preprocess data
df = pd.DataFrame(data)
mlb = MultiLabelBinarizer()
df['labels'] = list(mlb.fit_transform(df['tags']))
# Convert to Hugging Face dataset
dataset = Dataset.from_pandas(df)
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(mlb.classes_), problem_type="multi_label_classification")
# Tokenize data
def preprocess_function(examples):
return tokenizer(examples['job_title'], truncation=True, padding=True)
tokenized_dataset = dataset.map(preprocess_function, batched=True)
# Ensure labels are of type torch.float (this is required for multi-label classification)
def cast_to_float(example):
example['labels'] = torch.tensor(example['labels'], dtype=torch.float) # Convert labels to torch.float
return example
# tokenized_dataset = tokenized_dataset.map(cast_to_float)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
save_strategy="epoch",
num_train_epochs=3,
per_device_train_batch_size=8,
logging_dir="./logs",
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
eval_dataset=tokenized_dataset, # Ideally, you should split this into train/test datasets.
)
# Train model
trainer.train()```
I get this error : RuntimeError: result type Float can't be cast to the desired output type Long
!traceback
Please provide the full traceback for your exception in order to help us identify your issue.
While the last line of the error message tells us what kind of error you got,
the full traceback will tell us which line, and other critical information to solve your problem.
Please avoid screenshots so we can copy and paste parts of the message.
A full traceback could look like:
Traceback (most recent call last):
File "my_file.py", line 5, in <module>
add_three("6")
File "my_file.py", line 2, in add_three
a = num + 3
~~~~^~~
TypeError: can only concatenate str (not "int") to str
If the traceback is long, use our pastebin.
what do you mean
this is the last line of the error message. please show the whole entire thing.
ok
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
!traceback
Please provide the full traceback for your exception in order to help us identify your issue.
While the last line of the error message tells us what kind of error you got,
the full traceback will tell us which line, and other critical information to solve your problem.
Please avoid screenshots so we can copy and paste parts of the message.
A full traceback could look like:
Traceback (most recent call last):
File "my_file.py", line 5, in <module>
add_three("6")
File "my_file.py", line 2, in add_three
a = num + 3
~~~~^~~
TypeError: can only concatenate str (not "int") to str
If the traceback is long, use our pastebin.
how did you produce this code?
with ai
how much experience do you have writing python code?
@somber fractal I'm concerned that you don't know enough about what you're trying to do to benefit from any help I might give you.
i see
if you look at this code, you'll see that there's only three data instances, and that you don't divide them into training and testing. which means that the model will be useless.
this AI output is just intended to be used as an example.
i am not such dumb ) i know
i have to solve just the bug
open to any solution not to critics
@somber fractal the problem is probably that tokenized_dataset contains the wrong data type.
it looks like you commented out # tokenized_dataset = tokenized_dataset.map(cast_to_float). I wonder what error you were getting before, if any
to avoid this float data type error i commented out that function but didnt work
i wonder which parameters should i include to the tokenizer function to make the dtype long,
if it will have any positive effect ofcourse.
i have deadline thats why dont have enough time to research the documentation, thats why i am here.
How would you guys deal with prompt condensation? i want to reduce process time for parsing and for that i need to reduce input-output tokens
Use =linst() then go into the graph right click the line and you can change to polynomial in the graph window when you click propertys
openai with softbankbuilding stargate new company invest 500 billion for four years
possibly 🙂
thank you for your tips
Hey guys, have anyone worked with ocr fine tuning. I have omani number plate datasets. But couldn't find a proper ocr model to fine tune it. Can anyone help me with it?
Take a look around "YOLO (you only look once)" models or just roboflow, e.g. https://universe.roboflow.com/roboflow-universe-projects/license-plate-recognition-rxg4e
Just use openCV lol
It works good enough
Okay cool
But I wanted to know if anyone has experience with fine tuning an ocr model
https://paste.pythondiscord.com/4NJA Could I get a triple check on my math
It's a bit tough telling whether the problem is the model or the diffusion logic
dude ocr is just a ml model
Is AI is difficult field or easy for learning
AI/ML is a very difficult field, but you can easily learn to leverage existing models and even make your own with libraries that abstract away all the complex math and understanding required to build them from scratch
I just tried to make a matplotlib figure so large that I got a warning saying I might be getting DOS'ed
lol
Would this be the right place to discuss zencoder and copilot?
What about them? How to build one? Or how to use them? If build, yes... if just how to use them, that's more a general (#python-discussion) or OT discussion, depending
They aren't working. I'll ask in gen thank you!
If zencoder isn't working, you're better off asking their slack. I'm not sure about copilot's community.
Thank you for digging that link for me! I appreciate it!
Can someone tell me where to get help with using my gpu with tensorflow on windows? I've tried mutliple combinations of cuda/cudnn/drivers/python/tensorflow. I made sure they're all compatible each time. I've tried miniconda, anaconda3, and WSL2 with docker, and although I seem to have set them up correctly, tensorflow can't see my gpu in each case. I've also tried about 25 hours of consulting chatgpt. nvidia-smi does correctly show my gpu.
tensorflow development is winding down--I recommend switching to pytorch
I can help you install pytorch on windows.
😂 Yeah I just realized pytorch would be a much better solution. I'm installing an earlier version of cuda now, just for pytorch, and I'll let you know how it goes. Thanks for answering!
in my defense, I thought pytorch uses tensorflow. I just found out that it doesnt after I posted for help
Got it working already - Ty again!
nice!
I think
Though I'm having a blood hard time telling if so
https://github.com/lucaswalkeryoung/Diffusion that it's working
Could I get someone to look over my DDPM class to check my denoising logic? The loss is dropping really nicely, but it's kinda hard to tell if it's working or not
nightmare fuel
lol
also, what was that lib you were shilling now? was it seaborn or plotly?
ah, right
I had seaborn on my mind for some reason, so was a bit confused when I found out it's a matplotlib wrapper when I thought plotly was that 😅
plotly do be looking nice indeed, yeah
Hi, I made a post on #1035199133436354600 but it got closed for inactivity. I'm using Anaconda on Windows and have installed Jupyter Themes using conda install -c conda-forge jupyterthemes and tried to change the theme to Onedork by running jt -t onedork and restarting the Jupyter Notebook and refreshing my browser cache but the theme does not change, nor do I have the option in the themes menu to switch it to Onedork. Here is my log on a fresh reinstall https://paste.pythondiscord.com/MBDA
hi
can anyone go through this notebook and explain me whats there in the dataset
i cant understand anything
please ping me if you answer
RL is incredibly hard, where did you guys start?
I am really interested in shifting towards this focus. SWE is my love but AI is rapidly replacing in this field.
AI is definitely not replacing SWEs
How do you build a discord bot AI that gives answers based on specific sources like a google document? How to start?
the text is actually barely comprehensible, it is really badly written, took me several rereads to understand what is even going on and I still don't understand a couple things
the gist of it though is that each row in that table represents a recording made over several hours, but with a total recording time of 23.6 seconds, that was then split into 4097 "buckets" where each bucket represents a 23.6/4097 seconds from the recording, then those 4097 buckets were split into 23 chunks where each represents 1 second of that recording in which you have 178 of those buckets and so each row is those 178 buckets and there is the label at the end
basically, as I understand, you can think of the X1..178 as something like
X1 recorded at 00:00:00
X2 recorded at 00:00:30
X3 recorded at 00:01:00
X4 recorded at 00:01:30
...
where each record is some value from the EEG data
so, you have 178 features (X values) that summed together by how long each record is would make up a whole second, but the actual observation time might span several minutes/hours and then at the end you have the label, 1 for a seizure and the others for no seizure
so they are essentially trying to predict a seizure from say 30 minutes of observation
again, idk what is the actual interval of the recordings or what is the total observation period or if I even understood those 23.6 seconds correctly, but that is what I understand from the poorly written text
- this is the data collected from a single subject's (human's) brain (orange are the data, blue is the interpolating line, i.e., what you get with, e.g., plt.plot)
- they recorded each human 23 seconds (23.6 or something but unimportant)
- the recording device takes samples with some frequency; it turns out, it takes 178 samples per second (cool)
- then for each subject, we have 23 * 178 = 4094 datapoint (orange dots)
- we need to make a dataset out of this; how?
- they do it like this: crop each 23 second measurements into 1 second parts. Then your X values (features) will be those orange points in each 1 second interval (178 of them)
- what is y? y is what type of seizure happened in that interval (one of 1, 2, 3, 4, 5)
- ok so we have 178 features and 5 classes
- so X.shape[1] is 178; what is X.shape[0]? In other words, how many instances we have that have 178 features?
- well we have 500 subjects, and for each of them, we have 23 1-second chunks; so 23 * 500 = 11500
Huh
This is actually well within my experience
I worked as an R&D engineer at a neuroscience company in my last role
Although, uhhh... is there a question somewhere? 😅
On a separate note, I'm trying to wrap my head around treen ensembles/random forests. Am I correct in my understanding here?
Basically, we can make a decision tree off of a dataset. A random forest involves changing the dataset up a bit and creating decision trees off of that dataset, with the hope of having a bunch of decision trees that we can hope will agree with eachother on the important bits?
As for changing the dataset, I believe with random forests it's random sampling with replacement to create each tree?
Does anyone know of a good server specifically for diffusion model training/mechanics?
I've done the reading, but I need some good old human to human learning
Haha I might try to work on that, it's my weakest area of ML and I do keep seeing those jobs
Also, from what I just learned, I think random forst also biases feature node decision to be more random, to differentiate from other tree ensembles
Where does a machine learning engineer go camping? ||In a random forest||
If a consumer product/service uses ML, but the ml that it uses is a random forest, that's how you know it's shit
Out of curiosity, are you speaking to tree ensembles in general or just that specific algorithm?
specifically random forests.
Yes, the two parts that introduce randomness are:
- Sampling from the dataset with replacement to create a new one (bootstrapping)
- Only considering a random amount of splits and not all of them
The idea is that decision trees overfit to much and generally have high variance
introducing the randomness places you in a way better place in the bias-variance trade-off
Oh yeah I agree that specifically random forests are not the optimal tree ensemble. Probably an improvement on bagged decision trees, but I like the boosted trees whose further iterations focus on what was misclassified in earlier trees if I’m understanding the algorithm correctly.
Sure, but the drawback there is that you need to train them in sequence ig
On paper training RF should be faster (but it isn't in any of the implementations I've tried)
Boosting is inherently sequential
But yeah, either way nothing stops you from trying both
No free lunch after all
There will be problems where RF > gbms
That's one thing I've kind of been struggling to learn/figure out. I know the ins and outs of neural networks, and could implement one with pen and paper if need be (preferably would at least want numpy please....). I've been learning the ins and outs of quite a few different machine learning algorithms.
I just struggling with the insight of when to use which for what kind of issue/problem.
Like I'd have no idea if you asked me to give an example where a RF > gbms
And the only thing pushing me towards a neural network over other stuff is only feature amount
But even then that's more of just a gut feeling
than a thing I could defend as truth
I suppose in practice you just modify an existing implementation that works on something similar?
as far as i could tell, 1 was a seizure and 2, 3, 4, 5 were not seizures at all
a binary classification between classes of label 1 and the rest (2,3,4,5)
also, what was confusing me was the mention of:
EEG signals are to ensure the accuracy of diagnosing disease that usually is taken 8-10 hours in the form of records.
The EEG data used in our study were downloaded from 24-h EEG recorded (..)
Which leads me to believe that it was not an actual continuous 23.6 seconds, but rather, that was the total recording time, but it was different than the observation time, which may have been several hours and so the measurements were taken only every couple seconds/minutes, but again, I don't know, it's really hard to read what they have written (as in it's not written very clearly).
oh i see, thank you. Actually im a beginner in data science and I've been given a project in my college so i need some help
thanks to everyone who explained it
im watching videos step by step and working on this project
is it fine with any of you guys that if i add you and ask you my doubts
oh i see, im a student and ive been given this project
the only problem is we were just told to study on our own and complete it
they just provided us with a problem statement
no resources, no dataset
and as a beginner im really confused what to do
in the first week, we are just supposed to do analysis
preprocessing, cleaning, eda, visualization
but i just couldn't understand the data
Hmm. I'm not sure like... how much depth to go into
But EEG is typically time series data that is generally artifact heavy, but artifact cleaning can sometimes clean seizure activity so you have to be careful
If you have any questions about EEG specifically I'd be happy to help, though. Not sure if it's too indepth/specialized for your problem though
i found a notebook on kaggle
im referring it
if i dont understand anything, ill ask it
this is what im working on
is there any other dataset, i tried finding but the one which i sent was the most common one
So
Brass tax it for me guys
Can I or can I not use mixed precision on an M3 Apple Silicon macbook?
don't overthink it imo
Because a couple of things matter: one algo isn't intrinsically better than another one
In practice being able to robustly evaluate several ML algos matters wayyyyy more than knowing how any specific one works
Because you'd just try them all
Agree. Fast isn’t always the best. Something may break
Hi all, is Scrapy the best python web scraper?
It's subjective. So it depends on the nature of the task and the website involved.
Some prefer Playwright, some prefer Selinium, BeautifulSoup, etc.
I believe you can use mixed precision on M3.
as far as i could tell, 1 was a seizure and 2, 3, 4, 5 were not seizures at all
a binary classification between classes of label 1 and the rest (2,3,4,5)
opposite; 1 is non-seizure, others are some types of seizures (e.g., tonic clonic, complex partial). they are binaryfying the problem
EEG signals are to ensure the accuracy of diagnosing disease that usually is taken 8-10 hours in the form of records.
The EEG data used in our study were downloaded from 24-h EEG recorded (..)Which leads me to believe that [...] the measurements were taken only every couple seconds/minutes
first one is a generic fact, second one implies the dataset used in the notebook is a (rather small) subset of an original, big data. usually these are in the order of 10s or even 100s of GBs (what they have in the notebook is < 10MB). Also you'd lose a lot of information in between if your sampling period was in the order of seconds; temporal resolution of EEG recordings are rather high and typically in the order of milliseconds (in this dataset, it's 1/178 * 1000 = 5.6 milliseconds)
any high performance alternatives for networkx? i see snap.py but i'm struggling to compile it 🥴. i have found igraph
igraph and graph-tool are it
igraph seems good but the docs have massive ads covering everything 😩
Hi all, I have a model trained based on LayoutLM. The training is done, when I run inference on an image, I get the expected result. But I want the result in JSON, so that I can process it further. But there seems to be no way. One thing that I tried is to crop the image with the help of bounding boxes and give it to an OCR tool to recognise the text. But this doesn't work consistently, I'm not sure if it is due to cropping the image. So in short, LayoutLM gives an output with bounding boxes and labels, I use the bounding boxes to crop the image and provide the image to an OCR software to recognise the image. If someone could help me or point me to some resource, it would be really helpful. Thank you in advance.
PS: Mention me here or you can DM if you have experience working with LayoutLM or similar kinds of models.
Is this my understanding of dataset prepping correct?
Annotate Data:
For single-object classification: Label each image with a category (e.g., "dog", or "cat").
For multi-object detection: Annotate images with bounding boxes. Label Studio is a solution to do this.
There are scenarios that more categories appear in an image. Should you thereby always label images with bounding boxes?
It sounds like you're mixing up whole-image classification and detection
Yeah you're right
You need to classify some images first before you can detect if an image contains a category I would assume?
has anyone read essential math for data science? is it considered a good book for getting a good basic understanding for the maths need for ML?
For multi object detection unless you have something more specific in mind you can just label it with each type of object that appears. There are several strategies for that sort of algorithm, the simplest just being running each individual algorithm on it lol
Otherwise you can use a soft max activation
are there any good algorithms/models that break ovo words into morphemes?
the only library I can find for this is abandoned https://polyglot.readthedocs.io/en/latest/Installation.html
yea i also found that, been reading into it
I tried to use it just now and the website that hosts the models appears to be gone.
ahhh
In [5]: downloader.supported_languages_table("morph2")
HTTPError: HTTP Error 404: Not Found
damm that's kinda bad lol
Affixes attached to the beginning of English words.
For more information, see Appendix:English prefixes.
Category:English prefix forms: English prefixes that are inflected to display grammatical relations other than the main form.
Category:English terms by prefix: English terms categorized by their prefixes.
Affixes attached to the end of English words.
For more information, see Appendix:English suffixes.
Category:English suffix forms: English suffixes that are inflected to display grammatical relations other than the main form.
Category:English derivational suffixes: English suffixes that are used to create new words.
Category:English diminutive s...
dam u need to drop research tips it took me like 4-5days to find that
but yea i've been using it
I am a computational linguist.
true, i remember, i'm new to NLP
lemmatization has been a decent fall back
why do you need to do this
incase a word isn't present in my pre-defined dataset i'm creating but is a valid word
feel like i'm butchering the explanation
you want to replace out-of-vocabulary (OOV) words with the in-vocabulary word that most closely approximates its meaning?
yea kind of i'd also like to be able to segment them into their morphemes
why?
no real reason really just think it could be good meta-data
I suspect there's no good solution for this because people don't really need to do it these days
ahh fair fair
Just out of curiosity how would you approach that
I would give up.
part of life is recognizing what you can't do and cutting your losses.
just kidding. I mostly deal with interactive LLMs these days, where that isn't an issue. but I suppose you could take the word in the vocabulary with the shortest cosine distance to the OOV word.
Yes, I have annotated images with labels using Label studio. I do not have an issue with training or running inference on the model. Those work perfectly fine. I have run an inference on an image. Now to run the inference I give an image, and output gives an image with its identified labels. See here the output is an image, but I want an output in a different format, let's say a JSON so that I could do some post processing on the identified data.
I really don't want to classify the image, I want to know what is in an image, group it with labels, so I can do some process with that data.
Here's a gist of what I'm trying to do, maybe this could help. Let's say I have some 100 invoices (in images). What I would need is, I would like to get the details from an invoice, such as invoice number, amount etc. So, instead of plain OCR to recognise text, LayoutLM also has been used to identify what type of text it is.
Everything is good now, I give an image, LayoutLM tells me what the invoice number is. But the problem is the output is an image, with a bounding box and labels. So I'm not really able to do anything with the data. I can visually see it, but I would need it in a JSON format or something so I can write some code on top. Hope this helps.
Hi, Guys
"Can I get a 'Hi' from individuals who have successfully established their careers in data science?"
How do I undo this: ```py
self.transforms = transforms.Compose([
transforms.Resize(64, interpolation=transforms.InterpolationMode.BILINEAR),
transforms.RandomCrop(64),
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])
The normalization part I mean.
normalize is x' = (x - mean) / std, so to undo it should be just x = x' * std + mean
you can rephrase that into (x - (-mean / std)) / (1 / std) so you can throw it into a Normalize if you want to
no matter what I do I can't update scikit-learn on kaggle, I also changed this option and restarted my book but still older version of sklearn get's imported
iirc you can run !pip install in cells
latest environment prob just means latest kaggle environment, which may not necessarily have the latest sklearn
It's been all over headlines recently that China has been bypassing a lot of the legwork for training actual models by shortcutting with knowledge distillation.
Surely knowledge distillation has to run into hiccups at some point? I'm not a technical expert but it makes intuitive sense that shortcuts are not sustainable
What are the disadvantages of knowledge distillation?
Anyone successful integrated chatgpt api into python? What you use it for?
to test if ChatGPT is actually good at certain tasks.
What do you mean precisely? Calling the ChatGPT api with Python?
If so, yes.. I use it for a ton of things but mostly good ol' retrieval augmented generation
File is not loading permission error
that file isn't a path to a CSV. it looks like it's a whole folder.
Hi guys, i need little help with sklearn library and decision tree classifier. I need to find out why sorted data have higher impact on classification than not sorted but don't know how to start
anyone here had issues with Letta framework?
My Letta Free model keeps showing Failed to send message
Any time you need help with an error message, always show the whole entire error message and the code that caused it, even if you don't think it would help
Got it, sorry about thatl
Pardon me smart people
Is anyone here familiar with diffusion model internals? Or, can anyone point me to someone who is?
My forehead is sore from beating my head against the wall. I need to talk with a human who knows this stuff.
Always give the information that people would need to start helping you.
XD I know the rule about asking to ask. Throwing a butt load of context out though can result in a big wall of text that people often don't want to contend with
The TL;DR is that I've build a diffusion model, all the part are where they should be, and I've debugged and fine tuned as best I can. But it still won't learn - dead on arrival kind of thing, not just poor learning
Somewhere between the datapreprocessing phase, forward noise injection, the model's architecture, hyperparameter choices, the training regimen, the denoising process, and the potential for programmer error, something is going wrong
How do you know that it's wrong? What is the delta between the current behavior and the desired behavior?
If it's just outputting random noise, you should probably share the training code in a way that shows all your hyperparameters
As with most things about ML and diffusion models in specific, that's a question with a multidimensional answer. The short version is that after 25K steps I'm still getting pure noise despite what seems to me to be substantial improvements in loss.
MSE loss, scaled by a factor of 10000 (no exploding gradients) drops from the 10K range down to between 10 - 100. That's a three to four order of magnitude drop in loss. While I know loss isn't the best metric for assessing a model like this, it's still worth noting. At the start of training I see the faint imprint of what might eventually become structure - but I wouldn't exactly call it structure on its own. Loss drops precipitously before hitting a hard wall. It falls from 10Kish to 1Kish in a few batches, 1Kish to 100ish in a few dozen, and then to a lower bound usually around 20 or so in another few dozen. Then loss stops decreasing almost entirely, and after hitting that point and continuing for about 8 hours (while I slept) it dropped from 20 to, like, 18.
In short, fast learning and then hitting a wall.
This speaks to me of a few things: the model is learning the easy stuff well enough and then getting to a point where is can't (not struggles, but fails) to learn anything beyond this. I've tried a few different datasets and configurations of augmentation, and so I'm pretty sure the issue isn't lack of variety.
Hitting a hard wall sounds a lot like settling in on a trivial solution to me. It's found a minimum and it won't budge.
In terms of actual output, the model quickly starts producing almost-structure as it learns then hits the wall. It keeps on with this for a while after learning stops but eventually even this disappears and all I get out is noise again. This, too, sounds like falling towards some kind of trivial solution.
Now, I'm almost definitely over normalizing. I'm using instancenorm to normalize features maps individually, and weight norm to keep complement this. I read an article which said this approach helped their model converge in a fraction of the time and it outperformed a number of modern benchmarks.
Even if over normalization were the problem, though, I've been told that if any reasonably structured and capacitied model can't overfit to a single image then the issue is probably structure and not an issue of normalization/fine tuning. This of course comes with the caveat that lack of variety means lack of interesting gradient.
We get it you use gpt :)
Whatre you referring to by distance between words?
How do you deal with gibberish OOV words, are you taking the full context into account (/other lexicons) or do they just get their own lil vector space?
ChatGPT is the LLM I deal with the least, since it's proprietary
Cosine distance between two embedded representations of the words. Which requires them to both be in the vocabulary of that embedder
I haven't had to deal with gibberish on a scale worth accounting for.
What is the embedder embedding though? Does it just depend on what kind of preprocessing is being done or is it just a handwavey throwing things against the wall
Just jumping in here, jumping back a few messages it sounds to me (with little context) like you're asking what embedding a word actually means
If so, this is super cool. I remember a small sense of awe when I learned this
Ping and I'll expand
I think ive got a decent understand of what the embedder does im moreso curious about what's being fed into it in the first place
Hey, I am relatively new to python (finance major). would someone mind setting me up with some resources for python basics and data science essentials?
landed a DS internship for the summer but want to make sure i know a good amount before i get there, still quite a bit behind
I recommend doing the kaggle pandas tutorial so that you'll know how to manipulate tabular data.
ill take a look at this right now
Have fun!
thank you! would you recommend any subscriptions?
Now is better than never.
!zen right now
Although never is often better than right now.
sounds good
@serene scaffold https://www.kaggle.com/learn/pandas
is this the right one
Solve short hands-on challenges to perfect your data manipulation skills.
Yes
perfect, thanks!
also how much time would you recommend i dedicate per day leading up to my internship start date (may 18th)? i know everyone has a different learning curve but just want to get an idea of how much i should do
As much as you feel like doing. Don't burn yourself out.
Sounds good. I'm here every day because I have issues
I need an arithmatic check
I'm trying to simulate the reverse diffusion process without the model so I can be sure it's working properly. The forward process seems to be working, but the reverse isn't
Guys i have a problem where when i build an app using pyinstaller , the app i have currently selected or windows explorer automatically closes
Wrong channel XD
it doesnt change the fact that i am in need of desperate help 🥲
@sudden canyon can i get help
def forward(self, xₜ: torch.Tensor) -> tuple[torch.Tensor, ...]:
ϵ = []
for t in range(1, self.timesteps + 1):
ϵ.append(ϵₜ := torch.randn_like(xₜ))
ãₜ = self.ã[t]
b̃ₜ = self.b̃[t]
xₜ = (xₜ * ãₜ) + (ϵₜ * b̃ₜ)
if not t % 10:
self.transforms_reverse(xₜ).save(f"Outputs/forward_{t}.png")
return xₜ, ϵ
def reverse(self, xₜ: torch.Tensor, ϵ: list[torch.Tensor]) -> torch.Tensor:
for t in reversed(range(1, self.timesteps + 1)):
zₜ = torch.randn_like(xₜ) if t > 1 else 0
ϵₜ = ϵ.pop()
bₜ = self.b[t]
b̃ₜ = self.b̃[t]
b̃̄ₜ = self.b̃̄[t]
ãₜ = self.ã[t]
xₜ = ((xₜ - (ϵₜ * (bₜ / b̃̄ₜ))) / ãₜ) + (zₜ * b̃ₜ)
if not t % 10:
self.transforms_reverse(xₜ).save(f"Outputs/reverse_{t}.png")
return xₜ
I've recreated the formal algorithms for forward and reverse perfectly. It should be the case that I take the original and apply noise one step and a time and save the noise - since I don't have a model to predict it for me. Then I pop the noise off in reverse order and apply the denoising algorithm. I should do the thing. But all I get is noise coming back out
is 5:30 hours for inferencing with distilbert over 300k samples too slow ?
5 hours for inference is crazy. If it takes 5hrs just for inference, I can only imagine how computationaly expensive the actual model training was 😮
Are you running this on a CPU? If yes, then that explains why.
Real-time monitoring, object tracking, and line-crossing detection for CCTV camera streams. https://github.com/SanshruthR/CCTV_SENTRY_YOLO11 https://github.com/user-attachments/assets/e29ad9df-b810-4308-b6a8-4ff81019edea
yes even I was thinking this was insane
this is with distilbert
Yes. There's a much better way to do what you're trying to do. If I have time later tonight, I'll respond to this again with an example.
Meanwhile, are you're just using the pretrained model for evaluation without fine-tuning on your target data?
yes, it's ok thank you, i'll try on my own i'm looking for a good tutorial, one question is my code more cpu intensive
Where could i find some real world ai problem statements which can be then useful for future
how are embeddings calculated for any model? i feel there must be human involvement cuz what other way does it have to know how to tokenize words
make it 30fps , so that it will work lag free real time
You mentioned you're using Tesla P100, so I presume the machine you're using has an active accelerator (base on line 12 of your code)
To verify, print(device) and confirm it's not showing 'CPU'
https://paste.pythondiscord.com/2R6A
I present to you the cleanest DDPM denoising logic ever written
The space after the dot at line 13 triggers me for some reason
XD
My code can be triggering
It's been described as "different" more than once. Honestly, I find most people's code illegible
is deepseek a chatgpt wrapper
No, it's an entirely separate LLM
but trained from results given by chatgpt and others?
that's unlikely, they probably had their own dataset
Thanks. I kept hearing from news and videos that they kinda used something from LLM predecessors but am totally not sure what that was.
tbf, I haven't looked into it too much, but using an LLM to train another seems like a pretty terrible idea 
it's likely that it does
i.e., think about how much of the internet's texts are now LLM-generated; if any of that goes into the training dataset, then technically yes it uses something from predecessors
it's also why you'll see a lot of the same LLM-isms across multiple models
cause nearly everyone uses synthetic datasets generated from larger llms
For now
example: look up ShareGPT datasets
as the name suggests, all of these originate from conversations between a human and llm; they might've underwent further processing, but still
There's going to be a tipping point where generative models can produce works good enough to feed other models
we're already doing that tbh, very commonly in fact
at the very least, stuff spit out by very large llms are often good enough for training smaller ones
Totally. I'm also going to use ChatGTP or similar to build the initial embeddings for magic cards to train a deck builder
I'm thinking of a GNN based diffusion model which can diffuse either decks from cards or cards from decks
In other news
My diffusion model is learning!!!!!
sick
This is definitely structure
Early days, but its further than I've gotten before and it seems to still be learning
cool what's it tho
Well it's just a blob right now XD
It isn't just pure noise. So... progress
nice what is it supposed to be?
i just came here
guys, so I have this task where I pull files using R, I need to atutomate this,
I usually use R studio to pull csv files for that particular dates,
can someone share some links that will help me automate it on databricks?
(I'm not even sure if this is the right channel to ask this, if not, do let me know, I'll post it in the correct one
)
is databricks a python thing?
databricks is an environment where we can run all kinds of languages and also we can use it to automate stuff and revert to prev versions of the code
it's like jupyter but better
is anyone here have done freelance in AI, ML or DS can you please share your experience and journey ? bcs i m a beginner in this freelance field
Is anyone still using tflite-model-maker? I cannot get it installed, even the colab notebook referenced in google's tutorial is broken. The devs seem to be aware of this and recvommend mediapipe_model_maker, but that does not support audio.
Hey guys I am new here hope everything is fine 🙂
Welcome to the server :)
what is the right way to choose a model for generating contextual word embeddings ?
Normally I just test models with the model or system it is intended to be used with (let's say a classifier) and comparing the evaluation results of that.
Normally BERT based models are the gold standard, although pre-computed systems like GloVe can be useful in situations where you have a lot of data and not a lot of compute since GloVe just becomes a lookup in a table rather than sets of matrix operations.
Personally, I've found intfloat/multilingual-e5-large and intfloat/e5-{small/medium/large} to be excellent models for their size and compute cost. Worth making sure whether or not you need a model that can understand multiple languages and the association between words in different languages or if just a single language model works for you.
From what I have seen, if you want multi lingual models, you likely will have to go with larger models with bigger embedding sizes in order to maintain good accuracy, although again, depends on usecase
Cool, I haven't heard of GloVe before, I'll have to read about it
GloVe is like one of the OG ways of doing text embeddings before BERT and Transformer LLMs became all the rage
i'm performing an information retreival (semantic search) i've 300k samples of english text, i thought to go with adv nlp techniques as this is an adv nlp project
i'm not really sure about GloVe and how it scales well with large data
I am working on building an app that manages my pantry, my recipes, and my shopping list. I was wondering how effectively I could integrate such paid features as price tracking (such as for bread, eggs, milk, rice, and beans), and ai-generated recipes "using what you have in stock". how "reliable" is ai for recipe creation, and what tricks could I use with my prompts?
I've used ChatGPT to produce recipes. but LLMs can't do math. I once got a ChatGPT recipe and said "scale the recipe by half and convert all the units to grams", and it generated Python code to do that conversion.
yeah, that's what I was worried about. but I guess it could implement the teachings from The Flavor Bible rather well, since not everyone has the time to read that book
so you'd be doing RAG?
Well, that's a possibility, especially if it is selfhosted, but I do want to eventually turn it into a native app on the appstores and/or fdroid, so then we'd get size constraints and copyright issues.
Hello, I'm looking for the best dependency for handwritten OCR.
Help!!!!!!!!!!!!!!!!!!!!!!!1
Help!!!!!!!!!!!!!!
@remote stream please move your question to #packaging-and-distribution
I'm having fun
I'm working on building an embedding system for magic cards
And I'ma build a diffusion based deck/card builder 😄 😄
are the token type ids needed for multilabel classification? Using Bert
if you're trying to classify each token, then the y value for each token needs to be an n dimensional vector for n classes, where each element i is 1 if that label belongs to the ith class, else 0.
is is Bcewithlosslogits?
and if you have a sequence of m tokens (such as a sentence), then the y value for the whole sequence is an array of shape (n, m)
for the loss function
you need a target for calculating the loss, yes.
even if there are five targets ?
as in, each token can have between zero and five labels?
ok, with multilabel classification, with Bert, is it bcelosswithlogits because each token is being ran through Bert and the probability that the token type is accurate to one of the tokens has to be x (- [0,1] with the accuracy being higher for each feature being assigned to one of the five(just, you know, some categorical target) target values? is that why it is Bceloss?
does -[0, 1] mean x: -1 <= x <= 1?
the - is confusing.
suppose you have a sequence with m tokens, and each token can belong to n classes (potentially none of them). then the output from BERT will be an array of shape (n, m), where each element (i, j) is a number between 0 and 1, representing the probability that the jth token belongs to the ith class.
ok, and is sigmoid the optimizer?
sigmoid is an activation function
binary features?
the activation function when trying to predict a value for a target between 0 and 1, not softmax
I get it now
you can use sigmoid in any situation where you want to squeeze an individual number to be between 0 and 1.
softmax is nice because it squeezes each element in a vector to be between 0 and 1, but proportionally to each other, such that they sum to 1.
ok, bcewithlogits and sigmoid, no, I was thrown off because mutli-label classification values are not treated as categorical
This is the craziest thing I've seen today.
does that person define "math" as "doing calculations by hand"?
absolutely not
i think he is talking about linear algebra, statistics...
Classic mistake of believing something to be universally true because it was true in your personal experience. And boldly stating it to be so without first looking into it further. A very large portion of the most engaging posts involving knowledge fall in this category in my experience online.
Any recognized or valuable certificates for analytics/ML/data science?
None. I work for a resarch company and participate in hiring decisions for the AI division.
none? IBM? Oracle? Google Analytics?
I implemented that
what i keep hearing is that certificates don't really matter much and they're like cherry on top of cake
that's correct.
Well seems like Ima be cancelling my gpt subscription then
because of deepseek or what
yea deepseek
why does deepseek make you want to cancel your chatgpt subscription?
Becuz is free and is better than gpt
sure, but even if you download the model weights, setting it up so that you can start asking it stuff is non-trivial and requires beefy hardware.
it's easier with ollama but hardware is the main thing
Is anyone around that wants to voice chat? I am working on a system to allow a genetic algorithm to define self organizing automata, and I need to step away from it for a bit, but I want to talk about it with someone
Is there a open-source automated content moderation system that is pre-built and robust?
The approach is to filtering content on a CDN as its coming in to the database in transmission to database, at rest within the database/cdn network.
Machine learning is what I heard I need need for this. Can I use ray?
#Project scope.
This is a federation of decentralized cdn.
Real-time monitoring, object tracking, and line-crossing detection for CCTV camera streams. https://github.com/SanshruthR/CCTV_SENTRY_YOLO11 https://github.com/user-attachments/assets/e29ad9df-b810-4308-b6a8-4ff81019edea
hey guys, do you know any resource to learn about TorchInductor's IR?
I'd also like to know because i'm curious about what's behind PyTorch 2.0.
This article might not be very helpful to you, but it's still an interesting read https://dev.to/aaronlangford31/lessons-learned-from-using-torch-inductor-for-inference-1ma7
Hello everyone
I'm looking to automate report production from datascience and ML reprocessing.
I produce my stats, reprocessings, graphs with pandas, mathplotlib, sns ...
I don't have any particular problems with content creation, but I'm more concerned with layout and content use.
What would you recommend for clean formatting/layout to produce printable reports?
I'd like to stick to scriptable python, and avoid PowerBI or similar.
Thanks
what do you want the format of the output to be? pdf? html?
!source filter
Group for managing filters.
no preference , just easy printable
I believe HTML is not the best format to print, perhaps pdf or docx could be better
the main question is how to achieve a clean layout
Thank you
I'll let you know if I find something
What opened the passion to you all for Data science? Been plucking through the code academy career path, but about 45% through I have been losing steam on doing daily four to six hour steady sessions.
I really enjoy each part of the actual practical data analysis but man there is a wide world of things to learn. Just wondering what projects yall have undertaken that are exciting to give me a glimpse of the finish line ya know?
yea theres alot to take in but imo instead of putting it all on one day or smth just have fun and take time to digest it
I have been chewing through it for the last two weeks, trying not to burn out burn out, but dont wanna lose steam on learning ya know?
preparing for my UoT Data science program come september
ooh
what works for me is just try pacing yourself like do some projects which excite u or smth bcuz for me that always helps with burnout n stuff
like in between learning do some fun projs and then reinforce too
I agree, I run into the logical falacy of. If i keep learning, I will increase my mental toolbelt to solve x or y problem ya know?
I api called a lot of defunct insurance data and have been making a jupyter notebook as a portfolio project but I feel I am approaching the unknown unknowns of what I can do with it
i have cuda version 11.5 so which version of cuDNN should i install for running tensorflow?
you can refer to this list for tensorflow gpu compatibility w/ cuda versions
https://www.tensorflow.org/install/source#gpu
How fast is pytorch compared to tensorflow and keres?
there isn't a straightforward answer for this question. just use pytorch.
Are you using cpu pytorch or gpu pytorch ?
Well I've been trying to figure out if my computer can handle CPU because I can't tell if I have a GPU I've been trying to look into my computer's model but I've been getting a headache lately
There is something called ray.
https://www.ray.io/
@unkempt wigeon
Is there a way I'm figuring out if Mike terminal has a GPU?
yall whats a good platform to start freelancing on w/ python
Why there are so many problems with tensorflow to install it your gpu config..
Where pytorch is so simple compare to tensorflow installation.
Tensorflow development is winding down, so there might be new compatibility issues coming up that aren't getting or won't be fixed.
i want to scrap the data from the nansen website for the realitime update of the values like this is the code :-
# Extract data
trending_data = []
for row in soup.select("div.MuiBox-root.mui-style-70qvj9"): # Update selector based on actual HTML
try:
how shall i do that can anyone help
No data found. Check your selectors or the website structure.
am getting this again and again.
Hi i am new to ML. I know basic ml training how to train ml to recommend music to the user based on there age gender etc. Now i want to train a model make insight on how a website path is doing from its daily metrics like session time, total session, bounce rate etc. Any advice on anything will help i am still researching how to start and what to do. Talking to chatgpt
Is it actually better though?
Hello
i used keras v3 with torch backend and compared to pure pytorch it was much faster, both on GPU
but neither i have a reproducible example of that nor i claim i managed to do GPU adaptation properly in PyTorch
but the intriguing thing was that i didn't have to do anything (nothing) in keras for GPU adaptation
but the drawback was writing a custom loss function, passing the epoch index, using an adaptive learning rate was way harder to bake into the Keras code
so tradeoffs yet again
quantiatively: 10-15 times faster; also I had found a post on discourse of PyTorch where another person was suffering a similar loss of performance, so I'm not alone on that front, I thought/think
10-15x faster seems like a huge number. Are you sure pytorch pipeline is correctly implemented? In my experience, I have also found keras v3 torch backend little faster than pytorch (tabular/image data), but not a significant boost.
keras v3 was a significant improvement over prior versions, supporting jax/torch backend. But I still find it complex when adding custom callbacks as you mentioned. I only prefer using it for quick prototyping or sometimes tabular datasets.
Yeah 10-15x sounds like a significant change
yeah as said, I don'T claim i managed to do GPU adaptation properly in PyTorch
Using those metrics or even features made to be used within the path as features trained on their retention/interaction data is a good starting stone
wat tutorial should i use as begineer to data visualisation
bc im pretty sure data visualisation is required as a start to learning ml
If you are doing any benchmarking keras v3 v/s torch, I will be very keen to help/contribute.
I personally feel EDA is something which you learn more with practise. I would recommend doing couple of basic courses from coursera (I did this one"Applied Plotting, Charting & Data Representation in Python"), https://www.kaggle.com/learn select data visualization mini course on Kaggle, it's fun.
Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
hmm thanks, maybe I can come up with an MRE, data was public (a Kaggle competition actually), then share what I have here
Sounds good, thanks. Which competition btw?
an old one actually, https://www.kaggle.com/competitions/seizure-prediction/
EEG signals, interesting. Did you use Recurrent networks or transformers?
CNN actually :p
on the spectrogram of the signals
it was not my original idea, of course
and my aim was to compare some models under this dataset with some specific stuff, so neither the dataset itself nor the base model was too critical
only that it ought to have been a time-series based dataset
and that the base model wasn't too "incapable" (sorry logistic regression, you are loved too)
can i somehow download an open source llm model and run it locally with tensorflow?
or keras?
use pytorch
the huggingface website has code for each model for how to run it locally with pytorch. but keep in mind that a lot of models require enterprise hardware.
which one do you want to use?
why pytorch tho? and any, but small one, not rlly planning to do super accurate things
i used to use keras. Is it outdated or something?
yes, it is
exactly
one last thing. I remember doing my custom data augmentation class with keras for my own project. Tho i needed to fork keras and make it not to convert images into RGB (RGBA images will be RGB). Can i do the same with pytorch?
so you're trying to do a task with images, but the color channels are RGBA and not RGB?
would i need to learn data visualisation for ml?
yeah like, my augmentation was giving a random background to RGBA images. Couldnt with RGB
matplotlib is the standard data visualization tool, though it's not very pythonic, unfortunately.
this has nothing to do with the LLM, is a different project, but wondering if i could, since imma move to pytorch
so i should learn it first then jump to tensorflow
learn pytorch instead of tensorflow. by the time you're ready to get a job, tensorflow will probably be completely dead.
ok
so where should i start?
bc theres pandas, numpy, and all this stuff
sci kit
I don't know enough about image processing for this. I imagine you can have an adapter to handle any RGBA <-> RGB conversions.
okey, will look for it. Thanks 🙂
Hi. Do you have any recommended pandas tutorial for beginners? Most stuffs from youtube are too advanced, not detailed enough and too fast paced
I recommend that you first learn how to use pandas to manipulate and explore data, so that you get a sense for what "data" is in the context of data science.
ok
well this was pretty convenient
the kaggle pandas tutorial. it's interactive. (which is important, because you won't passively learn from watching youtube)
can you send me the link please?
any yt tutorials that can help if stuck?
Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
sure, but it's the first search result for "kaggle pandas tutorial". https://www.kaggle.com/learn/pandas
Solve short hands-on challenges to perfect your data manipulation skills.
Thank oyu
@serene scaffold why do u recommend learning pytorch rather than TF itself?
why do you say "TF itself"? is there a relationship that you think exists between the two?
yeah, but same with keras / TF. One is a higher level framework. I know pytorch is built in top of tf, but tf gives more flexibility, doesnt it?
like, is pytorch just more user friendly?
I know pytorch is built in top of tf
this is false.
i think its bc tensorflow is outdated and maybe not as fast and lacking features
can i do my inputs on vscode or any other ide? Ill just be referring to the guide right?
the first section is i would be creating my own dataframe anwyayy which i could do with other ide?
I've never seen anyone in industry use tensorflow, and it seems that development of tensorflow is winding down. as far as I can tell, the only reason anyone still uses tensorflow is because of tutorials that have been written for it.
i see... I thought TF was maintained by google, same as keras. Didnt know they abandoned both
to help navigate row numbers?
Pytorch is from community i guess?
you run the code in the kaggle pandas tutorial
but in general, you can use whatever code editor you want. VSC, pycharm, etc. have no baring on how the code is actually executed.
google loves to abandon stuff.
oh wait u can set custom index
okey, ty
meta
that's default index, generated when you created pandas df.
torch v/s tf in competitive ML. reference: https://mlcontests.com/state-of-competitive-machine-learning-2023/
Good read.
what is the problem ****
doesnt all the values default
nvm
i had to put it into list forat
format
Yep, has to be in the format of list of dictionaries.
fruits = pd.DataFrame([{"Fruit": "Apples", "Col2": 30}, {"Fruit": "Bananas", "Col3": 21}])
more like this.
thanks for showing the code and the error message. be sure to show the whole error message (you cut off the end) and to show it as text (not a screenshot)
fruits = pd.DataFrame({'Apples': [30], 'Bananas': [21]})
print(fruits)```
easy fix as dicts should be enclosed in `{}` and each key-value should be defined properly
how far can I go with this tutorial? Im currently a data analyst but only using excel and power bi. Was hoping I could upskill with this
what do you mean "how far can you go"?
I meant can i start applying for data analyst role with python required skills, or data scientist or anything related?
I wouldn't put that you "did the kaggle pandas tutorial" on your resume. I'm not sure what the best path is from non-python analyst to higher-paid-analyst-who-knows-python.
I see
But would it be a good baseline tho? Like starting point?
Ofc im also gonna be learning numpy and matplotlib altho im not exactly sure what those DA python do in their day to day job
dataframe manipulation is a core competency.
to clarify, dataframe is from pandas only? Or does other library of python has it too
pandas is the most widely used dataframe platform, but there's also polars.
possibly others.
oh its the other way around ok
what's the other way around?
I thought dataframe is exclusive to panda
What I meant to say is that. Since you mentioned that df manipulation is a core competency, I should focus on learning pandas? More than numpy and matplotlib
pandas and numpy are pretty closely related.
it's easier to use matplotlib if you're solid with numpy and pandas.
any resources for troubleshooting pyspark here? no success with google, stack overflow, etc..
Ok last sorry. When can I tell if I am ready to take on data analyst with python roles? Like what set of knowledge should I be able to do with pandas etc?
If you have to ask yourself that question you probably aren’t ready
https://zenodo.org/records/8006177
This is a wildfire satellite image dataset on Zenodo. It has a single zip file of size 48.4 GB. The zip file contains around 13k .tif files. Can anyone pls tell me if I can download a single .tif file from this to explore its contents before proceeding to download the large zip file?
We present a multi-temporal, multi-modal remote-sensing dataset for predicting how active wildfires will spread at a resolution of 24 hours. The dataset consists of 13.607 images across 607 fire events in the United States from January 2018 to October 2021. For each fire event, the dataset contains a full time series of daily observations, conta...
Well yea but thats not my point
Im saying youll know when ur ready
Thank you if I get stuck or if you have any questions you can video call me my apologies
Hey I'm new to python and am trying to get a github project to work but don't know how to
Feel free to specify your exact question or problem in the server so someone can help
See if there's a website thatll lightly uncompress zip files or at least let u look into them before downloading
Speaking of, another nail in coffin for Conda: https://github.com/pytorch/pytorch/issues/138506
It’s that I don’t know how to make my code work on pycript.com
oh nice. I wish you had sent me this on a bad day so it could have made my day
lol, glad you had a good day then.
Hey does anyone have any experience with hosting neural networks online? I made a simple ai that could solve MNIST with pytorch and wanted to try and host it online so users could draw a number and pass it as input to the ai, which could then make a prediction and return the predicted value to the user
you'd need javascript for the drawing part. that aside, that sounds like a good way to make a basic web app.
Do you have any reccomendations for what to use for the backend? Only made one full stack app ever and I used firebase so it was kind of cheating
It doesn't need to be that complex of a thing. I have no idea how to make the drawing part on the front end, but whatever the drawing part sends to the back end, you just need to convert it to a valid input for the model, pass it through, and send the result back to the front end.
Yeah I see, I already made the front end part pretty much with an html canvas but even though it’s super simple I might not know enough at this point to do it. Are there any resources so I can learn about making backends and sending info to and from them?
I don't really know. that's more of a #web-development question
I think many people usually start with Streamlit, Gradio, or HuggingFace Spaces. (you can start here as well)
You can move to FastAPI or Flask or Django if you wanna create a RESTful API to serve the trained model.
Finally, if you have experience with HTML, CSS, and perhaps Python or Javascript framework (Vue.js or React) you can design the client side (Frontend) to your taste and connect it with the backend (the API you created to serve the model)
I'm developing an AI model for biomedical predictive analysis. I'm looking for ideas on how to integrate a LLM into it, ideally using Python. Any suggestions?
Depends on what you are trying to achieve. LLMs are preferred for generative tasks, if you have any use case for generating texts related to biomedical domain, it's possible to integrate LLMs.
Yeah I wanna add like text support to a GRAD-CAM
I'm creating a sematic search engine app as Adv NLP project I'm using Haystack's vector DB and SQLite, I'm finding it hard to optimize inferencing for word embedding generation and writing them to DB, I found that concurrent.futures is taking up a lot of RAM and CPU even for 3000 samples of data, should I change to Dask or Ray ?
I'm new to distributed computing
ray is a bit more complex so imo i'd try dask first cuz its more simple for scalin n stuff
also batch embeddings before writing to the db could help with the load
Hello . I need some help if anyone could help me with an issue with spotify API that would be great
I'm about to begin creating an LLM chatbot. Before starting I wanted to ask this channel if anyone would be interested in learning and building the model with me. If you would like to help send me a friend request so we can have a conversation.
It seems like we got a clear winner between Deepseek and OpenAI. Listen to this surprising narration!
https://app.reef.lat/analysis/share-presentation?presentation_id=22f936ad-02c0-4f8f-a63c-dae696bc7663&audio=Analysis/reef_audio_presentation_72d2.mp3&text=Analysis/reef_text_presentation_a1f1.md&result=Analysis/visualize_result_1602.html.gz