#data-science-and-ml
1 messages Β· Page 130 of 1
cuz like the machine was actually involved in figuring out those conditionals
^ the content may interest you
have deep knowledge and insight into fundamental techniques from Artificial Intelligence, including: basic search methods, heuristic search methods, optimal path search methods, game tree search techniques, constraint solving techniques, planning techniques and markov decision processes
I am actually in the processβ’οΈ of writing a pathfinding library in C (for Python)
Nice, that could be considered an AI algorithm in the right context π
https://onderwijsaanbod.kuleuven.be/syllabi/e/H02C1AE.htm#activetab=doelstellingen_idp2287408 Machine Learning and Inductive Inference
[...] the domain of machine learning, which concerns techniques to build software that can learn how to perform a certain task (or improve its performance on it) by studying examples of how it has been accomplished previously, and in a broader sense the discovery of knowledge from observations (inductive inference).
The course was overly academic and had longer more precise definitions of ML in the syllabus but you can see how conditionals don't pass the litmus test of "machine learning" but decision trees do
I think I see, yes
I recommend AAA instead of AI when you want to describe something that most people would probably imagine to be "AI." Autonomous Adaptive Agents. Autonomous: no human intervention, it can operate/survive on its own in either the real world or a virtual world (e.g. a game). Adaptive: it learns / adjusts to achieve what it needs to, constantly trying to improve. Agent: agent as in game theory agent, an "entity that always aims to perform optimal actions based on given premises and information." Note that it must take actions with consequences, it can't just classify stuff or something like that (without making use of that classification for an action). Most things being advertised as "AI" do not fall under this definition, they don't have the required design goals. They are just tools. The end goal of what is being made is pretty important.
Then you immediately get into the strong/weak AI debate
this works for strong AI but not for weak/narrow
You can go off of feeling with this, when people imagine "AI" they think of something like feels like an animal, not a tool. It operates on its own, achieving its goals.
(Often a human specifically because we lack creativity it seems (in writing and such))
The distinction is in what you are trying to make, not what we currently have. OpenAI is not trying to make AI, it wants to make a tool that can replace certain jobs (in theory, it won't, it's a scam).
That's a different discussion altogether
Personally I don't really care about making my own definitions of AI
I'm just applying the already existing conventions from important literature out there
I just want people to not use AI when they are not really making AI (it's not their goal), it just obfuscates what they are making. It's like saying "i'm selling you a thing." "Have you not heard, everybody wants thing these days!"
they are not making AI based on your specific definition
But then what are they making?
Their definition is that it's what they are making, which is also now in conflict with other companies saying "it's what we are making." Depsite being very different things.
Norvig's book on AI is as good as it gets as a reference right? Especially since it predates all of the hype https://en.wikipedia.org/wiki/Artificial_Intelligence:_A_Modern_Approach
All I'm saying is, go in there. Find the definition they use of AI, apply it to whatever OpenAI is making and then the logical conclusion is "they're making AI"
If you were Peter Norvig and you wrote this book in 95 and wrote your definition of AI in this book I'd agree with you. That's how much I care about this discussion (close to zero). It's just about applying convention for me. Without covention, if everyone has their own definitions it is impossible to have a discussion
Yeah, my definition it is then. In this book, we adopt the view that intelligence is concerned mainly with rational action. Ideally, an intelligent agent takes the best possible action in a situation. We study the problem of building agents that are intelligent in this sense. - Norvig, page 30.
He brings up the other defintions and lands on this one.
Throughout history this has always been the goal, agents, this current use of "AI" is recent, and meaningless.
The current common "AI" may be part of actual AI, but it's not on its own, because that is not the end goal.
Probably the largest distinction being whether it's autonomous and an agent.
This is vague enough that an "agent" and a "situation" could be simply deciding if someone gets a loan or not
I appreciate the effort for going into the book
That counts. If it takes actions autonomously.
It does not need to be very good at it either.
So if you make an endpoint with your model on it and it's part of the loan application process it counts?
Yup. If it's an agent that tries to make the best decision. This is a game it exists in (a game as in game theory).
And my additional "autonomous" means that it does not require a human to approve the actions.
I also have adaptive in mine too, so it needs to keep learning.
But without those two added things, still much closer to what I consider to be an ok definition of "AI."
Norvig's is solid.
It's the same basically, game theory, agents.
Which is why the definition makes more sense in a sense, since in biology we are always taking actions and such as an agent.
There is no classifier there that just does that.
I would not consider it AI, but that does not make it useless or whatever, very much the opposite, it makes sense to make tools, not a whole rational agent.
Yeah, then there are safety issues.
Although the tools being made are destructive for other reasons.
ML is a safety hazard as is unless we massively constrain it for medium risk tasks
But not like, "wow I have this tank that goes around on its own and tries to deal as much damage as possible and even refuels itself, etc" levels of potentional destruction.
Although who knows, maybe unraveling society via spam and such actually is worse...
We have our world which we condense into an optimization problem that the algorithm needs to minimize. Big alignment problem. Also worse considering it tries to learn "the easy way out" (overfitting)
Classic example is if you naively train an ML model to reduce the amount of people with disease X to 0 you as the implementer think you're proding it to find a cure but the algorithm probably will arrive at "eliminate all those people"
So in conclusion, I don't like the current use of "AI" and I don't think it's something they even want to make. And selling it as such on everything does not even make sense, because it's selling "thing" instead of "useful tool." (just be forward about what it's trying to be, it's fine, I prefer you did not try to make AI, but useful tools)
(not even in the loan case for example, because I don't like that either (it must be human approved))
(Similar reason to this) https://samim.io/static/upload/Screenshot-20220124165543-1108x802.png
In some ways this can be even worse, as in cases similar to the loan example you brought up, because it being worse might make it more destructive. The key is whether or not it can take actions on its own. Not SOTA can still be AI.
If we regulate AI based on it being SOTA as the definition, it does not fix the actual problem.
(Which is why the compute limit stuff coming up now is nonsense (just monopoly stuff))
(What we also really want to target is all this spam that is still allowed, things that can take actions that can ruin people's lives (autonomously), e.g. the youtube algorithm just auto demonetizing everything you have and/or banning you (there are worse agents being used already, but this is a less gruesome example so I used this one))
Yeah this makes a lot of sense. The AI act will solve this all /s
hey @rich moth
need help ,.
just review my RL code
and tell do I need some improvements in it or not!
Any good technical podcasts on machine learning and neural networks
Like discussing new optimizers or architectures etc
Adan came out in 2022 has anyone made anything better
I don't even know where to look for it
if you mean the Adam optimizer, it is much older than 2022? https://arxiv.org/pdf/1412.6980
Anyone know how to run ollama downloaded models on vllm?
Or does vllm only accept huggingface models
i mean adan which is adaptive nesterov momentum
its the last i can find that is sufficiently different from adam and works good
So I have a Keras image classification model, and i was wondering if instead of training it overall for new classes. I can perhaps fit new classes using transfer learning? If so, can someone refer me to some docs of some kind. Mucho Gracias.
My excel dataframe is giving me a headache since it is changing my input so the format is not the same in all cells for time
my format is supposed to be hour : minute : seconds
but when some of the cells remove the seconds automatically
as you can see the input is up in the formula field but it does not match what is displayed in the cell?
step 1: Do not use Excel
that should be configurable under Home -> Number though, just change the format
why not? how would I finish my project otherwise? if the people that will use it downloads the data as either csv or xlsx?
this is just a dummy dataframe
what is your project?
It is a project to help some colleagues do calculations for heat treatment of steel
not a big or important question but just came to mind if anyone knows (i'm conducting my own research on the side currently)
- are there faster ways to read a csv than pandas built in read_csv method?
you can try other libraries like polars
yes u convert it to a better format
currently trying that actually however i'm not sure how compatable it is with dataframes
any examples?
parquet
why is pandas read_csv not fast enough?
polars is just another library for dataframes.
Hi, can somebody help me? Why pytesseract returns nothing?
def get_table():
pytesseract.pytesseract.tesseract_cmd = r'E:\Program Files\Tesseract\tesseract.exe'
image = pyautogui.screenshot(region=(1515, 190, 810, 810))
image.save('screenshot.png')
path = 'screenshot.png'
image = cv2.imread(path)
cv2.imwrite('original_screenshot.png', image)
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
image = cv2.adaptiveThreshold(image, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, 1603, -80)
image = cv2.bitwise_not(image)
cv2.imwrite('screenshot.png', image)
return pytesseract.image_to_string(Image.open('screenshot.png'))
print(get_table())```
yeah parquuete as etrotta said
i remember it reduced loading time exponentially for me
Its about a few hundred thousand rows per day multiplied by the amount of days in a quarter pandas is doing okay but i've been optimizing the processing time
You should be fine until we talk above a million rows I believe
100k * 100 is 10mil
mathβ
and you're storing all of this in a CSV?
you could be storing it in SQL and using an SQL query to retrieve just the rows that you need into a dataframe.
sorry if i didnt explain that well
basically a RPC is made from the database that gives us xlxs which are converted to csv(s) i'm new-ish to the project so i'm kinda discerning everything before i push decisions for a refactor
what are you using rpc for?
rpc returns xlxs?
yea something like that, the project is detached from the database if that makes principle of least privilege
I'm explaining certain things poorly i feel like π€ but i basically process stock transactions and do visualizations on it
so rpc returns some db query results in xlxs? and then u need to visualize them
this is weird design unless you don't own the db and using 3rd party servies
yea but i have all that handled lol it would be nice to get processing speed up a bit tho, got it down from abt 3mins to 3-5 seconds however most of the currently processing time lands within the read csv
i work at my countries Stock Exchange, we don't own the database thats a horrible decision for this kind of work well not horrible just not safe?
u can use parquet as soon as the csv file is generated. try to clean it as much as possible.
you can also try yielding the rows if you are doing backtesting or something similar
Pandas is generally worst answer for these problems: it's good for prototyping, but doesnt scale well. The first thing to consider is a partitioning strategy: parquet is a good choice for daily or weeklies, along with roll ups so you don't need to query the underlying records continuously
3-5 sec for whole quarter's data is fine i think u are using python after all, maybe some other language can lower it
There are a number of time series db's that are optimized for tick level analysis, but that's a different scale of the problem
I don't 'own' the db for my data, but I do work with it using various engines (lately? Mostly DuckDB)
yea i'm new to this project most of my knowledge is formal education based in stats/CS, i didn't make the decision of pandas the guy precursor to me did
KDB is the big boy database in this space
Half my life is replacing pandas code. Its good for rapid dev, but hits a wall quickly (both performance and complexity, imo)
well anyways Mr.carrot if you come across some cool trading strats do ping me. i love working on those, or some real insider quantitative edge
yea i think where i'm at isnt bad but if i can make small optimizations with notable impacts thats always nice, like i was able to move some loops into numpy/pandas built ins
π this is a little funny but yea i could see that
Experiment with DuckDB, if you know Sql. It'll let you write sql directly against in memory dataframes. No config needed
yea i do know sql i have a formal(university/internships) background just not the deep developer knowledge that alot of people may have I'll check it out
Polars is the usual answer tho, for growing out of Pandas, but requires a bit of a commitment otherwise you'll end up with a confusing code base
lol if i do, i personally follow basic index funds strategies for personal life
makes sense and i appreciate the advice i got today
try to see what those Market Makers are doin 
you have great job btw GL
yea i agree and appreciate it π , i'm from the caribbean tho so it isnt as notable as other stock exchanges such as NY or swiss
but a good way to enter the industry after uni
how do I remove outliers in data?
https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-to-detect-and-remove-outliers-with-python-code/
scroll down a bit, andyou will find a code!!
i want to make my own llm in which the model will have all the data of eg: products. And the user will give a prompt like "i want to have a pink shoes with laces", so from all the product the model will show the approiate one asper the users prompt. how can i make something like this?
first of all what you know about llm?
don't reply with full form!!
nth much, but like a model with will learn form the knowledge base data and give answers by understanding from it. (like there are 2: fine tuned and knowlege base)
what about ML?
reading the data from the datasets and making predictions asper it, like supervised, unsupervised or refo
no, no , have u practiced ML?
don't give def. sry about that!
yes i have
i did it
making graphs, predicting, clustering and stuff
Quick question; What level of correlation would be considered extreme/too high? To avoid multicollinearity
How does your duckdb workflow look like? Do you use DBT?
How do you connect to external sources? Do you use something like meltano or just regular python
ok, when building a nn of some sort, if some parameter is optimized, do you never change it no matter what is add to the model?
if I have a multiple linear lines that represent the weight vs price of different how can I combine them for a multivariate function that give supply and demand?
There's something that is literally called "multiple linear regression". It's just https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html which I sent yesterday π
welp, i guess that is karma then
i don't quite get this:
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3```
check out the link please
and look at the examples
The answers are there, right in front of you
why does that represent that linear equ?
Depends on project, but increasingly dbt (but some custom). Most data sources end up being something custom: Python lambdas to transcode data to parquet, and some api hooks. One thing we've done is build a sql overlay to use vendor APIs and various inference/etc directly from sql. Allows us to keep all our data transformations external from
the code..
This is very interesting, is this because SQL is the common denominator? Many analysts that don't know Python or is it truly a workflow your organisation believes is better?
I don't mean this in a snarky way, I'm really very curious
It's somewhat my personal philosophy, a bit of ease of training (I only need to educate analysts in one workflow), a bit of separation of concerns (code is infrastructure, sql is data transformations), but my favorite rationale is locality of behavior: it keeps business logic near each other "The behaviour of a unit of code should be as obvious as possible by looking only at that unit of code."
Doesn't mean we don't bookend with Python, but we can get pretty far in sql alone
I like this. I don't think I'd organize myself like this if I were leading a team but I'd never protest if I'd be in a team that does this
Well, that's the difference too: we're a service provider/integrator: we're enabling our customer's analysts, rather than being direct consumers.
I see, that's a big big distinction
As you know I'm a big polars fan. My project is basically done and if I could go back and change something I'd have used DuckDB (+ DBT). That's part of the reason for my curiosuity
I would probably make bigger use of polars if I were the consumer
My main collaborator doesn't know Python. Keeping it in Python as opposed to SQL or R, both of which he knows, was a deliberate strategy initially. His code isn't the cleanest and I wanted to insulate myself from it as much as possible.
In hindsight, I think having it in SQL (and telling him not to touch it) would've been better because at least he could've reviewed the code
imo the beauty of SQL is mostly that everyone knows it yeah
I think this scenario is actually quite common in teams where you have overconfident data analysts / scientists with poor engineering standards / code hygiene
Hindsight is 20/20 but I should've dealt with it better
The sql of today, especially in the OLAP world, is soooo nice. I spent many years without cte's, windows, etc. beyond that, the pace of innovation right now is awesome: integration with parquet, Python udf's, delta lakes, etc. I -love- the dynamic column stuff: https://duckdb.org/2024/03/01/sql-gymnastics.html
(Other OLAP platforms are doing similar things... it's a new era on the data side)
dynamic columns are new for me. Is it having a json(b) column?
I'm young enough to never have been in a situation without cte's, window etc.
Has those too, but you can reference columns dynamically. Like: max(columns('.*score')) to create an aggregate of all columns that end in score
(Simple example)
oh that's cool
Plus pivot/unpivot
Those I know
The last time I did SQL heavy work was 2021 iirc
So I'm behind, but not too much
Yah, the problem is just how fast they're introducing features. This stuff is DuckDb specific, so I try to use it sparingly
I try to make sure I'm doing things that are clickhouse or snowflake compatible, generally speaking
ANSI wise or jjust feature set wise?
As in, it's not ANSI but all OLAPs support it so it's fair game?
Some is, some isn't
I try to stick to: all olaps support it (or similar)
as the number of estimators in a random forest increases , it reduces overfitting right?
but at the same time you won't the number to be as low as possible to increase efficiency
hey lisan help this guy first! about LLM
yeah come on !!, just help him!
look at my skyscrapers named as losses
wait lemme scroll!
yeah this!!
reading research paper!
yeah okay!
send the docs!
π
and I never read that !!
https://www.geeksforgeeks.org/matplotlib-pyplot-loglog-function-in-python/
what are this inputs?
what is ax thing?
can just do plt.yscale("log")
just this? woah!
the plt.xticks, uses continuous values like
[1, 2, 3,4 ]
but I have discrete value like 100
so need to convert it into 1 to 100
I used range(1, 100) but doesn't work
like this
plt.xticks(range(1, TOTAL_NUM_EPISODES), labels=None)```
but I want my x -axis as number of episodes!
this was the output of "let it decide"
now what's this?
angstrorm?
yeah searched that!
noise? in RL?
wdym?
it's just losses!! after training
that Pong game!
just implemented replay_buffers
hey I have a simple python logic question!
for i in batch:
states, actions, rewards, next_states, done = zip(*i)
so in this batch we have 32 samples from whole buffer ( experience of model)
and now we have to add all the 32 samples from batch to s, a, r, ns, d
ignore why I did for loop on batch, it just because I am messing around appending wrongly!
what is untyped?
I am convering those later into tensors
just samples from whole buffer
yeah okay!
wait I have done big mistake!!
Is there a specific amount of time that needs to take to train a neural network would it take a month if it was on something very specific or do I have this all not under what's the truth my apologies
month??
do you have GPU>
I think so maybe I don't really know the maintenance of computers I'm trying to code to learn them better but that might take a couple of years to fully understand something basic
No I know it's going to take a couple of years to fully understand a few bits of the subject as python code can become very complex and I'm writing everything and putting it into a notebook like a little cheat sheet and I can easily remember if I'm having a problem I try to listen to python but sometimes I get into an infinite complaint loop where it complains that I didn't do something right and I listen to it I think so I'll have to check with my computer in a few minutes
Kind of true, it lowers the risk of overfitting
How long does it usually take for a neural network to learn I know it depends on multitude of factors but if I gave it something simple like trying to learn color how long would that take trying to weigh it where it it fully understands and comprehends each color with heavy weights on all Network pieces
oh mann,. what you are doing currently?
I'm not demotivated it's just I'm curious on how long it would take to teach on network and if it takes a month I don't mind cuz I want it to be strong teach it a color to the point where if you gave it a different color it would understand it's not that color
And would be teaching at colors be complex just so I understand this a little better
yeah!
Like if I put David access to a camera that was plugged into the computer if it were to be set on a specific color like red it would print out red or speak it out using text to speech option
Ty
Like if I gave it the phone color it would print out what the image is or what color is
green
kNN
k nearest neighbour?
k-nearest-neighbours is kind of overkill for matching to one of, what, at most a few thousand colors? :p
hi reptile
so there is a logical error
train.py -> https://www.pythonmorsels.com/p/26whg/ at line 21 and 119
buffer.py -> https://www.pythonmorsels.com/p/35wjn/
as I have told the structure of "batch"
Yah, that's always the basic problem/opportunity with data: partitioning. If you can organize/partition the data in a manner the aligns with the access pattern, things are good. But if not, you end up with random access which is performance hell (and usually ends in a full scan)
can you help me understand regression?
wait there is a best blog on that!
Learn what formulates a regression problem and how a linear regression algorithm works in Python.
The concepts behind linear regression, fitting a line to data with least squares and R-squared, are pretty darn simple, so let's get down to it! NOTE: This StatQuest comes with a companion video for how to do linear regression in R: https://youtu.be/u1cc1r_Y7M0
You can also find example code at the StatQuest github: https://github.com/StatQuest/...
so I guess I am messing around appending tuples , don't know but in batch variable another list is being added which I don't wanted
i do not get how to implement it though for my scenario
it depends on your data then, have you plot that?
i have
show !
i have multiple datasets and i got the line of best fit, i want to combine all of those lines into a single function/line though
what if you add all those data into one1
hi
hey @final kiln
are you reading that code ? of batch
That statquest video is pretty good, it's kinda hard to go through this topic without just going through the underlying math: the math isn't hard, it's just tedious (which is why we use nice libraries like sklearn)
can you help me implement sklearn for my case then π
more focused on the application
that blog already implemented that
i have the coefficients:
p = np.polyfit(arrival_kg, min_rs_per_kg, 1)
print("parameters (slope, intercept):", p)```
can I shove this into a sklearn function and get a result?

linear regression is one of the underlying algos in neural networks so it is good to understand well
does the weighted average not work here?
I already sent you the sklearn link twice. I and others will be less inclined to help if we don't see effort on your side (e.g., reading the docs/links that are sent to you)
That's very confusing language. Maybe try a basic linear regression example, with, say Iris dataset.

yeah , please take a look at that, I am just confused about appending!
Iris is a good data set because everyone knows it. And if you can do a linear regression on Iris, you can do it on anything
oh i forgot to mention, i did use this:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
and got a result by combining all the datasets together which isnt correct
Gallery examples: Principal Component Regression vs Partial Least Squares Regression Plot individual and voting regression predictions Comparing Linear Bayesian Regressors Linear Regression Example...
I have given the code, do I need to explain now in short?
Doing linear regression with sklearn is literally just calling 2 functions, .fit() and .predict()
and there's tons of examples in the documentation
it says tuple!
!d sklearn.datasets.load_iris
sklearn.datasets.load_iris(*, return_X_y=False, as_frame=False)```
Load and return the iris dataset (classification).
The iris dataset is a classic and very easy multi-class classification dataset...
in train.py! in updatedqn func!
no wait it's list
train.py -> https://www.pythonmorsels.com/p/26whg/ at line 21 and 119
in this code in update_dqn method the batch is list
it's List!
Does anyone have a good link to understand what bootstrap in random forrests are?
in that list we have a list and then all tuples 32!!
I tried to understand them from several resources but I can't seem to understand the advantage of it when they explain it
It's basically you have your dataset which has 5 samples: A, B, C, D, E
You train 3 trees. While training them you don't use the original dataset, you sample with replacement (literally, you take a sample and put it back) to make a new dataset.
For tree #1 you may have A A B C D for tree #2 A B C D D E, and for tree #3 A B C E E
It's just another way to increase randomness
this is bootstrapping, visually
replace "compute statistic X" with "train a tree"
https://paste.pythondiscord.com/JAGQ
whole print(batch)
but doesn't that make it harder to actually find the relation between data?
good question
it does, but it's the point π
Decision trees overfit really really easily
yeah , because then it will make individual tuples !! for like 32 samples!!
Look at it like this:
Your dataset is a noisy sample from a distribution
When you're fitting a ML model you're interested in knowing the actual relation between the independent / dependent variable
not just that of your training set (overfitting)
So the observation
but doesn't that make it harder to actually find the relation between data?
is true, but it applies to the literal relationship in your training set. You absolutely don't want to replicate this. This is per definiton over fitting
To truly understand why this is the case you actually have to study the bias variance trade-off
I will google this
Trees have very little (inductive) bias. They can fit basically everything
But, if you change 1 example the tree may be very different (high variance)
Random forest trades a tiny bit of bias for a massive reduction in variance
first of all in buffer ( deque ) we are just appending this experiences
([10, 180, 96, 104, 4, -4], 0, 0, [10, 175, 92, 108, 4, -4], False)which are this
and then we are creating samples (32) from whole buffer [ consider that buffer may have thousands of this experiences ]
so batch will have now 32 samples
now we have to add this 32 samples each into 5 variables which are s, a, r, ns, d
so that's why I am using zip
So randomness help us remove stuff that just happen to seem like they are related(noise) and let us actually find the variables that are actually dependent on each other?
That's a helpful way to look at it for now
and I just want to remove that []!! which is bothering me!
I kinda understand this but will give it another read once I understand the concept of bias and variance
Tysm man!
yeah but here usecase is opposite, we have 5 elements which will be converted into sepearte 5 variables
Idk what I would have done without you
okay lemme try atleast theN!
You premptively ask what is covered in a typical, rigorous ML class
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 5)
I am glad you view it this way and don't find it annoying
6? how?
yeah!!!
that extra [] is bothering !!
because while appending it is appending as list of list, nested list in short!
now how can we remove that stupid [], or I am doing mistake while appending?
This is what my old slides had to say about this btw
decision trees are extremely unstable
Yeah makes sense , I been trying tuning hyperparameters on them and it really changes the results
Ah and last but not least, random forest is just bagging + not considering all variables at each split
So if you understand that slide, you understand RF
Everyone is different right, but I'd go for regular Q learning first before diving into DQN
regular Q learning is easy enough to implement form scratch that when you upgrade it to DQN you'll at least have confidence that you know what you're doing π
yeah you are trolling me!π
simple Q-learning is just creating tables for q values, so it's boring!
not really
You can do Q learning with function approximation as well
The deep net is just approximating the table
It doesn't need to be a deep neural net, it can be any function approximator
yeah, but with neural net it's interesting
can someone who got an internship in data science msg tell me how i could get one aswell?
like what do i need to know/put in resume/ apply for internship
#career-advice has all the pros you need for this question π
I just sent an email to a company I wanted to intern at sounding very motivated and they "hired" me for the internship
I got that working, with itertools
from itertools import chain
this approach is also good!
([10, 330, 16, 184, 4, -4], 0, 0, [10, 325, 12, 188, 4, -4], False)
([10, 245, 100, 100, 4, -4], 1, 0, [10, 250, 96, 104, 4, -4], False)
([10, 180, 120, 80, 4, -4], 1, 0, [10, 185, 116, 84, 4, -4], False)
([10, 225, 276, 76, 4, 4], 0, 0, [10, 220, 272, 72, 4, 4], False)
([10, 260, 320, 120, 4, 4], 1, 0, [10, 265, 316, 116, 4, 4], False)
([10, 185, 236, 36, 4, 4], 1, 0, [10, 190, 232, 32, 4, 4], False)
now I got this with
for x in batch:
print(x[0])
now need to append
s, a, r, ns, d
now how can I append this values into 5 diff. variables?
no I think I am too much computing here and there, I should take a look at how it is appending!
Sick. I wonder how well this compares to the big proprietary ones? https://timefold.ai/blog/new-open-source-solver-python
The amount of time that has gone into CPLEX over decades makes me think it still blows stuff liek this out of the water for sufficiently large problems
Does not surprise me that it was made here. This is such a niche in Belgium π
A lot of my coursework was on this, fun stufff
Yeah I understood that when I was looking at the hyperparameters
ahah like what did you write? and what company was it
I just wrote that I'd been to some of their presentations and their work really interests/inspired me and that I wanted to do an internship. It's not a company you know (and if you did I wouldn't say which because it doxes me)
https://github.com/google/or-tools this is interesting in this space
It's mostly for modelling problems, you can pick your solver "backend"
It's missing a lot of the things timefold has though, I see they offer metaheuristics
Or rather, it's exclusively metaheuristics based
ig that answers the question. Metaheuristics are ime slower than doing something like simplex/branch and bound if your search space is tractable, non-linear, non-convex, ...
I guess the real bottleneck is an open foundation to build on, like BLAS and LAPACK which are government-maintained.
Um... Hi. I'm new here. Where do I go to ask for help? (When replying to me please @ me so I know you are talking to me. This is a force of habit, I'm sorry if it's an inconvenience.)
yeah, it's not only that. It's also just that the underlying algorithms are different
It's been too long since I looked at things like CPLEX but they do mostly standard, LP, IP, MIP, QP, ...
Which can be more efficient than full blown metaheuristics, if your problem allows for it
writing a genetic algorithm or so is a fun coding exercise btw π
hey hey, welcome. You can check out #βο½how-to-get-help .
In general, you can ask questions in a relevant room (like here) and people will do their best to answer. My biggest tip is to ask the question straight away like:
"What libraries can I use to do linear regression." instead of "I need help" or "I need help with linear regression" π
Okay well... here goes. I started using Python recently and I'm attempting to use Rasa to build an Ai. The only issue is, it does not install completely. I'm using a virtual environment. Pip, Python, and absl-py are all the latest version. I get a HUGE Error message somewhere in the downloading process. I can put that Error Message here if that's allowed.
Yah, definitely open a help thread with the error message (see the link above
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
I made a post, barely fit in the block with my paragraphs, lol.
https://youtu.be/hDKCxebp88A?si=Bn6dKsNGpaNGUyzS, I finished the first 7 hours and a half which talked about the basics of linear , logistic regression and decision trees + random foressts with sci kit learn , Now would it be better to take a break from the course and go check projects like kaggle notebooks that utilized these models before moving on?
This course is a practical and hands-on introduction to Machine Learning with Python and Scikit-Learn for beginners with basic knowledge of Python and statistics.
It is designed and taught by Aakash N S, CEO and co-founder of Jovian. Check out their YouTube channel here: https://youtube.com/@jovianhq
We'll start with the basics of machine lear...
so if i have a dataset with one or two values that have a value of 0, and i want to use MAPE to evaluate the set, what alternatives do i have towards the zero values? the zero values represent like <0.1% of the complete dataset but i want to avoid trying to remove the values themselves
The chat bots on company websites are usually low effort shit.
Fine-tuning an interactive LLM to a specific company would probably have worse results than a RAG system that uses a generic interactive LLM.
Any options that involve LLMs will be slower and more expensive than the shitty systems.
What is a RAG system?
Retrieval augmented generation
It's basically when you ask a chat bot something, and it looks up information related to your question, and then passes both your message and the relevant information to a generative LLM. And it uses the extra information to answer the question.
ohh so it gives the llm the necessary information to answer your question clearly instead of just pasting the documentation?
I mean if it's a chat bot that answers questions about libraries, "the necessary information to answer the question" might be the documentation.
What I mean is, for the rag system, the documentation might be the "augmented input". The output would be something generated by the LLM. It wouldn't just link you to the docs
Noted!
you mentioned it would be slower and more expensive but ig how viable it is depend on the margin of each of the previous factors right?
the low-fi chatbots that answer questions on company websites are so profoundly unhelpful that I don't know why companies even have them. But I can't imagine that they're more computationally demanding than a generative LLM that requires a GPU.
If a company wants to have a chatbot on their website at all, imo, it should be LLM based. The ones most companies have are basically just a text version of phone answering bots.
Yep
and it should actually be viable since it will reduce the amount of support tickets for customer service by a huge margins
It seems like a great idea , would like to implement it someday when I have deeper knowledge
My guess is that there are already startups offering RAG-based customer support bots, which are in turn making API calls to OpenAI (that the startup has to pay for, and then pass along that expense to the customer)
Yeah , I think I saw an ad for something similar a while ago
A question tho
Can't gpt-2 be viable for this?
or ig it depends on the context and required capabilities
you'd need to find an interaction-tuned version of GPT-2
ChatGPT is an interface for interaction-tuned versions of GPT-3, etc.
ohhh
makes sense
I think there was something done by microsoft for that posted on hugging face
lemme search for it
if ill train a model
for 30 hrs on my laptop
will it f it up
gpu temp is below 65 since half hour
hour
ayo how is it going dude
oh the drive?
what's the difference between multiple linear regression and using a fully-connected neural net? for like tabular data
it's pretty much the same. some people might try and make the distinction between linear reg, multiple lin reg, and multivariate lin reg, but it's all referred to as linear regression as well
in lin reg you look for M and B that satisfy Y = MX + B, which you'll note is the same as the weights and biases of a dense layer
this is also one of the key observations made by yann lecun in a paper from like 13 years ago, pointing out that this means many algorithms that iteratively applies affine transformations followed by selection rules can be "unfolded" into a neural network that you can now explain and whose architecture is well motivated by an optimization algorithm with convergence guarantees
the 1 layer case being linear regression
https://dl.acm.org/doi/abs/10.5555/3104322.3104374 this paper eventually led into a flavor of what is now called "model-based deep learning"
right, makes sense yeah
how do I access the pdf 
thanks!
From a practical pov, you can have a non-linear relationship with the independent and dependent variables
You can have this with linear regression but you need to specify all of them, including all interactions a priori.
The issue with neural networks is then that because you don't have to specify a relationship a prio you could be fitting the signal and the noise.
Naturally this makes linear regression whitebox and an DNN on tabular data black box
As a modeller it's also harder to get these fully connected networks right (on tabular). A lot more knobs and dials to turn
I like this question because you can answer it in two totally different ways (Edd's answer and mine) and I'm not sure which one you wanted haha
I have no idea which one I wanted
I'll take both though
Try doing some Kaggle competitions π the ones I mentioned are low time investment
alrighty
is there any book/source where to learn how to properly clean data? I have to make some models for the uni but the datasets is a mess (high skew and kurtosis for some feature, continuos and categorical features mixed etc.). Also there's an high imbalance 45/45/10 (which I tried to solve using SMOTENC). Still, I can't get good results on prediction
Had anyone ever tried runing paddle ocr on rocketchips ?
If yes im really curios about details in how to do it because it's hard to find anything on internet
Iβll be on huggingface discord more for this since VC told me it would be better to ask there, putting this here for posterity
Making a simple neuron for a workout Network Temple and get that understanding of it
What is a vector and tensor in the context of ml?
this is a bit of a rabbit hole question, but roughly:
- from the maths standpoint: a vector is an element of a vector space, and a tensor is a multilinear transformation
- for ML people: any multidimensional array is a tensor, and 1d arrays are vectors. this is enough to get around the basic ML code and papers, but not the more sophisticated stuff or if you want to go in depth
why do we need minima when we are already differentiating the loss function with the weight and bias in backpropogation?
you got that backwards
the only reason we take the derivatives in backprop is that it can be useful in finding local minima
How?
For a minima , shouldnt we equate the derivative to 0?
under the condition that a function is differentiable and locally convex, it can be shown with some effort that following the negative of the gradient with a proper step size will eventually lead you to a local minimum with gradient 0
this is impossible to do directly for anything other than trivial cases
ohh
a network with 1 layer and a nonlinear activation function is already a case where doing that explicitly is impossible
there are also cases where you could technically do it, but the effort of inverting a matrix is prohibitive, so you anyway can't
you almost always have one or both of these cases together in any interesting problem
Wait so our approach is to minimize the loss , by finding the minima at which the loss function is minimum for that respective weight?
that's a redundant way of putting it, but yes
you write the loss as a function of the weights, and then tweak the weights in such a way that the loss is small
And we use the formula W(new) = W(old) - L.R * the gradient of the Loss w.r.t W(old)
yes
well, that's vanilla gradient descent, but the other methods build up on it
i must add that there are also gradient-free methods
the philosophy is similar, but you trade in convergence guarantees for a shot at global optimality and relaxing the need for differentiability
One more thing , let's say I have 9 trainable parameters so the loss would be the function of those parameters , which in turn means that the loss function is 10-D function right?
I still have to cover that
no
the loss is scalar, and usually real-valued at that
But if you graph it , it would be in 10 dimensions right?
sure
but that 10 never shows up in any of the math you do on it
you have a function f: R^9 -> R
or possibly something more restrictive like a 9 dimensional manifold or just some subset of R^9, instead of R^9
Ummm I see thats one thing
Never the less , I cleared my confusion regarding the minima
Btw can we differentiate a max function?
remember the minima are the values the loss takes, the minimizers are the parameters
not in the conventional sense, no
So the perceptron loss function doesnt use GD I believe?
it does
but you'll either accept the output as probabilities of a categorical distribution, or use a smooth approximation to the max function during training
How will it differentiate as the loss function is max(0,-y*f(X))
it won't
the relu is also not differentiable btw. all modules make an arbitrary choice of subgradient for the relu, since it's subdifferentiable
for classifiers, you remove the max altogether or use something like a softmax
So that it brings down the summation to a probability?
what summation?
f(x)= w1X1 + w2X2 +b2
and wdym by "bring down"
the usual approach is that, if a classifier is supposed to output a particular class, this is (roughly) the same as saying that class has a probability 1 and the others have probability 0
and now we get the network's output probabilities match that
What I meant was , we first use forward propogation to find the dot product, and then use a activation function like softmax to bring it down to a range.
sure
Question
Lets say I have a regression problem
I would use 2 nodes in the input layer , 2 nodes in hidden layer and 1 node in the output layer
My loss function would be mean squared error
and let's assume that only the bias of the output node is variable , all of the remaining parameters are constant
So technically speaking my loss is entirely dependent on the bias right
yes
And now I find the derivative of the loss wrt to the bias
and lets say it is positive
so that would mean if i increase the bias , the loss would also increase right?
that is what the gradient tells you, yes
the gradient points in the direction that a function increases the most
So we bring down the value of the bias by subtracting it with the derivative
subtraction is not commutative so your wording is very ambiguous, but yes
But if the derivative is positive , wouldnt it also mean that decreasing the bias , would decrease my loss?
What are the pros and cons of capping?
because can't outliers in certain scenarios show you the relation between certain variables that you otherwise can't find?
Hi, I was making this project from tensorflow https://www.tensorflow.org/tutorials/keras/text_classification_with_hub and i wanted to deploy it to a free hosting site, like pythonanywhere, the problem that i have encountered is that pythonanywhere doesn't support tesnorflow or Keras. So i tought of saving the model pickle, which is supported, however, I realized that you can't save a keras model using pickle and that you need to use an .h5 file format, which is loaded using keras.
Is there anything that i can do to load the model without using keras or tensorflow?
Also, sorry if this is no the correct channel to ask this
what do you replace it with? duckdb? if one was concerned with performance they'd not even be using python no? (unless pytorch or someething is involved)
I was reading some code on kaggle and encountered this
In Logistic Regression, we use default value of C = 1. It provides good performance with approximately 85% accuracy on both the training and the test set. But the model performance on both the training and test set are very comparable. It is likely the case of underfitting.
I will increase C and fit a more flexible model```
Why would this be underfitting? isn't this perfect for the model?
it was able to capture the general trend and ignore the noise
Either DuckDB or some Polars or just consolidating (refactoring), and some just properly vectorizing operations (ie: getting rid of loops). Performance is fine with Python, driving Polars or DuckDB, although sure there's room for gain... but in analytical workflows, my biggest battle is complexity and reuse: sql (and dbt) give me a better structure @scenic parcel
Anybody have a resource/framework they'd recommend for distributed training? I tend to use AWS as a cloud provider, and have set up an internet facing multi-instance inference platform on EC2 in the past. I am reluctant to use SageMaker due to my impression it's trading ease of use for increased cost and abstracting things that I should probably learn instead. Though, if that's not the case I'm willing to change my mind.
@left tartan Sorry for disturbing you. Do you have an example of using DuckDB? I plan to use it with Django instead of Pandas but find it hard to use persistent data and create custom SQL based on query parameters.
Search this discord for 'import duckdb', I've posted a few examples in past
thank you so much
what is duckdb?
Even though you want to avoid overfitting, you still expect the performance on the training set to be a bit better, since that's the data you minimized your error for. So they're theorizing that since the training accuracy is about the same and even slightly lower than the test accuracy (meaning, the performance on the training set is comparable to data it wasn't trained on), that the model didn't eke out everything it could out of the training data, and suggest that might be because the regularization is too strict.
Ohh
Makes sense
Tysm!
is there a pandas-specific channel (or server)? I'm fairly experienced with python but pretty new to Pandas, would appreciate some help as I try and do things...
not really, but you can ask here or open a thread in #1035199133436354600
https://youtu.be/kQQaO5Cm5AI?si=boVbr88e72MzLWA8
Is this enough pandas to be move forward in my journey to be ml engineer for should I more more things about it?
In this video, learn Python Pandas Tutorial for Beginners [FREE] | Learn Pandas in 3 Hours.
00:00:00 What is Data Analysis
00:15:10 What is Data Structures in Pandas (Pandas Series Data Structures)
00:29:42 DataFrames Data Structures in Pandas
00:41:01 Arithmetic Operators in Pandas
00:48:40 Delete and Insert Data in Pandas
00:58:22 Write ...
I strongly recommend you to just use the official pandas tutorial/docs
Often times people making these videos/courses don't really know the tech either, they target beginners that can't tell and make money off of that
Ok thank you
BTW one questions I want to ask do I have to learn everything about pandas or ?
skim the documentation to know what exists and then start using it, then use the docs as a reference. Nobody knows "everything about Pandas", but a lot of people know where to find what they need etc.
and why
I want to choose a masters and I see that you are proficient in the field of data science
What are your options?
On average I think MS CS will teach you the ideas behind ML models and will also give you the required baggage to deploy models. My sample size isn't huge but what is missing from MS CS is the "finesse" of actually doing statistical modelling, that's often missing.
Statistics is another viable option but it's the opposite. You'll get all the finesse of modelling imaginable but probably not enough of the real world concerns (deployment, MLops, ...).
There's also MS data science (or AI). There you can't go off of the name, I'd really have to see the content because all of them I've seen are very different.
Finally, you can also pick applied fields if you have a specific interest you want to apply data science in. Experiemental psychology, bio informatics, computational chemistry, actuarial science, ... are all examples and there's many more
there's also signal processing
that comes with variants like medical sigproc/imaging, communications, and more
exactly, also a fine choice. My alma mater doesn't offer it, but it offers EE
EE into ML is a very solid choice as well. I think all of the very specific and advanced vision courses were exclsuively done there (due to the signal proc background)
maybe i would add that data science and ML are probably best seen as tools you apply within another field, so you'll always be better if you have the specific domain knowledge of where you plan on using them. if you already know what applications you like, you can mix the two things together. if you don't, then a more stand-alone learning of DS and ML might be better, with the understanding that you'll have to learn about the application area later
i think both zestar and i learned all our DS and ML stuff in the context of a particular application, and as a result both of us know a lot of non overlapping methods and maths simply because some are more common in some fields
Hello guys, I'm working an NLP classification task that involves specialized terminology/lingo. Is it realistic to fine-tune existing models such as bert/some other, or would you recommend starting with a baseline model such as naive bayes/some other and then working through iterations with custom nlp model? I'd analyze the dataset and see what type of data I'm working with and then use various models to have some preliminary tests to assess performance and later on to compare it with. Any insights / docs on structuring the dev process would be appreciated. Thanks! π
any one help me in getting room direction from 2dimage.The image will always be indoor image
It depends... π if I have enough time and I'm not rushing to beat any deadline, I'll definitely start from the classics; building a baseline model.
If I'm in my "let's go family, it's show timeeeee" mood, I'll go as far as adding transfer learning to it just so I can compare and contrast different model performance. (to be honest this part is more fun for me lol)
guys when making ai, do u make a neuron class? and what is its attributes
im trying to make an adaptive neural network, and its well confusing me
hi folks, been having a lot of issues with my base code for my teaching module for my AI. Hoping that someone may be able to give me a few pointers on why this issue may be happenning? I'm not super advanced in python, but everything appears correct, and it just keeps erroring. very frustrating. Hoping someone can have a peek to get a second pair of eyes on it to see what it is that I am missing please?
If anyone thinks they would be able to lend a quick hand, feel free to DM me.
just a heads up, in this channel it's typically more helpful to ask your question directly and be as specific as possible
People will rarely commit to DMs, they'll typically prefer to see a question they're able to answer directly (or not)
What courses/youtube videos/resources do you all suggest when learning about pytorch?
Docs!
i did sigproc for my masters too
What do you think about kaggle ml competitions?
Do you think looking through best approaches from leaderboards for a specific similiar use-case might help me for my problem?
Problem is that I don't know what to ask other than what I asked, because I could ask one question and the problem may be something completely different that I"m just not seeing, which is why I need a fresh set of eyes to look over it as I've probably got tunnel vision in relation to it.
At least give more information and context. Your prompt was devoid of any details, code, context, etc
You say it's 'erroring'. What errors? Etc
"UnboundLocalError: cannot access local variable 'y' where it is not associated with a value"
that's the error at the moment.
Before that, "ValueError: too many values to unpack (expected 3)"
Before that there was serialisation error
I've been going roun and round on the same errors for a while now.
I have no idea what error is the actual error.
What is the actual real error thoguh, must be something other than that because there are multiple errors that I keep going in circles with.
I'm currently cutting the code down so that I can upload it.
Essentially the error is somewhere inside....
[code]
i = self.sigmoid(np.dot(x, self.W_i[0, :]) + np.dot(h_prev, self.U_i) + self.b_i)
f = self.sigmoid(np.dot(x, self.W_f[0, :]) + np.dot(h_prev, self.U_f) + self.b_f)
c = f * c_prev + i * self.tanh(np.dot(x, self.W_c[0, :]) + np.dot(h_prev, self.U_c) + self.b_c)
o = self.sigmoid(np.dot(x, self.W_o[0, :]) + np.dot(h_prev, self.U_o) + self.b_o)
h = o * self.tanh(c)
y = np.dot(h, self.W_hy)
[/code]
as far as I am aware, because y isn't getting a value.
and nothing inside there is causing an exception
So I've tried many things. I've sent x with an np newaxis, I've sent x as it stands without modification or alteration. But nothing has worked.
If you would like to help, then please go to my post and have a look.
Link to post?
For what it's worth, those are just code issues. They're all 'actual' errors that you need to fix first, before getting to anything related to your training.
Can't upload the traceback as it hits the 2000 character limit. sorry
First step: open a help thread so the conversation stays in one place: #βο½how-to-get-help
!paste long blocks of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
thanks
One day when I grow up I'll know 10 % of the maths you do π
I'm reading a 1000+ page refresher on operating systems whenever I have a spare hour
howdy peeps, just got a bit further ahead of where I was earlier thanks to Billy, but there are still a few issues.. Things aren't being broadacast together properly.
As example setup...
import numpy as np
a = np.random.random((1, 1))
b = np.random.random((96,))
b= b.reshape(1, -1)
result = np.dot(a, b)
This works correctly and is broadcast together.
I have 2 np arrays in other code of same variance tha tI have applied them to and now I'mm being told they can't be broadcast together in the sigmoid.
ValueError: operands could not be broadcast together with shapes (1,12288) (1,96)
I did a
xax = self.W_i.reshape(1,-1)
to reshape the array like I did in the quick tester.
x1=np.dot(x, xax)
x2=np.dot(h_prev, self.U_i)
i = self.sigmoid(x1 + x2 + self.b_i)
But sigmoid isn't working right..
File "/home/user/AI/ai4.py", line 118, in forward
i = self.sigmoid(x1 + x2 + xbi)
~^~
It points at "x1 + x2"
Hi,
I am working on time series forecasting for one step ahead, the following picture represents the result of forecasting, as u see the time series is highly variable, the R2 is equal 62%. i am using two models CNN-LSTM-Attention and GRU attention. please i need your opinion what do you think about results? can it be improved ?
right side -> loss
left side -> average reward ( Q values )
why it is confusing, loss values are seeming to increasing(which is bad) whereas average q values are increasing ( which is nice)
is the loss td error? it's a bit finicky because per the papers and such what you want to do is gradient ascent not descent, so the value should indeed be increasing, but usually you just invert the loss and descend it, though you could then invert it again for plotting I guess? (I can't tell what you're doing without the code anyway) either way, RL do be a bit confusing π
yeah, RL is bit confusing for initial episodes, because it learns very slow!
do you need my code?
to check ?
I'm honestly not sure what I would be looking for, I'm ever so slightly out of loop with RL right now π
yeah, I mean I have to train this on atleast 100k then something will be out!
article about a dataframe being a bad abstraction. first argument seems to be lack of ability for type checking
Somewhat a silly debate in Python/dynamic type land
Oh, their argument is really against tables?
That everything should exist as an entity/object?
would this be the place to ask about data scraping
does it go so far to claim that NoSQL is the only way, lol?
It's more about strongly typing I guess. That we need some typing overlay? I didn't finish yet
guys can anyone take a look at my code? i really cant find the issue on my genetic algorithm, it just does not want to learn!!!!! (i trained it for an hour but the average score didnt increase, while it should learn a high score within minutes)
How to get started with ai on mobile ?
not really
You need dependent typing or more for this to work
Dataframe libs like Pandas can let you arbitrarily add new columns with whatever names at whatever time
How will you know what type type is at any point in time
Because, that's the point of data frames. Removing that removes the point
Can they solve that problem? It's one I spent too much time thinking of myself (strongly typed DFs)
I know Pandera
hmmm
Sounds convoluted
This is not the way to solve this problem (nor is statically typed dataframes)
Data versioning is also somewhat a solved problem
Have you heard of slowly changing dimensions?
Data warehouses have a data versioning problem
But, the schema is consistent
It' appropriate in some situations, but not in others
You can flip this problem on its head
tag datasets, tag runs
have a small CLI tool that can roll back time to a tagged dataset and execute a run on its commit hash
this is what I'd doo for ML. I actually do this without the CLI tool
For analytics this is a terrible idea
because it's a solved problem
Couldn't you incorporate data versioning into the code encoding the version information as part of the metadata for each document?
What are we talking about though
NLP? Vision? or general data
for analytics or similar
for the NLP and vision this might be an OK idea
someone probably has? though if you plan on trading purely based on price, know that that is quite baseless of a trading strategy
typical "business" data
even the data I work with, which is not "business" data
structured data
With a relatively fixed schema
Well, I store all my datasets in elasticsearch, when when I embed the data into the server, you can create custom fields that can track the information related to the dataset. I imagine you track the where the information is coming from and update the embeddings of the data easier.
I just make a strenum that tells me what the columns of a df are
Pro tip: instead of just slamming dependenciy installs into your terminal, read and at least understand their function
would you cache all the methods?
Yo
dms
I assume image_resized is a transformation step?
Do you cache the result affter each call or recompute?
I like types as much as the next guy but ...
There's also other tools
Just test your code
When the effort of types gets too much just write a test
what do you mean?
yes, and? I don't know what you mean?
define "a lot of files"
you mean, code?
So you mean, a lot of data
Seems like you have something very specific in mind and I don't understand it. That's fine.
Stuff like Airflow solves a lot of this
and DBT models are also made exactly for this
There's also something called "data lineage" worth looking into
DVC is something I'm so skeptical about
make it and once I have it in my hands I can critique it better
I think your workflow is idiosynctratic
You've had problems that are unique to what you are/were doing
But they may very not well be the problems and scenarios that are common
Why isn't your data in a DB
And that's when it becomes idiosyncratic
why
... an object storage?
You're saying all of this because, fundamentally, you did these projects solo
How do you scale what you did to teams
A database does a lot more than just reading and writing files
Firstly, there's a thing called the medallion architecture. In this image they're showing it off with structured data
But you can do the same idea for unstructured data (images, sound, ...)
You keep the data in bronze, you can transform it to silver in multiple ways and times
If you change (the result of) your transform logic, which is a very very expensive thing to do irl, you can make a different section in silver and/or bronze
Also
How are you enforcing role based access control?
At a basic level if you don't want all the goodies that object storages have
Having a principled way for user managemennt is important
A big part of minio and S3 is just governance
Having the same tier of fine grained permissions with git and git LFS... idk
Okay so, then your data isn't in your repo
then it's in S3. Then it is in an object storage
Then isn't an entire project localized to your git repo if you use S3 or similar anyway?
How is that different to now where you have a DAG that reads data from an object storage, does transforms and writes it back? (the status quo)
What if you have a feature branch that has CD to a staging area
You submit a PR
It gets merged, CD to prod
You read main, you know it's occurred
Nah i'm just describing the standard workflow of companies with good engineering hygiene
If you have continuous deployment and you test stuff out in other branches isn't what you read in main exactly what happened in reality
With the ELT "pattern" you never tamper with your source data which means you can absolutely checkout to a commit, run your pipeline and have that dataset
Especially if you have the date of the commit and add created_at < commit_date
I'm "challenging" you on this not because I don't think it's a good idea
It is, it just isn't worth the paradigm shift imo
But the same can be said about really knowing what exists in this space already
Rather, to try old things
So as to not reinvent the wheel, but square π
Which is an odd place to start
It's not
It's very niche
https://www.reddit.com/r/MachineLearning/comments/mrb096/comment/gun8aa0/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button this is how I feel about DVC, said the same as me but differently
More eloquently
another link, says the same as I do
.
idk how your pipeline can't be not ephemeral
it should always be
If it is, just run it
Just show me if/when it's done and I'll have an unbiased look
But if you invented a square wheel out of blindspots I'll tell you
gl with the takehome
I have this project where I must predict the next day's high and low temperature and was checking for normality, this is the Q-Q plot. What should I do about the not normally distributed residuals? Or how should I investigate this further?
what features do you train it with?
features = ['dew', 'humidity', 'precip', 'precipcover', 'windgust', 'cloudcover', 'visibility']
Do most of these features have a linear relationship?
If not then you can give a look to randomforests
How should I check if the features have a linear relationship?
by plotting each feature against the target
It's called data analysis
understanding the relation between your features and he target to identify the best model
Yup, they have a linear relationship. I've done tests but have been cramming so much info these days that I'm not sure about anything right now lol
Thank you by the way, for your responses
hmm I am honestly new to this too xd
Ooh boy, still, how long do you have?
Well if so I think that's the best you can do and someone with more experience will probably have a better suggestion
about 2 weeks xd
All right, thank you : )
I've found 3 typos in dagster's docs so far
Sue them
my payday is coming
Anyone over here experienced with Opencv?
Was wondering how I could just identify all objects in an image
Doesnt require recognition but just detection
for example :
in a street like this, it would maybe identify all the different people and place a box enclosing them
I would also be using an edge detector on this
I'm trying to train AI using Torch & Transformers but This chatbot literally copies me.
i used 7k of lines & messaages to train it
using Cuda via google colab
training_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=10,
per_device_train_batch_size=2,
save_steps=2000,
save_total_limit=2,
learning_rate=0.001
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets,
eval_dataset=tokenized_datasets,
data_collator=data_collator
)
trainer.train()
model.save_pretrained("./trained_model")
tokenizer.save_pretrained("./trained_model")
model = BertLMHeadModel.from_pretrained("./trained_model").to(device)
tokenizer = BertTokenizer.from_pretrained("./trained_model")
def chatbot_response(input_text, model, tokenizer):
input_ids = tokenizer(input_text, return_tensors='pt').input_ids.to(device)
output = model.generate(input_ids, max_length=100, pad_token_id=tokenizer.pad_token_id)
response = tokenizer.decode(output[0], skip_special_tokens=True)
return response
if __name__ == "__main__":
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
response = chatbot_response(user_input, model, tokenizer)
print(f'Bot: {response}')
its just a snipet from code
u can use windows
Use jupyter notebook
Congrat you just made a troll bot
π
so u guys using windows
I code on mobile so any idea how to start ai?
You can use Linux bro
u can use any os u want
Android on top
not android π
Why not I'm trying ai on android
cause i watched a video that recomanded linux
Don't tell me I can't:(
bro's trying to burn his phone
You run the code inside the cloud
cloud server
Technically yes
I have to buy a pc lol
Adios amigos
i prefer google colab
ok but fr someone help me π
this mf copies me
For starters you are using the wrong type of llm
BERT type models absolutely do not want to generate text
The only ingest text and spit out numbers
They don't generate paragraphs
Secondly idk how you are training your model but building your own llm takes a monumental amount of data
i have 7k lines of messages isn't that good?
what should i use instead
Try millions or billions before you start seeing it produce same text
Training llms from scratch require an insane amount of data
You can't use that model
why?
Because it is not built to generate text
It is built to ingest text and spit out numbers for classifying data
My advise would be use llama or the likes for your chat what ever
And models that understand non-english are going to be massive
I.e. 7+ billion params
I.e several GB at minimum
Idk how well llama even handles Turkish
Probably better with the 40B model
But it isn't something you can run easily yourself
why
Do you have a GPU with like 20GB+ vram?
You can try run it
Probably best to try via ollama
But the bigger models requires a lot of hardware
wouldn't it be better to just use an api for translation?
Maybe the smaller one will work with Turkish? But I somewhat doubt since Turkish has a pretty minimal amount of presence in datasets that they use for Train these things
This is indeed also an option but I guess it depends on how many resources you want it to use, and if you need it to maintain context across it since normally translation does loose some value of the text
Yeah makes sense
Maybe translate -> gpt2 -> translate will be functional enough for your use?
Id recommend Argos translate for the actual translate bit
Since it is free and seld-hostable
And from our experience pretty solid
yeah i was thinking of that
but then i also need to translate dataset right?
No
ok so, should i use gpt2 via translation or llama?
What are you training it to do?
chatbot
So why do you need to train it for that?
If this is a LLM you just take a pre trained one and adjust the prompt
For llama, chatgpt etc...
I would just do that via rag tbh
If you fine tune like you're doing now you're probably causing more damage to the model than actual training it
Won't be enough text to likely change the weights it already has trained
how would rag work in this context?
So RAG (look it up) and llama probably best for you?
doesn't gpt2 count for that?
Use the 'train' set of messages to give the model context on how it should reply and try mimic
Not really
oh k
Or at least it is a too small of a model
You basically need a 'big' LLM to do the actual text generation and have a conversation with
does Llama have its own pretrained words in it
i dont have billions of text so
Yes that is the idea with llama and others
I'd play around with ollama
It is a seld-hostable service that lets you easily switch out models and prompts
what do u mean switching out models and prompts
i just need to specify one model and use it
Giving it past 60 days prices as % change.
Started training it.
but if you give it time and price you want it to find the patterns for you
@buoyant vine is it ok if i use allenai/longformer-base-4096 for llama
Progress so far
Who knows a free good cloud for ai
The one I'm using costs coins and its a pain in the neck to get them
Hi everyone, im new to AI and I dont use python that much.
Im trying to use this model https://github.com/vikhyat/moondream but i seem to have some issues.
When I install via pip these things in a sequence
- numpy (1.26.4 because something doesnt work well with 2+)
- torch
- moondream2 (deps specified on the github page)
I get an index out of bounds error when trying to use the model from the first example from the repo.
BUT if i do pip freeze > requierments.txt and then clean my venv and run an installation using the generated requirements I no longer get the index out of bounds error (using the same input)
What could be the issue?
What is random_state in sklearn's train_test_split? and why should i set it to 42?
google colab
42 is kinda like seed which make data split same every time you runs
If u dont then before spliting it will sufffle the data every time you run
Hello guys. I want to upload a model and a dataset in streamlit and when i press run to say the accuracy. Can you please help?
Try creating seperate clean environment
Finally getting decent results and thats just the first epoch.
Hi
Howdy
No overfitting?
In backpropogation , which are the weights which get adjusted first , is it the weights which are closer to the input layer or the weights closer to the output layer?
not yet! ππ»
Nice π
data split into what, test and train?
im already setting a limit for the data that is being used to train right?
Its not spliting but before spliting it randomises the data
I'm afraid I don't understand you- does setting it to 42 control the data in some way?
Yes
U know concept of random int?
I do not
U know function called random.randint?
and 42 is what keeps it stable?
Yes
okay i understand now- thank you
π
What are the most advances parts of ML/AI in terms of skill?
Like do you want in terms of the difficulty to learn ?
What is skill anyway? It's not just knowledge, but experience applying to knowledge plus understanding when to apply which techniques.
So, the hardest part of developing skill is actually using the knowledge and learning from the experience
For Ai/ml, this means tackling a wide range of problems using a variety of techniques, and understanding which techniques are most likely to be fruitful (my point is that nothing individually is 'hard', the hard part is acquiring sufficient experience)
Yes
What academic stuff? I took calc1-3 matrix and linear algebra , optimization, and a bunch of stuff. I donβt know, data science differs so severely from one place to another and itβs relatively new and wasnβt a thing when I was in undergrad
Just in terms of ML/AI. Like, I donβt know, what form is deep learning is the hardest? Like specifics. I just grinded NLPs for a month straight. Probably reinforcement learning.
Like, in undergrad, I dealt with partials so much to point it is just none sense. It just varies so much from place to place. Like, it is confusing. My friend has a masters in EE and mostly, writes in PyTorch and tensorfloe, but x he is engineering stuff like, let me show you https://github.com/devin1126/DevBot-1.0
This repository contains all of the code that was used in the creation of the first iteration of my custom surveillance robot coined the 'DevBot'. Please read the README.md file for...
For intense optimization, yeah.
No, I never found it confusing
Itβs not, it is hard when you have to see the statics once it is optimized to see how parameters change when things are optimized. That is very hard.
No, like, say f(x,y;a,bc) = something, right? You have to maximize x and y, not a,b and c. When it is optimized, parameters change,
You just take the partials of those to see if the whole thing was optimized correctly and if everything all together holds. I was just asking like, what is the highest level of mastery in ML/AI at the moment.
uk a lib i can use for this
??
thats amazing
I was hoping you could suggest a way I could achieve what I linked
after 25k episodes!
left -> loss
right -> average Q
for objects in the street? Like I want something which is more general. Walking down a street recording this, it should be able to identify all objects
Im checking them out. While semantic segmentation does seem to really help.honestly the boring one suits my project more. Where do you normally find these? (Could you link if you found one already)?
Also whats your suggestion for how I should detect them - using haarcascades , lbps etc
So usually the gradients are calculated starting from the output layer and moving backward to the input layer and then the weights are updated simultaneously for all layers?
lemme check. thanks!
Thanks for your help Lisan Al Gayib, have narrowed it down to MS COCO and VIDVIP
also just another q b4 I go, as a beginner with CV , for image classification , do you recommend I do keras and then move to neural networks or should I directly move to neural networks
Well its very deep but I get the idea now
Thanks mate
Whats the difference between pytorch , tensorflow and keras?
does the documentation that u ve suggested contains all stuffs that i need to know about numpy fr ml (data science)
this is the doc if u dont remember it https://realpython.com/numpy-tutorial/
you should learn numpy from the official docs
will i need all methods in numpy fr ml
No but this is the kind of thing you should read to know what methods exist, then do a project with it, and then you can use it as a reference
yeah , but u have to memorize the methods
You absolutely do not
?
that's what i meant , they are pretty simple and self explainatory so more like you will remember them once you use
pytorch and numpy has mostly same api for operations
In my opinion it's always a good idea to learn what methods the library has to get a sense of what it can do. Don't memorize them. When you have a project you'll forget which methods exist to solve a specific problem but you'll know where to look to find it
i believe you should remember funtion names mostly used ones at least
so , i gotta get directly to the officiel documentation
and see what s the methods that i will take
ok thx guys
is this the doc u mean?
yes
Hi all, Im using seaborn to generate a plot, however I cant get the legend to be outside of the graph...
I tried to solve it but it got cut off...
Relevant code:
sns.set_theme()
sns.set_style("whitegrid")
sns.set_context("paper")
#plt.figure(figsize=(12, 4.8))
plot = sns.barplot(data=df, x='files', y='similarity', hue='type')
sns.move_legend(plot, "upper left", bbox_to_anchor=(1, 1))
sns.despine()
plt.savefig('plt.svg')
Thank you in advance
seaborn. It's pretty good, I use it occasionally (plotly's my main). Not frequently enough to remember how to place the legend tho π
Yeah, I mostly use plotly as well
seaborn is what I use for extensive EDAs because of joinplot etc
fwiw you can set your plotting backend with Pandas, in case you're using that. Meaning you can do df.plot() and have it output plotly, seaborn or matplotlib
As for moving the legend, big tip: the most of seaborn plots are matplotlib plots. It's often better to google "how to move the legend with matplotlib" in my experience.
Not sure if you are talking to me lol, but Im using pandas to create the dataframe already
Indeed, however the equivalent code for matplotlib is much more complex unfortunately
I think you add the box anchor parameter for that
plt.legend(loc="upper right") does that work for you?
if not, what do you want to do?
Maybe I don't get the question
I had used it in my previous code lemme see
Having the legend outside the plot, but not cut off like the image
let me see
Nope, it is inside and overlapping. I think this is teh default behavior
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
sns.set_context("paper")
plt.figure(figsize=(12, 4.8))
plot = sns.barplot(data=df, x='files', y='similarity', hue='type')
plot.legend(loc='upper left', bbox_to_anchor=(1, 1))
sns.despine()
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.savefig('plt.svg', bbox_inches='tight')
plt.show()```
try this
Working thanks!
How did it work th?
Also, I did not know one could use plt.show outside a notebook btw
I didn't know tight layout was a thing for non subplots
the bbox_to_anchor makes it outside and the rest makes sure it gets enough space
I honestly don't understand it that well
I found it on stackoverflow because I wanted the same thing a while ago
but I am glad it worked
Can you try this plt.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)?
Where? replacing the plt.legend of @deep sleet ?
sns.set_theme()
sns.set_style("whitegrid")
sns.set_context("paper")
#plt.figure(figsize=(12, 4.8))
fig, ax = plt.subplots()
plot = sns.barplot(data=df, x='files', y='similarity', hue='type', ax=ax)
ax.legend(bbox_to_anchor=(0, 1), loc='upper left', ncol=1)
sns.despine(fig=fig)
fig.savefig('plt.svg')
something like this
I vastly prefer making my figure and axis manually and passing it around
More explicit π
Not working th
@past meteor can I run ai code on Google jupiter notebook or bothosting or pydroid3 on mobile ?
Or even a vps??
Vague question. What is "AI code"
You can run non neural net algos on most consumer grade computers
are most CNNs made through cv2, like, I do not know, ImageDataGenerator and stuff?
It ultimately depends on what it is. If it's LLMs you will need heavier hardware
I thought LLMs use RNNs
are RNNs kind of just irrelevant?