#data-science-and-ml
1 messages ยท Page 118 of 1
Wait out for 4090ti 
It's gonna cost a house and a car
And also half of your soul
Hm. I would love to help You but I have no clue what might be wrong
does it also require a sacrifical lamb and a drop of a virgin's blood
That's saved for RTX5000 series
thats more likely for the electric bill rather than the cost of the GPU itself lol
True dat^
i wonder for any of the 40 series how long the electricity bill outweighs the GPU itself lol
probably like a month of constant use?
You would have to use it to a maxium for a year probably
That's my guess
home refrigerator's power consumption is typically between 300 to 800 watts of electricity.
That's when cooling of course
they have a tdp of only like 300-400W, right? 400Wร(13cent/(kWรh))โ>$/year is 455$/year (of constant usage), where I googled us electricity prices as 13 cents/kwh
Nvidia RTX 4090 has an official power draw of 450W
So not even a year, almost 3 in fact
lol thats a very wrong forecast right there if is see one
hilarious how ARIMA(1,1,1) also gives me a flat line
like it just refuses to do anything
ppm is having a snake tournament
Btw how is ur training set looking like? Is it divided by years or months/days? Cuz training it on years perspective might give a huge false positive
its on months
or so i think
because i did dataDF.index = dataDF.index.to_period('M') to change the index to Month but i actually dont know if its what it does lol
grid search ftw
Now it actually matches in terms of position
im curious because this resource im following limited the grid search of the p,q,d to 11
can i go higher to get better results
or does it cause diminishing results
I am afraid I don't know what You are talking about
ARIMA takes 3 argumental values (p , q , d), different models give different results, you do grid search by fitting each ARIMA model and evaluating, and determining the best with the best metric (IE lowest RMSE)
the grid search i did limited the value from 0 to 11
so it can go from (0,0,0) to (11,11,11)
hence this thing
actually it wasnt 11
it was only p
q and d was limited to 4, so technically the max is (11,4,4)
Hence the repetetive result? This is hella interesting but I see that I've got a TON of things to learn
its not repetitive
its taking the model arguments, example (1,2,3), create a model and fit it and evaluate the results
do it on the next
and when finished compare all models, and determine which one is the best
Oh, that's what You mean. Yeah I think I get it now
for this dataset, it got (9,2,0) since it has the lowest MSE value out of all
its a very interesting regressional technique used on time series stuff
this is the first time im dealing with ARIMA lol
This is the first time I am even looking at such a thing
Altough it seems super cool I think I will stick to my NLP shit
its not even deep learning, its literally just regressional analysis
although im curious how i can use DL on time series data lol
No clue, but good luck ๐๐ป
please always ask your actual question. don't ask to ask. if you have a question about langchain, assume that someone can help and ask a question that person could start answering.
Right now, what's up?
so for my product what im doing rn is m using
ChatGPT-AssistantsAPI for responses and handling context/history for a convo
everythings good working great,
so i want to pass user's data like his workspace data from my db via restapi call to gpt via RAG
so it could give biased answer considering the workspace data + it's own llm
My problem:
FileNotFoundError: [Errno 2] No such file or directory
actually i have JSON(rest_api) responses because in Langchain there's no option for SQL RAG,
im trying to use JSONLoader
but it takes an required arguement called filepath
i dont rly have a filepath
FileNotFoundError: [Errno 2] No such file or directory
try giving the whole error message, fromTracebackall the way to the end of the output.
!code
Please do not post screenshots of code.
import requests
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import JSONLoader
API_URL = "http://127.0.0.1:8000/api/workspaces/"
def get_workspace():
response = requests.get(API_URL, auth=("aryanjainak@gmail.com","Iamreal@123"))
if response.status_code == 200:
return response.json()
else:
print("Failed to fetch data:", response.status_code)
return None
def main():
workspace_data = get_workspace()
embeddings_model = OpenAIEmbeddings()
"""
splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = splitter.split_json(json_data=workspace_data)
print(json_chunks,'efef')
"""
loader = JSONLoader(
file_path=str(workspace_data),
jq_schema='.messages[].content',
)
data = loader.load()
embeddings = embeddings_model.embed_documents(data)
vectorstore = Chroma.from_documents(embeddings, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
docs = retriever.get_relevant_documents("What is the name of my workspace?")```
basically ik what the issue is
but what even can i do
it's an required argument file_path
Leaking your password to such a channel ain't the best idea
yeah, you should change your password for that API, since a bot has probably stolen it.
localhost
once you've changed your password for that API, post the whole error message that you're getting, starting from Traceback.
Yeah it's local host, but You posted your email as well. From Your reaction I supposed it's not a valid pass. It was just a friendly reminder to keep it in mind for the next time
@peak ridge in addition to the whole error message, can you also post all the import statements in that file?
FileNotFoundError: [Errno 2] No such file or directory: "PycharmProjects/Kleenestar/src/backend/[{'id': 6, 'root_user': {'id': 1, 'first_name': 'xyz', 'email': 'aryanjainak@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http:/127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}, 'users': [{'id': 1, 'first_name': 'xyz', 'email': '2342@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http:/127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}], 'business_name': 'Xyz', 'website_url': 'https:/www.xyz.com', 'industry': None, 'created_at': '2024-04-23T04:37:55.983893+05:30'}]"
i did ,edited there
given this data structure
[{'id': 6, 'root_user': {'id': 1, 'first_name': 'xyz', 'email': 'aryanjainak@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http:/127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}, 'users': [{'id': 1, 'first_name': 'xyz', 'email': '2342@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http:/127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}], 'business_name': 'Xyz', 'website_url': 'https:/www.xyz.com', 'industry': None, 'created_at': '2024-04-23T04:37:55.983893+05:30'}]
is there anything here that should be the completion of "PycharmProjects/Kleenestar/src/backend/ ?
JSON loader is requiring filepath because it's supposed to load the file from the harddrive
json.loads() is probably gonna solve your issue
And passing the actual data to it directly
nothing as such
JSON loader is the issue
ohh,
but as far as ik from the docs everybody is using this lib cuz it's returning Document(page_content= format lists
Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),```
something like this
Then just save the data from API as .txt/.json
it looks to me like workspace_data is that list, and you were passing it as a string.
i cant i would have alot of users wont make sense ig
i actually passed the list but it doesnt accept list
Cuz he doesn't have a filepath
wait i'll show u, just a sec.
sure, but the solution wasn't to turn the list into a string. you can't just pass any string--it has to be a string that actually represents what you need.
>>> from channels.rag import get_workspace
>>> get_workspace()
[{'id': 6, 'root_user': {'id': 1, 'first_name': 'xyz', 'email': 'aryanjainak@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http://127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}, 'users': [{'id': 1, 'first_name': 'xyz', 'email': '123@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http://127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}], 'business_name': 'Xyz', 'website_url': 'https://www.xyz.com', 'industry': None, 'created_at': '2024-04-23T04:37:55.983893+05:30'}]```
cools right?
json response
It's a dictionary already
hm, that's what i missed prolly
the outermost structure is a list
True, my point is that this is not a JSON. JSON doesn't accept single quotes
TypeError: expected str, bytes or os.PathLike object, not list
if i do this
def main():
workspace_data = get_workspace()
embeddings_model = OpenAIEmbeddings()
"""
splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = splitter.split_json(json_data=workspace_data)
print(json_chunks,'efef')
"""
loader = JSONLoader(
file_path=workspace_data,
jq_schema='.messages[].content',
)
data = loader.load()
print(data)```
sure, but file_path needs to be a pathlib.Path, or a string that is a file path. passing a string that is a valid json will still cause an error.
is my approach very terrible? @serene scaffold @neat bluff
isnt how u guys do rag
im from Django background
web dev databases bg
im learning all this for our startup
just a early startup
I know, I've pointed it out earlier to him earlier. jsonloader is clearly designed to only read files from hard drive
i can just use
json.loads() ?
My best guess would be to try. What's the worst that can happen.
๐ฏ smart guy
actually i did, but imma do again
def main():
workspace_data = get_workspace()
embeddings_model = OpenAIEmbeddings()
"""
splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = splitter.split_json(json_data=workspace_data)
print(json_chunks,'efef')
"""
loader = json.loads(workspace_data)
print(loader,'xyz')
data = loader.load()
print(data)```
TypeError: the JSON object must be str, bytes or bytearray, not list
Turn it to a string now
didnt even got 1 print
so like it even didnt work
As you did earlier
ohh
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)
So it worked
But as I said already - JSON doesn't accept the ' as value surroundings (i have no clue what they are called)
[{"id": 6, "root_user": {"id": 1, "first_name": "xyz", "email": "aryanjainak@gmail.com", "last_name": "Jain", "is_active": True, "profile": {"id": 1, "user": 1, "avatar": "http://127.0.0.1:8000/media/default.jpeg", "country": None, "phone_number": None, "referral_code": "865083", "total_referrals": 0}}, "users": [{"id": 1, "first_name": "xyz", "email": "123@gmail.com", "last_name": "Jain", "is_active": True, "profile": {"id": 1, "user": 1, "avatar": "http://127.0.0.1:8000/media/default.jpeg", "country": None, "phone_number": None, "referral_code": "865083", "total_referrals": 0}}], "business_name": "Xyz", "website_url": "https://www.xyz.com", "industry": None, "created_at": "2024-04-23T04:37:55.983893+05:30"}]
It has to look like this instead
so what should i do sir
def main():
workspace_data = get_workspace()
embeddings_model = OpenAIEmbeddings()
"""
splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = splitter.split_json(json_data=workspace_data)
print(json_chunks,'efef')
"""
loader = json.loads(str(workspace_data))
print(loader,'xyz')
data = loader.load()
print(data)
embeddings = embeddings_model.embed_documents(data)
vectorstore = Chroma.from_documents(embeddings, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
docs = retriever.get_relevant_documents("What is the name of my workspace?")```
this is how it looks rn
I would save it as a raw string in your code
Lemme do it for ya
def main():
workspace_data = get_workspace()
embeddings_model = OpenAIEmbeddings()
"""
splitter = RecursiveJsonSplitter(max_chunk_size=300)
json_chunks = splitter.split_json(json_data=workspace_data)
print(json_chunks,'efef')
"""
workspace_data = '[{"id": 6, "root_user": {"id": 1, "first_name": "xyz", "email": "aryanjainak@gmail.com", "last_name": "Jain", "is_active": True, "profile": {"id": 1, "user": 1, "avatar": "http://127.0.0.1:8000/media/default.jpeg", "country": None, "phone_number": None, "referral_code": "865083", "total_referrals": 0}}, "users": [{"id": 1, "first_name": "xyz", "email": "123@gmail.com", "last_name": "Jain", "is_active": True, "profile": {"id": 1, "user": 1, "avatar": "http://127.0.0.1:8000/media/default.jpeg", "country": None, "phone_number": None, "referral_code": "865083", "total_referrals": 0}}], "business_name": "Xyz", "website_url": "https://www.xyz.com", "industry": None, "created_at": "2024-04-23T04:37:55.983893+05:30"}]'
loader = json.loads(str(workspace_data))
print(loader,'xyz')
data = loader.load()
print(data)
embeddings = embeddings_model.embed_documents(data)
vectorstore = Chroma.from_documents(embeddings, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
docs = retriever.get_relevant_documents("What is the name of my workspace?")
same error
wdym same error
json.decoder.JSONDecodeError: Expecting value: line 1 column 124 (char 123)
yes
it's a web-application api
Mind showing me the code where You define and return said "JSON" data
class WorkSpacesViewSet(viewsets.ModelViewSet):
#permission_classes = (permissions.WorkSpaceViewSetPermissions,)
serializer_class = WorkSpaceSerializer
def get_queryset(self):
# All the workspaces the request user is a member of
return self.request.user.workspace_set.all()```
its written in django
django-rest framework*
I can see that
Wait let me think
Mind editing the code of an API a bit?
def get_queryset(self):
# All the workspaces the request user is a member of
userWorkspaces = self.request.user.workspace_set.all()
return json.dumps(userWorkspaces)```
No clue if this will not crash
๐ง
dont u think the web-application will have trouble
who cares,
i can try
"No clue if this will not crash"
np
I suppose it's not production deployed yet so hence my "no worries" debugging approach
ya the web-app crashed
ya
Error? I suppose You imported json
TypeError: Object of type QuerySet is not JSON serializable
we could turn it into json
but response.json()
(already doing it)
That's what we are actually trying to do.
It's clearly not. Whatever You receive on the other side is not JSON.
we are calling the api in get_workspace
If it would be JSON there wouldn't be a problem loading it using json.loads()
actually you are right
actually im just trying with this workspace data
i wont really use this
i am calling users dynamic data,
there marketing data via marketing-channels api's
and i am storing it on my db and i wanna pass it
im just trying with workspace data
So if I understand correctly - we are trying to fix a mockup which isn't gonna be the final data managed by this code?

yes
๐
but that data will also come from API request
or via db query (within an API)
i can change approach
like rn im calling via api
i can call via db directly if it works
it just needs to work
Anyway, the fact that json.loads() isn't able to load QuerySet (as stated in the error log) it doesn't mean that it won't be able able to parse it when we first treat it with some DICting...
or my company will die
Because now that I think about it... it probably didn't even try to do it because of uncompatible data type
true
but can i call it via db queries
more impossible
Funny thing is that I am doing similar thing and had similar issues, but in different area of interest
on the docs i saw these options
you understand my pain
are u guys using python for the backends?
Well I am one man army beside my frontend design guy.
im also the alone backed guy
but we have 2 interns on the frontend
and my co-founder is designer
and we have a pretty decent access to investors,market product
Is that a SaaS You are trying to build?
my co-founder has 1 more product 5k users
yes sir
https://kleenestar.io
lol, i hate that too
these designers are crazy they love that
That's the one You are building rn or the one of Your friend?
Cuz it looks fucking fancy. That's for sure
yes, we are.
we have crazy funds and access too
180 pre-registered users
my co-founder is gr8 guy.
glad to have him
Alright. Now I feel dedicated to fix this crap
yes, we gotta do it sir.
Cuz maybe we will help each other in this crazy world of building SaaS
Should people speedrun their data stuff? https://youtu.be/x82Ze21aQ2E?si=PAFuaMkcwUtgqmD6
Uh, choked hard during the yfinance cluster split, nonetheless good run. You have to speed run your sets. You have to. It means absolutely nothing if you do not. Come on, break my Reekie.
Data Science in Python
Elements of Data Science
An introduction to data science designed for people with no programming experience, this book presents a small, powerful subset of Python that allows you to do real work in data science as quickly as possible. It includes Jupyter notebooks where you can read the text, run the code, and work on exercises to practice what you learn.
https://allendowney.github.io/ElementsOfDataScience/README.html
Install it with pip and run it then compare it to standard python repl.
pip install ipython
Can anyone point me to the right direction,
I'm trying to build a model that matches a book's paragraphs in one language with the matching paragraphs in a translation (for example, let's take the little prince's english version and its japanese translated version)
The idea would be to create a version of a book where its original and translation are laid out side by side for language learning
I'm not too sure yet how to approach this kind of problem (what model to use, what kind of problem it is, etc.) so i'd appreciate some guidance
as of now, my idea would be to vectorize/tokenize the words, compute something like a vector sum per paragraphs, then maybe match the resultant vector using a dot product with the vectors in the other language, the thing tho is that since these are two different languages, the way the words would be vectorized would probably result in vectors where the dimensions aren't the same, so not yet sure how to deal with that
TLDR: I'd like to create a model that automates the creation of something like this: http://bilinguis.com/book/alice/jp/en/c1/ where the model aligns the text from an original language to an official human-translated text
Any suggestions would be appreciated!
๐
so what can i do now
How can i speed this loop up alot?:
how do i speed up this loop, it needs to be very fast, so i can run it like 120 times a second:
`def RSI_strategy_numba(data: pd.DataFrame, rsi_values, indicators) -> tuple[list[pd.DatetimeIndex], list[pd.DatetimeIndex]]:
buy_dates, sell_dates, state = [], [], 0
for idx, rsi in zip(data.index.values, rsi_values):
# If were in the buy state, check for a buy
if rsi > indicators[0] and state == 0:
buy_dates.append(idx); state = 1
# Otherwise check for a sell
elif rsi < indicators[1] and state == 1:
sell_dates.append(idx); state = 0return buy_dates, sell_dates`
appending to dataframes (and numpy arrays) is always slow and generally not recommended. it's better if you use dicts or lists, and if you still need a dataframe at the end, convert the final result to a dataframe
I am not appending to the dataframe. The problem i believe is the size of the dataframe
its around 43000 lines
I would like to use numpy's faster vectorization, but i cant figure out how
it's a numpy array of values. the dataframe should look have datatime as index, and then a column for rsi. Rsi values is coming from dataframe['rsi'].values()
it does seem like you need the state from the previous result, but that's not a big problem here
what's the type of indicators[0]?
thats just a list of floats.
ok
i have a genetic algorithm which generates it for me
then you can compare the entire rsi_values against indicators[0]
yeah, using numpy,where
just rsi_values > indicators[0] yields a vector of booleans with all the results
or that
you can similarly compute the state for all indices together, though this is a bit more tricky because each state depends on the previous one. it may be that you cannot avoid doing this in a loop, but you could rewrite it as a convolution at least
that means you can do all of these operations without any explicit for loops
Idk how that'd work, im quite new to numpy
I guess i could calculate the checks using rsi_values > indicators[0]
but then from there, how do i loop that without explicitly using a for loop?
hmm if you're not familiar with convolutions then there isn't a much better way than what you're already doing
you could get a speedup by avoiding resizing the lists. you can initialize them with 0s
I'll try to press chatgpt for answers on convolutions tomorrow. But for now ill go to bed. Thx for the help!
"Hey, I'm about to start my journey into AI and ML! If anyone else is starting from scratch and wants to join a group for group study, let's create a study group together. Together, we can learn the basics and support each other along the way. Excited to start this journey with like-minded individuals!"
Andrej Karpathys yt series is a great starter for intuitiv Neural Network Basics. Theres also a discord learning community there
if I made a machine learning algorithm and gave it info about jokes that I find funny and jokes I don't, would it be able to generate new jokes that I would find funny the majority of the time?
depends on the model's ability to learn properties of jokes that discriminate between ones that you do or do not find funny. it would probably take more training data than you would want to produce.
How long do you think it'd take just going through and collecting training data?
you'd have to find a dataset of "jokes" and go through (perhaps in excel) and label each one as funny or not funny to you. I would expect to spend at least several hours doing that.
Thanks for your information ๐
If you have no clue of how large your dataset needs to be a priori what you can always do is start with a small sample, check the results, increase the sample etc. until you no longer really improve
hm
Thanks a bunch. It might be an overkill for my use-case, but I will check it out regardless.
@serene scaffold I probably should have tagged you. Perhaps you can shed some light on how I can analyze the data?
basically, you find 10 jokes you find funny and 10 you don't and you give that to the algorithm. You let it produce 10 funny and unfunny jokes. If all of them are spot on, that means you're done. This will likely not be the case, You'll have to find more funny and unfunny jokes (say 10 more) and repeat with 20, you keep doing this in a loop until you're satisfied
genius
AI is not able to understand human humour btw
but it can train to simulate his
this is basically philosophy because you need to define what you mean by "understand" to have this discussion (and honestly, these are my least favourite ones)
Depending on your definition of understand the answer is either yes or no
You know how in unicode different characters have different codes like
I think
A is 01000001
a is 01100001
(Just an example, probably wrong code)
My question is does any character take more storage space than another or do they all take up the same uniform space
some """characters""" do take up more space, but when talking about things at this level of detail, your notion of what a character is starts clashing with formal definitions
!e ```py
examples = ['A', 'ร', '็ซ']
for example in examples:
print(example, len(example), example.encode('UTF-8'), len(example.encode('UTF-8')))
print(example, len(example), example.encode('UTF-16'), len(example.encode('UTF-16')))
print(example, len(example), example.encode('UTF-32'), len(example.encode('UTF-32')))
@agile cobalt :white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | A 1 b'A' 1
002 | A 1 b'\xff\xfeA\x00' 4
003 | A 1 b'\xff\xfe\x00\x00A\x00\x00\x00' 8
004 | ร 1 b'\xc3\x81' 2
005 | ร 1 b'\xff\xfe\xc1\x00' 4
006 | ร 1 b'\xff\xfe\x00\x00\xc1\x00\x00\x00' 8
007 | ็ซ 1 b'\xe7\x8c\xab' 3
008 | ็ซ 1 b'\xff\xfe+s' 4
009 | ็ซ 1 b'\xff\xfe\x00\x00+s\x00\x00' 8
Wow, I just lied to someone basically, thanks bros
I recommend reading up on how Rust handles Unicode data and strings
it is pretty insightful even if you don't plan to ever use Rust
in utf-8 specifically the answer is "yes" -- some characters are 1 byte, some are 2, etc.
in python the answer is "maybe" because (i think) strings use a fixed width for each code point, auto-resizing their character width as needed. so it acts like utf-32 functionally, but in practice the storage size might be more like ascii if the characters are all 1 codepoint (which can be represented in 1 byte). however don't quote me on this because i don't remember where i read it.
It's a "no" everytime actually. LLM's are able to generate jokes only out of existing ones - it's not able to generate new, unique or trend-based jokes. It will be able to operate only within existing context and by merging/changing the jokes it's been trained with.
you could build a system that does something like search for jokes, which then gets appended to the current context, from which the model can then generate new jokes. "LLMs are zero-shot learners" and all. but that's not the LLM itself, that's a bigger system built on top of the LLM.
ML (not just LLMs) is capable of taking 2 existing concepts and string them together to a 3rd, novel concept
this is a nice image
given 2 existing jokes it can produce a 3rd new joke
@tropic kettle note that python does not equal numpy does not equal apache arrow.
i actually don't know exactly how numpy stores its strings, but they behave like fixed-size UCS-4 fields, i.e. UTF-32, so bigger codepoints shouldn't take up more space but strings will tend to be large (and are un-ergonomic to work with due to the fixed field size)
i don't think arrow has a native string data type, but polars for example uses utf-8 and is backed by arrow. i assume the pandas arrow-backed dtype is also utf-8 but you'd have to dig around in their docs or source code for that info
and of course this is all irrelevant if you're interested in how databases store your text (depends on the database), or file size at rest (you choose the encoding + compression), or data size over the wire (same as file size, your choice)
It's not really new then, is it?
Running into a weird issue here.
Im training an AI to make number based predictions. It fits to the data perfectly shown by a final loss value of
0.00026427686376862626
Then on a test value it also "perfectly" predicts it.
Expected normalized number: [-0.9962218830221753]```
with a loss value of `0.000004458501602317615`
except when I un-normalize the value it is completely off of the target. even though the loss is extremely low.
```prediction: [170.93668721399916]
expected: [696.299988]``` So everything did its job correctly. I already verified that the normalization stuff works. I believe the issue is caused by precision and the range of the upper and lower bounds of the normalization being huge.
I'm not sure if you have the model train on the normalized value but then pass it the non-normalized value
since that effectively completely changes how the data looks and can change the pattern
Just want to ask, in the pinned section, someone recommended three resources for learning maths for ML/AI. Will they be enough to at least have a good maths base for ML/AI or does anyone recommend any alternate sources?
no it is given the normalized everything
What is a "good maths base" in your case?
In general you're looking for strong numeracy fundamentals, and specifically a good handle on undergrad-level calculus, linear algebra, probability, and statistics.
for "AI" specifically you probably don't need much statistics and can skimp on probability a little bit, but for generalist DS you do need both.
As in learning the required 'basic' maths for ML/AI. In my current situation, Im relearning my A level (high school) maths just to understand the general concepts before going onto learning the uni level maths
Im decent at maths. Im relearning it pretty quickly but it has been a few years. Only want to get into ML/AI cos its interesting and maybe useful for me in the future so might as well start now
anyone knows how to install tensorflow-gpu on windows ??
https://youtu.be/IHZwWFHWa-w?si=ymfibEI1iRHxjHf7&t=290
I am watching 3blue1brown's neural network video ep 2, and I am wondering how he got 13,002 weights/biasas for the parameters for his neural network. When I calculate it I get 12,963.
Enjoy these videos? Consider sharing one or two.
Help fund future projects: https://www.patreon.com/3blue1brown
Special thanks to these supporters: http://3b1b.co/nn2-thanks
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks
This video was supported by Amplify Partners.
For any early-stage ML startup fo...
your number appears to be correct if the output layer had 9 nodes, but it has 10 (they're numbered on the screen starting from zero)
In [10]: (784 * 16) + (16 * 16) + (16 * 10) + (16 + 16 + 10)
Out[10]: 13002
the last term here are the biases.
Thanks
yw
makes my head spin
just started getting into data science and stuff
im 17
A weight is a connection between two nodes. And each non-input node has a bias.
Why does it make your head spin? Your initial calculation was correct except that you were missing one output node.
TensorFlow >2.10 no longer supports GPU on Windows. You'll have to run it inside WSL2
If I install <2.11 version of tensorflow will it work then???
Yes but then you'd be stuck with old content, which will be detrimental in the long run (especially with how fast-moving the field is)
better to bite the bullet now and set up a WSL2 environment
Yes, the book math for machine learning will give you enough maths to understand the majority of canonical methods
That ain't me.
On the topic of TF, why do people still use TF over PyTorch? It seems for the most part Torch just dominates over TF both in speed and available tooling now.
I remember years ago it was the other way around, but since TF3 it seems PyTorch easily beats TF in almost every situation?
TF3? do you mean tf2?
the biggest reason is probably just momentum, but iirc it has a few niche advantages like easier/more mature deployment to web and edge devices
yeah sorry
for some reason I have it in my head that TF is v3 and v2 was the old version
TIL https://pytorch.org/executorch-overview is a thing though
ExecuTorch alpha release also provides early support for the recently announced Llama 3 8B along with demonstrations on how to run this model on an iPhone 15 Pro and a Samsung Galaxy S24 mobile phone.
my experience of that has been to opposite tbh
and ease of converting models to onnx, quantizing etc...
specifically
web and edge devices
or did you do it in torch? iirc it doesn't supports web directly
(I mean embed, not creating an api)
we go straight to onnx models
Now, my experience with just pytorch on edge or embedded devices is bad because PyTorch feeels huge and bulky
but exporting to onnx and then embedding onnxruntime or what not experience wise is better than TFLite
yeah idk then
TF/Keras is arguably easier to use and gets you results a tad faster
And it's also what some of us learnt in uni ๐
But I've since switched to Torch, all in all they're quite similar and I'd just recommend the vast majority to use Torch (in conjunction with lightning)
This is what we do too for edge
I personally used to use Tflite earlier, and haven't used it recently. But Pytorch -> onnx is pretty straightforward
yeah, and portability wise very nice
plus who knows when Google will kill something
the import mechanism change in tf 2.6, when they changed keras to a separate python package broke a lot of things and made it generally frustrating to use tf.
imo before that things were looking good with the subclassing api and the functional API.
But that was what finally forced me to completely pivot to pytorch as my primary.
Haven't really used tf much since.
It's a shame because I was actively excited for tf, even contributed a few smol things iirc.
Yeah that was exactly the reason why I switched. The constant breaking changes
And the docs that don't follow them etc.
can anyone tell me how can i use isochrone?
o shit, u rite
I never see anyone at work use TF. I only see TF in questions asked on this server. So I suspect that many beginner tutorials were written for TF in the past.
hey all, im building this dev tool to transform scrappy python code to code that follows best practices by using LLMs and AI.
still in beta but would love to get feedback from python practitioners and people in AI
https://gitgud.autonoma.app/
any best practice im missing? is the output good quality for prod?
Copilot for real developers
like the idea, and your ui is really impressive
nvm there was a bug, fixed now
Nobody switching back to Keras?
I don't remember asking it to complicate the code beyond recognition (take it lightheartedly, lol)... all I really wanted it to do was to just go from for key in dct: value = dct[key] to for value in dct.values(): pass or for key, value in dct.items(): pass
anyway, this is most certainly not ready for production, at least I can't imagine trusting it, also doing weird stuff like this might make some tests fail and then you have to rewrite those or it can change the ast in unexpected ways oh and after all, didn't use .items(), I even tried being a bit more explicit about the usage and even then... though it did produce less clutter that time. the logging is frankly way too much IMO and also where are the two blank lines around function definitions ๐
also constants appear to get lowercased for some reason
it does seem to work somewhat better with a bit more code than with those tiny samples I provided in the screenshots
also also, it seems to quite arbitrarily get rid of some comments... that's definitely not ideal
dunno, there's certainly room for improvement I guess
Thank you very much for the taking the time to test it and share observations Matiiss! This helps us a lot to iterate the solution. Sharing your comments with the team ๐
guys im new to data science so far i understand data mining and getting unstructured data but i dont understand the part where u
use python or ai to structure the data and get key insights can anyone explain that?
"data mining" is really just a buzzword. I wouldn't put any stock into it.
If you have a bunch of reddit messages, that's semi-structured data, since you know who wrote each message, and when, and which message was in response to which. But the messages themselves are unstructured data in natural language (English, or what have you). If you were to identify all the locations that are mentioned in each message, then you'd have structured data.
What kind of structured data you might want to extract from unstructured or semi-structured data depends on who you are, and what data you have or can obtain, and what your goal is. Retail companies might want to obtain structured data about what people think about their products.
so in this case for the retail company u would write a program that looks into the Databse or .txt file and extracts information that only mentions what people think about the product?
Potentially.
is there a good video or channel explaining these stuff
What are the levels to NLP?
what is difference between data science, machine learning and AI
I'd say that AI is about creating algorithms capable of complex decision making. ML is a sunset of AI where you make those algorithms by learning from experience (also known as data) but there is AI that isn't ML.
Finally, data science isn't a formal term with definitions. I'd say it's a toolbox of methods ranging from things related to ML to traditional statistics and potentially even optimisation/operations research. It's an applied field where you use data to solve problems. The "science" in there is to distinguish it from let's say business analytics where the goal is "insights" and bar charts. It's a narrower skillset.
"but there is AI that isn't ML."? what do you mean by this exactly
also any advice to someone coming from programming background into this field i don't wanna go to too deep into calculus but stay near the programming end
any pathways / job titles i should aim for
personally I think of AI as the goal and ML as a means
AI that isn't ML
programs that could do complex decision making existed before ML got popular
for example, you could just code a ton of conditional checks manually, and that could act like AI; in fact that's the idea of expert systems
ML is when hardware got better and people thought, "man, manually finding & coding in these rules every single time for every single new problem is a lot of work, what if we just had a generic tool which can do that for us instead?"
ML existed long before the hardware improved (1950s). The concept of AI has been around for a very long time, but ML came about around the time that lots of people started getting into AI for real (actual implementations on computers (machines, not people, back then "computer" also still meant a person / job title)).
thanks
is there any place for me ๐ ?
But for example, an automatic prover / search algorithm was considered AI back then, now it may not be due to not being impressive enough anymore.
true, but I meant before ML got popular
unless that's also false thus making me more of a dum dum
ML was popular back then also. It's been around for a while now.
ah welp
learn something everyday I guess ๐
Including some lesser known roles it played in stuff like the space race (optimization algorithms).
BFS, DFS, A* are all AI depending on the context
People don't like this but it's true
(About the space race stuff) But would in that context only be consider a search / optimization algorithm, the term ML was around, but did not blow up yet in usage, still was ML though.
(and it has nothing to do with ML)
I guess they just "feel less AI" when compared to chatbots
Also a lot would fall just under control theory, now parts of it are considered ML, even though it's still (optimal) control theory.
Knowledge representation and reasoning (KRR, KR&R, KRยฒ) is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can use to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. Knowledge representation incorporates findings ...
this seems appropriate for the present discussion
AI is when it feels magical enough is a certain definition of it.
Hahahaha yes this is true
This also means that with enough time all AI becomes non-AI.
Whenever I do a talk I ask people if they think a Google search is AI and the vast majority says no
It's the best example of that
Also I do mean ML as in the term ML was coined and used at this time, not just things that fall under ML now.
If you include prior to the term ML, then even earlier, since lots of search and optimization happened automatically (on machines) during WWII.
About this, I don't know where we draw the line and say a method is ML or "just statistics" I think most people would say linear regression isn't ML but SVMs somehow are
a lot of people introduce linear regression as "the simplest form of ML" too
But what about linear kernel, least squares SVMs. They reduce to something similar to LDA
i don't think the term is well-defined enough to be worthwhile
That I agree about
For me the difference is the end goal, statistical inference or simply prediction
Yes if you can do inference you can do prediction
But stats was always more focused in inference and not necessarily prediction
A key part of ML is the M, it happens on a machine, linear regression and such came way before that.
But they can also be done on a machine, so idk.
And we already had people trying to make AI-like automatic proof machines and such. Although most were never completed, too far ahead of their time (pre-Turing).
Hi, I am new to python and I need to fit data with x and y errors in mathplotlib. How can I do that? (I am trying something different than gnuplot, and I couldn't figure it out)
What are you asking, specifically? For plotting that, use errorbar with xerr and yerr arguments.
Yes, I need linear fit with xerr and yerr
I see, there is argument sigma, but I don't know to to include xerr and yerr
As for fitting - what kind of linear model are you looking for? If you want to take into account having errors on the x-axis, you'd need something like "total least squares", aka https://docs.scipy.org/doc/scipy/reference/odr.html
Yes, I have to use total least squares for fitting as far as I know. But, this looks promising, thanks
the AI director of my company said "AI is whatever you can't currently do"
@vernal thunder your message was removed for not being in English or being on-topic for this channel
I'm one of the moderators of this server.
Hahahahaah
Yes good
I'm not afraid of anyone
Keep this in your information
because Im Arabic
You don't need to be afraid. You just have to follow the rules. Posting informational content about religions is not on-topic.
Ana atakalam bil-arabi.
ุงุฐุง ุชููู ู ุนู ุนุฑุจู
ุงุฑูุฏ ุงู ุงุนุฑู ุงู ุฏูู ุชุชุจุน
ูุง
ูุง ุงููุง ุงูู ุดุฑู
We can't actually talk to each other in arabic in this server. But this server also is not an appropriate place for religious inquiry.
ูุจุฏู ุงูู ....
ุงู ุฑููู
ุฌูุฏ
ูู ุชุฏุนู ููุณุทูู ุงู ุงุฎุฑุงุฆูู
ุชููู
@vernal thunder I'm muting you if this off-topic discussion continues.
Hahaha, it looks like there are 14 on the PlayStation
This is a bad thing
By the way, I am the one who is silent
@vernal thunder if you send another message in this channel, make sure it's about data science or AI.
What's the best place to learn maths for ai ml
I suppose university/college would certainly be one of the better places for that
Mathematics for machine learning book
He listened, I am surprised ๐
Any idea on why this code is throwing RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor?
Is it because of DataLoader?
The code for train_model is this:
def train_model(model: nn.Module, train_loader: DataLoader, criterion: LossFunction, optimizer: optim.Optimizer, num_epochs: int = 10) -> None:
"""
Train a PyTorch model.
Args:
model (nn.Module): The model to train.
train_loader (DataLoader): The DataLoader for the training data.
criterion (nn.modules.loss._Loss): The loss function.
optimizer (optim.Optimizer): The optimizer.
num_epochs (int, optional): The number of epochs to train for. Defaults to 10.
"""
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for images, labels in train_loader:
labels = labels.float()
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs.squeeze(), labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f'Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}')
It works fine when not using CUDA
hey, does anyone have a good recommendation for a youtube or online guide for doing astronomy stuff with astropy and some machine learning? I found a youtube video online for using fits data and I was able to use ultralytics and roboflow, but i dont know if I got issues because of version mismatches with the pip packages or what, I'd rather ask if anyone is aware of a good guide
lmfaooo
{'Grid_ID': ['1001', '1001', '1001', '1001', '1001'], 'Datetime': [Timestamp('2023-03-01 00:00:00+0000', tz='UTC'), Timestamp('2023-03-01 00:15:00+0000', tz='UTC'), Timestamp('2023-03-01 00:30:00+0000', tz='UTC'), Timestamp('2023-03-01 00:45:00+0000', tz='UTC'), Timestamp('2023-03-01 01:00:00+0000', tz='UTC')], 'C1': ['4.25', '1.909999966621399', '0.0', '0.0', '0.0']}
Convert "Nยบ de ..." (inteiros) to int
int_cols = df.columns[df.columns.str.startswith('C')]
df[int_cols] = df[int_cols].apply(np.int64)
df.sample(10)
ValueError: invalid literal for int() with base 10: '4.25 ```
Guys im having this error
any help?
You need to covert it to a float first
hi all. python newbie here. need help with above
since this question is hyperspecific and depends on an xlsx file that only you have, it's not very likely that anyone will volunteer to answer it. I recommend doing the kaggle pandas tutorial.
Anyone ever had an error while training a cnn that says input ran out of data while using tensorflow? Found some advice in stack-overflow but the error is still there?
Can anyone help me?
i would point out that this is the deterministic interpetation, but you can alternatively derive the same regularizers through statistical criteria (e.g. for the L1 case, using maximum a posteriori when the parameters follow a laplace distribution centered at 0). also L2 does not restrict the values of the weights
discourages, yes, but not restricts
it won't prevent the weights from becoming infinitely large
what do you mean?
unless you explicitly introduce inequality constraints, the inputs and outputs will be unbounded
that won't stop you from quickly exceeding the computer precision and getting infs and nans
which is exactly what you see in e.g. exploding gradients
L2 reg alone does not prevent the parameters becoming arbitrarily large in any way
neither in the math nor in the implementation in the computer
yes but it won't "restrict" them. you CAN do that: you can guarantee the values never exceed a certain threshold
that's something different altogether and it's where constraints come in
the wording and semantics are important to distinguish that
i'd advice against making stuff up
what you're discussing now already exists and has names, and you'll have an easier time reading about it if you find the proper terms
if you say so. at any rate though, L2 does not prevent your parameters from becoming unbounded
you can do it via the extended lagrange form of the KKT conditions with inequality constraints
that does add some L2-looking regularization terms, but the additional slackness and positivity conditions anyway have to be enforced for the solutions to be in the feasible set. those require inequalities
the way it's usually explained is as "promoting smoothness"
maximizing the 2 norm of a vector is achieved by dumping all of the values into a single entry and setting everything else to 0
the minimum is achieved by making all entries equal
the less variation there is among vector entries, the lower the 2 norm
it's exactly what it does, though
smoothness when paired with an equality constraint does restrict the values, too
what L2 will do is make all of the parameters similar to each other
this
wdym by that?
i don't think degeneracy makes sense here either
that's what they mean by smoothness in this context, not differentiability
idk who came up with the term nor when, but it's well established
because the 2-norm is a contraction for small values, so it ignores them
you can try yourself playing with the example i gave above. take a vector, and for simplicity, work only with positive entries. say we work with the condition that the entries of the vector add up to 1
now let's maximize and minimize the 2-norm of the vector
it's pretty easy to conclude that the maximum value is 1, when one entry is 1 and the others are 0. this is the "least smooth" solution in the sense that it looks spiky
the minimum norm solution is the one where, if the vector is of length N, the entries are 1/N
then they all get contracted by the 2-norm
that solution is "smooth" in that the entries change very little w.r.t. each other
yes, though almost never used
0 leads to combinatorial problems and is fairly common. L1 is its convex relaxation and they are actually equivalent under special conditions
everything between 0 and 1 promotes sparsity, but 0 is not a proper norm and 0 < L < 1 is non convex
L = 1 is convex and non differentiable, but it does have a nice subgradient
how so
is it possible to neural network different type of activation in output layer of neural network?
For example in my output layer 9 of outputs have the softmax, which they should be categorical, and 1 output should have linear activation since it is prediction
why is one output different from the others?
I want to classify and predict at the same time
Something like this, where 1 is classification problem and 2 is prediction problem
did you make this?
I think there's a misunderstanding here. "classify" and "predict" aren't mutually exclusive things. a classifier predicts the classes of the inputs.
Okay. How would you approach this? If there is 9 classes and 1 parameter
I'm not sure
Do you understand what I am looking for?
I mean one obvious solution is to create two seperate neural nets, one for class classification and one for the stock prediction
can I fit it in one?
Sure, you can make any type of architecture/configuration you want
What I'd worry about is how I'd compute the loss of this network
Yes exatcly. I managed to implement it, but it didnt work well
I splited into classifier and linear prediction
So 1 softmax (with 9 classes) and 1 linear layer?
I wouldn't do that
Why not?
You can just have 1 network that does 2 outputs, you compute the loss of each and take the mean or so
It has been shown that combining them has desirable properties, it has a regularizing effect. If you want to get into the weeds you can read this https://en.wikipedia.org/wiki/Multi-task_learning
Obviously, the biggest issue with neural nets is that you can never easily conclude if it doesn't work or there's just a special set of hyperparameters you haven't tried yet that do work
I think if I were you I'd train them separately first and hyperparameter tune them separately as well and then benchmark against a multi-task style architecture
i don't think rust has a good BLAS/LAPACK implementation yet, does it?
- Python can use multiprocessing without any problems.
- The issue is multithreading. Only one Python thread can talk to the interpreter concurrently.
- Most major libs like numpy use multiple threads in C-land which circumvent this issue.
which means even though it could be a good idea, no one has done it yet
idk how easily rust exposes SIMD
google says support is only experimental
this arguably has a bigger impact than just parallelization which, as zestar says, is already taken care of in C for numpy
Not my area of expertise but you can definitely allocate chunks of the matrices/arrays to different threads and combine them afterwards
simply by virtue of getting more slots on the OS scheduler, sure. if your task already exceeds the cache size and the number of parallelizable operations in SIMD, you can speed it up by getting favored by the gods of RNG
it does, pretty good bindings to openblas, or if you're feeling wild it is pretty simple to bind to fortran
no but i mean written in rust directly
stable SIMD i.e. simd operations behind types that make it easier to use, but intrinsics are stable outside of avx512
aha
ah, no not really, I'm not sure it is worth ever doing that vs binding to cblas or open blas.
I have made some vector math libraries in Rust, but not to the same extent as blas. Often it becomes pretty annoying to maintain such a large number of specialized ops
for the sake of maybe beating blas by a few pct
Yeah, idk for AI/ML I probably wouldn't use CPU for heavy ops regardless, and Rust-cuda is a pretty nice experience
that's what i would've thought, yeah
my respects to you
Add mine as well
One thing I guess I would weigh in here though, I think Rust can be great for training models in situations where you need multi-gpu or multi-threaded dataset processing or pre-processing.
At work we use PyTorch Lightning and that thing single handidly takes 20 minutes to startup on a big dataset with 32 cores due to all the multi-processing and extra overhead going on from Python, where Rust can just use threads natively. That and the static type checking can help signififcantly to reduce the crashes at ends of runs due to some random error.
That being said, for quickly knocking something out Python still wins, and I think maybe if you have enough time training via onnxruntime might solve the original issue.
maybe
This is nonsense, every language with performance in mind can do parallelism, multiprocessing is also not what is desired, you don't want a process for each part. If they mean vs Python then that would make sense for CPU heavy tasks, but for matrix multiply we have numpy anyhow. Python is extremely slow. But you actually get more gains (more than the parallelization step) by switching to something like C or Rust ignoring parallelism. Python is just that much slower. And we do do that every time we call a numpy function. And also usually it all happens on the GPU anyhow where Rust does not apply (for large enough matrices / deep learning).
Also bonus points if you realize it's even better to use something like OpenCL for the CPU, for which you can also use PyOpenCL (SPMD/ISPC is the superior model for this stuff which is why the GPU also uses it).
Based you're talking about matrix multiplication in VR Chat tho
I have seen ppl teaching calculus on a whiteboard in it.
typical vr chat discussion
๐
I think it would be more viable if you could more concretely force the compiler to unroll some loops
biggest gain fortran has IMO the ability for it to aggressively unroll loops and split the ops into SIMD lanes automatically vs manually
Adding all this stuff to Rust does not seem a high priority, they really like their functional style without writing manual loops. But it will probably be added.
Tbh idk if it will ever truly have the ability to force unrolls since it is technically controlled by LLVM and depends on LLVM being able to work out if it should or not
Rust is more of a modern C++ alternative than C which gives it a focus on ergonomics over this kind of optimization stuff.
And as usual everyone ingores all the cool stuff Fortran did :(
Eh I disagree, at least for optimized compute, you can achieve the same thing in Rust abietite unsafe rust, as you would C, but both still have the same issue that LLVM/gcc largely control the unrolling behaviour automatically
but yeah, in terms of writing fast math ops without having get your hands dirty with manual SIMD, fortran is awsome
especially F95+ where you can expose functions via FFI more easily now
Yeah you can, but it's a question of how difficult, after all, I could also in Python by manually outputting machine code to a buffer writing that to an executable memory page and running it. This is an extreme example but unsafe Rust plus hoping LLVM does the right thing can feel like that. Anyhow I don't want to make this a Rust complaint channel so we can go to off topic.
My point was more unsafe rust gives you same control as you would C in reality, and if you really want the most number crunching performance, in both cases you are always manually writing the intrinsic regardless of if it is C or not, but yeah we're getting a bit off topic lol
Do I need a Master's degree to work as a data scientist?
im currently a sophmore undergrad computer science and engineering student
Depends on the country. Are you in the US? Europe? India?
Turkey
I don't know anything about the Turkish job market to be honest
I took linear algebra and some other math classes and I take statistics, ML, ai and differantiel equations in this semester
The only people that can answer this are people in your country
I'd like to work abroad though
Then it'll depend on the country you want to work in specifically ๐
I think it's possible with a bachelors in the US and UK for instance
Where I'm based (Belgium) not so much
Belgium requires MSc / PhD at least?
Science/theory oriented degrees put you on a track where you get BSc + MSc, no one leaves these before getting an MSc. Practice focused tracks don't lead to an MSc and cover no math, stats, ... (anymore) but deliver better programmers at day 1
that's the summary
1 or 2 years, so 4 or 5 years total (bs + ms). Just 1/3 finishes it in that time so it's more like 5+ years for the majority
yeah, a good move here is to do 2 of 1 year each
well yes, each place has their peculiarities
hence why, and I don't mean this to be rude, it's better to ask people IRL. Online you'll get US-centric advice that most likely doesn't apply to youu (or could even be detrimental)
that's the edge case
But if you're targetting idk Germany, I think r/germany or whatever is optimal
Did you look at the ones I sent? ๐
(I can resend)
A video
the f
which one?
Hey I am looking for some hints for something I would like to do with tensorflow. I want to show one of 5 images into the camera and have the program tell me which one it is. I know I should probably use template matching, but all tutorials I can find use more than one image as training data. Does someone know a good starting point for this?
For one thing, use pytorch.
Even if you want to implement model that performs template matching, you would need more than one training instance, would you not?
try pytorch
isn't Tensorflow have specific library for this set of problems? i think i've seen it somewhere, KerasCV.
hey are there any resources for this? scraping data using
language model-based tools like OpenAI API, Mistral 7B, Llama2
This means it is overfiting right? Spikes around 10,15 epochs
Before I hyperparam tune it, I want to make sure I did my best regarding the model architecture
Yes, looks like overfitting. Btw, the last 3 columns in the confusion matrix are all zeros, you might want to look into that
Yes. cool . thanks. Yes I know why are they 0s. this is my class distribution
I wanted just to test it out, before I remove outliers
Imma try over and undersampling and classweights to see how it performs
If it sucks imma just chop it off
I wouldn't say they're overfitting
I look at overfitting as a disproportionate gap between the validation and training loss
Your problem is moreso that your val loss isn't smooth but that's imo pointing towards a learning rate that is too high, lack of dropout, ... things you can tune easily
I just train with "enough" early stopping
Honestly, I noticed that there's a lot of variance in training. If I run the same hyperparameters on the same data some runs it's good, some runs it's not. Setting early stopping to something "reasonably high" makes you robust to the model quitting after a few bad epochs
As in, I think it reduces the variance
Thanks guys
Hmmmm
I wouldn't tune the seed but if I had enough time and patience I'd run the same thing N times and make a boxplot or something yes
Just don't have the time to do that with neural nets, each run takes way too long
You know what I should consider? Doing hyperparameter tuning as a multi-objective optimization problem. Instead of just tuning for the loss you also consider the time it takes to do a single run.
That way I could keep my search space larger, but have the hyper param optimizer "punish" the algo for selecting a very low learning rate or very large architecture.
Yeah, we have 2 big #enterprise GPUs
I have optuna, tensorboard and mlflow set up nicely. All I need to implement is 1 function to have my pipelines run. I code an architecture, run it for a couple of days and then read papers to find/code up the next one.
Contract research, $pharma pays us to do the research and then they may or may not put the ideas in prod
pretty much
that's really nice ๐ฎ
My issue is, I'm very wary of "tools"
I've been burnt so many times trying to adopt a shiny thing in my codebase only to notice it just doesn't do what I want it to do
If it's Python I'm also just relatively fast and churning out code so it's always a trade-off between "will I write it myself or figure out how it works from the docs"
My setup is ... nonstandard. I already had existing sklearn based preprocessing and metrics. I didn't want to port all of it to work for Torch so I wrap my Torch models in a sklearn interface ๐ฅด
Instead of going with https://github.com/skorch-dev/skorch and figuring out the pros and cons it was simply faster (<1-2h work) to write that interface myself
I want some good resources to learn text mining in python, can anybody suggest some lec series or book?
did i need to learn machine learning to make ai?
You don't need to learn machine learning to use off-the-shelf algorithms or call APIs (like OpenAI and the stuff cloud providers offer) but if you want to train your own and go off the beaten path yes you do
Has anybody used tensorflow Profiler tool?
I want to see where my pipeline is bottlenecking. I use CPU training only and read data from SSD.
stop_early = EarlyStopping(monitor='val_accuracy', patience=50, restore_best_weights=True) # 100
tb_callback = TensorBoard(log_dir="logs", profile_batch='1,10')
history =model.fit(
x_train_norm, y_train,
epochs=100, # 400
batch_size=64,
validation_data=(x_test_norm, y_test),
callbacks=[stop_early,tb_callback],
verbose=1
)
But theresnt any profiler data
I can see other things, which means, it works
help my install isnt working
in vs code
i used the pip install and it installed perfect
then i rebooted and it still says module not recognised or something
when i try again it says its already satisfied
whenever you need help with an error message, always show the whole error message in the chat as text.
Screenshots are not text.
Import "torch" could not be resolved
Anyway, you probably pip installed pytorch to a different environment than the one vscode is using. try running the program anyway and show the whole error message, if there is one, starting from Traceback.
Please stop posing screenshots of text. I will not answer any questions you ask in the future if you keep doing this.
it looks like you tried running a pip install command. I'm asking you to run the python script that you're trying to write.
@unkempt jay I'm still available to help. what do you do to run the python program?
i have my minor project submission tomorrow so i have one problem, my project is density based traffic light management system which detects objects using yolo v3 model using coco names file which include names of 80 objects, but my aim is to detect the ambulance in traffic with all other vehicles . my project is not detecting ambulance ,please help me.
hello, you'll need to be more specific in order to get help.
hey Stele if you can help meon #1035199133436354600 it'd be great!
if you want to cross-post your question, please link to the thread itself and give a brief explanation of what it's about.
sorry but how do i link my posts?
right-clicking a message gives you the option to copy the message link
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# sklearn comes with some example data sets
from sklearn import datasets
# Import train_test_split function
from sklearn.model_selection import train_test_split
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
#Import scikit-learn MLP classifier
from sklearn.neural_network import MLPClassifier
df = pd.read_csv("dest")
x1 = np.array(df["x1"]).reshape(-1,1)
x2 = np.array(df["x2"]).reshape(-1,1)
x3 = np.array(df["x3"]).reshape(-1,1)
Y = np.array(df["Class"])
X = np.concatenate((x1,x2,x3), axis=1)
accuracy_train = np.zeros(100)
accuracy_test = np.zeros(100)
for i in range(100):
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5) # 50% training and 50% test
# Create MLP classifer object
mlp = MLPClassifier(solver='adam', hidden_layer_sizes=(20, 20), max_iter=50000)
# Train MLP Classifer
model = mlp.fit(X_train, y_train)
# Predict the response for training dataset
y_pred = model.predict(X_train)
acc_train = metrics.accuracy_score(y_train, y_pred)
accuracy_train[i] = acc_train
# Predict the response for test dataset
y_pred = model.predict(X_test)
acc_test = metrics.accuracy_score(y_test, y_pred)
accuracy_test[i] = acc_test
print("Average accuracy for training data:", np.mean(accuracy_train))
print("Average accuracy for test data:", np.mean(accuracy_test))
Hi, dumb question but for this piece of code I'm curious why there's a variance in accuracy results
Is mlp.fit() and train_test_split() the two reasons why?
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5) # 5% training and 50% test
This is wrong. if you set the test size as .5, the training set size will be 1 - .5
yes the comment is a typo
i meant 50% test and 50% training
forgot the 0
(fixed)
y_pred = model.predict(X_train)
you also predicted on the training data, and you can't use that to evaluate the model's performance.
no that's intentional, my assignment asks me to predict on both the training and test samples
and then compare
by changing hidden layer sizes and max iteration
which i've sorted but like it also asks what causes the "variance" in the accuracy results
i'm suspecting it's the mlp.fit() and train_test_split(), am i right?
my project is based on yolo v3 model and project uses coco name file which i took from github,it has only 80 objects. My mentor asked me to detect ambulance if it is in traffic but my project detect it as truck, because ambulance is not included in coco name file . So task is to train data set so that it can detect ambulance and mark it as ambulance in bounding box
how do i label these semester into first year and not first year?
I suppose it depends on the information on the data you have and the location/school it was collected from.
Is 1, 2,and 3 the only unique values in that column?
A 2 years masters program for example, usually has 4 semesters. Year 1 would correspond to semester 1 & 2, and Year 2, semester 3 & 4.
In your case it appears the program only has 3 semesters (presumably it's a program with 1.5 years completion time.)
If that's the case, semester 1 & 2 shoukd correspond to year 1. the rest, semester 3 becomes 6 months (0.5 years)
You just have to investigate further to figure how it works over there.
Oh there's semester 4 ๐. I didn't catch that at first.
Hi, is anyone familiar with Hugging Face's Transformers Library? I'm trying to fine-tune an ASR/speech to text model with the library but idk how to feed my dataset, and for the dataset it's a 30 second audio file as the "feature" and a text saved on a notepad as target. If I want to feed these data, can i just use a simple list/numpy array? or do I need to turn it into a tensor first? any for of help is appreciated, thanks ion advanced ๐
is there a type of ML recommender system where in addition to the usual collaborative/content filtering, I can add/specify/enhance other dimensions (not sure if i phrased that correctly)?
For example, I like to look for a movie that's similar to Matrix but isn't scifi, or a movie/series that's similar to star wars but anime (this would probably legend of the galactic heroes), or death note but romance (kaguya sama)
Or maybe even something like "Something like movie X but not Y or Z" (e.g. something like stranger things but not like Dark)
Can anyone point me to resources on how to build something like this? Thanks!
Collaborative/content filtering doesn't need to be confined to 1-dimension (a scalar). Examples may only show scalar ratings to keep things simple. You can find the k-nearest neighbors based on vectors of any dimensionality.
To identify series "not like" Dark, point their vectors in the opposite direction (multiplying by -1). Make them "far away" for your distance function, so they are dissimilar for recommendation purposes.
Another RecSys mechanism is a Graph Neural Network (GNN). Different edge labels correspond to what you're thinking of as dimensions. A graph visualization may be easier to imagine, and GNNs can learn an optimal recommendation algorithm.
can you elaborate on how to use KNN for this?
I'm only a bit familiar with the algorithm but how exactly would i implement something like " star wars but anime"?
(I just realized that what i'm trying to build is something very similar to the attached pic, but for movies instead of words)
I left this randomsearch to run overnight. Does this means that this model architecture has capped performance to 85% and I should change something in either data or architecture
Can someone give me an example labelled dataset and unlabeled? I'm new I've searched it up but all I got was a long explanation with no example
labelled: You have Images of cats called "cat_1.png", "cat_2.png" and images of dogs called "dog_1.png", "dog_2.png"
unlabelled: You have no idea about what is in "image_1.png", "image_2.png", "image_3.png", "image_4.png"
thanks
How do I train images?
you don't train images. you train a model. you might train that model to do things with images.
how do I do that?
there are lots of ways you could do it. you first need to decide specifically what the model needs to do.
I'm trying to make a model that predict the given image if its a dog or a cat for fun
technically there are some things you could call "training an image", but these are almost definitely not what you are looking for - in particular Style Transfer
99.99999% of the time you are training models, not images/texts/prompts/etc.
okay, so first you need to procure a dataset with lots of images of dogs and cats. see if you can find one where the dimensions of every image are the same.
and there needs to be some way to know which image is which. like a text file structured like
image,animal
1.jpg,cat
2.jpg,cat
3.jpg,dog
i got the dataset part but i don't know what application or the thing to train
look into convolutional neural networks for image classification with pytorch
Do you have any baselines you can compare to?
Are there any good courses, resources that I can learn on?
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
thanks
Creating a group for ml looking for friends
My partner has an interesting problem involving Jaccard index...
Comparing >300k unique subsets A (2^A), where |A| > 300 - and for each such set finding sets with which it has minimum Jaccard index...
We were thinking about representing the sets as 300 bits- that gives us fast calculation of the index itself (because bitwise operations), so only the number of calculations makes it costly -
bruteforce of everything-to-everything is (300k)ยฒ operations.
Does anyone have any ideas how to get it lower? We were thinking about clustering it somehow but |2^A| is so big it's hard to think of something that makes sense (there's a lot of pairs that don't intersect at all).
Or what to use to optimise the speed of calculations - I know basically nothing of numpy but there might some methods to make such repetitive calculations fast?
Hey there, I don't know if it's the right channel to ask but:
Guys I want to learn a second language after python, I've just started learning ML and want a lang to help me out in that field
Which one do you recommend and why,
Rust or C++?
And yes, I want it for ML/AI primarily, and maybe i could gain some insights into how things work, the compliers, interpreters all of this stuff
If cpp's the one then I'll i shoot myself in the foot, in a way that it blows my whole leg off
There isn't another language that would help you all that much for ML. I guess you could learn C++, since a lot of python libraries are implemented in it, but there are better uses of your time if you want to get better at ML.
Thanks for the suggestion, could you tell what are those better uses in case my mindset is wrong?
study the math for ML, read about the different use cases, implement different ones and analyze their performance.
ML is more about math and applying research methods than it is about systems engineering or programming.
everyone does.
related meme
also replace "computer science" with "ml"
True af xd
Maths kinda intimidating but I've heard it's like a language, the moment you get fluent you'll be obsessed with it
And yea every bit of computer related stuff were made by math
the "computer" in "computer science" deals with "computability" in mathematics: can you perform a certain action/do a computation in a finite number of well-described steps
traditionally CS is a branch of mathematics
(not anymore, now it largely depends on the university cuz it can also mean other stuff)
University just gets you into details, as you said, CS is a branch of math
And your explanation of "computer" and math was brilliant
(Need to separate the two so you can't claim a math degree)
Add geometry, trigonometry, linear algebra, calculus (yes, you need to learn it if you want your physics to not be buggy garbage (on the other hand, it gives speedrunners more to work with)), and more depending on the specific game.
(If you are responsible for the graphics, you got a whole lot more to learn)
opengl looks fun but i have 0 reason to learn it
It's technically legacy now, since Vulkan is OpenGL 5.x (it was originally suppose to be the next version of OpenGL). But for a while it will still be around, because not everything has good Vulkan drivers yet (or ever will).
it's like 30 times harder though
Apple is putting the nail in the coffin though.
Yeah, I don't recommend using it directly unless you have to.
yeah, i can't find graphics programming applicable other than in game development, maybe it could be a fun experience applying math to it and whatnot
It's crucial for all kind of things, including the main topic of this channel.
GPUs used to be just for graphics, now they are pretty general.
how can vulkan be possibly used in ml?
You can use it to run models on the GPU.
aren't there better alternatives? it seems foreign to me that you'd use a graphics library like vulkan to do that
It's pretty normal to use a graphics library like Vulkan for this. There are some alternatives, they are all very similar. Ones like CUDA are Nvidia only, Vulkan can even run on mobile.
Also CUDA can't render graphics on its own to a window, Vulkan has all the normal graphics stuff.
(Without extensions)
i'm gonna guess that something like this is easier than making a game with it
Vulkan is an open standard, like OpenGL. There is also stuff like OpenCL, which is more like CUDA, but not just Nvidia.
Then there are some others.
Yes, it actually tends to have less setup work.
openCL would be fun, if there was more resources with C++ ๐
IMO OpenCL has the least boilerplate, and is the overall best API.
OpenCL technically also runs on more than just GPUs, it can do CPU, FPGA, etc.
I need to talk to someone who is good at computer vision... Could you please DM me?
C, then C++. C because it's the lingua franca of the programming world, lets you make fast things (like C++/Rust), is relatively simple, and C++ is directly based on it. C++ because like C, it's everywhere. Rust could be done instead of C++ after C, but since all the existing stuff uses C++, C++ (so you can read all that existing code).
Beyond this, GPU shader languages matter more, which includes stuff like CUDA, OpenCL C, HLSL, GLSL. For this CUDA or OpenCL C. CUDA if you already have an Nvidia GPU.
You could add other high level languages, but Python has kind of won that battle.
- You don't have to be really good at C or C++ or Rust, you more importantly just have to be able to read it.
- If you can read it, you can read other's code in open source projects to learn how to use them well / mimic them.
Any solutions for installing torch in alpine dockers?
'Star Wars' can be represented as a vector in n-dimensions having n scores from [0,1] in features like 'genre: sci-fi', 'producer: george lucas', 'best-picture: 1977', etc.
Reduce them down to a vector i on an arbitrary i-axis representing some 'ideal" measure of "Star Wars"-ness. Allowing for movies that out-Star Wars the original Star Wars, let's assume Star Wars has a score of 0.9977 from all of its n components projected onto i.
Add 2 basis vectors j and k along the j-, k-axes representing Japanese-ness and cartoon-ness. These aren't necessarily orthogonal to the i-axis (Star Wars borrowed from the 7 Samurai, so it may already have a 0.25j component embedded within itself).
j x k is a 2-D plane where (1, 1) represents 'anime'. i x j x k is a 3-space where the movies most similar to Star Wars are nearest to (0.9977, 1, 1) when their n-dimensional vectors are projected down onto this space.
what problem did you run into?
How to install torch/ultralytics in my docker for production usage
Im using mender to deploy my code to my devices and for application deployments i need to create docker file which will conatine all necessary libs for running the script and one of them is ultralytics
My app runs on python 3.11.0
I would see if you can find a base image that already has python3.11 and pytorch installed, and then extend the Dockerfile from there.
Do you think thats "stable" solution?
yes? why wouldn't it be?
Idk, always when i see some other img than systems with languages which i use im wondering how to not use them and install it by commands on alpine dockers instead
https://hub.docker.com/r/ultralytics/ultralytics/tags found this but thats like 6bg
I guess you could do docker run -it alpine /bin/bash and figure out all the steps you'd need to do to get pytorch running, and then reconstruct those steps in the dockerfile.
For sure i will try, but i'm wondering if there is anybody here who have had every same problem
looks like the other thing you'll want to consider is having the nvidia docker runtime installed
Is anyone here familiar with computer vision? If yes, could you please DM me, I need advice for a project. Thank you.
please don't ask to ask. instead, ask a complete question about computer vision, so that people who know about computer vision can read it and start answering it without extra steps.
Its a long question which would be more suitable in a conversation form. I just needed some advice
at least give enough information in this chat to start the discussion. if someone said "I know about computer vision", what would be your next message?
To give them details about what I want to do and ask for their opinion?
yes. people want as much information as you're willing to give them before they make a decision about helping or not.
Can anyone give me some advice on a object/target tracking project? It would consist of my drone using its front camera to detect me and start tracking me and making appropriate decisions to keep me centred in it's video feed as I move further away or out of it's frame.
@hollow escarp I got curious and tried to install pytorch in an alpine container. and once I finally got it installed, I couldn't import it because of some missing OS dependency.
but just installing pytorch makes the container more than two GB, so you might as well start with a more substantive base image.
hi
hello and welcome to our wonderful data science and ai chat.
Strange question that includes other domains. I'm getting a ValueError when using pandas.apply
df[col] = df[col].apply(
lambda x: (
x.strftime(...) # <- vscode raises exception here
if ((not pd.isnull(x)) and (x != ""))
else x
)
)
This one in particular:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#using-if-truth-statements-with-pandas
I'm running this python code with VScode debugger. And it only occurs when I have "raised exceptions" ticked. When it's not ticked, the program runs smoothly and it SEEMS like theres no issues - as in the apply function is doing what it was intended to do.
At the point of the raised exception, the x value is the whole column in an array so I understand why it can show.?
type(x) # <pandas.core.indexes.datetimes.DatetimeIndex>
I have "raised exceptions" ticked and then press Step Over, and execution continues successfully - with each step transforming each value to readable timestamp.
The deployed version of this code runs without exceptions. Am I going insane? or is there an issue with vscode or the python debugger... or python interpreter? (using python3.10 in a venv)
can you show print(df.head().to_dict('list')) as text (no screenshot)?
thanks again!
what would you suggest I do to obtain the features? I'm guessing this would be some embedding of some sort done on the synopsis/summary which extracts those features?
Thank you, sorry for the late reply give me a moment.
I did only head(2), because 5 is too large. Also changed all the values
# df.head(2).to_dict('list')
{'aa': ['2024-04-30 00:05:00+1234', '2024-04-30 00:10:00+1234'], 'bb': ['123123123', '123123123'], 'cc': ['123123123', '13123123'], 'dd': ['2309rjei230', '2309rjei230'], 'ee': ['', ''], 'ff': ['', ''], 'gg': ['', ''], 'hh': ['', ''], 'ii': ['', ''], 'jj': [0.0, 0.0], 'kk': ['U', 'U'], 'll': ['filename.json', 'filename2.json'], 'll': ['123123123', '123123'],
'mm-timestamp': [Timestamp('2024-04-30 00:13:46+1234', tz='timezone/timezone'), Timestamp('2024-04-30 00:13:46+1234', tz='timezone/timezone')], 'nn': [0.0, 0.0], 'oo': ['edwed', 'edwed'], 'pp': ['wed', 'wde'], 'qq': [None, None], 'rr-timestamp': [Timestamp('2024-04-30 00:18:50.400544+1234', tz='timezone/timezone'), Timestamp('2024-04-30 00:18:50.400544+1234', tz='timezone/timezone')], 'ss': [True, True], 'tt': ['vee', 'vee'], 'uu': ['123', '123'], 'vv': [0.0, 0.0], 'ww': ['A', 'A'], 'xx': ['', ''], 'yy': [Timestamp('2024-05-01 00:01:08+1234', tz='timezone/timezone'), Timestamp('2024-05-01 00:01:08+1234', tz='timezone/timezone')], 'zz': ['qwerqwer', 'qwerqwer'], 'az': ['weqrt', 'weqrt']}
hi, i was considering learning prompt engineering what course would yall suggest?
Is there anyone here who have performed redundancy analysis (RDA)?
Do you want to share your progress
Any reason why you can't use this? https://hub.docker.com/r/pytorch/pytorch
Maybe a better question, any reason why you can't export your torch model to ONNX and then make/use an ONNX image without needing the torch dependency?
I created script which uses YOLOv8 to detect some stuff from my camera
More like i created license plate recognition
Which uses YOLOv8 to get Location of license Plate of img
Through what are you using Yolo? Darknet? Torch? Tensorflow? Opencv?
Torch, i followed this installation process: https://medium.com/@pat.x.guillen/a-step-by-step-guide-to-running-yolov8-on-windows-122cb586b567
See the fine suggestion of the sklearn.neighbors.KNeighborsClassifier in scikit-learn earlier by @data.exs
It's easy to pick out bespoke features in a specific example, harder when answering a more general query and your data set has 1000s of features like you can find on IMDb ($), TMDB, MovieLens or try a small hand-crafted pd.DataFrame with just Star Wars, Space Cruiser Yamato, 7 Samurai, Spaceballs, Family Guy's Star Wars Parody, Titanic and Rocky to start with.
Feature engineering is trying different sets of features when training your classifier to see which features produce the best recommendations.
So im looking for a way to use my model on my raspberry PI for detecting stuff from My already trained model
So, as mentioned I think you should convert it to ONNX and deploy that on your raspi
thanks! i'll check this out
And then you can use this on your raspi https://onnxruntime.ai/
Okay, also i should add that i need to build it to docker to deploy that docker to mender ( thats OTA updates provider ) which then deploys it on devcies
Sure. You basically have 2 steps now. You need to compile to torch model to ONNX and then load that into an Docker image that has the ONNX runtime.
Personally I'd do this with a multi-stage build. In the first stage I'd basically use the Pytorch base image I linked initially and install all packages I need on top of it to build the ONNX file.
The second stage is one that has the ONNX runtime, you copy the file you made in stage 1 and you're ready ๐ .
(I simplified it, in practice I'd have at least 3 steps but that's ok. You'll at least use 2 images, that's the important point).
I'd start out testing this workflow in a notebook first to see if it works and so on. A massive advantage is that you don't need to ship 2 gigs of torch to your platform that is just doing inference (the raspi)
You can even make it simpler and just compile and version control the binary you get from compiling the torch model. You can do that if you're certain it won't change. Your Dockerfile becomes easier then.
Okay, and also i have model in .pt format ( it's like 20k img model ) which was trained by someone else
Isn't that any problem for converting it to ONNX format?
What do you mean with 20k img model?
I mean that this model contained more than 20k img for traning proccess
Just fyi, the model doesn't contain the images. Example: If you have 1 single neuron and run 10000k images on it it'll still be small.
Ye ye i know that
So why did you mention it? Maybe I'm missing something.
but Im just askig how to convert that .pt model to ONNX supported format?
It's in the link I sent you
oh ye, thx
If you don't mind, I'll stop answering. I gave you a lot of information to digest and I think you should read the links, some other docs, let it sink in etc. and if you have more questions afterwards just tag me ๐
Okay, im really glad for your support
So I started working on a python ML project few weeks ago and I don't know Python, ML or Pandas previously. I've got some questions regarding my dataframe structure. Is this a channel I could ask this stuff in?
So I'm doing ML regarding stock companies and their quarterly results. So I got 4 rows of quarterly results for a company X which I put into a dataframe, and then I use multiindex to store all these rows together. My reasoning was that if I flatten the dataframe then the ML model won't be able to 'identify' the 4 rows belonging to a specific company.
So '181', '356' and '59' are company IDs here
Will this work, or am I messing it all up?
Basically i'm attempting to make a pandas panel, I think (but been deprecated)
yeah
yeah, but I mean more specifically that I'm using multiindex and 'grouping' them (0-4) as you see here
but if above looks fine/normal to you then I guess my above approach is fine
what missing indices?
Haven't been following the conversation but ordinal encoding is really bad
I'd say always do one hot unless you're doing a decision tree and even when you are it's still dangerous
Maybe in NLP but in tabular datasets is not good
Imagine you have small medium and large and you do an ordinal encoding, you're saying large is X3 small
That's typically the danger
I used to do it for high cardinality data (like postal codes) but it's no longer necessary as we now have target encoding in sklearn https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html
Examples using sklearn.preprocessing.TargetEncoder: Release Highlights for scikit-learn 1.3 Comparing Target Encoder with Other Encoders Target Encoderโs Internal Cross fitting
I can't onehot-encoding here since I would get dimension scaling beyond what's reasonable, since every 4 rows is a unique company
but since its always 4 rows for each company, I figured there wouldn't be an issue with hierarchy
since it's all 'balanced'
Target encode them then
Is it possible to create a conversational AI that diagnoses a patient's mental illness? The input would be the patient's speech converted to text and their facial expressions recognized for emotions,?
Yeah, that's the reason. Given enough data it should sort itself out but it's a last resort approach imo
I'm being pedantic though ๐
Sorry for spamming, but trying to work out of I'm messing up when I'm merging my different dataframes. So I store like 5 companies in 'dfs' (4 rows each, so 20 rows total), enumerate over them and put them in a hashkey table with 'i' being the key. And then concat().
Since you were talking about 'missing indices', i dunno if my mistake here is that I need to for-loop over each of the 4 rows as well and assign company ID ('i) to them or if behaviour in screenshots is all correct
I am as well, but I'd say benchmark it for tabular data. It can easily be a hyperparemeter.
Just empirical evidence showed me it's usually bad ๐คท
The whole reason I'm doing all of above is because I have data over time (quarters) and I dunno how else to group it together for my ML code. I can't do avg or mean and just 1 row since that doesn't capture change over time. I could skip the multiindex part (just flatten the big dataframe) but then I'm worried the ML code won't be able to identify that each 4 rows are 'tied together'
I'm new to all of this so dunno best practices
whats NLP?
Yeah I feel like these are dying art
so what does being data scientist mean?
just someone who can run some data through an algo to generate meaningful charts?
what do i need to learn so i can call myself data scientist
to get good at machines learning algos i think i'll need alot of maths
https://www.youtube.com/watch?v=b7NnMZPNIXA has anyone watched this
No, tabular data is a lot more effort and the results are very variable so it seems like it's very much in the trough of disillusionment
Above is pre encoding/split. I'm doing SQL queries to get 4 rows for each company and then merging it together into a multiindex
its multiindex
pandas panel thing
I think
I avoid multi index stuff etc. as the plague
Yeah I could flatten it all, but then row 5 (currenty '0 VTVT') will be '5 VTVT'
so I lose the 0-4 grouping, and I dunno the impact of that for my ML learning code
As much as I dislike Pandas, you got to learn it
I know you do
I'm mostly referring to the other user
I'll run a few tests without multiindex, see what happens
I'm a big Polars fan ofc but what bugs me are breaking changes
But pandas has a fair share of those as well tbh
Very true
Heya
when we have a matrix
M=np.arange(1,11).reshape(5,2)
what does M[2] mean here?
what would M[1][1] be like and why?
differences between attribute and future?
M[2] would return [5,6]
M[2] is the third row of the matrix
Guys do you think it is going to be useful to learn cuda for machine learning jobs?
M[1][1] means you're first selecting the second row and then from that row, you're selecting the second element. So M[1][1] would be 4.
You mean a feature? They're synonyms.
oh
there is no consistency in what a "data scientist" is or does. it varies between companies, and even within companies.
inasfaras one can be "good at an algorithm", yes.
Learning how to use CUDA GPUs with your models would be more useful, learning how CUDA works, not so much.
usually "data scientist" is some combination of "machine learning" and "applied statistics". senior DS tends to be involved with long-term product strategy and work directly with senior-level business stakeholders. usually there is an expectation that you can operate with a reasonable level of independence, cleaning and sometimes even gathering your own data
the "data scientist" job title itself has come to indicate a kind of generalist jack-of-all-trades role, analogous to "full-stack developer" in software development. bigger organizations tend to have more specialized titles.
no one person can be good at all of it. so typically industry data scientists end up being really good at a few things and less good at other things, and self-select into jobs that make sense for their skills, and also tend to work on upskilling throughout their careers
understated but important job characteristics including writing/communication skills and project planning
Applied statistics part is big no-no don't think I'll enjoy that for long.
Thanks for extensive explanation it did resolve alot of queries for me.
Any data science related field which has more focus on programming/ coding part then others
data science is literally applied statistics
machine learning is just statistics wearing a hood
if you want to focus on programming, then do normal programming | backend development
if anything maybe model deployment/devops/mlops might be closer to what you are thinking, but that still depends on the place
I have actually spent alot of time into this but no jobs/ scope for frontend and machine learning jobs only
What do people do machine learning engineering jobs and LLMs
Ig stuff will make sense soon enough thanks everyone for answering
Halfway through this
https://youtu.be/b7NnMZPNIXA?si=QaEgK6Mr7vzGSHW-
In this video, I will guide you through the entire process of deriving a mathematical representation of an artificial neural network. You can use the following timestamps to browse through the content.
Timecodes
0:00 Introduction
2:20 What does a neuron do?
10:17 Labeling the weights and biases for the math.
29:40 How to represent weights and ...
Doesn't seem very complicated so far hopefully this is mid lvl ๐
Any thoughts
Any data science related field which has more focus on programming/ coding part then others
Data engineering.
ML engineering if you like math and numerical computing.
Thanks
@past meteor so i converted my pt model to onnx model ( using following command yolo export model=<my_model> format=onnx imgsz=640,640 ) and now im having trouble with reading correct values. Before my script for getting correct box places was:
def detect_closest_license_plate(image, model: YOLO) -> ClosestPlate:
prediction = model(image)[0]
camera_center_x, camera_center_y = image.shape[1] // 2, image.shape[0] // 2
closest_plate: ClosestPlate = None
closest_distance = float('inf')
for license_plate in prediction.boxes.data.tolist(): # Assuming prediction.xyxy[0] contains bounding box predictions
x1, y1, x2, y2, conf, cls = license_plate # Extract bounding box coordinates and confidence
plate_center_x, plate_center_y = (x1 + x2) // 2, (y1 + y2) // 2
distance = np.sqrt((plate_center_x - camera_center_x)**2 + (plate_center_y - camera_center_y)**2)
if distance < closest_distance:
closest_plate = ClosestPlate.from_dict({
'bbox': (x1, y1, x2, y2),
'confidence': conf,
'class_': cls,
'plate_center': (plate_center_x, plate_center_y),
'distance_to_camera': distance
})
closest_distance = distance
return closest_plate
Now im getting my predcitions that way: pred = session.run(None, {"images": to_numpy(proccess_image("./test_photos/test.jpg"))})
And i cant find correct corresponding values
Thats output:
[array([[[ 22.407, 27.52, 37.087, ..., 560.26, 582.73, 612.35],
[ 6.8977, 6.7349, 5.8882, ..., 628.67, 628.4, 626.59],
[ 13.573, 12.881, 11.815, ..., 275.83, 318.33, 388.62],
[ 6.5565e-06, 3.8147e-06, 4.2707e-05, ..., 9.4175e-06, 5.126e-06, 1.8477e-06]]], dtype=float32)]```
How would you guys go about publishing ML code that needs a very large dataset for it to work? Compress it using parquet and unpack it via code? The code is on GitHub, I'm more worried about the dataset
I wouldn't immediately know what's up, it's hard to debug if I'm not running the code
Does your data need to be public too?
Can't you just train, persist the model and then use it?
Depends what you are doing.
At work we store data via safetensors https://huggingface.co/docs/safetensors/index and use DVC https://dvc.org/ for managing the large files with Git and S3.
If your not going to be pulling often, you can use Git LFS with github as well, but it gets expensive quickly on bigger datasets (i.e. object > 100MB)
So in conclusion:
โข C/C++ because they're the mother language and by learning (not mastering, just good enough to be able to read and mimic the source code) it I can make speed/performance critical applications and understand codes better in general
โข C/C++, because it's so old that almost every where you look is C/C++ and not Rust, not that you can't write the same program in Rust, C/C++ just has more sources code written by it
โข I just kinda need to able to read C/C++ for CUDA
Am i right?
is DVC worth looking into? Currently I do stupid things like manually versioning my data and keeping several versions of critical parts of my code I change because I want/need absolute reproducibility ๐ฅด
I'm using MLflow + optuna + tensorboard + dagster
I could add more tooling, if it's worth it
So yes and no, I both love and hate it.
If you have a Git LFS setup, then use git LFS it is just so much smoother in terms of configuration and pulling changes.
Otherwise if you have big objects and need to store on S3, then DVC is great for that, but it is more manual than git LFS and has a pretty awful caching system that often requires you to delete the entire local cache in order to pull new files from the remote.
But, it is the only modern tool that supports effectively big objects tracked by git... On S3 or other storage with minimal setup
๐ So I guess it is a tradeoff
Would you use it for tabular data?
That I'm just storing in postgres tbf
All object related stuff is in minio DB, we could use DVC there
I would use it just for managing dataset files or big binary files with git only
The repro stuff, param tracking, etc... Is completely useless IMO and you'll spend most of your time debugging why DVC isn't working than actually doing the runs or the code
good to know, those are all new features as well
I looked into DVC a couple of years ago and that wasn't there afaik
That being said, My experience of PyTorch Lightning + Neptune has been excellent for tracking artefacts, metrics, etc...
safetensors is new to me ๐
btw guys i got a question,
TensorFlow or Pytorch
i'm just about to start learning ML and i cant decide which one's better
Pytorch
why?
Yup, same idea with Mlflow
Simply because it's more common nowadays. using the most popular tool has a lot of merit
i actually think DVC is great for managing "raw" data within your project. much better than a makefile
Safetensors is excellent, for us at least ๐ Since we train models via Python but deploy via Rust, so it is often very useful to be able to have that simple to use API which both langs can use
and store data efficiently and load quickly
Yea
that's not our use case, but having anything standardized is better than DIYing something in json or npz or arrow
Pretty satisfied with my mflow + dagster + lightning + optuna set up but having repro on the data side is pretty bad for me
I think I'll just start logging the git commit with my experiments
also i think DVC is really useful for sharing data within a team, dvc push/pull specifically
Yeah that is fair, the biggest gain IMO is the setup is super simple in all the langs, unlike arrow/parquet or other which often has a bunch of extra work around it
Then I can freely change things but I can get repro easily by checking out at that commit ๐
what's the value of neptune vs. mlflow or any of the 1e12 other options out there right now?
yeah that is what a I mean by tracking via Git, but it has a bad habbit of needing the local cache cleared if a file changes and you need to pull
yeah that's the whole point of DVC, it's great
ah, the cache... yeah.
Then I'll have to look closesly at DVC tomorrow
I always MacGyver until it gets bad and it's getting bad right now ๐
I like MLFlow, but equally Neptune is so simple to setup, and the free tier is actually super awesome, 200GB of storage for free is a lot for most people
UI is great, integrations are excellent, etc...
Oh yeah, I think a big caveat is I'm doing things on-prem
that's why I went with MLflow I think
Makes sense
there's also Dud, which is meant to be a kind of stripped-down DVC, but start with DVC first IMO
for completeness: https://kevin-hanselman.github.io/dud/
Dud # Website | Install | Getting Started | Source Code
Dud is a lightweight tool for versioning data alongside source code and building data pipelines. In practice, Dud extends many of the benefits of source control to large binary data.
With Dud, you can commit, checkout, fetch, and push large files and directories with a simple command line i...
that's nice! but it's proprietary saas right? just to be clear
(compared to mlflow for example)
snowflake also recently rolled out a model registry thing, we might start using that to deploy models directly in-warehouse
not sure if it has any useful tracking/versioning features though
proprietary saas
[...]
snowflake
Yes, very much just SAAS
yeah but we already pay for snowflake ๐
buying a new product would be harder with our current finances
snowflake is the one cloud tool I'm really not familiar with
is it basically just like big query
but not big query?
we've already significantly reduced our snowflake usage, to the point where it's an issue for our contract renewal
typical separation of storage and compute, data in buckets, snowflake compute puts a view over it and you can query it with SQL and pay through your nose?
"yes" in that it's a cloud data warehouse built around SQL and column-oriented analytics workloads
Meant SQL there*
i think that's how it works internally, but the pricing is a lot more opaque than that. i think at most you can choose which of the big 3 clouds to run on, and that's it
they also now have a spark-like python interface called "snowpark"
they're the other way around, they want to be a warehouse where you can do everything in-warehouse. some "lake" features too though
you can even now deploy arbitrary code in containers. so you can run arbitrary code directly in-warehouse and pay for it with a uniform compute credit (instead of slinging data back and forth between the warehouse and e.g. ECS)
I feel like snowflake is pretty lakehouse-ish as well right? Don't you land data into snowflake and not into S3/Azure blob first?
And then you use snowflake's compute to transform it right inside the "warehouse"/lake/... whatever it is
snowflake supports both
Ok, then my intuition of what it was was correct
it has stages, which are just blob stores like S3. but you can mount an S3 bucket transparently as a stage
so it's blob storage + OLAP distributed-ish + external integrations + a spark-like interface if you want it
basically like databricks' lakehouse yeah
is "databricks lakehouse" a product? or are you just talking about the pattern of building a lakehouse around databricks and dbfs?
i haven't used databricks since 2020
Problem I have with Snowflake is the vendor lock in feels worse than AWS tbh
not a product, it's just delta + databricks + marketing
is it any worse than any other data warehouse though?
Is snowflake "serverless"?
yeah it's pure saas
not self-hostable and completely opaque compute (priced in "credits")
maybe it's because we're using airflow + dbt + containers but i really don't feel that badly locked-in. we will be much more locked-in once we are deploying compute directly in-warehouse, but at least we already have a non-locked-in solution that we won't ever fully get rid of.
I think it is about inline with BigQuery lockin wise, but without the other GCP service support and no bandwidth costs.
Athena I think is pretty easy to replace, since it is litterally just re-skinned Trino, which tbh if we re-did our datalake now, I'd probably go with Trino as our main layer, so at least changing backend didnt change the queries
SaaS != serverless
Can you pay for a standard amount of compute that stays on during business hours with absolutely transparent pricing you then scale down during the night? Give or take backpressure
Or is the only model pay-as-you-use?
IIRC it is a price per storage GB used, price per data scanned, etc...
Which I think all the big warehouses use really
Shout out to snellar BTW for the absolutely insane engine which unfortunately seems to have died https://github.com/SnellerInc/sneller
our warehouse is basically always-on anyway so it's easy for us to estimate pricing
Maybe I'm paranoid but I wouldn't feel comfy buying into a service that doesn't allow me to pick a non serverless model
That's the issue, I feel like pay-as-you-use is nearly always more expensive than always-on
So if an always-on version exists and your things is ... always on, you can switch and save money
I think it is a trade off, I think of lot of people prefer having the serverless setup where it is cheaper initially, but then get bitten later on as their scale grows
But people like the convenience
yeah but then you negotiate a contract for credits + overage pricing if you know you're going to be always-on
sure, I'd 100 % start serverless
it's all very enterprise-ey
but on say Azure, many services let you switch transparently
Like, you have a serverless SQL server (absolutely ridicuolous naming) with a managed counterpart
the difference is that you can't blow your cost into interstellar orbit as easily as you can on google or AWS
There's serverless databricks versus de-facto managed, where you switch your cluster on and off "manually" etc.
larger mean, lower variance -- that kind of thing
I think the point I'm trying to make is that gcp, aws and azure definitely have versions with transparent pricing in the "managed" section of their offering
And they're still easier than using EC2/Azure VM
Yes, it's going to be part of a scientific article so I need to make the datasets used to train the models public, and they're really just cherrypicks of multiple datasets
yeah, i think i get it. my response is "snowflake doesn't have that but i also am not aware of anyone having issues with it beyond it just being expensive overall"
Yes, an important detail here is that C is more than a language at this point. It also acts as the interface between languages and the operating system. So for any language to be able to do anything it needs to pass through C's stuff at some point. This means that if you know both Python and C, you can get access to almost all libraries / utilities, and also make fast things that you can use in Python (bind some C library that does not have a Python module yet or if your own private stuff). C++ is not as necessary as C, but it does make programming things a lot easier, is the foundation for C++, and so it's used everywhere (even more access to more libraries (but even just knowing C will let you read most of it)). C is also not going anywhere any time soon, and changes too slowly for you to need to keep up with its features like with other languages.
last addition to this is that I believe cloud marketing has won in convincing us pay-as-you-use is the only viable option so people don't complain/have issues with it ๐
Great, will check it out! I'm not going to be pulling frequently, just pushing
tryna do Univariate time series on a JDIA dataset, how can i determine which variable ill be using univariate on? do i just wing it and use any one
i think in the case of snowflake people use it in spite of its pricing
for example we also use aiven's managed timescaledb and that's just a flat price per month
what is JDIA?
normally when doing a data project you have an actual real-world objective in mind, so you do whatever accomplishes that goal
True, so since almost everything needs to be passed by C and knowing it will surely deepen my knowledge, its a good idea to learn this but as you said, C is the main course but C++ is more like a DLC that is optional to learn
Thanks for the insight, mate
Aside from the pricing, is it that good?
It does everything we need it to do and they keep adding more features that we find useful. So it's certainly good enough
Sometimes I find myself fighting the query planner, wishing for proper database indexes. That's my only practical complaint as an individual user (as opposed to an administrator)
dow jones ind average, its just stock prices, just like NASDAQ
honestly i dont have a real-world objective in mind, im just trying to demonstrate utilizing univariate with different models
oh, DJIA
you're asking about whether to use open, close, or something else?
I think normally people use closing prices but it probably doesn't matter much for a simple univariate analysis
oh lol sorry i mispelt the acronym, but yeah from what ive seen alot of people use Close so i went with Close anyways
I agree with @hollow escarp
How should i debug it to get that values ?
?
Lmao
whats the best type of LSTM model? apparently theres 5, Vanilla, Stacked, Bidirection, CNN-LSTM, ConvLSTM
or is it all situational
Situational
That's like saying which one is the best, apple, banana or orange. when I want to keep the doctor away, it's apple, when I want something sour, it's orange and when I need a quick snack, it's banana.
weird how vanilla LSTM performs way shittier than MLPs
