#data-science-and-ml

1 messages ยท Page 118 of 1

sturdy kiln
#

thats probably top of the line commercial GPUs

neat bluff
#

Wait out for 4090ti ducky_devil

#

It's gonna cost a house and a car

#

And also half of your soul

#

Hm. I would love to help You but I have no clue what might be wrong

sturdy kiln
#

does it also require a sacrifical lamb and a drop of a virgin's blood

neat bluff
sturdy kiln
#

thats more likely for the electric bill rather than the cost of the GPU itself lol

neat bluff
#

True dat^

sturdy kiln
#

i wonder for any of the 40 series how long the electricity bill outweighs the GPU itself lol

#

probably like a month of constant use?

neat bluff
#

You would have to use it to a maxium for a year probably

#

That's my guess

#

home refrigerator's power consumption is typically between 300 to 800 watts of electricity.

#

That's when cooling of course

tidal bough
#

they have a tdp of only like 300-400W, right? 400Wร—(13cent/(kWร—h))โˆ’>$/year is 455$/year (of constant usage), where I googled us electricity prices as 13 cents/kwh

neat bluff
#

Nvidia RTX 4090 has an official power draw of 450W

#

So not even a year, almost 3 in fact

sturdy kiln
#

lol thats a very wrong forecast right there if is see one

#

hilarious how ARIMA(1,1,1) also gives me a flat line

#

like it just refuses to do anything

neat bluff
#

Btw how is ur training set looking like? Is it divided by years or months/days? Cuz training it on years perspective might give a huge false positive

sturdy kiln
#

its on months

#

or so i think

#

because i did dataDF.index = dataDF.index.to_period('M') to change the index to Month but i actually dont know if its what it does lol

#

grid search ftw

neat bluff
#

Now it actually matches in terms of position

sturdy kiln
#

im curious because this resource im following limited the grid search of the p,q,d to 11

#

can i go higher to get better results

#

or does it cause diminishing results

neat bluff
#

I am afraid I don't know what You are talking about

sturdy kiln
#

ARIMA takes 3 argumental values (p , q , d), different models give different results, you do grid search by fitting each ARIMA model and evaluating, and determining the best with the best metric (IE lowest RMSE)

#

the grid search i did limited the value from 0 to 11

#

so it can go from (0,0,0) to (11,11,11)

#

hence this thing

#

actually it wasnt 11

#

it was only p

#

q and d was limited to 4, so technically the max is (11,4,4)

neat bluff
#

Hence the repetetive result? This is hella interesting but I see that I've got a TON of things to learn

sturdy kiln
#

its not repetitive

#

its taking the model arguments, example (1,2,3), create a model and fit it and evaluate the results

#

do it on the next

#

and when finished compare all models, and determine which one is the best

neat bluff
#

Oh, that's what You mean. Yeah I think I get it now

sturdy kiln
#

for this dataset, it got (9,2,0) since it has the lowest MSE value out of all

#

its a very interesting regressional technique used on time series stuff

#

this is the first time im dealing with ARIMA lol

neat bluff
#

This is the first time I am even looking at such a thing

#

Altough it seems super cool I think I will stick to my NLP shit

sturdy kiln
#

its not even deep learning, its literally just regressional analysis

#

although im curious how i can use DL on time series data lol

neat bluff
#

No clue, but good luck ๐Ÿ‘Œ๐Ÿป

peak ridge
#

hey guys

#

ever worked with langchain?

serene scaffold
# peak ridge ever worked with langchain?

please always ask your actual question. don't ask to ask. if you have a question about langchain, assume that someone can help and ask a question that person could start answering.

neat bluff
peak ridge
#

so for my product what im doing rn is m using
ChatGPT-AssistantsAPI for responses and handling context/history for a convo
everythings good working great,

so i want to pass user's data like his workspace data from my db via restapi call to gpt via RAG
so it could give biased answer considering the workspace data + it's own llm

My problem:
FileNotFoundError: [Errno 2] No such file or directory
actually i have JSON(rest_api) responses because in Langchain there's no option for SQL RAG,

im trying to use JSONLoader
but it takes an required arguement called filepath
i dont rly have a filepath

serene scaffold
#

!code

arctic wedgeBOT
#
Formatting code on Discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

serene scaffold
#

Please do not post screenshots of code.

peak ridge
#
import requests
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import JSONLoader


API_URL = "http://127.0.0.1:8000/api/workspaces/"

def get_workspace():
    response = requests.get(API_URL, auth=("aryanjainak@gmail.com","Iamreal@123"))
    if response.status_code == 200:
        return response.json()
    else:
        print("Failed to fetch data:", response.status_code)
        return None

def main():
    workspace_data = get_workspace()
    embeddings_model = OpenAIEmbeddings()
    """
    splitter = RecursiveJsonSplitter(max_chunk_size=300)
    json_chunks = splitter.split_json(json_data=workspace_data)
    print(json_chunks,'efef')
    """
    loader = JSONLoader(
    file_path=str(workspace_data),
    jq_schema='.messages[].content',
    )
    data = loader.load()
    embeddings = embeddings_model.embed_documents(data)
    vectorstore = Chroma.from_documents(embeddings, embedding=OpenAIEmbeddings())
    retriever = vectorstore.as_retriever()
    docs = retriever.get_relevant_documents("What is the name of my workspace?")```
#

basically ik what the issue is
but what even can i do
it's an required argument file_path

neat bluff
#

Leaking your password to such a channel ain't the best idea

serene scaffold
#

yeah, you should change your password for that API, since a bot has probably stolen it.

serene scaffold
#

once you've changed your password for that API, post the whole error message that you're getting, starting from Traceback.

peak ridge
#

it's a localhost

#

and all passwords are of dummy db

neat bluff
#

Yeah it's local host, but You posted your email as well. From Your reaction I supposed it's not a valid pass. It was just a friendly reminder to keep it in mind for the next time

serene scaffold
#

@peak ridge in addition to the whole error message, can you also post all the import statements in that file?

peak ridge
# serene scaffold once you've changed your password for that API, post the whole error message tha...

FileNotFoundError: [Errno 2] No such file or directory: "PycharmProjects/Kleenestar/src/backend/[{'id': 6, 'root_user': {'id': 1, 'first_name': 'xyz', 'email': 'aryanjainak@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http:/127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}, 'users': [{'id': 1, 'first_name': 'xyz', 'email': '2342@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http:/127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}], 'business_name': 'Xyz', 'website_url': 'https:/www.xyz.com', 'industry': None, 'created_at': '2024-04-23T04:37:55.983893+05:30'}]"

serene scaffold
# peak ridge ```FileNotFoundError: [Errno 2] No such file or directory: "PycharmProjects/Klee...

given this data structure

[{'id': 6, 'root_user': {'id': 1, 'first_name': 'xyz', 'email': 'aryanjainak@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http:/127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}, 'users': [{'id': 1, 'first_name': 'xyz', 'email': '2342@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http:/127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}], 'business_name': 'Xyz', 'website_url': 'https:/www.xyz.com', 'industry': None, 'created_at': '2024-04-23T04:37:55.983893+05:30'}]

is there anything here that should be the completion of "PycharmProjects/Kleenestar/src/backend/ ?

neat bluff
#

JSON loader is requiring filepath because it's supposed to load the file from the harddrive

#

json.loads() is probably gonna solve your issue

#

And passing the actual data to it directly

peak ridge
peak ridge
#
Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
     Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),```
something like this
neat bluff
#

Then just save the data from API as .txt/.json

serene scaffold
peak ridge
peak ridge
peak ridge
#

wait i'll show u, just a sec.

serene scaffold
peak ridge
#
>>> from channels.rag import get_workspace
>>> get_workspace()
[{'id': 6, 'root_user': {'id': 1, 'first_name': 'xyz', 'email': 'aryanjainak@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http://127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}, 'users': [{'id': 1, 'first_name': 'xyz', 'email': '123@gmail.com', 'last_name': 'Jain', 'is_active': True, 'profile': {'id': 1, 'user': 1, 'avatar': 'http://127.0.0.1:8000/media/default.jpeg', 'country': None, 'phone_number': None, 'referral_code': '865083', 'total_referrals': 0}}], 'business_name': 'Xyz', 'website_url': 'https://www.xyz.com', 'industry': None, 'created_at': '2024-04-23T04:37:55.983893+05:30'}]```
#

cools right?
json response

neat bluff
#

It's a dictionary already

serene scaffold
neat bluff
peak ridge
#

TypeError: expected str, bytes or os.PathLike object, not list
if i do this

def main():
    workspace_data = get_workspace()
    embeddings_model = OpenAIEmbeddings()
    """
    splitter = RecursiveJsonSplitter(max_chunk_size=300)
    json_chunks = splitter.split_json(json_data=workspace_data)
    print(json_chunks,'efef')
    """
    loader = JSONLoader(
    file_path=workspace_data,
    jq_schema='.messages[].content',
    )
    data = loader.load()
    print(data)```
serene scaffold
peak ridge
#

is my approach very terrible? @serene scaffold @neat bluff
isnt how u guys do rag

#

im from Django background
web dev databases bg

#

im learning all this for our startup

#

just a early startup

neat bluff
neat bluff
#

My best guess would be to try. What's the worst that can happen.

peak ridge
#

actually i did, but imma do again

#
def main():
    workspace_data = get_workspace()
    embeddings_model = OpenAIEmbeddings()
    """
    splitter = RecursiveJsonSplitter(max_chunk_size=300)
    json_chunks = splitter.split_json(json_data=workspace_data)
    print(json_chunks,'efef')
    """
    loader = json.loads(workspace_data)
    print(loader,'xyz')
    data = loader.load()
    print(data)```
TypeError: the JSON object must be str, bytes or bytearray, not list
neat bluff
#

Turn it to a string now

peak ridge
#

didnt even got 1 print
so like it even didnt work

neat bluff
#

As you did earlier

peak ridge
#

ohh

#

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)

neat bluff
#

So it worked

peak ridge
#

did it?

#

it didnt printed the result tho?

neat bluff
#

But as I said already - JSON doesn't accept the ' as value surroundings (i have no clue what they are called)

[{"id": 6, "root_user": {"id": 1, "first_name": "xyz", "email": "aryanjainak@gmail.com", "last_name": "Jain", "is_active": True, "profile": {"id": 1, "user": 1, "avatar": "http://127.0.0.1:8000/media/default.jpeg", "country": None, "phone_number": None, "referral_code": "865083", "total_referrals": 0}}, "users": [{"id": 1, "first_name": "xyz", "email": "123@gmail.com", "last_name": "Jain", "is_active": True, "profile": {"id": 1, "user": 1, "avatar": "http://127.0.0.1:8000/media/default.jpeg", "country": None, "phone_number": None, "referral_code": "865083", "total_referrals": 0}}], "business_name": "Xyz", "website_url": "https://www.xyz.com", "industry": None, "created_at": "2024-04-23T04:37:55.983893+05:30"}]
#

It has to look like this instead

peak ridge
#

so what should i do sir

#
def main():
    workspace_data = get_workspace()
    embeddings_model = OpenAIEmbeddings()
    """
    splitter = RecursiveJsonSplitter(max_chunk_size=300)
    json_chunks = splitter.split_json(json_data=workspace_data)
    print(json_chunks,'efef')
    """
    loader = json.loads(str(workspace_data))
    print(loader,'xyz')
    data = loader.load()
    print(data)
    embeddings = embeddings_model.embed_documents(data)
    vectorstore = Chroma.from_documents(embeddings, embedding=OpenAIEmbeddings())
    retriever = vectorstore.as_retriever()
    docs = retriever.get_relevant_documents("What is the name of my workspace?")```
this is how it looks rn
neat bluff
#

I would save it as a raw string in your code

#

Lemme do it for ya

#
def main():
    workspace_data = get_workspace()
    embeddings_model = OpenAIEmbeddings()
    """
    splitter = RecursiveJsonSplitter(max_chunk_size=300)
    json_chunks = splitter.split_json(json_data=workspace_data)
    print(json_chunks,'efef')
    """
    workspace_data = '[{"id": 6, "root_user": {"id": 1, "first_name": "xyz", "email": "aryanjainak@gmail.com", "last_name": "Jain", "is_active": True, "profile": {"id": 1, "user": 1, "avatar": "http://127.0.0.1:8000/media/default.jpeg", "country": None, "phone_number": None, "referral_code": "865083", "total_referrals": 0}}, "users": [{"id": 1, "first_name": "xyz", "email": "123@gmail.com", "last_name": "Jain", "is_active": True, "profile": {"id": 1, "user": 1, "avatar": "http://127.0.0.1:8000/media/default.jpeg", "country": None, "phone_number": None, "referral_code": "865083", "total_referrals": 0}}], "business_name": "Xyz", "website_url": "https://www.xyz.com", "industry": None, "created_at": "2024-04-23T04:37:55.983893+05:30"}]'
    loader = json.loads(str(workspace_data))
    print(loader,'xyz')
    data = loader.load()
    print(data)
    embeddings = embeddings_model.embed_documents(data)
    vectorstore = Chroma.from_documents(embeddings, embedding=OpenAIEmbeddings())
    retriever = vectorstore.as_retriever()
    docs = retriever.get_relevant_documents("What is the name of my workspace?")
neat bluff
#

wdym same error

peak ridge
neat bluff
#

Alright I've forgot one thing

#

Is this API on the other side written by You?

peak ridge
#

it's a web-application api

neat bluff
#

Mind showing me the code where You define and return said "JSON" data

peak ridge
#
class WorkSpacesViewSet(viewsets.ModelViewSet):
    #permission_classes = (permissions.WorkSpaceViewSetPermissions,)
    serializer_class = WorkSpaceSerializer

    def get_queryset(self):
        # All the workspaces the request user is a member of
        return self.request.user.workspace_set.all()```
#

its written in django

#

django-rest framework*

neat bluff
#

I can see that

peak ridge
#

i can show u the response too
in postman if u want

#

via get req

neat bluff
#

Wait let me think

#

Mind editing the code of an API a bit?

def get_queryset(self):
        # All the workspaces the request user is a member of
        userWorkspaces = self.request.user.workspace_set.all()
        return json.dumps(userWorkspaces)```
#

No clue if this will not crash

peak ridge
#

who cares,
i can try

neat bluff
peak ridge
#

np

neat bluff
#

I suppose it's not production deployed yet so hence my "no worries" debugging approach

peak ridge
#

ya the web-app crashed

neat bluff
peak ridge
#

we could turn it into json

#

but response.json()
(already doing it)

neat bluff
#

That's what we are actually trying to do.

neat bluff
peak ridge
#

we are calling the api in get_workspace

neat bluff
#

If it would be JSON there wouldn't be a problem loading it using json.loads()

peak ridge
#

actually im just trying with this workspace data

#

i wont really use this

#

i am calling users dynamic data,
there marketing data via marketing-channels api's

#

and i am storing it on my db and i wanna pass it

#

im just trying with workspace data

neat bluff
#

So if I understand correctly - we are trying to fix a mockup which isn't gonna be the final data managed by this code?

peak ridge
#

๐Ÿ’€

#

but that data will also come from API request

#

or via db query (within an API)

#

i can change approach
like rn im calling via api

#

i can call via db directly if it works

#

it just needs to work

neat bluff
#

Anyway, the fact that json.loads() isn't able to load QuerySet (as stated in the error log) it doesn't mean that it won't be able able to parse it when we first treat it with some DICting...

peak ridge
#

or my company will die

neat bluff
#

Because now that I think about it... it probably didn't even try to do it because of uncompatible data type

peak ridge
#

but can i call it via db queries

#

more impossible

neat bluff
peak ridge
#

on the docs i saw these options

peak ridge
#

are u guys using python for the backends?

neat bluff
#

Well I am one man army beside my frontend design guy.

peak ridge
#

im also the alone backed guy
but we have 2 interns on the frontend

#

and my co-founder is designer

#

and we have a pretty decent access to investors,market product

neat bluff
#

Is that a SaaS You are trying to build?

peak ridge
#

my co-founder has 1 more product 5k users

peak ridge
#

lol, i hate that too

#

these designers are crazy they love that

neat bluff
#

That's the one You are building rn or the one of Your friend?

#

Cuz it looks fucking fancy. That's for sure

peak ridge
#

yes, we are.

#

we have crazy funds and access too

#

180 pre-registered users

#

my co-founder is gr8 guy.

#

glad to have him

neat bluff
#

Alright. Now I feel dedicated to fix this crap

peak ridge
#

yes, we gotta do it sir.

neat bluff
#

Cuz maybe we will help each other in this crazy world of building SaaS

peak ridge
#

ya, we talk daily about everything

#

and work

lapis sequoia
tacit basin
#

Data Science in Python

Elements of Data Science

An introduction to data science designed for people with no programming experience, this book presents a small, powerful subset of Python that allows you to do real work in data science as quickly as possible. It includes Jupyter notebooks where you can read the text, run the code, and work on exercises to practice what you learn.
https://allendowney.github.io/ElementsOfDataScience/README.html

#

Install it with pip and run it then compare it to standard python repl.
pip install ipython

dawn light
#

Can anyone point me to the right direction,

I'm trying to build a model that matches a book's paragraphs in one language with the matching paragraphs in a translation (for example, let's take the little prince's english version and its japanese translated version)
The idea would be to create a version of a book where its original and translation are laid out side by side for language learning

I'm not too sure yet how to approach this kind of problem (what model to use, what kind of problem it is, etc.) so i'd appreciate some guidance

as of now, my idea would be to vectorize/tokenize the words, compute something like a vector sum per paragraphs, then maybe match the resultant vector using a dot product with the vectors in the other language, the thing tho is that since these are two different languages, the way the words would be vectorized would probably result in vectors where the dimensions aren't the same, so not yet sure how to deal with that


TLDR: I'd like to create a model that automates the creation of something like this: http://bilinguis.com/book/alice/jp/en/c1/ where the model aligns the text from an original language to an official human-translated text

Any suggestions would be appreciated!

peak ridge
#

๐Ÿ˜

buoyant folio
#

How can i speed this loop up alot?:
how do i speed up this loop, it needs to be very fast, so i can run it like 120 times a second:

`def RSI_strategy_numba(data: pd.DataFrame, rsi_values, indicators) -> tuple[list[pd.DatetimeIndex], list[pd.DatetimeIndex]]:
buy_dates, sell_dates, state = [], [], 0
for idx, rsi in zip(data.index.values, rsi_values):
# If were in the buy state, check for a buy
if rsi > indicators[0] and state == 0:
buy_dates.append(idx); state = 1
# Otherwise check for a sell
elif rsi < indicators[1] and state == 1:
sell_dates.append(idx); state = 0

return buy_dates, sell_dates`
wooden sail
buoyant folio
#

I am not appending to the dataframe. The problem i believe is the size of the dataframe

#

its around 43000 lines

#

I would like to use numpy's faster vectorization, but i cant figure out how

wooden sail
#

ah true, that's what i get for not reading carefully

#

what type is rsi_values?

buoyant folio
#

it's a numpy array of values. the dataframe should look have datatime as index, and then a column for rsi. Rsi values is coming from dataframe['rsi'].values()

wooden sail
#

it does seem like you need the state from the previous result, but that's not a big problem here

#

what's the type of indicators[0]?

buoyant folio
#

thats just a list of floats.

wooden sail
#

ok

buoyant folio
#

i have a genetic algorithm which generates it for me

wooden sail
#

then you can compare the entire rsi_values against indicators[0]

buoyant folio
#

yeah, using numpy,where

wooden sail
#

just rsi_values > indicators[0] yields a vector of booleans with all the results

buoyant folio
#

or that

wooden sail
#

you can similarly compute the state for all indices together, though this is a bit more tricky because each state depends on the previous one. it may be that you cannot avoid doing this in a loop, but you could rewrite it as a convolution at least

#

that means you can do all of these operations without any explicit for loops

buoyant folio
#

Idk how that'd work, im quite new to numpy

#

I guess i could calculate the checks using rsi_values > indicators[0]

#

but then from there, how do i loop that without explicitly using a for loop?

wooden sail
#

hmm if you're not familiar with convolutions then there isn't a much better way than what you're already doing

#

you could get a speedup by avoiding resizing the lists. you can initialize them with 0s

buoyant folio
#

I'll try to press chatgpt for answers on convolutions tomorrow. But for now ill go to bed. Thx for the help!

wicked vessel
#

"Hey, I'm about to start my journey into AI and ML! If anyone else is starting from scratch and wants to join a group for group study, let's create a study group together. Together, we can learn the basics and support each other along the way. Excited to start this journey with like-minded individuals!"

trim saddle
analog bolt
#

if I made a machine learning algorithm and gave it info about jokes that I find funny and jokes I don't, would it be able to generate new jokes that I would find funny the majority of the time?

serene scaffold
analog bolt
serene scaffold
wicked vessel
past meteor
peak ridge
#

hm

neat bluff
timid kiln
#

@serene scaffold I probably should have tagged you. Perhaps you can shed some light on how I can analyze the data?

past meteor
# peak ridge hm

basically, you find 10 jokes you find funny and 10 you don't and you give that to the algorithm. You let it produce 10 funny and unfunny jokes. If all of them are spot on, that means you're done. This will likely not be the case, You'll have to find more funny and unfunny jokes (say 10 more) and repeat with 20, you keep doing this in a loop until you're satisfied

neat bluff
buoyant folio
#

but it can train to simulate his

past meteor
#

Depending on your definition of understand the answer is either yes or no

tropic kettle
#

You know how in unicode different characters have different codes like
I think
A is 01000001
a is 01100001
(Just an example, probably wrong code)
My question is does any character take more storage space than another or do they all take up the same uniform space

agile cobalt
#

some """characters""" do take up more space, but when talking about things at this level of detail, your notion of what a character is starts clashing with formal definitions

#

!e ```py
examples = ['A', 'ร', '็Œซ']
for example in examples:
print(example, len(example), example.encode('UTF-8'), len(example.encode('UTF-8')))
print(example, len(example), example.encode('UTF-16'), len(example.encode('UTF-16')))
print(example, len(example), example.encode('UTF-32'), len(example.encode('UTF-32')))

arctic wedgeBOT
#

@agile cobalt :white_check_mark: Your 3.12 eval job has completed with return code 0.

001 | A 1 b'A' 1
002 | A 1 b'\xff\xfeA\x00' 4
003 | A 1 b'\xff\xfe\x00\x00A\x00\x00\x00' 8
004 | ร 1 b'\xc3\x81' 2
005 | ร 1 b'\xff\xfe\xc1\x00' 4
006 | ร 1 b'\xff\xfe\x00\x00\xc1\x00\x00\x00' 8
007 | ็Œซ 1 b'\xe7\x8c\xab' 3
008 | ็Œซ 1 b'\xff\xfe+s' 4
009 | ็Œซ 1 b'\xff\xfe\x00\x00+s\x00\x00' 8
tropic kettle
#

Wow, I just lied to someone basically, thanks bros

agile cobalt
#

I recommend reading up on how Rust handles Unicode data and strings
it is pretty insightful even if you don't plan to ever use Rust

desert oar
# tropic kettle You know how in unicode different characters have different codes like I think ...

in utf-8 specifically the answer is "yes" -- some characters are 1 byte, some are 2, etc.

in python the answer is "maybe" because (i think) strings use a fixed width for each code point, auto-resizing their character width as needed. so it acts like utf-32 functionally, but in practice the storage size might be more like ascii if the characters are all 1 codepoint (which can be represented in 1 byte). however don't quote me on this because i don't remember where i read it.

neat bluff
desert oar
past meteor
#

ML (not just LLMs) is capable of taking 2 existing concepts and string them together to a 3rd, novel concept

#

this is a nice image

#

given 2 existing jokes it can produce a 3rd new joke

desert oar
#

@tropic kettle note that python does not equal numpy does not equal apache arrow.

i actually don't know exactly how numpy stores its strings, but they behave like fixed-size UCS-4 fields, i.e. UTF-32, so bigger codepoints shouldn't take up more space but strings will tend to be large (and are un-ergonomic to work with due to the fixed field size)

i don't think arrow has a native string data type, but polars for example uses utf-8 and is backed by arrow. i assume the pandas arrow-backed dtype is also utf-8 but you'd have to dig around in their docs or source code for that info

and of course this is all irrelevant if you're interested in how databases store your text (depends on the database), or file size at rest (you choose the encoding + compression), or data size over the wire (same as file size, your choice)

neat bluff
violet gull
#

Running into a weird issue here.
Im training an AI to make number based predictions. It fits to the data perfectly shown by a final loss value of
0.00026427686376862626
Then on a test value it also "perfectly" predicts it.

Expected normalized number: [-0.9962218830221753]```
with a loss value of `0.000004458501602317615`
except when I un-normalize the value it is completely off of the target. even though the loss is extremely low. 
```prediction: [170.93668721399916]
expected: [696.299988]``` So everything did its job correctly. I already verified that the normalization stuff works. I believe the issue is caused by precision and the range of the upper and lower bounds of the normalization being huge.
buoyant vine
#

I'm not sure if you have the model train on the normalized value but then pass it the non-normalized value

#

since that effectively completely changes how the data looks and can change the pattern

fallow coyote
#

Just want to ask, in the pinned section, someone recommended three resources for learning maths for ML/AI. Will they be enough to at least have a good maths base for ML/AI or does anyone recommend any alternate sources?

violet gull
desert oar
#

In general you're looking for strong numeracy fundamentals, and specifically a good handle on undergrad-level calculus, linear algebra, probability, and statistics.

#

for "AI" specifically you probably don't need much statistics and can skimp on probability a little bit, but for generalist DS you do need both.

fallow coyote
#

Im decent at maths. Im relearning it pretty quickly but it has been a few years. Only want to get into ML/AI cos its interesting and maybe useful for me in the future so might as well start now

winged yew
#

anyone knows how to install tensorflow-gpu on windows ??

frail heart
#

https://youtu.be/IHZwWFHWa-w?si=ymfibEI1iRHxjHf7&t=290

I am watching 3blue1brown's neural network video ep 2, and I am wondering how he got 13,002 weights/biasas for the parameters for his neural network. When I calculate it I get 12,963.

Enjoy these videos? Consider sharing one or two.
Help fund future projects: https://www.patreon.com/3blue1brown
Special thanks to these supporters: http://3b1b.co/nn2-thanks
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks

This video was supported by Amplify Partners.
For any early-stage ML startup fo...

โ–ถ Play video
serene scaffold
#
In [10]: (784 * 16) + (16 * 16) + (16 * 10) + (16 + 16 + 10)
Out[10]: 13002
#

the last term here are the biases.

frail heart
#

Thanks

serene scaffold
#

yw

craggy coral
#

just started getting into data science and stuff

#

im 17

serene scaffold
#

Why does it make your head spin? Your initial calculation was correct except that you were missing one output node.

hasty grail
winged yew
hasty grail
#

Yes but then you'd be stuck with old content, which will be detrimental in the long run (especially with how fast-moving the field is)

#

better to bite the bullet now and set up a WSL2 environment

past meteor
buoyant vine
#

On the topic of TF, why do people still use TF over PyTorch? It seems for the most part Torch just dominates over TF both in speed and available tooling now.

I remember years ago it was the other way around, but since TF3 it seems PyTorch easily beats TF in almost every situation?

agile cobalt
#

TF3? do you mean tf2?

#

the biggest reason is probably just momentum, but iirc it has a few niche advantages like easier/more mature deployment to web and edge devices

buoyant vine
#

yeah sorry

#

for some reason I have it in my head that TF is v3 and v2 was the old version

agile cobalt
buoyant vine
#

and ease of converting models to onnx, quantizing etc...

agile cobalt
buoyant vine
#

we go straight to onnx models

#

Now, my experience with just pytorch on edge or embedded devices is bad because PyTorch feeels huge and bulky

#

but exporting to onnx and then embedding onnxruntime or what not experience wise is better than TFLite

agile cobalt
#

pithink yeah idk then

past meteor
#

And it's also what some of us learnt in uni ๐Ÿ‘ˆ

#

But I've since switched to Torch, all in all they're quite similar and I'd just recommend the vast majority to use Torch (in conjunction with lightning)

potent sky
buoyant vine
#

yeah, and portability wise very nice

potent sky
#

plus who knows when Google will kill something

#

the import mechanism change in tf 2.6, when they changed keras to a separate python package broke a lot of things and made it generally frustrating to use tf.
imo before that things were looking good with the subclassing api and the functional API.
But that was what finally forced me to completely pivot to pytorch as my primary.
Haven't really used tf much since.
It's a shame because I was actively excited for tf, even contributed a few smol things iirc.

past meteor
#

And the docs that don't follow them etc.

runic parcel
#

can anyone tell me how can i use isochrone?

serene scaffold
serene scaffold
jaunty valve
#

hey all, im building this dev tool to transform scrappy python code to code that follows best practices by using LLMs and AI.
still in beta but would love to get feedback from python practitioners and people in AI
https://gitgud.autonoma.app/
any best practice im missing? is the output good quality for prod?

tame blade
#

nvm there was a bug, fixed now

spring field
#

I don't remember asking it to complicate the code beyond recognition (take it lightheartedly, lol)... all I really wanted it to do was to just go from for key in dct: value = dct[key] to for value in dct.values(): pass or for key, value in dct.items(): pass
anyway, this is most certainly not ready for production, at least I can't imagine trusting it, also doing weird stuff like this might make some tests fail and then you have to rewrite those or it can change the ast in unexpected ways oh and after all, didn't use .items(), I even tried being a bit more explicit about the usage and even then... though it did produce less clutter that time. the logging is frankly way too much IMO and also where are the two blank lines around function definitions ๐Ÿ™‚
also constants appear to get lowercased for some reason
it does seem to work somewhat better with a bit more code than with those tiny samples I provided in the screenshots
also also, it seems to quite arbitrarily get rid of some comments... that's definitely not ideal
dunno, there's certainly room for improvement I guess

jaunty valve
craggy coral
#

guys im new to data science so far i understand data mining and getting unstructured data but i dont understand the part where u
use python or ai to structure the data and get key insights can anyone explain that?

serene scaffold
# craggy coral guys im new to data science so far i understand data mining and getting unstruct...

"data mining" is really just a buzzword. I wouldn't put any stock into it.

If you have a bunch of reddit messages, that's semi-structured data, since you know who wrote each message, and when, and which message was in response to which. But the messages themselves are unstructured data in natural language (English, or what have you). If you were to identify all the locations that are mentioned in each message, then you'd have structured data.

#

What kind of structured data you might want to extract from unstructured or semi-structured data depends on who you are, and what data you have or can obtain, and what your goal is. Retail companies might want to obtain structured data about what people think about their products.

craggy coral
craggy coral
lapis sequoia
#

What are the levels to NLP?

narrow tiger
#

what is difference between data science, machine learning and AI

past meteor
# narrow tiger what is difference between data science, machine learning and AI

I'd say that AI is about creating algorithms capable of complex decision making. ML is a sunset of AI where you make those algorithms by learning from experience (also known as data) but there is AI that isn't ML.

Finally, data science isn't a formal term with definitions. I'd say it's a toolbox of methods ranging from things related to ML to traditional statistics and potentially even optimisation/operations research. It's an applied field where you use data to solve problems. The "science" in there is to distinguish it from let's say business analytics where the goal is "insights" and bar charts. It's a narrower skillset.

narrow tiger
#

"but there is AI that isn't ML."? what do you mean by this exactly

#

also any advice to someone coming from programming background into this field i don't wanna go to too deep into calculus but stay near the programming end
any pathways / job titles i should aim for

jaunty helm
# narrow tiger "but there is AI that isn't ML."? what do you mean by this exactly

personally I think of AI as the goal and ML as a means

AI that isn't ML
programs that could do complex decision making existed before ML got popular
for example, you could just code a ton of conditional checks manually, and that could act like AI; in fact that's the idea of expert systems
ML is when hardware got better and people thought, "man, manually finding & coding in these rules every single time for every single new problem is a lot of work, what if we just had a generic tool which can do that for us instead?"

iron basalt
narrow tiger
iron basalt
#

But for example, an automatic prover / search algorithm was considered AI back then, now it may not be due to not being impressive enough anymore.

jaunty helm
iron basalt
jaunty helm
iron basalt
#

Including some lesser known roles it played in stuff like the space race (optimization algorithms).

past meteor
#

BFS, DFS, A* are all AI depending on the context

#

People don't like this but it's true

iron basalt
#

(About the space race stuff) But would in that context only be consider a search / optimization algorithm, the term ML was around, but did not blow up yet in usage, still was ML though.

past meteor
#

(and it has nothing to do with ML)

jaunty helm
#

I guess they just "feel less AI" when compared to chatbots

iron basalt
#

Also a lot would fall just under control theory, now parts of it are considered ML, even though it's still (optimal) control theory.

past meteor
wooden sail
iron basalt
#

AI is when it feels magical enough is a certain definition of it.

past meteor
iron basalt
#

This also means that with enough time all AI becomes non-AI.

past meteor
#

Whenever I do a talk I ask people if they think a Google search is AI and the vast majority says no

#

It's the best example of that

iron basalt
#

If you include prior to the term ML, then even earlier, since lots of search and optimization happened automatically (on machines) during WWII.

past meteor
wooden sail
#

a lot of people introduce linear regression as "the simplest form of ML" too

past meteor
#

But what about linear kernel, least squares SVMs. They reduce to something similar to LDA

wooden sail
#

i don't think the term is well-defined enough to be worthwhile

past meteor
#

For me the difference is the end goal, statistical inference or simply prediction

#

Yes if you can do inference you can do prediction

#

But stats was always more focused in inference and not necessarily prediction

iron basalt
#

A key part of ML is the M, it happens on a machine, linear regression and such came way before that.

#

But they can also be done on a machine, so idk.

#

And we already had people trying to make AI-like automatic proof machines and such. Although most were never completed, too far ahead of their time (pre-Turing).

iron ruin
#

configuring score ranges is pain

#

especially for a dataset of 4000

drifting depot
#

Hi, I am new to python and I need to fit data with x and y errors in mathplotlib. How can I do that? (I am trying something different than gnuplot, and I couldn't figure it out)

tidal bough
#

What are you asking, specifically? For plotting that, use errorbar with xerr and yerr arguments.

drifting depot
#

I see, there is argument sigma, but I don't know to to include xerr and yerr

tidal bough
drifting depot
serene scaffold
#

@vernal thunder your message was removed for not being in English or being on-topic for this channel

vernal thunder
#

Is this correct

#

Hmm, who asked for your opinion?

serene scaffold
vernal thunder
#

Hahahahaah

#

Yes good

#

I'm not afraid of anyone

#

Keep this in your information

#

because Im Arabic

serene scaffold
vernal thunder
#

ุงุฐุง ุชูƒู„ู… ู…ุนูŠ ุนุฑุจูŠ

#

ุงุฑูŠุฏ ุงู† ุงุนุฑู ุงูŠ ุฏูŠู† ุชุชุจุน

#

ูŠุง

#

ูŠุง ุงูŠู‡ุง ุงู„ู…ุดุฑู

serene scaffold
vernal thunder
#

ูŠุจุฏูˆ ุงู†ูƒ ....

#

ุงู…ุฑูŠูƒูŠ

#

ุฌูŠุฏ

#

ู‡ู„ ุชุฏุนู… ูู„ุณุทูŠู† ุงูˆ ุงุฎุฑุงุฆูŠู„

#

ุชูƒู„ู…

serene scaffold
#

@vernal thunder I'm muting you if this off-topic discussion continues.

vernal thunder
#

Hahaha, it looks like there are 14 on the PlayStation

#

This is a bad thing

#

By the way, I am the one who is silent

serene scaffold
#

@vernal thunder if you send another message in this channel, make sure it's about data science or AI.

fallen osprey
#

What's the best place to learn maths for ai ml

spring field
#

I suppose university/college would certainly be one of the better places for that

trail monolith
neat bluff
limber token
#

Any idea on why this code is throwing RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor?

#

Is it because of DataLoader?

#

The code for train_model is this:

def train_model(model: nn.Module, train_loader: DataLoader, criterion: LossFunction, optimizer: optim.Optimizer, num_epochs: int = 10) -> None:
    """
    Train a PyTorch model.

    Args:
        model (nn.Module): The model to train.
        train_loader (DataLoader): The DataLoader for the training data.
        criterion (nn.modules.loss._Loss): The loss function.
        optimizer (optim.Optimizer): The optimizer.
        num_epochs (int, optional): The number of epochs to train for. Defaults to 10.
    """
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0

        for images, labels in train_loader:
            labels = labels.float()
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs.squeeze(), labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f'Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}')
limber token
#

It works fine when not using CUDA

leaden narwhal
#

fellas can anyone give me a hand

gusty flicker
#

hey, does anyone have a good recommendation for a youtube or online guide for doing astronomy stuff with astropy and some machine learning? I found a youtube video online for using fits data and I was able to use ultralytics and roboflow, but i dont know if I got issues because of version mismatches with the pip packages or what, I'd rather ask if anyone is aware of a good guide

narrow tiger
leaden narwhal
#


{'Grid_ID': ['1001', '1001', '1001', '1001', '1001'], 'Datetime': [Timestamp('2023-03-01 00:00:00+0000', tz='UTC'), Timestamp('2023-03-01 00:15:00+0000', tz='UTC'), Timestamp('2023-03-01 00:30:00+0000', tz='UTC'), Timestamp('2023-03-01 00:45:00+0000', tz='UTC'), Timestamp('2023-03-01 01:00:00+0000', tz='UTC')], 'C1': ['4.25', '1.909999966621399', '0.0', '0.0', '0.0']}

Convert "Nยบ de ..." (inteiros) to int
int_cols = df.columns[df.columns.str.startswith('C')]
df[int_cols] = df[int_cols].apply(np.int64)

df.sample(10)

 ValueError: invalid literal for int() with base 10: '4.25 ```
#

Guys im having this error

#

any help?

serene scaffold
fleet compass
#

hi all. python newbie here. need help with above

serene scaffold
calm pagoda
#

Pls guide me..

daring pier
#

Anyone ever had an error while training a cnn that says input ran out of data while using tensorflow? Found some advice in stack-overflow but the error is still there?
Can anyone help me?

wooden sail
#

i would point out that this is the deterministic interpetation, but you can alternatively derive the same regularizers through statistical criteria (e.g. for the L1 case, using maximum a posteriori when the parameters follow a laplace distribution centered at 0). also L2 does not restrict the values of the weights

#

discourages, yes, but not restricts

#

it won't prevent the weights from becoming infinitely large

#

what do you mean?

#

unless you explicitly introduce inequality constraints, the inputs and outputs will be unbounded

#

that won't stop you from quickly exceeding the computer precision and getting infs and nans

#

which is exactly what you see in e.g. exploding gradients

#

L2 reg alone does not prevent the parameters becoming arbitrarily large in any way

#

neither in the math nor in the implementation in the computer

#

yes but it won't "restrict" them. you CAN do that: you can guarantee the values never exceed a certain threshold

#

that's something different altogether and it's where constraints come in

#

the wording and semantics are important to distinguish that

#

i'd advice against making stuff up

#

what you're discussing now already exists and has names, and you'll have an easier time reading about it if you find the proper terms

#

if you say so. at any rate though, L2 does not prevent your parameters from becoming unbounded

#

you can do it via the extended lagrange form of the KKT conditions with inequality constraints

#

that does add some L2-looking regularization terms, but the additional slackness and positivity conditions anyway have to be enforced for the solutions to be in the feasible set. those require inequalities

#

the way it's usually explained is as "promoting smoothness"

#

maximizing the 2 norm of a vector is achieved by dumping all of the values into a single entry and setting everything else to 0

#

the minimum is achieved by making all entries equal

#

the less variation there is among vector entries, the lower the 2 norm

#

it's exactly what it does, though

#

smoothness when paired with an equality constraint does restrict the values, too

#

what L2 will do is make all of the parameters similar to each other

wooden sail
#

wdym by that?

#

i don't think degeneracy makes sense here either

#

that's what they mean by smoothness in this context, not differentiability

#

idk who came up with the term nor when, but it's well established

#

because the 2-norm is a contraction for small values, so it ignores them

#

you can try yourself playing with the example i gave above. take a vector, and for simplicity, work only with positive entries. say we work with the condition that the entries of the vector add up to 1

#

now let's maximize and minimize the 2-norm of the vector

#

it's pretty easy to conclude that the maximum value is 1, when one entry is 1 and the others are 0. this is the "least smooth" solution in the sense that it looks spiky

#

the minimum norm solution is the one where, if the vector is of length N, the entries are 1/N

#

then they all get contracted by the 2-norm

#

that solution is "smooth" in that the entries change very little w.r.t. each other

#

yes, though almost never used

#

0 leads to combinatorial problems and is fairly common. L1 is its convex relaxation and they are actually equivalent under special conditions

#

everything between 0 and 1 promotes sparsity, but 0 is not a proper norm and 0 < L < 1 is non convex

#

L = 1 is convex and non differentiable, but it does have a nice subgradient

#

how so

toxic mortar
#

is it possible to neural network different type of activation in output layer of neural network?

#

For example in my output layer 9 of outputs have the softmax, which they should be categorical, and 1 output should have linear activation since it is prediction

serene scaffold
toxic mortar
#

Something like this, where 1 is classification problem and 2 is prediction problem

toxic mortar
#

But I cant seem to implement it

serene scaffold
toxic mortar
#

Okay. How would you approach this? If there is 9 classes and 1 parameter

serene scaffold
#

I'm not sure

toxic mortar
#

I mean one obvious solution is to create two seperate neural nets, one for class classification and one for the stock prediction

#

can I fit it in one?

lapis sequoia
#

guys

#

i need a strong source to learn numpy

past meteor
#

What I'd worry about is how I'd compute the loss of this network

toxic mortar
#

I splited into classifier and linear prediction

past meteor
#

So 1 softmax (with 9 classes) and 1 linear layer?

toxic mortar
#

Yes

#

Two seperate models with different architecture

past meteor
#

I wouldn't do that

toxic mortar
#

Why not?

past meteor
#

You can just have 1 network that does 2 outputs, you compute the loss of each and take the mean or so

past meteor
#

Obviously, the biggest issue with neural nets is that you can never easily conclude if it doesn't work or there's just a special set of hyperparameters you haven't tried yet that do work

#

I think if I were you I'd train them separately first and hyperparameter tune them separately as well and then benchmark against a multi-task style architecture

wooden sail
#

i don't think rust has a good BLAS/LAPACK implementation yet, does it?

past meteor
#
  1. Python can use multiprocessing without any problems.
  2. The issue is multithreading. Only one Python thread can talk to the interpreter concurrently.
  3. Most major libs like numpy use multiple threads in C-land which circumvent this issue.
wooden sail
#

which means even though it could be a good idea, no one has done it yet

#

idk how easily rust exposes SIMD

#

google says support is only experimental

#

this arguably has a bigger impact than just parallelization which, as zestar says, is already taken care of in C for numpy

past meteor
#

Not my area of expertise but you can definitely allocate chunks of the matrices/arrays to different threads and combine them afterwards

wooden sail
#

simply by virtue of getting more slots on the OS scheduler, sure. if your task already exceeds the cache size and the number of parallelizable operations in SIMD, you can speed it up by getting favored by the gods of RNG

buoyant vine
wooden sail
buoyant vine
wooden sail
#

aha

buoyant vine
# wooden sail no but i mean written in rust directly

ah, no not really, I'm not sure it is worth ever doing that vs binding to cblas or open blas.

I have made some vector math libraries in Rust, but not to the same extent as blas. Often it becomes pretty annoying to maintain such a large number of specialized ops

#

for the sake of maybe beating blas by a few pct

#

Yeah, idk for AI/ML I probably wouldn't use CPU for heavy ops regardless, and Rust-cuda is a pretty nice experience

wooden sail
#

that's what i would've thought, yeah

buoyant vine
#

Me rn ๐Ÿ˜…

#

These routines alone are ~20k LOC

wooden sail
#

my respects to you

foggy obsidian
#

Add mine as well

buoyant vine
#

One thing I guess I would weigh in here though, I think Rust can be great for training models in situations where you need multi-gpu or multi-threaded dataset processing or pre-processing.

At work we use PyTorch Lightning and that thing single handidly takes 20 minutes to startup on a big dataset with 32 cores due to all the multi-processing and extra overhead going on from Python, where Rust can just use threads natively. That and the static type checking can help signififcantly to reduce the crashes at ends of runs due to some random error.

That being said, for quickly knocking something out Python still wins, and I think maybe if you have enough time training via onnxruntime might solve the original issue.

#

maybe

iron basalt
#

This is nonsense, every language with performance in mind can do parallelism, multiprocessing is also not what is desired, you don't want a process for each part. If they mean vs Python then that would make sense for CPU heavy tasks, but for matrix multiply we have numpy anyhow. Python is extremely slow. But you actually get more gains (more than the parallelization step) by switching to something like C or Rust ignoring parallelism. Python is just that much slower. And we do do that every time we call a numpy function. And also usually it all happens on the GPU anyhow where Rust does not apply (for large enough matrices / deep learning).

#

Also bonus points if you realize it's even better to use something like OpenCL for the CPU, for which you can also use PyOpenCL (SPMD/ISPC is the superior model for this stuff which is why the GPU also uses it).

past meteor
#

Based you're talking about matrix multiplication in VR Chat tho

iron basalt
wooden sail
#

typical vr chat discussion

buoyant vine
#

๐Ÿ˜…

#

I think it would be more viable if you could more concretely force the compiler to unroll some loops

#

biggest gain fortran has IMO the ability for it to aggressively unroll loops and split the ops into SIMD lanes automatically vs manually

iron basalt
buoyant vine
#

Tbh idk if it will ever truly have the ability to force unrolls since it is technically controlled by LLVM and depends on LLVM being able to work out if it should or not

iron basalt
#

Rust is more of a modern C++ alternative than C which gives it a focus on ergonomics over this kind of optimization stuff.

#

And as usual everyone ingores all the cool stuff Fortran did :(

buoyant vine
#

Eh I disagree, at least for optimized compute, you can achieve the same thing in Rust abietite unsafe rust, as you would C, but both still have the same issue that LLVM/gcc largely control the unrolling behaviour automatically

#

but yeah, in terms of writing fast math ops without having get your hands dirty with manual SIMD, fortran is awsome

#

especially F95+ where you can expose functions via FFI more easily now

iron basalt
#

Yeah you can, but it's a question of how difficult, after all, I could also in Python by manually outputting machine code to a buffer writing that to an executable memory page and running it. This is an extreme example but unsafe Rust plus hoping LLVM does the right thing can feel like that. Anyhow I don't want to make this a Rust complaint channel so we can go to off topic.

buoyant vine
#

My point was more unsafe rust gives you same control as you would C in reality, and if you really want the most number crunching performance, in both cases you are always manually writing the intrinsic regardless of if it is C or not, but yeah we're getting a bit off topic lol

simple tapir
#

Do I need a Master's degree to work as a data scientist?

#

im currently a sophmore undergrad computer science and engineering student

past meteor
past meteor
#

I don't know anything about the Turkish job market to be honest

simple tapir
#

I took linear algebra and some other math classes and I take statistics, ML, ai and differantiel equations in this semester

past meteor
#

The only people that can answer this are people in your country

simple tapir
#

I'd like to work abroad though

past meteor
#

Then it'll depend on the country you want to work in specifically ๐Ÿ™‚

#

I think it's possible with a bachelors in the US and UK for instance

#

Where I'm based (Belgium) not so much

simple tapir
#

Belgium requires MSc / PhD at least?

past meteor
#

Science/theory oriented degrees put you on a track where you get BSc + MSc, no one leaves these before getting an MSc. Practice focused tracks don't lead to an MSc and cover no math, stats, ... (anymore) but deliver better programmers at day 1

#

that's the summary

#

1 or 2 years, so 4 or 5 years total (bs + ms). Just 1/3 finishes it in that time so it's more like 5+ years for the majority

#

yeah, a good move here is to do 2 of 1 year each

#

well yes, each place has their peculiarities

#

hence why, and I don't mean this to be rude, it's better to ask people IRL. Online you'll get US-centric advice that most likely doesn't apply to youu (or could even be detrimental)

#

that's the edge case

#

But if you're targetting idk Germany, I think r/germany or whatever is optimal

#

Did you look at the ones I sent? ๐Ÿ‘€

#

(I can resend)

#

A video

#

the f

#

which one?

quaint crystal
#

Hey I am looking for some hints for something I would like to do with tensorflow. I want to show one of 5 images into the camera and have the program tell me which one it is. I know I should probably use template matching, but all tutorials I can find use more than one image as training data. Does someone know a good starting point for this?

serene scaffold
hidden ferry
gritty vessel
#

hey are there any resources for this? scraping data using
language model-based tools like OpenAI API, Mistral 7B, Llama2

toxic mortar
#

This means it is overfiting right? Spikes around 10,15 epochs

#

Before I hyperparam tune it, I want to make sure I did my best regarding the model architecture

hasty grail
toxic mortar
#

Yes. cool . thanks. Yes I know why are they 0s. this is my class distribution

#

I wanted just to test it out, before I remove outliers

#

Imma try over and undersampling and classweights to see how it performs

#

If it sucks imma just chop it off

past meteor
#

I look at overfitting as a disproportionate gap between the validation and training loss

#

Your problem is moreso that your val loss isn't smooth but that's imo pointing towards a learning rate that is too high, lack of dropout, ... things you can tune easily

#

I just train with "enough" early stopping

#

Honestly, I noticed that there's a lot of variance in training. If I run the same hyperparameters on the same data some runs it's good, some runs it's not. Setting early stopping to something "reasonably high" makes you robust to the model quitting after a few bad epochs

#

As in, I think it reduces the variance

toxic mortar
#

Thanks guys

past meteor
#

Hmmmm

#

I wouldn't tune the seed but if I had enough time and patience I'd run the same thing N times and make a boxplot or something yes

#

Just don't have the time to do that with neural nets, each run takes way too long

#

You know what I should consider? Doing hyperparameter tuning as a multi-objective optimization problem. Instead of just tuning for the loss you also consider the time it takes to do a single run.

#

That way I could keep my search space larger, but have the hyper param optimizer "punish" the algo for selecting a very low learning rate or very large architecture.

#

Yeah, we have 2 big #enterprise GPUs

#

I have optuna, tensorboard and mlflow set up nicely. All I need to implement is 1 function to have my pipelines run. I code an architecture, run it for a couple of days and then read papers to find/code up the next one.

#

Contract research, $pharma pays us to do the research and then they may or may not put the ideas in prod

#

pretty much

#

that's really nice ๐Ÿ˜ฎ

#

My issue is, I'm very wary of "tools"

#

I've been burnt so many times trying to adopt a shiny thing in my codebase only to notice it just doesn't do what I want it to do

#

If it's Python I'm also just relatively fast and churning out code so it's always a trade-off between "will I write it myself or figure out how it works from the docs"

#

My setup is ... nonstandard. I already had existing sklearn based preprocessing and metrics. I didn't want to port all of it to work for Torch so I wrap my Torch models in a sklearn interface ๐Ÿฅด

wintry grail
#

I want some good resources to learn text mining in python, can anybody suggest some lec series or book?

short isle
#

did i need to learn machine learning to make ai?

past meteor
toxic mortar
#

Has anybody used tensorflow Profiler tool?

#

I want to see where my pipeline is bottlenecking. I use CPU training only and read data from SSD.

stop_early = EarlyStopping(monitor='val_accuracy', patience=50, restore_best_weights=True) # 100
tb_callback = TensorBoard(log_dir="logs", profile_batch='1,10')
history =model.fit(
    x_train_norm, y_train,
    epochs=100, # 400
    batch_size=64,
    validation_data=(x_test_norm, y_test),
    callbacks=[stop_early,tb_callback],
    verbose=1
)

#

But theresnt any profiler data

#

I can see other things, which means, it works

unkempt jay
#

help my install isnt working

#

in vs code

#

i used the pip install and it installed perfect

#

then i rebooted and it still says module not recognised or something

#

when i try again it says its already satisfied

serene scaffold
serene scaffold
unkempt jay
#

Import "torch" could not be resolved

serene scaffold
#

Anyway, you probably pip installed pytorch to a different environment than the one vscode is using. try running the program anyway and show the whole error message, if there is one, starting from Traceback.

serene scaffold
# unkempt jay

Please stop posing screenshots of text. I will not answer any questions you ask in the future if you keep doing this.

#

it looks like you tried running a pip install command. I'm asking you to run the python script that you're trying to write.

#

@unkempt jay I'm still available to help. what do you do to run the python program?

silk yarrow
#

i have my minor project submission tomorrow so i have one problem, my project is density based traffic light management system which detects objects using yolo v3 model using coco names file which include names of 80 objects, but my aim is to detect the ambulance in traffic with all other vehicles . my project is not detecting ambulance ,please help me.

serene scaffold
thorn cairn
serene scaffold
thorn cairn
#

sorry but how do i link my posts?

serene scaffold
thorn cairn
#

ayy it does

#

there is a brief explanation there!

buoyant shoal
#
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# sklearn comes with some example data sets
from sklearn import datasets

# Import train_test_split function
from sklearn.model_selection import train_test_split
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
#Import scikit-learn MLP classifier
from sklearn.neural_network import MLPClassifier 


df = pd.read_csv("dest")
x1 = np.array(df["x1"]).reshape(-1,1)
x2 = np.array(df["x2"]).reshape(-1,1)
x3 = np.array(df["x3"]).reshape(-1,1)
Y = np.array(df["Class"])

X = np.concatenate((x1,x2,x3), axis=1)

accuracy_train = np.zeros(100)
accuracy_test = np.zeros(100)

for i in range(100):

    # Split dataset into training set and test set
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5) # 50% training and 50% test


    # Create MLP classifer object
    mlp = MLPClassifier(solver='adam', hidden_layer_sizes=(20, 20), max_iter=50000)


    # Train MLP Classifer
    model = mlp.fit(X_train, y_train)

    # Predict the response for training dataset
    y_pred = model.predict(X_train)

    acc_train = metrics.accuracy_score(y_train, y_pred)

    accuracy_train[i] = acc_train
    
    # Predict the response for test dataset
    y_pred = model.predict(X_test)

    acc_test = metrics.accuracy_score(y_test, y_pred)
    
    accuracy_test[i] = acc_test
    


print("Average accuracy for training data:", np.mean(accuracy_train))
print("Average accuracy for test data:", np.mean(accuracy_test))
#

Hi, dumb question but for this piece of code I'm curious why there's a variance in accuracy results

#

Is mlp.fit() and train_test_split() the two reasons why?

serene scaffold
buoyant shoal
#

i meant 50% test and 50% training

#

forgot the 0

#

(fixed)

serene scaffold
#

y_pred = model.predict(X_train)
you also predicted on the training data, and you can't use that to evaluate the model's performance.

buoyant shoal
#

and then compare

#

by changing hidden layer sizes and max iteration

#

which i've sorted but like it also asks what causes the "variance" in the accuracy results

#

i'm suspecting it's the mlp.fit() and train_test_split(), am i right?

silk yarrow
# serene scaffold hello, you'll need to be more specific in order to get help.

my project is based on yolo v3 model and project uses coco name file which i took from github,it has only 80 objects. My mentor asked me to detect ambulance if it is in traffic but my project detect it as truck, because ambulance is not included in coco name file . So task is to train data set so that it can detect ambulance and mark it as ambulance in bounding box

thorn cairn
#

how do i label these semester into first year and not first year?

odd meteor
# thorn cairn how do i label these semester into first year and not first year?

I suppose it depends on the information on the data you have and the location/school it was collected from.

Is 1, 2,and 3 the only unique values in that column?

A 2 years masters program for example, usually has 4 semesters. Year 1 would correspond to semester 1 & 2, and Year 2, semester 3 & 4.

In your case it appears the program only has 3 semesters (presumably it's a program with 1.5 years completion time.)

If that's the case, semester 1 & 2 shoukd correspond to year 1. the rest, semester 3 becomes 6 months (0.5 years)

You just have to investigate further to figure how it works over there.

odd meteor
vestal spruce
#

Hi, is anyone familiar with Hugging Face's Transformers Library? I'm trying to fine-tune an ASR/speech to text model with the library but idk how to feed my dataset, and for the dataset it's a 30 second audio file as the "feature" and a text saved on a notepad as target. If I want to feed these data, can i just use a simple list/numpy array? or do I need to turn it into a tensor first? any for of help is appreciated, thanks ion advanced ๐Ÿ™

dawn light
#

is there a type of ML recommender system where in addition to the usual collaborative/content filtering, I can add/specify/enhance other dimensions (not sure if i phrased that correctly)?

For example, I like to look for a movie that's similar to Matrix but isn't scifi, or a movie/series that's similar to star wars but anime (this would probably legend of the galactic heroes), or death note but romance (kaguya sama)

Or maybe even something like "Something like movie X but not Y or Z" (e.g. something like stranger things but not like Dark)

Can anyone point me to resources on how to build something like this? Thanks!

neon island
# dawn light is there a type of ML recommender system where in addition to the usual collabor...

Collaborative/content filtering doesn't need to be confined to 1-dimension (a scalar). Examples may only show scalar ratings to keep things simple. You can find the k-nearest neighbors based on vectors of any dimensionality.

To identify series "not like" Dark, point their vectors in the opposite direction (multiplying by -1). Make them "far away" for your distance function, so they are dissimilar for recommendation purposes.

Another RecSys mechanism is a Graph Neural Network (GNN). Different edge labels correspond to what you're thinking of as dimensions. A graph visualization may be easier to imagine, and GNNs can learn an optimal recommendation algorithm.

dawn light
toxic mortar
#

I left this randomsearch to run overnight. Does this means that this model architecture has capped performance to 85% and I should change something in either data or architecture

deep veldt
#

Can someone give me an example labelled dataset and unlabeled? I'm new I've searched it up but all I got was a long explanation with no example

agile cobalt
deep veldt
#

thanks

deep veldt
#

How do I train images?

serene scaffold
serene scaffold
deep veldt
agile cobalt
#

technically there are some things you could call "training an image", but these are almost definitely not what you are looking for - in particular Style Transfer

99.99999% of the time you are training models, not images/texts/prompts/etc.

serene scaffold
#

and there needs to be some way to know which image is which. like a text file structured like

image,animal
1.jpg,cat
2.jpg,cat
3.jpg,dog
deep veldt
serene scaffold
past meteor
deep veldt
#

Are there any good courses, resources that I can learn on?

serene scaffold
arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

deep veldt
#

thanks

twilit flower
#

Creating a group for ml looking for friends

stoic gorge
#

My partner has an interesting problem involving Jaccard index...
Comparing >300k unique subsets A (2^A), where |A| > 300 - and for each such set finding sets with which it has minimum Jaccard index...

We were thinking about representing the sets as 300 bits- that gives us fast calculation of the index itself (because bitwise operations), so only the number of calculations makes it costly -
bruteforce of everything-to-everything is (300k)ยฒ operations.

Does anyone have any ideas how to get it lower? We were thinking about clustering it somehow but |2^A| is so big it's hard to think of something that makes sense (there's a lot of pairs that don't intersect at all).
Or what to use to optimise the speed of calculations - I know basically nothing of numpy but there might some methods to make such repetitive calculations fast?

plush bobcat
#

Hey there, I don't know if it's the right channel to ask but:

Guys I want to learn a second language after python, I've just started learning ML and want a lang to help me out in that field

Which one do you recommend and why,
Rust or C++?

And yes, I want it for ML/AI primarily, and maybe i could gain some insights into how things work, the compliers, interpreters all of this stuff

plush bobcat
serene scaffold
plush bobcat
serene scaffold
plush bobcat
#

Yea

#

I underestimated math for some reason, I'll be on it

#

Thanks

serene scaffold
serene scaffold
#

also replace "computer science" with "ml"

plush bobcat
#

True af xd

#

Maths kinda intimidating but I've heard it's like a language, the moment you get fluent you'll be obsessed with it

#

And yea every bit of computer related stuff were made by math

wooden sail
#

the "computer" in "computer science" deals with "computability" in mathematics: can you perform a certain action/do a computation in a finite number of well-described steps

#

traditionally CS is a branch of mathematics

#

(not anymore, now it largely depends on the university cuz it can also mean other stuff)

plush bobcat
#

University just gets you into details, as you said, CS is a branch of math

#

And your explanation of "computer" and math was brilliant

iron basalt
iron basalt
# serene scaffold related meme

Add geometry, trigonometry, linear algebra, calculus (yes, you need to learn it if you want your physics to not be buggy garbage (on the other hand, it gives speedrunners more to work with)), and more depending on the specific game.

#

(If you are responsible for the graphics, you got a whole lot more to learn)

worthy shoal
#

opengl looks fun but i have 0 reason to learn it

iron basalt
#

It's technically legacy now, since Vulkan is OpenGL 5.x (it was originally suppose to be the next version of OpenGL). But for a while it will still be around, because not everything has good Vulkan drivers yet (or ever will).

worthy shoal
#

it's like 30 times harder though

iron basalt
#

Apple is putting the nail in the coffin though.

iron basalt
worthy shoal
#

yeah, i can't find graphics programming applicable other than in game development, maybe it could be a fun experience applying math to it and whatnot

iron basalt
#

GPUs used to be just for graphics, now they are pretty general.

worthy shoal
#

how can vulkan be possibly used in ml?

iron basalt
worthy shoal
#

aren't there better alternatives? it seems foreign to me that you'd use a graphics library like vulkan to do that

iron basalt
#

It's pretty normal to use a graphics library like Vulkan for this. There are some alternatives, they are all very similar. Ones like CUDA are Nvidia only, Vulkan can even run on mobile.

#

Also CUDA can't render graphics on its own to a window, Vulkan has all the normal graphics stuff.

#

(Without extensions)

worthy shoal
#

i'm gonna guess that something like this is easier than making a game with it

iron basalt
#

Vulkan is an open standard, like OpenGL. There is also stuff like OpenCL, which is more like CUDA, but not just Nvidia.

#

Then there are some others.

iron basalt
worthy shoal
#

openCL would be fun, if there was more resources with C++ ๐Ÿ˜„

iron basalt
#

IMO OpenCL has the least boilerplate, and is the overall best API.

#

OpenCL technically also runs on more than just GPUs, it can do CPU, FPGA, etc.

craggy agate
#

I need to talk to someone who is good at computer vision... Could you please DM me?

iron basalt
#

Beyond this, GPU shader languages matter more, which includes stuff like CUDA, OpenCL C, HLSL, GLSL. For this CUDA or OpenCL C. CUDA if you already have an Nvidia GPU.

#

You could add other high level languages, but Python has kind of won that battle.

#
  • You don't have to be really good at C or C++ or Rust, you more importantly just have to be able to read it.
#
  • If you can read it, you can read other's code in open source projects to learn how to use them well / mimic them.
hollow escarp
#

Any solutions for installing torch in alpine dockers?

neon island
# dawn light can you elaborate on how to use KNN for this? I'm only a bit familiar with the ...

'Star Wars' can be represented as a vector in n-dimensions having n scores from [0,1] in features like 'genre: sci-fi', 'producer: george lucas', 'best-picture: 1977', etc.

Reduce them down to a vector i on an arbitrary i-axis representing some 'ideal" measure of "Star Wars"-ness. Allowing for movies that out-Star Wars the original Star Wars, let's assume Star Wars has a score of 0.9977 from all of its n components projected onto i.

Add 2 basis vectors j and k along the j-, k-axes representing Japanese-ness and cartoon-ness. These aren't necessarily orthogonal to the i-axis (Star Wars borrowed from the 7 Samurai, so it may already have a 0.25j component embedded within itself).

j x k is a 2-D plane where (1, 1) represents 'anime'. i x j x k is a 3-space where the movies most similar to Star Wars are nearest to (0.9977, 1, 1) when their n-dimensional vectors are projected down onto this space.

serene scaffold
hollow escarp
#

Im using mender to deploy my code to my devices and for application deployments i need to create docker file which will conatine all necessary libs for running the script and one of them is ultralytics

#

My app runs on python 3.11.0

serene scaffold
hollow escarp
serene scaffold
hollow escarp
serene scaffold
hollow escarp
serene scaffold
craggy agate
#

Is anyone here familiar with computer vision? If yes, could you please DM me, I need advice for a project. Thank you.

serene scaffold
craggy agate
serene scaffold
craggy agate
serene scaffold
craggy agate
#

Can anyone give me some advice on a object/target tracking project? It would consist of my drone using its front camera to detect me and start tracking me and making appropriate decisions to keep me centred in it's video feed as I move further away or out of it's frame.

serene scaffold
#

@hollow escarp I got curious and tried to install pytorch in an alpine container. and once I finally got it installed, I couldn't import it because of some missing OS dependency.

but just installing pytorch makes the container more than two GB, so you might as well start with a more substantive base image.

past hearth
#

hi

serene scaffold
past hearth
#

Strange question that includes other domains. I'm getting a ValueError when using pandas.apply

df[col] = df[col].apply(
  lambda x: (
    x.strftime(...) # <- vscode raises exception here
    if ((not pd.isnull(x)) and (x != ""))
    else x
  )
)

This one in particular:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#using-if-truth-statements-with-pandas

I'm running this python code with VScode debugger. And it only occurs when I have "raised exceptions" ticked. When it's not ticked, the program runs smoothly and it SEEMS like theres no issues - as in the apply function is doing what it was intended to do.

At the point of the raised exception, the x value is the whole column in an array so I understand why it can show.?

type(x) # <pandas.core.indexes.datetimes.DatetimeIndex>

I have "raised exceptions" ticked and then press Step Over, and execution continues successfully - with each step transforming each value to readable timestamp.

The deployed version of this code runs without exceptions. Am I going insane? or is there an issue with vscode or the python debugger... or python interpreter? (using python3.10 in a venv)

serene scaffold
dawn light
past hearth
past hearth
# serene scaffold can you show `print(df.head().to_dict('list'))` as text (no screenshot)?

I did only head(2), because 5 is too large. Also changed all the values

# df.head(2).to_dict('list')
{'aa': ['2024-04-30 00:05:00+1234', '2024-04-30 00:10:00+1234'], 'bb': ['123123123', '123123123'], 'cc': ['123123123', '13123123'], 'dd': ['2309rjei230', '2309rjei230'], 'ee': ['', ''], 'ff': ['', ''], 'gg': ['', ''], 'hh': ['', ''], 'ii': ['', ''], 'jj': [0.0, 0.0], 'kk': ['U', 'U'], 'll': ['filename.json', 'filename2.json'], 'll': ['123123123', '123123'],
     'mm-timestamp': [Timestamp('2024-04-30 00:13:46+1234', tz='timezone/timezone'), Timestamp('2024-04-30 00:13:46+1234', tz='timezone/timezone')], 'nn': [0.0, 0.0], 'oo': ['edwed', 'edwed'], 'pp': ['wed', 'wde'], 'qq': [None, None], 'rr-timestamp': [Timestamp('2024-04-30 00:18:50.400544+1234', tz='timezone/timezone'), Timestamp('2024-04-30 00:18:50.400544+1234', tz='timezone/timezone')], 'ss': [True, True], 'tt': ['vee', 'vee'], 'uu': ['123', '123'], 'vv': [0.0, 0.0], 'ww': ['A', 'A'], 'xx': ['', ''], 'yy': [Timestamp('2024-05-01 00:01:08+1234', tz='timezone/timezone'), Timestamp('2024-05-01 00:01:08+1234', tz='timezone/timezone')], 'zz': ['qwerqwer', 'qwerqwer'], 'az': ['weqrt', 'weqrt']}

swift fulcrum
#

hi, i was considering learning prompt engineering what course would yall suggest?

quaint loom
#

Is there anyone here who have performed redundancy analysis (RDA)?

hollow escarp
past meteor
#

Maybe a better question, any reason why you can't export your torch model to ONNX and then make/use an ONNX image without needing the torch dependency?

hollow escarp
#

More like i created license plate recognition

#

Which uses YOLOv8 to get Location of license Plate of img

past meteor
#

Through what are you using Yolo? Darknet? Torch? Tensorflow? Opencv?

neon island
# dawn light thanks again! what would you suggest I do to obtain the features? I'm guessing ...

See the fine suggestion of the sklearn.neighbors.KNeighborsClassifier in scikit-learn earlier by @data.exs

It's easy to pick out bespoke features in a specific example, harder when answering a more general query and your data set has 1000s of features like you can find on IMDb ($), TMDB, MovieLens or try a small hand-crafted pd.DataFrame with just Star Wars, Space Cruiser Yamato, 7 Samurai, Spaceballs, Family Guy's Star Wars Parody, Titanic and Rocky to start with.

Feature engineering is trying different sets of features when training your classifier to see which features produce the best recommendations.

hollow escarp
past meteor
past meteor
#

And then you can use this on your raspi https://onnxruntime.ai/

hollow escarp
past meteor
# hollow escarp Okay, also i should add that i need to build it to docker to deploy that docker ...

Sure. You basically have 2 steps now. You need to compile to torch model to ONNX and then load that into an Docker image that has the ONNX runtime.

Personally I'd do this with a multi-stage build. In the first stage I'd basically use the Pytorch base image I linked initially and install all packages I need on top of it to build the ONNX file.

The second stage is one that has the ONNX runtime, you copy the file you made in stage 1 and you're ready ๐Ÿš€ .

(I simplified it, in practice I'd have at least 3 steps but that's ok. You'll at least use 2 images, that's the important point).

I'd start out testing this workflow in a notebook first to see if it works and so on. A massive advantage is that you don't need to ship 2 gigs of torch to your platform that is just doing inference (the raspi)

#

You can even make it simpler and just compile and version control the binary you get from compiling the torch model. You can do that if you're certain it won't change. Your Dockerfile becomes easier then.

hollow escarp
#

Okay, and also i have model in .pt format ( it's like 20k img model ) which was trained by someone else

#

Isn't that any problem for converting it to ONNX format?

past meteor
#

What do you mean with 20k img model?

hollow escarp
past meteor
#

Just fyi, the model doesn't contain the images. Example: If you have 1 single neuron and run 10000k images on it it'll still be small.

hollow escarp
#

Ye ye i know that

past meteor
#

So why did you mention it? Maybe I'm missing something.

hollow escarp
#

but Im just askig how to convert that .pt model to ONNX supported format?

past meteor
#

It's in the link I sent you

hollow escarp
#

oh ye, thx

past meteor
#

If you don't mind, I'll stop answering. I gave you a lot of information to digest and I think you should read the links, some other docs, let it sink in etc. and if you have more questions afterwards just tag me ๐Ÿ‘Š

hollow escarp
#

Okay, im really glad for your support

wide wolf
#

So I started working on a python ML project few weeks ago and I don't know Python, ML or Pandas previously. I've got some questions regarding my dataframe structure. Is this a channel I could ask this stuff in?

#

So I'm doing ML regarding stock companies and their quarterly results. So I got 4 rows of quarterly results for a company X which I put into a dataframe, and then I use multiindex to store all these rows together. My reasoning was that if I flatten the dataframe then the ML model won't be able to 'identify' the 4 rows belonging to a specific company.

#

So '181', '356' and '59' are company IDs here

#

Will this work, or am I messing it all up?

#

Basically i'm attempting to make a pandas panel, I think (but been deprecated)

#

yeah

#

yeah, but I mean more specifically that I'm using multiindex and 'grouping' them (0-4) as you see here

#

but if above looks fine/normal to you then I guess my above approach is fine

#

what missing indices?

past meteor
#

Haven't been following the conversation but ordinal encoding is really bad

#

I'd say always do one hot unless you're doing a decision tree and even when you are it's still dangerous

#

Maybe in NLP but in tabular datasets is not good

#

Imagine you have small medium and large and you do an ordinal encoding, you're saying large is X3 small

#

That's typically the danger

wide wolf
#

I can't onehot-encoding here since I would get dimension scaling beyond what's reasonable, since every 4 rows is a unique company

#

but since its always 4 rows for each company, I figured there wouldn't be an issue with hierarchy

#

since it's all 'balanced'

past meteor
#

Target encode them then

late spear
#

Is it possible to create a conversational AI that diagnoses a patient's mental illness? The input would be the patient's speech converted to text and their facial expressions recognized for emotions,?

past meteor
#

Yeah, that's the reason. Given enough data it should sort itself out but it's a last resort approach imo

#

I'm being pedantic though ๐Ÿ˜‚

wide wolf
#

Sorry for spamming, but trying to work out of I'm messing up when I'm merging my different dataframes. So I store like 5 companies in 'dfs' (4 rows each, so 20 rows total), enumerate over them and put them in a hashkey table with 'i' being the key. And then concat().

Since you were talking about 'missing indices', i dunno if my mistake here is that I need to for-loop over each of the 4 rows as well and assign company ID ('i) to them or if behaviour in screenshots is all correct

past meteor
#

I am as well, but I'd say benchmark it for tabular data. It can easily be a hyperparemeter.

#

Just empirical evidence showed me it's usually bad ๐Ÿคท

wide wolf
#

The whole reason I'm doing all of above is because I have data over time (quarters) and I dunno how else to group it together for my ML code. I can't do avg or mean and just 1 row since that doesn't capture change over time. I could skip the multiindex part (just flatten the big dataframe) but then I'm worried the ML code won't be able to identify that each 4 rows are 'tied together'

#

I'm new to all of this so dunno best practices

#

whats NLP?

past meteor
#

Yeah I feel like these are dying art

narrow tiger
#

so what does being data scientist mean?
just someone who can run some data through an algo to generate meaningful charts?
what do i need to learn so i can call myself data scientist

#

to get good at machines learning algos i think i'll need alot of maths

past meteor
#

No, tabular data is a lot more effort and the results are very variable so it seems like it's very much in the trough of disillusionment

wide wolf
#

Above is pre encoding/split. I'm doing SQL queries to get 4 rows for each company and then merging it together into a multiindex

#

its multiindex

#

pandas panel thing

#

I think

past meteor
#

I avoid multi index stuff etc. as the plague

wide wolf
#

Yeah I could flatten it all, but then row 5 (currenty '0 VTVT') will be '5 VTVT'

#

so I lose the 0-4 grouping, and I dunno the impact of that for my ML learning code

past meteor
#

As much as I dislike Pandas, you got to learn it

#

I know you do

#

I'm mostly referring to the other user

wide wolf
#

I'll run a few tests without multiindex, see what happens

past meteor
#

I'm a big Polars fan ofc but what bugs me are breaking changes

#

But pandas has a fair share of those as well tbh

#

Very true

wide wolf
#

Btw, coding with ChatGPT as a helper is amazing

#

made my python learning 10x easier

lapis sequoia
#

Heya
when we have a matrix
M=np.arange(1,11).reshape(5,2)
what does M[2] mean here?
what would M[1][1] be like and why?

deep veldt
#

differences between attribute and future?

craggy agate
#

M[2] is the third row of the matrix

craggy haven
#

Guys do you think it is going to be useful to learn cuda for machine learning jobs?

craggy agate
past meteor
deep veldt
serene scaffold
serene scaffold
craggy agate
desert oar
#

the "data scientist" job title itself has come to indicate a kind of generalist jack-of-all-trades role, analogous to "full-stack developer" in software development. bigger organizations tend to have more specialized titles.

#

no one person can be good at all of it. so typically industry data scientists end up being really good at a few things and less good at other things, and self-select into jobs that make sense for their skills, and also tend to work on upskilling throughout their careers

#

understated but important job characteristics including writing/communication skills and project planning

narrow tiger
agile cobalt
#

data science is literally applied statistics

#

machine learning is just statistics wearing a hood

#

if you want to focus on programming, then do normal programming | backend development
if anything maybe model deployment/devops/mlops might be closer to what you are thinking, but that still depends on the place

narrow tiger
#

What do people do machine learning engineering jobs and LLMs

#

Ig stuff will make sense soon enough thanks everyone for answering

#

In this video, I will guide you through the entire process of deriving a mathematical representation of an artificial neural network. You can use the following timestamps to browse through the content.

Timecodes
0:00 Introduction
2:20 What does a neuron do?
10:17 Labeling the weights and biases for the math.
29:40 How to represent weights and ...

โ–ถ Play video
#

Doesn't seem very complicated so far hopefully this is mid lvl ๐Ÿ˜‚

#

Any thoughts

desert oar
narrow tiger
#

Thanks

hollow escarp
#

@past meteor so i converted my pt model to onnx model ( using following command yolo export model=<my_model> format=onnx imgsz=640,640 ) and now im having trouble with reading correct values. Before my script for getting correct box places was:

def detect_closest_license_plate(image, model: YOLO) -> ClosestPlate:
  prediction = model(image)[0]

  camera_center_x, camera_center_y = image.shape[1] // 2, image.shape[0] // 2
  closest_plate: ClosestPlate = None
  closest_distance = float('inf')

  for license_plate in prediction.boxes.data.tolist():  # Assuming prediction.xyxy[0] contains bounding box predictions
    x1, y1, x2, y2, conf, cls = license_plate  # Extract bounding box coordinates and confidence
    plate_center_x, plate_center_y = (x1 + x2) // 2, (y1 + y2) // 2
    distance = np.sqrt((plate_center_x - camera_center_x)**2 + (plate_center_y - camera_center_y)**2)
    if distance < closest_distance:
      closest_plate = ClosestPlate.from_dict({
        'bbox': (x1, y1, x2, y2),
        'confidence': conf,
        'class_': cls,
        'plate_center': (plate_center_x, plate_center_y),
        'distance_to_camera': distance
      })

      closest_distance = distance
    
  return closest_plate

Now im getting my predcitions that way: pred = session.run(None, {"images": to_numpy(proccess_image("./test_photos/test.jpg"))})

And i cant find correct corresponding values

Thats output:

[array([[[     22.407,       27.52,      37.087, ...,      560.26,      582.73,      612.35],
        [     6.8977,      6.7349,      5.8882, ...,      628.67,       628.4,      626.59],
        [     13.573,      12.881,      11.815, ...,      275.83,      318.33,      388.62],
        [ 6.5565e-06,  3.8147e-06,  4.2707e-05, ...,  9.4175e-06,   5.126e-06,  1.8477e-06]]], dtype=float32)]```
limber token
#

How would you guys go about publishing ML code that needs a very large dataset for it to work? Compress it using parquet and unpack it via code? The code is on GitHub, I'm more worried about the dataset

past meteor
past meteor
#

Can't you just train, persist the model and then use it?

buoyant vine
# limber token How would you guys go about publishing ML code that needs a very large dataset f...

Depends what you are doing.

At work we store data via safetensors https://huggingface.co/docs/safetensors/index and use DVC https://dvc.org/ for managing the large files with Git and S3.

If your not going to be pulling often, you can use Git LFS with github as well, but it gets expensive quickly on bigger datasets (i.e. object > 100MB)

plush bobcat
# iron basalt C, then C++. C because it's the lingua franca of the programming world, lets you...

So in conclusion:

โ€ข C/C++ because they're the mother language and by learning (not mastering, just good enough to be able to read and mimic the source code) it I can make speed/performance critical applications and understand codes better in general

โ€ข C/C++, because it's so old that almost every where you look is C/C++ and not Rust, not that you can't write the same program in Rust, C/C++ just has more sources code written by it

โ€ข I just kinda need to able to read C/C++ for CUDA

Am i right?

past meteor
#

I'm using MLflow + optuna + tensorboard + dagster

#

I could add more tooling, if it's worth it

buoyant vine
#

So yes and no, I both love and hate it.

If you have a Git LFS setup, then use git LFS it is just so much smoother in terms of configuration and pulling changes.
Otherwise if you have big objects and need to store on S3, then DVC is great for that, but it is more manual than git LFS and has a pretty awful caching system that often requires you to delete the entire local cache in order to pull new files from the remote.

#

But, it is the only modern tool that supports effectively big objects tracked by git... On S3 or other storage with minimal setup

#

๐Ÿ˜… So I guess it is a tradeoff

past meteor
#

Would you use it for tabular data?

#

That I'm just storing in postgres tbf

#

All object related stuff is in minio DB, we could use DVC there

buoyant vine
#

I would use it just for managing dataset files or big binary files with git only

#

The repro stuff, param tracking, etc... Is completely useless IMO and you'll spend most of your time debugging why DVC isn't working than actually doing the runs or the code

past meteor
#

good to know, those are all new features as well

#

I looked into DVC a couple of years ago and that wasn't there afaik

buoyant vine
#

That being said, My experience of PyTorch Lightning + Neptune has been excellent for tracking artefacts, metrics, etc...

desert oar
plush bobcat
#

btw guys i got a question,

TensorFlow or Pytorch

i'm just about to start learning ML and i cant decide which one's better

plush bobcat
past meteor
# plush bobcat why?

Simply because it's more common nowadays. using the most popular tool has a lot of merit

desert oar
buoyant vine
# desert oar safetensors is new to me ๐Ÿ‘€

Safetensors is excellent, for us at least ๐Ÿ˜… Since we train models via Python but deploy via Rust, so it is often very useful to be able to have that simple to use API which both langs can use

#

and store data efficiently and load quickly

desert oar
past meteor
#

Pretty satisfied with my mflow + dagster + lightning + optuna set up but having repro on the data side is pretty bad for me

#

I think I'll just start logging the git commit with my experiments

desert oar
#

also i think DVC is really useful for sharing data within a team, dvc push/pull specifically

buoyant vine
past meteor
#

Then I can freely change things but I can get repro easily by checking out at that commit ๐Ÿ‘€

desert oar
#

what's the value of neptune vs. mlflow or any of the 1e12 other options out there right now?

buoyant vine
desert oar
past meteor
#

Then I'll have to look closesly at DVC tomorrow

#

I always MacGyver until it gets bad and it's getting bad right now ๐Ÿ˜‚

buoyant vine
#

UI is great, integrations are excellent, etc...

past meteor
#

Oh yeah, I think a big caveat is I'm doing things on-prem

#

that's why I went with MLflow I think

buoyant vine
#

Makes sense

desert oar
#
desert oar
#

(compared to mlflow for example)

#

snowflake also recently rolled out a model registry thing, we might start using that to deploy models directly in-warehouse

#

not sure if it has any useful tracking/versioning features though

past meteor
#

proprietary saas
[...]
snowflake

buoyant vine
desert oar
#

buying a new product would be harder with our current finances

past meteor
#

snowflake is the one cloud tool I'm really not familiar with

#

is it basically just like big query

#

but not big query?

desert oar
#

we've already significantly reduced our snowflake usage, to the point where it's an issue for our contract renewal

past meteor
#

typical separation of storage and compute, data in buckets, snowflake compute puts a view over it and you can query it with SQL and pay through your nose?

desert oar
past meteor
#

Meant SQL there*

desert oar
desert oar
past meteor
#

Where I'm from everyone drank the databricks koolaid

#

Everyone is on lakehouse

desert oar
#

they're the other way around, they want to be a warehouse where you can do everything in-warehouse. some "lake" features too though

#

you can even now deploy arbitrary code in containers. so you can run arbitrary code directly in-warehouse and pay for it with a uniform compute credit (instead of slinging data back and forth between the warehouse and e.g. ECS)

past meteor
#

I feel like snowflake is pretty lakehouse-ish as well right? Don't you land data into snowflake and not into S3/Azure blob first?

#

And then you use snowflake's compute to transform it right inside the "warehouse"/lake/... whatever it is

desert oar
#

snowflake supports both

past meteor
#

Ok, then my intuition of what it was was correct

desert oar
#

it has stages, which are just blob stores like S3. but you can mount an S3 bucket transparently as a stage

#

so it's blob storage + OLAP distributed-ish + external integrations + a spark-like interface if you want it

past meteor
#

basically like databricks' lakehouse yeah

desert oar
#

is "databricks lakehouse" a product? or are you just talking about the pattern of building a lakehouse around databricks and dbfs?

#

i haven't used databricks since 2020

buoyant vine
#

Problem I have with Snowflake is the vendor lock in feels worse than AWS tbh

past meteor
#

not a product, it's just delta + databricks + marketing

desert oar
past meteor
#

Is snowflake "serverless"?

desert oar
#

yeah it's pure saas

#

not self-hostable and completely opaque compute (priced in "credits")

#

maybe it's because we're using airflow + dbt + containers but i really don't feel that badly locked-in. we will be much more locked-in once we are deploying compute directly in-warehouse, but at least we already have a non-locked-in solution that we won't ever fully get rid of.

buoyant vine
# desert oar is it any worse than any other data warehouse though?

I think it is about inline with BigQuery lockin wise, but without the other GCP service support and no bandwidth costs.

Athena I think is pretty easy to replace, since it is litterally just re-skinned Trino, which tbh if we re-did our datalake now, I'd probably go with Trino as our main layer, so at least changing backend didnt change the queries

past meteor
#

SaaS != serverless

#

Can you pay for a standard amount of compute that stays on during business hours with absolutely transparent pricing you then scale down during the night? Give or take backpressure

#

Or is the only model pay-as-you-use?

buoyant vine
#

IIRC it is a price per storage GB used, price per data scanned, etc...

#

Which I think all the big warehouses use really

past meteor
#

that's pretty bad

#

it's really bad actually imho

buoyant vine
desert oar
#

our warehouse is basically always-on anyway so it's easy for us to estimate pricing

past meteor
#

Maybe I'm paranoid but I wouldn't feel comfy buying into a service that doesn't allow me to pick a non serverless model

past meteor
#

So if an always-on version exists and your things is ... always on, you can switch and save money

buoyant vine
#

I think it is a trade off, I think of lot of people prefer having the serverless setup where it is cheaper initially, but then get bitten later on as their scale grows

#

But people like the convenience

desert oar
past meteor
#

sure, I'd 100 % start serverless

desert oar
#

it's all very enterprise-ey

past meteor
#

but on say Azure, many services let you switch transparently

#

Like, you have a serverless SQL server (absolutely ridicuolous naming) with a managed counterpart

desert oar
#

the difference is that you can't blow your cost into interstellar orbit as easily as you can on google or AWS

past meteor
#

There's serverless databricks versus de-facto managed, where you switch your cluster on and off "manually" etc.

desert oar
#

larger mean, lower variance -- that kind of thing

past meteor
#

And they're still easier than using EC2/Azure VM

limber token
desert oar
#

yeah, i think i get it. my response is "snowflake doesn't have that but i also am not aware of anyone having issues with it beyond it just being expensive overall"

iron basalt
# plush bobcat So in conclusion: โ€ข C/C++ because they're the mother language and by learning (...

Yes, an important detail here is that C is more than a language at this point. It also acts as the interface between languages and the operating system. So for any language to be able to do anything it needs to pass through C's stuff at some point. This means that if you know both Python and C, you can get access to almost all libraries / utilities, and also make fast things that you can use in Python (bind some C library that does not have a Python module yet or if your own private stuff). C++ is not as necessary as C, but it does make programming things a lot easier, is the foundation for C++, and so it's used everywhere (even more access to more libraries (but even just knowing C will let you read most of it)). C is also not going anywhere any time soon, and changes too slowly for you to need to keep up with its features like with other languages.

past meteor
#

last addition to this is that I believe cloud marketing has won in convincing us pay-as-you-use is the only viable option so people don't complain/have issues with it ๐Ÿ˜„

limber token
sturdy kiln
#

tryna do Univariate time series on a JDIA dataset, how can i determine which variable ill be using univariate on? do i just wing it and use any one

desert oar
#

for example we also use aiven's managed timescaledb and that's just a flat price per month

desert oar
#

normally when doing a data project you have an actual real-world objective in mind, so you do whatever accomplishes that goal

plush bobcat
#

Thanks for the insight, mate

past meteor
desert oar
#

Sometimes I find myself fighting the query planner, wishing for proper database indexes. That's my only practical complaint as an individual user (as opposed to an administrator)

sturdy kiln
#

honestly i dont have a real-world objective in mind, im just trying to demonstrate utilizing univariate with different models

desert oar
#

you're asking about whether to use open, close, or something else?

#

I think normally people use closing prices but it probably doesn't matter much for a simple univariate analysis

sturdy kiln
#

oh lol sorry i mispelt the acronym, but yeah from what ive seen alot of people use Close so i went with Close anyways

craggy agate
#

I agree with @hollow escarp

hollow escarp
hollow escarp
craggy agate
#

Lmao

sturdy kiln
#

whats the best type of LSTM model? apparently theres 5, Vanilla, Stacked, Bidirection, CNN-LSTM, ConvLSTM

#

or is it all situational

craggy agate
sturdy kiln
#

weird how vanilla LSTM performs way shittier than MLPs