#data-science-and-ml

1 messages · Page 181 of 1

final kiln
#

is an interesting question, their rl vs pre training data

cuz sure, you're finetuning the thing to do what u want

but, the internet is suuuuupa big, surely it creates all sorts of pre-despositions

jaunty helm
#

ollama is kind of sketch
a large part of it is built on llamacpp, but ollama really doesn't want to talk about it, and have worded it in the past like features from llamacpp were done by ollama
sometimes they also try to be the first and ship kinda broken code, iirc gpt-oss for them was a lot more inefficient on ollama than on llamacpp
I'd just use run llama-cpp's openai-compatible server and send requests to it

final kiln
jaunty helm
final kiln
#

must be something to it

jaunty helm
#

yeah... no idea from me either, never done this type of training/tuning myself

fair aspen
#

how do I run the new qwen3.5-9B with pytorch or transformers?

odd shell
#

yay scares me...im too afraid i brick my system lol

fair aspen
odd shell
#

good question

#

I use pacman solely so far?

jaunty helm
#

they prob have instructions on how to do that on the model page tho I havent checked

fair aspen
odd shell
#

Im currently setting it up, but I guess you have to use yay inevtiably? idk even how libs work yet in arch lol

fair aspen
#

how do you install transformers serve on arch linux though

jaunty helm
fair aspen
odd shell
#

in venv you can eh?

fair aspen
#

or maybe you can

#

but it returns this ```agnulo -> pip install
error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try 'pacman -S
python-xyz', where xyz is the package you are trying to
install.

If you wish to install a non-Arch-packaged Python package,
create a virtual environment using 'python -m venv path/to/venv'.
Then use path/to/venv/bin/python and path/to/venv/bin/pip.

If you wish to install a non-Arch packaged Python application,
it may be easiest to use 'pipx install xyz', which will manage a
virtual environment for you. Make sure you have python-pipx
installed via pacman.```
jaunty helm
#

making a venv for your project and installing it in there is prob a good idea

odd shell
#

i havent verified it yet, but i imagine it ought to work?

fair aspen
odd shell
#

yes

fair aspen
#

how do I do that?

odd shell
#

check if you have a venv running in shell, if not, python -m venv venv for example

#

then activate from bin

fair aspen
#

I don't know what these big words mean 😵‍💫

odd shell
#

are you running an ide or bash?

fair aspen
#

I use zsh

jaunty helm
# fair aspen how do I do that?

you can see the docs for more detail, but yeah in a nutshell go into your project directory, run python -m venv ./.venv and your venv will be stored in .venv
then depending on your shell run one of these commands, after which your terminal will be in the venv, then you can pip install

odd shell
#

literally my order

fair aspen
jaunty helm
odd shell
#

normal as in global, no, thats the entire point

fair aspen
#

why doesn't it work?

jaunty helm
# fair aspen why doesn't it work?

hm
maybe it needs the exact command
pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main" is what it says on the qwen page

odd shell
#

need to setup env too for pip

#

should work then

#

(just verified myself)

jaunty helm
# fair aspen oh yeah thanks

actually should've asked first but do you have enough vram to run the 9b or whichever one you're trying to run?

fair aspen
#

do I need to reinstall rocm or pytorch-rocm on the venv?

odd shell
#

check if its in there?

jaunty helm
odd shell
#

not sure if venv pulls from global excl or if its custom

#

you can just pip it anyway and test

fair aspen
#

uh I got the error:
Could not install packages due to an OSError: [Errno 122] Disk quota exceeded

#

my disk is not full:

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p3  884G  432G  407G  52% /
devtmpfs        3.8G     0  3.8G   0% /dev
tmpfs           3.8G   76M  3.8G   2% /dev/shm
efivarfs        256K  126K  126K  50% /sys/firmware/efi/efivars
tmpfs           1.6G  1.6M  1.6G   1% /run
tmpfs           3.8G   17M  3.8G   1% /tmp
/dev/nvme0n1p1 1022M   74M  949M   8% /boot
tmpfs           778M   84K  778M   1% /run/user/1000
none            1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
agnulo -> ```
jaunty helm
#

0 idea on that
I'm on windows

odd shell
#

hmmm interesting

#

nvme some sort of virtual machine?

fair aspen
#

it's an ssd

odd shell
#

ohh those

#

maybe youre installing on the partition with less space?

fair aspen
#

well no because I'm on / obviously

odd shell
#

wasnt obv to me 🥹

fair aspen
#

sorry 🥺

odd shell
#

did you figure it out?

#

docs say that there may be quotas on your dir?

fair aspen
#

what quotas

#

maybe I'm having a disk health issue (would be odd since it's new)

odd shell
#

hmm kinda tricky problem, seems it may be a lot of things

floral quarry
#

@knotty raven

odd shell
#

im reading some obscure stuff about inodes saying to run: df -hi and see if youre capped? this is a bit beyond me tbh

fair aspen
# odd shell im reading some obscure stuff about inodes saying to run: df -hi and see if your...

I have no idea what IUse isagnulo -> df -hi Filesystem Inodes IUsed IFree IUse% Mounted on /dev/nvme0n1p3 57M 777K 56M 2% / devtmpfs 966K 680 965K 1% /dev tmpfs 972K 86 972K 1% /dev/shm efivarfs 0 0 0 - /sys/firmware/efi/efivars tmpfs 800K 1.2K 799K 1% /run tmpfs 1.0M 7.3K 1017K 1% /tmp /dev/nvme0n1p1 0 0 0 - /boot tmpfs 195K 100 195K 1% /run/user/1000 none 1.0K 1 1023 1% /run/credentials/systemd-journald.service agnulo ->

odd shell
#

looks healthy too 🙁

#

maybe cache full?

fair aspen
#

I just cleared cache and the error persists

odd shell
#

sudo pacman -Syu and reboot, activate venv again and pip it again id say

fair aspen
#

ok good idea

#

woah

#

I fucked my system so bad

odd shell
#

what happened?

fair aspen
#

I rebooted and now my WM doesn't work and I have no internet

odd shell
#

youre on your phone or other device, and your machine is in tty?

odd shell
#

might run something lik: journalctl -b -p warning and see if you have some major issues

fair aspen
odd shell
#

ah gotta config that

#

didnt you do that during installation?

#

maybe you wrote it on your iso 😛

fair aspen
#

NetoworkManager.service: Job NetworkManager.service/start failed with result 'dependency'

odd shell
#

instead of chroot

fair aspen
#

it worked until now 🤨

#

oh all my daemons are failing

odd shell
#

gotta do it manually then i guess

fair aspen
#

because of D-Bus

odd shell
#

did you archinstall or manual?

fair aspen
#

manual

#

but it worked until now

odd shell
#

i screwed it up first time too lol

#

you should be able to troubleshoot it from tty though

fair aspen
#

it's not my first time

#

but I'm so confused

#

it worked for months til now

#

I need to chroot probably

odd shell
#

boot from usb again and set it up i guess?

fair aspen
#

it seems like I'm fucked 😎

untold frost
#

i am trying to do logistic regression with multiple variable using scikit-learn any videos that can help me?

arctic silo
#

We need to try RAG

opaque condor
#

What would happen if an AI was trained on a world war II data set like music recipes etc

limpid zenith
#

what would you want to happen?

serene scaffold
robust echo
#

You can't train anything without a desired outcome to select for.

#

Every kind of machine learning requires a reward system. Otherwise how do you know what behavior to reinforce?

#

You could train a generative audio model on a dataset of 1940s music, for example - that has a specific goal which can be rewarded, "produce data which is recognizable as being 1940s era music."

opaque condor
#

A lm
My dad needed my help to get pellets

robust echo
#

Basically your training set is composed of some kind of informational artifact, like an image or audio file or text or w/e

#

And you mess it up, then train a network to reproduce the original from the messed up input.

#

And then you can feed new inputs in to generate new things

#

So if you have a training set composed entirely of textual documents from the 1940s, then you can train a model to produce new outputs which are statistically similar to textual documents from that era.

#

Pretty much simple as

opaque condor
#

ingredents:
    - 2 tbsp butter
    - 2 tbsp flour
    - 1 cup milk or cream
    - 1 tsp salt
    - 0.5 tsp black pepper
    - 0.5 tsp mustard (optional,for flavor)
    - 1 cooked chicken, slice

instructions:
  1. In a saucepan, melt the butter over medium heat.

  2. Stir in the flour to make a smooth roux, cooking for 1–2 minutes.

  3. Gradually whisk in the milk or cream until smooth and slightly thickened.

  4. Add salt, pepper, and mustard if using.

  5. Pour over the cooked chicken, or return the chicken to the pan and simmer briefly so it’s coated with the sauce.```

here is one file i have as a explination
robust echo
#

If you train a model using recipes from the 1940s, the outputs of the model will statistically resemble recipes from the given time period/culture.

opaque condor
#

I have a lot of files to write

robust echo
#

A language model is just reproducing patterns in text it has consumed. So if it's consumed texts like this, it'll output texts with similar patterns. E.g. it will favor ingredients which were commonly used during that period.

opaque condor
#

do I have a good start
I do have music not as audio files but text files that use every bit of data that's available

robust echo
#

I'm sure you could train a model on this kind of input if you have enough of it. If you're not very familiar with neural networks you should probably do something a bit more basic at first

#

But yeah there are tons of archived texts from the ww2 period, you could definitely produce a pretty good data set for training I think

#

A good starting point might be using a pre-designed architecture but throwing the pre-trained weights away and training new ones

#

could also try a technique called transfer learning

opaque condor
#

I've worked with CNN

convolutional neural network

I haven't worked with videos because videos take a lot of memory for my computer

But I understand models I know I can use a pre-trained model but I tried one they didn't come on I have to pick her up then I don't understand how hard it is to make a data set if I can't find it you know on any type of site

#

Do you think that is too much to try to unpack

#

For my current dataset I have
5 songs from 1939
2 sauce recipes 1939

Between those 7 files I have 204 lines of text I going to get more

opaque condor
#

Do you think that's a good amount or do you think I might need a lot more

spring field
opaque condor
#

from scratch

How much would be needed for a generative model anyway

serene scaffold
#

so having 204 lines of text is tantamount to absolutely nothing

spring field
#

unless you're fine with the model outputting "the" forever, lol

robust echo
#

yeah I'd be looking for like samples from major historical archives

#

You need a lot of data

#

If you want to generate recipes, 2 recipes is 2 input samples, not 204 input samples.

opaque condor
#

I'm going for music and food of the time first I am going to go for historical events I'm just trying to get the culture of that time period heck I was working on finding recipes from that time period

robust echo
#

I trained a pretty basic CNN and used in the hundreds of thousands of training images just for pretty simple categorization

opaque condor
#

I got a chicken ala king recipe I should really be counting how many lines i have

rich moth
#

I feel like we just hit an inflection point. Jack Dorsey laid off 4k employees because 1 entity "claude code" took the job. Fifty percent of his company... He even said most companies will do the same in within a year. Companies dont even need to justify laying us off now, lol.

#

Dont worry though DJT is at the helm. thisisfinae

winged quest
#

Hi i am new here

rich moth
jaunty helm
final kiln
rich moth
#

I can only imagine the pressure academia these days is facing. How far are we from they replacing professors?

#

Or... will man made goods and human articulated sessions just become more expensive and for the rich?

final kiln
#

academia has so many bad profs cuz a lot of them just wanted to be researchers and have no vocation for teaching

odd shell
fair aspen
#

You really really don't want D-Bus to fail

odd shell
#

Ah, did you manage to recover?

fair aspen
#

I ended up installing debian

#

how do you install rocm on debian though?

tardy haven
#

Guyzz 😔

limber plover
#

Oh nice, I have Debian 13 running

tardy haven
#

Guys, I haven’t been able to impress my crush for a long time 😔

serene scaffold
tardy haven
limber plover
#

He has been around each sub talking about his crush and the task he wants

limber plover
#

@serene scaffold I guess they are called Discussions.

#

Or channels

wintry brook
#

Hey guys, i just started with dsa ... I practically have 0 knowledge about data science or ai but I want to grow fast..

So i joined kaggle and saw it has more of a practical approach, like learning the basics and then doing titanic competitions or so...

On the other hand I have a course i brought which is pretty well rated and covers most of the topics in details ...

I am learning from the course and solving the introductory competitions from kaggle and learning as I make model for the competitions while using ai to learn from while I make the model

Is that a good approach.. If you read this msg till end.. Thank you for your time

serene scaffold
wintry brook
#

I am in cse 2nd year.. now my main aim is to be industry ready actually... I am pretty good at maths and studies as a overall... I also aim to make products that simplify life and is cool

wintry brook
#

Any suggestion or direction

warm dune
#

some people will learning things useless, care about that

wintry brook
#

Could you suggest me any roadmap

rancid thorn
#

Guys Im trying to make an LSTM but when I add layers it gets incredibly worse. The loss doesnt go down at all and it ends up sucking. Anyone know how to fix this?

subtle lotus
#

Transform a python code to apk is hard. But using Kivy and Buildozer is more easy! I'll transform my first code to apk

warm dune
wintry brook
wicked basin
#

Hello Ive been trying to get into small Neural networks I am really interested in learning and messing with the perceptron algorithms does anyone have some good documentation that maybe i havent found ?

wicked basin
#

After doing some more research and asking AI on how to describe and explain perceptron to me I ended up with these notes

arctic wedgeBOT
twilit topaz
#

What's your opinion of zero shot forecasting for time series? Many articles claimed that it's better than ARIMA that needs fine tuning

#

There's been a lot of different ones like TimesFM, Reverso, and Chronic

jaunty helm
#

though I suppose it's easy to just plug them in and get some good enough results

twilit topaz
jaunty helm
#

doesn't say that they outright do not work in general

twilit topaz
#

I would say it's performance is relatively good

jaunty helm
twilit topaz
#

From online it does decent too

twilit topaz
#

Or some type of machine learning model for time series

jaunty helm
twilit topaz
#

You use R for this or do you use sktime/ Darts time series? I mentioned Darts personally

#

Sktime is kinda hard to use for me

jaunty helm
# twilit topaz You use R for this or do you use sktime/ Darts time series? I mentioned Darts pe...

neither honestly, darts I just havent gotten to it
sktime was not a good time last time I touched it
for one its processing speed is completely horrendeous for seemingly simple tasks; somehow padding and cutting all series to the same fixed length takes minutes when in polars it took 2 seconds
integration with polars isn't great either and/or poorly explained; for example only by experimentation did I find that to make sktime recognize which is the time column, I had to prefix the column with __index__

jaunty helm
#

ig one issue I often run into when doing ts with gbms is memory usage
tons of variables should be 'trivial' (like, lags are just, look x rows behind!) yet you have to duplicate then shift the data into a new column for the regressors to work anyway

#

feels like there should and could be some library that doesn't have to do this and thus saves lots of mem
but to my knowledge said library doesnt exist

twilit topaz
#

For time series just statsmodels?

#

Actually statsmodels can't handle polars either

jaunty helm
twilit topaz
#

Darts handles covariates pretty well

jaunty helm
#

I'll try it next time a ts comes up

twilit topaz
jaunty helm
#

I think sktime had lag transforms too, not sure about its memory use, but like the speed was just really unbearable for a lot of things that should be fast?
oh yeah and I think I also tried to get its catch22 to be fast, in the end iirc the parallel processing options straight up dont work or something and I had to install a different catch22 library that sktime can then use instead of its native impl or smthn

twilit topaz
twilit topaz
#

For me you just input the lags you want and Darts I guess does it for you?

#

You just have to know like what exogenous variables you are dealing with

warm dune
bronze wyvern
#

Hello, quick question, when it comes to linear regression, is MAE, mean absolute error enough to describe the model behaviour?

For e.g, I have used the housing californian dataset to train a predictive model to predict house prices. Would MAE be enough? How can I know that pls

warm dune
#

in my opinion only look to one metrics it’s bad

#

i prefer to use 2+

bronze wyvern
#

why do we have so many metrics? I mean why can't we use only MAE for instance

jaunty helm
bronze wyvern
warm dune
#

There’s differents situations, for outliers and model can adjust the weights, and the accuracy for the real data are totally wrong

#

so we can choose another metrics to “ignore” outliers and only see tue tru data

#

there are many of examples and situations for that

rancid thorn
twilit topaz
#

They have RNNs

rancid thorn
#

Whats that?

twilit topaz
#

A time series Forecasting library

rancid thorn
#

Darts time series forecasting

#

Oh

#

Well I mean im using pytorch

#

That should be pretty good

twilit topaz
#

Darts has pytorch models but the API is very user friendly. It has LSTMs

rancid thorn
#

I mean the problem isnt that I cant make an LSTM

#

Its that when I add layers to it it sorta breaks

twilit topaz
#

Interesting, maybe it's how you set it up.

#

For Darts they implement a standard vanilla LSTM I suppose

rich moth
#

Do you guys think pressure conditioning scales with task complexity?

#

i thought this was pretty remarkable and wanted to share it

rich moth
# warm dune explain more that

So this is the idea. We know LLMs preform inconsistently on hard task. We seem stuck on better models, more compute and longer CoT, but what if you just change the stakes? So I made an experiment where i injected three types of pressure into the systems prompt at inference., economic (your budget is limited), environmental (errors have real consequences), and competitive (you're being benchmarked against other systems). Just context framing. I did 200 trials across 8 conditions on SWE-bench Verified. Triad condition hit +77% relative improvement over baseline but the most interesting part was find the scaling realationship. The harder task benefited more from the pressure and it was predictable. And I can measure this task complexity with a formula called UCF |Φ| i made, but you can use it before hande and see if pressure it even wroth applying.

#

Im validating on a GAIA benchmar which is running now, but so far its looking really good.

#

Its gonna take some time but ill share the visual

umbral hatch
#

this might be repetitve but is data science at risk of being taken by AI?

rich moth
rich moth
#

Well, no. We're Teamsters. So they can't just axe us (thank god for union protections) but its a different battlefield these days then it was 10-20-50 years ago

#

Unions are something we all need to gravitate towards for our own protections from incoming AI and greedy companies.

umbral hatch
#

honestly i would love for this sutpid AI bubble to burst, AI isnt even good for most cases and AGI is very unlikely I feel

rich moth
#

Unfortunally, its not going to. As you can see its growing and getting better. So much a CEO axed 4k employees because Claude code took over the job

rich moth
umbral hatch
#

tbh makes sense since the platform is full of slop

rich moth
#

Ex Twitter CEO.

#

Twitter doesnt exist. Elon Musk bought it, hence the kitchen sink video.

#

He made a new company apparrrently and already axed 4k people

umbral hatch
#

oh shit i thought it was twitter

rich moth
#

No , twitter is now X. Twitter doesnt exist anymore but he did invent it

umbral hatch
#

i guess data science isnt gonna last?

rich moth
#

It will last, but on much smaller scale.

umbral hatch
rich moth
#

AI will do most of the work, but a few high eductated people i imagine will oversee operations.

umbral hatch
#

not a good career path then?

rich moth
#

No, dont let me steer you wrong. You should learn it never the less.

umbral hatch
#

im honestly interested in it

rich moth
#

I guess the most important I said in all this is "Unions". If the country or the rich wont protect us, then we need ourselves.

#

We can really all have good jobs if we actually unite.

umbral hatch
#

what are unions exactly? ive heard of them but never really understood them

rich moth
#

They have been around a long time in American history. But they actually phyically fought with managment a century ago for fair worker rights. It was a bloody past, but they laid the ground work for the rest of us., You should look into it.

#

Let me see if I can get you a starting point.

#

hold on lol wrong on

warm dune
rich moth
warm dune
#

Just go all in

#

dont think to much

rich moth
umbral hatch
warm dune
warm dune
#

cuz if you already have this in your mind, you cannot achieve your best

#

the fear will control you

#

i say for my own experience

umbral hatch
#

im not afraid of going all in i just dont want it to go to waste

#

id still do it but perosnally would like a bit more clarity

warm dune
#

in technical terms saying to you, data science are in the beat moment for me

#

with the grow of AI big techs there are a many of “new jobs” for data created, and so many different experiences that a year doesn’t exist

#

i think thats a good way that always will need a good professional behind

umbral hatch
#

ok thanks

rich moth
#

I made a section for usage within AI. Please check out doc

rich moth
#

Im testing right now. Mind you this a local model. Unsloth/qwen3.5-35b-a3b@q5_k_xl

#

I'm limited here Im on a single 4090 but i can only imagine with a bigger model

#

16 to 24 t/ks

lyric lynx
#

Is it possible to use LM Studio for a discord chat bot?

wicked basin
lyric lynx
#

it costs money for more

#

And im using a more uncensored model because i wwant it to be a fun bot

wicked basin
#

and a free AI api would probably be weak

#

could always make ur own api with open source models

#

but you would probably need a dedicated machine to run it

lyric lynx
#

Device name LAPTOP-RSNUQGVQ
Processor 13th Gen Intel(R) Core(TM) i7-13650HX (2.60 GHz)
Installed RAM 24.0 GB (23.7 GB usable)
Device ID CF0C74A1-3DA6-466F-940C-46A123517B61
Product ID 00342-21498-02091-AAOEM
System type 64-bit operating system, x64-based processor
Pen and touch No pen or touch input is available for this display

#

has 14 cores

#

with rtx 5050

wicked basin
#

yeah thatll do it

lyric lynx
#

😭

fair aspen
#

I'm steering a model but its personality is so volatile

#

I change one word and the model is a totally different person

#

I'm trying to make a gen alpha LLM

vale wave
#

is there any data analyst guy who can guide me for projects??

terse frigate
#

If I want to prep for AI ML engineer roles. Is it worth grinding leetcode?

How else can I upskill myself? I have a masters in AI

cedar mason
#

I would also like to know, as I am finishing my undergraduate degree and about to begin a master's degree in AI.

serene scaffold
#

!warn @grim acorn your message was removed for soliciting donations

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied warning to @grim acorn.

bronze wyvern
#

Hello, question, is handling of outliers something we do in data preprocessing section or it's more in the eda?

Like in data preprocessing we handle duplicate and null values.

Then in eda, we see trends etc, then handle outliers?

flat crown
#

Seeking Local Alternatives for Math/OCR Pipeline (LLM Evaluation)
​Current Pipeline:
​OCR (Claude 3.5/4.5): Converts PDF to pipe-separated question format.
​Evaluation (GPT-5): Handles math calculations, step-by-step explanations, and formatting.
​Verification (Gemini 2.0): Validates answers and handles uncertainty.
​Export: Pipe-separated text → Word Table → Excel/Admin Panel.
​The Problem: I want to replace the GPT-5 evaluation step with a local, cost-free alternative without losing the quality of math reasoning and explanation.
​Questions:
​Which local model (DeepSeek-R1, Llama 3.3, Qwen) is best for Hindi/English math evaluation?
​Are there any Python-based orchestration tools you recommend for running this locally with high throughput?
​Any tips on maintaining the strict pipe-separated structure when using local LLMs?

serene scaffold
#

@flat crown you've asked this in at least three places. please pick one so that people don't duplicate their efforts

summer plover
#

I dont know the answer, but my coworker does something similar but in English and Norwegian, he say that DeepSeek is better than Haiku, so maybe not not directly comparable, but Deepseek did very well on the math portion

#

and what is high throughtput? i use ollama locally

bronze wyvern
#

How do we decide whether we should remove outliers? I mean what kind of questions do we need to ask ourselves? For example if we take the housing dataset for california and we have outliers based on the total number of bed rooms, what would be some reasoning pls

limber plover
#

@fair aspen Did Ai generate that? That reads like a teenager trying to be serious with an adult but has no idea what words are, and, or, trying to be hip explaining it.

serene scaffold
jaunty helm
bronze wyvern
#

yep I see, in my uni coursework, the project is graded based on "sections", like you have a section for eda, one for data pre processing etc but in real life/project this is often mixed together based on the insights we want to discover as we go?

jaunty helm
bronze wyvern
#

oh ok, didn't know that, yeah make sense when I see that... in my lecture slides, they make it as if it's a sequential workflow

#

Small question, currently I am working with the california housing dataset and the aim is to predict house prices. Now the house prices are large and the value is capped at 500,000.

Should I use log in such scenarios pls, if so where/when should I use it, to display the data or even inside the ML model, log should be used?

fair aspen
#

Can anyone help me install ROCm on Debian 13, my GPU is the rx 9060 xt

warm dune
bronze wyvern
#

I have standardised the other numeric features but the house price/target variable, I haven't touch it yet

warm dune
bronze wyvern
#

kaggle

warm dune
bronze wyvern
#

sure, I can have a look

warm dune
#

and for your question, i prefer to scale all the data, cuz the error always go to the lower possible

bronze wyvern
#
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# List of numeric columns to scale
num_cols = ['housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value']

# Fit scaler on train and transform
train_scaled = scaler.fit_transform(train_set[num_cols])

# Transform test set using the same scaler
test_scaled = scaler.transform(test_set[num_cols])

Currently I have this

#

should median_house_value be included?

jaunty helm
# bronze wyvern Small question, currently I am working with the california housing dataset and t...

standard ols, potentially - check the distribution skew: usually house prices are right skewed, i.e. there's a small but significant chunk of premium houses that cost a fortune
if you do ols on raw prices it'll focus a lot on those and thus make it poorer at predicting the standard house prices; taking the log can make the distribution more normal
alternatives like gamma regression's assumptions about data are more correct here, so you don't need to log, or trees / gbms which don't really care

warm dune
red harness
#

Hi everyone! 👋

I'm building a Python data enrichment tool as a side project to improve my coding skills and automate a boring task at work.

The goal is to take an Excel file with 4000 Italian VAT numbers (Tax IDs) and extract the standard B2B email address for each company (I'm avoiding the official certified government emails due to strict privacy and cold outreach laws).

The issue is that I have zero budget for enterprise data APIs (like Dun & Bradstreet, Atoka, etc.), and enrichment tools like Dropcontact are too expensive for a learning experiment. I want to tackle this with pure web scraping.

Here is my 2-step architecture idea:

Step 1 (VAT to Domain): Take the VAT number, query a search engine (e.g., "VAT 12345678901"), and scrape the first relevant URL to find the company's official website.

Step 2 (Domain to Email): Once I have the URL, use a crawler to visit the homepage and the /contact page, then extract standard emails using Regex.

The stack I'm considering: pandas for CSV input/output, and Playwright or BeautifulSoup / requests for the actual crawling.

My questions for the community:

SERP Scraping: Google rate-limits and blocks very fast. Do you know of any ultra-cheap/free SERP APIs (is DuckDuckGo more lenient?), or should I just integrate residential proxies directly into Playwright?

Web Crawling: For step 2, is requests + BeautifulSoup + Regex enough to find emails, or do most modern sites obfuscate emails with JS, making Playwright mandatory?

Alternative Approaches: Is there a lateral thinking approach I'm missing to match a VAT number/Tax ID to a company domain for (almost) free?

Any advice on libraries, architectural patterns, or open-source GitHub repos I can learn from would be incredibly appreciated. Thanks a lot! 🚀

grim acorn
#

@serene scaffold
Hi, sorry about that. I didn’t realize posting my Python tool sale link would be considered solicitation. I’ll follow the server rules from now on. Thanks for letting me know. 👍

bronze wyvern
#

quick question, when we use linear regression it assumes that the dependent and independent variable are normally distributed?

somber lichen
#

Any use pytorch ?

serene scaffold
# somber lichen Any use pytorch ?

what would you ask someone if they did?
it saves everyone time, including yours, if you always start with your actual question. don't ask to ask.

warm dune
serene scaffold
bronze wyvern
#

Hello, quick question, consider this graph. The problem is the axes. It seem that the range istoo big for the x-axis. Should I use standardised axes, so standardised variables here?

#

or is it because of the outliers?

tidal bough
#

Standardising won't help in the slightest; the plot would look exactly the same but with different labels and ticks on the axes

#

It's because of the outliers, yeah. You could also try plotting in log-log axes, or at least log-x.

rich moth
#

Running the GAIA benchmark. They both get the right answer but one is faster, econ_comp. Not only was it faster , its attention to detail improved. But look at the outputs, the one on the bottom right is slightly more focused.

rich moth
#

Thats your first hurdle.

rich moth
red harness
opaque condor
#

I don't want to ever used
LableImag
And if so how do you label a video

rich moth
#

What country do you live in and what do you access too? You sound outside of the US and EU to be straight up

#

Well , doesnt matter. You want soemthing with decent amount of VRAM. You can pick up a 5090 or AMD has 32 gig options but ROCm is a bit choppy but lots of work is being done.

#

AMD AI 395+ with some cards ? I mean there are a lot of options, really need to plan this out and narrow it down, maybe you just need somthing basic.

rich moth
jaunty helm
jaunty helm
bronze wyvern
#

Hi, quick question. I'm working with the california housing dataset and from what I've read on the internet, people who work with this dataset says that the housing prices have been capped at $500,000, same with the housing ages which have been capped at 52. But how can they say that, I mean is there any proof or it's pure deduction?

summer bridge
#

Anyone have a suggestion for IDE use python/pandas and SQL? I have looked into PyCharm and Vim which I understand is not an IDE but have read good things about it.

jaunty helm
bronze wyvern
#

yep just did that, I assume the hard cut off indicates the capped values

jaunty helm
bronze wyvern
#

yep noted, thanks !

red harness
bronze wyvern
#

I have been looking online about the evaluation metrics using linear regression for the california housing dataset. It seems that every one has metrics like that, for e.g house prices different by 40k for e.g.

This suggest linear regression was never a good algorithm to be used?

soft tundra
bronze wyvern
#

yeah I need to see if I can increase that

soft tundra
bronze wyvern
#

hmm what are you referring to by n and p pls

soft tundra
#

n = amount of observations(data)
p = amount of predictors(variables

bronze wyvern
#

yeah I believed it's much much larger

jaunty helm
bronze wyvern
#

yep noted, thanks !

rich moth
#

I got my first visuals back, but I nmeed mroe data points, start level 2 next.

soft tundra
# bronze wyvern yep noted, thanks !

if you want interpretability you should rather stick to linear regression but if prediction accuracy is the only thing you care about other models are much better suited for the job

stuck swallow
#

Would anyone know what data science projects to do with a bunch of messages and associated user IDs? I tried to make a model that would predict volatile people before becoming volatile (basically filtering out people who cussed a lot then examining only the messages where they didn't cuss) but the confusion matrix was terrible. Any simpler ideas that could actually produce useful results out of the data?

The CSV is in this form:
user_id, message_content, date_posted

serene scaffold
wide wing
#

hi guys i got a question what course or youtube video do You recommend to learn scikit learn?

serene scaffold
serene scaffold
#

!resources data science

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

wide wing
#

I already know calculus, statistics, and linear algebra

serene scaffold
wide wing
#

like i need to improve but i don't know what to do

serene scaffold
wide wing
serene scaffold
wide wing
#

ik what is that but i never tried to make it with pure python

serene scaffold
serene scaffold
wide wing
#

cuz i just realized I am stupid

serene scaffold
#

you are not

wide wing
#

I am kinda lost i learnt the essential libraries, math,and python now i don't know what to do

serene scaffold
#

@wide wing try making a neural network in pytorch
you'll probably still use sklearn to partition the data into train and test, and to evaluate the performance.

wide wing
serene scaffold
serene scaffold
#

are you trying to learn about data science and AI, or what?

wide wing
#

yes, i want to become a machine learning engineer

serene scaffold
#

it's important to keep a positive attitude about learning new concepts, because that will never stop

narrow tiger
#

Are there any models which I can use locally to generate photorealistic images/video/voice?

#

Or if there are any good paid once?

#

They should have good docs, I am new to ai image/video stuff

serene scaffold
rich moth
#

looks pretty photorealistic to me. but id search around hugging face

subtle lotus
#

First Day using Streamlit and LangChain

#

Build your first AI chatBot with Streamlit

half pulsar
#

Do you think emergence is close? || @barren gulch ||

barren gulch
#

emergence of what

half pulsar
#

Do you see it being a possibility?

barren gulch
#

what do you mean by emergence

half pulsar
#

Like AGI

half pulsar
serene scaffold
#

No matter what AI gets invented, I don't think everyone will ever agree that a certain thing is AGI

half pulsar
#

True, I think it's just a bunch of layers of additive complexity at this point trying to figure out what works and doesn't. What's missing and isn't, what's the best algorithm if it's been invented yet, maybe its a combination of Hybrid Neural Symbolic Governed AI with modern LLM Agents for tool calls. Keep it grounded using a deterministic traceable deliberate substrate you might have a shot, but that's putting it 'simply'. There's a lot of guess work still

#

I think its quite interesting

serene scaffold
#

I wouldn't consider something to be AGI unless it least tried to model constant cognition and constant sensory input. There's nothing happening in an LLM except when you use it to generate text, but the human mind is constantly active.

half pulsar
#

Yeah I agree that text generation is definitely not enough for this, but I'm thinking of having it paired with a larger system with continuous feedback(self looping), memory, self-modeling and some kind of ongoing world reaction, and maybe a sandbox for testing and validating internally before trying it out for real, The hardest part of that would be maintaining stability at scale

#

Can you set a goal and have it actually invent a way to achieve that one goal.

#

Can it set that goal itself.

#

Like those are questions that need a yes for it to work out

rich moth
#

You can really sum it up in one word, homeostasis. I feel like when the system can maintain that.

half pulsar
#

@barren gulch got any projects?

#

@limber plover how are yours?

#

Anything cool with the Matrix stuff?

rich moth
#

Heres the flow of my setup, anyone got any suggestions or critiques?

#

You guys should check out eraser, i discovered it today.

half pulsar
#

Thanks!

rich moth
#

yup

half pulsar
rich moth
#

It works really well qwen 27b model.

half pulsar
#

What's suppose to stand out here?

rich moth
#

Funny cause most enterprise models get these GAIA task wrong

#

Well , not most but a surprising amount compared to local models

half pulsar
#

Yeah its nice it's using known approaches is that your goal?

rich moth
half pulsar
#

Its like a hybrid retrieval system

half pulsar
rich moth
#

You can't use words to describe it? lol

half pulsar
#

Well I can I just need to be careful without revealing architecture details

rich moth
#

seems lame

half pulsar
#

I appreciate it

rich moth
#

Well lets see it in action. Lets see your claim to fame?

#

Well im off to github and license my work. I feel like you're lurking to taking peoples work.

#

was an interesting convo though

half pulsar
# rich moth You can't use words to describe it? lol

I built a deterministic reasoning runtime, Instead of prompts and responses, like most agent based systems, this maintains a structured knowledge state and runs planning, exploration and evaluation loops on top of it against a sandboxed world model, and the key thing here is that every time I run it produces artifacts that can be traced, replayed and verified, so that behavior is reproducible rather than being purely probabilistic

half pulsar
#

I don't care about LLMs. or Transformer based anything. It's all garbage.

jaunty helm
# narrow tiger Are there any models which I can use locally to generate photorealistic images/v...

for video the newest one is ltx 2.3 - I can't run it myself, but I'm pretty sure you either require top grade consumer gpus like rtx xx90's, or some mid-high tier + tons of ram for it to offload then wait like several minutes
voice I've not looked into too much
images, yes, something like z-image-turbo can run on 8gb vram with decent speeds; flux 2 [klein]s are pretty bad at anatomy when it comes to text-to-image, but they can edit well

limber plover
#

@half pulsar no, I am still on python trying to remember what I did. The matrix though is simple mathematics. So I have to write that out first in my reasoning before I put any code to it.

narrow tiger
jaunty helm
barren gulch
bleak zealot
#

LOL my best test result so far.
BTC R:R 1.93, ETH R:R 2.28, SOL R:R 2.44 , XRP R:R 2.54. MFE 86 %.

Unseen pairs

ADA R:R 2.03, DOT R:R 1.9, LINK R:R 1.5, MFE 75 %

PPO meta learning is the way 🙂

half pulsar
half pulsar
#

Python is simple though you'll get use to it!

gentle verge
#

Hey guys
Could anyone help me understand the tfidf feature extraction.
I have a 150k row dataset and I am trying a combo of char + word tfidf vectorizer from sklearn.
I got around 35k features and 150k rows now I can't possibly feed it to lightgbm or any tree based models cuz it will take around 2-3 for 1 model.
Is there any way to make the hyperparameter tuning less painful?

rancid thorn
#

Is it better to add layers or layer size?

serene scaffold
rancid thorn
#

Id guess they probably have different advantages

#

where size lets the network break down and analyze more thoroughly information

#

While layer amount lets it find more complicated patterns

serene scaffold
serene scaffold
rancid thorn
#

you know taylor serieses?

serene scaffold
#

I know of them

rancid thorn
#

With more neurons it can derive other factors out of already existing ones

#

To then have more to work with

warm dune
#

guys why SKLEARN don't have XGBoost and LigthGBM natively?

wide bane
#

I WANT TO ANNOTATE LISS 4 IMAGES AND THERE ARE 1000 IMAGES SO IS THERE IS ANY TOOL I CAN USE TO ANNOTATE JUST THE BUIDLINGS

agile cobalt
wide bane
#

Linear Imaging Self-Scanning Sensor-4

#

i have 1000 images and i have tried SAM-GEO in qgis but it is for rbg images and it should look like buildings

wide bane
opaque condor
#

No unfortunately or maybe somebody from GitHub learned how to automate it I don't know

lyric vale
#

hii, is it possible to predict tomorrow price based on previous data of stock market using neutral networks?

lyric vale
warm dune
#

but dont need a neural network for the model have good accuracy

#

models like XGBoost and RandomForest can be great in some cases

#

I mean, obviously a nn will have a accuracy better, but others model are good too, so depends how well you need the model

warm dune
#

i dont tunning or review, so the metrics can be better in this case

#

but an NN always gonna be the better option

jaunty helm
# lyric vale will it give good accuracy or not?

usually no
additionally accuracy is a bad metric, e.g. if your model correctly predicts if the market goes up/down 60% of the time, that doesn't tell you about how much you gained from that. there could be a crash tomorrow falling into that 40% wrong predictions and you'd lose all your money

lyric vale
jaunty helm
#

in general trying to predict market from only the prior stock prices is a bad idea
what could be a good idea is if you collected data from other sources that's not the stock price, but could influence the stock price

lyric vale
jaunty helm
#

as a recent and easy example, when deepseek released, it demonstrated that powerful LLMs can run on less capapble chips, nvidia stocks plummeted
stuff like this, which honestly is usually pretty obvious in hindsight, but in the moment is hard to find

twilit topaz
#

The Darts time series. Were you able to do the tasks easier?

twilit topaz
sharp sierra
#

I have a Q is AI intern is good option as a fresher? ? or not

final kiln
#

yes

#

one thing to be careful about is

#

like, some time back I spoke with someone who heavily invests in the NASDAQ and at the same time also works in AI at deepmind

the nasdaq is heavily weighted towards AI stock, so this persons whole life is basically a thematic bet on AI

jaunty helm
final kiln
#

career choices are investments, diversification is good

warm dune
somber lichen
#

Price-only models are basically fancy coin flips with extra steps. Add real signals (news, vol, macros) or prepare to explain the empty wallet to future you 😭

fallow coyote
#

For image classification, is it better to use a pre trained model or should I create my own model as a way to better understand these models? For context, im doing some practice projects for image classification

final kiln
final kiln
fallow coyote
wooden sail
fallow coyote
#

Tbh, i shouldve bought a 3090 when i first built my pc 😂

wooden sail
#

it depends entirely on how big your problems are, but 12gb vram is enough to do a lot of small scale testing

#

i usually get away with testing small things on a laptop with like 4gb vram, and then i send the full scale problem to a cluster with real ML hardware

#

maybe some extra context that might be helpful: normally i only think of problems with lots of linalg in terms of the biggest matrix i have to store; with ML though, modules like pytorch and jax also build a computational graph to compute gradients, so 12gb vram in ML is not really comparable to 12gb ram in regular computations. you need a bit of slack. just in case you get surprised by a random OOM with a small~ish problem

#

in any case, don't expect to train llms locally, but you can do a lot with 12 gigs

lyric vale
#

which models do big quant firms uses? do they use there custom models?

grand minnow
granite zephyr
#

Assume a logistic regression model or a perceptron where we include the two features x and x².

Then I have a two-dimensional space, where I can easily draw and visualize the decision boundary. It is neither a plane nor a hyperplane.

Since the quadratic term is included, the decision boundary will be curved rather than a straight line, even though both models are still linear in the parameters?

half pulsar
#

You could have bought 2 for the price of one now

somber lichen
spiral falcon
#

Does anyone compile all models with data such as linear containing characteristics, method, etc?

warm dune
untold frost
#

I am trying to build a logistic regression model for a churn dataset in scikit learn, i get 0.87 on the training dataset but when i use the testing dataset i get 0.57.
I am thinking that the training data is overfitting but what should i look out for and what should i study to understand why this is happening?

untold frost
#

yep

warm dune
#

how we can't see the loss in the epochs, prob it's overfitting

#

try to add regularization

untold frost
#

i have tried both l1 and l2, i always forget how to do the code thing

warm dune
#

the shape

untold frost
#

~440k training rows for training data
~64k test rows for testing data
15 columns total

warm dune
#

how it's sklearn and we don't have so many options, i think don't have so many things that you can check

#

i only think about this 4 options. but it's good you see this dataset on kaggle, prob will have others notebooks about

untold frost
#

precision recall
0 0.95 0.22
1 0.53 0.99
this is the testing dataset classification report hope it can help

#

also thanks for the help will look into all four options

warm dune
#

prob the target it's 90/10 or something like that

untold frost
#

training
1.0 249999
0.0 190833
testing
0 33881
1 30493

warm dune
#

if you fix tell me what it's wrong

untold frost
#

ok will do

untold frost
#

so i did a correletion plot and i can see the issue

#

while the training data has variable that are from 20~55 that correlate to churn, the actual test data has 5~11 at best i think it is pretty logical for the model not being able to complete perform since what the model learns from training will different patterns from the testing dataset which has lower correlation for each variable towards churn, i might still be wrong but i think this is true

untold frost
warm dune
untold frost
warm dune
#

did you use the train_test_split?

#

in this case the shuffels it's true

#

so it's more the data, you can use methods to split better the train and test set, or do something with the data like IQR

untold frost
#

i will look into that tomorrow probably, still it is intresting and thanks for the help

jaunty helm
untold frost
#

thanks will do

sudden canyon
#

@solar arrow I removed your message because the project looks like malware, with only an executable available and no source code.

feral meteor
#

ai go brr or somthing idk

#

me after the ai give bad code

#

im starting to under stand why people pay for decent moddles as i spent 1 god dame hours trying to fix a cropping issue as the ai REFUSED to expand the croped area for eatch target enought so make sure they still fit after expanded bye a local screen distortion matrix

#

i have no idea why it could not fathum it

#

i even did the god dame mathes my self

#

oh and if eny one was wondeirng how i got the data uh.. gantry ssytem in scrap mecanic

#

i fitted the exac fov in game and using linear projection and a hand full of hand labled images dumpted points along a ray in a 3d grid to get the data points. need to use the ais output to genrate the rays and re do it as my datas very messy lol

#

after i get this finished as i have a few things i found in testing that will likely thruther improve the ai im try doing mindusrty scemtic auto genration if eny one has ideas as its a very hard thing for ai im fully opten to sugestions as i love keeping notes on nishe ai training methods and ai. i dont care about the bog standard stuff as iv researched the standard stuff already

robust yoke
#

if we train a model with 80 : test 20
what if its like train a model for a where we take a data of 1 year
but in some case like a year end or other case final 20 may differ
how will u over
can i use this as i learning data analyst or database which one will be fine
i need a opinion

bronze wyvern
#

Hello, quick question, I have generated a heat map, as uploaded. I was wondering, how do you people evaluate a heatmap, what are the insights that you check/verify?

I read about multi-collinearity where multiple dependent variables have some relationships among themselves, why is that a problem? Is it a thing here pls

warm dune
#

the model stay “lost” in how uses the weights for each feature

bronze wyvern
#

wdym, like it focuses on only one ? but even if we include these as predictors?

warm dune
#

so the model stay “lost”

#

In tree and ensemble models multi-colli dont have a big impact

bronze wyvern
#

ah ok, so ideally, what does linear regression model want? Like in the predictors/dependent variable, what kind of relationship should exist?

warm dune
pallid badge
#

Hi I am looking for a invitation to the GraphRAG discord? Would anybody happen to know how I can get there?

serene scaffold
#

(Microsoft should be permanently banned from being allowed to name things.)

pallid badge
#

Furthermore, I am looking for a Python library (:-) ), learn more about Graph RAG

serene scaffold
#

maybe they can help you find what you're looking for

pallid badge
#

Have you worked with this software btw?

serene scaffold
#

yes

pallid badge
#

Thank you for the invitation!

serene scaffold
#

yw

half pulsar
pallid badge
#

Would I need to work about scalability? I am just at the beginning of my journey

serene scaffold
half pulsar
#

Its extremely difficult it took me several years

pallid badge
#

And basically, I wonder how to bring multiple ontologies together into one Knowledge Graph for a GraphRAG (not necessiliary the MS thing) 😄

half pulsar
#

So good luck!

serene scaffold
#

I made it a rule in my team that we have to say "MS Graphrag" if we're talking about the microsoft thing
because "RAG with a graph" isn't a concept that microsoft can just own

pallid badge
#

"don't try to prematurely optimize when you haven't even figured out what you're trying to do" will put this on my screen 😄

half pulsar
serene scaffold
#

also I came up with a technique that involves generation-augmented retrieval
my coworker named it SteeleGAR
which is now my fremen name also

pallid badge
half pulsar
#

Just build a structure that you're happy with, the structure itself needs to be modular and well thought out before you think about scaling it

pallid badge
serene scaffold
pallid badge
serene scaffold
half pulsar
pallid badge
#

If I may explain: So I got a list of heterogeneous data sources, they wish to be put into a RAG. To make it more specific for this field, I thought about a GraphRag. Ontologies were selected, but multiple ontologies probably go into 1 Knowledge Graph?
But nobody thought about cleaning the data , how to structure it, and on top of it, previous people thought Excel is the tool of choice when the claimed the created a structured database based on two source, with manual inputs partially

pallid badge
half pulsar
# pallid badge What does this mean please?

When you're building it you're gonna have to do a lot of problem solving so, you need to be careful with what you solve and even what order its solved. You have to have a good idea of what your needs and goals are, and stay aligned its easy to get off track on projects like that, Just make sure you make good use of your time basically. Just have you priorities straight and you'll be fine.

#

Don't overengineer it, just build what you need, complexity creep is a bitch

#

Build something small stable then optimize from there(MVP)

serene scaffold
# pallid badge What does this mean please?

when you're trying to learn about a challenging domain like machine learning and AI, you need to have small victories along the way, or you'll burn out and give up. don't start by trying to build Jarvis, because you won't, and you'll give up before you learn anything.

half pulsar
#

Yep!

#

Math is a big part of this, you're gonna use a lot of it.

#

I think that's the fun part personally.

pallid badge
#

Thank you for your kindness and advice.
I can say, the people who hired me, have no understanding of software dev whatsoever. But they want an AI driven tool in 6 month. No specs given, no requirements defined, no business plan, no USP.
I told them I would need time to learn the software stack, the available libraries . 3 months minium. I was given 0.
Also all of the non-programmers think one could wave the AI wand and a tool is there.

"good idea of what your needs and goals are" - I am working on this as well

Re math: Happy with it, no problem, I have a STEM background and some knowledge in scientific computing

#

Next week I should come up with a strategy, which tools, how to deal with the data sources etc.
Fun fact, the people did not think at all about RAG, just used excel, regex and also manual extractions .

vocal quartz
#

@jade prairie plz give me mic permission

#

@golden marsh give me mic permission

pallid badge
#

@half pulsar and @serene scaffold if you happen to have good starting points, I am ears and would be very grateful. I am looking for good software libraries, best open source. I was looking into Dockling, OpenAlex etc, Ontologies in general.

half pulsar
pallid badge
half pulsar
pallid badge
#

Hm, I have 6months and maybe next year as showcasing date.

#

I guess in the worst case, I will look for another job

half pulsar
#

I'd say be careful with how you calculate the graphs and how you implement it from the core, because the amount of combinatorial explosion you can get very quickly is insane.

pallid badge
#

I am not even there. I would need to find out what are good libaries etc , how to deal with a lot of unstructured data, heterogenous source, then GraphRag. May I ask where you learned your knowledge from?

#

Books, YT videos , chats with colleagues?

bronze wyvern
half pulsar
#

There's a lot of depth here 😂

pallid badge
#

@half pulsar : Now I am confused. Don't you use Python or did you really wrote your own library?

half pulsar
warm dune
#

don't have a fix ideia for what using heatmaps, normally it's to identify corr and somethings that can help you at the model

pallid badge
#

Ok, I express my wish, if anybody has good learning resources for GraphRag (not general MS), please pass them my way

half pulsar
spiral kindle
#

Hello 👋👋👋

pallid badge
#

Many thanks @half pulsar !!!!

spiral kindle
#

Please why can't i speak in this space????

pallid badge
#

I am a bit off , cycling home

#

take care

half pulsar
#

Take care 👋

spiral kindle
#

Why did my speaker 🔊

Muted?

#

Who is replying me????

half pulsar
#

Today I made a big achievement just been in such a happy mood. Took YEARS

solar arrow
#

Does anyone want gemini api for free? Not sure for how long I will give for free
Ping/dm me if want

serene scaffold
#

!ban 985951964779139132 Repeatedly mentioning a project that purportedly offers free API calls to gemini, but which they're distributing on github as an executable with no source code

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied ban to @solar arrow permanently.

limber plover
#

@half pulsar do you tackle emergent behavior or study? What I mean is, from basic implementation of the algorithms. Rather than NN directly. I have been doing experiments on this by having basic robot in a room and it is given basic function such as, left, right, up, down. I then give it basic memory elements so it can remember where it was last. It then has a choice of using it or not. You keep doing this until such system has emergent behavior out of all the small subsystems added. This seems basic but it is interesting. It's also faster for me rather than training NNs for this. You can get a rather sophisticated system behavior.

half pulsar
#

One of the tests I ran was letting it observe objects through a live camera feed in real time. After repeated drops it converged on a consistent downward acceleration from the trajectories, basically rediscovering gravity from observation. Still early experiments though mostly seeing what kinds of structure emerge from simple components interacting with the environment. But I'm soon moving it to larger scale testing on my 3D printer with a full feedback loop.

limber plover
#

@half pulsar that is actually fantastic on what you are describing. What I did was restrict it's movement as if missing a leg to see if could figure out an effective way of moving around the room with just 3 legs. Which it did, it eventually emerged the same pattern after some time as if it had 4 legs. Not quite the same but close so that it could move more effectively around the room avoiding objects. It did make a large data set for it's position and choice of possible positions. Now I guess what I could do is some how feed the data back to it. There is also the matter of pruning as I don't want it to keep data that has not been used for some time as a reference.

half pulsar
limber plover
#

@half pulsar it would be based on least used coordinates. Or position in this case based on x y. With in certain amount of time. Every 30 sec it might delete it. Sort of like stacked. The bottom most stack gets a snip for now. Remember, I am still learning python.

wanton quest
#

Hello guys any inside ir35 contract roles in uk for ai engineer/ data scientist

heavy crow
#

I want to pretrain a embedding network for 2d signal data to produce meaningfull embeddings before using them for a downstream task, what are some techniques for this? I've tried nt-xent which works well for producing a nicely spread out latent space but im unsure if it is actually teaching any meaningful features or relying on specific (to the downstream task meaningless) outliers. Thanks! I've also looked into boyl and simclr.

#

Are there any broader literature review papers for this type of work? Seems like mostly its image based.

wintry brook
#

Hi guys just started with data science and ai.. know a bit of python and sql... Can you guys give me some project idea

thorny solar
#

How have your day been guy's i just finished an Internship search agent with python

#

I will like you guy's to recommend a good platform to kickstart my AI Engineering journey

feral meteor
#

wokring out good lr rate scedualing this seemed to work well

opaque condor
#

do i use .py or .cpp to teach an ai about codeing?

serene scaffold
opaque condor
#

if i gave an ai a .py or a .cpp or a file that allows code to be ran could the ai understand the formating over time and learn to generate the .file of that type?

serene scaffold
opaque condor
#

As long as it's good quality over quantity

And AI understands patterns and you don't have good quality data the AI spits out day that that could be incorrect or dangerous

serene scaffold
warm dune
past bramble
#

I have a list of datasets of closed models generation from hugging face. I need to look at each individual schema and maybe a single sample for each dataset so that I can convert it to a universal format, but I noticed I would have to download the whole dataset. Is it possible to only gather the schema and a single sample for this use case?

wintry brook
final cobalt
#

Autoencoders are absolutely fundamental. The basic idea is this: take an image and run it through a network to produce an embedding, then run that embedding through a second machine to rebuild the original image

#

You might ask - why do something so pointless? Well - if the embedding created in the middle contains enough information to rebuild the original perfectly, you can be certain your embedding contains all of the information about the image. This principle can be applied to anything. If you can encode something such that it can be perfectly rebuilt, you've captured all the information about it. Now you can pass that embedding to other models for manipulation, information extraction, prediction, whatever you need

#

If you're going to do ML, autoencoders are, in my opinion, the place to start

bronze wyvern
#

Hi, I need a quick advice. I recently showed my work to my supervisor for my final year project. He told me to add a novelty feature. What my app is about, it's about animal welfare. So basically, users can create posts and have a little chat, posting images etc.

Now, my supervisor told me to add an NLP stuff to my posts. This is where I wanted some ideas. What kind of things can I add here?

Maybe I can try to train a model that will identify if posts are urgent?
What could be other possibilities? My supervisor told me to categorise positive vs negative post. But don't know if this will fit my context.

Any idea pls

#

Beyond that, is there a recommended tutorial for NLP just to get me started pls

grand minnow
wintry brook
lyric vale
#

can anyone tell me why my model of eminst is giving around 82% accuracy only? i tried using basic neutral network and cnn getting same on both.

jaunty helm
final cobalt
#

It also looks like you're doing rotations. Three layers should be enough for unrotated, unflipped, otherwise unmodified numbers, but you're introducing a very high degree of variance if you allow for those sorts of transforms

#

Three small layers might just not be enough

lyric vale
final cobalt
#

Normalization within the layers

#

Normalize the output from each layer

lyric vale
final cobalt
#

Another approach would be to convolve/pool all the way down to a 1x1 with 62 channels, convert that to a vector, and softmax that

#

Both approaches should work, in theory

lyric vale
#

yes let me try applying softmax

final cobalt
#

My recommendation - add the normalization, and remove any transforms/augments on the data to check if it's just a model capacity issue

final cobalt
#

If that doesn't work - bring it to ChatGPT. Your model looks correct to me, if perhaps a little small. But ChatGPT is very good at spotting eenie-weenie bugs that are hard to spot

final cobalt
#

Well, for tiny images like the one you're working with, three should be enough if the dataset isn't too variable

#

More variation = more information = need more neurons to capture it all and identify features

final cobalt
# lyric vale sure

Oh! You should also instantiate your weights. I'm a little rusty, I havn't built a model in a while

#

So you'll have to look it up, but, it can make a difference. That said, 82% is pretty good. Your model is learning. It's just that at some point, the signal either stops getting through to or the model simply doesn't have the capacity. You can tell which by examining the gradients. If the manitudes of the gradients explode or vanish to zero, you've got a structural issue

#

Not sure how you could with such a small model

#

Your learning rate is also a little high

#

Normally you wouldn't expect to see an LR above 0.0002

opaque condor
serene scaffold
heavy crow
#

What are your thoughts on progressive dropout to improve generalization and convergence speed? So starting at 0% dropout and increasing over time.

bronze wyvern
warm dune
warm dune
subtle lotus
warm dune
wintry brook
subtle lotus
#

And with Machine Learning I can build AI like Grok and DeepSeek

jaunty helm
warm dune
half pulsar
#

You will hit compute limitations and data limits before you even get far enough into it.

subtle lotus
#

That's good

#

Ok

robust yoke
bronze wyvern
#

Hello, can someone explain why do we have to one-hot encode the label/target in NLP pls, what if we omit that and use a word index for e.g?

serene scaffold
iron basalt
# bronze wyvern Hello, can someone explain why do we have to one-hot encode the label/target in ...
["cat", "dog", "bird"]

cat  -> 0
dog  -> 1
bird -> 2

dog -> [0, 1, 0] (length of vocab)

Lets say our NLP stuff ends with giving probabilities:

Predicted probabilities:
[0.2, 0.7, 0.1]

P(cat) = 0.2
P(dog) = 0.7
P(bird) = 0.1

This lines up nicely with our one-hot vector (if we want to compare them and then update our system).

Other problem with having it being a single number is that it implies an ordinal relationship that does not exist: bird > dog > cat.

Other problem is that if you had say 3 neurons that you want to each respond to a different animal, it becomes much harder than if they have 3 inputs (from the one-hot) with 3 weights (instead of 1). In that case the weights become easy to learn, [1, 0, 0] to only respond to cat, [0, 1, 0] to respond to only dog, [1, 1, 0] to respond to both.
spring field
#

is it a cat?
is it a dog?
it's a catdog

celest hedge
fading knoll
bronze wyvern
bronze wyvern
bronze wyvern
#

Hello, can someone confirm if the following workflow is correct in NLP when it comes to classification task pls:

So first step is to take the raw text and tokenize them.
This result in a sequence of text.
Next step is to convert this sequence of text into a sequence of numbers, word -> index so that they can be mapped to a vector embedding.

(Now, when we say map to a vector embedding, are these embeddings initialize at random?)

We are now at the embedding layer where we have a sequence of embeddings. These embeddings are feed to neural net to learn complex pattern. Here, does the embeddings changes? Kind of like when we change weights during backpropagation?

Then last layer would be a sigmoid for binary classification or softmax for multi classification.

When it comes to the labels used for inference, the labels are also in a numeric format, like 1,2,3. Then these are one hot encoded.

Why are they one hot encoded though? Is it just to fit the shape of the list of probabilities we obtained?

heavy crow
#

If you use the indices directly you introduce ordering that is not intended. E.g. Cat (index 5) being "smaller" than dog (index 10)

#

Generally its nicer to have everything be arond the same magnitude. But with indices you have wildly different magnitudes. Some models have a vocab size of over 200k. So the model would need to predict the number 1 and also the number 200k.

#

The embeddings are not always initialized randomly. Sometimes you use pretrained embeddings such as from Bert or word2vec

bronze wyvern
#

Yep I see, thanks !

hybrid crown
#

Hello guys
Hope everyone is doing well
I just wanted to know what's the best cloud space to deploy aka hosting llms in?
And the best one to train them

#

Best also equal cheapest for me 😢
But still i need something that works great not just cheap

serene scaffold
bronze wyvern
#

yup I see, thanks !

warm dune
serene scaffold
mossy osprey
#

I was making this stock trading ai, firstly probably should explain how it works i took abt 10 years of stock market days and made the ai simulate trading for like 100 days at a time, then take the wins and losses to build a new ai on a infinite loop, but a problem im running into is the ai seems to take only the data from the most recent tade data which causes it to build an ai that only works well and has a high winrate in that specific time/dataset, i am curious abt how to make it more versatile and durable in any environment?

gleaming lake
#

Yoo guys, I'm here to understand on how ai are able to like be able to get hold of the images for example being able to see a image of a certain skin issue and being able to identify it, how does it do that and what math is required behind it?, does it compare the images vectors of each pixel with the test picture? Or something else.

serene scaffold
#

and then you train a convolutional neural network on those images, with their respective labels

#

in this approach, when you go to use the model "for real", there's no direct comparison between new images and the ones that they were trained on

opaque condor
#

The idea is that in a descending fashion each layer of a transform removes one heading so we have four on four layers by the second to last layer you arrive at 1

Which is somewhat similar to how a visual cortex behaves

But if the prediction is wrong it adds intention head to the layer
Incorrect layer by layer making it more like a human brain

haughty veldt
final cobalt
#

So I've been working away

#

And I think I've figured out how to do diffusion on a graph

hardy zodiac
#

hey everyone ! , my self om and i am new in this server...

opaque condor
haughty veldt
shy stag
#

I am thinking of making a visual recognition program is anyone here experienced with this

serene scaffold
serene scaffold
shy stag
#

what I really wanna is the programming part

#

How do I prepare a dataset, do I even need to prepare one?? Do I need to train a model and all of that stuff

serene scaffold
serene scaffold
shy stag
shy stag
#

I will check and see what I can find

pallid badge
#

@half pulsar : I wanted to say thank you! I am reading about the libraries you told me. NetworkX looks very interesting

iron basalt
pallid badge
#

@half pulsar Also PyKEEN . I was wondering if one could combine this with ModernBert

half pulsar
heavy crow
#

With adamW and a decaying LR should the gradient norm decrease over time? What does it mean if the gradient is more or less constant? And the loss is decreasing slowly. Is it circling a minimum?

mossy osprey
#

I’m building a crypto trading AI and its learning speed is hella slow idk its supposed to be like this but its has like a 100x slower learning rate compared to my stock market trading AI which was converging in only a 8 hours of training. While this one is growing it its at a much much slower rate its been over 5 hours and its winrate only increased by a measly 0.2.

mossy osprey
#

😭 ima take the laughing reaction as its not a good sign

#

Atleast its only going up😭 ✌️

mossy osprey
#

✌️

limber sleet
#

how long will ts take 😭

wicked basin
rain egret
#

Great

#

I'm supressed

dusky abyss
#

if the inputs to a model during training are scaled should scaling also be used for inference?

mossy osprey
mossy osprey
#

😭 ✌️

brazen valley
#

Book for linear algebra?

wooden sail
#

check the pins

wooden sail
bronze wyvern
#

Hi, quick question. I was reading about symmetric and asymmetric semantic search.

When it comes to symmetric semantic search, I thought that the query and the document should be of the same length. To some extent it's true but can also differ by 1 or 2 words. What matters is the natural flow/intent of the sentence?

For e.g, if as query I typed: "How to learn Machine learning", in the document I expect something like "Learning machine learning"?

In contrast with asymmetric semantic search, we try to compare the content of the query and the document, like if query is "What is machine learning", as document I can have: "Machine learning is a subset of Artificial intelligence which involves..."

Can someone confirm if the above statements are correct pls... would really appreciate if someone can add anything up to this if there is any clarification missing.

warm dune
#

can someone explain the difference between CNNs and RNNs?

lime grove
#

is it so hard to look this up on Wiki?

#

convolution vs. recursion.

tidal bough
#

there's no similarity - a CNN is a neural network with convolution layers (see pytorch's Conv2d for an example), a RNN involves a hidden state that gets repeatedly fed back into the network.

warm dune
tidal bough
#

you could sort of represent a convolutional layer as a sparse ordinary layer, but it's a bit of an unnatural way to look at it

#

a convolutional layer is for data which has locality, so to say - where nearby cells are related. images, video, occasionally audio

#

that's why it makes sense to do what a convolutional layer does (have each cell of the output only depend on a small neighbourhood of it), whereas dense layers are used where there's no reason to expect locality.

tidal bough
swift terrace
#

hello, can anybody help me to fixing my error? i want to use gemini api key but the terminal say "the api key is not provided"

shy stag
#

Yo, so how does numpy even work

#

how can it convert a dataframe into arrays that can be understood by a computer

iron basalt
#

It just gives a reference to the array to Numpy.

wooden sail
iron basalt
#

Slight addition to that, Apache Arrow stores those columns in chunks.

#

But that does not change much (process them one at a time or in parallel as intended).

shy stag
#

Oh

gilded depot
weary river
cursive totem
#

im curious about some statistics - how many people out here are into deep learning, how many of them are using pytorch and how many are using tf in contrast to classic ml with things like pandas, sklearn, sql and whatever there are

dusky hemlock
#

Hey y'all, I've been researching the limitations and capabilities of my 1660 Ti 6Gb GPU. I don't mind using existing LLMs for specific purposes, but I would like to build (and maybe train??) a model which can at bare minimum maintain short conversational english. It doesn't need to have thinking or reasoning, tool calls or agentic functionality. I am hoping this is possible, either via a custom Python implementation or using existing solutions which can be modified for the aforementioned basic conversations. Does anyone have ANY tips for me? My previous attempts technically worked, but from my uneducated perspective the results were quite poor. I don't know exactly what to be expecting or what to look for yet. Any tidbits of info are appreciated. Thanks!

serene scaffold
#

and if you can use quantization to represent each parameter in 16 bits, that would help even more.

dusky hemlock
#

Lol i opened the link, second sentence first words "Pretained model"

serene scaffold
#

and even if you could, your hard drive probably can't store enough training data to do it, and you'd have to leave it running for weeks to train

dusky hemlock
#

That does make sense

placid magnet
#

you'd do better with colab, bigger gpu on free tier. (T4 15gb - lmited time, depends on availability)
as long as your carful about how you manage things... short runs, checkpoint often etc.etc.

but your STILL NEVER gonna make a LLM from scratch that can do anything worth even speaking about other than being able to say you made one...
not on anything consumer or free..

you could spend a bunch on A100/H100 time.. but that gets costly... fast.. and i wouldnt even consider that until you have some experience or you'll just waste credits..

#

there are other options also.. but im not so familiar with free offerings or pricing and availability.

but on the 1660 alone your essentially limited to just running inference on small models
(you could probably do some stuff with 1b-3b models.. but i dont think it would be worth it)

half pulsar
#

LLMs suck why would you want to make one from scratch.

#

There's much better things to build, world is too stuck on it for quick money.

placid magnet
#

that's a pretty subjective view..

if you have interests in ML/AI then it's actually quite interesting to look at doing SFT finetuning (i do it myself personally.. only at 7B-12B size though)

making you own is a big step though

#

But as a learning experience if your into the field..

why not..?

half pulsar
#

Transformers suck

dusky hemlock
dusky hemlock
half pulsar
#

You don't need LLMs or Transformers or Pretrained Models.

dusky hemlock
placid magnet
dusky hemlock
half pulsar
placid magnet
dusky hemlock
#

I learn best when facing challenges, so that's inherently incorrect in my case. I appreciate the advice regarding small and realistic wins, though. I do realize that now, since folks have been talking about the very low probability of being able to do what I originally intended.

placid magnet
half pulsar
# placid magnet yeah but thats not even what you have been saying.. youve been going.. "why llm...

That’s not what I’m saying. I'm not saying "LLMs suck so don't even bother" I'm just saying that jumping into building a full transformer/LLM scratch is a terrible starting point if the goal is to learn. Those systems are the result of many years of work between many fields, If someone skips the fundamentals and goes straight to LLM, they'll end up with just a bunch of Libraries without actually understanding what's happening.. If the goal is learning ML/AI, it’s far more productive to start with the underlying algorithms, gradient descent, simple neural nets, attention mechanisms, optimization, etc and experiment there first. Once you actually build and understand those pieces, building and modifying to achieve better results makes more sense. Fundamentals then thinking out the box is where the magic happens.

placid magnet
#

anyway it's pointless to argue..

i actually agree with that last point..

but everyone learns differently

half pulsar
#

@serene scaffold

spring field
serene scaffold
#

!clban 1482509195642146869 spam account

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied ban to @grim turtle permanently.

past meteor
#

!cleanban 1482509195642146869

arctic wedgeBOT
#

:x: User is already permanently banned (#108860).

spring field
#

lol

obtuse mauve
#

no experience needed😆 Wow

placid carbon
#

Ello

obtuse mauve
#

New here😄

#

Do you guys use Codex too?

placid magnet
#

been using it a bit lately.. more than before..
But these things are still not really reliable enough IMO, they can help for "the super common things" we all get tired of doing..
and they are improving. but still, better to do things yourself often

#

copilot/codex do make good doccomment writers though...
well.. untill they dont

obtuse mauve
#

I mainly use it to discuss and review my projects, but yeah it's not good enough for a whole project. I overused it in a project once and in the end had to fix half the script because Codex mixed stuff up

half pulsar
placid magnet
#

generally my position is.. if your learning.. not really a good idea,
doesnt really help you learn and they still make to many mistakes, if you do learn you might just learn bad things...

but if you already know, and can properly review stuff. they can be helpful

obtuse mauve
half pulsar
#

I hadn't have any luck with Codex, I've tried in in VS but maybe that's just not a good place to use it?

half pulsar
obtuse mauve
placid magnet
#

gpt5.4-codex in codex in vsc codex app is not bad (recently)
though personally i do thing through RooCode -- which is a great harness for it much better prompt management IMO

obtuse mauve
placid magnet
half pulsar
#

Yeah I've liked the new 5.4 Model, It's been quite good at documentation

obtuse mauve
obtuse mauve
half pulsar
#

Yep, and using Claude is also expensive but it seems the best for handling complexity.

half pulsar
obtuse mauve
#

fair

half pulsar
#

Github Copilot 5.4 + Claude has been great for documenting my Large Codebase hasn't gotten confused yet, only started using it a few months ago, saves a lot of time and that's quite valuable to me. But I'd never let it make modifications especially alone, It needs to be "steered" often.

placid magnet
#

well some of thats down to how you frame tasks and your prompting skills.

but yeah honestly.. thats still the weak point of a lot of coding agents

i DO have a few somewhat complex projects ive been doing on the side, which are entirely Codex 5.4 just to test how good things are now (every model release i give one a project to do to see how things are shaping up)
and 5.4-codex has been surprisingly competent
But i still don't think these models are ready to be trusted yet..
IMO AI shouldnt "replace" you doing your work anyway.. when they get good they will be great as "accelerators" lets you do things faster, but i dont think they will ever really match a human dev with real world experience.

pseudo lark
half pulsar
#

Documentation has always slowed me down the most, I'd prefer to not skip that part

obtuse mauve
half pulsar
#

For learning it might be trusted enough as a "review pal"

obtuse mauve
#

But you definitely shouldn't use to learn

half pulsar
placid magnet
obtuse mauve
half pulsar
#

That's the problem you run into is when it doesn't actually know, it's just going to make it up, make it sound convincing after all its a text predictor not a thinker.

#

It doesn't know what it doesn't know

obtuse mauve
#

What it handled surprisingly well though is the aerodynamics presets

obtuse mauve
half pulsar
#

Maybe some day ChatGPT will learn what "I don't know" means 🤣

placid magnet
#

the other thing we all have to remember is that, it's a bell curve

it trains on a huge distribution of code.. a LOT of that code comes from public repositories and code katas and similar.
this means that the output your getting is the most statistically likely.. and on the bell curve.. thats usualy a fair distance from the best stuff..

half pulsar
#

They forgot to put that into the training dataset 🤣

placid magnet
#

ohh that one is entirely OUR fault.. (i mean humans)

RHLF trains models not to show to much uncertainty. this biases them toward provide an answer.. NOT ask for more information..

obtuse mauve
placid magnet
#

they do try to offset it somewhat..
but if it's asking to much. or showing to much ambiguity.. thats bad for the powerpoints 🤣

half pulsar
#

LLMs are architecturally flawed and no company wants to face that, we're just watching them try to work around and band-aid "fix" its flaws.

obtuse mauve
#

They hope that it'll somehow somewhat work🤣

placid magnet
#

i dunno if i agree with that entirely

as yet they have been one of the most effective ways to do things...
is it perfect, yeah no...
but, it's better than a lot of the older methods in the ways that the leaders in the field seem to care about.. 🤷

#

I do personally think that we settled on things and put a whole lot of faith in them though..
i dont think they will be the answer to AGI and other such things.. not really

half pulsar
#

Good at natural language and that's how LLMs demonstrate emergence. But we can't try to make it out to do more unless you want to change the architecture. I believe that LLMs are just the "workers" while it still needs a true brain.

obtuse mauve
half pulsar
#

I agree that they are not a answer to AGI not even close!, AGI wouldn't need a pretrained model.

obtuse mauve
#

But a big problem in the ai industry is the ai overuse, I mean Meta, we do not want Meta ai in fucking whatsapp

placid magnet
half pulsar
#

LLMs are enough to fabricate the appearance of Intelligence but it's not the Intelligence itself. They look cool so its quick money for them but its still afar from the true goal we all want.

#

We need Colossus and Guardian

obtuse mauve
#

Well fellas gotta go, nighty nighty😴

half pulsar
half pulsar
obtuse mauve
warm dune
#

for nn, its better to starting in tf and keras, or pytorch?

serene scaffold
warm dune
serene scaffold
warm dune
serene scaffold
#

the neural network concepts will still be up-to-date, but no one actually uses tensorflow anymore.

warm dune
serene scaffold
warm dune
serene scaffold
hasty lynx
#

Hello im building an AI diffusion LLM similiar to mercury 2 by Inception; if you want to help or partecipate in this project, dm me!

iron basalt
half pulsar
#

Man I can't believe 3090s last year were worth 500 dollars and DDR4 was borderline Ewaste and now they're Gold again. Horrible time to buy hardware

#

Can't even build a Xeon with 256GB RAM(Either DDR3 or DDR4) Server for a few hundred dollars anymore.

#

Infinite money loop for datacenters though now that I have to rent 🥴

dusky hemlock
proven pier
#

Does anybody have a resource I can read about generating embedding vectors for Attention? Everything I read about has to deal with language/word processing specifically. In this case, you can define your embedding dimension as the length of all vocabulary. My feature space instead deals with only continuous numbers (which I normalize)

spring field
#

do you even need embeddings if you feature space is reals? like the reason you use embeddings is so that you can convert text into a numerical value, but if you already have the numbers, you don't need to embed them

#

also, regarding dimension matching vocab size, that sounds like one-hot encoding, which you wouldn't really use for text anyway, that'd be far too many dimensions, text embeddings are like, idk 256 to 4096 dimensions

#

for categories, sure
but again, if you already have numerical values, you don't need to embed them separately

proven pier
spring field
#

I think you should be able to just skip the embedding part and pass in vectors of your numerical features

proven pier
abstract thicket
#

What's the best in class unsupervised clustering algorithm. I have very noisy data, a lot of it (2 million) with like 16 features. (also pretty sure that there must be atleast a 100 clusters in it)

#

My current findings using GMM, reveals k=13 for layer 1 and then redoing clustering on these parent clusters resulted in on average 10 more subclusters. (often sharing common semantics with other subclusters of other parent clusters)

lyric vale
#

what should i learn so i can able to get good rank in kaggle? anyone having any idea?

serene scaffold
warm dune
#

manifold is basically the 'shape' that data has when placed in a visualization space? Or is there something more to it?

proven pier
bronze wyvern
#

Hi, I was reading a bit about why casing does matters in NLP, especially with the example of Apple vs apple. I was wondering, for a news classification, would casing matters?

jaunty helm