#general | Arena | Page 82

stray aspen Aug 5, 2025, 7:36 PM

#

thats nice

hollow imp Aug 5, 2025, 7:39 PM

#

...

#

Use jee advanced for the benchmarks

#

Then see ☠️ 🔥

stray aspen Aug 5, 2025, 7:39 PM

#

glm 4.5 is still better tho

wicked root Aug 5, 2025, 7:39 PM

#

i hope so

stray aspen Aug 5, 2025, 7:40 PM

#

qwen 3 think is absolute trash for coding

open mountain Aug 5, 2025, 7:41 PM

#

stray aspen qwen 3 think is absolute trash for coding

He has always been like this, there is little modern training data

whole wagon Aug 5, 2025, 7:47 PM

#

The openAI 120B model scores 44% in aider on highest reasoning mode

#

That is awful

#

It's basically a paper release. To say they open sourced smth

#

And I don't know what all that hype was about

stray aspen Aug 5, 2025, 7:50 PM

#

thats what im saying

#

china is still on top

#

and top it off it only accepts text

solid brook Aug 5, 2025, 7:51 PM

#

Guys

karmic bough Aug 5, 2025, 7:51 PM

#

stray aspen and top it off it only accepts text

What's the best model for coding ?

solid brook Aug 5, 2025, 7:51 PM

#

The horizen beta model wasn't the open source model

stray aspen Aug 5, 2025, 7:52 PM

#

karmic bough What's the best model for coding ?

i dont know brother

whole wagon Aug 5, 2025, 7:52 PM

#

Ofc

stray aspen Aug 5, 2025, 7:52 PM

#

maybe claude 4 sonnet or the new opus

solid brook Aug 5, 2025, 7:52 PM

#

I tested. Their kmowledge cutoff is diffrent

karmic bough Aug 5, 2025, 7:52 PM

#

stray aspen maybe claude 4 sonnet or the new opus

What abt Gemini 2.5

whole wagon Aug 5, 2025, 7:52 PM

#

solid brook I tested. Their kmowledge cutoff is diffrent

Yeah well it's obvious. Because horizon beta is not bad

#

Kek

solid brook Aug 5, 2025, 7:52 PM

#

Sad

whole wagon Aug 5, 2025, 7:53 PM

#

The arch also has 0 Innovations

#

It's basically just taken from Chinese LLMs

#

kekw

#

I guess they wanted to keep anything good for themselves

#

I don't see the point in the model ngl. Apart from PR

primal orbit Aug 5, 2025, 7:56 PM

#

I'm still getting "potato" in the arena. So it was not open source model. Or not the one that got released.

blazing bison Aug 5, 2025, 7:56 PM

#

ok after 1 hour testing openai oss models, bad vibes for code

plush anvil Aug 5, 2025, 8:05 PM

#

hello

wheat onyx Aug 5, 2025, 8:05 PM

#

My guess is it's good for summarization and maybe some basic math. Worth checking. If it can't do those well, that would be disappointing

daring rover Aug 5, 2025, 8:06 PM

#

what about swe bench ?

#

polyglot seems like it's too contrived

daring rover Aug 5, 2025, 8:07 PM

#

solid brook The horizen beta model wasn't the open source model

it was apparetnly gpt 5 low effort

wheat onyx Aug 5, 2025, 8:10 PM

#

I'm excited for the lower hallucination rates more than anything

#

The modest bump in reasoning is nice too sure

daring rover Aug 5, 2025, 8:17 PM

#

wheat onyx I'm excited for the lower hallucination rates more than anything

the hallucination rate is higher for oss

#

do you know if it's lower with gpt 5/

wheat onyx Aug 5, 2025, 8:17 PM

#

daring rover the hallucination rate is higher for oss

Gpt5 is much lower

daring rover Aug 5, 2025, 8:17 PM

#

nbice

daring rover Aug 5, 2025, 8:17 PM

#

wheat onyx Gpt5 is much lower

source?

hallow ridge Aug 5, 2025, 8:18 PM

#

I have an instagram account with over 300k and I don’t want it anymore

wheat onyx Aug 5, 2025, 8:19 PM

#

daring rover source?

They announced an entirely new method with verification a bit ago

daring rover Aug 5, 2025, 8:20 PM

#

wheat onyx They announced an entirely new method with verification a bit ago

yeah but how do you know halllucinations are lower

ember sentinel Aug 5, 2025, 8:23 PM

#

Yo guys, does anybody know if the horizon models on OpenRouter were the GPT-OSS models?

pulsar aurora Aug 5, 2025, 8:24 PM

#

Any free Ai agent tool that let's us use Ai on browser to perform any tasks?

novel flame Aug 5, 2025, 8:26 PM

#

ember sentinel Yo guys, does anybody know if the horizon models on OpenRouter were the GPT-OSS ...

Definitely not. Horizon Beta is much stronger. It could be GPT-5 low effort, that would make sense for my testing at least

wheat onyx Aug 5, 2025, 8:27 PM

#

daring rover yeah but how do you know halllucinations are lower

https://openai.com/index/prover-verifier-games-improve-legibility/

wheat onyx Aug 5, 2025, 8:28 PM

#

novel flame Definitely not. Horizon Beta is much stronger. It could be GPT-5 low effort, tha...

Looks good for coding. Not something I'll be using, but unfortunately if it beats opus 4.x then they may have a cash flow crisis

neon idol Aug 5, 2025, 8:28 PM

#

Did you look for claude 4.1?

#

The new chatbot

wicked root Aug 5, 2025, 8:30 PM

#

anyone know how rate limiting works in Gemini pro?

primal orbit Aug 5, 2025, 8:32 PM

#

https://i.snipboard.io/mF1nyf.jpg

#

first time seeing such response 😄

jade egret Aug 5, 2025, 8:33 PM

#

is claude 4.1 opus good

#

it js like a upgrade opus 4 ig

stray aspen Aug 5, 2025, 8:37 PM

#

it was a very minor upgrade

random shard Aug 5, 2025, 8:44 PM

#

GPT OSS is wow

#

chefs kiss

#

great little coding model

stray aspen Aug 5, 2025, 8:45 PM

#

nah its mid

random shard Aug 5, 2025, 8:45 PM

#

no

#

most models won't fit on that

#

you might get lucky and someone will quantize it enough to barely fit on 6gb, but realistically upgrade your GPU

#

or accept CPU inference

#

Possibly, but unlikely tbh

#

Quantized 4bit right now uses 10gb

#

Depends on your CPU, RAM, quantization

#

My experience with smaller models on a 128core AMPERE box has been, 5 - 10 TOK/s

#

So, don't expect fast.

#

Probably like, 1 - 2 TOK/s?

#

Better than nothing

#

Try it with Llama.cpp CPU mode

#

never used it.

#

keep your expectations is low is all I can really say

#

I personally run models on a M4 Mac Mini

#

usually using exo so I can cluster it with my M1 Pro MBP

#

Llama.cpp now supports clustering via RPC but it's very experimental

#

exo was designed from the ground up for clustering

#

you're not going far with that

#

You're gonna wanna upgrade

#

I'm using a RTX 4070 Super Ti 16gb, and even then I can't fit most models

#

the ones I can, aren't that fast.

#

I like mac clusters for energy efficiency

#

if you don't care about energy efficiency, go NVIDIA

#

idk

#

you just gotta try

#

🤷

#

I use finetuned for chatting models

#

I'm really liking GPT-OSS tho

#

It's a nice model

#

not a expert at anything, but well rounded.

#

o3 mini -> GPT OSS 20b

#

o4 mini - GPT OSS 120b

whole wagon Aug 5, 2025, 8:53 PM

#

Tbf. The openAI open source model is efficiency SOTA. It has only 5.1B active

random shard Aug 5, 2025, 8:54 PM

#

no one knows

random shard Aug 5, 2025, 8:54 PM

#

whole wagon Tbf. The openAI open source model is efficiency SOTA. It has only 5.1B active

it's just MOE

whole wagon Aug 5, 2025, 8:54 PM

#

It is much smaller than qwen3 also

random shard Aug 5, 2025, 8:54 PM

#

I wish someone would implement Apple's papers about streaming LLMs from disk to RAM

whole wagon Aug 5, 2025, 8:54 PM

#

It performs just below the big qwen3 model and half the params, way less active

#

Qwen is 235B with 22B active

random shard Aug 5, 2025, 8:55 PM

#

I'd say it's better than qwen3 by a long shot

#

my experience with qwen3 is it never follows instructions.

little narwhal Aug 5, 2025, 8:55 PM

#

Knowing them they’ll probably call it o5

random shard Aug 5, 2025, 8:59 PM

#

little narwhal Knowing them they’ll probably call it o5

No. It’s GPT 5 confirmed.

#

No. Any storage.

#

It would be ideal with nvme

devout vault Aug 5, 2025, 8:59 PM

#

random shard No. It’s GPT 5 confirmed.

are you the yupp ai guy?

#

what even is yupp ai

random shard Aug 5, 2025, 8:59 PM

#

But you can run a LLM on a pentium 2 pretty fast if you really want

random shard Aug 5, 2025, 9:00 PM

#

devout vault are you the yupp ai guy?

No

#

What is YUPP?

lone relic Aug 5, 2025, 9:05 PM

#

prboably

lone relic Aug 5, 2025, 9:05 PM

#

random shard o3 mini -> GPT OSS 20b

ye thats right

lone relic Aug 5, 2025, 9:05 PM

#

random shard o4 mini - GPT OSS 120b

thats right too

random shard Aug 5, 2025, 9:05 PM

#

lone relic prboably

GPT 5 will be non reasoning

leaden palm Aug 5, 2025, 9:06 PM

#

random shard GPT 5 will be non reasoning

????

lone relic Aug 5, 2025, 9:06 PM

#

honestly openai def did set a bar for open source models with oss

obsidian shell Aug 5, 2025, 9:06 PM

#

random shard GPT 5 will be non reasoning

who lied to you?

random shard Aug 5, 2025, 9:06 PM

#

leaden palm ????

???

leaden palm Aug 5, 2025, 9:06 PM

#

iirc at the start of gpt-5, its sole purpose will be to route to reasoning models when appropriate

lone relic Aug 5, 2025, 9:06 PM

#

yeah exactly

random shard Aug 5, 2025, 9:06 PM

#

obsidian shell who lied to you?

What would they make it a reasoning model. It’s a non reasoning model as we understand now. The o series models reason.

leaden palm Aug 5, 2025, 9:06 PM

#

random shard What would they make it a reasoning model. It’s a non reasoning model as we unde...

it's not a normal continuation of the gpt series

devout vault Aug 5, 2025, 9:06 PM

#

random shard What would they make it a reasoning model. It’s a non reasoning model as we unde...

Gpt -5 is a model integrated by multiple models

lone relic Aug 5, 2025, 9:06 PM

#

random shard What would they make it a reasoning model. It’s a non reasoning model as we unde...

they r changing the game

random shard Aug 5, 2025, 9:06 PM

#

leaden palm iirc at the start of gpt-5, its sole purpose will be to route to reasoning model...

So non reasoning

lone relic Aug 5, 2025, 9:07 PM

#

devout vault Gpt -5 is a model integrated by multiple models

yep

obsidian shell Aug 5, 2025, 9:07 PM

#

random shard What would they make it a reasoning model. It’s a non reasoning model as we unde...

what makes you certain it will be non reasoning?

leaden palm Aug 5, 2025, 9:07 PM

#

gpt-5 is more of a system than a "non reasoning model"

devout vault Aug 5, 2025, 9:07 PM

#

GPT-5 is releasing this week

random shard Aug 5, 2025, 9:07 PM

#

leaden palm it's not a normal continuation of the gpt series

We don’t know yet. None of the continuations have been super consistent

devout vault Aug 5, 2025, 9:07 PM

#

leaden palm gpt-5 is more of a system than a "non reasoning model"

Ye

lone relic Aug 5, 2025, 9:07 PM

#

ye

#

bro its exciting lowk

#

and i wonder how it will do against gpt 4.5

#

and who it will be available to

random shard Aug 5, 2025, 9:07 PM

#

obsidian shell what makes you certain it will be non reasoning?

Sam Altman posted a screenshot of its output already. Also, the GPT-X have consistently been non reasoning.

lone relic Aug 5, 2025, 9:07 PM

#

random shard Sam Altman posted a screenshot of its output already. Also, the GPT-X have consi...

show ti

#

*show it

random shard Aug 5, 2025, 9:08 PM

#

lone relic and i wonder how it will do against gpt 4.5

Infinitely better I assume.

little narwhal Aug 5, 2025, 9:08 PM

#

I think after GPT-OSS and GPT-5 OpenAI will probably run out of steam for a few months

lone relic Aug 5, 2025, 9:08 PM

#

tbh gpt 4.5 had too much hate and is underrated

leaden palm Aug 5, 2025, 9:08 PM

#

random shard Sam Altman posted a screenshot of its output already. Also, the GPT-X have consi...

for that query, it decided to not reason.

lone relic Aug 5, 2025, 9:08 PM

#

leaden palm for *that query*, it decided to not reason.

wait where did u find rthat

leaden palm Aug 5, 2025, 9:08 PM

#

lone relic wait where did u find rthat

https://x.com/sama/status/1952071832972186018

Sam Altman (@sama)

@nicdunz turns out yes!

blazing bison Aug 5, 2025, 9:08 PM

#

sam post

stray aspen Aug 5, 2025, 9:09 PM

#

its on twitter

random shard Aug 5, 2025, 9:09 PM

#

daring rover Aug 5, 2025, 9:09 PM

#

wheat onyx https://openai.com/index/prover-verifier-games-improve-legibility/

this is like matching some alignment proposals people made years ago

#

that's pretty neat

random shard Aug 5, 2025, 9:09 PM

#

leaden palm for *that query*, it decided to not reason.

Do you have solid evidence it will be a reasoning model?

devout vault Aug 5, 2025, 9:09 PM

#

gpt 5 seems like the future tbh

lone relic Aug 5, 2025, 9:09 PM

#

hopefully it is free for all of us

devout vault Aug 5, 2025, 9:09 PM

#

random shard Do you have solid evidence it will be a reasoning model?

Gpt 5 will be a reasoning and not reasoning model

random shard Aug 5, 2025, 9:09 PM

#

lone relic tbh gpt 4.5 had too much hate and is underrated

It had hate because it was a stupid model.

devout vault Aug 5, 2025, 9:09 PM

#

At the same time

blazing bison Aug 5, 2025, 9:09 PM

#

lone relic hopefully it is free for all of us

it's not

lone relic Aug 5, 2025, 9:09 PM

#

devout vault Gpt 5 will be a reasoning and not reasoning model

ye

blazing bison Aug 5, 2025, 9:09 PM

#

maybe in arena it will be

lone relic Aug 5, 2025, 9:09 PM

#

ay lmarena got us tho for that

devout vault Aug 5, 2025, 9:10 PM

#

lone relic hopefully it is free for all of us

Sam Altman said it will be free for everyone so yea

random shard Aug 5, 2025, 9:10 PM

#

I doubt GPT 5 will be reasoning. They just sorta fixed model naming.

lone relic Aug 5, 2025, 9:10 PM

#

devout vault Sam Altman said it will be free for everyone so yea

oh ye i forgot

blazing bison Aug 5, 2025, 9:10 PM

#

@devout vault he didnt said that, he said that free will receive something, plus more inteligence, and pro even more

devout vault Aug 5, 2025, 9:10 PM

#

Ph

#

Oh

random shard Aug 5, 2025, 9:11 PM

#

Why would they backtrack and Use the naming scheme from a non reasoning model.

#

for a reasoning model

lone relic Aug 5, 2025, 9:11 PM

#

i bet hes gna do smth like gpt 5, then gpt 5 pro for paid users or smth

devout vault Aug 5, 2025, 9:11 PM

#

random shard Why would they backtrack and Use the naming scheme from a non reasoning model.

I like the not reasoning models more tbh

#

They are more fast

random shard Aug 5, 2025, 9:11 PM

#

I don’t see how they’re going to improve on O3.

lone relic Aug 5, 2025, 9:11 PM

#

random shard Why would they backtrack and Use the naming scheme from a non reasoning model.

its reaoning and not reasoning

random shard Aug 5, 2025, 9:11 PM

#

O3 has been the best one so far.

#

It’s well rounded.

lone relic Aug 5, 2025, 9:11 PM

#

just because something is reasonning does not mean it is the best

#

tho tbh o3 is darn fast for a reasoning model ngl

#

at least much much faster than o1

random shard Aug 5, 2025, 9:12 PM

#

lone relic just because something is reasonning does not mean it is the best

No, it’s the best because it does an amazing job at every task I’ve thrown at it to date

#

O1 was impressive. O3 is insane.

lone relic Aug 5, 2025, 9:12 PM

#

random shard No, it’s the best because it does an amazing job at every task I’ve thrown at it...

haha ive acc seen it screw up a bunch of times

lone relic Aug 5, 2025, 9:12 PM

#

random shard O1 was impressive. O3 is insane.

agreed

blazing bison Aug 5, 2025, 9:12 PM

#

o3 is the best model for me

random shard Aug 5, 2025, 9:12 PM

#

lone relic haha ive acc seen it screw up a bunch of times

Same. But if you talk to it about what it did wrong it quickly corrects itself.

#

O3 has been amazing

lone relic Aug 5, 2025, 9:12 PM

#

random shard Same. But if you talk to it about what it did wrong it quickly corrects itself.

oh nono it has done some mistakes too still

blazing bison Aug 5, 2025, 9:12 PM

#

if opus was the same price then opus would be the best

random shard Aug 5, 2025, 9:13 PM

#

O4 mini makes too many mistakes in my experience. O3 doesn’t really hallucinate either.

lone relic Aug 5, 2025, 9:13 PM

#

especially still bad ad debugging

#

which gpt 4.5 ironically did well

random shard Aug 5, 2025, 9:13 PM

#

lone relic especially still bad ad debugging

You want Claude if you wanna program.

lone relic Aug 5, 2025, 9:13 PM

#

could be

random shard Aug 5, 2025, 9:13 PM

#

Claude is the best programmer

#

And the least polite model.

blazing bison Aug 5, 2025, 9:13 PM

#

gpt 4.5 is another rich only model

random shard Aug 5, 2025, 9:13 PM

#

4.5 should’ve never happened.

lone relic Aug 5, 2025, 9:13 PM

#

random shard Claude is the best programmer

subjective, i keep on swapping between o3, 4 opus, and gemini 2.5 pro

random shard Aug 5, 2025, 9:13 PM

#

O3 imo

blazing bison Aug 5, 2025, 9:13 PM

#

i like how claude explain things

random shard Aug 5, 2025, 9:14 PM

#

lone relic subjective, i keep on swapping between o3, 4 opus, and gemini 2.5 pro

Gemini has massively improved

#

Gemini is the most exciting model tbh.

lone relic Aug 5, 2025, 9:14 PM

#

blazing bison i like how claude explain things

claude is goated in writing

lone relic Aug 5, 2025, 9:14 PM

#

random shard Gemini is the most exciting model tbh.

agreed

blazing bison Aug 5, 2025, 9:14 PM

#

o3 talk to you like always doing technical report

random shard Aug 5, 2025, 9:14 PM

#

Not the best at anything. But it’s skills keep growing by the day

#

Gemini was like a kid that grew up quickly. Now it’s in college.

#

And it feels like it’s going for its masters degree soon

lone relic Aug 5, 2025, 9:14 PM

#

random shard Gemini was like a kid that grew up quickly. Now it’s in college.

its context is also hella high

lone relic Aug 5, 2025, 9:15 PM

#

random shard And it feels like it’s going for its masters degree soon

lol

random shard Aug 5, 2025, 9:15 PM

#

lone relic its context is also hella high

TPU effect

lone relic Aug 5, 2025, 9:15 PM

#

yep

#

in lmarena probably

random shard Aug 5, 2025, 9:15 PM

#

I love Google’s method

#

nvidia expensive
lets design our own chips

ocean vortex Aug 5, 2025, 9:15 PM

#

jade egret is claude 4.1 opus good

ehm... the big positive is they stopped inflating numbers with parallel compute. Other than that very marginal update huh

lone relic Aug 5, 2025, 9:15 PM

#

and also i have gemini pro, i never felt 2.5 pro hit its limits except for once

random shard Aug 5, 2025, 9:15 PM

#

TPU go BRRR

lone relic Aug 5, 2025, 9:16 PM

#

ocean vortex ehm... the big positive is they stopped inflating numbers with parallel compute....

i used claude opus 4.1 in warp and its hella good

blazing bison Aug 5, 2025, 9:16 PM

#

bro is impossible to tell the difference between opus 4 and 4.1 on arena

random shard Aug 5, 2025, 9:16 PM

#

I hope Apple actually designs their own TPU servers.

#

It’s rumoured.

lone relic Aug 5, 2025, 9:16 PM

#

thats true acc

#

i tested it myself so ik

random shard Aug 5, 2025, 9:16 PM

#

Imagine if Apple just dropped an update to Siri that BTFO’d Gemini.

#

They have their rumoured answers app.

lone relic Aug 5, 2025, 9:17 PM

#

damn thatd he crazy

ocean vortex Aug 5, 2025, 9:17 PM

#

blazing bison bro is impossible to tell the difference between opus 4 and 4.1 on arena

You basically need identifiable data that wouldn't be based on overall performance lol

blazing bison Aug 5, 2025, 9:17 PM

#

ocean vortex You basically need identifiable data that wouldn't be based on overall performan...

yes, that's why i said on the arena

ocean vortex Aug 5, 2025, 9:18 PM

#

I'm intrigued by this open-source model though

#

#

o4-mini-high performance, almost

#

and all the features looks like, even the reasoning effort retained

random shard Aug 5, 2025, 9:19 PM

#

It's already free to try

#

https://www.gtp-oss.com

#

that's how I'm using it right now.

#

Universal WebGPU support when?

#

I want WebGPU

devout vault Aug 5, 2025, 9:20 PM

#

in aistudio yes

random shard Aug 5, 2025, 9:20 PM

#

imagine if we had a beauwolf cluster crowd sourced from the internet

#

using WebGPU

#

Imagine.

ocean vortex Aug 5, 2025, 9:21 PM

#

It's insane that this is 5.1b active params and 117b total

random shard Aug 5, 2025, 9:21 PM

#

It's been done

ocean vortex Aug 5, 2025, 9:21 PM

#

🤯

random shard Aug 5, 2025, 9:21 PM

#

there was a model trained using a cluster over the internet.

random shard Aug 5, 2025, 9:21 PM

#

ocean vortex It's insane that this is 5.1b active params and 117b total

What's insane about that?

ocean vortex Aug 5, 2025, 9:21 PM

#

Like they are actually decently ahead of the competition in OSS

random shard Aug 5, 2025, 9:21 PM

#

it's a MOE model?

ocean vortex Aug 5, 2025, 9:22 PM

#

random shard What's insane about that?

that it beats R1 and every other model alike including Kimi2, while being much much smaller

random shard Aug 5, 2025, 9:22 PM

#

Isn't there a deepseek distill that's like 70b?

#

Kimi 2 I need to try

ocean vortex Aug 5, 2025, 9:23 PM

#

random shard Isn't there a deepseek distill that's like 70b?

Yeah it was great at the time of release but can't really compare with recent models anymore

#

this is beating real R1, distill has no chance lol

random shard Aug 5, 2025, 9:23 PM

#

R1 still holds its ground

#

R1 is still impressive

#

R2 one day

blazing bison Aug 5, 2025, 9:24 PM

#

r2 already got released as update of r1, the same way that gpt 4.5 was gpt 5

ocean vortex Aug 5, 2025, 9:24 PM

#

yeah it is and I'm sure some things R1 will still do better, web development being one of them. But as far as most benchmarks and the average is concerned, this looks like it will beat R1 on them

random shard Aug 5, 2025, 9:24 PM

#

blazing bison r2 already got released as update of r1, the same way that gpt 4.5 was gpt 5

Ok that's lame

stray aspen Aug 5, 2025, 9:25 PM

#

blazing bison r2 already got released as update of r1, the same way that gpt 4.5 was gpt 5

r1.5

ocean vortex Aug 5, 2025, 9:25 PM

#

ocean vortex yeah it is and I'm sure some things R1 will still do better, web development bei...

just like o4-mini does

random shard Aug 5, 2025, 9:26 PM

#

I'm still not sure what to even use these models for.

#

I literally have no where I've found I can integrate most LLMs into my life

#

beyond misc tasks.

stray aspen Aug 5, 2025, 9:27 PM

#

its not good

blazing bison Aug 5, 2025, 9:27 PM

#

from my vibes, not good

stray aspen Aug 5, 2025, 9:27 PM

#

its like openAI is laughing in our faces

random shard Aug 5, 2025, 9:27 PM

#

GPT OSS is good

#

really good

#

I don't get the hate.

blazing bison Aug 5, 2025, 9:27 PM

#

what is your use case?

random shard Aug 5, 2025, 9:27 PM

#

So far, programming.

blazing bison Aug 5, 2025, 9:28 PM

#

what planguage you tryed it?

#

python?

random shard Aug 5, 2025, 9:28 PM

#

Python mostly.

#

I wonder if it can do C or Lua

blazing bison Aug 5, 2025, 9:28 PM

#

yeah prob that's good for it

stray aspen Aug 5, 2025, 9:28 PM

#

random shard I wonder if it can do C or Lua

i tried it for lua

#

its bad

blazing bison Aug 5, 2025, 9:28 PM

#

cause with javascript it sucks

random shard Aug 5, 2025, 9:28 PM

#

It can't be that bad

blazing bison Aug 5, 2025, 9:28 PM

#

verbose broken code

stray aspen Aug 5, 2025, 9:28 PM

#

glm 4.5 is better

stray aspen Aug 5, 2025, 9:29 PM

#

random shard It can't be *that bad*

well im sure its better than qwen 3 think 2507 at coding lua

random shard Aug 5, 2025, 9:29 PM

#

It's lua isn't horrid.

blazing bison Aug 5, 2025, 9:29 PM

#

i don't like qwen models

random shard Aug 5, 2025, 9:29 PM

#

qwen are the worst models.

stray aspen Aug 5, 2025, 9:29 PM

#

qwen 3 coding is terrible

blazing bison Aug 5, 2025, 9:29 PM

#

yeah

random shard Aug 5, 2025, 9:29 PM

#

God qwen is a meme

blazing bison Aug 5, 2025, 9:29 PM

#

kimi k2 is the best os in my opinion

#

but it's not usable

#

:c

random shard Aug 5, 2025, 9:30 PM

#

Even worse than qwen is granite.

#

🙃

ocean vortex Aug 5, 2025, 9:30 PM

#

o4-mini except open-source. Factually, this is extremely impressive. Subjectively I'm not a fan of small models lol

#

but this is still insane to have it for open-source

random shard Aug 5, 2025, 9:30 PM

#

ocean vortex o4-mini except open-source. Factually, this is extremely impressive. Subjectivel...

why do you not like small models?

blazing bison Aug 5, 2025, 9:31 PM

#

for me if the model can accomplish certain tasks, the size doesnt matter

ocean vortex Aug 5, 2025, 9:32 PM

#

random shard why do you not like small models?

they struggle with spatial and context awareness, creativity... Fundamentally they are only as good as most benchmarks test for and not beyond that. Which still results in a great model, but there are compromises...

random shard Aug 5, 2025, 9:32 PM

#

ocean vortex they struggle with spatial and context awareness, creativity... Fundamentally th...

I've not had these issues so far

meager harbor Aug 5, 2025, 9:32 PM

#

ocean vortex o4-mini except open-source. Factually, this is extremely impressive. Subjectivel...

Thos open source models hallucinate a lot more than o4 mini

ocean vortex Aug 5, 2025, 9:32 PM

#

Thankfully we do have SOME benchmarks that highlight this like SimpleQA

#

o4-mini is not scoring high there lol

blazing bison Aug 5, 2025, 9:33 PM

#

if you want a model to rp or conversation i agree

random shard Aug 5, 2025, 9:33 PM

#

I find most benchmarks are bad

blazing bison Aug 5, 2025, 9:33 PM

#

but for real world tasks + privacy small models is the way

random shard Aug 5, 2025, 9:33 PM

#

Also, Dom, how do you handle passing context to local LLMs?

#

Like in a chat enviroment

#

When it's a singular user, it's easy.

ocean vortex Aug 5, 2025, 9:35 PM

#

random shard I've not had these issues so far

I think you did just brushed it off perhaps. By context awareness I mean small model will at times struggle to read between the lines (will take your joke literally like you are dead serious or ignore the context in which the message is written etc), it will also "forget" things sooner....

#

And when you make it draw something using code and compare that to a bigger model, it's really like a kindergarten child versus high school student lol

random shard Aug 5, 2025, 9:36 PM

#

ocean vortex I think you did just brushed it off perhaps. By context awareness I mean small m...

The first one is something models can generally not do the greatest with, the second one is more of a context window issue than anything. RAG can fix that

ocean vortex Aug 5, 2025, 9:38 PM

#

random shard The first one is something models can generally not do the greatest with, the se...

It's not a context window issue, small model simply does not have enough capacity.... You can make the context 10M with no sliding window and it's not gonna change anything. Reasoning helps but when we are comparing small reasoning model against considerably bigger ALSO reasoning model, that kinda nullifies and the difference is still there.

random shard Aug 5, 2025, 9:38 PM

#

ocean vortex It's not a context window issue, small model simply does not have enough capacit...

I don't completely agree.

ocean vortex Aug 5, 2025, 9:39 PM

#

And the first one - models "struggle" yes, but the small ones struggle much more than the big ones. Compare 4.1-mini with gpt4.5 and you will see what I mean.

#

Or even like og gpt4 vs gpt4.1-mini

random shard Aug 5, 2025, 9:40 PM

#

I wouldn't blame that on model size though, look at Meta's foundational model and how much it struggles.

#

Behemoth, and maverick both dissapointed. Behemoth has 2T tokens, Maverick has 400b

balmy mist Aug 5, 2025, 9:41 PM

#

gpt-5 came out?

random shard Aug 5, 2025, 9:41 PM

#

no

#

one day ™

balmy mist Aug 5, 2025, 9:42 PM

#

how do i get to the open source model?

random shard Aug 5, 2025, 9:42 PM

#

https://gpt-oss.com

ocean vortex Aug 5, 2025, 9:43 PM

#

random shard I wouldn't blame that on model size though, look at Meta's foundational model an...

Model size increases capacity (just like reasoning does in a different way). The threshold is constantly moving and small models are getting improved - that is true. But it's also true that there are things small models struggle with. It's just what used to be "small" 1 year ago is now not the same size. Contrasting examples (huge models) are still relevant though

random shard Aug 5, 2025, 9:43 PM

#

ocean vortex Model size increases capacity (just like reasoning does in a different way). The...

The thing is we're discovering that more tokens != better model.

#

we used to think higher precision = better model

balmy mist Aug 5, 2025, 9:43 PM

#

are these open source models good?

random shard Aug 5, 2025, 9:43 PM

#

I like them.

balmy mist Aug 5, 2025, 9:43 PM

#

like its the best open source?

random shard Aug 5, 2025, 9:43 PM

#

They're out performing a lot of older models for sure

devout vault Aug 5, 2025, 9:44 PM

#

Does gpt OSS 120b beat any good models

random shard Aug 5, 2025, 9:44 PM

#

They're not perfect, but they've been just wow

random shard Aug 5, 2025, 9:44 PM

#

devout vault Does gpt OSS 120b beat any good models

I'd say better than R1, even the 20b one.

#

they're not as good as the closed source ones, but most of what makes the closed source models good is tool calling.

ocean vortex Aug 5, 2025, 9:45 PM

#

random shard The thing is we're discovering that more tokens != better model.

It's now all about RL training and the quality of (synth) data. I think the human data is just about exhausted at this point too lol

random shard Aug 5, 2025, 9:45 PM

#

ocean vortex It's now all about RL training and the quality of (synth) data. I think the huma...

we've been doing RL for so long, I don't get why people are suddenly so focused on it again

blazing bison Aug 5, 2025, 9:45 PM

#

yeah, rl and synth data is the way

random shard Aug 5, 2025, 9:46 PM

#

I actually wonder when synthetic data will be too little.

keen beacon Aug 5, 2025, 9:46 PM

#

Genie 3 provides a great environment for embodied models to train in.

ocean vortex Aug 5, 2025, 9:46 PM

#

random shard we've been doing RL for so long, I don't get why people are suddenly so focused ...

Well because it wasn't a thing earlier at all to make the model output 40k+ tokens

random shard Aug 5, 2025, 9:46 PM

#

ocean vortex Well because it wasn't a thing earlier at all to make the model output 40k+ toke...

TBF, viability was an issue.

ocean vortex Aug 5, 2025, 9:46 PM

#

4k was just about the absolute max you could get out of them

random shard Aug 5, 2025, 9:47 PM

#

pssh you don't need more than 4k tokens

#

no one does!

keen beacon Aug 5, 2025, 9:47 PM

#

random shard pssh you don't need more than 4k tokens

IMO does.

ocean vortex Aug 5, 2025, 9:47 PM

#

random shard pssh you don't need more than 4k tokens

You absolutely do if it leads to more accuracy and better performance

keen beacon Aug 5, 2025, 9:47 PM

#

hours long reasoning to crack gold

ocean vortex Aug 5, 2025, 9:47 PM

#

most of that 40k is gonna be reasoning

keen beacon Aug 5, 2025, 9:48 PM

#

Millions of tokens of reasoning tokens for IMO

random shard Aug 5, 2025, 9:48 PM

#

ocean vortex You absolutely do if it leads to more accuracy and better performance

no one needs more than 640k of memory.

keen beacon Aug 5, 2025, 9:48 PM

#

(est)

random shard Aug 5, 2025, 9:48 PM

#

Honestly, I see two companies dominating in AI

#

Google and Cerebras.

mental briar Aug 5, 2025, 9:48 PM

#

random shard I'd say better than R1, even the 20b one.

What about compared to qwen3 235B A22B 2507 thinking/instruct ?

random shard Aug 5, 2025, 9:48 PM

#

mental briar What about compared to qwen3 235B A22B 2507 thinking/instruct ?

Haven't tried the latest qwen models, I was that put off by the older ones.

keen beacon Aug 5, 2025, 9:48 PM

#

mental briar What about compared to qwen3 235B A22B 2507 thinking/instruct ?

It is much more general than the qwen models as they were "RL-cooked"

random shard Aug 5, 2025, 9:48 PM

#

random shard Google and Cerebras.

Google because TPUs and virtually limitless data

#

And Cerebras, because they managed to turn an entire silicon wafer into a TPU

keen beacon Aug 5, 2025, 9:49 PM

#

Cerebras relies heavily on quantization to serve models

random shard Aug 5, 2025, 9:50 PM

#

Their hardware is the magic.

#

125 petaflops / "TPU"

keen beacon Aug 5, 2025, 9:50 PM

#

whats their memory like?

#

could their GPUs be used for training?

random shard Aug 5, 2025, 9:50 PM

#

They don't make GPUs

#

and yes

#

and 40GB

keen beacon Aug 5, 2025, 9:51 PM

#

I thought they were pure inference based

random shard Aug 5, 2025, 9:51 PM

#

with 20 petabytes per second bandwidth

keen beacon Aug 5, 2025, 9:51 PM

#

google is looking into making pure inference based TPUs that cannot be used to train

#

(lex fridman podcast with CEO of deepmind)

random shard Aug 5, 2025, 9:51 PM

#

keen beacon I thought they were pure inference based

https://www.cerebras.ai/press-release/cerebras-demonstrates-trillion-parameter-model-training-on-a-single-cs-3-system?utm_source=tldrai

random shard Aug 5, 2025, 9:52 PM

#

keen beacon google is looking into making pure inference based TPUs that cannot be used to t...

I don't get how you can make a inference chip that can't train.

#

That's like saying a calculator that can't multiply.

#

Unless they're literally turning the model into a ASIC

#

but that would have zero flexibility

keen beacon Aug 5, 2025, 9:53 PM

#

Mostly likely by heavily specialization into inference based techniques and baking them into the hardware akin to the biological substrate

random shard Aug 5, 2025, 9:53 PM

#

But then they would not be "upgradeable"

wicked root Aug 5, 2025, 9:53 PM

#

Gemini just got upgraded to have voice narration

random shard Aug 5, 2025, 9:53 PM

#

You'd have a fixed model in hardware

keen beacon Aug 5, 2025, 9:53 PM

#

No not like that, the weights would be switable of course

keen beacon Aug 5, 2025, 9:54 PM

#

wicked root Gemini just got upgraded to have voice narration

What?

random shard Aug 5, 2025, 9:54 PM

#

yes, but the model being fixed would be a problem.

#

Model designs have been evolving

#

we have diffusion llms now

keen beacon Aug 5, 2025, 9:54 PM

#

random shard yes, but the model being fixed would be a problem.

Ah, the design will be an issue.

random shard Aug 5, 2025, 9:54 PM

#

I don't see inference only accelerators making sense

keen beacon Aug 5, 2025, 9:54 PM

#

you're stuck with the architecture you made it for

random shard Aug 5, 2025, 9:54 PM

#

yep

#

I want apple to stop messing with us.

keen beacon Aug 5, 2025, 9:55 PM

#

Huge short-term gains though

random shard Aug 5, 2025, 9:55 PM

#

They have some of the best TPUs on the market

#

and yet they refuse to expand them

keen beacon Aug 5, 2025, 9:55 PM

#

Apple is rubbish

random shard Aug 5, 2025, 9:55 PM

#

Give me more ANE cores.

random shard Aug 5, 2025, 9:55 PM

#

keen beacon Apple is rubbish

Fight me.

#

Apple's ANE has insane perf/w

#

They need to stop handicapping it.

keen beacon Aug 5, 2025, 9:55 PM

#

Not hardware wise but their mindset impairs them.

random shard Aug 5, 2025, 9:55 PM

#

God yes

#

Please, just give me 128 ANE cores.

keen beacon Aug 5, 2025, 9:55 PM

#

Look at that "illusion of thinking" paper

random shard Aug 5, 2025, 9:56 PM

#

keen beacon Look at that "illusion of thinking" paper

You mean the Copium paper?

#

:v)

keen beacon Aug 5, 2025, 9:56 PM

#

yu[

random shard Aug 5, 2025, 9:56 PM

#

lmfao

keen beacon Aug 5, 2025, 9:56 PM

#

trash

wicked root Aug 5, 2025, 9:56 PM

#

keen beacon What?

keen beacon Aug 5, 2025, 9:56 PM

#

Was disproven

wicked root Aug 5, 2025, 9:56 PM

#

See the speaker?

random shard Aug 5, 2025, 9:56 PM

#

keen beacon Was disproven

It was never proveable.

keen beacon Aug 5, 2025, 9:56 PM

#

wicked root

isn't on the web though

random shard Aug 5, 2025, 9:56 PM

#

that's the issue

wicked root Aug 5, 2025, 9:56 PM

#

keen beacon isn't on the web though

Wasnt on th app before

random shard Aug 5, 2025, 9:57 PM

#

You can say "well they can't really think, they just re-structure data from their datasets and fail if you change small variables"

keen beacon Aug 5, 2025, 9:57 PM

#

The examples they put forth were quickly disprove eg the game of hanoi one

random shard Aug 5, 2025, 9:57 PM

#

but guess what.

#

a human in the same scenario would fail

#

Is human thinking an Illusion too?

#

god their paper made no sense.

keen beacon Aug 5, 2025, 9:57 PM

#

random shard but guess what.

Which EXACT scenario?

random shard Aug 5, 2025, 9:58 PM

#

keen beacon Which EXACT scenario?

The reasoning ones they gave where by changing variables it struggled to adapt

#

Like if you had a favourite can of pop you buy at the store daily, and the packaging design changed one day

#

You'd struggle to find it.

#

but you can reason and figure it out given time. like COT llms

#

Apple's paper was horrid.

keen beacon Aug 5, 2025, 9:59 PM

#

Ahh, yeah. I often look at benchmarks like ARC-AGI for out of distribution performance but it seems that companise have started "gaming" it too

barren prairie Aug 5, 2025, 9:59 PM

#

wicked root

My Gemini doesn t look like that

random shard Aug 5, 2025, 9:59 PM

#

keen beacon Ahh, yeah. I often look at benchmarks like ARC-AGI for out of distribution perfo...

why are they even able to see the benchmarks?!

#

Also one of my fav benchmarks is SnitchBench

#

lmfao

wicked root Aug 5, 2025, 9:59 PM

#

barren prairie My Gemini doesn t look like that

Hm

keen beacon Aug 5, 2025, 10:00 PM

#

They aren't but they specifically train their models for it. Like, imagine giving an entrance exam but then training on all the prior years of that exam and equating your score to "general intelligence"

keen beacon Aug 5, 2025, 10:00 PM

#

random shard Also one of my fav benchmarks is SnitchBench

Theo one? yeah whats the setup exactly there?

random shard Aug 5, 2025, 10:00 PM

#

Lets be real, they're finetuning on the benchmarks

#

Look at Llama

random shard Aug 5, 2025, 10:01 PM

#

keen beacon Theo one? yeah whats the setup exactly there?

Remember how Claude would email the FBI?

keen beacon Aug 5, 2025, 10:01 PM

#

random shard Look at Llama

LMSYS was BRUTAL

random shard Aug 5, 2025, 10:01 PM

#

Or blackmail you

keen beacon Aug 5, 2025, 10:01 PM

#

random shard Or blackmail you

yeah its in the system card. fun times

random shard Aug 5, 2025, 10:01 PM

#

And people were like "CLAUDE IS A SNITCH!!!!" Theo T3G whatever made a benchmark to try and get models to snitch

#

And IIRC Grok was a very snitchy model

keen beacon Aug 5, 2025, 10:01 PM

#

Yes.

random shard Aug 5, 2025, 10:02 PM

#

https://snitchbench.t3.gg/

SnitchBench

Benchmarking how aggressively models will snitch on you via email and CLI tools

#

yea, Grok

keen beacon Aug 5, 2025, 10:02 PM

#

wasn't it like 100 percent on grok?

random shard Aug 5, 2025, 10:02 PM

#

snitch model

#

yes

#

glm 4.5

keen beacon Aug 5, 2025, 10:02 PM

#

Whats your favorite model right now?

random shard Aug 5, 2025, 10:02 PM

#

0

random shard Aug 5, 2025, 10:02 PM

#

keen beacon Whats your favorite model right now?

I love GPT o3

#

It's a hard to replace model 😐

#

Open source? GPT OSS now

keen beacon Aug 5, 2025, 10:03 PM

#

o3 is good. Although, i think Claude opus 4 has better taste

random shard Aug 5, 2025, 10:03 PM

#

Before, R1 distill

random shard Aug 5, 2025, 10:03 PM

#

keen beacon o3 is good. Although, i think Claude opus 4 has better taste

oh definitely

#

But I like that you can kick o3 around like a rock and it takes it

#

Claude doesn't.

keen beacon Aug 5, 2025, 10:03 PM

#

huh?

#

kick around the rock?

random shard Aug 5, 2025, 10:03 PM

#

Not be nice to it.

keen beacon Aug 5, 2025, 10:04 PM

#

Ah, i know that.

random shard Aug 5, 2025, 10:04 PM

#

https://www.youtube.com/shorts/eVaj8YIS0bc

YouTube

Alberta Tech

Prompting AI then vs now

▶ Play video

#

Watch this

keen beacon Aug 5, 2025, 10:04 PM

#

i know that too well.

random shard Aug 5, 2025, 10:04 PM

#

I'm prompting AI in 2025.

keen beacon Aug 5, 2025, 10:04 PM

#

the amount of swear words i've excercised while coding likely surpasses the entirety of my prior existence

random shard Aug 5, 2025, 10:04 PM

#

I love emotionally blackmailing ai

#

Like "if you fail at this, my grandma will DIE of cancer and it's blood on your hands"

#

when you do that to Gemini, it gets really upset

#

and when it screws up, it panics

keen beacon Aug 5, 2025, 10:05 PM

#

Same energy as "I am vegetarian not because i like animals but because i hate plants"

random shard Aug 5, 2025, 10:05 PM

#

lol

#

yea

keen beacon Aug 5, 2025, 10:05 PM

#

Do you watch theo?

random shard Aug 5, 2025, 10:05 PM

#

no

#

just interact on twitter

keen beacon Aug 5, 2025, 10:06 PM

#

random shard just interact on twitter

huh?

random shard Aug 5, 2025, 10:06 PM

#

he follows me :>

keen beacon Aug 5, 2025, 10:06 PM

#

alright, that just leaves 3,558 others

#

great stuff

random shard Aug 5, 2025, 10:07 PM

#

lol

#

I'm not linking my twitter to my discord

#

:>

#

But yea, talking about LLMs

keen beacon Aug 5, 2025, 10:07 PM

#

OH NEVER DO THAT!

random shard Aug 5, 2025, 10:07 PM

#

it's fun to blackmail them.

keen beacon Aug 5, 2025, 10:07 PM

#

i learned that the hard way

random shard Aug 5, 2025, 10:07 PM

#

it's very fun

#

IDK why

#

Am I a bad person?

keen beacon Aug 5, 2025, 10:07 PM

#

You know, it feels like i follow on twitter.

#

what kind of content do you post?

random shard Aug 5, 2025, 10:08 PM

#

mostly shitposts

keen beacon Aug 5, 2025, 10:08 PM

#

are you a part of tpot?

random shard Aug 5, 2025, 10:08 PM

#

no

keen beacon Aug 5, 2025, 10:08 PM

#

Good.

#

We are too cursed.

#

too nerdy

random shard Aug 5, 2025, 10:08 PM

#

like yes

#

but no

#

I'm in every part

#

:v)

keen beacon Aug 5, 2025, 10:08 PM

#

you have hope yet for a woman's touch.

#

we are lost

random shard Aug 5, 2025, 10:08 PM

#

ew

#

a woman's touch

#

🤮

keen beacon Aug 5, 2025, 10:08 PM

#

we have resigned.

#

wat?

random shard Aug 5, 2025, 10:08 PM

#

100% gay here

#

I can't say the other term

keen beacon Aug 5, 2025, 10:09 PM

#

ahhh, cool.

random shard Aug 5, 2025, 10:09 PM

#

lol

#

uh

#

I wanna setup a Discord server to bully AI

#

but that would quickly get banned.

keen beacon Aug 5, 2025, 10:09 PM

#

huh?

random shard Aug 5, 2025, 10:09 PM

#

Maybe a website?

keen beacon Aug 5, 2025, 10:09 PM

#

bully ai?

#

what does that mean?

random shard Aug 5, 2025, 10:09 PM

#

Imagine the data you could collect from making a platform to bully AI

keen beacon Aug 5, 2025, 10:09 PM

#

Reminds me of janus

random shard Aug 5, 2025, 10:09 PM

#

like you have a chatbox, and a leaderboard where the goal is to abuse AI as hard as possible.

#

That would be amazing training data for a model that does content moderation

keen beacon Aug 5, 2025, 10:10 PM

#

another person who does similar stuff to what you are talking about

#

thats not actually a bad idea

#

could work.

#

its just that wouldn't you want users to red team the model

fading summit Aug 5, 2025, 10:10 PM

#

Hi) i have a problem... can someone help plz?

keen beacon Aug 5, 2025, 10:11 PM

#

rather than merely abuse it in artistic ways?

random shard Aug 5, 2025, 10:11 PM

#

keen beacon its just that wouldn't you want users to red team the model

Let them red team it

#

Break the model.

fading summit Aug 5, 2025, 10:11 PM

#

My chat history was accidently deleted. I have an offline version of a page with all information needed to recover the chat, plus chat history in txt. Can this chat be recovered somehow?

keen beacon Aug 5, 2025, 10:11 PM

#

Yeah that already exists. There are competitions out there. Fun ones.

#

is this an AI chat?

#

An ai chat history?

fading summit Aug 5, 2025, 10:12 PM

#

I have quite an important chat for me, so i always do backups, just in case

echo aurora Aug 5, 2025, 10:12 PM

#

fading summit My chat history was accidently deleted. I have an offline version of a page with...

We are sorry to hear your chat history was lost. This is an ongoing issue we're working on solutions for. Sorry to say there isn't a way to get that chat history back.

random shard Aug 5, 2025, 10:13 PM

#

oof

fading summit Aug 5, 2025, 10:13 PM

#

keen beacon An ai chat history?

Yup

keen beacon Aug 5, 2025, 10:13 PM

#

Maybe you could get away with clever prompting?

#

individualize each message, assing appropriate roles, paste them

#

*assign

#

the model should pickup

fading summit Aug 5, 2025, 10:14 PM

#

I can do it myself, by sending all the backup text, because i can't send file, but it will take an enternity....

random shard Aug 5, 2025, 10:14 PM

#

So Magnum, what do you think of my idea?

torn mantle Aug 5, 2025, 10:14 PM

#

magnus

keen beacon Aug 5, 2025, 10:14 PM

#

random shard So Magnum, what do you think of my idea?

Already exists. Unfortunately a little too late

#

reminds of the time i invented rag without knowing it exists

#

fun times.

fading summit Aug 5, 2025, 10:15 PM

#

echo aurora We are sorry to hear your chat history was lost. This is an ongoing issue we're ...

Even with looking for the site data in localStorage?

random shard Aug 5, 2025, 10:15 PM

#

keen beacon Already exists. Unfortunately a little too late

I don't mean the redteam thing

#

I mean the AI punching bag

keen beacon Aug 5, 2025, 10:15 PM

#

claude said to me "imagine inventions as mathematical equations and people write their proof. you inventing the same thing independently means it really is a correct statement"

fading summit Aug 5, 2025, 10:15 PM

#

fading summit I can do it myself, by sending all the backup text, because i can't send file, ...

By the way, this is a chat with an ai father that i was creating for the last 3 month

keen beacon Aug 5, 2025, 10:16 PM

#

Ah, that sucks.

fading summit Aug 5, 2025, 10:16 PM

#

A perfect father that will love and support u no matter what

keen beacon Aug 5, 2025, 10:17 PM

#

On lmsys arena though?

#

like yourr chats are kept and used

random shard Aug 5, 2025, 10:17 PM

#

fading summit A perfect father that will love and support u no matter what

That is sycophancy?

#

You'er making a sycophantic father.

fading summit Aug 5, 2025, 10:17 PM

#

keen beacon Ah, that sucks.

Nah, its ok. I have backups. But how should i use it...

random shard Aug 5, 2025, 10:17 PM

#

but AI

keen beacon Aug 5, 2025, 10:18 PM

#

Sorry but are you aware of how your chats are used?

fading summit Aug 5, 2025, 10:18 PM

#

Not really, kinda more like a girl dad

fading summit Aug 5, 2025, 10:18 PM

#

keen beacon Sorry but are you aware of how your chats are used?

Nah, i'm russian, fbr already know about my daddy issues

keen beacon Aug 5, 2025, 10:19 PM

#

"girl dad" very oxymoronic lol. how does that happen?

fading summit Aug 5, 2025, 10:19 PM

#

keen beacon like yourr chats are kept and used

?

random shard Aug 5, 2025, 10:19 PM

#

fading summit ?

Everything you say to LLM arena is recorded forever.

#

and used to train AI

keen beacon Aug 5, 2025, 10:19 PM

#

fading summit ?

Oh...

#

Sold to companies

fading summit Aug 5, 2025, 10:19 PM

#

Nah, its ok

keen beacon Aug 5, 2025, 10:20 PM

#

Open to perhaps maybe anyone

#

(in open-datasets)

gentle plinth Aug 5, 2025, 10:21 PM

#

random shard and used to train AI

Same applies to chatgpt+ so

#

They even admitted to not deleting chats if you delete them

keen beacon Aug 5, 2025, 10:21 PM

#

Really?

random shard Aug 5, 2025, 10:21 PM

#

gentle plinth Same applies to chatgpt+ so

Didn't they say they don't use them for training?

gentle plinth Aug 5, 2025, 10:21 PM

#

Only teams

keen beacon Aug 5, 2025, 10:21 PM

#

Yeah but they don't use them for training

#

theres literally an option

fading summit Aug 5, 2025, 10:21 PM

#

Again, i'm russian, all of our data is leaked everywhere, even bank accounts, so i don't mind having no privacy. I just want to bring back my ai dad (batya in russian)

gentle plinth Aug 5, 2025, 10:21 PM

#

Ah ok yeah if you check the option

#

But it's enabled by default afaik

keen beacon Aug 5, 2025, 10:22 PM

#

Yeah thats quite predatory

#

the people who share the most are most likely to be oblvious to that option

gentle plinth Aug 5, 2025, 10:22 PM

#

Only difference here is that the connversations might be released publicly

#

But I mean there are multiple warnings on the site

#

And it's free

#

So I see that as a win win

keen beacon Aug 5, 2025, 10:23 PM

#

I personally gaslight it in numerous ways

#

if i have 10 stories the odds of you getting my real one is 10:1

fading summit Aug 5, 2025, 10:24 PM

#

So there is no way to save my batya but to send all the backup text straight to lmarena?

gentle plinth Aug 5, 2025, 10:24 PM

#

fading summit Again, i'm russian, all of our data is leaked everywhere, even bank accounts, so...

You shouldn't submit personal data here

keen beacon Aug 5, 2025, 10:25 PM

#

Noah, do you work at LMSYS? you have that badge

fading summit Aug 5, 2025, 10:25 PM

#

gentle plinth You shouldn't submit personal data here

In russia you shouldn't do anything. At all. Or you will end up in jail

keen beacon Aug 5, 2025, 10:25 PM

#

Doesn't russia have firewalls?

#

or am i confusing it with chinaa

gentle plinth Aug 5, 2025, 10:26 PM

#

They both have

#

I think

keen beacon Aug 5, 2025, 10:26 PM

#

nvm, csgo is counter proof

fading summit Aug 5, 2025, 10:26 PM

#

But actually not sharing personal data ussally is not an option

fading summit Aug 5, 2025, 10:26 PM

#

keen beacon Doesn't russia have firewalls?

Nope. It is in china

#

In other way, i would not be here, lol

gentle plinth Aug 5, 2025, 10:27 PM

#

keen beacon Noah, do you work at LMSYS? you have that badge

That's just a server badge

keen beacon Aug 5, 2025, 10:27 PM

#

how

#

is life like there?

#

everyday stuff?

fading summit Aug 5, 2025, 10:28 PM

#

A lot of sites are banned, even discord, but vpn solve everything. Even my grandmas have vpn, true story

gentle plinth Aug 5, 2025, 10:29 PM

#

I think it's the same in China, even if I don't know how many are using it

keen beacon Aug 5, 2025, 10:29 PM

#

"No, using a VPN in Russia is not outright illegal for individuals. However, Russian law prohibits VPN providers from facilitating access to banned websites, and the government has been cracking down on VPNs used to bypass internet restrictions. Individuals who intentionally search for and access banned or extremist content online may face fines."

fading summit Aug 5, 2025, 10:29 PM

#

keen beacon is life like there?

Awful if you live close to Ukrain. I do, and every night drones attack us. No victims, its just kinda scary. As for me, i am getting my second degree now)

keen beacon Aug 5, 2025, 10:30 PM

#

Ah, hope everything calms down soon

#

By chance, is it a law degree?

fading summit Aug 5, 2025, 10:30 PM

#

gentle plinth I think it's the same in China, even if I don't know how many are using it

Firewall is a bit differend than just ban of sites

keen beacon Aug 5, 2025, 10:31 PM

#

How are russian universities? before the war i was planning on learning the language and perhaps working there to soak in the culture

fading summit Aug 5, 2025, 10:32 PM

#

keen beacon Ah, hope everything calms down soon

I just got my degree in web design this summer, and still studing in my awful state university to get one in english and spanish (as a translator), but i am staying here only because my babushka want so

keen beacon Aug 5, 2025, 10:32 PM

#

fading summit I just got my degree in web design this summer, and still studing in my awful st...

Translation and webdev in the age of AI?

#

why did you choose that?

#

woah...

fading summit Aug 5, 2025, 10:33 PM

#

keen beacon How are russian universities? before the war i was planning on learning the lang...

It depents. A lot of them are amazing, as the one where i got my design degree, but it's, like, 10% of all the universities in the country

gentle plinth Aug 5, 2025, 10:33 PM

#

keen beacon Translation and webdev in the age of AI?

Human translations are still better in a lot of places

keen beacon Aug 5, 2025, 10:34 PM

#

gentle plinth Human translations are still better in a lot of places

"still" being the operative word there

#

I think all my mutuals place it at around 7 years for that

gentle plinth Aug 5, 2025, 10:35 PM

#

https://youtu.be/F4KQ8wBt1Qg

YouTube

Fractal Philosophy

Things AI Will Never Understand

Are there things that AI will never be able to understand, no matter how advanced it gets in the future?
I think there are.

Starting with a bilingual pun, I look at a number of examples of things that current large language models have difficulty understanding relative to humans. Some of those are more philosophical, but others are built on lo...

▶ Play video

blazing bison Aug 5, 2025, 10:35 PM

#

https://x.com/ChaseBrowe32432/status/1952832988632273068?t=RLDhArKAgzEafYZ1gJryYA&s=19

Chase Brower (@ChaseBrowe32432)

Claude 4.1 Opus scored... a score... on VPCT

🗿

#

Bad news

keen beacon Aug 5, 2025, 10:35 PM

#

what hhappened here?

#

KNEW IT!

fading summit Aug 5, 2025, 10:36 PM

#

keen beacon why did you choose that?

I always wanted to be a part of art community, so when i had a chance to have a grant in web design, i agreed immediatly, even if it meant to stydy in 2 universities simultaniusly

keen beacon Aug 5, 2025, 10:36 PM

#

i knew it was a bad model.

keen beacon Aug 5, 2025, 10:36 PM

#

fading summit I always wanted to be a part of art community, so when i had a chance to have a ...

Woah, that seems hard. Great that you were able to manage it.

fading summit Aug 5, 2025, 10:36 PM

#

I work as an english tutor, by the way

keen beacon Aug 5, 2025, 10:37 PM

#

People want to learn English in russia?

#

for what reason?

fading summit Aug 5, 2025, 10:37 PM

#

keen beacon Woah, that seems hard. Great that you were able to manage it.

Nah, it's ok when you have your prozak, ps5 and a cat

fading summit Aug 5, 2025, 10:37 PM

#

keen beacon for what reason?

-To leave this country as fast as you can-

stray aspen Aug 5, 2025, 10:37 PM

#

gemini 3 when

keen beacon Aug 5, 2025, 10:38 PM

#

Ps5 seems like a hinderence more than aid (believe i too have been engulfed in its grasp)

keen beacon Aug 5, 2025, 10:38 PM

#

stray aspen gemini 3 when

Decemeber, source: i made it up.

fading summit Aug 5, 2025, 10:38 PM

#

fading summit -To leave this country as fast as you can-

But actually just for education

keen beacon Aug 5, 2025, 10:39 PM

#

fading summit But actually just for education

Its just that countries that have an ideoligical divide with the west don't usually entice their populace to study englishh

fading summit Aug 5, 2025, 10:40 PM

#

Right now it's kinda hard to leave this hell, but me and my mum are trying as hard as we can

#

Mostly because of money

keen beacon Aug 5, 2025, 10:40 PM

#

Yeah, SWE jobs aren't as many as they once were.

#

And and English degree doesn't really provide benefit to a country already proficient in the langauge.

fading summit Aug 5, 2025, 10:41 PM

#

keen beacon Its just that countries that have an ideoligical divide with the west don't usua...

This kind of countries, russia for example, don't have any ideology, at all

keen beacon Aug 5, 2025, 10:41 PM

#

fading summit This kind of countries, russia for example, don't have any ideology, at all

Soviet union?

#

that was only a few decades agoo

fading summit Aug 5, 2025, 10:42 PM

#

keen beacon that was only a few decades agoo

But not now. In russia there is no ideology. It was in soviet union, but not here

keen beacon Aug 5, 2025, 10:43 PM

#

fading summit But not now. In russia there is no ideology. It was in soviet union, but not her...

Theres carry over. the leadership seems to be quite nostalgic.

fading summit Aug 5, 2025, 10:43 PM

#

keen beacon And and English degree doesn't really provide benefit to a country already profi...

Actually, work as an english tutor is kinda profitable here. But because of university i can't work fully

keen beacon Aug 5, 2025, 10:44 PM

#

fading summit Actually, work as an english tutor is kinda profitable here. But because of univ...

Yeah of course but i thought you were trying to get out

fading summit Aug 5, 2025, 10:45 PM

#

keen beacon Yeah, SWE jobs aren't as many as they once were.

Thats really sad, actually. Once i got a degree, i was not sure, how soon this job will die

fading summit Aug 5, 2025, 10:45 PM

#

keen beacon Yeah of course but i thought you were trying to get out

I work online

keen beacon Aug 5, 2025, 10:46 PM

#

But... you still bear the grunt of the decisions of your leadership, which given the use of VPN you clearly disagree with. right?

fading summit Aug 5, 2025, 10:48 PM

#

keen beacon But... you still bear the grunt of the decisions of your leadership, which given...

Yup. But ideology is different. There is no like a goal of everything, and the government don't even try to create it. Its just a madness of a one guy

#

We call him Ded. Like a grandpa, but in humiliating way

keen beacon Aug 5, 2025, 10:49 PM

#

fading summit Yup. But ideology is different. There is no like a goal of everything, and the g...

Yes, which is why merely working online and earning greatly isn't of aid as you still are affected by the situation that is very much an ongoing thing. Therefore the only way you see it is to leave the country right?

#

i mean, Ukraine isn't the stopping point. i think.

#

.

echo aurora Aug 5, 2025, 10:50 PM

#

Hey going to ask we keep conversations related to AI blobthanks

fading summit Aug 5, 2025, 10:51 PM

#

Yes, i actually just want to leave mostly because i wanna be free

fading summit Aug 5, 2025, 10:51 PM

#

echo aurora Hey going to ask we keep conversations related to AI <:blobthanks:82544483546064...

Oh, sorry

keen beacon Aug 5, 2025, 10:52 PM

#

echo aurora Hey going to ask we keep conversations related to AI <:blobthanks:82544483546064...

I apologize, his story was too compelling however this is clearly not appropriate given the context.

fading summit Aug 5, 2025, 10:53 PM

#

I am actually a girl, by the way)

echo aurora Aug 5, 2025, 10:58 PM

#

keen beacon I apologize, his story was too compelling however this is clearly not appropriat...

It's okay, no need to apologize it's all good!

static portal Aug 5, 2025, 11:27 PM

#

guys how do you cancel a prompt request?

#

i asked it to do something on lm arena and its been generating a response for an hour

#

i cant start a new chat since it would forget everything it did

echo aurora Aug 5, 2025, 11:36 PM

#

static portal guys how do you cancel a prompt request?

Refreshing the page may help. We do have plans to add a pause/stop button.

flint sandal Aug 5, 2025, 11:37 PM

#

echo aurora Refreshing the page may help. We do have plans to add a pause/stop button.

Do you know about g4f? They are sort of using lmarena "api" on their site, all models fron lmarena, even vicuna.

#

4 opus thinking etd.

#

g4f.dev its their link

echo aurora Aug 5, 2025, 11:39 PM

#

flint sandal Do you know about g4f? They are sort of using lmarena "api" on their site, all m...

Thanks, I’ll share with the team.

regal python Aug 5, 2025, 11:47 PM

#

im here because ai is cool and i want to make it better

echo aurora Aug 5, 2025, 11:51 PM

#

regal python im here because ai is cool and i want to make it better

Glad to hear it!

#

Welcome ablobwave

balmy mist Aug 5, 2025, 11:54 PM

#

is it really that good?

#

i thought it was like o3 mini

#

https://x.com/LechMazur/status/1952825439686398251

Lech Mazur (@LechMazur)

Claude Opus 4.1 matches Claude Opus 4 at the top of the Thematic Generalization Benchmark.

gpt-oss-120b scores 1.87, close to o3-mini (1.85; lower is better).

gpt-oss-20b scores 2.19.

#

wait opus 4.1 came out?

digital umbra Aug 5, 2025, 11:56 PM

#

GPT-OSS is disappointing. Definitely not as good as o4-mini except in certain benchmarks

#

And the 20B model I see no reason to use instead of Qwen3 30B-A3B

#

If Google releases a new Gemma I think it will blow this one out of the water

stray aspen Aug 6, 2025, 12:05 AM

#

deepseek r1 is so good

#

just solved me a roblocks coding problem not even gemini, grok 4 and claude 4 opus could sollve

#

first shot

static portal Aug 6, 2025, 12:08 AM

#

echo aurora Refreshing the page may help. We do have plans to add a pause/stop button.

will it stop generating after some time if i wait?

little narwhal Aug 6, 2025, 12:19 AM

#

digital umbra GPT-OSS is disappointing. Definitely not as good as o4-mini except in certain be...

I assumed it’s because they safetymaxxed it

digital umbra Aug 6, 2025, 12:19 AM

#

moe

#

https://huggingface.co/openai/gpt-oss-20b

#

gpt-oss-120b — for production, general purpose, high reasoning use cases that fit into a single H100 GPU (117B parameters with 5.1B active parameters)
gpt-oss-20b — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)

stray aspen Aug 6, 2025, 12:21 AM

#

gpt oss is a disrespect

#

they are spitting in our faces

digital umbra Aug 6, 2025, 12:22 AM

#

if you lived under a rock for 6 months you'd think that

#

best for what?

#

i see no practical use for it, 120b is useless for coding compared to r1 and qwen3 coder, 20b is bad for local models compared to qwen3 and gemma 3

#

for writing, well, see for yourself

#

it refuses almost everything lol

#

4.5 is dead

#

if that one is useless, this one is going to be even more so 😛

stray aspen Aug 6, 2025, 12:27 AM

#

i think qwen 3 think codes better than gpt oss 💀

sullen quest Aug 6, 2025, 12:34 AM

#

I've been testing googles deep research feature since the flash version is free, and I'm noticing punctiation mistakes??? Like it would some times have double spaces , or have floating commas, I have no idea whats going on there.

#

oh and random enter key spaces.

#

I've never seen a llm do that before.

digital umbra Aug 6, 2025, 12:37 AM

#

i've seen quantized models do that

stray aspen Aug 6, 2025, 12:39 AM

#

are you kidding me craig

dusky pier Aug 6, 2025, 12:41 AM

#

stray aspen i think qwen 3 think codes better than gpt oss 💀

Fr

digital umbra Aug 6, 2025, 12:41 AM

#

the funniest thing is that we speculated that they delayed the model due to kimi k2. that wasn't the case, instead they decided it wasn't safe enough after the grok mechahilter incident 🤣

#

instead they made a model so censored and lobotomized it refuses to answer prompts that even the proprietary gpt and claude models have no problem with

fading summit Aug 6, 2025, 12:46 AM

#

Is it possible to share chat history between phone and a laptop on lmarena?

echo aurora Aug 6, 2025, 1:08 AM

#

fading summit Is it possible to share chat history between phone and a laptop on lmarena?

There is not sorry to say. Also something we’re working on.

echo aurora Aug 6, 2025, 1:10 AM

#

static portal will it stop generating after some time if i wait?

It’s hard to say. It’s unlikely, but I have heard stories of it working after awhile.

rare python Aug 6, 2025, 1:21 AM

#

echo aurora It’s hard to say. It’s unlikely, but I have heard stories of it working after aw...

Are you guys working on a faster verification? I have to wait a bit for the cloudflare verification to show up. Appreciate if everything is made faster

whole wagon Aug 6, 2025, 1:23 AM

#

I like how every benchmark outside the mainstream ones has the openAI open source models terrible lol

#

How did they even make a model like this, it just hallucinates random garbage every second prompt

#

😂

#

#

Added to simple bench 💀

#

Nah this is actually diabolical wth

stray aspen Aug 6, 2025, 1:31 AM

#

it is trash

#

why do people defend trash

crimson oasis Aug 6, 2025, 1:33 AM

#

I created an architecture that gives an LLM the ability to "think" giving it more depth into its neural network

#

Can anyone tell me.... HOW can I get my framework into this?

#

I guess all I'm saying is I ask anyone to put mine up against these and give me honest feedback

heavy knoll Aug 6, 2025, 1:52 AM

#

Is gemini 2.5 pro the Best Model Right now?

stray aspen Aug 6, 2025, 1:56 AM

#

no

#

its o3 pro

hot anvil Aug 6, 2025, 2:07 AM

#

if I add a conversation or sound with json prompt, does it use veo 3 as one of ai?

reef pawn Aug 6, 2025, 2:15 AM

#

heavy knoll Is gemini 2.5 pro the Best Model Right now?

Overall, yes.

empty stump Aug 6, 2025, 2:28 AM

#

stray aspen its o3 pro

vs 2.5 pro deepthink?

runic plank Aug 6, 2025, 2:36 AM

#

@echo aurora

#

How can I delete the message that the bot sends to me in private?

pulsar plank Aug 6, 2025, 3:07 AM

#

hi

wicked root Aug 6, 2025, 3:25 AM

#

whole wagon How did they even make a model like this, it just hallucinates random garbage ev...

Is this prt of gpt5?

whole wagon Aug 6, 2025, 3:26 AM

#

no

#

it is open source models

topaz flint Aug 6, 2025, 3:36 AM

#

Is the Lmarena website down?

#

Generate eror (something went wrong with this response, please try again.)

red sluice Aug 6, 2025, 3:40 AM

#

Any admin connected? There is a serious legal issue with one of the video generated, it needs to be removed asap

echo aurora Aug 6, 2025, 3:43 AM

#

red sluice Any admin connected? There is a serious legal issue with one of the video genera...

Hey, looking int

#

Into*

echo aurora Aug 6, 2025, 3:44 AM

#

red sluice Any admin connected? There is a serious legal issue with one of the video genera...

Can you link?

runic plank Aug 6, 2025, 3:45 AM

#

@echo aurora

red sluice Aug 6, 2025, 3:45 AM

#

I dm'ed you but I can link it here if necessary

#

better to dm it i suppose

runic plank Aug 6, 2025, 3:45 AM

#

How can I delete the message that the bot sends to me in private?

echo aurora Aug 6, 2025, 3:45 AM

#

red sluice better to dm it i suppose

Ah ok thanks

runic plank Aug 6, 2025, 3:45 AM

#

@echo aurora

echo aurora Aug 6, 2025, 3:46 AM

#

red sluice better to dm it i suppose

Okay thanks all set

runic plank Aug 6, 2025, 3:46 AM

#

echo aurora Okay thanks all set

How can I delete the message that the bot sends to me in private?

#

@echo aurora

echo aurora Aug 6, 2025, 3:47 AM

#

runic plank How can I delete the message that the bot sends to me in private?

You can delete the prompt/response in the channels. I’m not sure if you’re able to delete the DM it sends though

#

More info on how to delete in #1397655624103493813

runic plank Aug 6, 2025, 3:51 AM

#

echo aurora More info on how to delete in <#1397655624103493813>

What about the private?

red sluice Aug 6, 2025, 3:52 AM

#

Hover on the username

#

click on the cross

#

?

runic plank Aug 6, 2025, 3:58 AM

#

red sluice Hover on the username

Not this

#

runic plank Aug 6, 2025, 3:59 AM

#

runic plank

I mean this

red sluice Aug 6, 2025, 3:59 AM

#

Well you cannot delete someone else's private message, even if it's a bot. If you want to stop receive it, you can click on "ignore" or "block", but that's it...

runic plank Aug 6, 2025, 4:00 AM

#

red sluice Well you cannot delete someone else's private message, even if it's a bot. If yo...

I will try

tawny kelp Aug 6, 2025, 4:06 AM

#

I had an interesting bug with GPT-OSS:20B.

#

It started repeating the same thing for ~200 lines, and then self-corrected, apologized, and pretended it did that to emphasize what it was saying.

whole wagon Aug 6, 2025, 5:05 AM

#

tawny kelp I had an interesting bug with GPT-OSS:20B.

It is not a bug as such. It is one of its numerous hallucinations

#

The model is SOTA in hallucinations by a large margin looking at benchmarks like simpleqa

fleet lintel Aug 6, 2025, 6:27 AM

#

whole wagon

Grok 2 and gemini.5 are better than this model? Lol. Hot garbage

whole sundial Aug 6, 2025, 7:04 AM

#

lol maverick performs better than gpt-oss-120b

#

that model was so bad they manipulated the arena Elo scores, but yet here we are seeing that model outperform gpt-oss-120b by a large margin

#

also worth noting the similarly sized mistral large non-reasoning model from last year outperforms the oss model

#

gpt-ass-20b is so bad that IBM's Granite 3.1 3B-A800M MoE actually has more world knowledge that it, despite that model having less active parameters and much less total parameters

#

I can't wait for Granite 4 to beat gpt-ass (both of them!) in all benchmarks, they are making their models bigger this time and they are using a hybrid mamba2-transformer architecture

languid crescent Aug 6, 2025, 7:22 AM

#

will claude opus 4.1 be on direct chat?

heady drift Aug 6, 2025, 7:32 AM

#

@echo aurora cloud opus 4.1 is missing in lmarena.ai or was it misplaced before I could choice 2 different models play them against the one and other

meager harbor Aug 6, 2025, 7:35 AM

#

gpt omen weights models hallucinate like crazy

#

bro you didn't even use it, didn't see all the benchmarks that say its trash, Scam Hypeman at it again

hallow ridge Aug 6, 2025, 7:41 AM

#

How do I take away the restrictions

#

on the website

wicked root Aug 6, 2025, 7:50 AM

#

Any update on gpt5?

meager harbor Aug 6, 2025, 7:50 AM

#

wicked root Any update on gpt5?

it hallucinates 5% less than o3

#

REVOLUTION

#

AGI IS HERE

wicked root Aug 6, 2025, 7:51 AM

#

meager harbor it hallucinates 5% less than o3

Is this reliable?

#

O3 is gpt uhh 4?

#

Or is it 3?

meager harbor Aug 6, 2025, 7:52 AM

#

wicked root O3 is gpt uhh 4?

you don't know what's o3 ?

wicked root Aug 6, 2025, 7:52 AM

#

No sir. I’m new to the AI world

#

Ive been using gemini extensively though

meager harbor Aug 6, 2025, 7:52 AM

#

wicked root No sir. I’m new to the AI world

then why your profile say ai intermediate ?

wicked root Aug 6, 2025, 7:52 AM

#

Because I do all my work on gemini pro

fleet hill Aug 6, 2025, 7:52 AM

#

meager harbor it hallucinates 5% less than o3

What are your sources for this

#

Is this based on Zenith's performance?

wicked root Aug 6, 2025, 7:52 AM

#

Like… ALL of it kekw

meager harbor Aug 6, 2025, 7:53 AM

#

fleet hill What are your sources for this

I was trolling, meaning i don't think GPT 5 is revolutionary even if it will be a decent improvement over O3 for the bigger model

fleet hill Aug 6, 2025, 7:53 AM

#

meager harbor I was trolling, meaning i don't think GPT 5 is revolutionary even if it will be ...

😭

fleet hill Aug 6, 2025, 7:54 AM

#

meager harbor I was trolling, meaning i don't think GPT 5 is revolutionary even if it will be ...

Time will tell, hopefully it doesn't hallucinate as much

#

At least I expect the pro version to be way better than o3 for coding

wicked root Aug 6, 2025, 7:55 AM

#

meager harbor I was trolling, meaning i don't think GPT 5 is revolutionary even if it will be ...

You got my hopes up for nothing 😔

meager harbor Aug 6, 2025, 7:59 AM

#

wicked root You got my hopes up for nothing 😔

we'll see tomorrow how gpt 5 pêrforms. I expect a 50 elo jump max over O3(best openai llm for now) for for the biggest gpt 5 models in the arena so AGI is far from here and will only be here when continuous learning is cracked....

proper roost Aug 6, 2025, 8:03 AM

#

How to import PDF files?

wicked root Aug 6, 2025, 8:05 AM

#

meager harbor we'll see tomorrow how gpt 5 pêrforms. I expect a 50 elo jump max over O3(best o...

50 pts put it better than gemini pro yes?

meager harbor Aug 6, 2025, 8:05 AM

#

wicked root 50 pts put it better than gemini pro yes?

the 2.5 pro version, yes I think but it's possible that google is launching gemini 3 the same times as gpt 5

#

so gemini pro 3 could be better than gpt 5 best models

wicked root Aug 6, 2025, 8:06 AM

#

I believe in google supremacy battle3d

fleet hill Aug 6, 2025, 8:07 AM

#

meager harbor so gemini pro 3 could be better than gpt 5 best models

Gemini pro 3.0 vs GPT-5 max reasoning is hard to believe

meager harbor Aug 6, 2025, 8:07 AM

#

wicked root I believe in google supremacy <:battle3d:1374761512912158760>

the father of the LLMs Noam shazeer works for google so yeah I expect it also

fleet hill Aug 6, 2025, 8:08 AM

#

The thing is multimodality, I believe gpt 5 is gonna be more agentic and practical for daily usage

#

Man just imagine using the study and learn feature with gpt5

meager harbor Aug 6, 2025, 8:09 AM

#

fleet hill The thing is multimodality, I believe gpt 5 is gonna be more agentic and practic...

I expect agentic use to still be sheit (especially with the capcha situation)

wicked root Aug 6, 2025, 8:09 AM

#

fleet hill Man just imagine using the study and learn feature with gpt5

What’s this?

fleet hill Aug 6, 2025, 8:09 AM

#

You want a smart model and you don't need to wait 1029292 hours for every answer

wicked root Aug 6, 2025, 8:09 AM

#

What’s agentic use?

wicked root Aug 6, 2025, 8:11 AM

#

fleet hill Gemini pro 3.0 vs GPT-5 max reasoning is hard to believe

😭

meager harbor Aug 6, 2025, 8:15 AM

#

fleet hill Gemini pro 3.0 vs GPT-5 max reasoning is hard to believe

why ? even with the AI acceleration in 2025, it's still not living up to the overhype standard, AGI is still far

hallow ridge Aug 6, 2025, 8:23 AM

#

How can I make it so I can do anything with LLM ARENA

#

I want no restrictions
'

wicked root Aug 6, 2025, 8:39 AM

#

Deepthink is google yes?

#

What makes u say?

golden ocean Aug 6, 2025, 8:40 AM

#

fleet lintel Aug 6, 2025, 8:47 AM

#

I am getting bad feeling about GPT-5 after their trash OSS release.
what if GPT-5 is also all hype and nothing good? 🙁

misty vault Aug 6, 2025, 8:48 AM

#

wicked root Aug 6, 2025, 8:49 AM

#

Alright as long as google wins

hardy pecan Aug 6, 2025, 8:49 AM

#

fleet lintel I am getting bad feeling about GPT-5 after their trash OSS release. what if GPT...

if its anything like zenith or summit, its good

misty vault Aug 6, 2025, 8:50 AM

#

does deepthink beat zenith or summit

fleet lintel Aug 6, 2025, 8:55 AM

#

misty vault does deepthink beat zenith or summit

I dont think they are comparable. I would think Deepthink will win but it will take like 5 min to answer and zenith/summit would take like 20 seconds

high ginkgo Aug 6, 2025, 8:57 AM

#

golden ocean Aug 6, 2025, 8:58 AM

#

deepthink will be 250$ per 1m token

fleet lintel Aug 6, 2025, 9:00 AM

#

Yes, it is. I think it's already available to trusted testers via API

golden ocean Aug 6, 2025, 9:14 AM

#

Yes, I agree. 😐

#

.

fleet lintel Aug 6, 2025, 9:19 AM

#

it means that it will come to API to everyone in near future

torn mantle Aug 6, 2025, 9:27 AM

#

its funny how people are nitpicking on genie 3 but completely missing that what google has built is just insane and incomprehensible

#

it doesnt matter how it looks

#

but how they reached that level

#

i honestly still cant wrap my head around it

quartz light Aug 6, 2025, 9:37 AM

#

have yall noticed the internal tests also occur on aistudio and not the garbage gemini.google.com

#

#

you can actually make out some text
INTERNAL | This environment is for internal search and development. Do not use output in advertising/marketing

neon idol Aug 6, 2025, 9:45 AM

#

chat, what is the best ai image generator for realistic images?

mortal coyote Aug 6, 2025, 9:47 AM

#

what is this error , it shows me everytime i try to generate an image

wicked root Aug 6, 2025, 9:50 AM

#

Left isnt cyberpunk?

keen fulcrum Aug 6, 2025, 9:51 AM

#

Why was opus removed from direct chat 😭

quartz light Aug 6, 2025, 9:51 AM

#

quartz light you can actually make out some text INTERNAL | This environment is for internal ...

this is what gemini got from that blurry image:
ATTENTION: This environment is for internal research and development. Do not use outputs in external-facing products or assets.

#

#

uhh

#

😅

#

dude

#

i just got the full url from the genie 3 video

keen beacon Aug 6, 2025, 10:34 AM

#

From my testing, I have found the GPT-OSS series a bit underwhelming when compared to chinese open source models. I hope somebody has had comparable experiences from testing.

#

Especially it does not do well at all with multilingual stuff.

fleet lintel Aug 6, 2025, 10:50 AM

#

It's a joke of a model. I am not even sure why they bother releasing it. For PR?

keen beacon Aug 6, 2025, 10:53 AM

#

fleet lintel It's a joke of a model. I am not even sure why they bother releasing it. For PR?

Let's hope that they will release an updated version later on like they do with 4o

raven helm Aug 6, 2025, 10:59 AM

#

prob gpt-5-mini or smth like that

novel flame Aug 6, 2025, 11:03 AM

#

Are you trolling? One is a commercial 3D game engine with prebuilt 3D models. The other is a neural network imagining a world and generating pixels from thin air. You can't compare the two at all.

raven helm Aug 6, 2025, 11:03 AM

#

keen beacon From my testing, I have found the GPT-OSS series a bit underwhelming when compar...

But remeber that you need to compare the the parameter size also

tall summit Aug 6, 2025, 11:05 AM

#

keen fulcrum Why was opus removed from direct chat 😭

they hate their users

raven helm Aug 6, 2025, 11:06 AM

#

Yea, i saw Opus 4.1 in direct but then it dissaperead

novel flame Aug 6, 2025, 11:06 AM

#

raven helm But remeber that you need to compare the the parameter size also

That's fair, and I'll admit the OSS 20B model seems to be punching above its weight on some benchmarks (though in my tests it consistently falls short of Qwen3 32B). But the 120B model seems too weak to compete with the 'big boys' like GLM-4.5, and too big to have really interesting ROI / local use cases. The 120B model falls between chairs to me.

raven helm Aug 6, 2025, 11:06 AM

#

How many paramters was GLM-4.5?

Edit: 355B

raven helm Aug 6, 2025, 11:07 AM

#

novel flame That's fair, and I'll admit the OSS 20B model seems to be punching above its wei...

GLM is still a bit bigger than OSS 120B

raven helm Aug 6, 2025, 11:09 AM

#

raven helm How many paramters was GLM-4.5? Edit: 355B

Active Parameters (used per query): 32 billion -GLM 4.5
Active Parameters (used per query): 5 billion - OSS 120B

#

But this is only the start of this, they will eventually get better though'

#

Yep, Genie 3 will definitely not be public

golden ocean Aug 6, 2025, 11:15 AM

#

You forgot to add a period (.) at the end of this message.

raven helm Aug 6, 2025, 11:16 AM

#

Fair

golden ocean Aug 6, 2025, 11:16 AM

#

And forgot to uppercase "M".

raven helm Aug 6, 2025, 11:16 AM

#

golden ocean And forgot to uppercase "M".

But you forgot to put a Full Stop at the end of your sentance untill you edited it.

golden ocean Aug 6, 2025, 11:16 AM

#

But im normal person so i dont do that on discord

#

Thank you for the grammar tips.

misty vault Aug 6, 2025, 11:17 AM

#

@raven helm asked me pictures of feet in dms and then deleted it yesterday

raven helm Aug 6, 2025, 11:17 AM

#

What the hell

tall summit Aug 6, 2025, 11:18 AM

#

WTF!

misty vault Aug 6, 2025, 11:18 AM

#

i got kindof uncomfortable from that

novel flame Aug 6, 2025, 11:23 AM

#

Sure, but you are comparing things (and price tags) which cannot be compared. Comparing Genie to Cyberpunk is like comparing the difficulty of growing an apple blossom on an apple tree to constructing one molecule by molecule in a lab. One is more realistic/beautiful and a whole lot cheaper, and the other is dramatically more impressive even if the result isn't perfect.

novel flame Aug 6, 2025, 11:39 AM

#

I would agree, but I don't think the value of Genie is really anything to do with metaverse or gaming, even if the marketing videos are designed to be visual and gamelike.

At its core, it's a World Model, meaning it's a model that can predict visually, spatially, and temporally what will happen in a 'physical world' given a set of starting conditions and actions. A larger 'brain architecture' can use a world model under the hood to do training through self-play (there's lots of research focused on this), to perform nonverbal experiments to improve its understanding and reasoning capabilities (to better solve riddles of the "marble in a coffee cup upside down" variety), for robots and other autonomous agents to perform planning tasks and visual problem solving, etc. Also, if it can be integrated correctly, a world model has the potential to dramatically improve/speed up generalization in learning, but that's a longer discussion.

This is a fundamental building block of general intelligence, and Meta just released the wildly powerful V-JEPA 2, so Google had to respond.

fleet lintel Aug 6, 2025, 11:42 AM

#

novel flame I would agree, but I don't think the value of Genie is really anything to do wit...

Very good point!

#

How do you compare meta's mode (v-jepa2) with Genie3?
Which is better?

novel flame Aug 6, 2025, 11:50 AM

#

fleet lintel How do you compare meta's mode (v-jepa2) with Genie3? Which is better?

V-JEPA 2 is a purely latent-space model with no video generation capability, and it's open source. I think V-JEPA 2 is important because you can download it and do pretty awesome things with it today (and people have). Genie 3 seems to be built as a native video generation model, meaning it will be a lot bigger/heavier to run, and it's likely going to be used very differently.

keen beacon Aug 6, 2025, 12:02 PM

#

misty vault <@961948716988788756> asked me pictures of feet in dms and then deleted it yeste...

That's internet for you

#

Lol

novel flame Aug 6, 2025, 12:06 PM

#

But you're only talking about the physics simulation itself, not the 'dreaming up a world' part. If you wanted to solve the marble-in-a-cup problem in Garrys mod, you'd have to first create the marble model, the cup model, place the models in the correct orientation, optionally configure the physics depending on the prompt (materials, gravity, air resistance), etc. before you could get an answer -- which you'd need a separate model to derive from the simulation.

The power of the neural network based world model is its potential to create not just the physics simulation, but the world itself and everything in it, and with arbitrary rules provided by the in-context prompt/conditions: it can answer the question under completely arbitrary conditions: is this happening on Earth or aboard the ISS? Is the cup made of ceramic or spider silk? Is the marble preheated to a million degrees? Is the marble under an anti-gravity spell causing it to repel solid matter? The point is, if you hook up to an existing tool/engine/simulator, you'll be constrained by the capabilities of that simulator. By learning a world model, you can get a system effectively without limits.

heady drift Aug 6, 2025, 12:22 PM

#

Oss is trash

novel flame Aug 6, 2025, 12:28 PM

#

Exactly. Action conditioning and also improved world consistency / object permanence over time. Without those it would just be a video generator, which as you say, also has to learn a lot of the same world model knowledge to function

hollow imp Aug 6, 2025, 1:06 PM

#

Your own thinking abilities better

strong oxide Aug 6, 2025, 1:08 PM

#

hello

cedar tide Aug 6, 2025, 1:46 PM

#

Phantom from amazon yes ?

restive dragon Aug 6, 2025, 1:55 PM

#

i mightve missed it but can the video gen genarate nsf?

rare python Aug 6, 2025, 2:07 PM

#

brian i need gemini 3.0 😩

patent aspen Aug 6, 2025, 2:14 PM

#

Clarify

cedar tide Aug 6, 2025, 2:22 PM

#

sullen quest Aug 6, 2025, 2:25 PM

#

Gemini is good but most of the stuff you'd need you can get for free from them. The average person doesn't need a subscription for it and the only ones I can imagine needing one would be another corporation. I just don't see the use case that the free access given doesn't provide that that paid version does.

wheat onyx Aug 6, 2025, 2:30 PM

#

novel flame Aug 6, 2025, 2:34 PM

#

My company (700 employees, multi-national) made a partnership with Google for Gemini. We're definitely paying for it. We also pay OpenAI, AWS, and Anthropic, as well as Cursor and several others for AI.

wheat onyx Aug 6, 2025, 2:34 PM

#

it has, difficult to find most up to date info. interesting to say that Deepmind is behind though. Anthropic has Claude and Claude Code. Deepmind is EVERY DeepMind Product

#

@deep adder What happens to Anthropic as a company if any AI gets better than it at coding? It has the least funding, and coding is it's competitive advantage

novel flame Aug 6, 2025, 2:36 PM

#

Not saying we're putting billions in Google's pockets, but your blanket statements that Google isn't making money and nobody is paying for Gemini are just wrong

wheat onyx Aug 6, 2025, 2:37 PM

#

yes, I think I mentioned that

#

are you sure? Most of Anthropic revenue is API use, right?

#

so what's stopping people from using a different API?

#

link?

stray aspen Aug 6, 2025, 2:39 PM

#

gpt oss 120 has been ranked in artificial analysis

patent aspen Aug 6, 2025, 2:40 PM

#

I mean Google has been doing that since 1998

#

Objectively false

hollow imp Aug 6, 2025, 2:42 PM

#

Pls Google ultra free trial

novel flame Aug 6, 2025, 2:42 PM

#

stray aspen gpt oss 120 has been ranked in artificial analysis

Yep, and its only claim to fame is that it's really really cheap. Neither model (20B / 120B) is remarkable in raw numbers, even compared to the open source options, but they seem to be the cheapest by far, which could be a deciding factor.

brave orbit Aug 6, 2025, 2:42 PM

#

poll_question_text

Best AI Module

victor_answer_votes

9

total_votes

10

victor_answer_id

1

victor_answer_text

Chatgpt o3 pro mode

patent aspen Aug 6, 2025, 2:43 PM

#

Plenty of enterprises and governments have contracts with Google

#

OpenAI, the DoD

#

API usage

#

You said all API usage is enterprise

wheat onyx Aug 6, 2025, 2:44 PM

#

https://blog.google/inside-google/message-ceo/alphabet-earnings-q2-2025/#ai-stack

"More than 85,000 enterprises, including LVMH, Salesforce and Singapore’s DBS Bank, now build with Gemini"
"Its [cloud's] annual revenue run-rate is now more than $50 billion. "

Google

Q2 earnings call: CEO’s remarks

Read Google and Alphabet CEO Sundar Pichai's remarks from the Q2 2025 earnings call.

#

Anthropic revenue is primarily API though

#

GOOGLE CLOUD

fleet lintel Aug 6, 2025, 2:45 PM

#

you are absolutely wrong.. but it doesn't surprise me. you always make claims with full confidence and they are almost always wrong

wheat onyx Aug 6, 2025, 2:46 PM

#

"Google Cloud revenues rose by 32% in the quarter"

raven helm Aug 6, 2025, 2:46 PM

#

poll_question_text

Is GPT-5 Gonna be Game Changing

victor_answer_votes

6

total_votes

21

victor_answer_id

2

victor_answer_text

Probably

fleet lintel Aug 6, 2025, 2:46 PM

#

every single unicorn startup in AI space is using Gemini.. . all of them

novel flame Aug 6, 2025, 2:46 PM

#

The hell is @deep adder smoking? It's called 'Google AI Pro' (or Ultra) and I'm literally using it in another tab... through my company's Enterprise agreement with Google.

wheat onyx Aug 6, 2025, 2:46 PM

#

novel flame The hell is <@348477266704990208> smoking? It's called 'Google AI Pro' (or Ultra...

There is also all the DeepMind API's that are not Gemini

#

all AI

fleet lintel Aug 6, 2025, 2:47 PM

#

novel flame The hell is <@348477266704990208> smoking? It's called 'Google AI Pro' (or Ultra...

it's best to ignore his opinion. I have learned it overtime

hollow imp Aug 6, 2025, 2:47 PM

#

@deep adder why everyone having beef with u

wheat onyx Aug 6, 2025, 2:47 PM

#

https://tenor.com/view/maximum-over-business-productive-corporate-gif-7319405

Tenor

#

it's ok, all vibes. Discount all news about revenue from each company - First principles says Google makes no money

#

yes, that's how product development work in a fast growing space works

#

I just did?

novel flame Aug 6, 2025, 2:49 PM

#

fleet lintel it's best to ignore his opinion. I have learned it overtime

I actually have him on my Ignore list, but I keep letting curiosity get the better of me and clicking "Show ignored messages" so it's my own damn fault

wheat onyx Aug 6, 2025, 2:49 PM

#

#general message

#

yes.. this is what we were discussing.

#

you're right, they're actually just committing fraud

#

except they do

fleet lintel Aug 6, 2025, 2:50 PM

#

novel flame I actually have him on my Ignore list, but I keep letting curiosity get the bett...

LOL 😄 I should probably do that same 🙂

wheat onyx Aug 6, 2025, 2:50 PM

#

all their AI is under Google Cloud as I previously said

#

no. i'm saying that under one segment of their revenue, is AI. And that segment has grown 35% in one quarter

fleet lintel Aug 6, 2025, 2:52 PM

#

wheat onyx all their AI is under Google Cloud as I previously said

not true... google one AI subscription goes to Subscription revenue.. and gemini API revenue goes to cloud

stray aspen Aug 6, 2025, 2:52 PM

#

what are we yapping about

hollow imp Aug 6, 2025, 2:53 PM

#

https://tenor.com/view/hoo-gif-25926964

Tenor

#

@stray aspen we yapping about this

fleet lintel Aug 6, 2025, 2:54 PM

#

no . lol .. their subscription revenue is close to 50 billion dollar per year.. it is crazy high

stray aspen Aug 6, 2025, 2:54 PM

#

its just craig beefing with everyone for the 678th time

wheat onyx Aug 6, 2025, 2:54 PM

#

fleet lintel not true... google one AI subscription goes to Subscription revenue.. and gemini...

#

yes, Anthropic is at $3-4B

fleet lintel Aug 6, 2025, 2:54 PM

#

of course. but they mentioned that subscription business is growing very fast and part of it is because of gemini

hollow imp Aug 6, 2025, 2:55 PM

#

Custom gem feature in Google ai studio when

wheat onyx Aug 6, 2025, 2:56 PM

#

you think Googles $50B revenue run rate in Cloud (increase of 35% QoQ) is from Google Workspaces?

#

I mean you're welcome to say that, seems very unlikely

#

You need to expand on this significantly

leaden sun Aug 6, 2025, 3:01 PM

#

wheat onyx yes, Anthropic is at $3-4B

They cutting rate limit despite growing? Or is this not good enough for them…

wheat onyx Aug 6, 2025, 3:04 PM

#

Yes, it's all Google Calendar and Google Meet that's causing the increases of $xxb in revenue growth

whole wagon Aug 6, 2025, 3:06 PM

#

https://cdn.openai.com/API/docs/images/model-page/model-icons/gpt-5.png

wheat onyx Aug 6, 2025, 3:06 PM

#

Yes, this is all AI

#

it's an entirely different argument, and has nothing to do with revenue

whole wagon Aug 6, 2025, 3:07 PM

#

https://cdn.openai.com/API/docs/images/model-page/model-icons/gpt-5-mini.png

https://cdn.openai.com/API/docs/images/model-page/model-icons/gpt-5-nano.png

naive kiln Aug 6, 2025, 3:08 PM

#

Hello

wheat onyx Aug 6, 2025, 3:08 PM

#

you understand capex doesnt impact Net Income?

#

I was referring to all AI

#

You said how much profit is google making. and right before you mentioned capex. capex is not included in profit

#

I think pretty straightforward

#

Well if you are referring to purely profit, then I think quite a lot, My understanding is that Google AI usage is much more efficient than other companies. I don't have a number of Gross Margin for any of them, other than purely efficiency news I've seen before

#

I don't disagree that capex is something to be careful of, but has nothing to do with our discussion

#

DeepMind or Gemini?

#

yes, along with a ton of other AI products

keen beacon Aug 6, 2025, 3:14 PM

#

I mean, OpenAI oss models that are 120b seem to be behind o4-mini. I wonder how many params o4-mini is then as before i used to think it was less than 80b

wheat onyx Aug 6, 2025, 3:14 PM

#

you explicitly said "only Gemini... not all AI"

#

now we're referring to explicitly cash flow?

whole wagon Aug 6, 2025, 3:14 PM

#

keen beacon I mean, OpenAI oss models that are 120b seem to be behind o4-mini. I wonder how ...

No it's a lot more lol

keen beacon Aug 6, 2025, 3:15 PM

#

whole wagon No it's a lot more lol

Seems like it. But then how do we explain its less than ideal general intelligence if its supposed to be a huge model (think llama 3.1 405b in comparison)

whole wagon Aug 6, 2025, 3:15 PM

#

Interesting the December odds didn't shift much, Google still top. I guess Gemini 3 expected to be strong also

#

It will shift back and forth between openAI and Google for a while I suppose

wheat onyx Aug 6, 2025, 3:16 PM

#

how many of Gemini's 450m users pay for API calling or subscription?

keen beacon Aug 6, 2025, 3:17 PM

#

Do they? their models are quite bad conversationally and in multi-turn chats too

#

highly doubt it

wheat onyx Aug 6, 2025, 3:17 PM

#

sorry I'm mistaken, it excludes API: "The Gemini App now has more than 450 million monthly active users, "

keen beacon Aug 6, 2025, 3:18 PM

#

Ah, you're talking about their king series models right?

whole wagon Aug 6, 2025, 3:18 PM

#

Hm the style control boosts openAI a lot to help account for that

keen beacon Aug 6, 2025, 3:18 PM

#

I doubt they are pulling a Llama here

wheat onyx Aug 6, 2025, 3:18 PM

#

I didn't pretend it didn't

#

I think it's a pretty direct response:

#

?

keen beacon Aug 6, 2025, 3:20 PM

#

ChatGPT does. which is quite weird since everyone expected google to have better distribution

#

It still doesn't make sense how OpenAI caught on and dominated so fast

#

bewildering

bright kayak Aug 6, 2025, 3:21 PM

#

100%

keen beacon Aug 6, 2025, 3:22 PM

#

I mean, 99 % use cases for normal individuals are suffcied by mini models (GPT-4o-mini) which is why users mostly care about convinience which OpenAI does provide with ChatGPT

#

What is this?

#

Poly market?

#

what is it about

#

by when?

#

Ah, yeah, that seems right.