#general

1 messages Β· Page 47 of 1

elder rapids
#

with a capital G

#

why do you keep deleting your messages

small haven
#

psi

#

how good is "o3 pro"? any insights here?

keen beacon
#

itll be better than o3

small haven
#

wow thats breaking news

keen beacon
#

it has "pro" in the name

small haven
#

omg i have goosebumps now

keen beacon
#

theres two "o"s in the name

elder rapids
#

o3 is blinking

keen beacon
#

(i wonder how many people actually read the raw cot when asking it to code, it does a lot of cot within comments. the final output even with the comments is pretty stripped of it, but enough to understand where the tendency seems to come from)

fleet lintel
#

o3 pro should be interesting.. i have not much hope from grok3.5 or deepseekr2 but very hopeful about o3 pro

elder rapids
#

deepseek r2 is never releasing

#

😭

quiet folio
#

I dont want my own comments that actually make sense to get removed though

misty vault
#

Fr

quiet folio
#

no I actually just escaped maximum security prison for being able to see deleted messages

misty vault
#

that's literally what he just said

tall summit
#

oh nice independent scrolling

#

i don't think that is the most difficult to code but alright

keen beacon
#

i deleted it because treesitter might not be the best way to do it

#

there might be an easier way to do it (dependent on the editor) in a generic fashion with semantic understanding

echo aurora
ocean vortex
pliant cypress
#

redsword and goldmane are only available on webdev arena?

barren prairie
small haven
unborn ocean
#

nous research still beats it

#
NOUS RESEARCH

Hermes 3 contains advanced long-term context retention and multi-turn conversation capability, complex roleplaying and internal monologue abilities, and enhanced agentic function-calling. Our training data aggressively encourages the model to follow the system and instruction prompts exactly and in an adaptive manner. Hermes 3 was created by fin...

topaz peak
#

looks like some frontend for another already existing A.I , ie a scam

keen beacon
topaz peak
#

beautifull website ngl, but i am not convinced

quiet folio
small haven
#

so whats the consensus on claude 4 opus, shite or vibes

keen beacon
wintry tinsel
unborn ocean
golden ocean
unborn ocean
#

i don't personally use them much, but they have very popular finetunes of llama, mistral models

misty vault
#

bro claude 4 opus kinda stupid ngl

unborn ocean
#

(mainly focusing on tool use and conversational stuff)
But now they are also working on training their own models trained over a crypto-like (prob not the right word) compute network.
If you are interested in research, i high recommend checking out some of their work.

misty vault
#

Its code works but so much duplicate code

#

rookie mistakes

keen beacon
#

its been a while but i didnt recall them using qwen

elder rapids
#

and it's not going to be webdev monster for that long anymore

unborn ocean
#

mixed it up right there with mistral for some reason

keen beacon
#

it seems they avoid qwen

unborn ocean
#

otherwise one could use athene for high quality qwen finetune

unborn ocean
#

but they are a public company (so they probably just used the best model available for their model)

#

and it might also be that qwen is just not good at conversational stuff or roleplaying

keen beacon
#

maybe if u dont tune directly off the base model (noushermes tunes on the base model, so can't be that)

small haven
unborn ocean
#

nice priorities

keen beacon
#

for sure

unborn ocean
#

Psyche is an open infrastructure that democratizes AI development by decentralizing training across underutilized hardware. Building on DisTrO and its predecessor DeMo, Psyche reduces data transfer by several orders of magnitude, making distributed training practical.

#

might actually donate some compute

keen beacon
#

yeah it is

#

beyond the decentralized aspect, it might be interesting 20t is a solid amount of pretraining tokens

unborn ocean
#

tru 20t is actually like close to what qwen used, right?

keen beacon
unborn ocean
#

"In the first stage (S1), the model was pretrained on over 30 trillion tokens with a context length of 4K tokens."

#

so it is like actually close to SOTA and more than qwen 2.5 and llama 3 i think

keen beacon
#

Some of the models 19 trillion

keen beacon
#

Qwen 3 is 36 trillion

#

I still won't be expecting much, nous research haven't been pretraining their models like qwen. It might be like 20t of slop but I don't know lmao

unborn ocean
#

ik, just the first quote i found

#

and i am not sure if the 20t is complete from nous research

#

or just s1

unborn ocean
#

is kind of a lot for a small research collective

keen beacon
unborn ocean
#

jup kind of ambitious

keen beacon
#

its gonna take years at the current rate

unborn ocean
#

about 1111 days total

elder rapids
#

but it doesn't go very far beyond acknowledging it and very slightly adjusting

sonic tendon
#

source,

misty vault
mortal flame
#

I like Calmriver, but I wish I knew WTF it really is?

ancient sandal
#

is goldmane > pro-05-06?

#

seems like it since people saying it's better than opus

dusky aurora
late path
calm sequoia
#

Anyone seen any rumors what happened to R2?

brazen vine
#

has anyone subscribed to github coding agent? how was your experience?

balmy mist
#

what did i miss?

ocean vortex
calm sequoia
#

Something is cooking with the GPT 4o

#

It just answered long promompt in miliseconds

#

As soon as I pressed the "Enter" button πŸ‘€

golden ocean
#

that sounds like the opposite of cooking

misty vault
misty vault
trim vale
#

Is it normal that a gemma 3 gguf model of the same size as a perfectly working llama model seems like it requires much more memory

#

Most other gguf models use roughly the same amount of ram when their sizes are similar... yet gemma 3 seems to work differently

misty vault
#

because gemma 3 is agi

calm sequoia
misty vault
#

"what is 9 + 10?"

calm sequoia
#

It's strange because even small models can't do it in milliseconds

#

Maybe it's that their serves were not overloaded

misty vault
calm sequoia
misty vault
#

Add gpt-4 im not kidding

#

Me

rancid oasis
#

What’s good frens

misty vault
#

real

#

Just admit gpt 4o is cancer broπŸ™

#

"You are too poor" sucking off gpt 4o and gemini 2.5 pro says enoughsadboyo

tall summit
#

yeah

#

ok

#

well

#

you can ask in lmarena

#

if it's just a simple prompt

#

and..

#

idk how much long context is but it might be fine

#

have you tried it

high ginkgo
#

he's tricking u man

tall summit
#

???

high ginkgo
#

don't fall for it

#

your pc will turn into mining bot

tall summit
#

real

#

lmao

high ginkgo
#

he is going to drain all ur tokens for today with one prompt

#

it is trickery

golden ocean
#

Claude 4 opus is struggling with one bug

high ginkgo
#

πŸ˜„πŸ˜„πŸ˜„πŸ˜„πŸ˜„

golden ocean
#

And instead of failing to fix the bug it just destroyed the entire app

#

With an additional 100 lines of comments

#

And Claude 4 opus thinks for like 5 sentences and thats its ot even thinking its literally just repeating the task I gave it

quiet folio
#

I can see R is on the middle side of the gaussian IQ chart πŸ˜„

misty vault
misty vault
quiet folio
misty vault
#

So does that imply that gpt 4.5 and claude 4 opus are on par

late path
cedar tide
#

The whale is back

misty vault
#

Sus

cedar tide
late path
#

I think it's extremely unlikely for any company to catch up to the level of 2.5pro now. OpenAI and Anthropic have tried their best, but o3 only surpassed 2.5pro in specific areas, and opus still feels like a previous generation model.

misty vault
#

Okay, I get it, the text is not accurate because it is not on par with gpt 4.5 and claude 4 opus at the same time. Then is it on par with at least one of them? Like 4.5 or claude 4 opus or just overhype and it ends up worse than both? I guess we don't know, but it looks like they are back, so let's hope they will be the next sota model, go deepseek! 🐳πŸ₯΅

tall summit
#

fake

misty vault
#

I think it real but who knows how well it actually performs

tall summit
#

i hope

#

but i don't care until it releases

quiet folio
#

Fr

misty vault
fleet lintel
#

OpenAI must be cooking something. They were about 18 months ahead of Google about 18 months back (Gemini 1.0 launch). And they have huge talent and enough money to burn. I dont think they can squander away all that lead in such a small time. I think something big must be coming from them

cedar tide
#

Well, sorry for sharing dubious information, after talking to the person behind the rumors it seems fake.

misty vault
high ginkgo
#

not true. megacorps like xai, openai and anthropic have agi and asi internally

misty vault
cedar tide
torn mantle
torn mantle
cedar tide
#

I don't think Open AI is ahead

ocean vortex
# cedar tide

This would depend on the model and whatnot. But what he is referring to here is not a finished product. More like experimental model that was not tuned yet or safety aligned. It's not only the latter though, meaning the product an user gonna see could be better than what he has his hands on now.

#

Looking at his resume he doesn't look a very technical person either tbh. All his roles were product manager. So not a ML Engineer and I doubt he's in a loop on the training or differences in all the models πŸ‘€

flat flax
cedar tide
#

Yes

patent aspen
#

The easiest way to make comparisons is just to look at the pace of improvement of released models

high ginkgo
fleet lintel
#

3.5 today? I am already prepared to be dissappointed.

sonic tendon
#

where are you getting this, out of curiosity?

#

is it just researcher tweets

sonic tendon
brittle tiger
sonic tendon
#

lmaoo

misty vault
golden ocean
#

LMARena rage baiting or trolling is too easy

sonic tendon
misty vault
#

πŸš›

sonic tendon
#

🫑

#

ello

#

baseless claim πŸ’―

quiet folio
#

I smiled until I read πŸ’―

tall summit
misty vault
#

@sonic tendon can relate

sonic tendon
#

it's roughly 50-50 within may now

#

well

#

that was mostly me

#

low liquidity, shouldn't have bet as much as I did

#

should I make another relevant AI news thread?

#

relevant AI news (version 2)

tall summit
#

ooooh you made the poll

#

time to vote

#

wow 2k mana free per alt you make and refer

ocean vortex
#

it was underperforming considering current models but it was still great at the time. We didn't have any real alternatives for raw thinking output when this was just released

tall summit
ocean vortex
#

2.5pro wouldn't exist or wouldn't be nearly as good if they hadn't made flash-thinking earlier as well

cedar tide
#

I hope that the "request models" category is not just there to look good but that they will add the models that the community requests 😢

ocean vortex
#

new Deepseek maybe today

tall summit
#

too many rationalists on this platform (manifold)

tall summit
ocean vortex
#

this is not a joke, it's a real thing. Well a real rumor at least lol

tall summit
#

I've been working (together with Javier Gomez-Serrano) with a group at Google Deepmind to explore potential mathematical applications of their tool "AlphaEvolve", a successor of their earlier tool "Funsearch" that was publicly announced today: deepmind.google/discover/blog/… . Very roughly speaking, this is a tool that can attempt to extremize functions F(x) with x ranging over a high dimensional parameter space Omega, that can outperform more traditional optimization algorithms when the parameter space is very high dimensional and the function F (and its extremizers) have non-obvious structural features.

Some of the preliminary problems we have tried this on, including problems involving harmonic analysis inequalities, additive combinatorics, and packing, were already mentioned in the announcement; we are now gradually moving on to more challenging problems where the parameter space has a sparser set of good solutions. The work is still ongoing, but I hope to be able to re…

ocean vortex
#

Dork ASI confirmed πŸ‘

#

Probably only gonna be available to SuperDork subscribers though

harsh flume
#

Is there any potential grok-esque model in the arena rn?

vernal meadow
#

What temp does lmarena use for the models?

ocean vortex
fleet lintel
tall summit
fleet lintel
#

Terence Tao is legit! Alphaevolve may not generate much news but could play huge role in advancing humanity

olive mesa
#

Do y'all think there are employees from different big companies watching this chat to see people's opinions on their models beuh

jade egret
#

hello

hybrid gate
#

i know claude 4 just released but when does new AI usually show up on the leaderboards?

feral creek
#

yea thats a good question im curious too

sour spindle
#

There’s a simplicity to it that makes is very persuasive to users

jade egret
ocean vortex
# jade egret

if it's o3 on chatgpt website then there is no competition at all lol. It's way more effective at using python than other models to compute/verify and that's in addition to it already being very strong offline

torn mantle
#

xai is cooking you up

#

??

#

?!

fleet lintel
#

"UAE gives all 11M citizens free ChatGPT Plus".

Very intersting

misty vault
#

this is actually real

narrow elbow
misty vault
#

LMfAo

small haven
#

can july come any sooner, me wants deepthink

small haven
#

shh deepthink gets release -> o3 pro releases in tandem 🧠

sonic tendon
#

this actually looks legit given how badly the unsloth guy is trying to cover

keen beacon
#

i don't think its happening btw

golden ocean
#

We know

misty vault
keen beacon
ocean vortex
#

yeah at this point it's too late in the day

keen beacon
#

only the unsloth article which is based on this information

ocean vortex
#

and there was also this tweet

#

someone associated with Deepseek replying I presume but I didn't have time to verify it tbh

misty vault
ocean vortex
#

Opus is a but of a weird model btw. Really quite unusual how they couldn't showcase anything other than swe essentially. But it does hold up when you test it and looks unique and quite capable πŸ€·β€β™‚οΈ

golden ocean
keen beacon
#

i didnt see it

ocean vortex
#

In my eyes I almost wrote it off completely after I saw their benchmark manipulation with parallel processing lol

#

but it actually seems good

keen beacon
misty vault
keen beacon
misty vault
#

@keen beacon yep

keen beacon
#

saying it was based on that

quiet folio
# misty vault

THis message and then the x link implied to me that unsloth based it off that before wild posted it not gonna lie

keen beacon
sonic tendon
golden ocean
#

I think its obvious that unsloth based it off that

keen beacon
misty vault
#

who

keen beacon
#

unsloth

misty vault
#

cares

sonic tendon
#

hmm, wasn't 0324 released on the 25th

keen beacon
#

u r the one continuing it tho

misty vault
#

who cares about gooning with gpt 4o

ocean vortex
sonic tendon
#

i care

ocean vortex
#

just selectively ethical I suppose. πŸ‘€

sonic tendon
#

yeah I'm giving up on speculation for today

#

they seem to imply no access to insider info outside of that

#

i was wrong about Claude, tbf, so take my opinion with a grain of salt

#

but I think it could be real

ocean vortex
#

Deepseek is very very strict with "confidential info" lately yeah. And it's China we are talking about so consequences are different lmao

sonic tendon
high ginkgo
#

grok 3.5

sonic tendon
#

you could be right, tho

sonic tendon
keen beacon
#

unsloth had early access to qwen 3 (which might imply they might have insider info about deepseek) but mike from unsloth said they just based it on that specific tweet

sonic tendon
#

I guess the chief question is "did unsloth actually base the article on speculation or were they just trying to cover their ass"

#

i lean slightly towards the latter, but tomorrow will tell

#

mildly sus but plausibly deniable

#

"is now the best performing open-source model in the world" is quite implausible as a copy-paste

unborn ocean
#

true, i get that they would want a template for release

#

but the highly specific information is really sus

sonic tendon
#

ik people dissed on speculation earlier, but, honestly, I find this pretty fun

unborn ocean
#

and they are not even really denying much, they are actually saying a release is very likely right now

sonic tendon
#

deepseek cannot hide from me

unborn ocean
#

I mean speculation about models and model releases are like our 'specialty' in this chat

#

so...

sonic tendon
#

yeahhh

balmy mist
#

what i miss?

#

what happened to relevant news thread?

upbeat zealot
#

any idea when claude 4 will be released in lmarena? it's been several days since release

small haven
#

i love hallucinations

sonic tendon
vernal meadow
#

Sonnet 4 sux so much it is sad. bad logic, bad math and always refusing to answer for stupid reasons.

It might be a agentic coding god but certainly not a good chat model

#

I would be not surprised if it is even lower than 3.7 in lmarena.

patent aspen
#

Claude finally made it out of Mt Moon

balmy mist
#

thanks tho

jade egret
#

flow

cerulean seal
#

hi guys, this website is free?

misty vault
misty vault
cerulean seal
#

how do they pay for that

#

wrf

upbeat zealot
cerulean seal
#

how do they do thatt😭

upbeat zealot
#

still its not even on the leaderboards

misty vault
#

they dont

cerulean seal
#

claude 4 opus is amazing at coding

cerulean seal
misty vault
#

they get sponsored

cerulean seal
#

wont this get taken down soon

misty vault
#

by all the companies

cerulean seal
#

OH!

#

so

#

what happens if the season ends

#

will the ai go away

#

😞

#

i dont wanna pay $100 a month for claude bruh

misty vault
#

if the organization decides to remove lmarenas access to a certain model then yes

#

for older models

cerulean seal
#

ph

#

thats so cool

#

i wish they can add deep research

#

i ghought the website was pirating AI

#

i didnt know that was competition

misty vault
#

me when gpt-4-0314

cerulean seal
#

gpt-4 exists?

jade egret
#

hi

cerulean seal
#

whats the oldest model in the website

jade egret
#

what is this

cerulean seal
#

@misty vault bruhh i just saw that the opus 4 stops at a random part of coding

#

so it cant be abused

#

noo

misty vault
#

just say "continue"

cerulean seal
#

fr?

keen beacon
misty vault
#

yes

cerulean seal
#

that easy?

misty vault
#

yes

cerulean seal
#

ill tell it to stop in batches and continue

jade egret
#

is it gemini 2.5 pro I/O

cerulean seal
#

cause i was in theiddle of a game

#

and then it cancelled

jade egret
keen beacon
jade egret
#

for some reason it crashed

jade egret
misty vault
keen beacon
#

there are much better 2.5 pro models in the arena right now

cerulean seal
#

i feel like LMArena it so abusable

jade egret
cerulean seal
#

i dont like how features are limited

keen beacon
cerulean seal
#

i might just buy the $10 monthly for opus 4

misty vault
#

Idk they secured it pretty well

jade egret
jade egret
keen beacon
#

yeah u can evaluate it yourself

cerulean seal
#

whats the best AI to use rn?

#

for coding

misty vault
#

for coding claude 4 opus

jade egret
#

yea

#

but you gotta keep saying continue

cerulean seal
#

20$ a month..

patent aspen
#

Jules is free

cerulean seal
misty vault
#

dork 4.0 is agi

jade egret
misty vault
#

yes ask @deep adder

patent aspen
#

When will this joke die?

jade egret
#

@deep adder

jade egret
#

is drok 4.0 agi (like acctually)

golden ocean
#

Yes

jade egret
#

bro acctualy fr?

misty vault
#

dork*

jade egret
#

dork

#

eh i dont think it is

#

πŸ’€

misty vault
#

gpt-4-preview-0314 was agi

jade egret
torn mantle
#

sup craig

#

our asi

patent aspen
#

What is the average age of this channel?

#

12?

torn mantle
#

are you sure im a man

#

you sound so sure

patent aspen
#

It's the gen Z / gen alpha slang

#

And the jokes

#

Makes me think this is a younger server

#

?

#

I wasn't saying it was bad

torn mantle
#

< 20

#

Spill the tea brian

patent aspen
#

32

torn mantle
#

That's fine

patent aspen
#

Thanks

wintry tinsel
noble zinc
small haven
#

apple insiders

small haven
#

craig im trying to like claude code, but wtf ..

#

smh

#

aint they owned by amazon

#

or partially

#

i mean to buy anthropic off, minimum $100b

#

dont think theyll accept fair value

#

ya minimum $100b

#

private equity?

#

$61.5b is based on the funding round theyve raised, its not really a stock market, just a gauge

#

series e1 theyre very late stage

#

and u didnt know about their funding round smh

#

i mean the only thing they have is siri

solar hollow
#

isnt anthropic bound to amazon somehow though?

patent aspen
#

Anthropic has major deals with Amazon and Google and partners with other big tech companies. The deals with Amazon and Google couldn't be exclusive because of antitrust oversight from the FTC. For this reason it's also highly unlikely that Apple could acquire Anthropic in the current regulatory environment.

#

If you're a big tech company, you're not supposed to acquire or make exclusive agreements with nascent companies that have the potential to become a significant competitor in the future

#

Big tech companies have tried to get around that with special deals that are like pseudo-acquisitions, but even those deals have faced heavy scrutiny from regulators

sonic tendon
# jade egret what is this

hard to say for sure, but people suspect that it (dragonclaw) and redsword are the non-preview versions of gemini 2.5 pro and flash

late path
#

dragonclaw is probably a old 2.5 pro checkpoint, no longer be in the arena now

elder rapids
#

it was pretty smart

#

but it was like a strong model gone wrong

#

like there was something off about it

#

it didn't know how to spell lmfao

#

insane syntactic errors

keen beacon
#

O3 does some really strange stuff which remind me of that

elder rapids
#

ye

#

btw sometimes it glitches out

keen beacon
#

2.5 pro doesn't seem to be plagued by those problems, at least the released versions (in a very visible way)

elder rapids
#

when talking to o3

#

it seems like o3's CoT is a psuedo tree

#

even though it's single

#

it kept telling me multiple revisions through an A B C process

#

and forgetting which one it was assigning to the context

keen beacon
#

i dont really understand what ur trying to say

elder rapids
#

aren't we in an AI server

#

just use chatgpt

keen beacon
#

others dont seem to get it sometimes from what ive seen. and i doubt the models would do that well without your context

keen beacon
elder rapids
#

no one else is here

#

but if you mean it's a trend

keen beacon
#

im talking about past conversations

elder rapids
#

benefit of the doubt it isn't inherently loaded and you take it as is without accusing it of sophistry

#

so if I'm invoking 3rd party Interpretation (which inherently would lack ALL context besides the claims) that should speak volumes in what I'm trying to say regardless

#

just take it as is, ion know what else to say

fleet lintel
#

what are the latest un-released good models on LMArena?

keen beacon
fleet lintel
#

I think goldmane is gemini.. what about redsword?

calm sequoia
#

Also gemini

#

Have you guys checked the cutoff date of these models?

keen beacon
#

one of them will be ga 2.5 pro i think (best one will be chosen)

fleet lintel
#

small incremental improvements over current 2.5 pro or decently big improvements?

keen beacon
#

people say its better than nightwhisper based on the posts in this channel

fleet lintel
keen beacon
late path
#

rumor has it that the current gemini 2.5 is actually gemini 3 internally lol

keen beacon
calm sequoia
# calm sequoia
poll_question_text

Best general purpose LLM in 2025 yet

victor_answer_votes

7

total_votes

20

victor_answer_id

3

victor_answer_text

Gemini 2.5 Pro 03-25

#

Goldmane October 2024

#

It's interesting that it answers differently if you ask for the date in different language

#

Redsword June 1, 2024

keen beacon
#

(if its not trained in or provided in the system prompt)

#

but it is interesting nonetheless

calm sequoia
#

Yeah I guess it takes too much time to check. At least they don't know what happened in 2025

#

Lol GPT 4.1

#

The original 2.5 PRO is always off the charts when pricing is included.

#

Also MCbench update

ocean vortex
late path
#

I think goldmane will be better than 0325

ocean vortex
#

We use data from n > 5000 LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with d = 28,632 items in total). From them we distill a sparse benchmark, metabench, that has less than 3% of the original size of all six benchmarks combined.

Ok so they used saturated outdated benchmarks catgrin

calm sequoia
ocean vortex
cedar tide
calm sequoia
#

Don't have experiance with claude 4, but if your remove the 4.1 Nano it seems good.

keen beacon
#

lmao theres deepseek prover on mcbench?

ocean vortex
#

the thing to keep in mind also, if you gonna use old benchmarks like that, contamination is likely gonna be a bigger problem for labs that were doing this for a long time. Than for relatively new players who got to it after people moved on to other less saturated (by default) metrics.

#

or just older models vs new, depending how much they changed their datasets

cedar tide
cedar tide
calm sequoia
#

Can't find it anymore. Ask Dom

ocean vortex
#

look at qwen

ocean vortex
#

this doesn;t seem to be included in their paper

calm sequoia
#

The new Unified-Bench is here. Compare the performance and cost of 20+ LLMs, averaged across 25 diverse benchmarks! Thanks to @E_Ellipsis, you can now compare Gemini 2.5 Pro (03-25), 2.5 Flash (05-20), and even Claude 4 models with o3, o4-mini, Qwen3, Sonnet 3.7, and more. 🧡

o3

keen beacon
#

metabench seems to be more interesting than expected

ocean vortex
#

some scores are oddly referenced, like they referenced gpt4o AIME25 against Claude parallel processing one with majority voting? hm

#

em.. wtf lol

#

I suppose it makes sense considering it's so concise with reasoning, but this is completely opposite to Anthropic's table... Overfitted on newer AIME25? catgrin

keen beacon
ocean vortex
#

this is a HUGE discrepancy

keen beacon
#

messed up

keen beacon
#

well it was like that

ocean vortex
keen beacon
#

they remeasured

keen beacon
#

theyve been tryiyng to fix it i think

#

#general message <-- claude 4 sonnet thinking having a lower score than claude 4 sonnet before they remeasured

ocean vortex
#

Opus is giving off slight gpt4.5 vibes of it being outperformed by smaller models in places regardless tbh, although it's a more capable model now. I think they could do more RL training on it

keen beacon
#

Not worth it imo

ocean vortex
#

can't see how it wouldn't lead to higher scores. I think GPQA could be improved just with a sys prompt. That benchmark seems to favor longer outputs a lot. And considering the size there should be more gains than long outputs from smaller one

keen beacon
#

They should focus on making another 3.5 sonnet

ocean vortex
#

unless like train Sonnet non-reasoning on final Opus/Sonnet-thinking outputs. But at that point I would just do both things - this and further RL training on Opus. To see which option is the more promising etc

calm sequoia
ocean vortex
#

oh wait. They probably just used the same method but picked their own different benchmarks to get subsets from. Yeah I was just looking at this wrong lol

keen beacon
#

the meta bench thing is coincidental

ocean vortex
#

but what is the point of calling it metabench then? I would think at least some of it is similar catgrin

keen beacon
ocean vortex
#

yeah it would appear so they didn't independently test anything, how is the average calculated then..?

#

"Hallucinations when summarizing" -- this should have been inverted at the very least but they are weighting the scores in some interesting ways. Cause even when I treated this metric as a very bad score the average for o3 came up still higher at 61.8%

fleet lintel
ocean vortex
# fleet lintel wow... Claude 4 launch is disappointing af

Maybe not disappointing per se (reasoning still looks very solid as well as context awareness etc, probably the closest model now to the feel of gpt4.5 with better performance), but yeah there are things where it seems to underperform for sure

#

would be interesting to get SimpleQA score of it

mossy drum
#

New model in Beta Arena: qwen3-235b-a22b-no-thinking

peak notch
#

Hello @everyone
Hire a Generative AI Engineer | Unlock the Power of Intelligent Automation and Creativity

Are you looking to leverage the latest in generative AI to drive innovation, enhance productivity, and create smarter workflows?

I’m a skilled Generative AI Engineer with hands-on experience in designing and implementing AI-powered solutions that deliver real value. From custom chatbot development and workflow automation to AI-assisted content generation and LLM integration, I offer both technical depth and strategic insight.

Who I Work With

Startups seeking rapid prototyping and intelligent systems

Small to medium businesses looking to automate and scale

Agencies enhancing service offerings with AI capabilities

Creatives and marketers integrating AI into daily workflows

Let’s Work Together

I am currently available for freelance projects, part-time contracts, and consulting engagements. Whether you need a full build or expert guidance, I can help you integrate AI into your business with clarity and efficiency.

alpine coral
#

which responds in Chinese unprompted and self identifies as from Baidu

#

(the subsequent responses were in English.. but nothing to write home about.. didn't perform or feel like grok)

torn mantle
#

ernie x1 ( reasoning model )

torn mantle
#

sorry i didnt read the baidu part

ocean vortex
alpine coral
#

lol

calm sequoia
#

But the fact that self-reported benchmarks are included makes it already unreliable

#

Though I don't believe OpenAI would cheat here

alpine coral
#

yeah im kinda similar.. at least i think oai (and most of the other major players) are more likely to selectively publsih / cherry pick evals or do sneaky things involving asterisks* than outright lie

keen beacon
#

competitors can remeasure too

#

faking evals is not a good idea

alpine coral
#

yeah

keen beacon
#

though u cant replicate the claude 4 benchmarks

#

the parallel scores\

#

it uses an internal scoring model LMAO

alpine coral
#

yeah wow lol

#

i saw previous comments alluding to this and agree: would've expected better from anthropic..

#

i mean they're meant to transparancy / alignment / safety company...

#

yet they're releasing sneaky / non-reproducible evals..

#

that said.. and just fwiw.. i do think opus 4 is a genuinely strong / top tier model

keen beacon
#

just report pass@1 and thats it tbh

ocean vortex
alpine coral
#

[but i think sonnet 4 is perhaps a flop... and that doesn't bode well for Anthropic at all imo]

keen beacon
#

i dont think either were pretrained from scratch theres that i think

alpine coral
#

yeah i agree

keen beacon
#

maybe their actual new pretraining run will be good (although there might've been architectural changes along the cpt, but we can't tell)

alpine coral
#

yeah they've just lost time - and afaik remain constrained by resources / compute

#

so it will be a challenge

#

both technically / financially combined with their lagging position relative to competitors (esp oai and google)

ocean vortex
#

so it becomes difficult to directly compare it against competition

alpine coral
#

yeah the limited release of evals compared to previous model releases kinda says something in itself

ocean vortex
#

AIME25 score is good, but then AIME24 seemingly isn't..

sour spindle
#

The limits with Anthropic models are maddening

keen beacon
#

@alpine coral did you try goldmane/redsword btw?

alpine coral
#

i was just trying to get them in the arena actually! but haven't had any luck

#

i got calmwater - kinda surprisingly (i thought it was associated with a now-released version of 2.5)

#

it performs well

keen beacon
#

calmriver is the latest 2.5 flash i think

#

if its calmwater its different

alpine coral
#

ahh it's calmriver

keen beacon
#

its still on the arena?

alpine coral
#

yeah

keen beacon
#

maybe they forgot to update the name

#

webdev arena metadata

alpine coral
#

it's almost certainly got thinking enabled

mossy drum
#

New model in Beta Arena: glm-4-air-250414

ocean vortex
#

Air is a trendy name right now. "Our slimmest model yet"

alpine coral
#

very, very impressive / strong

#

like really good (not a step change.. but it looks - based on two quizes.. given in a single exchange - like a genuinely stronger pro 2.5)
[actually i dunno.. maybe a step change... pretty damn good]

sour spindle
#

Have you tested goldmane

alpine coral
#

no haven't gotten it yet

cerulean seal
#

opus 4 still good at coding?!

alpine coral
#

have you used both?

#

personally i have no idea - though others here will

sonic tendon
alpine coral
#

yeah.. i mean for models in the arena there aren't really other options ha

#

though some of the scores are for from API/official chat

sonic tendon
#

idk, i figured it wouldn't be too difficult to reverse-engineer the battle api

#

could be wrong

torn mantle
alpine coral
#

but yeah i couldn't do it even if it was somehow possible

torn mantle
#

goldmane vs redsword

sonic tendon
#

eo?

torn mantle
#

sometimes one performs better than the other

torn mantle
sonic tendon
#

ah

torn mantle
#

you thought its a new model

#

pat pat

sonic tendon
#

:3

#

webdev arena system prompt, for anyone curious

#

too lazy to remove the //s

#

oh, nvm, someone posted it a bit ago

alpine coral
#

yep! kinda got that nebula feel about it tbh ha like yeah been a while since something like

#

though sample of 1.. shouldn't get too far ahead myself ha

spare mango
#

Should I get Gemini Pro or ChatGPT Plus for university Computer Science assistance and research?

#

I'll ask the AI to summarize chapters within the digital books that they've provided us, amongst many other things.

alpine coral
#

i got goldmane (on beta chat)

#

it did slightly worse on the two question sets above (12, 6 respectively - still v strong but not at the very top like redsword

#

but interestingly.. i gave it an additional question set after those two - which it smoked

civic flame
#

finally we'll get anon models on the nice UI

dapper storm
#

LMArena will stay open and accessible to everyone

So does that mean the ranking algo will stay open? Or just that we'll be able to see the rankings

tardy pasture
#

@echo aurora How does one use the search models on the new UI?

spare mango
#

Am I on the actual Gemini Pro?

#

I've been added to a friends family.

#

Why does it say (preview)?

alpine coral
#

i don't think the old site is even accessible any more

clever estuary
#

the new site has a huge censorship issue

civic flame
echo aurora
clever estuary
#

any slightly problematic words like kill would trigger the word filter

jade egret
# jade egret
poll_question_text

Which is the best at math?

victor_answer_votes

13

total_votes

24

victor_answer_id

1

victor_answer_text

Gemini 2.5 pro (I/O edition)

civic flame
spare mango
echo aurora
misty vault
#

there is only preview

echo aurora
civic flame
#

iirc i suggested it ~a month ago when it was still a beta

tardy pasture
novel flame
#

Has anyone here evaluated OpenAI Codex versus Google Jules yet?

patent aspen
patent aspen
novel flame
patent aspen
#

Plus Gemini Pro is free for students if you use an edu email

patent aspen
#

Comes with NotebookLM, etc

unborn ocean
#

@patent aspen you so 100% either work at google / deepmind or your room looks like this 🀣

tall summit
#

rude

echo aurora
split kayak
#

On pc you can press Ctrl+F on website to filter by model name

still jetty
#

thank you for keeping the legacy site available

unborn ocean
#

idk, i just want him to actually tell me if he works there

#

or if he is just genuinely a fan

#

or nothing of the two

sweet tinsel
#

Are the anon models still only on the legacy UI? Can't seem to get them on lmarena.ai?

#

Well got some now.

#

Get a lot more anon models in the legacy UI.

spare mango
#

Is it better to let Gemini Pro remember my data in the long run? Will it be more intelligent, or will it be more bloated, slow, and dumber?

#

By "data" I mean, the information it gathers over the course of multiple chats.

balmy mist
#

the new arena is so nice

calm sequoia
#

Nice update! Can't wait for the Q&A πŸ˜‹

echo aurora
patent bane
placid frigate
#

Why did you severely limit the context window in direct chat with the AI model? After interruption, when I write the text again, it gives an error
The Sonnet 4 and Opus 4 models freeze at the moment when they are "thinking" and do not even reach the output of the text

torn mantle
#

i wish a smaller model is used to filter out bad/inappropriate prompts instead of simple word/string/regex filters

tall summit
#

how small is "smaller", ya think?

elder burrow
#

but in any other context its going to be more bloated yeah, I'd start from new conversations after a few prompts

ocean vortex
elder burrow
ocean vortex
#

it's free to use and you can customize it how you want

mossy drum
#

New model in Beta Image arena: bagel (style of image quite resembles gpt-image-1)

ocean vortex
#

though I do wonder about their data privacy policy... could be meaningful if lmarena are sending all inputs/outputs to OpenAI for moderation LOL

torn mantle
small haven
#

day 41 without o3 pro

torn mantle
#

you are still counting 😭

small haven
#

yes until the day comes duh

#

o3 still says june 12 like yesterday, omg its so accurate

sturdy mica
#

bro

#

they really added the new ui

#

its so incomplete

#

atleast add max output tokens

#

sliders for temperature are so helpful

#

and they're gone

#

bro

misty vault
#

fr

#

But u can use the legacy ui

sturdy mica
#

legacy ui scks tho

#

and it is not gonna have new features anymore

#

they really need to add the temperature and max output token controls though

cerulean seal
#

then the arena can be abusable.

sturdy mica
#

telling them what

sturdy mica
#

how could it be used for abuse

#

the technology already exists, we just need to add it to the new ui

#

gradio built this in a cave! with a box of scraps!

civic flame
#

LOOL

sturdy mica
#

@misty vault telling them what

olive mesa
#

wtf

echo aurora
#

Reminder:

βœ… No NSFW

balmy mist
#

bro its never coming

#

i gave up hope

cerulean seal
echo aurora
small haven
#

i wouldnt be surprised if it comes out with deepthink release

ocean vortex
#

to those who kindly ask after donating $250

small haven
#

only select users have it for safety testing

ocean vortex
#

They could have done what Anthropic did and never released it at all just gave the benchmark scores for it. So I guess it could be worse πŸ‘€

small haven
#

i just hope when deepthink is released, its not going to be heavily limited like veo 3

#

8 videos and ur done for the week, gg lol

zinc ore
#

Shouldn't be, heck, I kinda expect it to be cheaper than o3 still

unborn ocean
#

really don't have much time right now, but i build up some internal benches about the CoT prompt:
it seems to have an INSANE effect on performance for 2.5 flash, pushing its performance well above all the other models (that also have the same prompt)

#

should be pretty obvious from that they are actually just using the same model (for reasoning and normal)

(And qwq / llama maverick lost <5% because of rate limits)

civic flame
unborn ocean
elder rapids
#

and a lot of bugs currently

#

censorship weights, mobile bugs

jade egret
# jade egret
poll_question_text

Which company do you think will achieve A.G.I first?

victor_answer_votes

14

total_votes

19

victor_answer_id

2

victor_answer_text

Google

small haven
#

oai >> google, still, 74% is due to recency bias

o4 internal model should easily dominate deepthink, there is still a gap, but narrowing

zinc ore
#

If you cover the entire AI space, Google is easily ahead. 2.5 pro is still right on o3's heals, although some argue it's overall better.

small haven
#

oh yea google is breath heavy, but not in depth

zinc ore
#

They're also pivoting to world models and it'll be interesting to see what kind of performance improvements that brings

#

Haven't seen anything from openAI from that angle (streams of experience).

#

Google basically appears to match openAI in the LLM space, while being ahead everywhere else, while also showing off what they think is the next frontier of AI improvement (world models).

#

So I think that's natural why people have the perception they'll win. They also have the most compute, built transformers, and don't have a Nvidia tax/bottle neck but use specialized hardware and control their entire vertical stack.

#

Like, when you're forced to look at the entire picture holistically, Google starts to look like an increasingly promising bet in the space.

#

Today I learned they have more compute than Microsoft and Amazon combined.

elder rapids
#

if Google wanted to make an o3 or o4 model they can and probably do have one internally

#

there's no reason to serve such an intense model

#

it goes against basically everything they've been building in regards to efficiency and profit

#

but even outside of that, Trey said it all tbh

#

openAI simply isn't in the position to do that

#

it's a fundamental problem, not mechanical feasibility

patent aspen
# patent aspen
poll_question_text

How old are you?

victor_answer_votes

15

total_votes

21

victor_answer_id

1

victor_answer_text

< 24

elder rapids
torn mantle
torn mantle
small haven
# elder rapids there's no reason to serve such an intense model

if google have an "o3" internally already; then why is it not being served and oai is hosting it at scale rn. im trying to not be biased here, but o3 is just such in a different league, beyond gemini 2.5 pro as of now. the efficiency/profit threshold release is bs, bc they have veo 3 and its certainly not cheap to serve, heck just look at their $250/mo plan. google has more money/data, yes, but it can only get u so far, just look at meta. i may be wrong at the end of the day, time will tell

elder rapids
elder rapids
# small haven if google have an "o3" internally already; then why is it not being served and o...

I already said why they're not serving it lmao. And it's not in a different league it underperforms in a lot of things compared to 2.5 pro.

veo 3 is an entirely different thing lmao, it inherently requires more compute and diversifies their AI resume, it's necessary

everything you're saying is as improbable as saying anthropic will be the one to AGI, you're simply choosing to say it's openAI, when everything points to Google having legitimate reasons to be both the strongest lab + the lab with the best research.

Meta isn't a good comparison, they have neither the infrastructure, the data scientists, the ML researchers, the scientific foundations, etc

#

when Google has the opposite, they have THE best

#

not just "one of the best"

#

crazy how you say some of the most nothing burger shi ever

#

logically speaking people have more incentive to work for Google

#

saying openAI doesn't even make any sense lmao

#

everybody wants to work for Google

#

that's the holy Grail dawg

#

😭

#

smarter in the sense we as an AI community define "smartness" sure, but better isn't the case

small haven
elder rapids
#

no it's not lol

small haven
elder rapids
#

veo 3 doesn't exemplify ANY price efficiency schedule in regards to AI

#

it isn't more cutting edge you're lying out of your ass

#

startup feel is bs

#

has nothing to do with how it operates and incentives

#

no it LITERALLY is

#

you can't justify that even a little bit

#

publicly traded means nothing

small haven
elder rapids
#

you literally have no idea how that works, that's COMPLETELY irrelevant to employee

cerulean seal
#

?

#

what the argument about uere

elder rapids
#

crazy how that's something I study, but with that, this is irrelevant to researchers financially

cerulean seal
#

is politics allowed (like saying is trump affecting AI?) <@&1349916362595635286>

elder rapids
leaden palm
#

well
βœ… Avoid political and religious content. As a space that’s inclusive to many different worldviews we ask to avoid topics related to politics and religion in order to maintain an inclusive space. It is okay to have discussion related to new policy or laws as long as it’s related to AI.

#

it would be silly to ban all trump discussions

cerulean seal
#

well is trump affecting AI?

leaden palm
#

of course

elder rapids
#

you're shifting the goalpost

cerulean seal
#

its ggs

leaden palm
cerulean seal
elder rapids
cerulean seal
#

AI is going to get affected via collateral damage

elder rapids
#

prove that

#

yes you can lmao

#

this isn't unfalsifiable

#

you made the claim

#

what

#

😭

#

let's get our Gemini 2.5 pros to Duke it out

#

deadass

#

what's the claim

#

brobro

#

this is irrelevant btw, employee incentive is in discussion

#

that's legit irrelevant

#

none, DeepMind is already what initiated this entire thing

#

legit doesn't matter

#

no like, in no case

#

does it matter

#

in any way

#

there's no parts of a company that are legitimately stagnant if they're not unstable

#

that's a nothingburger

#

and not how it works

#

Google is too large to be stagnant

#

and still, Google is basically the only one truly "innovating"

#

they have the most distribution already

#

and even operating under the premise "employee incentive", Google pays twice as much

#

the bonus and RSU's are important, OpenAI's private equity options are speculative and meaningless, literally contradicts "incentive"

#

google provides annual cash bonuses + liquid GOOG RSU's that vest over 4 years

zinc ore
#

Above argument is kinda funny considering openAI has lost a bunch of their main researchers over the past year

#

One of the more bleeding companies in the talent space

zinc ore
#

The lead researcher on sora went to google

#

Co-lead technically

#

Ilya has his own company but is using Google TPUs

#

FB also lost a lot of their top researchers. Basically anthropic got a bunch of the openAI talent, or they went and formed their own companies.

Google got Noam back (huge deal tbh probably bigger than anything openAI has gotten).

#

Claude would be way more performant imo if they had similar compute that openAI has. Arguably they'd be in the lead, but I think it would end up being between them and Google if that were the case.

small haven
#

ur just talking about micro events that dont even matter to oai long term

#

mind u oai has 5k employees, deepmind has 2k if u wanna talk macros

elder rapids
#

AI isn't just LLMs

#

and that's LITERALLY the only thing OpenAI has

#

openAI doesn't have an alpha zero, openAI doesn't have an alphaevolve

#

openAI doesn't have the data, openAI doesn't have basically everything

small haven
zinc ore
#

Alpha proof and geometry or whatever it's called

#

Alphafold

elder rapids
zinc ore
#

Genie !

#

Genie (world model)

small haven
#

ok let me cherry pick

#

i/o

#

robotics

elder rapids
#

what are you cherry picking

small haven
#

unrelated things

zinc ore
#

The entire argument is cherry picking lol, we aren't having a particularly exhaustive conversation

#

Like 99% of convos in here is vague allusions as to why one company is better

zinc ore
#

Does that include Google brain, or is that the number pre-merger?

elder rapids
#

which doesn't necessarily invoke a deeper discussion, if one at all

small haven
#

source?

#

exactly

#

might as well say oai has 10k employees, im spicy

#

rooting google to get agi first (even if they do get it) is crazy to me, ppl sometimes

#

smart people actually do have a brain

#

and just because google has their own gpu equivalent hardware, doesn't mean anything, actually it should just mean research friction, more time integrating/debugging than actual research

umbral crypt
#

Insane arguments at 5 am

keen beacon
#

Ngl I'm getting increasingly convinced by the google propaganda in this channel

small haven
#

i mean..

#

people are recency loving creatures

#

tiktok brain

#

if oai releases o4 next week, i know for a fact everyone changes their perspective lol

umbral crypt
#

No way lmao openai is so bad

zinc ore
#

"I know for a fact" like

umbral crypt
#

Ting tong countries probably gonna rock this again

zinc ore
#

Nah, they gonna fall behind. But I hope I'm wrong and Deepseek goes bang bang on the competition

small haven
keen beacon
small haven
#

why am i not getting pinged, r u guys scared lol

keen beacon
#

ur typing tho should i ping u again?

zinc ore
# small haven ?

You're saying "you know for a fact" is just your personal assumptions about how performant it will be

small haven
#

dont be shy

#

enable it lol

umbral crypt
#

Blablabla

#

πŸ‘πŸ˜‚

small haven
zinc ore
small haven
zinc ore
#

We also don't know what the competition might have or drop when it releases

#

Yeah IDC about openAIs claims, proof is in the pudding, they gotta show it first

small haven
#

lol

#

in retrospect i can say the same thing with google, on face value they havent released any substantial to actually compete against oai (when we talk about agi -- not gimmicky videos)

zinc ore
#

"better than most PHDs across most fields" I think is another claim they made for current o3

leaden palm
#

now that lm arena is shadcn themed instead of gradio themed should i update lmb too πŸ€”

keen fulcrum
#

Gemini diffusion is amazing!

calm sequoia
#

πŸ‘€ Claude thinking is OP

#

On the other hand, comparable to o3-medium

cedar tide
#

New amazon model "folsom-exp-v1.5"

mossy drum
#

New model in Arena: stephen

late path
cedar tide
cedar tide
late path
cedar tide
cedar tide
late path
#

from language style

cedar tide
#

New model : "X-preview" from baidu

#

New model : glm-4-air-250414

frosty lark
#

does the arena work now? I get only errors

keen ferry
#

claude 4 opus is so easy to jailbreak lol

cedar tide
#

new deepseek r1 making discord clone

drifting thorn
#

wow

fleet lintel
cedar tide
#

there are glm 4 plus on the leaderboard

cedar tide
fleet lintel
#

Wow.. huge diff

keen fulcrum
#

Meet Opera Neon, a browser for the agentic web
οΈ€οΈ€
οΈ€οΈ€Opera Neon can browse with you or for you, take action & help you get things done.
οΈ€οΈ€
οΈ€οΈ€Our playground to redefine what a browser can be.
οΈ€οΈ€
οΈ€οΈ€πŸ§© Invite only. Sign up now: opr.as/f4190e

**πŸ’¬ 6β€‚πŸ” 20 ❀️ 78β€‚πŸ‘οΈ 102.9K **

β–Ά Play video
drifting thorn
#

wow, new R1

torn mantle
#

but why did they call it a minor upgrade

cedar tide
torn mantle
#

ive heard the reasoning of the new r1 is much better

#

@civic flame whats ur take

#

hmm

cedar tide
torn mantle
#

yes

#

you will do it?

#

david?

#

πŸ₯Ί

cedar tide
torn mantle
#

wbu?

calm sequoia
#

Last week I've met a lot of people who use 4o for coding. I thought they are midwits, but maybe the lmarena leaderboard is right πŸ‘€

torn mantle
#

4o

#

4o

#

ppl

#

but

drifting thorn
#

It seems that we had slept on GPT 4.5

#

And maybe 4.5 is undertrained

#

Claude 4 Opus, on the other hand, is fully trained

keen beacon
#

maybe tbh

drifting thorn
#

And I would like to see a race between GPT 4.5, Claude 4 Opus and Llama 4 Behemoth

keen beacon
#

i recall reading the original simpleqa paper, they dont score that well

#

claude models on simpleqa

drifting thorn
#

I mean what if 4.5 is constantly upgraded just like 4o

keen beacon
#

its too large

#

the pricing though

#

lmao

cedar tide
fleet lintel
#

you keep saying that and you keep decreasing your credibility

#

grok is just 🀒

drifting thorn
#

Where’s 3.5???