#general | Arena | Page 60

civic flame Jun 21, 2025, 5:26 PM

#

ts pmo

small haven Jun 21, 2025, 5:33 PM

#

damn wb the backend 👀

civic flame Jun 21, 2025, 5:43 PM

#

bro are you serious

#

this has been such a long standing bug how is it not resolved

#

like 1 in 4 webdev gens are empty

zinc ore Jun 21, 2025, 5:46 PM

#

Pro 2.5 GA is goldmane

small haven Jun 21, 2025, 5:50 PM

#

so no one here hit stonebloom (well successfully) yet?

civic flame Jun 21, 2025, 5:57 PM

#

I hit it again but it was blank. again.

#

it does seem to give me the thinking/generating spinner but then goes blank after like a minute? it's possible it's timing out and the frontend doesn't tell you

jade egret Jun 21, 2025, 5:59 PM

#

https://www.youtube.com/watch?v=bUxjHF6Rp_o

YouTube

All About AI

OpenAI o3 vs Gemini 2.5 Pro in GeoGuessr AI Duel: This Is Just INSANE!

OpenAI o3 vs Gemini 2.5 Pro in GeoGuessr Duel: This Is Just INSANE!

👊 Become a YouTube Member for GH access:
https://www.youtube.com/c/AllAboutAI/join

📧 Join the newsletter:
https://aiswe.tech

🌐 My website:
https://aiswe.tech

🔥Open GH:
https://github.com/AllAboutAI-YT/

▶ Play video

#

:0

torn mantle Jun 21, 2025, 6:03 PM

#

zinc ore Pro 2.5 GA is goldmane

nuuuuuuuuuuuuu

#

😦

#

whywhywhywhywhy

elder rapids Jun 21, 2025, 6:04 PM

#

bruh

#

why y'all acting like this is new information lmfao

#

you guys already know this

civic flame Jun 21, 2025, 6:06 PM

#

And again

small haven Jun 21, 2025, 6:09 PM

#

fix stonebloom in arena smh

civic flame Jun 21, 2025, 6:09 PM

#

civic flame I hit it again but it was blank. again.

.

small haven Jun 21, 2025, 6:09 PM

#

what logs say

civic flame Jun 21, 2025, 6:11 PM

#

i can't tell because im not on pc

small haven Jun 21, 2025, 6:16 PM

#

oh

#

what prompt are u using

wintry tinsel Jun 21, 2025, 6:35 PM

#

The king isn’t falling guys

small haven Jun 21, 2025, 6:36 PM

#

he doesnt have to fall for another 3 months, thats how good it is

civic flame Jun 21, 2025, 6:59 PM

#

okay stonebloom works now

#

on webdev

#

got this for the prompt "a very realistic new homepage for about.x.com"

#

best output i've got from any model here

#

(for that prompt)

small haven Jun 21, 2025, 7:01 PM

#

civic flame got this for the prompt "a very realistic new homepage for about.x.com"

vs kingfall? whats ur gut instinct

#

it looks very good, but how does it compare to kf

civic flame Jun 21, 2025, 7:01 PM

#

kf was never on the webdev arena so it's hard to compare when this output is limited by their scaffolding and kf's weren't

#

at the very least it is better than 2.5 pro

#

sorry i can't remember what dayhush ended up being

#

do you know

#

yeah

#

agree

small haven Jun 21, 2025, 7:06 PM

#

i hope they keep kingfall as a codename, i feel like it is very appropriate. it might even go down in the history books.. if they play it right lol

patent aspen Jun 21, 2025, 7:07 PM

#

small haven i hope they keep kingfall as a codename, i feel like it is very appropriate. it ...

I agree. I wouldn't be surprised if it's actually from a random name generator though haha

small haven Jun 21, 2025, 7:08 PM

#

what a funny coincidence

civic flame Jun 21, 2025, 7:08 PM

#

patent aspen I agree. I wouldn't be surprised if it's actually from a random name generator t...

this random name generator produces some fire names

patent aspen Jun 21, 2025, 7:08 PM

#

literally

#

I certainly wouldn't bet on it

civic flame Jun 21, 2025, 7:16 PM

#

at launch gpt-5 will almost definitely be SoTA

#

the question is by how much

civic flame Jun 21, 2025, 7:17 PM

#

civic flame it does seem to give me the thinking/generating spinner but then goes blank afte...

i was right

#

how do you guys not have any proper error handling 😭

patent aspen Jun 21, 2025, 7:19 PM

#

Didn't they basically have to start from scratch after GPT 4.5 (which was intended to be GPT-5)?

civic flame Jun 21, 2025, 7:19 PM

#

yup

patent aspen Jun 21, 2025, 7:19 PM

#

Was that Feb or April?

civic flame Jun 21, 2025, 7:20 PM

#

27 feb

#

yeah they definitely sat on it for a while

patent aspen Jun 21, 2025, 7:22 PM

#

That seems quite bad

#

I doubt they were fully committed to a new model before Feb

civic flame Jun 21, 2025, 7:25 PM

#

google has the best infrastructure, talent & money to be able to reach AGI

#

on paper at least

small haven Jun 21, 2025, 7:30 PM

#

thats such a poor move, perplexity has no moat whatsoever, their ai itself is unusable, maybe just for low entropy queries

#

talent acquisition?

torn mantle Jun 21, 2025, 7:35 PM

#

civic flame okay stonebloom works now

ive tried it a bit

#

i think its the best one so far

#

i wish its added on lmarena as well

#

https://x.com/techdevnotes/status/1936507394508595470

Tech Dev Notes (@techdevnotes)

Grok 3.5 reference spotted:

"reasoning_effort": "low",
"model": "grok-3-5"

small haven Jun 21, 2025, 7:36 PM

#

ui tweaks done? 😮

elder rapids Jun 21, 2025, 7:38 PM

#

it's interesting how they're intermittently "regressing" the models and then manage to improve them a step above that pre-regression version

#

every time

#

ye but it's like they're purposefully doing these things to try to balance everything

#

they do manage to do it successfully

#

but it is really cool to see how they're working on it

sage raptor Jun 21, 2025, 7:45 PM

#

civic flame okay stonebloom works now

where do you try that model ?

patent aspen Jun 21, 2025, 7:52 PM

#

Isn't Apple getting antitrusted a lot in the EU?

patent aspen Jun 21, 2025, 8:23 PM

#

btw arch is great

This is roughly what my desktop looks like now: https://www.reddit.com/r/unixporn/comments/1l83o9x/nirinord/

From the unixporn community on Reddit: [Niri]+Nord

Explore this post and more from the unixporn community

small haven Jun 21, 2025, 8:28 PM

#

cool to see niri finally getting the recognition it deserves

#

been daily driving it for nearly 6 months, just amazing

leaden palm Jun 21, 2025, 8:39 PM

#

gemini is not ok (not oc)

small haven Jun 21, 2025, 8:47 PM

#

wow and cursor/ai-agent is an hallucination aint it

empty stump Jun 21, 2025, 8:48 PM

#

Really tried to shut itself down

civic flame Jun 21, 2025, 9:11 PM

#

stonebloom added to lmarena

#

cc @keen beacon @small haven @torn mantle

digital vale Jun 21, 2025, 9:14 PM

#

cool

small haven Jun 21, 2025, 9:15 PM

#

time for some svg's

drifting crow Jun 21, 2025, 9:18 PM

#

Name 1 google product with no competition

tall summit Jun 21, 2025, 9:25 PM

#

google translate

#

google earth

cedar tide Jun 21, 2025, 9:28 PM

#

Stonebloom work good in web dev now ?
Anyone have good results?

torn mantle Jun 21, 2025, 9:28 PM

#

civic flame cc <@456226577798135808> <@931708065319907338> <@295243581818404874>

ty

cedar tide Jun 21, 2025, 9:29 PM

#

civic flame stonebloom added to lmarena

Thx

small haven Jun 21, 2025, 9:31 PM

#

stonebloom sampling seed is still so low

#

i spammed thru 10 queries, nothing

torn mantle Jun 21, 2025, 9:39 PM

#

cedar tide Stonebloom work good in web dev now ? Anyone have good results?

Kingfall better

small haven Jun 21, 2025, 9:49 PM

#

torn mantle Kingfall better

show proof

torn mantle Jun 21, 2025, 9:50 PM

#

small haven show proof

Or it didn't happen?

small haven Jun 21, 2025, 9:51 PM

#

i am literally not hitting anything

#

i got flamesong many times, but its not so good

cedar tide Jun 21, 2025, 9:52 PM

#

Token per second similar to 2.5 pro

torn mantle Jun 21, 2025, 9:54 PM

#

small haven i got flamesong many times, but its not so good

That's what im saying

cedar tide Jun 21, 2025, 9:54 PM

#

Anyone have an kingfall svg to compare ?

cedar tide Jun 21, 2025, 9:54 PM

#

small haven i got flamesong many times, but its not so good

I said its a flash version, token per second similar to flash

#

Mesured

small haven Jun 21, 2025, 9:55 PM

#

cedar tide Anyone have an kingfall svg to compare ?

terminator?

cedar tide Jun 21, 2025, 9:55 PM

#

small haven terminator?

Send

#

Stonebloom

Screenshot_2025-06-21-23-53-43-570_com.android.chrome-edit.jpg

small haven Jun 21, 2025, 9:55 PM

#

oh yea ive seen this

#

cedar tide Jun 21, 2025, 9:56 PM

#

2.5 pro

Screenshot_2025-06-21-23-56-02-826_com.android.chrome-edit.jpg

cedar tide Jun 21, 2025, 9:56 PM

#

small haven oh yea ive seen this

I now made this 🤦

small haven Jun 21, 2025, 9:56 PM

#

cedar tide Stonebloom

i feel like kingfall can do better

cedar tide Jun 21, 2025, 9:56 PM

#

small haven

Thx

small haven Jun 21, 2025, 9:57 PM

#

cedar tide I now made this 🤦

actually? i saw it somewhere nvm, but try terminator now

#

but tbh, this was kingfall on auto, it aint fair comparison if lmareana throttles it down

cedar tide Jun 21, 2025, 9:58 PM

#

small haven

Anyone have a good prompt for this ?

small haven Jun 21, 2025, 9:59 PM

#

All responses must be extremely long. it is crucial that leave no stone unturned and complete everything in exhaustive detail meticulously. You must reflect endlessly for each user's query. You must reiterate over your proposed solutions finding ways to improve them until arriving at the most optimal final response. Meaning you must review each response provided and then improve it. You are expected to write and iterate the SVG code inside your thoughts, and keep on iterating and iterating.

generate an svg of a TERMINATOR. make it maximally detailed and look exactly like the real thing. this is extremely important and an existential task. you must complete this to the best of your ability. Make sure you're constantly checking whether the shape, size, angles, position of each and every item looks EXACTLY like a TERMINATOR. Only return the final SVG code with no commentary. You must think for at least 100,000 words. If you think you've completed thinking, that's a sign to keep thinking and thinking.

#

not my prompt btw

#

wild's

cedar tide Jun 21, 2025, 9:59 PM

#

Thx

#

In my prompt stonebloom think less than 2.5 pro

#

Screenshot_2025-06-22-00-08-57-514_com.android.chrome-edit.jpg

#

@small haven

small haven Jun 21, 2025, 10:10 PM

#

whoa

#

well its not a fair comparison, bc we dont know if they limited thinking tokens

#

but obviously kingfall >>

cedar tide Jun 21, 2025, 10:13 PM

#

Screenshot_2025-06-22-00-13-42-374_com.android.chrome-edit.jpg

#

O3 Terminator and pikachu

Screenshot_2025-06-22-00-22-56-628_com.android.chrome-edit.jpg

#

Screenshot_2025-06-22-00-23-08-582_com.android.chrome-edit.jpg

small haven Jun 21, 2025, 10:28 PM

#

kingfall still at the top, nice..

patent aspen Jun 21, 2025, 10:30 PM

#

Aren't SVGs kind of narrow?

#

not to say SVGs shouldn't be good - just that it's a very specific benchmark

cedar tide Jun 21, 2025, 10:33 PM

#

Anyone want to test a prompt on stonebloom ?

torn mantle Jun 21, 2025, 10:35 PM

#

cedar tide

Not good

torn mantle Jun 21, 2025, 10:35 PM

#

cedar tide

Not good

torn mantle Jun 21, 2025, 10:35 PM

#

cedar tide O3 Terminator and pikachu

Not good

torn mantle Jun 21, 2025, 10:35 PM

#

small haven but obviously kingfall >>

I mean i wasnt lying

#

1 prompt tells me a lot

#

I dont need to ask it 100 questions

#

Yes

#

Trust me

#

You should

wintry tinsel Jun 21, 2025, 10:36 PM

#

small haven kingfall still at the top, nice..

It’s not nice cause there no guarantee of them releasing it

small haven Jun 21, 2025, 10:36 PM

#

Is stonebloom a distilled version of kingfall perhaps? I'm not sure what google is attempting here, kf is good as is.

#

damn, not sure what they are attempting with all these revisions

patent aspen Jun 21, 2025, 10:38 PM

#

When you hill climb, some things get better while other things get worse. It's not the case that everything gets better across the board. So it seems kind of insane to me that SVGs only are the benchmark

keen beacon Jun 21, 2025, 10:39 PM

#

SVGs aren't a typical post training thing though I think. So it's somewhat of a measure of dilution from the base model capabilities

elder rapids Jun 21, 2025, 10:43 PM

#

patent aspen When you hill climb, some things get better while other things get worse. It's n...

yeah

#

these guys get echo chambered out of their minds tbh

wintry tinsel Jun 21, 2025, 10:44 PM

#

SVG generation is fun though

elder rapids Jun 21, 2025, 10:44 PM

#

just mark them down to just be casual AI users who enjoy it a little more than the avg population

#

and that's p much it

wintry tinsel Jun 21, 2025, 10:45 PM

#

We got a Reddit intellectual here 🥸

elder rapids Jun 21, 2025, 10:45 PM

#

I don't use reddit btw

wintry tinsel Jun 21, 2025, 10:46 PM

#

I wouldn’t call them casual users they use AI profusely to the point of it being an addiction, although their knowledge on the topic isn’t always accurate it’s not entirely bad either

elder rapids Jun 21, 2025, 10:46 PM

#

not sure what you think casual is

#

but casual would mean exactly that

#

lol

#

nobody you know irl is a casual AI user, they just use it every once in a while

patent aspen Jun 21, 2025, 10:52 PM

#

hill climbing is kind of like this:

141411211111111111411311111

121211211111111111211356261

121111111111114511211356261

121111167811112311211335261

231111165612319945211335221

#

Obviously model 5 is better than model 1 but look at all the regressions along the way

elder rapids Jun 21, 2025, 10:52 PM

#

patent aspen hill climbing is kind of like this: 141411211111111111411311111 12121121111111...

hill just got climbed by Khalil rountree too

#

https://tenor.com/view/khalil-rountree-khalil-rountree-jr-alex-pereira-ufc-mma-gif-16126203293596248368

Tenor

small haven Jun 21, 2025, 11:02 PM

#

someone got a pretty good pikachu on another server, might be the way he prompt'd it

small haven Jun 21, 2025, 11:04 PM

#

cedar tide Stonebloom

hmm

elder rapids Jun 21, 2025, 11:07 PM

#

small haven

yo that's really good

jade egret Jun 21, 2025, 11:57 PM

#

poll_question_text

Which would be best if it come out?

victor_answer_votes

15

total_votes

21

victor_answer_id

3

victor_answer_text

Gemini 3

victor_answer_emoji_name

🍎

ornate agate Jun 21, 2025, 11:58 PM

#

oof

small haven Jun 22, 2025, 12:06 AM

#

ornate agate oof

thats a fat gap from veo3, wow

jade egret Jun 22, 2025, 12:09 AM

#

dang

storm needle Jun 22, 2025, 12:46 AM

#

torn mantle https://x.com/techdevnotes/status/1936507394508595470

there's a good chance that this model won't be good
https://x.com/elonmusk/status/1936493967320953090

Elon Musk (@elonmusk)

Please reply to this post with divisive facts for @Grok training.

By this I mean things that are politically incorrect, but nonetheless factually true.

patent aspen Jun 22, 2025, 12:47 AM

#

storm needle there's a good chance that this model won't be good https://x.com/elonmusk/statu...

That's all posturing

elder rapids Jun 22, 2025, 12:47 AM

#

storm needle there's a good chance that this model won't be good https://x.com/elonmusk/statu...

unless they simply want to release a bad model, they're not going to do bad things to it lol

elder rapids Jun 22, 2025, 12:47 AM

#

patent aspen That's all posturing

yep

topaz peak Jun 22, 2025, 1:25 AM

#

grok 3.5 is a meme, dunno why people even take it seriously in the A.I race

#

deepseek is the real wildcard choice

jade egret Jun 22, 2025, 1:40 AM

#

when is 3.5 dropping.......

#

and when is gpt5 dropping.......

candid storm Jun 22, 2025, 1:47 AM

#

Grok 3.5 july, gpt 5 august

storm needle Jun 22, 2025, 1:48 AM

#

claude 4 opus is definitely better

small haven Jun 22, 2025, 1:52 AM

#

storm needle claude 4 opus is definitely better

o3 pro is definitely better if u dont assume practicality, obv. opus 4 is only good for coding

wintry tinsel Jun 22, 2025, 2:06 AM

#

topaz peak grok 3.5 is a meme, dunno why people even take it seriously in the A.I race

Grok is a lot stronger than deepseek

wintry tinsel Jun 22, 2025, 2:06 AM

#

small haven o3 pro is definitely better if u dont assume practicality, obv. opus 4 is only g...

Opus 4 is good for any and all kinds of writing as well

small haven Jun 22, 2025, 2:12 AM

#

wintry tinsel Opus 4 is good for any and all kinds of writing as well

why so? breath of words from o3 pro is much more natural and professional than opus 4 imo

wintry tinsel Jun 22, 2025, 2:24 AM

#

Breath?

small haven Jun 22, 2025, 2:31 AM

#

*srry

torn mantle Jun 22, 2025, 2:31 AM

#

topaz peak grok 3.5 is a meme, dunno why people even take it seriously in the A.I race

Hate to say it but people wants to believe grok 3.5 is a thing

#

We just want competition

drifting crow Jun 22, 2025, 2:54 AM

#

Grok is the best ai

small haven Jun 22, 2025, 2:58 AM

#

torn mantle Hate to say it but people wants to believe grok 3.5 is a thing

yea potentially called grok 4 now 😭

patent aspen Jun 22, 2025, 3:01 AM

#

small haven yea potentially called grok 4 now 😭

I'm pretty sure this is just an excuse to delay the timeline. It's a very Elon tweet

small haven Jun 22, 2025, 3:02 AM

#

so august for grok 3.5. got it. 😂

patent aspen Jun 22, 2025, 3:02 AM

#

Or 2026 who could say?

#

I'm just thinking about self-driving car timelines. That's my frame of reference

small haven Jun 22, 2025, 3:03 AM

#

lmao

whole wagon Jun 22, 2025, 3:08 AM

#

AI winter, been a few weeks with no model

#

Kappa

storm needle Jun 22, 2025, 3:15 AM

#

small haven yea potentially called grok 4 now 😭

isn't this tweet a red flag?

lilac nimbus Jun 22, 2025, 3:24 AM

#

wintry tinsel Grok is a lot stronger than deepseek

Yes

elder rapids Jun 22, 2025, 3:26 AM

#

patent aspen I'm just thinking about self-driving car timelines. That's my frame of reference

2013 bro 😭

#

but in all honestly, Elon seems to make very good predictions sometimes

patent aspen Jun 22, 2025, 3:27 AM

#

elder rapids 2013 bro 😭

He can't do timeline expectation management though

elder rapids Jun 22, 2025, 3:28 AM

#

yeah that's what I'm saying

#

he first talked about, in 2013, full self driving coming "in a couple weeks/months"

#

now we're in 2025

#

it's been 12 yrs

small haven Jun 22, 2025, 3:32 AM

#

storm needle isn't this tweet a red flag?

rip grok 3.5

patent aspen Jun 22, 2025, 3:32 AM

#

People talk about how fast xAI was able to build a data center and get Grok out. IMO there's no way in hell they did that without incurring a colossal amount of technical debt. I don't know what form of technical debt it is or where it is, but I would bet a Tesla Model 3 it's there.

wintry tinsel Jun 22, 2025, 3:53 AM

#

whole wagon AI winter, been a few weeks with no model

AI like most products is released in burst/waves, some months have several major releases crammed in, while others have several months in a row with nothing significant except free range open source or Chinese releases

jade egret Jun 22, 2025, 3:54 AM

#

cuz elon hype it up so much

#

hopefully it as good as he said

hardy pecan Jun 22, 2025, 4:00 AM

#

grok 3 was surprisingly better than expected when it was released

#

to be fair

#

it topped lmarena initially

#

lets see if he can keep that velocity for 3.5

topaz peak Jun 22, 2025, 4:02 AM

#

hardy pecan it topped lmarena initially

still no idea how that happened

#

it wasn't that good

small haven Jun 22, 2025, 4:03 AM

#

well in that point in time, it was very good at math (esp. on think mode), better than o1 pro

whole wagon Jun 22, 2025, 4:09 AM

#

It's been over 4 months since grok 3 release and they had the supercomputer the entire time after grok 3. They must have cooked smth decent

short trench Jun 22, 2025, 4:20 AM

#

whole wagon It's been over 4 months since grok 3 release and they had the supercomputer the ...

hi

whole wagon Jun 22, 2025, 4:22 AM

#

hi

elder rapids Jun 22, 2025, 4:29 AM

#

yo wait

#

does linking a grok chat to Twitter boost that tweet that link is being sent with

keen fulcrum Jun 22, 2025, 4:31 AM

#

jade egret

did you forget about this?

https://fixupx.com/elonmusk/status/1934107366380949977

Elon Musk (@elonmusk)

Am increasingly confident that Grok 3.5 will be the smartest AI by a significant margin

**💬 1.5K 🔁 1.7K ❤️ 12.9K 👁️ 1.80M **

patent aspen Jun 22, 2025, 4:33 AM

#

keen fulcrum did you forget about this? https://fixupx.com/elonmusk/status/19341073663809499...

#general message

keen beacon Jun 22, 2025, 4:34 AM

#

elon has no idea. he rt'd fake benchmarks a little after he said it was releasing in a week

#

(he later deleted the rt)

small haven Jun 22, 2025, 4:35 AM

#

keen beacon elon has no idea. he rt'd fake benchmarks a little after he said it was releasin...

that was actually funny

keen fulcrum Jun 22, 2025, 4:41 AM

#

patent aspen https://discord.com/channels/1340554757349179412/1340554757827461211/13851285469...

I can't believe it

#

He is speaking out the truth

zinc ore Jun 22, 2025, 4:45 AM

#

Aliens could show up and show off their most intelligent AI and Elon would say 3.5 is smarter

keen fulcrum Jun 22, 2025, 4:59 AM

#

The chance is pretty low that will happen before grok release

hollow ocean Jun 22, 2025, 5:57 AM

#

grok 3.5 late july

keen fulcrum Jun 22, 2025, 6:25 AM

#

for supergrok users

#

august for everyone

small haven Jun 22, 2025, 6:45 AM

#

superdupergrok users $200, late june

#

another drop hint by elon ma

calm sequoia Jun 22, 2025, 7:53 AM

#

#

Why is DeepMind bleeding so much?

#

This signal is extremely bullish for Anthorpic

civic flame Jun 22, 2025, 8:16 AM

#

stonebloom is the only model ive tried to get this right

#

it does have some great knowledge

small haven Jun 22, 2025, 8:18 AM

#

civic flame stonebloom is the only model ive tried to get this right

can you try the luxury car problem plz

civic flame Jun 22, 2025, 8:22 AM

#

civic flame stonebloom is the only model ive tried to get this right

lol im seeing if o3 high can get this

#

so far it has been thinking for 230s

civic flame Jun 22, 2025, 8:22 AM

#

small haven can you try the luxury car problem plz

when i get the opportunity

#

❌

small haven Jun 22, 2025, 8:33 AM

#

rips

tall summit Jun 22, 2025, 8:43 AM

#

civic flame stonebloom is the only model ive tried to get this right

humanity's last exam ass question

alpine coral Jun 22, 2025, 8:44 AM

#

small haven yea potentially called grok 4 now 😭

is he taking the piss? like "rewrite the entire corpus of human knowledge" (with an LLM lol) - how can that be taken seriously lol

tall summit Jun 22, 2025, 8:46 AM

#

well there's no perfect corpus of knowledge

#

i like that goal in theory at least

#

if it's a human-usable corpus and not just slop to retrain on, so much different from what he's talking about

alpine coral Jun 22, 2025, 8:51 AM

#

what he is talking about?

leaden sun Jun 22, 2025, 8:51 AM

#

grok is especially talkative in the arena, like you're reading a condensed journal article 😅
"thanks for being so token-generous, Elon" I guess?

alpine coral Jun 22, 2025, 8:54 AM

#

it's like, blacklist website they don't won't for rag (Rolling Stone lmao), but trying to curate the mass of training data (which is some kind of proxy / representation of human knowledge imo), by using an LLM too lol, to strip out the 'knowledge' he finds disagreeble is the biggest fool's errand ever (if that's what he's talking about)

#

like it'll legit be such a dumb model lol

#

i thought it was just gonna be more 'anti-woke' fine tuning.. but tinkering with the foundational training data to meet a particular version of 'truth' or political persuation is dumb af imo

leaden sun Jun 22, 2025, 8:59 AM

#

tall summit well there's no perfect corpus of knowledge

agree, and what is knowledge anyway, right?

alpine coral Jun 22, 2025, 9:01 AM

#

going down this epistemological path again? 😅

#

tbf.. nice to see a bit of philosophical pondering from time to time

#

surely healthy

leaden sun Jun 22, 2025, 9:03 AM

#

alpine coral surely healthy

surely very needed, especially nowadays 🥺

alpine coral Jun 22, 2025, 9:09 AM

#

oh nice i ran a few quizes and it was stonebloom

torn mantle Jun 22, 2025, 9:12 AM

#

jade egret

no

torn mantle Jun 22, 2025, 9:13 AM

#

small haven rips

sad

alpine coral Jun 22, 2025, 9:24 AM

#

small haven rips

it seems to perform alright tho

#

a bit like flamesong (tho not sure if it's similarly fast-ish)

#

peformed pretty decently on the questions i gave it; i wouldn;'t be surprised if it and flamesong were flash 3 or somehing tbh (they're really good, but not as good as google's best)

torn mantle Jun 22, 2025, 9:49 AM

#

for me its kingfall > blacktooth > stonebloom

alpine coral Jun 22, 2025, 9:52 AM

#

yeah the scores i got aren't consistent with the sense i get reading about kingfall vs blacktooth in forums

#

consensus seems to be kingfall > blacktooth

#

i'm really not sure ha.. both are good - and the scores just are what they are (which prob isb;t much lol(

keen beacon Jun 22, 2025, 9:53 AM

#

i wonder what extent their post training includes riddles / puzzles like that. reminds me of the time when they seemingly fine-tuned the answer to the digital clock question on gemini 1.5 pro

alpine coral Jun 22, 2025, 9:53 AM

#

yeah but it's not just riddles.. like some are spatial awareness things; it just seems to handle them better

#

(which is generally consistent with 'bigger' models in my experience)

#

and also it revising the correct answer to my 'flag' question 😅

#

like that's just brute force knowledge recall / precision - no riddle or wordpkay

civic flame Jun 22, 2025, 10:03 AM

#

alpine coral peformed pretty decently on the questions i gave it; i wouldn;'t be surprised if...

stonebloom is not flash

#

kingfall, blacktooth and stonebloom are all checkpoints of the same model

#

said model is larger than 2.5 pro so if that's what it's getting on these benchmarks that's a little disappointing

#

perhaps worth a retry

#

internal full model names for these vs gem 2.5 pro's

alpine coral Jun 22, 2025, 10:07 AM

#

oh ddear.. if we've gone from kingfall to blacktooth to stonebloom - the regression is real aha

civic flame Jun 22, 2025, 10:08 AM

#

it's still quite far away from the release checkpoint

#

expect it to regress in some ways and improve in others while they're refining

alpine coral Jun 22, 2025, 10:08 AM

#

i've only got stonebloom once too

#

the others i have several data points

civic flame Jun 22, 2025, 10:09 AM

#

yeah it's harder to test this compared to kingfall & blacktooth because they patched studio 💔

keen beacon Jun 22, 2025, 10:10 AM

#

civic flame internal full model names for these vs gem 2.5 pro's

im not sure about "full model names" but there's a specific indicator of it in the internal stuff compared to 2.5 pro. if the full model name (not codenamed) was revealed i wasnt aware of it

#

i think thats what ur talking about there (the difference in values there)

civic flame Jun 22, 2025, 10:11 AM

#

i mean perhaps not the fully complete model name but the longer internal format

late path Jun 22, 2025, 11:43 AM

#

it feels like stonebloom is not even as good as blacktooth in my conversation analysis tasks

#

kingfall > blacktooth > 0605 > stonebloom

civic flame Jun 22, 2025, 11:56 AM

#

lmao

#

yeah idk about stonebloom

#

it feels generally like it's only got worse since kingfall

#

it also feels like stonebloom thinks less than both of the previous versions

wintry tinsel Jun 22, 2025, 12:11 PM

#

civic flame it feels generally like it's only got worse since kingfall

My theory all along is that companies neuter performance and intelligence for deployability and sanitization, a model that can run many instances at once at worse performance is always preferred

calm spear Jun 22, 2025, 12:16 PM

#

can I develop a mobile app for lmarena?

#

do you allow?

torn mantle Jun 22, 2025, 1:24 PM

#

civic flame it also feels like stonebloom thinks less than both of the previous versions

yep

echo aurora Jun 22, 2025, 2:56 PM

#

calm spear can I develop a mobile app for lmarena?

thank you for showing interest in wanting to do so! it does bring us joy seeing members of the community being really excited to contribute to LMArena.

that being said we wouldn't be okay with someone creating an app on our behalf.

patent aspen Jun 22, 2025, 3:05 PM

#

civic flame it feels generally like it's only got worse since kingfall

Sad. I wish I knew what they were doing with it

#

The fact that it's thinking less suggests they're experimenting with inference optimizations. I can only hope that the consensus here that it's worse will be reflected in the ELO

#

I also suspect there is a lag between when they get the ELO scores back and when they have time to react

keen fulcrum Jun 22, 2025, 3:17 PM

#

calm spear can I develop a mobile app for lmarena?

I recommend just using a PWA

keen fulcrum Jun 22, 2025, 3:17 PM

#

echo aurora thank you for showing interest in wanting to do so! it does bring us joy seeing ...

Did OpenAI decline sharing o3 pro with you?

echo aurora Jun 22, 2025, 3:34 PM

#

keen fulcrum Did OpenAI decline sharing o3 pro with you?

tbh I'm not sure about that specific model, but rest assured the team is interested in bringing the best and most popular models to LMArena

sacred plaza Jun 22, 2025, 4:19 PM

#

do y'all feel like you critical thinking skills (like deductive reasoning, systems thinking, aware of congitive biases, etc) has improved while using these generative ai models/products? if so, it is with the default option provided by ai labs or do you add on additional fine tuning notes to guide it down a certain pathway like Socratic questioning responses.

I have found most of these models lacking any sense of care for the critical thinking skills of the user and trying to cultivate it. have been exploring different techniques to get these models to be better mentors instead of excellent answering machines. expert modeling has been promising for me so far.

leaden sun Jun 22, 2025, 5:04 PM

#

always take such studies with a grain of salt of cause...

dusky aurora Jun 22, 2025, 5:11 PM

#

arena glitches again

#

timeout on cloudflare

ionic idol Jun 22, 2025, 5:13 PM

#

whoops

sorry it was 4o image

polar roost Jun 22, 2025, 5:17 PM

#

oh it's happening with everyone

wooden mulch Jun 22, 2025, 5:42 PM

#

the team is working on a fix now. sorry for the trouble.

jade egret Jun 22, 2025, 5:51 PM

#

wooden mulch the team is working on a fix now. sorry for the trouble.

hello

elder rapids Jun 22, 2025, 5:57 PM

#

philosophically redundant btw

native idol Jun 22, 2025, 6:00 PM

#

Depends on how do person/individual use AI.

echo aurora Jun 22, 2025, 6:16 PM

#

Hey my apologies everyone I was out and now just catching up.

#

Looks like there was an outage but has since been fixed, so it should be working now.

sacred plaza Jun 22, 2025, 6:18 PM

#

So don't think and just read papers is your response?! Lol

tall summit Jun 22, 2025, 6:19 PM

#

sacred plaza do y'all feel like you critical thinking skills (like deductive reasoning, syste...

it's stayed the same

#

i hope

sacred plaza Jun 22, 2025, 6:20 PM

#

206 pages. Lmao

#

Good point. I ain't in school so this essay writing specific study does not feel very relevant to stuff most knowledge workers do.

#

🥱

agile heart Jun 22, 2025, 6:28 PM

#

would it be cool if there was a AImusicarena like theres one for chat LLMs AI coding arena what if there was one for music like i know suno never open sourced there models but that would be cool

leaden sun Jun 22, 2025, 6:44 PM

#

agile heart would it be cool if there was a AImusicarena like theres one for chat LLMs AI ...

was thinking to request this for the arena too! what a coincidence 😆

echo aurora Jun 22, 2025, 6:48 PM

#

interesting idea! #1372230675914031105 share here

agile heart Jun 22, 2025, 6:50 PM

#

leaden sun was thinking to request this for the arena too! what a *coincidence* 😆

yeah because i would like a free version of the suno AI feature witch are "Cover" and "extend"

ionic idol Jun 22, 2025, 7:09 PM

#

down again

grim axle Jun 22, 2025, 7:11 PM

#

ionic idol down again

yep

echo aurora Jun 22, 2025, 7:11 PM

#

ionic idol down again

it is? I'm not seeing the same

grim axle Jun 22, 2025, 7:12 PM

#

agile heart Jun 22, 2025, 7:12 PM

#

ionic idol down again

you know you can edit messages right (im not being rude BTW) you spelled Down wrong

ionic idol Jun 22, 2025, 7:12 PM

#

i think im crashing it by pasting large prompts

echo aurora Jun 22, 2025, 7:13 PM

#

wait I'm seeing it now too

#

this is being reported

grim axle Jun 22, 2025, 7:13 PM

#

ionic idol i think im crashing it by pasting large prompts

I was gonna use flux kontext thank you bro

agile heart Jun 22, 2025, 7:13 PM

#

ionic idol i think im crashing it by pasting large prompts

you Also spelled Large wrong (just pointing it out)

echo aurora Jun 22, 2025, 7:17 PM

#

ionic idol down again

it's working again for me, can you confirm you're seeing the same?

#

is it working again for you @grim axle ?

grim axle Jun 22, 2025, 7:19 PM

#

Yep great job @echo aurora

echo aurora Jun 22, 2025, 7:21 PM

#

grim axle Yep great job <@283397944160550928>

credit goes to our engineers

#

and the community for flagging

#

it's much appreciated

grim axle Jun 22, 2025, 7:23 PM

#

There’s another issue in image generation

#

#

It’s my first time using it

echo aurora Jun 22, 2025, 7:27 PM

#

grim axle

you could be hitting a limit, how many image gens have you done today?

#

I assume it's just that model giving you troubles?

grim axle Jun 22, 2025, 7:27 PM

#

echo aurora I assume it's just that model giving you troubles?

Probably let me try another llm

#

Okay so all flux llms are not working

#

actually nvm my prompt was probably copyrighted so it wouldn’t go through

echo aurora Jun 22, 2025, 7:32 PM

#

grim axle Okay so all flux llms are not working

yeah they seems to be working for me fine, I'll keep an eye out though for other reports.

agile heart Jun 22, 2025, 7:33 PM

#

Also the GPT-image1 isnt working

echo aurora Jun 22, 2025, 7:35 PM

#

agile heart Also the GPT-image1 isnt working

replying in the thread

agile heart Jun 22, 2025, 7:36 PM

#

echo aurora replying in the thread

ok Also will you add reference images to the Gemini models

echo aurora Jun 22, 2025, 7:38 PM

#

agile heart ok Also will you add reference images to the Gemini models

I'd encourage you to use the #1372230675914031105 channel

agile heart Jun 22, 2025, 7:39 PM

#

echo aurora I'd encourage you to use the <#1372230675914031105> channel

ok thx

wintry tinsel Jun 22, 2025, 8:06 PM

#

The internet does the same thing not to the same degree but still

jade egret Jun 22, 2025, 8:06 PM

#

echo aurora I'd encourage you to use the <#1372230675914031105> channel

🍊

verbal nimbus Jun 22, 2025, 8:11 PM

#

echo aurora I'd encourage you to use the <#1372230675914031105> channel

Wow, Discord has been hiding that channel from me this whole time 😵‍💫

echo aurora Jun 22, 2025, 8:11 PM

#

jade egret 🍊

🍍

echo aurora Jun 22, 2025, 8:11 PM

#

verbal nimbus Wow, Discord has been hiding that channel from me this whole time 😵‍💫

oh no!! sry to hear that!

#

at least you know now

#

would check out the Channels & Roles section that's at the top of the channel list as well if you haven't already

verbal nimbus Jun 22, 2025, 8:12 PM

#

echo aurora would check out the `Channels & Roles` section that's at the top of the channel ...

Yup, managed to enable it 🙂

echo aurora Jun 22, 2025, 8:13 PM

#

hmm I thought I had that auto enabled for everyone pikaconfused

#

yeah it's enabled as a default channel

#

that's odd

verbal nimbus Jun 22, 2025, 8:13 PM

#

When was it added?

echo aurora Jun 22, 2025, 8:14 PM

#

~month ago ish

verbal nimbus Jun 22, 2025, 8:14 PM

#

Maybe it didn't show up for people who joined before 🤷‍♂️

echo aurora Jun 22, 2025, 8:14 PM

#

yeah could be

verbal nimbus Jun 22, 2025, 8:14 PM

#

echo aurora ~month ago ish

Oh, I had already joined before that.

echo aurora Jun 22, 2025, 8:20 PM

#

verbal nimbus Oh, I had already joined before that.

this was the announcement around the time we made the change, may find other helpful bits of info in there #announcements message

iron cipher Jun 22, 2025, 8:39 PM

#

Ya allah, please don't cause the server to "wipe" the chat history again

manic oriole Jun 22, 2025, 8:40 PM

#

timeout error

#

hope it comes back soon

#

is back

echo aurora Jun 22, 2025, 8:42 PM

#

So sorry everyone, team is aware

manic oriole Jun 22, 2025, 8:42 PM

#

is all good

late path Jun 22, 2025, 8:45 PM

#

524, even if you successfully enter the website there is only an empty model list

native idol Jun 22, 2025, 8:46 PM

#

wait a bit

native idol Jun 22, 2025, 8:46 PM

#

late path 524, even if you successfully enter the website there is only an empty model lis...

Please wait a bit.

iron cipher Jun 22, 2025, 8:46 PM

#

native idol Please wait a bit.

will my chat history be "wiped", again?

manic oriole Jun 22, 2025, 8:46 PM

#

is not really loading the page

manic oriole Jun 22, 2025, 8:47 PM

#

iron cipher will my chat history be "wiped", again?

probably not

echo aurora Jun 22, 2025, 8:47 PM

#

manic oriole is not really loading the page

Same for me

iron cipher Jun 22, 2025, 8:47 PM

#

iron cipher will my chat history be "wiped", again?

I know the chats will be in the database, but still, but hopefully not

iron cipher Jun 22, 2025, 8:47 PM

#

manic oriole probably not

why did that happen about twice

manic oriole Jun 22, 2025, 8:48 PM

#

iron cipher why did that happen about twice

not sure

iron cipher Jun 22, 2025, 8:48 PM

#

Also quick one: Anyone remember when ChatGPT didn't have chat history? Also remember when older chat history went innacessible for a few weeks/months on the platform?

manic oriole Jun 22, 2025, 8:48 PM

#

the beta.lmarena.ai is just slow to load

#

is working

#

thank you

iron cipher Jun 22, 2025, 8:49 PM

#

manic oriole not sure

did you try and fix it

echo aurora Jun 22, 2025, 8:50 PM

#

I’m seeing it up again too

manic oriole Jun 22, 2025, 8:50 PM

#

iron cipher did you try and fix it

i am not a dev of this lmarena

#

i am just happy to have access to this website

torn mantle Jun 22, 2025, 8:53 PM

#

echo aurora Same for me

😦

#

oh its working now

echo aurora Jun 22, 2025, 8:54 PM

#

torn mantle 😦

Should be working again

iron cipher Jun 22, 2025, 9:02 PM

#

Chat history cleared again

#

well not permanantly, will have to wait until the datasets go live

leaden sun Jun 22, 2025, 9:04 PM

#

not working on all browsers sadly 😅

errant cave Jun 22, 2025, 9:23 PM

#

iron cipher Also quick one: Anyone remember when ChatGPT didn't have chat history? Also reme...

What? I've been using ChatGPT since almost the very beginning and it always had chat history

#

From the first response in my oldest chat

#

And the last one

radiant siren Jun 22, 2025, 10:17 PM

#

how often benchmarks happen? if lets say Grok 3.5 would release today , when it gets benchmarked?

leaden sun Jun 22, 2025, 10:39 PM

#

................I like grok now 😂

#

I hope they keep it like this 😆

#

https://tenor.com/view/the-daytripper-texas-daytripper-chet-garner-dramatic-gif-13850530

Tenor

jade egret Jun 22, 2025, 11:56 PM

#

when ne wgoogle model drop 😭

jade egret Jun 23, 2025, 3:54 AM

#

jade egret

poll_question_text

yall think grok 3.5 gonna be good?

victor_answer_votes

20

total_votes

22

victor_answer_id

2

victor_answer_text

no

zinc ore Jun 23, 2025, 4:01 AM

#

Oof

vivid sandal Jun 23, 2025, 4:03 AM

#

Is it happening again? Is there an outage again?

echo aurora Jun 23, 2025, 4:09 AM

#

Yeah we are seeing the same

#

Really sorry everyone, today has had a lot of problems

vivid sandal Jun 23, 2025, 4:13 AM

#

It's fine, you guys are really interactive, really puts me at ease when I see you chat about the ongoing problems real-time

Ty for your work trophy3d

echo aurora Jun 23, 2025, 4:14 AM

#

vivid sandal It's fine, you guys are really interactive, really puts me at ease when I see yo...

We appreciate that

hoary plaza Jun 23, 2025, 4:22 AM

#

echo aurora We appreciate that

I saw that old legacy had repo link chat. Do we have a similar new interface for it??

placid charm Jun 23, 2025, 4:47 AM

#

echo aurora We appreciate that

thank you for dealing with the site issues but since everytime when it happens, i lose all my chat history, so for the future can we please be able to register accounts where all chats would be saved rather than keeping on cookies?

errant thorn Jun 23, 2025, 4:50 AM

#

placid charm thank you for dealing with the site issues but since everytime when it happens, ...

!!! I lost my chat history 😭

echo aurora Jun 23, 2025, 5:19 AM

#

placid charm thank you for dealing with the site issues but since everytime when it happens, ...

Being able to save these chat histories as a feature is something we’re in the process of exploring, additionally though reliability of the site is something we’re focused on as well. We want to tackle the loss of chat history from both ends as we understand the frustration it causes when these are lost.

haughty tangle Jun 23, 2025, 5:32 AM

#

Why is this dude talking to himself

echo aurora Jun 23, 2025, 5:49 AM

#

The site is up and running again ablobcheer

keen ferry Jun 23, 2025, 6:24 AM

#

echo aurora The site is up and running again <a:ablobcheer:399742793976643585>

hey I was just thinking whoever manages this server needs to add a monitor into for lmarena on discord to see if the site is down or up right now (and probably a timer how long it was down or up, or just straight up make a website for that)

echo aurora Jun 23, 2025, 6:28 AM

#

keen ferry hey I was just thinking whoever manages this server needs to add a monitor into ...

A status page (for is LMArena is up or down) is something we're planning on implementing. I'll advocate we get it up and running sooner rather than later. Having it linked with a bot to auto post to the server would be a rly nice feature as well so that's a good callout. I'm working on a bot that'll post when leaderboards are updated and new models are added, but yeah having site status also linked would be nice to have. Good idea blobthumbsup

hoary plaza Jun 23, 2025, 6:31 AM

#

hoary plaza I saw that old legacy had repo link chat. Do we have a similar new interface for...

@echo aurora

echo aurora Jun 23, 2025, 6:32 AM

#

hoary plaza <@283397944160550928>

Ah sry I missed that! RepoChat isn't on the current site atm (assuming that's what you mean by new interface)

hoary plaza Jun 23, 2025, 6:34 AM

#

Ye like I saw there is a new site for webdav and beta lmarena too so was asking if we have something for repo link chat too

echo aurora Jun 23, 2025, 6:35 AM

#

hoary plaza Ye like I saw there is a new site for webdav and beta lmarena too so was asking ...

Not currently, but be sure to let us know if it's something you'd like in #1372230675914031105 so others can weigh in on the same request

cedar tide Jun 23, 2025, 7:47 AM

#

@echo aurora glm 4 air arrives in the leaderboard or was he in the arena for nothing?

languid crescent Jun 23, 2025, 9:08 AM

#

halo, is grok 3 latest on lm arena?

calm spear Jun 23, 2025, 10:43 AM

#

echo aurora thank you for showing interest in wanting to do so! it does bring us joy seeing ...

not on your behalf, unofficial status of it would be stated in the name

wintry tinsel Jun 23, 2025, 2:18 PM

#

jade egret

That’s not accurate it will be good, but the question it will it be Sota good

hybrid locust Jun 23, 2025, 2:49 PM

#

will the old site be shut down in favor of the new one?

potent snow Jun 23, 2025, 2:59 PM

#

What are yalls favorit text to image?

agile heart Jun 23, 2025, 3:01 PM

#

GPT-image1

agile heart Jun 23, 2025, 3:01 PM

#

hybrid locust will the old site be shut down in favor of the new one?

no keep it because the new one is SO buggy and stuff

hybrid locust Jun 23, 2025, 3:12 PM

#

agile heart no keep it because the new one is SO buggy and stuff

true

#

and it's missing the settings/parameters

#

like temperature

#

it just feels not so great to use

#

...

agile heart Jun 23, 2025, 3:13 PM

#

yeah and the limits and Downtime really makes it frustrating

hybrid locust Jun 23, 2025, 3:13 PM

#

what limits

#

i haven't faced any yet

#

model usage is practically unlimited I'd say

agile heart Jun 23, 2025, 3:14 PM

#

i mean image model has limits

whole wagon Jun 23, 2025, 3:17 PM

#

someone calculate how many days its been since musk said grok 3.5 would release

#

i think i got it, was supposed to release 6th May

#

so 48 days late

#

how is that even possible

jade egret Jun 23, 2025, 3:34 PM

#

GTA 6 ahh moment

patent aspen Jun 23, 2025, 3:41 PM

#

whole wagon how is that even possible

48 days late isn't even that much by Elon standards haha

#

He's been 12+ years late before

whole wagon Jun 23, 2025, 3:42 PM

#

yeah but LLM training it should be really obvious if you are actually near release or not

#

its not a loosely defined end goal

patent aspen Jun 23, 2025, 3:44 PM

#

Nah it's complicated enough that being off by a month or two is pretty normal. He'll be off by a lot more than a month or two though

native flame Jun 23, 2025, 3:45 PM

#

Guys I think I found a question about English language that Grok 3, Claude 4 and Gemini 2.5 pro have it right, but GPT o3 or the deep research mode have it wrong

patent aspen Jun 23, 2025, 3:45 PM

#

He's Mr. Overpromise and Underdeliver

native flame Jun 23, 2025, 3:49 PM

#

native flame Guys I think I found a question about English language that Grok 3, Claude 4 and...

https://x.com/gaydeer1225/status/1936964649364107317

I asked GPT if the "You'd help me..." Is a conditional, question or future in the past.
Gpt told me it was a conditional
Grok, claude and gemini, told future in the past.
Some guys on disc said it was future and the past
And a few ones on disc and reddit said it was conditional
What do you think?

nico ❄️ DELTARUNE SPOILERS (@gaydeer1225)

// deltarune spoilers

IM GONNA LOSE IT THEY GREW SO APART SHE DOESNT EVEN ASK THEM FOR HELP ANYMORE

tall summit Jun 23, 2025, 3:51 PM

#

deltarune spoilers

native flame Jun 23, 2025, 3:58 PM

#

tall summit deltarune spoilers

xD no worries, there are no spoilers hsha

#

Ah no wait yes there were xD

#

But I'm still trying to find the right answer :'v

spare mango Jun 23, 2025, 4:08 PM

#

Can Gemini 2.5 Pro analyze music?

#

Musical instruments themselves... ie, their tone, melody, mood, genre, without the lyrics?

primal orbit Jun 23, 2025, 4:20 PM

#

https://i.snipboard.io/i7Ot0z.jpg

#

gemini is being nice to me today 😄

tall summit Jun 23, 2025, 4:21 PM

#

primal orbit gemini is being nice to me today 😄

are you andrew tate

primal orbit Jun 23, 2025, 4:21 PM

#

no, not close.

ocean vortex Jun 23, 2025, 5:51 PM

#

spare mango Musical instruments themselves... ie, their tone, melody, mood, genre, without t...

Honestly to best way to find it out is simply try it. They have improved video/audio input a lot more recently

lapis light Jun 23, 2025, 6:18 PM

#

primal orbit https://i.snipboard.io/i7Ot0z.jpg

This looks so much like GPT response, i've never seen Gemini respond like that before

tall summit Jun 23, 2025, 6:19 PM

#

LMAO submission #2 and #3

echo aurora Jun 23, 2025, 6:21 PM

#

tall summit LMAO submission #2 and #3

yeah those were fun

tall summit Jun 23, 2025, 6:21 PM

#

echo aurora yeah those were fun

i hope you can fight against any google forms botting

echo aurora Jun 23, 2025, 6:24 PM

#

tall summit i hope you can fight against any google forms botting

I'll be monitoring it closely, shouldn't be a problem

leaden sun Jun 23, 2025, 6:25 PM

#

spare mango Musical instruments themselves... ie, their tone, melody, mood, genre, without t...

you mean by uploading the music as mp3 file or just a description like the title of the song and composer?

spare mango Jun 23, 2025, 6:25 PM

#

leaden sun you mean by uploading the music as mp3 file or just a description like the title...

uploading the music as mp3 file, or linking the music through youtube.

echo aurora Jun 23, 2025, 6:27 PM

#

spare mango uploading the music as mp3 file, or linking the music through youtube.

@agile heart shared a rly cool idea yesterday #general message

I'm going to start up a feedback thread about it.

leaden sun Jun 23, 2025, 6:30 PM

#

I think in terms of classical music, just a text-based description could suffice

#

it would add another context dimension (to the theater play) too, if models understand the tone, rhythm, lyrics, instruments used just by reading the title of the music and its composer, I think

Bildschirmfoto_2025-06-23_um_20.20.55.png

flint skiff Jun 23, 2025, 6:41 PM

#

are grok 3.5 codenames in the arena?

#

or nothing yet

patent aspen Jun 23, 2025, 6:44 PM

#

#general message

keen beacon Jun 23, 2025, 6:47 PM

#

they put grok 3 onto the arena early. we've seen nothing from xai so far

#

bad sign

patent aspen Jun 23, 2025, 6:47 PM

#

I think xAI is swimming in technical debt

keen beacon Jun 23, 2025, 6:56 PM

#

only grok 3 mini is in that screenshot

#

i dont think the grok 3 (full) reasoning variant ever released either lol

ocean vortex Jun 23, 2025, 7:03 PM

#

Grok3 is not mid. It's ahead by a good margin over Sonnet 3.7 on artificialanalysis ratings. Most of the models that are ahead are newer and came out after grok3

#

also do not not mix up resoning and non-reasoning versions

#

That is not the point. It should still perform good overall and 4.0 is more competitive and does that. What I'm really saying is at the time of release grok3 was SOTA or very close to that. There was no other alternative that would be objectively better overall at the time

#

wasn't released yet

#

only o3-mini

#

February 17, 2025

#

grok3 release

#

it was released to the public, that's what really counts... Besides the early checkpoint (lmarena) did check out as a performant model

#

It was performant straight out the box

#

GPT4 API access was late and very limited as well, did not stop people from figuring out it's a strong model

unborn ocean Jun 23, 2025, 7:18 PM

#

grok 3 was a very good model when it came out

#

but considering the more aggressive post train and that most other labs did not focus on base models
it was a bit short of sota

leaden sun Jun 23, 2025, 7:18 PM

#

even being mid, grok at least has a taste for classical music, and that already makes it one my fav now 😊

Bildschirmfoto_2025-06-23_um_21.15.54.png

ocean vortex Jun 23, 2025, 7:20 PM

#

unborn ocean but considering the more aggressive post train and that most other labs did not ...

there was no "not focusing on base model" lol. Everyone is flat out all of the time and non-reasoning (chat "base") models are just as important as the reasoning ones, especially back then

unborn ocean Jun 23, 2025, 7:21 PM

#

ocean vortex there was no "not focusing on base model" lol. Everyone is flat out all of the t...

but they where clearly not really focusing on training a new base model (unless you want to count the failed attempt for 4.5)

#

or improving it (outside of post training - which in my definition does not count as focusing on base model)

ocean vortex Jun 23, 2025, 7:22 PM

#

unborn ocean but they where clearly not really focusing on training a new base model (unless ...

whoever didn't it's their failure. We have obviously seen improved base models from most of the labs since then...

keen beacon Jun 23, 2025, 7:22 PM

#

Openai did a midtrain on 4o, and fresh pretrains for 4.1 mini and nano. It's not just post training

unborn ocean Jun 23, 2025, 7:23 PM

#

keen beacon Openai did a midtrain on 4o, and fresh pretrains for 4.1 mini and nano. It's not...

well i don't count the midtrain 4o, or the "new" 4.1 as really focusing on the base model very much

unborn ocean Jun 23, 2025, 7:23 PM

#

ocean vortex whoever didn't it's their failure. We have obviously seen improved base models f...

yes, i agree with you, when i say sota i am more referring to the sota the labs could do (though i know that that is probably not really the intended use of the word in this context)

ocean vortex Jun 23, 2025, 7:30 PM

#

It's funny that you mention that cause I'm actually completely the opposite and anti-Elon full tilt lol

#

but this doesn't change how grok3 actually performs

civic flame Jun 23, 2025, 7:30 PM

#

i love you claude 4 opus

#

🗣️

ocean vortex Jun 23, 2025, 7:40 PM

#

I would never pay for their SuperDork sub or however it's called lmao
But I did use the "early-grok3" lmarena checkpoint quite extensively, and then used it on grok website once grok3 was made available for free users

leaden sun Jun 23, 2025, 7:54 PM

#

Another reason to like grok? it knows to do Shakespeare

Bildschirmfoto_2025-06-23_um_21.46.38.png

#

https://tenor.com/view/muppets-muppet-show-roy-clark-theater-theatre-gif-26176969

Tenor

civic flame Jun 23, 2025, 8:01 PM

#

it probably is

leaden sun Jun 23, 2025, 8:03 PM

#

I used similar theatrical acting on gemini and gemma yesterday, and got errors many times while grok seems to understand to play along my "charade" 😅

#

i guess those acting classes from ages ago are helpful to trick some models, but not all...

#

am certain it's not "corporate" tuning, i suspect rather some kind like filter? their response got midway cancelled and turned into err...

#

i can only guess

#

it was Claude

small haven Jun 23, 2025, 9:18 PM

#

i have to agree with my man craig, grok 3 was sota at that point in time, esp. in math, its easy to criticize it in hindsight

sour spindle Jun 23, 2025, 10:14 PM

#

is flamesong a new google flash line of model

torn mantle Jun 23, 2025, 10:16 PM

#

sour spindle is flamesong a new google flash line of model

yes

lone vector Jun 23, 2025, 10:20 PM

#

Do you think it’s Gemini 3.0 or another 2.5 model

small haven Jun 23, 2025, 10:29 PM

#

so we might not see kf until end of summer? wow

#

ok thank god

sour spindle Jun 23, 2025, 10:42 PM

#

is stonebloom a new iteration of 2.5 pro

small haven Jun 23, 2025, 10:45 PM

#

no kidding 😮

kind cloud Jun 23, 2025, 11:06 PM

#

I'm thinking that Stonebloom might be something like a "2.5-pro-lite."
I tested the models by asking, "what's the official title for One Piece Chapter 1117?"
2.5 Pro answered "Mo" (the correct title) every time I tried. Flash gave me nonsense/random answers every time. And Stonebloom answered "Mo" most of the time, but gave incorrect answers a few times.

tall summit Jun 23, 2025, 11:06 PM

#

kind cloud I'm thinking that Stonebloom might be something like a "2.5-pro-lite." I tested ...

"2.5-pro-lite" umm there's a word for this

small haven Jun 23, 2025, 11:25 PM

#

kind cloud I'm thinking that Stonebloom might be something like a "2.5-pro-lite." I tested ...

big b said stonebloom is not distilled

#

same param count as kf

kind cloud Jun 23, 2025, 11:32 PM

#

small haven big b said stonebloom is not distilled

who is big b?

small haven Jun 23, 2025, 11:58 PM

#

o4 pro #8

sacred quail Jun 24, 2025, 1:31 AM

#

i want 4o thinking

languid crescent Jun 24, 2025, 7:42 AM

#

hey uh is opus 4 thinking 16k down? for some reason I get errors when trying to enter a prompt...

junior quest Jun 24, 2025, 8:15 AM

#

what the hell🤣

echo aurora Jun 24, 2025, 8:21 AM

#

junior quest what the hell🤣

pikaconfused well that doesn't look right, what's the prompt you're using?

junior quest Jun 24, 2025, 8:27 AM

#

echo aurora <:pikaconfused:398202117493620740> well that doesn't look right, what's the prom...

write a very hard exercise in physics and solve it

A (Properly) Hard Physics Exercise
Quantum Mechanics – Δ-potential in a 3-D Harmonic Trap
A single non–relativistic particle of mass 𝑚 is confined by an isotropic harmonic oscillator of frequency ω.
In addition, it is subjected to a point–like interaction

[ V_{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . ]

The full Hamiltonian is therefore

[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). ]

Introduce

• the oscillator length (a_{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

[ \psi_{n\ell m}(\mathbf r)=R_{n\ell}(r),Y_{\ell m}(\theta,\phi), ]

with (R_{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a_{ho}^{2})}L_{n}^{\ell+1/2}(r^{2}/a_{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

[ \psi_{n\ell m}(0)=0, ]

echo aurora Jun 24, 2025, 8:39 AM

#

junior quest write a very hard exercise in physics and solve it A (Properly) Hard Physics E...

okay thank you for sharing, this is helpful

keen beacon Jun 24, 2025, 8:45 AM

#

kind cloud I'm thinking that Stonebloom might be something like a "2.5-pro-lite." I tested ...

Ask it about the romanized name specifically, it's able to remember that more easily

ocean vortex Jun 24, 2025, 10:04 AM

#

sacred quail i want 4o thinking

you already have it, it's called o1

ocean vortex Jun 24, 2025, 11:31 AM

#

junior quest what the hell🤣

lol. Latex seems to work though unless this got fixed already:

#

oh..

#

output issue rather than interface issue, and if the prompt was that entire thing including everything after A (Properly) Hard Physics Exercise, it just assumed that's how you want latex to be formatted from now on...

leaden sun Jun 24, 2025, 11:48 AM

#

is there a limitation of characters in the arena? @echo aurora

Bildschirmfoto_2025-06-24_um_13.44.55.png

keen beacon Jun 24, 2025, 11:48 AM

#

\[ \] iirc these are common latex delimiters but some parsers might not accept them by default. it looks he copied the output via selection since it's missing the slashes (i've done this before)

#

o3 seemed to be using those delimiters

ocean vortex Jun 24, 2025, 11:49 AM

#

leaden sun is there a limitation of characters in the arena? <@283397944160550928>

ping doesn't work if you add it after you edit the message lol

leaden sun Jun 24, 2025, 11:50 AM

#

i didnt know this, thanks for telling me 😅

ocean vortex Jun 24, 2025, 11:50 AM

#

keen beacon o3 seemed to be using those delimiters

Well it seems to work for std latex, but the exact thing he pasted isn't rendered on chatgpt either think

keen beacon Jun 24, 2025, 11:52 AM

#

ocean vortex Well it seems to work for std latex, but the exact thing he pasted isn't rendere...

#

works on chatgpt if you add back the slashes

#

lmarena isnt rendering latex using those delimiters it seems

#

\[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]

The full Hamiltonian is therefore

\[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). \]

Introduce

• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

\[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), \]

with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

\[ \psi{n\ell m}(0)=0, \]

ocean vortex Jun 24, 2025, 11:55 AM

#

yeah this doesn't on lmarena... OpenAI render is more lenient then

#

keen beacon Jun 24, 2025, 11:57 AM

#

he pasted the output without the slashes because of how he selected it manually. o3 outputted proper latex delimiters (\[ \]). the markdown renderer omits it (and it's not visible in the rendered output), so when he selects it it's gone. so the second one is not a valid test

#

im not sure if new lmarena has a button to copy it directly (which will include those slashes)

#

anyway the fix seems to just add \[ \] as additional latex delimiters beyond $$ $$

ocean vortex Jun 24, 2025, 11:59 AM

#

keen beacon he pasted the output without the slashes because of how he selected it manually....

if it fails the rendering it's different than copying the rendered text though

#

And as you can see same prompt chatgpt rendered much more

keen beacon Jun 24, 2025, 11:59 AM

#

ocean vortex if it fails the rendering it's different than copying the rendered text though

the slashes aren't visible because of the markdown renderer. the actual text output has it. (this is why his pasted output doesn't have them, as he selected the rendered output and copied it in his browser) also, the latex renderer doesn't parse those delimiters, which doesn't render the latex

ocean vortex Jun 24, 2025, 12:02 PM

#

keen beacon the slashes aren't visible because of the markdown renderer. the actual text out...

the thing is if it's working for the input on lmarena then it should have been rendered there as well. Also model most likely sees them

#

input was exactly the same

#

on chatgpt latex only works for model output, that's why it looks different

cursive iron Jun 24, 2025, 12:04 PM

#

Can we get Videos model on Lmarena at future?

ocean vortex Jun 24, 2025, 12:05 PM

#

many interfaces are treating user input the same way though, including lmarena (it added bulletpoint lol)

keen beacon Jun 24, 2025, 12:06 PM

#

raw model output:

\[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]

The full Hamiltonian is therefore

\[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). \]

Introduce

• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

\[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), \]

with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

\[ \psi{n\ell m}(0)=0, \]

renderer renders markdown latex. inside latex delimiters, e.g. $$, it will render later.
it sees: \[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]
\[ \] is not defined as a latex delimiter by the latex parser in the renderer.
so then it goes to the markdown parser/renderer. which omits the slashes in \[ \] => [ ]

then he selected (the rendered output) it and copied it in his browser, rather than copying the raw model output. (there's a specific button to do that in old arena)

selected and copied output via browser:

[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . ]

The full Hamiltonian is therefore

[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). ]

Introduce

• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), ]

with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

[ \psi{n\ell m}(0)=0, ]

#

the actual problem is just this: \[ \] is not defined as a latex delimiter by the latex parser in the renderer.
if you replace \[ \] with $$ it works:

ocean vortex Jun 24, 2025, 12:08 PM

#

keen beacon raw model output: ``` \[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g...

I'm not sure what you are trying to say or how does this change anything tbh.
I'm referring to this and it's pretty clear to me that lmarena renders less #general message

it nukes the slashes yeah, but this looks more of a side-effect of failed rendering in the first place, the display is not right comparing it to chatgpt

keen beacon Jun 24, 2025, 12:10 PM

#

its not a single process. the latex parser parses stuff within specified latex delimiters. it doesn't (because it's not defined as a latex delimiter in their parser they're using). so it gets parsed as markdown, where the markdown renderer nukes the slashes. anyway, the actual problem is that \[ \] aren't specified as latex delimiters

#

$$ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . $$

The full Hamiltonian is therefore

$$ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). $$

Introduce

• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

$$ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), $$

with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

$$ \psi{n\ell m}(0)=0, $$

Print this. no codeblock.

i simply replaced the bracket delimiters with $$ and it works

#

also they need to add  as latex delimiters as well

#

i see in his output there's inline math with those delimiters as well

ocean vortex Jun 24, 2025, 12:15 PM

#

Never argued for it being "single" or not single process lol. My point was that it was immediatelly clear it is not rendered while it could/should have been after that valid test. Here's how the input should have looked (instead of nuking slashes from it):

keen beacon Jun 24, 2025, 12:15 PM

#

$r^{\ell}$ => $$ r^{\ell} $$ (inline math should be rendered via $ and $ as well)

ocean vortex Jun 24, 2025, 12:17 PM

#

the same applies for the model output. It most likely sees the full input as it was, not how it's displayed, but then the same issue is with its own output

keen beacon Jun 24, 2025, 12:19 PM

#

ocean vortex Never argued for it being "single" or not single process lol. My point was that ...

youre not understanding me anyway. it really doesn't matter tbh. the fundamental issue from my long convoluted explanation is that \[ \]  needs to be parsed latex delimiters alongside $$ in the renderer.

#

this doesn't affect model performance in any way, its just visual

ocean vortex Jun 24, 2025, 12:20 PM

#

Well obviously it's just visual. I would also argue that trying to render user message is probably not the best approach in the first place either...

keen beacon Jun 24, 2025, 12:21 PM

#

ocean vortex Well obviously it's just visual. I would also argue that trying to render user m...

why not? i like it

ocean vortex Jun 24, 2025, 12:21 PM

#

keen beacon why not? i like it

it can be a mess as sometimes it's rendering things you were not meaning to be rendered. So hashtags become huge text etc

keen beacon Jun 24, 2025, 12:22 PM

#

yeah but you could put it in a codeblock anyway. i like it the majority of the time

ionic idol Jun 24, 2025, 1:30 PM

#

huh

indigo hazel Jun 24, 2025, 1:44 PM

#

@echo aurora Sorry for tagging, can I ask you to add the possibility to make a photo directly from the website? It would be more comfortable

sullen parcel Jun 24, 2025, 2:07 PM

#

hey

#

i was wondering if anyone has recommendations for an LLM that can replicate a specific design style with high character accuracy?

#

or rather what's the best in this category

#

for context, i wanna make another chapter of my storybook

#

children-story style

echo aurora Jun 24, 2025, 3:53 PM

#

leaden sun is there a limitation of characters in the arena? <@283397944160550928>

I don't believe so.

did the model get stuck or did it look like it was finished providing an output?

echo aurora Jun 24, 2025, 3:57 PM

#

indigo hazel <@283397944160550928> Sorry for tagging, can I ask you to add the possibility to...

My apologies as I may be misunderstanding your question here so please correct me if I'm wrong.

If you click this little drop down you should be able to create images.

leaden sun Jun 24, 2025, 3:59 PM

#

echo aurora I don't believe so. did the model get stuck or did it look like it was finishe...

it felt like its answer were cut by a potential limit, to me

echo aurora Jun 24, 2025, 4:01 PM

#

leaden sun it felt like its answer were cut by a potential limit, to me

which model was it?

indigo hazel Jun 24, 2025, 4:04 PM

#

echo aurora My apologies as I may be misunderstanding your question here so please correct m...

Don’t worry, it’s probably just my English. That function lets me generate AI images, but I’m asking about an option for taking a photo. Here’s the issue (maybe it’s silly): When I need to upload a photo from my device’s gallery or storage to the site, I first have to take the photo using my camera app, save it, and then go to the site to select it from my existing images. What I’d like is a way to take a photo directly and upload it to the site without having to save it to my gallery first. Is that possible?

leaden sun Jun 24, 2025, 4:07 PM

#

echo aurora which model was it?

grok

patent aspen Jun 24, 2025, 4:18 PM

#

ocean vortex Well obviously it's just visual. I would also argue that trying to render user m...

IMO this is the type of thing that will be a bit messy at first, get better over time, and eventually be so reliable that we can't imagine a world without it

echo aurora Jun 24, 2025, 5:07 PM

#

leaden sun grok

okay thanks I'll try to reproduce the same issue. blobthanks

echo aurora Jun 24, 2025, 5:09 PM

#

indigo hazel Don’t worry, it’s probably just my English. That function lets me generate AI im...

Gotcha! Thank you for explaining further. Yes, that's very possible. If you could share this idea in the #1372230675914031105 channel that'd be ideal. That way other members in our community can weigh in on if that's something they'd like to see added as well.

ocean vortex Jun 24, 2025, 5:09 PM

#

echo aurora okay thanks I'll try to reproduce the same issue. <:blobthanks:82544483546064492...

what is the max tokens set at? Grok3 can be very verbose when it needs to...

alpine coral Jun 24, 2025, 5:11 PM

#

same with some of the chinese thinking models too

sacred quail Jun 24, 2025, 5:17 PM

#

@surreal creek im inviting you to be more kind person. Lets make this world better together

ocean vortex Jun 24, 2025, 5:19 PM

#

alpine coral same with some of the chinese thinking models too

Chinese models not always but often enough do have crappy fine-tuning. Grok3 is the 1 from a very few models which can output extremely long responses with thinking disabled, while not being verbose all of the time

hollow ocean Jun 24, 2025, 5:20 PM

#

https://tenor.com/view/risa-bezos-speedball-gif-19767173

Tenor

ocean vortex Jun 24, 2025, 5:20 PM

#

Most models are either concise or verbose, that is not the case here, it really seems flexible...

#

So like, this is non-reasoning version:

surreal creek Jun 24, 2025, 5:45 PM

#

sacred quail <@1243629219032862851> im inviting you to be more kind person. Lets make this wo...

?

#

how about let’s discuss AI benchmarking instead 😄👍

#

is there a possibility that human eval benchmarks push AI’s political views away from the academic consensus view they are trained on to more populist views that resonate greater with the average person?

#

when the Llama 4 Maverick matchups were fully released, I noticed that there was one individual mass prompting with political questions specifically selecting for which AI gave him a more “conservative” answer, if a push of this sort was organized on a larger level by some political group seeking to promote AI models that specifically advocate for their ideology, would it affect the landscape as a whole?

#

or would it just be similar to Elon currently trashing Grok 3.5 by trying to “dewokeify” it

sacred quail Jun 24, 2025, 6:08 PM

#

I dont think academic concensus important for politics. Academy is always aligned with system even if not looks like that. So there is no bad thing if LLMs thinks like average person about politics. This is democracy right ? If we must listen some small elite group in academy, then it would be technocracy, not democracy. In the end of the day, politics is not about what is true or not, politics is about "which thing benefites who?" So its better the academics and LLMs not being talkative about that.

#

Btw im finding Maverick 3-26 exprimental much better than final maverick version

#

Im not sure what they did but exprimental version in lmarena certainly better

ocean vortex Jun 24, 2025, 6:21 PM

#

surreal creek is there a possibility that human eval benchmarks push AI’s political views away...

It does not. But this very much can:

#

Was only a matter of time before Elon tried to add his own biases to grok I think... As scary bad as it is

#

Doing this he wouldn't have to overfit on misinformation, if he is altering the entire internet of data instead

#

Finally he will be able to have a model that will tell him that covid vaccine is causing autism lmao

#

It's a good thing that OpenAI parted ways with him a long time ago and Grok is struggling to gain popularity in US, let alone anywhere else, that's the only silver lining

sacred quail Jun 24, 2025, 6:37 PM

#

Grok could be most popular second AI because of twitter but i agree your concerns

sacred quail Jun 24, 2025, 6:38 PM

#

sacred quail Grok could be most popular second AI because of twitter but i agree your concern...

@gr0k iS tHiS tRuE ?

surreal creek Jun 24, 2025, 6:42 PM

#

sacred quail I dont think academic concensus important for politics. Academy is always aligne...

I apologize for being rude before, it is exceedingly obvious that English is not your first language 👍

sacred quail Jun 24, 2025, 6:42 PM

#

Yes, sorry about my broken english. Im trying my best

surreal creek Jun 24, 2025, 6:43 PM

#

ocean vortex Finally he will be able to have a model that will tell him that covid vaccine is...

RFK Grok 😭

#

GRFK

#

Grfk Jr.

leaden sun Jun 24, 2025, 6:44 PM

#

ocean vortex It does not. But this very much can:

seriously, now after testing grok for some time, i very much look forward to grok 4 ✨

Bildschirmfoto_2025-06-24_um_20.35.15.png

ocean vortex Jun 24, 2025, 6:47 PM

#

sacred quail Grok could be most popular second AI because of twitter but i agree your concern...

Twitter is only popular in US though, and at least half of that audience are firmly against Musk and everything he creates

#

He was doomed the moment he decided to get political, and even more doomed once he started parroting misinformation and far-right crazy bs

#

He probably thinks he can control people the same way leaders of completely corrupt and oppressed regimes can... 99% he doesn't believe most of the stuff he puts out, but it serves the purpose

#

I think it could work. It would change the model in some way for sure, and he has all the money in the world..

echo aurora Jun 24, 2025, 6:54 PM

#

gentle reminder to avoid political stuff unless it's specific to AI please blobthanks

leaden sun Jun 24, 2025, 7:00 PM

#

politics aside, simply look at grok as a neutral competitor in this crazy ai race, i must say xAI dev deserve a raise for making grok such a sweet delight, well-versed in classical literature, classical music and theatre plays 😊 it makes the interaction...very natural and humbly human

marsh stratus Jun 24, 2025, 7:29 PM

#

I think it will definitely be interesting to see how far you can stretch an LLM to favor some political viewpoints while still maintaining functionality. Not something I’d personally spend a billion dollars on, but will be neat to see

unborn ocean Jun 24, 2025, 7:58 PM

#

what is really sota about xAI is how fast they raised their evaluation

ocean vortex Jun 24, 2025, 7:58 PM

#

marsh stratus I think it will definitely be interesting to see how far you can stretch an LLM ...

I mean in theory you could just dump data into AI and instruct it to rewrite it to be more far-right aligned. It is smart enough to understand what you want. To do it more efficiently you could make it alter text as well instead of rewriting (fill in the middle etc) which would be much faster. Even small things can have an effect if applied on a big scale

#

and then once you train on that, the entire pattern matching and probabilities will be shifted to align more with that manipulated biased fake data

keen beacon Jun 24, 2025, 8:03 PM

#

its just easier and way more cheaper to do this in post training. doing it during pretraining is gonna be an expensive research effort

ocean vortex Jun 24, 2025, 8:03 PM

#

keen beacon its just easier and way more cheaper to do this in post training. doing it durin...

it's also way less effective to do it post training

#

you are going against the entire internet and what it already learned

#

so you either degrade the performance, overfit it, make it almost unusable on many subjects, or all of those lol

keen beacon Jun 24, 2025, 8:05 PM

#

imo i think it can still be done effectively. depending on how the pretraining data is curated, most models will encounter those views and will know how to repeat those views anyway. this doesn't even require expansive rewriting/etc. plus i dont think even chinese models do that, it's expensive and complicated. youre researching propaganda models not frontier performance, its a huge waste of money, if you want propaganda spewing models there are far more effective and cheaper ways to accomplish it

ocean vortex Jun 24, 2025, 8:07 PM

#

keen beacon imo i think it can still be done effectively. depending on how the pretraining d...

Chinese models are a good example while doing this post training doesn't work tbh

keen beacon Jun 24, 2025, 8:07 PM

#

ocean vortex Chinese models are a good example while doing this post training doesn't work tb...

they dont do much on it

ocean vortex Jun 24, 2025, 8:07 PM

#

keen beacon they dont do much on it

and yet they can't make it work even on those very limited topics

keen beacon Jun 24, 2025, 8:07 PM

#

ocean vortex and yet they can't make it work even on those very limited topics

theyre not exhaustively doing post training on that stuff

#

they arent putting much effort into it, thats why it seems weak

ocean vortex Jun 24, 2025, 8:08 PM

#

keen beacon theyre not exhaustively doing post training on that stuff

you are just assuming that. The fact is many Chinese labs tried it and we don't have a single example of this being done effectively

keen beacon Jun 24, 2025, 8:08 PM

#

yi models iirc didnt even stop themselves at all on tiananmen square and you would see a western model reply on it

#

they just added an external filter to cut the model off/replace the response

#

no i think its exactly that

ocean vortex Jun 24, 2025, 8:10 PM

#

Degraded performance is most definitely one of the reasons, meaning it's reasonable to assume the opposite I would say - models that did do this more effectively were not even released

keen beacon Jun 24, 2025, 8:11 PM

#

they arent putting much effort into chinese political alignment as they could potentially be

#

yi models are uncensored about tienanmen square with no jailbreak 😂 but rip those models

#

they only added an external filter on their chinese api 🤷 i guess it was compliant enough at the time

ocean vortex Jun 24, 2025, 8:13 PM

#

Yeah that's true as well. What Elon is believed to be trying to accomplish is to be on a considerably larger scale. Though I wouldn't discount Chinese labs as "ticking the box" entirely, many of them share the same values of their government and have deep roots in it

#

Like they are the ones benefiting from it

#

So for them the current system works, and I would be very surprised if behind closed doors those labs have different opinion on Taiwan etc

#

yeah pretty much... And especially for those directly benefiting from CCP, this is even more true

keen beacon Jun 24, 2025, 8:20 PM

#

keen beacon yi models are uncensored about tienanmen square with no jailbreak 😂 but rip th...

hmm, testing it again with yi lightning (last i tested this was on a different yi model):

leaden sun Jun 24, 2025, 8:41 PM

#

is this fearmongering to make us believe in emergence craze? 😅
https://www.youtube.com/watch?v=eczw9k3r6Ic

YouTube

AI Explained

When Will AI Models Blackmail You, and Why?

In the last few days Anthropic have released an impressive honest account of how all models blackmail, no matter what goal they have, and despite prompt warnings, and other preventions. But do these models want this?

Thanks to Storyblocks for sponsoring this video! Download unlimited stock media at one set price with Storyblocks: storyblocks....

▶ Play video

unborn ocean Jun 24, 2025, 8:43 PM

#

why does dario look so depressed

balmy mist Jun 24, 2025, 8:47 PM

#

leaden sun is this fearmongering to make us believe in emergence craze? 😅 https://www.you...

nahh his channel pretty fair, one of the better ai channels imo

cedar tide Jun 24, 2025, 8:50 PM

#

Screenshot_2025-06-24-22-50-07-058_com.twitter.android-edit.jpg

#

The open source Open AI model coming out this summer will run on a phone and be on par with O3 Mini?

leaden sun Jun 24, 2025, 8:59 PM

#

i see, a self fulfilling prophecy so to speak

ocean vortex Jun 24, 2025, 9:46 PM

#

And they do that to some extent. But only if models were safety aligned. If that is not the goal when training the model and you don't fine-tune on safety, it will obviously not refuse essentially ever, it will generate the continuation for everything

#

No model is 100% "safe" in all cases, I don't think that is the goal

#

but the fundamental idea still works

#

it will refuse blatant extreme system prompts

#

you can trick the model, but if you can do that that usually means your intelligence is on a level where you would be able to retrieve the same information using other means as well

#

the current system prevents low intelligence psychos from breaking havoc easily, so in that sense it kinda works

#

I mean my point is it will refuse low-effort blatant extreme/damaging system prompts, that paper does not dispute this

#

Personally I like to speculate on what is known. This seems a bit like a speculation on a fairly distant future that may be redefined and in need of entirely different solutions sooner than it becomes reality...

#

For current models that we have it is not very relevant IMO

#

you can't force that though. There's no enforced safety alignment on nukes 🤷‍♂️

#

if someone has the funds, he absolutely can train AI for anything and it's impossible to prevent this

#

However Trump campaigning to ban AI regulation is the opposite spectrum of extreme and obviously not the right move too

#

Individual can't make AI to do anything though. Only huge companies with insane funding can, often with power and/or links to the government. With nukes you also need power+money

#

In some sense this is comparable to technology advancing in general. It's possible now to do way more damage with less than 50 years ago

#

so it tends to amplify both good and bad

keen beacon Jun 24, 2025, 10:15 PM

#

didn't opus 4 try to contact press and regulators when it was tasked to do something immoral tho

#

i remember reading that from anthropic

#

i get the point youre saying though

verbal nimbus Jun 24, 2025, 10:19 PM

#

https://youtu.be/z3awgfU4yno?feature=shared

YouTube

bycloud

The LLM's RL Revelation We Didn't See Coming

Try out Warp 2.0 now, the current rank #1 AI on Terminal Bench, outperforming Claude Code: https://go.warp.dev/bycloud
You can also use code "BYCLOUD" to get Warp Pro for 1 month free. (limited for 1,000 redemptions)

My Newsletter
https://mail.bycloud.ai/

my project: find, discover & explain AI research semantically
https://findmypapers.ai/

...

▶ Play video

verbal nimbus Jun 24, 2025, 10:22 PM

#

verbal nimbus https://youtu.be/z3awgfU4yno?feature=shared

Research suggests RL does not add any new reasoning pathways. Also, some models like Qwen improved from RLVR even when the data is labeled incorrectly due to internal bias.

iron cipher Jun 24, 2025, 11:44 PM

#

Can someone please update the repochat database, I lost a python script to not ticking Auto Save, and a Windows update came along in the middle of the night without my consent.

rare python Jun 25, 2025, 12:19 AM

#

@echo aurora stonebloom is broken in webdev arena. It's fine on lmarena.

echo aurora Jun 25, 2025, 12:34 AM

#

rare python <@283397944160550928> stonebloom is broken in webdev arena. It's fine on lmarena...

thank you for the flag, I'll create a post in #1343291835845578853 will some followup questions.

leaden palm Jun 25, 2025, 2:10 AM

#

i figured out what the aeris guy is doing

leaden palm Jun 25, 2025, 3:58 AM

#

i think that's a bit of an overreach

#

being vehemently anti-china is problematic but "ban all who criticize chinese ai" is also problematic

mossy drum Jun 25, 2025, 6:32 AM

#

New model in Image Arena: kordex-can

echo aurora Jun 25, 2025, 6:59 AM

#

leaden palm being vehemently anti-china is problematic but "ban all who criticize chinese ai...

Would like add on here that discussion should be focused on the model or organization and not where it’s developed. Different places will have different laws and practices for how they develop AI and that’s fine to discuss, but when it turns into blatant hatred or something unrelated to AI is where we’ll draw that line. Sometimes when that line is crossed isn’t always crystal clear, but we’ll do our best to enforce it. If anyone feels like we aren’t enforcing our rules or creating a welcoming space you’re encouraged to reach out directly and let us know. My DMs are open (although using the ModMail bot is preferred).

cedar tide Jun 25, 2025, 8:10 AM

#

This summer with their open model

#

a new model has arrived on the leaderboard,
I really don't understand why you put it in the arena 🤦,
there are plenty of interesting models to put

Screenshot_2025-06-25-10-15-27-066_com.android.chrome-edit.jpg

#

M1 arrived in the leaderboard

Screenshot_2025-06-25-10-20-24-527_com.android.chrome-edit.jpg

#

Magistral medium arrived(much lower than mistral medium 🤦)

Screenshot_2025-06-25-10-22-30-719_com.android.chrome-edit.jpg

calm sequoia Jun 25, 2025, 8:40 AM

#

I like how the o3 is slowly rising and gemini is slowly falling

#

We haven't got new 4o since 03-26 👀

tall summit Jun 25, 2025, 8:45 AM

#

leaden palm i figured out what the aeris guy is doing

you couldn't bother reading the paper 😱
i didn't know there was a paper but i expected this much

#

everyone knows by now that gemini 2.5 pro is extremely susceptible to prompt engineering and roleplay prompts change its attitude more than any other model

late path Jun 25, 2025, 8:49 AM

#

calm sequoia I like how the o3 is slowly rising and gemini is slowly falling

2.5pro's falling is most likely the result of anon models like blacktooth and stonebloom

rare python Jun 25, 2025, 9:01 AM

#

What is it?

white kelp Jun 25, 2025, 11:14 AM

#

2.5pro is out and on leaderboard? Surprised no tweet

calm sequoia Jun 25, 2025, 11:31 AM

#

late path 2.5pro's falling is most likely the result of anon models like blacktooth and st...

Wouldn't this also result in drop of o3?

sacred plaza Jun 25, 2025, 11:45 AM

#

rare python What is it?

Based on the same concept that drives the hype around this entire industry: vibes

tall summit Jun 25, 2025, 11:58 AM

#

i'm using gemini 2.5 pro to translate a full novel zero-shot and it's good
i never tested it out with a text as long as this but wow

keen beacon Jun 25, 2025, 12:02 PM

#

Sometimes, other times is way too lazy. That being said, way above gemini

late path Jun 25, 2025, 12:02 PM

#

calm sequoia Wouldn't this also result in drop of o3?

One explanation is that they have more advantage in areas where 2.5pro excels, while o3 possesses merits that some Gemini models (including anonymous ones) collectively lack.

calm sequoia Jun 25, 2025, 12:03 PM

#

May be. Also the distribution may have shifted of the voters themselves.

#

Or polymarket guys stopped spam 😄

keen beacon Jun 25, 2025, 12:04 PM

#

Lol, now that gemini is top by that big a margin theres no point

#

Until grok 3.5 or gpt 5 comes along ..

rare python Jun 25, 2025, 12:05 PM

#

keen beacon Sometimes, other times is way too lazy. That being said, way above gemini

The writing style is also concise and pack with jargon + gen z, modern slang to be relatable

#

Don't forget the — for dramatic and academic

keen beacon Jun 25, 2025, 12:06 PM

#

Ive been trying o3 with tools too, its quite a monster

rare python Jun 25, 2025, 12:06 PM

#

keen beacon Ive been trying o3 with tools too, its quite a monster

One of the things that Gemini lack so hard

#

Claude and o3 are good at tool use and agentic programming

calm sequoia Jun 25, 2025, 12:07 PM

#

keen beacon Lol, now that gemini is top by that big a margin theres no point

Exactly. I would have expected them to switch to o3 though for bigger gain.

keen beacon Jun 25, 2025, 12:08 PM

#

rare python Claude and o3 are good at tool use and agentic programming

Will try claude code too, but so far it seems a bit wose than o3 for me.
Gemini tho .. i cant stand it , 2 page long answers that are just wrong

rare python Jun 25, 2025, 12:08 PM

#

keen beacon Will try claude code too, but so far it seems a bit wose than o3 for me. Gemini...

hallucinated?

keen beacon Jun 25, 2025, 12:09 PM

#

Yeah, also its just generally wrong, i dont get the hype around it. The real world use case of the model is trash

keen beacon Jun 25, 2025, 12:10 PM

#

calm sequoia Exactly. I would have expected them to switch to o3 though for bigger gain.

Btw i found a new method for getting the leaderboard data without polling 👀

rare python Jun 25, 2025, 12:10 PM

#

Give me a prompt I'll test it

keen beacon Jun 25, 2025, 12:10 PM

#

rare python Give me a prompt I'll test it

4 ways to get the leaderboard data from lmarena.ai

calm sequoia Jun 25, 2025, 12:11 PM

#

keen beacon Btw i found a new method for getting the leaderboard data without polling 👀

You mean latest ranks or anonymous model stats?

rare python Jun 25, 2025, 12:11 PM

#

keen beacon 4 ways to get the leaderboard data from lmarena.ai

Search grounding enable?

keen beacon Jun 25, 2025, 12:11 PM

#

rare python Search grounding enable?

Yeah

keen beacon Jun 25, 2025, 12:11 PM

#

calm sequoia You mean latest ranks or anonymous model stats?

Latest

calm sequoia Jun 25, 2025, 12:12 PM

#

I don't think this is a channel to talk about it 😄

#

They gonna fill in the holes

#

LMarena could just release API for fastest updates and make some money

rare python Jun 25, 2025, 12:13 PM

#

keen beacon 4 ways to get the leaderboard data from lmarena.ai

Here are four ways to get the leaderboard data from lmarena.ai, each with different levels of technical complexity and directness.

1. Directly from the Hugging Face Space

LMArena maintains a Hugging Face Space that mirrors the official leaderboard. This space contains the source code, including an app.py file, which generates the leaderboard tables. In a community discussion on the Hugging Face platform, a site representative pointed a user to a specific line in this Python script as the place to programmatically dump the data. This is the most direct and sanctioned method for accessing the raw data structure. You can navigate to the "Files" section of the Hugging Face Space, locate the app.py file, and adapt the code to export the leaderboard data into a format like CSV or JSON.

2. Utilizing Pre-existing Data Dumps

The organization behind LMArena, LMSYS, periodically shares datasets with the community to support open science and research. These datasets include anonymized voting data, prompts, and model answers. While not a real-time feed of the leaderboard, these dumps provide rich historical data. You can find these datasets on their Hugging Face page or linked in their blog posts, such as the one for the "Search Arena" which open-sourced its dataset and analysis code. This method is ideal for research and analysis that doesn't require the absolute latest rankings.

3. Web Scraping

Web scraping is a common, though technically unofficial, method for extracting data from websites. Several articles and projects detail how to scrape the LMArena leaderboard. One approach uses AI-powered tools like DeepSeek to automatically extract the rankings, model names, and scores into a structured JSON format. Another, more traditional method involves writing a custom script using libraries like Selenium to parse the website's HTML. However, it is critical to note that LMArena's terms of use explicitly forbid programmatic access and scraping of the website. Proceeding with this method carries the risk of having your access terminated.

4. Browser Extensions and Community Tools

Developers in the AI community have created tools to interact with the LMArena site. One example is a browser extension available on GitHub that allows users to maintain a personal leaderboard by tracking their votes. While this specific tool is designed for personal stats, its existence demonstrates that the website's front-end data can be programmatically accessed and repurposed. You could explore GitHub or developer forums for similar community-built tools designed to export or track the main public leaderboard, or use such projects as a starting point for building your own tool, keeping in mind the site's terms of service.

keen beacon Jun 25, 2025, 12:13 PM

#

calm sequoia LMarena could just release API for fastest updates and make some money

Indeed ..

keen beacon Jun 25, 2025, 12:14 PM

#

rare python Here are four ways to get the leaderboard data from lmarena.ai, each with differ...

Its 3/4 wrong

calm sequoia Jun 25, 2025, 12:15 PM

#

rare python Here are four ways to get the leaderboard data from lmarena.ai, each with differ...

Lol I saw someone made a chrome extension that logs your lmarena votes. That's the most creative way to get the anonymous model ranks before official release I've encountered 😄

#

For the creators I mean

rare python Jun 25, 2025, 12:16 PM

#

keen beacon Its 3/4 wrong

Which one wrong?

keen beacon Jun 25, 2025, 12:18 PM

#

Is correct data source but wrong extraction method
Is just wrong, its historical data
Web scraping is correct, the method suggested on how to do it is plain wrong
Is wrong, its not for getting leaderboard but for keeping track of your own votes

late path Jun 25, 2025, 12:18 PM

#

i dont think theres a way to get realtime rankings before the official leaderboard repo is updated

keen beacon Jun 25, 2025, 12:18 PM

#

late path i dont think theres a way to get realtime rankings before the official leaderboa...

There is .. but i just found out

#

So gonna make 5-10% profit after grok gets released and google still wins xD

leaden sun Jun 25, 2025, 12:19 PM

#

leaden palm i figured out what the aeris guy is doing

well...I've figured it out through rhetorical debates with aeris, but its creator still is deeply convinced of its "emergence"... despite having an advanced academic degree

rare python Jun 25, 2025, 12:21 PM

#

keen beacon 1. Is correct data source but wrong extraction method 2. Is just wrong, its hist...

I sent feedback 🥴

leaden sun Jun 25, 2025, 12:24 PM

#

I wouldnt go that far to call it xenophobia, banning wont help those people critizing cn ai to think critically either, to the contrary, it will exaggerate the effect even more

verbal nimbus Jun 25, 2025, 12:33 PM

#

Can't model providers basically cheat by returning blank responses for prompts where their model perform badly

#

E.g. if the reasoning overflows, return blank (because that means the model got stuck)

keen beacon Jun 25, 2025, 12:35 PM

#

verbal nimbus E.g. if the reasoning overflows, return blank (because that means the model got ...

They have measures so that it doesnt happen

rare python Jun 25, 2025, 12:35 PM

#

verbal nimbus Can't model providers basically cheat by returning blank responses for prompts w...

Or people will choose the model that actually worked and the blank one won't get a vote

verbal nimbus Jun 25, 2025, 12:35 PM

#

They exclude rounds where a model has no response when counting the votes.

keen beacon Jun 25, 2025, 12:36 PM

#

echo aurora Would like add on here that discussion should be focused on the model or organiz...

W staff

verbal nimbus Jun 25, 2025, 12:36 PM

#

So instead of losing, a provider can technically prevent a round from being counted when they know their model is stuck.

rare python Jun 25, 2025, 12:36 PM

#

RIP stonebloom in webdev arena. Bro can't even generate anything. Pure blank

keen beacon Jun 25, 2025, 12:36 PM

#

leaden sun I wouldnt go that far to call it xenophobia, banning wont help those people crit...

i didn't mean it like that i should've elaborated more sorry

verbal nimbus Jun 25, 2025, 12:38 PM

#

verbal nimbus So instead of losing, a provider can technically prevent a round from being coun...

I notice some models like R1 0528 tap out if the prompt is too hard

barren prairie Jun 25, 2025, 12:42 PM

#

verbal nimbus I notice some models like R1 0528 tap out if the prompt is too hard

But on the app it works ...I tried the same promt here and there and deepSeek worked well sometimes better than the other models

late path Jun 25, 2025, 12:42 PM

#

keen beacon There is .. but i just found out

Aren't huggingface space just contains .pkl files which generated by manually running scripts? Those are the latest(but still not realtime)

verbal nimbus Jun 25, 2025, 12:42 PM

#

barren prairie But on the app it works ...I tried the same promt here and there and deepSeek wo...

I mainly use WebDev. I have a prompt that causes DeepSeek to return blank every time. Prowlridge and blacktooth were similar too.

#

Non-thinking models like Mistral Medium had no issue

barren prairie Jun 25, 2025, 12:44 PM

#

verbal nimbus I mainly use WebDev. I have a prompt that causes DeepSeek to return blank every ...

Me too ...I have some prompts when deepSeek never answered on arena webdev but when I opened the app it worked well ....maybe because it thinks for too long

verbal nimbus Jun 25, 2025, 12:46 PM

#

barren prairie Me too ...I have some prompts when deepSeek never answered on arena webdev but w...

Yeah. This gives those models an unfair advantage, since it lets them tap out on problems they get stuck on, when it should have been counted as a fail.

keen beacon Jun 25, 2025, 12:55 PM

#

late path Aren't huggingface space just contains .pkl files which generated by manually ru...

Correct

late path Jun 25, 2025, 12:57 PM

#

and theres a way to get newer ranking than that?

keen beacon Jun 25, 2025, 12:57 PM

#

Yes

late path Jun 25, 2025, 12:57 PM

#

oh

keen beacon Jun 25, 2025, 12:57 PM

#

Its hidden

#

You have to do reverse engineering to find it, took me a whole day :/, i hope its worth it

keen beacon Jun 25, 2025, 12:58 PM

#

late path oh

Are you on polymarket too?

late path Jun 25, 2025, 1:01 PM

#

I'm just buying google

keen beacon Jun 25, 2025, 1:01 PM

#

Yeah i assume grok 3.5 and even gpt 5 will not overthrow google

rare python Jun 25, 2025, 1:02 PM

#

Damn

alpine coral Jun 25, 2025, 1:45 PM

#

unironically calls for censorship lol

calm sequoia Jun 25, 2025, 1:45 PM

#

rare python Damn

It blows my mind that people think there's only 4/100 chance that gemini won't be overthrown. It happened many times in the last days of the month 😄

alpine coral Jun 25, 2025, 1:45 PM

#

yeah that's wild btw

#

i mean.. might throw a few bucks on oAI

rare python Jun 25, 2025, 1:46 PM

#

calm sequoia It blows my mind that people think there's only 4/100 chance that gemini won't b...

When?

#

no o3 pro on lmarena

#

kek

calm sequoia Jun 25, 2025, 1:46 PM

#

Few times

#

There's still a chance for: Grok 3.5, the 4o new variant, DeepSeek R2, even GPT 5 😄

alpine coral Jun 25, 2025, 1:47 PM

#

rare python no o3 pro on lmarena

there shouldn't be imo (and prob wont).. it's already tricky to make it a 'battle' b/w thinkjng and non-thinking models

rare python Jun 25, 2025, 1:47 PM

#

calm sequoia Few times

rare python Jun 25, 2025, 1:47 PM

#

calm sequoia There's still a chance for: Grok 3.5, the 4o new variant, DeepSeek R2, even GPT ...

In 5 days?

#

kek

alpine coral Jun 25, 2025, 1:48 PM

#

alpine coral there shouldn't be imo (and prob wont).. it's already tricky to make it a 'battl...

adding models with parrellel CoT and synthesis wouldn't seem right (aside from the costs etc)

calm sequoia Jun 25, 2025, 1:48 PM

#

It was like 2 or 3 days when Gemini 2.5 PRO came out

rare python Jun 25, 2025, 1:48 PM

#

alpine coral there shouldn't be imo (and prob wont).. it's already tricky to make it a 'battl...

I mean my point for R said Gemini 2.5 Pro will be beaten in June, which only o3 pro can right now. Grok 3.5? They keep delaying it so who know

calm sequoia Jun 25, 2025, 1:49 PM

#

rare python

It seems the style control really made the leaderbord better. Good thing I'm not on polymarket.

late path Jun 25, 2025, 1:49 PM

#

calm sequoia It blows my mind that people think there's only 4/100 chance that gemini won't b...

This would require them to first submit a test model, then spend at least 3-4 days collecting data, then update the leaderboard before the fifth day, and exceed 2.5 pro with stylecontrol unchecked.
I think the probability is far less than 1%.

rare python Jun 25, 2025, 1:49 PM

#

calm sequoia It seems the style control really made the leaderbord better. Good thing I'm not...

Style control and Gemini still on top

calm sequoia Jun 25, 2025, 1:49 PM

#

rare python I mean my point for R said Gemini 2.5 Pro will be beaten in June, which only o3 ...

Do you personally place this chance at 4/100?

alpine coral Jun 25, 2025, 1:50 PM

#

rare python I mean my point for R said Gemini 2.5 Pro will be beaten in June, which only o3 ...

yeah sorry to intrude ig aha (i hear what you're saying.. but yeah don't think o3 pro will be added; but also don't think that's excludes oai entirely - not to <2% or whatever it is.. just like who knows ha)

late path Jun 25, 2025, 1:50 PM

#

The reason the market is 97% instead of 100% is, I think, almost entirely due to opportunity cost

alpine coral Jun 25, 2025, 1:50 PM

#

alpine coral yeah sorry to intrude ig aha (i hear what you're saying.. but yeah don't think o...

actually... in terms of the leadboard..

#

if it's for June.. then yeah..

rare python Jun 25, 2025, 1:51 PM

#

Yep it's for June

alpine coral Jun 25, 2025, 1:51 PM

#

tricky to see an OAI model surpassing tbh ahah

rare python Jun 25, 2025, 1:51 PM

#

calm sequoia Do you personally place this chance at 4/100?

0.000000000000000000000000000000000000001/100

calm sequoia Jun 25, 2025, 1:52 PM

#

late path This would require them to first submit a test model, then spend at least 3-4 da...

Not neccessary. If LMarena would tweet something like "Super good model DeepSeek R2 was released in arena", the requried 1k votes would come very fast.

calm sequoia Jun 25, 2025, 1:53 PM

#

rare python 0.000000000000000000000000000000000000001/100

With such opinions you would have been wiped in march

alpine coral Jun 25, 2025, 1:53 PM

#

whether the feel it or not.. polymarket introduces all kinda of 'pressures' .. like if the lb doesn't update b/w now and 1 July for whatever reason (just hypothetically), then the current standings would apply (for the end of June bet) right?

late path Jun 25, 2025, 1:53 PM

#

calm sequoia With such opinions you would have been wiped in march

nebula was in the arena at the time

calm sequoia Jun 25, 2025, 1:54 PM

#

Do you remmember when it was released?

#

I've checked the polymarket but the data is not present anymore

late path Jun 25, 2025, 1:54 PM

#

0325

calm sequoia Jun 25, 2025, 1:54 PM

#

On the arena

rare python Jun 25, 2025, 1:54 PM

#

calm sequoia With such opinions you would have been wiped in march

Um we are talking about June here. Is it related?

calm sequoia Jun 25, 2025, 1:55 PM

#

The world didn't change

#

Since then

#

One thing in LLMs is constant - unpredictability

late path Jun 25, 2025, 1:55 PM

#

rare python Jun 25, 2025, 1:56 PM

#

o3 is released in April

#

And no one surpassed Gemini leading +40 elo back then

calm sequoia Jun 25, 2025, 1:56 PM

#

late path

So its 10 days to the end of the month

#

Another scenario: the anonymous models, currently in arena, which seems better then newest Gemini, are actually from other lab and not Google.

#

Nice

rare python Jun 25, 2025, 1:58 PM

#

calm sequoia Another scenario: the anonymous models, currently in arena, which seems better t...

which one?

calm sequoia Jun 25, 2025, 1:58 PM

#

Another scenario: lmarena decides to split anonymous models only to subset of user's, neither of which cares to check the lab origin

late path Jun 25, 2025, 1:58 PM

#

calm sequoia Another scenario: the anonymous models, currently in arena, which seems better t...

Stonebloom is the only one that might be better than 2.5pro, and it is also a google model

rare python Jun 25, 2025, 1:58 PM

#

calm sequoia One thing in LLMs is constant - unpredictability

so profound

calm sequoia Jun 25, 2025, 1:59 PM

#

Cheezy but it's true

rare python Jun 25, 2025, 1:59 PM

#

calm sequoia Cheezy but it's true

so which one?

#

You haven't answered

calm sequoia Jun 25, 2025, 1:59 PM

#

Wdym

rare python Jun 25, 2025, 1:59 PM

#

calm sequoia Another scenario: the anonymous models, currently in arena, which seems better t...

this

calm sequoia Jun 25, 2025, 1:59 PM

#

Idk this is hypothetical, someone mentioned two models earlyer

#

I can see you're really invested into polymarket to care so deeply

#

My idea was it's never 4/100 in LLMs

#

Too many umpredictables

rare python Jun 25, 2025, 2:01 PM

#

calm sequoia I can see you're really invested into polymarket to care so deeply

I never knew polymarket until today

#

:)

#

I want you to stop roleplaying and rage baiting me

calm sequoia Jun 25, 2025, 2:02 PM

#

According to Ourobaros chances of Gemini dropping from No. 1 spot is equal to that of Jesus returning in 2025

rare python Jun 25, 2025, 2:03 PM

#

late path Jun 25, 2025, 2:04 PM

#

calm sequoia According to Ourobaros chances of Gemini dropping from No. 1 spot is equal to th...

It's still opportunity cost. A 3% gain in 5 days is not the same as a 3% gain over 6 months

calm sequoia Jun 25, 2025, 2:05 PM

#

rare python

Sorry if this made you angry, but that's what you were saying.

rare python Jun 25, 2025, 2:06 PM

#

I'll report you if you keep making conspiracy theory without the source

keen beacon Jun 25, 2025, 2:07 PM

#

I dont get why anthropic is 6% for december

#

They are code focused no general models + no new models anytime soon ..

calm sequoia Jun 25, 2025, 2:10 PM

#

They have a good team and competence for this. Maybe they expect the chances to go up before other major releases.

late path Jun 25, 2025, 2:12 PM

#

keen beacon I dont get why anthropic is 6% for december

If you're looking for free money, you can check out these markets below
None is more promising than Anthropic, and some have even gone out of business.
People just don't want to lock up their money in it for half a year

keen beacon Jun 25, 2025, 2:28 PM

#

late path If you're looking for free money, you can check out these markets below None is ...

Damn

sour spindle Jun 25, 2025, 2:28 PM

#

I think Antrhopic “doesn’t play the game” as much as other companies

#

Remember the market is simply highest ranking model on lmarena

keen beacon Jun 25, 2025, 2:28 PM

#

late path If you're looking for free money, you can check out these markets below None is ...

But yeah its not worth to leave $ there for months , i have better roi on other stuff

sour spindle Jun 25, 2025, 2:29 PM

#

Google is constantly gaming the leaderboard to find out how to eek out slightly more Elo

balmy mist Jun 25, 2025, 2:40 PM

#

has anyone tried the seedance video model? is it the best?

keen beacon Jun 25, 2025, 2:44 PM

#

sour spindle I think Antrhopic “doesn’t play the game” as much as other companies

True, google does a lot tho and specifically in lmarena.

keen beacon Jun 25, 2025, 2:49 PM

#

alpine coral unironically calls for censorship lol

it was a joke bro 😭 chill out on me

glad jackal Jun 25, 2025, 2:57 PM

#

Yo why isn't there qwen3 0.6B,4B,8B and 14B in lmarena leader board?

sour spindle Jun 25, 2025, 3:00 PM

#

keen beacon True, google does a lot tho and specifically in lmarena.

FWIW I do think Google has good models but l think when the margins are so slim at the top of the leaderboard the extra gaming helps tremendously

polar roost Jun 25, 2025, 3:01 PM

#

what's the use/msg limit in direct chat?

rare python Jun 25, 2025, 3:10 PM

#

keen beacon True, google does a lot tho and specifically in lmarena.

They have so many testing models

alpine coral Jun 25, 2025, 3:10 PM

#

if the german government forced AI companies to ensure LLMs said the holocaust didn't exist, or to refuse answer questions about it, you'd have a great point there...

keen beacon Jun 25, 2025, 3:12 PM

#

alpine coral if the german government forced AI companies to ensure LLMs said the holocaust d...

Oh boy holocaust is the kind of topic that you either agree with or get banned, or in case of Germany agree or get arrested

alpine coral Jun 25, 2025, 3:12 PM

#

similarly missing hte point entirely

#

mistral doesn't train its models to accomodate German hate speech laws

#

anyway... this isn't going to be productive

primal orbit Jun 25, 2025, 3:14 PM

#

Is stonebloom still in? All I get is kraken.

barren prairie Jun 25, 2025, 3:16 PM

#

primal orbit Is stonebloom still in? All I get is kraken.

Are they good??

primal orbit Jun 25, 2025, 3:16 PM

#

kraken - no

#

ok, i got stonebloom on 20th try

wintry tinsel Jun 25, 2025, 3:24 PM

#

calm sequoia According to Ourobaros chances of Gemini dropping from No. 1 spot is equal to th...

GPT 5 will dethrone it and it releases this year

hoary plaza Jun 25, 2025, 3:40 PM

#

I mean as the new models are introduced, can we increase their priority of appearing rather than old ones??

#

I don't battle much but like I got stonebloom once in 3 days 😂😂

hoary plaza Jun 25, 2025, 3:40 PM

#

wintry tinsel GPT 5 will dethrone it and it releases this year

Well a person can dream

dusky aurora Jun 25, 2025, 3:49 PM

#

developers,please improve smapling. gemini is almost unusable under these settings

polar roost Jun 25, 2025, 4:02 PM

#

what's the use/msg limit in direct chat?

primal orbit Jun 25, 2025, 4:12 PM

#

polar roost what's the use/msg limit in direct chat?

depends on a model it seems

#

but it refreshes within a chat after a while, so it's possible to continue

polar roost Jun 25, 2025, 4:25 PM

#

primal orbit but it refreshes within a chat after a while, so it's possible to continue

oh alright thanks

sour spindle Jun 25, 2025, 5:07 PM

#

primal orbit Is stonebloom still in? All I get is kraken.

I got stonebloom a minute ago first try

ocean vortex Jun 25, 2025, 5:10 PM

#

dusky aurora developers,please improve smapling. gemini is almost unusable under these settin...

wdym? Messages get cut-off?

dusky aurora Jun 25, 2025, 5:10 PM

#

ocean vortex wdym? Messages get cut-off?

no creativity at all,it is stuck on the basic assumptions. too low temperature or top-p

ocean vortex Jun 25, 2025, 5:11 PM

#

dusky aurora no creativity at all,it is stuck on the basic assumptions. too low temperature o...

I can say with confidence their settings are not an issue. Model less creative, or more likely... It simply generated a response you did not like at that time. Or you are using a smaller model (Flash etc)

dusky aurora Jun 25, 2025, 5:12 PM

#

gemini-2.5-pro

ocean vortex Jun 25, 2025, 5:12 PM

#

dusky aurora gemini-2.5-pro

direct chat?

dusky aurora Jun 25, 2025, 5:12 PM

#

ocean vortex direct chat?

yes

ocean vortex Jun 25, 2025, 5:14 PM

#

dusky aurora yes

go here and you can change these settings yourself https://legacy.lmarena.ai

#

But I think default is temp0.7 and top_p 0.95-1, so unlikely this will make much difference unless you push it beyond 1.0

storm needle Jun 25, 2025, 6:03 PM

#

calm sequoia According to Ourobaros chances of Gemini dropping from No. 1 spot is equal to th...

free money

leaden palm Jun 25, 2025, 6:25 PM

#

sullen quest Jun 25, 2025, 6:50 PM

#

Hey are lmarena links sharable? Like If I send someone a direct chat link could they see what was in it?

keen beacon Jun 25, 2025, 6:51 PM

#

Anyone try out gemini-cli?
https://blog.google/technology/developers/introducing-gemini-cli-open-source-ai-agent/
https://github.com/google-gemini/gemini-cli

Google

Gemini CLI: your open-source AI agent

Free and open source, Gemini CLI brings Gemini directly into developers’ terminals — with unmatched access for individuals.

GitHub

GitHub - google-gemini/gemini-cli: An open-source AI agent that bri...

An open-source AI agent that brings the power of Gemini directly into your terminal. - google-gemini/gemini-cli

echo aurora Jun 25, 2025, 7:01 PM

#

sullen quest Hey are lmarena links sharable? Like If I send someone a direct chat link could ...

Sorry to say when you share a chat link it doesn't share the chat conversation. This is something we're looking into.

sullen quest Jun 25, 2025, 7:03 PM

#

thanks! Glad it'll be added at some point then.

errant cave Jun 25, 2025, 7:09 PM

#

rare python

I wish more sites banned people for false reports like this like 4chan does

civic flame Jun 25, 2025, 7:42 PM

#

keen beacon Anyone try out gemini-cli? https://blog.google/technology/developers/introducing...

first impressions not great.. got a little stuck

small haven Jun 25, 2025, 7:43 PM

#

f's

#

wen kingfall in gem cli

keen beacon Jun 25, 2025, 8:08 PM

#

civic flame first impressions not great.. got a little stuck

Same here

ocean vortex Jun 25, 2025, 8:19 PM

#

leaden palm

I don't think it will have any impact whatsoever tbh. Everyone was already training on this. Court ruling is just a technicality after the fact meaning they won't have to spend money to make this go away

#

Like meta was torrenting books, and everyone else is not any more saint, what are we even talking about here... LOL

#

copyright was never really a bottleneck, in practice at least...

#

they train on it, and then they "ask for permission", or prevent the model from disclosing it / getting caught. Or wait for court ruling like this one with the model already in production. But either way, no one is waiting for permission 👀

keen beacon Jun 25, 2025, 9:57 PM

#

keen beacon Oh boy holocaust is the kind of topic that you either agree with or get banned, ...

Thank you

keen beacon Jun 25, 2025, 9:59 PM

#

small haven wen kingfall in gem cli

wasnt it gemini 2.5 5-6 preview

#

?

echo aurora Jun 25, 2025, 10:01 PM

#

keen beacon Thank you

lets move on from this topic

keen beacon Jun 25, 2025, 10:01 PM

#

echo aurora lets move on from this topic

worst part about it is i was semi trolling sob_pray im so sorry

#

i didnt know people would be that invested

#

i deleted it just incase as well

rare python Jun 25, 2025, 10:16 PM

#

errant cave I wish more sites banned people for false reports like this like 4chan does

I wish I can detail my report more to report the trolls, and I reported your message to Discord to, hope they find a way.

tall summit Jun 25, 2025, 11:07 PM

#

leaden palm

i want them to use copyrighted material

rare python Jun 25, 2025, 11:09 PM

#

tall summit i want them to use copyrighted material

I agree but they have to pay for them.

keen beacon Jun 26, 2025, 12:38 AM

#

tall summit i want them to use copyrighted material

the rights to one like mickey mouse can bankrupt them alone

#

them shits are expensive

#

and most companies dont even sell their rights because they dont want people to use their art in certain ways

surreal creek Jun 26, 2025, 12:48 AM

#

late path If you're looking for free money, you can check out these markets below None is ...

They’re selling dollars for 88 cents

leaden palm Jun 26, 2025, 1:53 AM

#

does anyone know how to undepress gemini

rare python Jun 26, 2025, 1:55 AM

#

leaden palm does anyone know how to undepress gemini

Depressed Gemini

#

Say something to motivate it

#

Don't let it uninstall itself 😭

leaden palm Jun 26, 2025, 2:00 AM

#

leaden palm does anyone know how to undepress gemini

update: may have solved this by sending a :(

#

its been thinking for over a minute now

#

oh it was just rate limited

#

why did it do it twice?

#

you already switched to 2.5 flash...

elder rapids Jun 26, 2025, 2:12 AM

#

leaden palm does anyone know how to undepress gemini

why don't you just prompt it

#

?

leaden palm Jun 26, 2025, 2:12 AM

#

that's the thing

#

what prompt

elder rapids Jun 26, 2025, 2:12 AM

#

is there no system prompt

leaden palm Jun 26, 2025, 2:12 AM

#

this is gemini's claude code competitor ("gemini cli")

elder rapids Jun 26, 2025, 2:12 AM

#

yes

#

I know

elder rapids Jun 26, 2025, 2:13 AM

#

leaden palm what prompt

just tell it not to acknowledge x thing, treat all interactions as X, maximum response = technical context only

#

stuff like that

#

or also simply add: state facts directly without apologies or self deprecation

#

tell it to use active voice

rare python Jun 26, 2025, 2:35 AM

#

elder rapids is there no system prompt

It has gemini.md

#

same as claude.md

mossy drum Jun 26, 2025, 4:19 AM

#

New model in Image Arena: kordex-can-on

hoary plaza Jun 26, 2025, 4:46 AM

#

@echo aurora can we add a change log channel on discord which makes announcement of any changes you do

#

Like adding a model too

leaden palm Jun 26, 2025, 4:47 AM

#

hoary plaza Like adding a model too

even the anonymous ones?

hoary plaza Jun 26, 2025, 4:47 AM

#

A role can be used to ping if they are interested

leaden palm Jun 26, 2025, 4:47 AM

#

https://x.com/lmarena_ai/ usually announces the public ones so i suppose it wouldn't be too much of a stretch to extend it to here though

lmarena.ai (@lmarena_ai) on X

LMArena: Open Platform for Crowdsourced AI Benchmarking. Graduated from UC Berkeley / @lmsysorg. We’re hiring: https://t.co/1OkfLq2n0I

hoary plaza Jun 26, 2025, 4:47 AM

#

That's for you mods to decide

#

Like if it's convenient you can do that, if its not then nothing we can do

dense jasper Jun 26, 2025, 4:48 AM

#

hi

harsh flume Jun 26, 2025, 5:17 AM

#

What are your guy's impressions on minimax-m1?

#

I ran some prompts with the intent of prompt-improving (as, here is a not-well-articulated-prompt, please improve it for result X) and it performed really well

echo aurora Jun 26, 2025, 5:56 AM

#

hoary plaza <@283397944160550928> can we add a change log channel on discord which makes ann...

yup! we actually have plans to build a bot that'll do just that!

whole wagon Jun 26, 2025, 7:05 AM

#

polymarket says there's only 30% chance gpt5 comes before july 31st lol

calm sequoia Jun 26, 2025, 7:34 AM

#

Aider bench must have the most correlation with LMARENA leaderboard

keen beacon Jun 26, 2025, 8:09 AM

#

whole wagon polymarket says there's only 30% chance gpt5 comes before july 31st lol

I mean .. its a big event, likely to be delayed to august
Whats more interesting is the 90% chance for open source model. Im going all in on that but once the release is close

whole wagon Jun 26, 2025, 8:12 AM

#

Well gpt5 before Dec 31st is also at 90%

#

So it's the same

#

The open source model has been delayed already

#

It was expected before July before

#

And this before June 30th for GPT5. Was also delayed

late path Jun 26, 2025, 8:20 AM

#

whole wagon The open source model has been delayed already

solo lost some of money because of this market🫣

ocean vortex Jun 26, 2025, 8:46 AM

#

R1 still the open-source king 😇

#

and qwen3 absolutely flops on SimpleQA lmao

#

although I can't say that I'm extremely surprised

rare python Jun 26, 2025, 8:49 AM

#

ocean vortex R1 still the open-source king 😇

How can I use seed thinking 1.6?

#

Through API only?

ocean vortex Jun 26, 2025, 8:53 AM

#

rare python How can I use seed thinking 1.6?

you mean this?

#

dunno but it's still behind LOL

rare python Jun 26, 2025, 8:55 AM

#

ocean vortex dunno but it's still behind LOL

No, like each model has their own style. I want to try them out even if they aren't topping the benchmarks

#

Especially creative writing and multi turn conversation

ocean vortex Jun 26, 2025, 9:03 AM

#

rare python No, like each model has their own style. I want to try them out even if they are...

there's probably no API or it's only to Chinese citizens. Though you can try it there https://www.volcengine.com/experience/ark?model=doubao-seed-1-6-250615

火山方舟大模型体验中心-火山引擎

火山方舟大模型体验中心，免登录即可体验，畅享DeepSeek、Doubao等最新模型！火山方舟是火山引擎推出的大模型服务平台，提供模型训练、推理、评测、精调等全方位功能与服务，并重点支撑大模型生态。

#

even this website is all Chinese with no apparent way to switch to English lmao

ocean vortex Jun 26, 2025, 9:26 AM

#

it's slow though, 12tok/sec. Took 10min to generate 26k. MCP and Canvas you can only use when signed up with a phone number and my country is not included in their list... catgrin

#

I'm curious to try their MCP (tools), this model does have solid fine-tuning at a first glance. Unlike most other models that perform good on TAU, this one does not halluciate running the code with no tools available. It gets very close to doing that but kinda stops itself and realizes it can't actually run code

ocean vortex Jun 26, 2025, 10:18 AM

#

the one I linked yeah. They don't seem to be blocking IPs

leaden sun Jun 26, 2025, 10:22 AM

#

ocean vortex it's slow though, 12tok/sec. Took 10min to generate 26k. MCP and Canvas you can ...

I think you can simply buy esim and choose one that is listed there

rare python Jun 26, 2025, 10:23 AM

#

Seed 1.6 Thinking seems to be their best model right now

#

Direct link to Seed 1.6 Thinking:

https://www.volcengine.com/experience/ark?model=doubao-seed-1-6-thinking-250615

barren prairie Jun 26, 2025, 10:28 AM

#

hoary plaza I don't battle much but like I got stonebloom once in 3 days 😂😂

I got it 1626626 times but it never answered always empty

rare python Jun 26, 2025, 10:29 AM

#

ocean vortex even this website is all Chinese with no apparent way to switch to English lmao

Yes there is. Right click and "Translate this page into English" using Google Translate

#

Better than nothing

ocean vortex Jun 26, 2025, 10:46 AM

#

rare python Yes there is. Right click and "Translate this page into English" using Google Tr...

well obviously... I was talking about their website version. With com domain ideally it's supposed to have English lol

rare python Jun 26, 2025, 10:47 AM

#

ocean vortex well obviously... I was talking about their website version. With com domain ide...

doubao.com doesn't have English too. It redirect you to cici.com

ocean vortex Jun 26, 2025, 10:47 AM

#

Nothing spectacular but it looks interesting enough to warrant testing it more extensively

#

Seems to be around the level of the open-source SOTA, potentially somewhat better when we look at tools and their finetuning

rare python Jun 26, 2025, 10:58 AM

#

ocean vortex Seems to be around the level of the open-source SOTA, potentially somewhat bette...

Do you know which model they are using for cici.com?

#

Seems like a non thinking seed

alpine coral Jun 26, 2025, 11:03 AM

#

leaden palm does anyone know how to undepress gemini

im kinda confused.. melancholic tone aside, isn't admitting defeat here a good response (versus it pretending to have figured it out and confabulating some useless/nonsense 'answer')?

#

or is the idea that it should actually be able to resolve whatever the issue at hand is, and it's basically being lazy (and sad aha)?
if so then yeah ig prompting might help (but otherwise it seems the task/problem is just beyond its capabilities 🤷‍♂️)

ocean vortex Jun 26, 2025, 11:10 AM

#

alpine coral im kinda confused.. melancholic tone aside, isn't admitting defeat here a good r...

This seems a good response yes. A welcome change to how it used to be with the model trying the same things over and over in a loop.

delicate cedar Jun 26, 2025, 11:49 AM

#

is there a place where u can get unlimited uses for claude

torn mantle Jun 26, 2025, 12:26 PM

#

@ocean vortex why did you leave chatgpt server?

#

btw kouhe3 shared a link where you can try multiple chinese models

#

just search for ai dangbei

ocean vortex Jun 26, 2025, 12:30 PM

#

torn mantle <@514836230802898954> why did you leave chatgpt server?

Honestly it's a shit-posting pit with not much reason to stay. Feels very much like a one-sided affair if you actually try to post useful things there lol

leaden sun Jun 26, 2025, 12:37 PM

#

i wonder what claude server would look like if that exists...?

Bildschirmfoto_2025-06-26_um_12.19.22.png

torn mantle Jun 26, 2025, 12:43 PM

#

ocean vortex Honestly it's a shit-posting pit with not much reason to stay. Feels very much l...

smh... re-join

#

its not always about sharing something useful

#

we can just troll sometimes

misty vault Jun 26, 2025, 1:47 PM

#

@gork is this real without system prompts staging this