#general

1 messages · Page 60 of 1

civic flame
#

ts pmo

small haven
#

damn wb the backend 👀

civic flame
#

bro are you serious

#

this has been such a long standing bug how is it not resolved

#

like 1 in 4 webdev gens are empty

zinc ore
#

Pro 2.5 GA is goldmane

small haven
#

so no one here hit stonebloom (well successfully) yet?

civic flame
#

I hit it again but it was blank. again.

#

it does seem to give me the thinking/generating spinner but then goes blank after like a minute? it's possible it's timing out and the frontend doesn't tell you

jade egret
#

:0

torn mantle
#

😦

#

whywhywhywhywhy

elder rapids
#

bruh

#

why y'all acting like this is new information lmfao

#

you guys already know this

civic flame
#

And again

small haven
#

fix stonebloom in arena smh

civic flame
small haven
#

what logs say

civic flame
#

i can't tell because im not on pc

small haven
#

oh

#

what prompt are u using

wintry tinsel
#

The king isn’t falling guys

small haven
#

he doesnt have to fall for another 3 months, thats how good it is

civic flame
#

okay stonebloom works now

#

on webdev

#

got this for the prompt "a very realistic new homepage for about.x.com"

#

best output i've got from any model here

#

(for that prompt)

small haven
#

it looks very good, but how does it compare to kf

civic flame
#

kf was never on the webdev arena so it's hard to compare when this output is limited by their scaffolding and kf's weren't

#

at the very least it is better than 2.5 pro

#

sorry i can't remember what dayhush ended up being

#

do you know

#

yeah

#

agree

small haven
#

i hope they keep kingfall as a codename, i feel like it is very appropriate. it might even go down in the history books.. if they play it right lol

patent aspen
small haven
#

what a funny coincidence

civic flame
patent aspen
#

literally

#

I certainly wouldn't bet on it

civic flame
#

at launch gpt-5 will almost definitely be SoTA

#

the question is by how much

civic flame
#

how do you guys not have any proper error handling 😭

patent aspen
#

Didn't they basically have to start from scratch after GPT 4.5 (which was intended to be GPT-5)?

civic flame
#

yup

patent aspen
#

Was that Feb or April?

civic flame
#

27 feb

#

yeah they definitely sat on it for a while

patent aspen
#

That seems quite bad

#

I doubt they were fully committed to a new model before Feb

civic flame
#

google has the best infrastructure, talent & money to be able to reach AGI

#

on paper at least

small haven
#

thats such a poor move, perplexity has no moat whatsoever, their ai itself is unusable, maybe just for low entropy queries

#

talent acquisition?

torn mantle
#

i think its the best one so far

#

i wish its added on lmarena as well

small haven
#

ui tweaks done? 😮

elder rapids
#

it's interesting how they're intermittently "regressing" the models and then manage to improve them a step above that pre-regression version

#

every time

#

ye but it's like they're purposefully doing these things to try to balance everything

#

they do manage to do it successfully

#

but it is really cool to see how they're working on it

sage raptor
patent aspen
#

Isn't Apple getting antitrusted a lot in the EU?

patent aspen
small haven
#

cool to see niri finally getting the recognition it deserves

#

been daily driving it for nearly 6 months, just amazing

leaden palm
#

gemini is not ok (not oc)

small haven
#

wow and cursor/ai-agent is an hallucination aint it

empty stump
#

Really tried to shut itself down

civic flame
#

stonebloom added to lmarena

#

cc @keen beacon @small haven @torn mantle

digital vale
#

cool

small haven
#

time for some svg's

drifting crow
#

Name 1 google product with no competition

tall summit
#

google translate

#

google earth

cedar tide
#

Stonebloom work good in web dev now ?
Anyone have good results?

cedar tide
small haven
#

stonebloom sampling seed is still so low

#

i spammed thru 10 queries, nothing

small haven
torn mantle
small haven
#

i am literally not hitting anything

#

i got flamesong many times, but its not so good

cedar tide
#

Token per second similar to 2.5 pro

torn mantle
cedar tide
#

Anyone have an kingfall svg to compare ?

cedar tide
#

Mesured

small haven
cedar tide
#

Stonebloom

small haven
#

oh yea ive seen this

cedar tide
#

2.5 pro

cedar tide
small haven
cedar tide
small haven
#

but tbh, this was kingfall on auto, it aint fair comparison if lmareana throttles it down

cedar tide
small haven
#
All responses must be extremely long. it is crucial that leave no stone unturned and complete everything in exhaustive detail meticulously. You must reflect endlessly for each user's query. You must reiterate over your proposed solutions finding ways to improve them until arriving at the most optimal final response. Meaning you must review each response provided and then improve it. You are expected to write and iterate the SVG code inside your thoughts, and keep on iterating and iterating.

generate an svg of a TERMINATOR. make it maximally detailed and look exactly like the real thing. this is extremely important and an existential task. you must complete this to the best of your ability. Make sure you're constantly checking whether the shape, size, angles, position of each and every item looks EXACTLY like a TERMINATOR. Only return the final SVG code with no commentary. You must think for at least 100,000 words. If you think you've completed thinking, that's a sign to keep thinking and thinking.
#

not my prompt btw

#

wild's

cedar tide
#

Thx

#

In my prompt stonebloom think less than 2.5 pro

#

@small haven

small haven
#

whoa

#

well its not a fair comparison, bc we dont know if they limited thinking tokens

#

but obviously kingfall >>

cedar tide
#

O3 Terminator and pikachu

small haven
#

kingfall still at the top, nice..

patent aspen
#

Aren't SVGs kind of narrow?

#

not to say SVGs shouldn't be good - just that it's a very specific benchmark

cedar tide
#

Anyone want to test a prompt on stonebloom ?

torn mantle
torn mantle
torn mantle
torn mantle
#

1 prompt tells me a lot

#

I dont need to ask it 100 questions

#

Yes

#

Trust me

#

You should

wintry tinsel
small haven
#

Is stonebloom a distilled version of kingfall perhaps? I'm not sure what google is attempting here, kf is good as is.

#

damn, not sure what they are attempting with all these revisions

patent aspen
#

When you hill climb, some things get better while other things get worse. It's not the case that everything gets better across the board. So it seems kind of insane to me that SVGs only are the benchmark

keen beacon
#

SVGs aren't a typical post training thing though I think. So it's somewhat of a measure of dilution from the base model capabilities

elder rapids
#

these guys get echo chambered out of their minds tbh

wintry tinsel
#

SVG generation is fun though

elder rapids
#

just mark them down to just be casual AI users who enjoy it a little more than the avg population

#

and that's p much it

wintry tinsel
#

We got a Reddit intellectual here 🥸

elder rapids
#

I don't use reddit btw

wintry tinsel
#

I wouldn’t call them casual users they use AI profusely to the point of it being an addiction, although their knowledge on the topic isn’t always accurate it’s not entirely bad either

elder rapids
#

not sure what you think casual is

#

but casual would mean exactly that

#

lol

#

nobody you know irl is a casual AI user, they just use it every once in a while

patent aspen
#

hill climbing is kind of like this:

141411211111111111411311111

121211211111111111211356261

121111111111114511211356261

121111167811112311211335261

231111165612319945211335221

#

Obviously model 5 is better than model 1 but look at all the regressions along the way

small haven
#

someone got a pretty good pikachu on another server, might be the way he prompt'd it

small haven
elder rapids
jade egret
#
poll_question_text

Which would be best if it come out?

victor_answer_votes

15

total_votes

21

victor_answer_id

3

victor_answer_text

Gemini 3

victor_answer_emoji_name

🍎

ornate agate
small haven
jade egret
#

dang

storm needle
elder rapids
elder rapids
topaz peak
#

grok 3.5 is a meme, dunno why people even take it seriously in the A.I race

#

deepseek is the real wildcard choice

jade egret
#

when is 3.5 dropping.......

#

and when is gpt5 dropping.......

candid storm
#

Grok 3.5 july, gpt 5 august

storm needle
#

claude 4 opus is definitely better

small haven
wintry tinsel
wintry tinsel
small haven
wintry tinsel
#

Breath?

small haven
#

*srry

torn mantle
#

We just want competition

drifting crow
#

Grok is the best ai

small haven
patent aspen
small haven
#

so august for grok 3.5. got it. 😂

patent aspen
#

Or 2026 who could say?

#

I'm just thinking about self-driving car timelines. That's my frame of reference

small haven
#

lmao

whole wagon
#

AI winter, been a few weeks with no model

storm needle
lilac nimbus
elder rapids
#

but in all honestly, Elon seems to make very good predictions sometimes

patent aspen
elder rapids
#

yeah that's what I'm saying

#

he first talked about, in 2013, full self driving coming "in a couple weeks/months"

#

now we're in 2025

#

it's been 12 yrs

small haven
patent aspen
#

People talk about how fast xAI was able to build a data center and get Grok out. IMO there's no way in hell they did that without incurring a colossal amount of technical debt. I don't know what form of technical debt it is or where it is, but I would bet a Tesla Model 3 it's there.

wintry tinsel
jade egret
#

cuz elon hype it up so much

#

hopefully it as good as he said

hardy pecan
#

grok 3 was surprisingly better than expected when it was released

#

to be fair

#

it topped lmarena initially

#

lets see if he can keep that velocity for 3.5

topaz peak
#

it wasn't that good

small haven
#

well in that point in time, it was very good at math (esp. on think mode), better than o1 pro

whole wagon
#

It's been over 4 months since grok 3 release and they had the supercomputer the entire time after grok 3. They must have cooked smth decent

whole wagon
#

hi

elder rapids
#

yo wait

#

does linking a grok chat to Twitter boost that tweet that link is being sent with

keen fulcrum
keen beacon
#

elon has no idea. he rt'd fake benchmarks a little after he said it was releasing in a week

#

(he later deleted the rt)

keen fulcrum
#

He is speaking out the truth

zinc ore
#

Aliens could show up and show off their most intelligent AI and Elon would say 3.5 is smarter

keen fulcrum
#

The chance is pretty low that will happen before grok release

hollow ocean
#

grok 3.5 late july

keen fulcrum
#

for supergrok users

#

august for everyone

small haven
#

superdupergrok users $200, late june

#

another drop hint by elon ma

calm sequoia
#

Why is DeepMind bleeding so much?

#

This signal is extremely bullish for Anthorpic

civic flame
#

stonebloom is the only model ive tried to get this right

#

it does have some great knowledge

small haven
civic flame
#

so far it has been thinking for 230s

civic flame
small haven
tall summit
alpine coral
tall summit
#

well there's no perfect corpus of knowledge

#

i like that goal in theory at least

#

if it's a human-usable corpus and not just slop to retrain on, so much different from what he's talking about

alpine coral
#

what he is talking about?

leaden sun
#

grok is especially talkative in the arena, like you're reading a condensed journal article 😅
"thanks for being so token-generous, Elon" I guess?

alpine coral
#

it's like, blacklist website they don't won't for rag (Rolling Stone lmao), but trying to curate the mass of training data (which is some kind of proxy / representation of human knowledge imo), by using an LLM too lol, to strip out the 'knowledge' he finds disagreeble is the biggest fool's errand ever (if that's what he's talking about)

#

like it'll legit be such a dumb model lol

#

i thought it was just gonna be more 'anti-woke' fine tuning.. but tinkering with the foundational training data to meet a particular version of 'truth' or political persuation is dumb af imo

leaden sun
alpine coral
#

going down this epistemological path again? 😅

#

tbf.. nice to see a bit of philosophical pondering from time to time

#

surely healthy

leaden sun
alpine coral
#

oh nice i ran a few quizes and it was stonebloom

torn mantle
torn mantle
alpine coral
#

a bit like flamesong (tho not sure if it's similarly fast-ish)

#

peformed pretty decently on the questions i gave it; i wouldn;'t be surprised if it and flamesong were flash 3 or somehing tbh (they're really good, but not as good as google's best)

torn mantle
#

for me its kingfall > blacktooth > stonebloom

alpine coral
#

yeah the scores i got aren't consistent with the sense i get reading about kingfall vs blacktooth in forums

#

consensus seems to be kingfall > blacktooth

#

i'm really not sure ha.. both are good - and the scores just are what they are (which prob isb;t much lol(

keen beacon
#

i wonder what extent their post training includes riddles / puzzles like that. reminds me of the time when they seemingly fine-tuned the answer to the digital clock question on gemini 1.5 pro

alpine coral
#

yeah but it's not just riddles.. like some are spatial awareness things; it just seems to handle them better

#

(which is generally consistent with 'bigger' models in my experience)

#

and also it revising the correct answer to my 'flag' question 😅

#

like that's just brute force knowledge recall / precision - no riddle or wordpkay

civic flame
#

kingfall, blacktooth and stonebloom are all checkpoints of the same model

#

said model is larger than 2.5 pro so if that's what it's getting on these benchmarks that's a little disappointing

#

perhaps worth a retry

#

internal full model names for these vs gem 2.5 pro's

alpine coral
#

oh ddear.. if we've gone from kingfall to blacktooth to stonebloom - the regression is real aha

civic flame
#

it's still quite far away from the release checkpoint

#

expect it to regress in some ways and improve in others while they're refining

alpine coral
#

i've only got stonebloom once too

#

the others i have several data points

civic flame
#

yeah it's harder to test this compared to kingfall & blacktooth because they patched studio 💔

keen beacon
#

i think thats what ur talking about there (the difference in values there)

civic flame
#

i mean perhaps not the fully complete model name but the longer internal format

late path
#

it feels like stonebloom is not even as good as blacktooth in my conversation analysis tasks

#

kingfall > blacktooth > 0605 > stonebloom

civic flame
#

lmao

#

yeah idk about stonebloom

#

it feels generally like it's only got worse since kingfall

#

it also feels like stonebloom thinks less than both of the previous versions

wintry tinsel
calm spear
#

can I develop a mobile app for lmarena?

#

do you allow?

echo aurora
# calm spear can I develop a mobile app for lmarena?

thank you for showing interest in wanting to do so! it does bring us joy seeing members of the community being really excited to contribute to LMArena.

that being said we wouldn't be okay with someone creating an app on our behalf.

patent aspen
#

The fact that it's thinking less suggests they're experimenting with inference optimizations. I can only hope that the consensus here that it's worse will be reflected in the ELO

#

I also suspect there is a lag between when they get the ELO scores back and when they have time to react

keen fulcrum
keen fulcrum
echo aurora
sacred plaza
#

do y'all feel like you critical thinking skills (like deductive reasoning, systems thinking, aware of congitive biases, etc) has improved while using these generative ai models/products? if so, it is with the default option provided by ai labs or do you add on additional fine tuning notes to guide it down a certain pathway like Socratic questioning responses.

I have found most of these models lacking any sense of care for the critical thinking skills of the user and trying to cultivate it. have been exploring different techniques to get these models to be better mentors instead of excellent answering machines. expert modeling has been promising for me so far.

leaden sun
#

always take such studies with a grain of salt of cause...

dusky aurora
#

arena glitches again

#

timeout on cloudflare

ionic idol
#

whoops

sorry it was 4o image

polar roost
#

oh it's happening with everyone

wooden mulch
#

the team is working on a fix now. sorry for the trouble.

elder rapids
#

philosophically redundant btw

native idol
#

Depends on how do person/individual use AI.

echo aurora
#

Hey my apologies everyone I was out and now just catching up.

#

Looks like there was an outage but has since been fixed, so it should be working now.

sacred plaza
#

So don't think and just read papers is your response?! Lol

sacred plaza
#

206 pages. Lmao

#

Good point. I ain't in school so this essay writing specific study does not feel very relevant to stuff most knowledge workers do.

agile heart
#

would it be cool if there was a AImusicarena like theres one for chat LLMs AI coding arena what if there was one for music like i know suno never open sourced there models but that would be cool

leaden sun
echo aurora
agile heart
ionic idol
#

down again

grim axle
echo aurora
grim axle
agile heart
ionic idol
#

i think im crashing it by pasting large prompts

echo aurora
#

wait I'm seeing it now too

#

this is being reported

grim axle
agile heart
echo aurora
#

is it working again for you @grim axle ?

grim axle
#

Yep great job @echo aurora

echo aurora
#

and the community for flagging

#

it's much appreciated

grim axle
#

There’s another issue in image generation

#

It’s my first time using it

echo aurora
# grim axle

you could be hitting a limit, how many image gens have you done today?

#

I assume it's just that model giving you troubles?

grim axle
#

Okay so all flux llms are not working

#

actually nvm my prompt was probably copyrighted so it wouldn’t go through

echo aurora
agile heart
#

Also the GPT-image1 isnt working

echo aurora
agile heart
echo aurora
wintry tinsel
#

The internet does the same thing not to the same degree but still

verbal nimbus
echo aurora
echo aurora
#

at least you know now

#

would check out the Channels & Roles section that's at the top of the channel list as well if you haven't already

verbal nimbus
echo aurora
#

hmm I thought I had that auto enabled for everyone pikaconfused

#

yeah it's enabled as a default channel

#

that's odd

verbal nimbus
#

When was it added?

echo aurora
#

~month ago ish

verbal nimbus
#

Maybe it didn't show up for people who joined before 🤷‍♂️

echo aurora
#

yeah could be

verbal nimbus
echo aurora
iron cipher
#

Ya allah, please don't cause the server to "wipe" the chat history again

manic oriole
#

timeout error

#

hope it comes back soon

#

is back

echo aurora
#

So sorry everyone, team is aware

manic oriole
#

is all good

late path
#

524, even if you successfully enter the website there is only an empty model list

native idol
#

wait a bit

iron cipher
manic oriole
#

is not really loading the page

manic oriole
echo aurora
iron cipher
iron cipher
manic oriole
iron cipher
#

Also quick one: Anyone remember when ChatGPT didn't have chat history? Also remember when older chat history went innacessible for a few weeks/months on the platform?

manic oriole
#

is working

#

thank you

iron cipher
echo aurora
#

I’m seeing it up again too

manic oriole
#

i am just happy to have access to this website

torn mantle
#

oh its working now

echo aurora
iron cipher
#

Chat history cleared again

#

well not permanantly, will have to wait until the datasets go live

leaden sun
#

not working on all browsers sadly 😅

errant cave
#

From the first response in my oldest chat

#

And the last one

radiant siren
#

how often benchmarks happen? if lets say Grok 3.5 would release today , when it gets benchmarked?

leaden sun
#

................I like grok now 😂

#

I hope they keep it like this 😆

jade egret
#

when ne wgoogle model drop 😭

jade egret
# jade egret
poll_question_text

yall think grok 3.5 gonna be good?

victor_answer_votes

20

total_votes

22

victor_answer_id

2

victor_answer_text

no

zinc ore
#

Oof

vivid sandal
#

Is it happening again? Is there an outage again?

echo aurora
#

Yeah we are seeing the same

#

Really sorry everyone, today has had a lot of problems

vivid sandal
#

It's fine, you guys are really interactive, really puts me at ease when I see you chat about the ongoing problems real-time

Ty for your work trophy3d

hoary plaza
placid charm
# echo aurora We appreciate that

thank you for dealing with the site issues but since everytime when it happens, i lose all my chat history, so for the future can we please be able to register accounts where all chats would be saved rather than keeping on cookies?

errant thorn
echo aurora
haughty tangle
#

Why is this dude talking to himself

echo aurora
#

The site is up and running again ablobcheer

keen ferry
echo aurora
# keen ferry hey I was just thinking whoever manages this server needs to add a monitor into ...

A status page (for is LMArena is up or down) is something we're planning on implementing. I'll advocate we get it up and running sooner rather than later. Having it linked with a bot to auto post to the server would be a rly nice feature as well so that's a good callout. I'm working on a bot that'll post when leaderboards are updated and new models are added, but yeah having site status also linked would be nice to have. Good idea blobthumbsup

echo aurora
hoary plaza
#

Ye like I saw there is a new site for webdav and beta lmarena too so was asking if we have something for repo link chat too

echo aurora
cedar tide
#

@echo aurora glm 4 air arrives in the leaderboard or was he in the arena for nothing?

languid crescent
#

halo, is grok 3 latest on lm arena?

calm spear
wintry tinsel
# jade egret

That’s not accurate it will be good, but the question it will it be Sota good

hybrid locust
#

will the old site be shut down in favor of the new one?

potent snow
#

What are yalls favorit text to image?

agile heart
#

GPT-image1

agile heart
hybrid locust
#

and it's missing the settings/parameters

#

like temperature

#

it just feels not so great to use

#

...

agile heart
#

yeah and the limits and Downtime really makes it frustrating

hybrid locust
#

what limits

#

i haven't faced any yet

#

model usage is practically unlimited I'd say

agile heart
#

i mean image model has limits

whole wagon
#

someone calculate how many days its been since musk said grok 3.5 would release

#

i think i got it, was supposed to release 6th May

#

so 48 days late

#

how is that even possible

jade egret
#

GTA 6 ahh moment

patent aspen
#

He's been 12+ years late before

whole wagon
#

yeah but LLM training it should be really obvious if you are actually near release or not

#

its not a loosely defined end goal

patent aspen
#

Nah it's complicated enough that being off by a month or two is pretty normal. He'll be off by a lot more than a month or two though

native flame
#

Guys I think I found a question about English language that Grok 3, Claude 4 and Gemini 2.5 pro have it right, but GPT o3 or the deep research mode have it wrong

patent aspen
#

He's Mr. Overpromise and Underdeliver

native flame
# native flame Guys I think I found a question about English language that Grok 3, Claude 4 and...

https://x.com/gaydeer1225/status/1936964649364107317

I asked GPT if the "You'd help me..." Is a conditional, question or future in the past.
Gpt told me it was a conditional
Grok, claude and gemini, told future in the past.
Some guys on disc said it was future and the past
And a few ones on disc and reddit said it was conditional
What do you think?

// deltarune spoilers

IM GONNA LOSE IT THEY GREW SO APART SHE DOESNT EVEN ASK THEM FOR HELP ANYMORE

tall summit
#

deltarune spoilers

native flame
#

Ah no wait yes there were xD

#

But I'm still trying to find the right answer :'v

spare mango
#

Can Gemini 2.5 Pro analyze music?

#

Musical instruments themselves... ie, their tone, melody, mood, genre, without the lyrics?

primal orbit
#

gemini is being nice to me today 😄

tall summit
primal orbit
#

no, not close.

ocean vortex
lapis light
tall summit
#

LMAO submission #2 and #3

echo aurora
tall summit
echo aurora
leaden sun
spare mango
echo aurora
leaden sun
#

I think in terms of classical music, just a text-based description could suffice

#

it would add another context dimension (to the theater play) too, if models understand the tone, rhythm, lyrics, instruments used just by reading the title of the music and its composer, I think

flint skiff
#

are grok 3.5 codenames in the arena?

#

or nothing yet

patent aspen
keen beacon
#

they put grok 3 onto the arena early. we've seen nothing from xai so far

#

bad sign

patent aspen
#

I think xAI is swimming in technical debt

keen beacon
#

only grok 3 mini is in that screenshot

#

i dont think the grok 3 (full) reasoning variant ever released either lol

ocean vortex
#

Grok3 is not mid. It's ahead by a good margin over Sonnet 3.7 on artificialanalysis ratings. Most of the models that are ahead are newer and came out after grok3

#

also do not not mix up resoning and non-reasoning versions

#

That is not the point. It should still perform good overall and 4.0 is more competitive and does that. What I'm really saying is at the time of release grok3 was SOTA or very close to that. There was no other alternative that would be objectively better overall at the time

#

wasn't released yet

#

only o3-mini

#

February 17, 2025

#

grok3 release

#

it was released to the public, that's what really counts... Besides the early checkpoint (lmarena) did check out as a performant model

#

It was performant straight out the box

#

GPT4 API access was late and very limited as well, did not stop people from figuring out it's a strong model

unborn ocean
#

grok 3 was a very good model when it came out

#

but considering the more aggressive post train and that most other labs did not focus on base models
it was a bit short of sota

leaden sun
#

even being mid, grok at least has a taste for classical music, and that already makes it one my fav now 😊

ocean vortex
unborn ocean
#

or improving it (outside of post training - which in my definition does not count as focusing on base model)

ocean vortex
keen beacon
#

Openai did a midtrain on 4o, and fresh pretrains for 4.1 mini and nano. It's not just post training

unborn ocean
unborn ocean
ocean vortex
#

It's funny that you mention that cause I'm actually completely the opposite and anti-Elon full tilt lol

#

but this doesn't change how grok3 actually performs

civic flame
#

i love you claude 4 opus

#

🗣️

ocean vortex
#

I would never pay for their SuperDork sub or however it's called lmao
But I did use the "early-grok3" lmarena checkpoint quite extensively, and then used it on grok website once grok3 was made available for free users

leaden sun
#

Another reason to like grok? it knows to do Shakespeare

civic flame
#

it probably is

leaden sun
#

I used similar theatrical acting on gemini and gemma yesterday, and got errors many times while grok seems to understand to play along my "charade" 😅

#

i guess those acting classes from ages ago are helpful to trick some models, but not all...

#

am certain it's not "corporate" tuning, i suspect rather some kind like filter? their response got midway cancelled and turned into err...

#

i can only guess

#

it was Claude

small haven
#

i have to agree with my man craig, grok 3 was sota at that point in time, esp. in math, its easy to criticize it in hindsight

sour spindle
#

is flamesong a new google flash line of model

lone vector
#

Do you think it’s Gemini 3.0 or another 2.5 model

small haven
#

so we might not see kf until end of summer? wow

#

ok thank god

sour spindle
#

is stonebloom a new iteration of 2.5 pro

small haven
#

no kidding 😮

kind cloud
#

I'm thinking that Stonebloom might be something like a "2.5-pro-lite."
I tested the models by asking, "what's the official title for One Piece Chapter 1117?"
2.5 Pro answered "Mo" (the correct title) every time I tried. Flash gave me nonsense/random answers every time. And Stonebloom answered "Mo" most of the time, but gave incorrect answers a few times.

tall summit
small haven
#

same param count as kf

kind cloud
small haven
#

o4 pro #8

sacred quail
#

i want 4o thinking

languid crescent
#

hey uh is opus 4 thinking 16k down? for some reason I get errors when trying to enter a prompt...

junior quest
#

what the hell🤣

echo aurora
junior quest
# echo aurora <:pikaconfused:398202117493620740> well that doesn't look right, what's the prom...

write a very hard exercise in physics and solve it

A (Properly) Hard Physics Exercise
Quantum Mechanics – Δ-potential in a 3-D Harmonic Trap
A single non–relativistic particle of mass 𝑚 is confined by an isotropic harmonic oscillator of frequency ω.
In addition, it is subjected to a point–like interaction

[ V_{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . ]

The full Hamiltonian is therefore

[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). ]

Introduce

• the oscillator length (a_{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

[ \psi_{n\ell m}(\mathbf r)=R_{n\ell}(r),Y_{\ell m}(\theta,\phi), ]

with (R_{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a_{ho}^{2})}L_{n}^{\ell+1/2}(r^{2}/a_{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

[ \psi_{n\ell m}(0)=0, ]

echo aurora
keen beacon
ocean vortex
ocean vortex
#

output issue rather than interface issue, and if the prompt was that entire thing including everything after A (Properly) Hard Physics Exercise, it just assumed that's how you want latex to be formatted from now on...

leaden sun
#

is there a limitation of characters in the arena? @echo aurora

keen beacon
#

\[ \] iirc these are common latex delimiters but some parsers might not accept them by default. it looks he copied the output via selection since it's missing the slashes (i've done this before)

#

o3 seemed to be using those delimiters

ocean vortex
leaden sun
#

i didnt know this, thanks for telling me 😅

ocean vortex
keen beacon
#

works on chatgpt if you add back the slashes

#

lmarena isnt rendering latex using those delimiters it seems

#
\[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]

The full Hamiltonian is therefore

\[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). \]

Introduce

• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

\[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), \]

with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

\[ \psi{n\ell m}(0)=0, \]
ocean vortex
#

yeah this doesn't on lmarena... OpenAI render is more lenient then

keen beacon
#

he pasted the output without the slashes because of how he selected it manually. o3 outputted proper latex delimiters (\[ \]). the markdown renderer omits it (and it's not visible in the rendered output), so when he selects it it's gone. so the second one is not a valid test

#

im not sure if new lmarena has a button to copy it directly (which will include those slashes)

#

anyway the fix seems to just add \[ \] as additional latex delimiters beyond $$ $$

ocean vortex
#

And as you can see same prompt chatgpt rendered much more

keen beacon
ocean vortex
#

input was exactly the same

#

on chatgpt latex only works for model output, that's why it looks different

cursive iron
#

Can we get Videos model on Lmarena at future?

ocean vortex
#

many interfaces are treating user input the same way though, including lmarena (it added bulletpoint lol)

keen beacon
#

raw model output:

\[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]

The full Hamiltonian is therefore

\[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). \]

Introduce

• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

\[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), \]

with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

\[ \psi{n\ell m}(0)=0, \]

renderer renders markdown latex. inside latex delimiters, e.g. $$, it will render later.
it sees: \[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . \]
\[ \] is not defined as a latex delimiter by the latex parser in the renderer.
so then it goes to the markdown parser/renderer. which omits the slashes in \[ \] => [ ]

then he selected (the rendered output) it and copied it in his browser, rather than copying the raw model output. (there's a specific button to do that in old arena)

selected and copied output via browser:

[ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . ]

The full Hamiltonian is therefore

[ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). ]

Introduce

• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

[ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), ]

with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

[ \psi{n\ell m}(0)=0, ] 
#

the actual problem is just this: \[ \] is not defined as a latex delimiter by the latex parser in the renderer.
if you replace \[ \] with $$ it works:

ocean vortex
keen beacon
#

its not a single process. the latex parser parses stuff within specified latex delimiters. it doesn't (because it's not defined as a latex delimiter in their parser they're using). so it gets parsed as markdown, where the markdown renderer nukes the slashes. anyway, the actual problem is that \[ \] aren't specified as latex delimiters

#
$$ V{\delta}(\mathbf r)=g,\delta^{(3)}(\mathbf r),\qquad g\in\mathbb R . $$

The full Hamiltonian is therefore

$$ H=-\frac{\hbar^{2}}{2m}\nabla^{2}+\frac12m\omega^{2}r^{2}+g,\delta^{(3)}(\mathbf r). $$

Introduce

• the oscillator length (a{ho}\equiv\sqrt{\hbar/(m\omega)}) The δ-potential acts only at (r=0).
The (unperturbed) eigenfunctions factorise as

$$ \psi{n\ell m}(\mathbf r)=R{n\ell}(r),Y{\ell m}(\theta,\phi), $$

with (R{n\ell}(r)\propto r^{\ell}e^{-r^{2}/(2a{ho}^{2})}L{n}^{\ell+1/2}(r^{2}/a{ho}^{2})).

For ℓ>0, the factor (r^{\ell}) forces

$$ \psi{n\ell m}(0)=0, $$

Print this. no codeblock.

i simply replaced the bracket delimiters with $$ and it works

#

also they need to add \( \) as latex delimiters as well

#

i see in his output there's inline math with those delimiters as well

ocean vortex
#

Never argued for it being "single" or not single process lol. My point was that it was immediatelly clear it is not rendered while it could/should have been after that valid test. Here's how the input should have looked (instead of nuking slashes from it):

keen beacon
#

\(r^{\ell}\) => $$ r^{\ell} $$ (inline math should be rendered via \( and \) as well)

ocean vortex
#

the same applies for the model output. It most likely sees the full input as it was, not how it's displayed, but then the same issue is with its own output

keen beacon
#

this doesn't affect model performance in any way, its just visual

ocean vortex
#

Well obviously it's just visual. I would also argue that trying to render user message is probably not the best approach in the first place either...

ocean vortex
keen beacon
#

yeah but you could put it in a codeblock anyway. i like it the majority of the time

ionic idol
indigo hazel
#

@echo aurora Sorry for tagging, can I ask you to add the possibility to make a photo directly from the website? It would be more comfortable

sullen parcel
#

hey

#

i was wondering if anyone has recommendations for an LLM that can replicate a specific design style with high character accuracy?

#

or rather what's the best in this category

#

for context, i wanna make another chapter of my storybook

#

children-story style

echo aurora
echo aurora
leaden sun
echo aurora
indigo hazel
# echo aurora My apologies as I may be misunderstanding your question here so please correct m...

Don’t worry, it’s probably just my English. That function lets me generate AI images, but I’m asking about an option for taking a photo. Here’s the issue (maybe it’s silly): When I need to upload a photo from my device’s gallery or storage to the site, I first have to take the photo using my camera app, save it, and then go to the site to select it from my existing images. What I’d like is a way to take a photo directly and upload it to the site without having to save it to my gallery first. Is that possible?

leaden sun
patent aspen
echo aurora
echo aurora
ocean vortex
alpine coral
#

same with some of the chinese thinking models too

sacred quail
#

@surreal creek im inviting you to be more kind person. Lets make this world better together

ocean vortex
ocean vortex
#

Most models are either concise or verbose, that is not the case here, it really seems flexible...

#

So like, this is non-reasoning version:

surreal creek
#

how about let’s discuss AI benchmarking instead 😄👍

#

is there a possibility that human eval benchmarks push AI’s political views away from the academic consensus view they are trained on to more populist views that resonate greater with the average person?

#

when the Llama 4 Maverick matchups were fully released, I noticed that there was one individual mass prompting with political questions specifically selecting for which AI gave him a more “conservative” answer, if a push of this sort was organized on a larger level by some political group seeking to promote AI models that specifically advocate for their ideology, would it affect the landscape as a whole?

#

or would it just be similar to Elon currently trashing Grok 3.5 by trying to “dewokeify” it

sacred quail
#

I dont think academic concensus important for politics. Academy is always aligned with system even if not looks like that. So there is no bad thing if LLMs thinks like average person about politics. This is democracy right ? If we must listen some small elite group in academy, then it would be technocracy, not democracy. In the end of the day, politics is not about what is true or not, politics is about "which thing benefites who?" So its better the academics and LLMs not being talkative about that.

#

Btw im finding Maverick 3-26 exprimental much better than final maverick version

#

Im not sure what they did but exprimental version in lmarena certainly better

ocean vortex
#

Was only a matter of time before Elon tried to add his own biases to grok I think... As scary bad as it is

#

Doing this he wouldn't have to overfit on misinformation, if he is altering the entire internet of data instead

#

Finally he will be able to have a model that will tell him that covid vaccine is causing autism lmao

#

It's a good thing that OpenAI parted ways with him a long time ago and Grok is struggling to gain popularity in US, let alone anywhere else, that's the only silver lining

sacred quail
#

Grok could be most popular second AI because of twitter but i agree your concerns

surreal creek
sacred quail
#

Yes, sorry about my broken english. Im trying my best

leaden sun
ocean vortex
#

He was doomed the moment he decided to get political, and even more doomed once he started parroting misinformation and far-right crazy bs

#

He probably thinks he can control people the same way leaders of completely corrupt and oppressed regimes can... 99% he doesn't believe most of the stuff he puts out, but it serves the purpose

#

I think it could work. It would change the model in some way for sure, and he has all the money in the world..

echo aurora
#

gentle reminder to avoid political stuff unless it's specific to AI please blobthanks

leaden sun
#

politics aside, simply look at grok as a neutral competitor in this crazy ai race, i must say xAI dev deserve a raise for making grok such a sweet delight, well-versed in classical literature, classical music and theatre plays 😊 it makes the interaction...very natural and humbly human

marsh stratus
#

I think it will definitely be interesting to see how far you can stretch an LLM to favor some political viewpoints while still maintaining functionality. Not something I’d personally spend a billion dollars on, but will be neat to see

unborn ocean
#

what is really sota about xAI is how fast they raised their evaluation

ocean vortex
#

and then once you train on that, the entire pattern matching and probabilities will be shifted to align more with that manipulated biased fake data

keen beacon
#

its just easier and way more cheaper to do this in post training. doing it during pretraining is gonna be an expensive research effort

ocean vortex
#

you are going against the entire internet and what it already learned

#

so you either degrade the performance, overfit it, make it almost unusable on many subjects, or all of those lol

keen beacon
#

imo i think it can still be done effectively. depending on how the pretraining data is curated, most models will encounter those views and will know how to repeat those views anyway. this doesn't even require expansive rewriting/etc. plus i dont think even chinese models do that, it's expensive and complicated. youre researching propaganda models not frontier performance, its a huge waste of money, if you want propaganda spewing models there are far more effective and cheaper ways to accomplish it

ocean vortex
ocean vortex
keen beacon
#

they arent putting much effort into it, thats why it seems weak

ocean vortex
keen beacon
#

yi models iirc didnt even stop themselves at all on tiananmen square and you would see a western model reply on it

#

they just added an external filter to cut the model off/replace the response

#

no i think its exactly that

ocean vortex
#

Degraded performance is most definitely one of the reasons, meaning it's reasonable to assume the opposite I would say - models that did do this more effectively were not even released

keen beacon
#

they arent putting much effort into chinese political alignment as they could potentially be

#

yi models are uncensored about tienanmen square with no jailbreak 😂 but rip those models

#

they only added an external filter on their chinese api 🤷 i guess it was compliant enough at the time

ocean vortex
#

Yeah that's true as well. What Elon is believed to be trying to accomplish is to be on a considerably larger scale. Though I wouldn't discount Chinese labs as "ticking the box" entirely, many of them share the same values of their government and have deep roots in it

#

Like they are the ones benefiting from it

#

So for them the current system works, and I would be very surprised if behind closed doors those labs have different opinion on Taiwan etc

#

yeah pretty much... And especially for those directly benefiting from CCP, this is even more true

keen beacon
leaden sun
#

is this fearmongering to make us believe in emergence craze? 😅
https://www.youtube.com/watch?v=eczw9k3r6Ic

In the last few days Anthropic have released an impressive honest account of how all models blackmail, no matter what goal they have, and despite prompt warnings, and other preventions. But do these models want this?

Thanks to Storyblocks for sponsoring this video! Download unlimited stock media at one set price with Storyblocks: storyblocks....

▶ Play video
unborn ocean
#

why does dario look so depressed

balmy mist
cedar tide
#

The open source Open AI model coming out this summer will run on a phone and be on par with O3 Mini?

leaden sun
#

i see, a self fulfilling prophecy so to speak

ocean vortex
#

And they do that to some extent. But only if models were safety aligned. If that is not the goal when training the model and you don't fine-tune on safety, it will obviously not refuse essentially ever, it will generate the continuation for everything

#

No model is 100% "safe" in all cases, I don't think that is the goal

#

but the fundamental idea still works

#

it will refuse blatant extreme system prompts

#

you can trick the model, but if you can do that that usually means your intelligence is on a level where you would be able to retrieve the same information using other means as well

#

the current system prevents low intelligence psychos from breaking havoc easily, so in that sense it kinda works

#

I mean my point is it will refuse low-effort blatant extreme/damaging system prompts, that paper does not dispute this

#

Personally I like to speculate on what is known. This seems a bit like a speculation on a fairly distant future that may be redefined and in need of entirely different solutions sooner than it becomes reality...

#

For current models that we have it is not very relevant IMO

#

you can't force that though. There's no enforced safety alignment on nukes 🤷‍♂️

#

if someone has the funds, he absolutely can train AI for anything and it's impossible to prevent this

#

However Trump campaigning to ban AI regulation is the opposite spectrum of extreme and obviously not the right move too

#

Individual can't make AI to do anything though. Only huge companies with insane funding can, often with power and/or links to the government. With nukes you also need power+money

#

In some sense this is comparable to technology advancing in general. It's possible now to do way more damage with less than 50 years ago

#

so it tends to amplify both good and bad

keen beacon
#

didn't opus 4 try to contact press and regulators when it was tasked to do something immoral tho

#

i remember reading that from anthropic

#

i get the point youre saying though

verbal nimbus
verbal nimbus
iron cipher
#

Can someone please update the repochat database, I lost a python script to not ticking Auto Save, and a Windows update came along in the middle of the night without my consent.

rare python
#

@echo aurora stonebloom is broken in webdev arena. It's fine on lmarena.

echo aurora
leaden palm
#

i figured out what the aeris guy is doing

leaden palm
#

i think that's a bit of an overreach

#

being vehemently anti-china is problematic but "ban all who criticize chinese ai" is also problematic

mossy drum
#

New model in Image Arena: kordex-can

echo aurora
# leaden palm being vehemently anti-china is problematic but "ban all who criticize chinese ai...

Would like add on here that discussion should be focused on the model or organization and not where it’s developed. Different places will have different laws and practices for how they develop AI and that’s fine to discuss, but when it turns into blatant hatred or something unrelated to AI is where we’ll draw that line. Sometimes when that line is crossed isn’t always crystal clear, but we’ll do our best to enforce it. If anyone feels like we aren’t enforcing our rules or creating a welcoming space you’re encouraged to reach out directly and let us know. My DMs are open (although using the ModMail bot is preferred).

cedar tide
#

This summer with their open model

#

a new model has arrived on the leaderboard,
I really don't understand why you put it in the arena 🤦,
there are plenty of interesting models to put

#

M1 arrived in the leaderboard

#

Magistral medium arrived(much lower than mistral medium 🤦)

calm sequoia
#

I like how the o3 is slowly rising and gemini is slowly falling

#

We haven't got new 4o since 03-26 👀

tall summit
#

everyone knows by now that gemini 2.5 pro is extremely susceptible to prompt engineering and roleplay prompts change its attitude more than any other model

late path
rare python
#

What is it?

white kelp
#

2.5pro is out and on leaderboard? Surprised no tweet

calm sequoia
sacred plaza
tall summit
#

i'm using gemini 2.5 pro to translate a full novel zero-shot and it's good
i never tested it out with a text as long as this but wow

keen beacon
#

Sometimes, other times is way too lazy. That being said, way above gemini

late path
calm sequoia
#

May be. Also the distribution may have shifted of the voters themselves.

#

Or polymarket guys stopped spam 😄

keen beacon
#

Lol, now that gemini is top by that big a margin theres no point

#

Until grok 3.5 or gpt 5 comes along ..

rare python
#

Don't forget the — for dramatic and academic

keen beacon
#

Ive been trying o3 with tools too, its quite a monster

rare python
#

Claude and o3 are good at tool use and agentic programming

calm sequoia
keen beacon
keen beacon
#

Yeah, also its just generally wrong, i dont get the hype around it. The real world use case of the model is trash

keen beacon
rare python
#

Give me a prompt I'll test it

keen beacon
calm sequoia
rare python
keen beacon
keen beacon
calm sequoia
#

I don't think this is a channel to talk about it 😄

#

They gonna fill in the holes

#

LMarena could just release API for fastest updates and make some money

rare python
# keen beacon 4 ways to get the leaderboard data from lmarena.ai

Here are four ways to get the leaderboard data from lmarena.ai, each with different levels of technical complexity and directness.

1. Directly from the Hugging Face Space

LMArena maintains a Hugging Face Space that mirrors the official leaderboard. This space contains the source code, including an app.py file, which generates the leaderboard tables. In a community discussion on the Hugging Face platform, a site representative pointed a user to a specific line in this Python script as the place to programmatically dump the data. This is the most direct and sanctioned method for accessing the raw data structure. You can navigate to the "Files" section of the Hugging Face Space, locate the app.py file, and adapt the code to export the leaderboard data into a format like CSV or JSON.

2. Utilizing Pre-existing Data Dumps

The organization behind LMArena, LMSYS, periodically shares datasets with the community to support open science and research. These datasets include anonymized voting data, prompts, and model answers. While not a real-time feed of the leaderboard, these dumps provide rich historical data. You can find these datasets on their Hugging Face page or linked in their blog posts, such as the one for the "Search Arena" which open-sourced its dataset and analysis code. This method is ideal for research and analysis that doesn't require the absolute latest rankings.

3. Web Scraping

Web scraping is a common, though technically unofficial, method for extracting data from websites. Several articles and projects detail how to scrape the LMArena leaderboard. One approach uses AI-powered tools like DeepSeek to automatically extract the rankings, model names, and scores into a structured JSON format. Another, more traditional method involves writing a custom script using libraries like Selenium to parse the website's HTML. However, it is critical to note that LMArena's terms of use explicitly forbid programmatic access and scraping of the website. Proceeding with this method carries the risk of having your access terminated.

4. Browser Extensions and Community Tools

Developers in the AI community have created tools to interact with the LMArena site. One example is a browser extension available on GitHub that allows users to maintain a personal leaderboard by tracking their votes. While this specific tool is designed for personal stats, its existence demonstrates that the website's front-end data can be programmatically accessed and repurposed. You could explore GitHub or developer forums for similar community-built tools designed to export or track the main public leaderboard, or use such projects as a starting point for building your own tool, keeping in mind the site's terms of service.

calm sequoia
#

For the creators I mean

rare python
keen beacon
#
  1. Is correct data source but wrong extraction method
  2. Is just wrong, its historical data
  3. Web scraping is correct, the method suggested on how to do it is plain wrong
  4. Is wrong, its not for getting leaderboard but for keeping track of your own votes
late path
#

i dont think theres a way to get realtime rankings before the official leaderboard repo is updated

keen beacon
#

So gonna make 5-10% profit after grok gets released and google still wins xD

leaden sun
leaden sun
#

I wouldnt go that far to call it xenophobia, banning wont help those people critizing cn ai to think critically either, to the contrary, it will exaggerate the effect even more

verbal nimbus
#

Can't model providers basically cheat by returning blank responses for prompts where their model perform badly

#

E.g. if the reasoning overflows, return blank (because that means the model got stuck)

keen beacon
rare python
verbal nimbus
#

They exclude rounds where a model has no response when counting the votes.

verbal nimbus
#

So instead of losing, a provider can technically prevent a round from being counted when they know their model is stuck.

rare python
#

RIP stonebloom in webdev arena. Bro can't even generate anything. Pure blank

keen beacon
verbal nimbus
barren prairie
late path
verbal nimbus
#

Non-thinking models like Mistral Medium had no issue

barren prairie
verbal nimbus
late path
#

and theres a way to get newer ranking than that?

keen beacon
#

Yes

late path
#

oh

keen beacon
#

Its hidden

#

You have to do reverse engineering to find it, took me a whole day :/, i hope its worth it

keen beacon
late path
#

I'm just buying google

keen beacon
#

Yeah i assume grok 3.5 and even gpt 5 will not overthrow google

rare python
alpine coral
#

unironically calls for censorship lol

calm sequoia
# rare python Damn

It blows my mind that people think there's only 4/100 chance that gemini won't be overthrown. It happened many times in the last days of the month 😄

alpine coral
#

yeah that's wild btw

#

i mean.. might throw a few bucks on oAI

rare python
#

no o3 pro on lmarena

#

kek

calm sequoia
#

Few times

#

There's still a chance for: Grok 3.5, the 4o new variant, DeepSeek R2, even GPT 5 😄

alpine coral
rare python
alpine coral
calm sequoia
#

It was like 2 or 3 days when Gemini 2.5 PRO came out

rare python
calm sequoia
# rare python

It seems the style control really made the leaderbord better. Good thing I'm not on polymarket.

late path
rare python
calm sequoia
alpine coral
late path
#

The reason the market is 97% instead of 100% is, I think, almost entirely due to opportunity cost

alpine coral
#

if it's for June.. then yeah..

rare python
#

Yep it's for June

alpine coral
#

tricky to see an OAI model surpassing tbh ahah

rare python
calm sequoia
calm sequoia
alpine coral
#

whether the feel it or not.. polymarket introduces all kinda of 'pressures' .. like if the lb doesn't update b/w now and 1 July for whatever reason (just hypothetically), then the current standings would apply (for the end of June bet) right?

late path
calm sequoia
#

Do you remmember when it was released?

#

I've checked the polymarket but the data is not present anymore

late path
#

0325

calm sequoia
#

On the arena

rare python
calm sequoia
#

The world didn't change

#

Since then

#

One thing in LLMs is constant - unpredictability

late path
rare python
#

o3 is released in April

#

And no one surpassed Gemini leading +40 elo back then

calm sequoia
#

Another scenario: the anonymous models, currently in arena, which seems better then newest Gemini, are actually from other lab and not Google.

#

Nice

calm sequoia
#

Another scenario: lmarena decides to split anonymous models only to subset of user's, neither of which cares to check the lab origin

late path
rare python
calm sequoia
#

Cheezy but it's true

rare python
#

You haven't answered

calm sequoia
#

Wdym

calm sequoia
#

Idk this is hypothetical, someone mentioned two models earlyer

#

I can see you're really invested into polymarket to care so deeply

#

My idea was it's never 4/100 in LLMs

#

Too many umpredictables

rare python
#

:)

#

I want you to stop roleplaying and rage baiting me

calm sequoia
#

According to Ourobaros chances of Gemini dropping from No. 1 spot is equal to that of Jesus returning in 2025

rare python
late path
calm sequoia
# rare python

Sorry if this made you angry, but that's what you were saying.

rare python
#

I'll report you if you keep making conspiracy theory without the source

keen beacon
#

I dont get why anthropic is 6% for december

#

They are code focused no general models + no new models anytime soon ..

calm sequoia
#

They have a good team and competence for this. Maybe they expect the chances to go up before other major releases.

late path
sour spindle
#

I think Antrhopic “doesn’t play the game” as much as other companies

#

Remember the market is simply highest ranking model on lmarena

keen beacon
sour spindle
#

Google is constantly gaming the leaderboard to find out how to eek out slightly more Elo

balmy mist
#

has anyone tried the seedance video model? is it the best?

keen beacon
keen beacon
glad jackal
#

Yo why isn't there qwen3 0.6B,4B,8B and 14B in lmarena leader board?

sour spindle
polar roost
#

what's the use/msg limit in direct chat?

rare python
alpine coral
#

if the german government forced AI companies to ensure LLMs said the holocaust didn't exist, or to refuse answer questions about it, you'd have a great point there...

keen beacon
alpine coral
#

similarly missing hte point entirely

#

mistral doesn't train its models to accomodate German hate speech laws

#

anyway... this isn't going to be productive

primal orbit
#

Is stonebloom still in? All I get is kraken.

barren prairie
primal orbit
#

kraken - no

#

ok, i got stonebloom on 20th try

wintry tinsel
hoary plaza
#

I mean as the new models are introduced, can we increase their priority of appearing rather than old ones??

#

I don't battle much but like I got stonebloom once in 3 days 😂😂

hoary plaza
dusky aurora
#

developers,please improve smapling. gemini is almost unusable under these settings

polar roost
#

what's the use/msg limit in direct chat?

primal orbit
#

but it refreshes within a chat after a while, so it's possible to continue

sour spindle
dusky aurora
ocean vortex
dusky aurora
#

gemini-2.5-pro

ocean vortex
dusky aurora
ocean vortex
#

But I think default is temp0.7 and top_p 0.95-1, so unlikely this will make much difference unless you push it beyond 1.0

leaden palm
sullen quest
#

Hey are lmarena links sharable? Like If I send someone a direct chat link could they see what was in it?

keen beacon
#
Google

Free and open source, Gemini CLI brings Gemini directly into developers’ terminals — with unmatched access for individuals.

GitHub

An open-source AI agent that brings the power of Gemini directly into your terminal. - google-gemini/gemini-cli

echo aurora
sullen quest
#

thanks! Glad it'll be added at some point then.

errant cave
# rare python

I wish more sites banned people for false reports like this like 4chan does

civic flame
small haven
#

f's

#

wen kingfall in gem cli

keen beacon
ocean vortex
# leaden palm

I don't think it will have any impact whatsoever tbh. Everyone was already training on this. Court ruling is just a technicality after the fact meaning they won't have to spend money to make this go away

#

Like meta was torrenting books, and everyone else is not any more saint, what are we even talking about here... LOL

#

copyright was never really a bottleneck, in practice at least...

#

they train on it, and then they "ask for permission", or prevent the model from disclosing it / getting caught. Or wait for court ruling like this one with the model already in production. But either way, no one is waiting for permission 👀

keen beacon
#

?

echo aurora
keen beacon
#

i didnt know people would be that invested

#

i deleted it just incase as well

rare python
tall summit
rare python
keen beacon
#

them shits are expensive

#

and most companies dont even sell their rights because they dont want people to use their art in certain ways

surreal creek
leaden palm
#

does anyone know how to undepress gemini

rare python
#

Say something to motivate it

#

Don't let it uninstall itself 😭

leaden palm
#

its been thinking for over a minute now

#

oh it was just rate limited

#

why did it do it twice?

#

you already switched to 2.5 flash...

elder rapids
#

?

leaden palm
#

that's the thing

#

what prompt

elder rapids
#

is there no system prompt

leaden palm
#

this is gemini's claude code competitor ("gemini cli")

elder rapids
#

yes

#

I know

elder rapids
# leaden palm what prompt

just tell it not to acknowledge x thing, treat all interactions as X, maximum response = technical context only

#

stuff like that

#

or also simply add: state facts directly without apologies or self deprecation

#

tell it to use active voice

rare python
mossy drum
#

New model in Image Arena: kordex-can-on

hoary plaza
#

@echo aurora can we add a change log channel on discord which makes announcement of any changes you do

#

Like adding a model too

leaden palm
hoary plaza
#

A role can be used to ping if they are interested

leaden palm
hoary plaza
#

That's for you mods to decide

#

Like if it's convenient you can do that, if its not then nothing we can do

dense jasper
#

hi

harsh flume
#

What are your guy's impressions on minimax-m1?

#

I ran some prompts with the intent of prompt-improving (as, here is a not-well-articulated-prompt, please improve it for result X) and it performed really well

echo aurora
whole wagon
#

polymarket says there's only 30% chance gpt5 comes before july 31st lol

calm sequoia
#

Aider bench must have the most correlation with LMARENA leaderboard

keen beacon
whole wagon
#

Well gpt5 before Dec 31st is also at 90%

#

So it's the same

#

The open source model has been delayed already

#

It was expected before July before

#

And this before June 30th for GPT5. Was also delayed

late path
ocean vortex
#

R1 still the open-source king 😇

#

and qwen3 absolutely flops on SimpleQA lmao

#

although I can't say that I'm extremely surprised

rare python
#

Through API only?

ocean vortex
#

dunno but it's still behind LOL

rare python
#

Especially creative writing and multi turn conversation

ocean vortex
# rare python No, like each model has their own style. I want to try them out even if they are...

there's probably no API or it's only to Chinese citizens. Though you can try it there https://www.volcengine.com/experience/ark?model=doubao-seed-1-6-250615

#

even this website is all Chinese with no apparent way to switch to English lmao

ocean vortex
#

it's slow though, 12tok/sec. Took 10min to generate 26k. MCP and Canvas you can only use when signed up with a phone number and my country is not included in their list... catgrin

#

I'm curious to try their MCP (tools), this model does have solid fine-tuning at a first glance. Unlike most other models that perform good on TAU, this one does not halluciate running the code with no tools available. It gets very close to doing that but kinda stops itself and realizes it can't actually run code

ocean vortex
#

the one I linked yeah. They don't seem to be blocking IPs

leaden sun
rare python
#

Seed 1.6 Thinking seems to be their best model right now

barren prairie
rare python
#

Better than nothing

ocean vortex
rare python
ocean vortex
#

Nothing spectacular but it looks interesting enough to warrant testing it more extensively

#

Seems to be around the level of the open-source SOTA, potentially somewhat better when we look at tools and their finetuning

rare python
#

Seems like a non thinking seed

alpine coral
#

or is the idea that it should actually be able to resolve whatever the issue at hand is, and it's basically being lazy (and sad aha)?
if so then yeah ig prompting might help (but otherwise it seems the task/problem is just beyond its capabilities 🤷‍♂️)

ocean vortex
delicate cedar
#

is there a place where u can get unlimited uses for claude

torn mantle
#

@ocean vortex why did you leave chatgpt server?

#

btw kouhe3 shared a link where you can try multiple chinese models

#

just search for ai dangbei

ocean vortex
leaden sun
#

i wonder what claude server would look like if that exists...?

torn mantle
#

its not always about sharing something useful

#

we can just troll sometimes

misty vault
#

@gork is this real without system prompts staging this