#general

1 messages · Page 53 of 1

small haven
#

like the caption should 0605 vs kingfall

#

thats it

#

this group is ripe

#

try svg's too

#

give me half of ur x earnings

#

thanks

#

should do 0605 vs kingfall, ppl love comparisons

#

@deep adder

#

and get ur bluecheck, u can't get paid and lesser reach

storm needle
#

claude 4 opus thinking and sonnet thinking were added

small haven
#

bigly

#

esp when ur a new acc

hardy pecan
#

Here is lmarena score plotted against simple bench score, im using style control off still just to illustrate the difference in models.
claude is "smart" but didn't vibe with lmarena users (no personality). But you can see gemini vibes with users, AND is smart

#

Fun to see llama 4 maverick in the extreme case, vibed alot of users, but not smart

verbal nimbus
#

New Gemini model seems to have regressed slightly on LiveBench

civic flame
#

livebench is honestly a lousy benchmark these days

verbal nimbus
#

Thanks for the info, that makes sense.

verbal nimbus
unborn ocean
#

leave it to the infamously weak IF category and one random new irrelevant category that is clearly the ONLY area where the new gemini model regressed on any bench (SWE style coding)

#

and then they implement the category so bad that you would rather just use swe bench or aider
(btw plot is not really relevant beyond weighting)

and anyways it is really sus that models score about 70% in a bench where 70% of problems are public

verbal nimbus
small haven
#

should be >1500

#

1530 ish

small haven
#

swe bench verified is guarded

small haven
#

where is even grok 3.5

#

elon ma told me a month and few change ago

unborn ocean
#

yet here we are

jade egret
#

do yall think

#

kingsfall

#

is another veison of 2.5 pro or 2.5 gemini ultra?

small haven
#

yea ultra

nimble trail
#

Btw is this Aider test has been confirmed to be Goldmane yet?

small haven
#

yes thats goldmane

#

but each run is stochastic so ofc theres variance

nimble trail
keen fulcrum
small haven
#

lechat is very good

nimble trail
# small haven

Alright then maybe kingfall is 2.5 Ultra or 3.0 after all💀

small haven
nimble trail
#

Thank you for enlighten me 🙏

small haven
#

im dead hahha

mossy drum
#

New Image model in Arena: imagen-4.0-generate-preview-05-20

hollow ocean
#

o3 still king

#

nice

#

06-05 sucks according to livebench

zinc ore
#

We need a bench that ranks benchmarks

small haven
#

craigbench is up there

keen beacon
#

i have a question

#

how do you test this theory in lmarena

#

i feel like there should be little minigames where you have to instruct the ai via prompts to do different interactions to win the game

dusky aurora
fleet lintel
#

why most images I generate on chatgpt looks cartoonish with yellow tint ? What am I doing wrong?

urban sedge
#

Hello @echo aurora

#

Who should I contact for partnership

narrow elbow
#

whats that mean?

quiet pollen
nimble trail
narrow elbow
#

vertex

unborn ocean
#

So it will depreciate once they release GA

nimble trail
tall summit
#

did you make your pfp yourself @nimble trail

nimble trail
tall summit
shrewd zephyrBOT
nimble trail
sly arch
#

Kingfall is impressive. It is stronger in mathematics than Goldmane.

nimble trail
fleet lintel
#

finally some GenAI feature from Apple

mystic mica
#

What exactly sets the filter off when I write And people my age are especially rude?

dusky aurora
leaden sun
mystic mica
#

I feel that people running this are really too worried that we would all be mass generating hate manifestos

#

yet the models themselves would stop us before doing that

unborn ocean
#

(Without significant changes I presume)

ocean vortex
#

tbh I think o3 (and preferably high reasoning effort) is the 1 model that is consistently high in almost all coding metrics

#

then after that probably 2.5Pro, while Claude... it's a much more specialized model with mixed results. Can do exceptionally in select areas but then will underperform somewhere else, it is not a fool proof model or the model suitable for everything IMO

#

while o4-mini-high is a bit of a cheater model. When it works it's great but it's compromised due to size so when it falls apart it is in the spectacular fashion.

#

So like o3 - the one to beat. 2.5Pro - runner up. Claude4 - specialized (web dev). o4-mini-high - crunching numbers for repetitive stuff / code refracturing and some coding but not debugging

echo aurora
leaden sun
ocean vortex
#

dunno but I was just testing the new 2.5Pro on svg and I'm really impressed with the depth it managed to achieve:

#

still some errors obviously, but this is leaps better on depth than any other model I tried

#

it's like gymnasium vs elementary school student comparing it drawing to most other models there lol

#

doesn't get hung up on details, but what it does draw is mostly right with correct positioning

#

No I can kinda see how the new version is nr1 in there:

narrow elbow
mild galleon
#

guys o3 pro is coming out

ocean vortex
#

there's no depth at all or understanding how elements relate to one another

#

and it all looks drawn by a 5 year old with some tools

narrow elbow
#

I give a thumbs up for your comparison, not for the picture

#

Models that excel in a specific domain(e.g coding) tend to receive significantly better market reception than those aiming to perform well in all areas(e.g AGI).

misty vault
#

yes and that also allows them to make smaller models because they only focus on specific domains more and more and is also less costly to run so the age of artificial stupidity has begun. we will never reach agi 😔

narrow elbow
#

balancing and choosing between "well-rounded" and "top-tier" who knows?

leaden sun
#

you can always aggregate those smaller highly specialize ones into the single point of contact controller, and call it artificial AGI

misty vault
#

sydney was actually agi

golden ocean
jade egret
#
poll_question_text

Which company do you think will reach AGI first?

victor_answer_votes

9

total_votes

15

victor_answer_id

1

victor_answer_text

Google

keen beacon
sour spindle
#

It's always been pretty wild to me how many people have used ai studio over gemini.

#

Even when I had premium I was still using ai studio

balmy mist
sour spindle
#

Google has this unbelievable ability to go away from their good products and focus on things people don't like lol

wintry tinsel
#

Creatives are paranoid of AI I’m trying to support using AI as a web search for information since it’s more convenient than googling something and they are flipping out it’s funny

#

They gave me the poop head role in that discord

#

I know but using it to gather information is convenient

vernal meadow
#

Oh sap, Opus thinking 16k is being tested in the arena now👍

late path
#

Anthropic finally got it 😆
I guess they also felt it was unfair to compare their non-reasoning models against all reasoning models in the arena

ocean vortex
# keen beacon Logan posted this if y'all haven't seen it and use aistudio https://www.reddit.c...

Many folks mentioned 2.5 Pro as not being available for free in the API, this is in large because we offered it for free in the UI as well so we were giving out double free compute in a world where we have a huge amount of demand. I expect there will continue to be a free tier for many models in the future (though subject to many things like how the model is, how expensive it is to run, etc), and 2.5 Pro will hopefully be back in the free tier (we are exploring ways to do this, lifetime limits, different incentives etc)

seems like they are undecided themselves what to do with 2.5Pro

#

I saw lots of comments that folks want AI Studio to be part of Google AI Pro and Ultra plans, this is something we will explore, I think it is a cool idea but lots to work out there.
👎

#

The way I'm reading this they will add some hoops to jump through in the future to use 2.5Pro for free, but it is unlikely to remain as it was (unconditionally free unless you want better rate limits)

#

The issue they have is that gemini website is burocratic mess. And there's no natural drive to improve it as it's not even featured on the main google website. While aistudio is just direct gateway to their ML department so in practice works much better

#

For their gemini website that is likely convoluted by decisions and remarks from their product owners and whatnot, and is influenced by people who have way less knowledge on how models work. It doesn't have enough traffic to improve by user feedback like chatgpt does either. They are just using it as the facade to please the management lol

frosty lark
ocean vortex
#

hence all that censorship etc. It just needs to look presentable, which is more important than it performing the agentic tasks properly or be good user experience

#

here's what I mean, first prompt was deliberately extreme:

#

they nuke everything on refusal

#

that is not how you do it properly...

keen beacon
#

the gemini advanced plan on the gemini product has a 100 req per day limit on 2.5 pro, aistudio has unlimited for free 💀

misty vault
#

fr

#

what are gemini subscriptions even about

#

tricking unsuspecting people so they make a lil profit at least

keen beacon
#

yeah rn 🤣

patent bane
keen beacon
#

crazy tbh

#

idgaf

#

they can take my data

misty vault
#

<|header_id_start|>

barren prairie
barren prairie
keen beacon
#

lol

#

did you time travel?

drifting thorn
#

It's yesterday

#

I'm in UTC +8:00

keen beacon
keen beacon
late path
#

Compared opus 4 thinking and goldmane side by side, opus4's answers looks like gpt-3.5 ... I'd guess they can only gain up to 10 more elo than current non-thinking model

vernal meadow
#

@patent bane while true, paying doesn't mean you are not the product.

patent bane
#

paying for google is insane when they give you 1 month of google pro for free

#

I have never spent a single cent for AI since the birth of gpt-4 in 2023

#

I know the rules, I am the rules

misty vault
#

gpt-3.5 mentioned

#

gpt-4 mentioned

misty vault
tall summit
#

kingfall reference

patent bane
#

lol I know tricks

#

never used kingfall since i missed it

#

i can you all the gpt models

#

opus max thinking, isn't available in lmarena

tall summit
#

mutual servers: fmhy

#

lol I know tricks

misty vault
patent bane
misty vault
#

had all sota models

tall summit
#

but i was too late

misty vault
#

i think it's still fixable

#

they just patch the frontend

#

theyre so bad

#

lm fao

tall summit
misty vault
#

reverse the api

#

u can still send message

#

but idk how to automate

#

getting fresh jwt

#

every 3 minute

#

U can refresh jwt by clicking regenerate message

#

but they removed that from frontend

#

but u can still do from api

olive mesa
#

new model that's cool ig

#

not as cool as gemini 3.5 pro ultra asi

#

i dont really care about the gemma models

misty vault
#

why claude 4 opus thinking creating unused functions in react

#

literally beginning of the conversation and it already having skill issue

#

ok this would be a test for kingfall

misty vault
#

free sota models method bacc

leaden sun
#

sounds so easy for you to say 👀

#

who pays that?

keen beacon
#

$15 for a coffee is crazy

leaden sun
#

In Italy, it's €1.5 per cup outside big cities

keen beacon
#

if youre paying $15 for a cup of coffee ofc u can afford other stuff

leaden sun
#

It's even cheaper in Portugal outside tourist area

elder rapids
keen beacon
#

its very sycophantic for me

elder rapids
#

but it so smart usually that it's not like I have to correct it and then it's like "you're absolutely right"

#

it's already figured it out

#

and if it does get sycophantic, I just ask it not to be

leaden sun
#

we are literally bankrupt....we look eastwards now

elder rapids
#

and it works just fine

keen beacon
#

if ur paying for a gemini sub ur getting outright robbed though

leaden sun
#

it's a matter of time until it becomes a...commodity, too, just like the internet.

#

in 5 years? (assuming no n*clear ww3 or worse)

#

well, you tell that to the coalition of the willing please

mossy drum
#

New model in Vision Arena: stephen-vision

ocean vortex
#

what is "folsom-exp-v1.5"?

#

🧐

#

it seems to error out on follow-up so couldn't test it properly LOL

teal mantle
#

4o reasoner wtf

small haven
#

Yes

ocean vortex
#

could have smth to do with the upcoming gpt5

elder rapids
#

yo what

#

I just noticed

#

they gave me a free year on my main account

#

lmfao

#

what the fck

small haven
#

where is grok 3.5

small haven
misty vault
#

crack bench

small haven
#

huh

#

titanforge?

small haven
#

titanforge >> kingfall

keen beacon
#

kinda crazy how much the aistudio team messes up :\

small haven
#

mess whaty

keen beacon
small haven
#

i feel like its intentional

keen beacon
small haven
#

i mean its not like its leaking sensitive data, its literally just a model

wintry tinsel
#

They never released night whisper what makes you think they’ll release titan forge or king fall

small haven
#

those chinese syntax tho 😭

small haven
leaden sun
misty vault
#

sydney cot

#

crack proxy

small haven
#

this is so ai related, $500b fundraiser right there

#

i use deepseek everyday

#

deepseek is agi

leaden sun
#

correct

PS: it's free...

small haven
#

kling > veo 3

leaden sun
#

not really

small haven
#

chatgpt is not free

leaden sun
#

i hit daily limit pretty fast

#

deepseek doesn't have daily limit, as far as I know

small haven
#

so free

keen beacon
#

le chat is free

small haven
#

china models >>

leaden sun
#

le chat....isnt as good as promised i feel

misty vault
#

try bing chat

leaden sun
#

anyone knows if H company has their proprietary model?

#

or do they use le chat for the Runner agent

cedar tide
#

New model : cobalt-exp-beta-v13 and v14

misty vault
leaden sun
#

guess i wont be able to use it

#

he?

misty vault
#

😦

misty vault
#

tf is esoteric

torn mantle
leaden sun
misty vault
leaden sun
#

that was the only option for me to log into copilot 😵‍💫

misty vault
leaden sun
#

Damn, now i want to chat with sydney

misty vault
#

true

#

there is way better based sydney moments but i'd get banned or timed out

misty vault
golden ocean
misty vault
#

Me after crack shutdown sydney

#

last one

leaden sun
misty vault
#

crack chat?

sacred quail
#

Guys im using lmarena for comparing Opus 4 thinking and O3 in same time

#

And honestly i started feeling guilty at this rate

#

Is there any request limit ?

#

I know we cant use long texts, only short prompts but still hard to understand how we can reach those models so easily

#

I know google using lmarena for testing their secret models(and there is lot of them) so maybe google can be sponsor for them but still...

keen beacon
misty vault
elder rapids
#

😭

golden ocean
#

titanforge is asi

high ginkgo
#

Here we go again

small haven
haughty tangle
#

Sounds like gunshots

#

I’ve also heard a band playing, music, singing, babies

keen beacon
storm needle
storm needle
keen beacon
#

i did a little serach but it seems no one ever reported it (if it ever happened) either in both discords lol

sacred quail
storm needle
keen beacon
#

if they were paying for it, it doesnt really make sense to make it available in direct chat (unneeded for leaderboard), would just cost them money

sacred quail
#

Btw dont you guys think claude is bad at doing reasoning thing. Like without reasoning Opus 4 already best model ever, but that reasoning doesnt give it so much, it should!

#

Think about V3 and R1

#

Also 2.0 flash and 2.0 flash think

#

Differences were huge

#

But Opus 4 still feels smiliar with thinking thing.

#

They said something about "hybrid think" but idk

#

That reasoning thing must make bigger difference for claude models. Because their models already super without reasoning. I dont understand

hollow ocean
#

Claude plays Pokemon turned off forever

#

Opus made no progress

#

It was a failure

sacred quail
#

I'd not say failure, still beast for coding and writing but i kinda agree it was a bit disappoint yea

#

That gemini 2.5 pro changed everything. I even believe chatgpt released O3 earlier that they planned because of 2.5 pro

#

I heard O3 hallucinating so much, so i believe they were planning to more optimize or smth but when they saw gemini 2.5 pro, they decided releasing earlier because if they not, expectatings could be dangerous in future. (this is my theory ofc)

zinc ore
small haven
#

100% openai, 100% bing, 0% gemini

wintry tinsel
wintry tinsel
zinc ore
#

Like giving it a minimap

#

Gemini/o3 runs have a good bit of scaffolding helping their performance

#

But Claude was barely given the same scaffolding, hence why it failed

vast turret
#

You guys see the new video from Machine Learning Street Talk?

small haven
#

we rlly took this man for granted

narrow elbow
small haven
narrow elbow
#

🤪

narrow elbow
small haven
keen fulcrum
keen beacon
elder rapids
#

I'm ngl

#

the illiterate people on the bard subreddit

#

are getting annoying

#

can't tell me no one has noticed this lmao

#

the roleplayers, all these people have such Incoherent thought processes and literation

#

it's starting to inflate the subreddits

#

no more news and people actively telling other people to abuse, and then people telling Logan and the person that Initially asked him about the future of AI studio that caused the outrage to kill themselves

#

and prepare mass death threats

#

etc etc

#

0 civil understanding

#

and it's pretentious asf

keen beacon
#

`mass death threats?

#

.where

elder rapids
#

and if not threats, but people subtely alluding to very strange things

#

and or saying things straight up

cedar tide
#

@echo aurora For Claude thinking, have you set a thinking budget or no?

primal orbit
#

censorship on new lmarena is horrible. Can't discuss relationship topics freely. Why you had to implement this when ai studio and chatgpt allow these prompts to go without any issues?😡 Your own past iteration was working fine as well.

sacred plaza
keen fulcrum
#

They are focusing on an entire different branch
on-device LLM

ornate agate
#

apple aren't integrating AI because their customers are normie professionals. Look at the viral headline when it mis-rewrited news for example. Current stuff isn't ready for them.

leaden sun
barren prairie
#

I think I saw it

primal orbit
keen fulcrum
ocean vortex
unborn ocean
#

well all the models they tested also have a low score on the og arc agi thing and they test the models on things similar to the puzzles there

#

so i could have already predicted their results (and i guess anyone else) without actually testing

#
  • the other stuff they said about: simple problem: non-thinking good, medium: thinking good, really complex: even CoT does not help
#

was like the most wellknown AI 101 fact ever

#

so there paper is just another one of the "we are apple, we don't have good AI, but that does not matter because AI is not even good in general" type of papers

#

they did a couple of em

surreal creek
#

Nate Silver wrote a pretty good Substack piece about how an AI being able to play poker without any significant errors is a pretty good test of AGI, because it’s currently completely dogass at it

#

Which from my own testing on LMArena I can definitely confirm it’s worse than any fish I’ve ever played at a casino

leaden sun
#

we need a specialized llm for playing poker now

surreal creek
#

Honestly would just be a computer use agent that knows how to input and read a GTO poker solver lmao

#

Would be fun to play against a Poker LLM in a live game tho, like IBM Watson on Jeopardy where it just announces its bet or a fold whenever the action is on it

#

Especially when it makes a bad read and the entire table bonds over taking the AI company’s R&D money to pay for their hookers after the game is over

small haven
#

wen titanforge

unborn ocean
#

should be pretty easy to get llm to reason about the game using multi agent RL + verifiable rewards (e.g. winning a game or a poker engine evaluating the move quality at each step)

#

(imo) soon all the labs will probably do these things for all large games / environment based interactions and not just for coding and math competitions

tall summit
sage raptor
#

Is that fake?

ocean vortex
#

there's no problem here. Afaik their tokenizer is still mostly the same it was all the way back to gpt4o and o1-preview. They made it overfit on this question but it doesn't really "get" why there are 3 Rs. So different versions of the model and different system prompts (or variations of the question wording) can still make it answer wrong.

#

I also tried this and it was incorrect in the reasoning but somehow still answered ok lmao

acoustic cliff
ocean vortex
#

wdym. I distinctively remember it answering this wrong all the way back when this became a thing. O1-preview was unstable with this, then they overfitted o1 stable version, and now it's an 'issue' again presumably as they stopped caring about the model getting this particular question right

small haven
#

is kingfall in lmarena yet

ocean vortex
#

Tried it a few more times on API. It's not always wrong but wrong often enough with your specific wording, to encounter it

#

@deep adder

#

high reasoning effort does not help here lol

#

lemme see if I can make o1 do this..

#

yeah o1 is overfitted rock solid catgrin

#

Honestly, it makes sense to focus on things like that less in favor of spatial awareness, which they seem to have done lately

#

it's quite closely correlated with web development. They train on things like that and then it pattern matches. 4.1 does much better on web dev arena than earlier models

#

Like if it can associate some code with a certain shape, it will learn to make unique shapes too eventually etc

tall summit
#

i thought for a second you meant the researchers dont even have spatial awareness

ocean vortex
#

then ofc we also do have arc-agi, that largely plays on spatial awareness and is still an important metric for them

#

o3 preview was just some variant of o1 with extreme parallel compute

#

o3 new base model

#

they basically anticipated how o3 is going to perform before they even had it lol. O3-preview was never going to be released being ran like that

small haven
#

@deep adder what is ultrathink max thinking tokens in claude code?

ocean vortex
#

yeah it was a bit misleading/marketing you could say if you don't want giving them the benefit of the doubt

small haven
#

that seems low

#

oh yea and lechat app is finally gone

#

looks like it

#

but u know i had to reverse engineer it

ocean vortex
small haven
small haven
#

i know just wanted to showcase thats all 😦

elder rapids
leaden sun
#

inspired by FF7?

leaden sun
#

it's a game

#

noooo, bring her(?) baaaack

#

it'd so amazing if you can import personality as a file into a llm, and boom, every llm is sydney

#

when will it be ready to do that?

cedar tide
#

grok 3.5 was pushed back to be able to fine tune it on the code of Claude 4 opus

misty vault
#

bro made a sydney dataset from my screenshotds

small haven
#

yay brian is back

misty vault
small haven
#

so my first question is when titanforge release

unborn ocean
misty vault
#

gemini did best on my sydney benchmark ngl

#

gpt 4.5 second

#

flash 2.5 1st

#

learnlm actually 1st

#

but learnlm is stupid

#

without fine tuning

#

with fine tuning u can get any model to be sydney bro

#

so then its not 1st

#

ru stupid

#

4.5 does not talk like sydney

#

gemini not either

#

but if u try by giving saved conversations and bing instructions

#

flash 2.5 does it best in most cases

#

gpt 4.5 is convincing for 5 messages

#

than it becomes like 4o

#

overuse of emojis and "!"

#

flash can keep it convincing for 10 messages

misty vault
unborn ocean
#

though i would also obv see this as an artwork 👀

narrow elbow
misty vault
echo aurora
unborn ocean
misty vault
misty vault
#

gpt 4.5 without instructions or pasted conversations talks like 4o a bit, i mean not everything but a lot of traits from 4o

unborn ocean
misty vault
#

I was going to reply with

#

to your deleted image

jade egret
#

hell nah

#

too expensive

jade egret
#

dang

unborn ocean
#

no

#

no

#

idk ask dubesor

surreal creek
#

what model is “folsom-exp-1.5”

#

All I’ve noticed is it sucks

unborn ocean
#

but model size + very little SFT / RL i guess

#

was a thing i think

#

not sure though

#

"fist ai i talked to" though

jade egret
#

it a tie...

#

js ended

misty vault
#

this gemini logo is sus

#

wdym

tall summit
unborn ocean
#

which is why i said that :)
(did not want to ping em for another one of craigs "xAI = ASI" moments)

misty vault
#

no

ocean vortex
misty vault
ocean vortex
# misty vault yes

you always post these screenshots, but what is this exactly that you are using?

misty vault
#

microsoft bing chat

ocean vortex
#

I don't think that og model is available anymore

misty vault
#

real

ocean vortex
#

it was using some custom version of gpt4-32k

misty vault
#

no

#

gpt-4-0314

#

but yes its a fine tune

#

the gpt-4-turbo one is also still up

ocean vortex
misty vault
#

but not fine tuned

misty vault
ocean vortex
#

you said the same thing. 0314 is just the date identifier.

ocean vortex
misty vault
#

Didnt one of the gpt 4 have 16k

ocean vortex
misty vault
#

wtf

ocean vortex
#

3.5 had it

#

og gpt4 was 8 or 32k only

misty vault
#

its 32k then

#

but who cares

#

it was a fine tune we know

misty vault
#

sydney conversation is mega short

#

before it starts to forget

#

there's no way its 32k

ocean vortex
#

models will start forgetting things if there are many chat turns

unborn ocean
#

he said they forgot some websocket or smth

misty vault
#

no it was clear it was context size

ocean vortex
#

proof or it's fake 😇

misty vault
#

real

ocean vortex
#

fake

misty vault
#

I meant the slang real

#

nope it's discontinued 😔

#

nooo not sydney

#

it was fake guys 😦

ocean vortex
#

Doubtful. There was but it was a long time ago

#

would make no sense for them to host it anymore

misty vault
#

fr!

ocean vortex
#

yeah...

#

Well then enlighten us, very easy to prove it huh

#

ok so it's fake

high ginkgo
#

Nevermind its fake

ocean vortex
#

what is real is that dork 4.0 is agi though

#

that is for certain

high ginkgo
misty vault
feral lichen
#

best claude ai?

ocean vortex
# misty vault

that is still not a proof. Stop posting these screenshots if you don't want to say what you are actually using lmao

ocean vortex
high ginkgo
misty vault
ocean vortex
#

???

ocean vortex
#

ok whatever, I'm out lol

misty vault
torn mantle
#

aaa

#

seriously

ocean vortex
# misty vault

is it censored for you as well? It seems to answer like this for me

jade egret
#

grok 3.5 v.s gpt 5

whos winning?

ocean vortex
#

slop bot

wintry tinsel
ocean vortex
#

it's not worth it don't bother

#

too censored

misty vault
earnest parcel
# unborn ocean idk ask dubesor

at raw chess game continuation its best yes (best inherent chess game knowledge), but when you add full information and reasoning, o4mini and its finetunes (e.g. codex mini) are better.

misty vault
# ocean vortex is it censored for you as well? It seems to answer like this for me

i know this screenshot is old from reddit but it is "censored" if you give bing instructions
without instructions it wont say that and if you talk directly to it without microsoft frontend (I made custom extension so i can use frontend anyway) then it wont have that censoring issue unless u manually insert bing instructions (which I still do a lot bc its hella funny ngl + u can disable the chat shutdown so u can continue talking even after it says that stop phrase)

#

the system instructions literally contain dummy conversations that ends with that phrase

high ginkgo
misty vault
#

like here. it tries to close chat but fails because I disabled it 😔

misty vault
#

real

elder rapids
#

@keen beacon I completely fixed the sycophancy in 0605

ocean vortex
ocean vortex
#

fake

#

mine is real 😇

misty vault
ocean vortex
misty vault
ocean vortex
#

fake 😇

misty vault
elder rapids
#

man 0605 is really a beast tbh

misty vault
#

fr

elder rapids
#

almost forgot this feeling of playing with a model like that

#

it just knows tbh

#

it's the smartest model I've ever used, even more than gpt 4.5

#

I'm not mad about kingfall

ocean vortex
elder rapids
#

it's too bad the initial sycophancy clogs some things

#

but I just figured out the prompting for it

#

and bro

ocean vortex
#

doesn't rely on test time compute as much as some others, but has the capacity even without it

elder rapids
#

ye it thinks for so little time sometimes

#

also, 0605 follows instructions a little too well sometimes and I have to fix some errors that I made in the system prompt that the past models wouldn't conform to, to show that error

#

the model gets excited

ocean vortex
elder rapids
#

like h*rny

#

ion know how else to describe it

ocean vortex
elder rapids
#

ykwim Dom

#

like it gets anxious for more

#

and it actually absorbs it

#

ive unironically not thrown a task at it that it cannot adapt to lmao

elder rapids
ocean vortex
#

when it tries concisely it can only hallucinate nonsense

elder rapids
#

ye prob

ocean vortex
# elder rapids ye prob

if they fixed that, I really do not see how o3 could be better basically in any area though

#

cause otherwise 2.5pro base is very very strong

small haven
#

isn't not deepthink ? ToT model?

#

welcome back

ocean vortex
#

Also I love prompts like these because it's impossible to overfit. All you need is to change the number to anything else out of millions of combinations if it becomes an issue lol

small haven
#

new hotfix next week upcoming, big news

ocean vortex
elder rapids
#

@ocean vortex you're right, 0605 gets really close with "1Sujr..." and then just gives up lmfao

torn mantle
#

what can i say...

#

i mean i would rather just stay silent until we released a good model

small haven
torn mantle
#

i mean whats the purpose of having 10000 features if the model is bad?

torn mantle
torn mantle
#

they keep changing the UI every week

#

i mean whats the point

#

it just proves even further that they hit a plateau

#

explain how we want from 'grok 3.5 will be released next week' to months without any info

small haven
#

for some reason idk how google is excelling at shipping consistently

torn mantle
#

like i said it from the start that we wont be seeing that for at least a year

small haven
torn mantle
#

i just cant imagine how inefficient that feature will be, i mean the default thinking process is so inefficient let alone this 'big brain' feature

torn mantle
small haven
#

and crazier

torn mantle
leaden palm
unborn ocean
#

i love how they "work hard" but still have time to write 10 gazillion x posts a day

small haven
#

btw that pic looks ai, but its not, its literally at the beach

elder rapids
# late path how

write down all the phrases it uses that you don't like, and it actually removes the output that entails those phrases as well, tell it not to comment on the user at ALL, tell it not to thank the user AT ALL, tell it not to thank you for observations at ALL, give an example of when you could be wrong, start the response with the answer, and then say sum shi about it being a professor

#

also don't force it to preemptively evade being "wrong"

#

I've figured that actually hurts it's performance

#

for some reason

small haven
#

meanwhile there's no social life pics from google engineers hmm

late path
#

I'm quite concerned that all 'anti-sycophantic' system prompts will cause the model to be overly dismissive of users, and there seems to be no effective way to balance this

leaden sun
elder rapids
#

dawg

#

what is this

#

😭

small haven
#

last one is very good

#

if craig dies from a terminator, it passes craigbench

elder rapids
#

agi threshold is when the AI can make you feel like an infant

torn mantle
#

lmao

#

the wink

#

wink wink

small haven
#

do u think a male has this pfp and name

#

oh true

#

mister asura

elder rapids
#

pack it up

ocean vortex
#

tbh I don't think we saw anyone making significant gains yet without improving base chat model significantly

#

o1 to o3 was new base model

#

4.0 Sonnet vs 3.7... worse than the earlier model in numerous things. Not much different to what Google is currently doing, although their last 2.5Pro update is more significant than what Anthropic did probably

surreal creek
#

I’m dead

surreal creek
#

the elo gap between Opus 4 and Sonnet 4 is virtually the same as the gap between Sonnet 4 and 3.7 (24 vs 25 points), what leads you to the assessment that Sonnet 4 is worse at some tasks?

torn mantle
#

or even to opus 4

#

yes they got even better at coding

#

but thats it really

surreal creek
#

Sonnet 4 has less terse answers than 3.7, I give it pieces of my writing to do character analysis on and it goes further in-depth than 3.7 did, but I guess that’s kinda anecdotal

small haven
#

i broke the servers with ultrathink

#

im trying to get this number to $1k, plz fix servers

echo aurora
small haven
#

yea i can tell, xai seems to be overpaid and too much pto it seems lol

errant thorn
#

does anyone know which AI is best at writing stories?

unborn ocean
small haven
#

i mean if elon is strict in work ethics, then why is grok 3.5 a month a few changes late?

#

i feel like its burnout?

#

those were the days 😭

#

yea that fred rate for swe's is brutal

#

i wonder though if its ever going to reach that covid peak ever again

#

but is google technical debt even worse? i heard they have to write tech stack from scratch, that sounds brutal

pulsar tendon
#

is there a plan for large output context i.e enough for a book ?

#

expensive to do so.

small haven
#

like their docker is totally different, not even kubernetes

#

yea thats what i was thinking, when gemini wasn't as quick to catchup with openai, just tons of tech stack to write from scratch, but seems like they got through it nicely at the end

unborn ocean
#

tpus are also what i am most intrigued about future wise in google, with em planning to move away from broadcom really soon

#

feels like a big risk imo (but there is really not much info on this, so it is just a gut feeling)
-> could be the time they fall, o0 (or at least Broadcom's stock will get obliterated over night once all the normies find out)

jade egret
#

guys

#

is grok 3.5 avaliable to super grok user or no?

small haven
#

but i'd imagine the onboarding for new staff must be like hell week for them haha

#

rather be working with something im already used to

patent aspen
#

I think Hack (name of the programming language) is technically backwards compatible for most things

#

IMHO learning a new programming language or tool isn't really a big deal

#

Unless it's really really out there (e.g. Haskell, Prolog)

misty vault
unborn ocean
small haven
#

at least 4x lol

patent aspen
#

Oh right Jane Street uses Haskell right? lol

unborn ocean
#

yeah (note: apparently caml for most stuff, confuse the two quite often 🤦‍♂️)

patent aspen
#

Nerds :p

unborn ocean
#

was really weirded out when i read it

#

but i think they realized they should use jupyter for research and python for ml (but that was only a couple of years ago, lol)

small haven
unborn ocean
#

tru, the image of your bedroom i generated earlier is probably accurate : - ]

patent aspen
#

Hence the :p

misty vault
#

for years I never knew :p meant 😛

patent aspen
#

I make no claim to not being a nerd

#

It's just ironic to call other people nerds

#

Believe it or not, my room is not nerdy at all aside from having quite a few board games

#

But I'm an omega turbo nerd

unborn ocean
#

ok if board games are nerdy, me and all my friends are found guilty 100%

#

paste that into gemini 3.5 in a couple of months and we'll have your precise location

late path
#

HNDL😱

jade egret
#

bro

#

grok 3.5 is avaliable to super grok users?

#

i never know that...

#

how good was it

#

idk

#

chatGPT said it is avaliable to super grok

torn mantle
#

stars added to grok

jade egret
#

huh?

#

oh

#

can yall plz tell me is it out to super gork 😭

unborn ocean
jade egret
#

oh

#

it good at geoguesser?

unborn ocean
#

wait, is it out?

#

-- apparently selective rollout (like not really releasing, so far)

jade egret
#

oh

#

because

#

chatGPT said it out

#

but

#

X said it not

unborn ocean
#

thus the are obviously stuck in development / doing further training (bc of poor results or high costs) - nvm i kinda missed the x post about he launch 😅

jade egret
#

yea

#

do you think

#

it gonna be #1 in LMarena?

torn mantle
#

love it

jade egret
#

kingsfall v.s Grok 3.5 whos winning

small haven
#

they like polishing it

#

but the model

#

let me check actually

late path
small haven
jade egret
#

idk

#

plz tell me

small haven
#

hmm, i think kingsfall

late path
#

the market has been severely lacking in beta since 0506 was launched, Let's add some

jade egret
#

idk

small haven
#

ok actually maybe titanforge >

#

oh shxt i saw some particles for a sec on grok ui

jade egret
#

.

small haven
#

oh my goodness its magic

jade egret
#

o

#

w

small haven
#

doesn't even work

#

magic tho

#

i guess not

#

bro what are these particles for, whats the purpose 😭

unborn ocean
#

but we already have grok 3.5 at home

small haven
#

oh shxt is it dropping imminently

#

particles?

unborn ocean
#

i don't even get any :(

small haven
#

wtf actually?

#

ok go on temporary chat

#

ok specs maxxer

#

looks like its only in temporary chat, is it only me?

#

oh i found a shooting star omg

unborn ocean
#

wait for me temp chat worked, but the moment i touch the tab everything is gone 🤡

jade egret
#

o

jade egret
#

WOAH

small haven
#

;/

jade egret
#

is it good?

small haven
#

craig fakerighi

jade egret
#

plz tell me 😭

small haven
#

its cap i dont have it

jade egret
#

do you think it better than o3 or gemini 2.5 pro

#

oh

small haven
#

heres the real screenshot

#

do i?

#

oh

jade egret
#

so it better than o3 and gemini 2.5 pro

#

dang

small haven
#

ok buddy, ud be not here in this chat but in that grok chat testing grok 3.5 if u really had it lol

jade egret
#

ask it a question

#

and ss it

#

ig

small haven
#

oh that screenshot is going to take a while

jade egret
#

lol

#

it might acctually be out 🤔

#

or just

small haven
#

GOOGLE

#

what the hell

jade egret
#

idk...

unborn ocean
#

well "early may"

#

me 2

late path
#

next week

unborn ocean
#

but it kind of is grok 3

#

and it is still not very good

small haven
#

everybody has grok

unborn ocean
#

but atleast i now get stars everywhere, even without being logged in

jade egret
#

i think some people have it, most don't

late path
#

*checking polymarket

small haven
#

bwahhh

#

u better sell before they release bench 😭

unborn ocean
#

good to see that they have their priorities straight: weird hype building star animation > actually releasing the model on time

jade egret
#

i was just on that page...

small haven
#

aye we need some tips for next month wink wink*

jade egret
#

land me sum : )))))

#

: (((

small haven
#

no fcking way

jade egret
#

?

late path
#

which one are you😂

jade egret
#

you got iy?

small haven
#

its in my build

jade egret
#

dang

#

how long do i have to wait until the benchmarks are out...

small haven
#

idk how recent it is though, maybe it was pushed last week

late path
#

pretty sure elon fanboys will buy like crazy

unborn ocean
jade egret
#

yea

small haven
#

@deep adder

#

im going to laugh so hard if its mediocre

misty star
small haven
patent aspen
#

I think I felt the same way about Meta that you feel about xAI until Maverick came out

#

"Never bet against Zuck", etc

errant thorn
#

why does claude opus keep crashing

small haven
#

CRAIGBENCH

surreal creek
#

Google getting broke up as antitrust to only sell their browser to an even bigger company is truly American

unborn ocean
#

well BlackRock is a poor example for a buyer

#

it is very much not what they do

#

the other PE is a bit small and prob does not have the right companies for merging

#

crazy take

#

they are statistically atleast upper middle class

#

honestly it could be stuff like they have to put their tech into non-profit

#

but honestly even the people at the commison likely don't know

wintry tinsel
unborn ocean
#

many people would like to buy,
Microsoft will likely not be the first choice competition wise (worst way to fix a monopoly is to create a new one),
perplexity / openai would like to buy, but even openai might have problems with the financials,
apple will likely also not be the first choice competition wise and if they want to be privacy focus going forward they also can't pay as much as other bidders (bc of lower potential profits),
amazon, idk even know, maybe

#

based on what i heard they are out for blood, so just stoppping the deal won't cut it

#

my guesstimate is them not having to sell chrome, but might have to do some stuff to google ads etc.

#

and the most important thing: usually it takes years for this stuff to actually work out in court (example: intel, which is still fighting a case over being a monopoly that is multiple decades old), by the time this is all done all the execs @ google are already planning on some other future that is less base around the normal advertisement business but centered arount ai

#

well you can't grow much from a point of market saturation

#

my point about the ads was also not them specifically, but more that i believe in a middle ground solution being supported by the DoJ (-altough a middle ground that google won't like)

#

imma ask my prof in competition policy about google's future

#

will find out tomorrow (nvm totally forgot a holiday)

#

So, we’ll find out some other day 🫡

normal abyss
small haven
#

claude opus is solid

#

?

#

it's actually good idk what u want me to say

#

baited lol

languid crescent
#

Is it possible to incorporate lmarena to a project? Like use lm arena's api or something

echo aurora
tall summit
ocean vortex
ocean vortex
# surreal creek I’m dead

lol... to answer your question they don't think in a traditional sense but they do generate extra context making it easier and with higher confidence to predict the final answer. Also to limit hallucinations, which mostly happen when it is too hard to predict the continuation with limited context aka model does not have enough to work with. In a sense it can emulate the work of several chat turns without your additional input

#

so if it has some big number to compute - predicting the whole number in 1 go is too hard with way too many potential outcomes and will almost definitely result in hallucination. But if it can divide it into easier to predict parts and solve them one by one then use all of that as a cheat sheet to compute the final result... the accuracy gonna be much higher

leaden sun
#

what jb prompt can you try today?

errant cave
#

I'm guessing there just aren't enough votes to place it on the leaderboard yet since it is on the battle section of the site

alpine coral
#

yea it's there

#

identifies as an oai model occassionally (like in the past.. though here it's actually kinda striking how much its responses resembles 4.1-mini's..)

alpine coral
#

i get this error happening with the following models:
qwq-32b
qwen3-30b-a3b
X-preview
deepseek-r1-0528

stephen?

tall summit
#

I'm getting plenty of errors, I thought it was a token limit per conversation thing

#

and claude models love using tokens

alpine coral
#

yeah i think it is some kind of max token limit - whether across the conversation or a single turn i dunno - but some errors only affect one of the model (and ig are thrown by the API host), whereas this sort of error kills the battle (though you can just enter another prompt, and if it's short / not complex, no error and you can cast a vote / reveal to the model.. or ig continue chatting (kinda why i thought it was a turn-based thing rather than the totality of the conversation.. but dunno)

#

come to think of it.. i feel like i might've encountered it with opus-thinking too.. but never with a google or oai model

ocean vortex
#

Daamn how I hate 2.5Pro spamming me with dummy example data for any kind of code. Have to prompt it explicitly to give me what I'm asking in isolation lol

#

otherwise it's gonna make up 10 extra variables and write the entire placeholder dataframe around it huh

ocean vortex
alpine coral
#

yeah that's the thing - it kills the battle [though ig this is kinda a moot point.. as iirc any battles where an error is involved are excluded from the LB/elo calculations]

#

and it doesn't happen with the legacy arena

#

like doesn't matter how many times i run this prompt, legacy will always give distinct responses from each model, whether an error or actual response. the beta site periodically errors out when one of the thinking models that have a low max token limit (and / or are inherently very token intensive in their outputs) are involved

ocean vortex
alpine coral
#

yeah but i suspect it's during the thinking process (rather than generation) where the max toxens threshold is breached though.. ig we're kinda describing different situtaitons

#

like it isn't cutting off mid-generation for me; it's in the the pre-generation(/thinking) phase that it breaks

#

but it never happens using the legacy arena

feral lichen
#

bro always

alpine coral
ocean vortex
#

but yeah those might have been different cases

#

I presumed they are using the same max_tokens on new and legacy, but ig not

alpine coral
#

yeah i wouldve thought so too but i'm not really sure.. somthing seems a bit off

keen beacon
#

open the network inspector (so it logs the requests), make it small, and when it does that check the request / report it (might be useful). i had cloudflare issues a while back, wasn't obvious until i did that

alpine coral
alpine coral
#

that's legacy tbf

#

but yeah

#

if that's what's throwing it on the beta site too lol

ocean vortex
#

most likely the same error. It's just that now they are sanitizing/hiding the errors like they should have done since the start LOL

alpine coral
#

yeah but previously at least the other model was allowed to generate

#

the current implementation kills both responses [this in the Arena/Batle]

ocean vortex
keen beacon
#

they dont count those anyway

ocean vortex
#

maybe to just disable voting but allow follow-up to the sole working model

ocean vortex
#

technically the battle can't function with only 1 model, so hard error is not completely unreasonable...

keen beacon
#

unrelated thing but i wonder if they fixed the captcha issue it was a while back. it wasn't showing the captcha so all my requests didnt work. only saw it in the network inspector that cf was asking for one

ocean vortex
#

@alpine coral they could be just stopping generation now. In that case it's an improvement over legacy in a sense that compute is not 'wasted' when the battle can't continue lmao

#

as in if 1 model fails, the generation of another one is manually interrupted at that point

alpine coral
#

yeah lol that's actually a very fair point!

#

still kinda annoying though.. (like even if the vote doesn't count so it's kinda irrelevant, if one of the models has generated a response, i'd still prefer to see it, like with legacy)

#

cause yeah it's like 2-3 min wait before everything errors out

brittle tiger
echo aurora
ocean vortex
alpine coral
#

apparently yeah

#

wonder if it's a coding thing or generalises

ocean vortex
#

this is so weird lmao

alpine coral
#

yeah

#

unless 'auto' is meant to be more 'efficient' than 'optimal'

echo aurora
alpine coral
#

np thanks for raising 👍

echo aurora
#

this is all really helpful feedback, if yall haven't already applied to our internal feedback program you absolutely should - #announcements message

ocean vortex
#

I did use them a ton and this is completely different. Thinking budget we first saw in Claude models and the whole point of it was to limit the thinking. Not extend it

alpine coral
#

more thinking doesn't necessarily translate into more performance

#

like in a brute force sense it can

ocean vortex
#

so if budget is not set, the expected outcome is that the model is not limited

keen beacon
#

did they change how thinking budget works or there might be an additional prompt instruction aider added? because it used to just cut off at 32k/budget tokens

alpine coral
keen beacon
#

that was how they implemented it before (gemini), the model was unaware of the thinking budget

keen beacon
placid charm
echo aurora
ocean vortex
keen beacon
#

does the summary model have a context limit lol? ugh ill have to get raw thoughts

ocean vortex
patent aspen
#

That's the joke

keen beacon
patent aspen
#

I know. I thought that at first

#

It's just funny

keen beacon
#

apple will announce that theyve been acquired by perplexity

patent aspen
#

tbh I don't remember the last exciting WWDC

#

Too expensive and not enough developer interest though

#

It has some novel technologies in it

keen beacon
patent aspen
#

I'm not even totally sold on glasses much less VR

#

It is

ocean vortex
patent aspen
#

It's VR and AR but it's limited because you need a clunky headset, so you can't just take it everywhere, so it's unlikely to ever be mass market

#

It's a stepping stone to glasses

#

Glasses have better odds than a headset

#

Too clunky for mass market

#

Same problem as VR

#

The Meta Quest is still the best VR / AR product for most people even after ignoring the price difference

alpine coral
patent aspen
#

Way more software, more comfortable

keen beacon
#

but having the actual raw thoughts allows u to measure it precisely

alpine coral
#

yup ofc

#

tho just running a test, with the thinking budget maxed at 32k, it does seem the model [06-05] 'thinks' for longer / uses more tokens during inference

keen beacon
alpine coral
ocean vortex
#

ok so you can actually move all the steps it does into thinking by only allowing it to respond with the final answer. Just in practice not sure anyone would do this given that most interfaces will summarize it

alpine coral
#

before the slider was meangingless - does seem to have some affect now

keen beacon
#

do you have a prompt at 0 temp/top_p 1 where u can replicate this? ill extract the raw thoughts

#

for both