#general
1 messages · Page 51 of 1
just for it to solve a basic physics problem, it wasn't hard
I meant for it, to see how its changed
it will fix o3 pro issues
looks like today's openai livestream doesn't have o3pro?
maybe tmrw?
Do we have rumors on o3 release?
lies
So gemini’s coming out tmrw
o3 pro on thursday
Deepthink thursday
guys my pc just broke
O3 on thurs too?
this is crazy
What time?
i have it alrdy, its ass
Hmmmmm
if ur tight, id say hold onto it for google ultra
When people use the term "structured thinking" what are they usually talking about? I only ask because a lot of external AI jargon doesn't map cleanly to more technical AI jargon
idk it seems idiosyncratic to me too
its better than o3, but not that great of a jump like o1 to o1 pro
What about compared to gemini 2.5?
o1 pro could spit out 2k lines no placeholder, no omissions, all in full, o3 is maxed out at 500 locs
OMG
Thisi s fantastic
and o3 pro times out a lot
king falls tmmrw and king will fall further a week later
its good
woah
google releasing products every week whereas xai :
okay gdm just drop kingfall already 😔
this is why
link
lol
no screenshot either
lol.. i mean he probably took a vacation after a long time, i dont doubt that they sleep in the office
o3 pro cots seem to get it
but we should seriously start an audit to see what are they doing exactly
so far, its still thinking
cheeeeeen
ya
Gotta sleep
o3 pro is ass i been telling
Gn everyone
is the answer public ? or did o3 pro rlly get it
but o3 pro has tools
I don't think I would take the extent of deepthink, so inevitably, yeah
yea im in evan chen site, no 3031 answer
o3 pro timed out great
full cot
full cot ^
this is from another run
and still running
it is public but it's under a reddit post with like a handful of upvotes and a few comments
yea guessed so
lmao
they both timed out
thanks i needed this
if you're paying for o3 pro there should be no time out 😔
well it does :/
nah
it was nearing 15 mins, i guess thats the hard limit
what's the discord?
deepthink is going to crack it without tools
i wonder if theyre gonna switch out the deepthink model
if they make further revisions that close the gap as a standalone model w/o parallel requests
assuming its a set 2.5 pro revision already
deepthink? kingfall? goldmane? everything?
real
it has to consent first
there's a decent chance kingfall drops on arena this week to be fair
i see goldmane releasing officially tomorrow, and kingfall appearing on the arena on the weekend
i arrived at the conclusion that the red teaming platform was just testing existing models this round
so nothing good
leo has access to asi
yup
lol i wish
btw
kingfall and goldmane are INSANE writers
the best by a large margin out of all the models
I'm not asking them for a prose here
or to "write"
I'm asking them to explain things
and that gives insight into how they actually write
without forcing them
no
lol
not what I said
it's a lot easier to gauge how good they are at writing in the usual sense if you're asking for prose
that's not true at all
you don't ask for a prose or a prompt when you want to see how a good model is in writing
debatable
if it has to create its own context
then you're not really allowing it to write, this is the same for every LLM
sydney fine tune of gpt-4 is best at writing
deadass hate when you talk to me like I'm stupid
no it has nothing to do with style lmao
I'm not asking it a question and paying attention to its wording
that's redundant and nonsensical
🤣 🤣
why y'all deleting so much
why not
I just said you shouldn't allow it to create its own context, certain models do do this well like r1 and o3, but models like grok are hard capped, same with claude
you know this, too, with 3.6 sonnet
it was an excellent writer
but it was a different type
@keen beacon i agree the OG claude models were really really good at writing
they felt very unhinged but in a good way
i remember when claude was only available for public use using their slack bot
it was great
let me see if the messages are still there
same esp for the time
pfft
anybody have gemini ultra here
Google pulls a Kingfall before OpenAI's stream 🤪
hahah they thought it was o3 pro, plugged it out when it wasn't
LMAO
Real
augment code is matching google ultra in price 😭
<@&1349916362595635286> Please, add "Claude 4 Sonnet" or** "Claude 4 Opus"** in the last WebSite !
Why did you almost completely abandon the old website, when it is better done and uses gradio (which is visually attractive) ?
i'm not too sure about those last 4 words..
this is all gone now 😔
@deep adder run this bunx ccusage
faster npx alternative
?
it checks ur claude code metrics lol
i just seen that
i went through 2 billions tokens insanity
how come
I am sorry to hear you're not a fan of the of the new site. It is a big change, it's totally fair if you prefer the old website, but we’d love to hear your feedback on the new one if you’re open to sharing. When we changed from the legacy site to the current site we did mention that moving forward all feature updates and improvements will happen on the new LMArena site. We have been seeing a lot of positive signal that the new site is more appealing and have made the decision that going forward this is where our team is going to be focusing on.
any bug bounty?
Not atm but for sure a good idea I'll pass along. We do have report bug form on the site but in terms of a bounty program that's not something we have currently.
Did you find something 👀
yes, remember when models went unavailable while ago? he did that 😔

@echo aurora will the webdev arena be integrated into the main site?
<@&1349916362595635286> I understand your team's decision. However, I believe both websites are complementary, as each has its strengths and weaknesses.
New website:
It features a more modern, visually pleasant, and intuitive interface. It's prettier and easier to use, but also noticeably less complete.
Old website:
While its interface is a bit more "raw" or even cluttered, it's much more complete and packed with features. It might feel more like a "dev tool" than a "power user interface", but its depth is genuinely valuable.
Personally, I especially appreciate being able to tweak settings like temperature or the maximum number of output tokens (up to 4096), and how easily accessible every button or option becomes once you're used to the layout.
Honestly, what I prefer is that "familiar" and "feature-rich" experience. For example, when using tools like Ollama, I tend to choose the command line over a graphical interface.
So I think both websites are great—but just not for the same type of user. Then again, maybe I’m a bit of an edge case 😅
In the finally, I prefer the last website 😁
generally I won't be able to share ETAs for new features/updates/etc, but moving webdev over the arena is on our radar
😏
im jk nothing
too lazy, but if money is involved, i would
@echo aurora Can we have a page with a continuously updated list of models currently on the leaderboard that are not yet in the arena? (mystery models included)
a bounty program is a rly good idea 
on that note: maybe redarena being integrated into the main site could be cool
??
sry for delay, I'm going to spin up a thread in #1372230675914031105 when out of this meeting
pretty sure this wouldn't be the case, since the stream was already known to be something else
https://www.rxddit.com/r/singularity/comments/1l32s24/sam_altman_says_the_perfect_ai_is_a_very_tiny/
Source: Maginative on Youtube: Sam Altman Talks AGI Timeline & Next-Gen AI Capabilities | Snowflake Summit 2025 Fireside Chat: https://www.youtube.com/watch?v=qhnJDDX2hhU
Video by vitrupo on 𝕏: https://x.com/vitrupo/status/1930009915650912586
the perfect ai is a 1 bit-sized model with superhuman reasoning, 100 trillions tokens of context
and most importantly its knowledge coming from tool workflows
it is kind of a weird ideal in many ways and i have some problems with him assuming that this would be the ideal (economically speaking), but it is fiction anyways
so who cares
on a completely different note: @earnest parcel godaddy says your domain is worth 434 USD 🤑 (dubesor.de)
nobody needed him to say this
guys I broke Gemini
exactly lol
u know the convo been boring when he pulls this shxt
so trueeeee
i want ultrahuman reasoning though 🙏
its gonna happen but whatever hes talking is at least a decade away lol
agi
If AGI/ASI is actually developed in this decade like AI 2027 predicts, I will eat my hat.
Fair enough.
Though the AI 2027 scenario will either be the biggest comedy show of our time, or the most shocking prophetic call ever.
is goldmane still good for thursday?
i will make you throw up the hat you ate and will eat it as well with all your stomache and inside fluids still on it
deepthink is technically agi, u can eat it now
Bro....
Google didn't lie, the latest 2.5 flash is much more efficient than the old one (being more performant)
I did new model request
https://discord.com/channels/1340554757349179412/1379958143072862208
seeing that now, adding to the request list
[moved to ai-news]
huh where is this guy tweet??
@patent aspen i thought it was tmmrw
oh nvm he tweeted 4 hrs later from this time, last time
is stylectrl dead in the new arena?
its the default
as it should be thanks!
late june
as per brian
its going thru a safety testing phase rn
only trusted users
yea its heavily been nerfed compared to the december version
wym? u think its gonna be a flop?
nah even a month before, its been flaking, was just using o3 since
idk i think theyre just following protocol
u love finance do u
cool hopefully
ive tried months ago, its meh unless u do math heavy things
deeperthink is rlly bad
o3 patches that tho, let alone o3 pro
grok 3 is archaic
a little over that, well for the last tweet
pre gemini io
goated
but where is logan's tweet?
if you dont want to be in the same place where major ai developments are taking place
There’s too many stealth Google models
There’s like 20, I can’t keep count of them. Every 4 days I see a new one.
Kingfall, Dragontail, Nightwhisper, Dreamtides, Moonhowler, Stargazer, Shadowbrook, Riverhollow, Lunarcall, Moonfall
I’m pretty sure that’s all of them
nothing
you lied
I'll never believe you again
it's over Google is never releasing
🤞
wait a min
ts not happening
i though gemini 2.5 pro already released
iterations
or thats the preview
there are different previews, they change like every month
ye, but GA soon
what is ga
GA = general availability
no
O3 killer tmr?
it's not going to be an o3 "killer", o3 has higher compute variants
it's just going to be even better than before
with that out of the way
yeah o3 killer, if tomorrow
What about gpt5?
hope it's even real
Polymarket has the probability at 90% for a 2025 release
The only company to release an ultra variant is Anthropic with opus everyone else is greedy
This is what it gave when asked for SVG of a robot https://x.com/testingcatalog/status/1930298521078399226?s=61
I don’t know how that compares to Claude 4 Opus
Since I haven’t tested it with the same prompt
It’s probably just an iteration of 2.5 Pro though
I like everything but the robot
The background and text is nice, but I personally think Kingfall is better
I witnessed firsthand as Gemini changed versions (in direct chat) back in May
boom
SAM FALLS TODAY
goldmane🏆🏆🏆
kingfall definitely won
good moning
trust in brian
brian is our google insider
o1 pro? its not that good, o3 beats it by a mile
kingfall is way above o3
i mean even o3 in price/performance
i doubt kingfall uses 10x
its more like $1.25/1m in $10/1m out
deepthink will be 10x
have u tried kingfall? do u not remember, its cot wasn't that big and actually lesser than 0506
prolly i can believe that
u right, thats an interesting pov, parallel cot
oh yea deepthink i believe will demolish o3 pro
i have o3 pro rn on stealth, it aint that great as it thought it be
its better than o3, but not similar to the jump from o1 to o1 pro
all of this competition is so good, bc now we have o4 and o5-mini-high to proc an early release, very very excited for summer
hmm, currently if ur basing codex as their winning agentic product, i believe ur wrong, ive tried it and currently claude code >> is the meta
operator isn't even that good, even with the o3 upgrade as the base model
hmm, have u tried jules?
its more like claude > gemini > openai, in my view
lately or when it got released
it improved a lot
codex is garbage too 😭
they both garbage
but i belive oai is more garbage than gemini
😂
im using codex-1 on their ui, its rlly bad
idk why ppl are hyping itup
by perpertuity, how is codex mini > codex-1?
no its not the ui, its beautiful, but the output is very bad
trust me
hmm later today i will, but i think gemini is going to take my time lol
its unfortunately google that will win
i am surprised
but any competition is good
dario said benchmarks dont matter anymore, so its gg 😭
and $15/$75 pricing💀
gemini at 86% and claude is at 72%, im not using claude 4 for a while
yes, but this kind of a jump is too big, im going to be using the good ole' copy and paste until they integrate their pro/ultra sub for virtually unlimited queries like claude max
i believe its going to be unlimited on at least gemini ultra via their ui
o3 pro is roughly at $100-$150 in/out and they can afford the "unlimited" inference via their ui
so why cant google
veo 3 is just a whole different arch
and its just a gimmick rn, i dont rlly care
we will see,
gonna get ultra
if deepthink is announced
but later june im guessing
nah gemini is
oai never cut under competition
ever
r1 > o1 mini, what are u saying 😭
talking out of ur ass srry 😭
oh yea i get what u mean, but they still never cut under competition, but cut after competition cuts
never before it
they always held a premium
they greedy
🙏🙏🙏
I used to pray for times like these
New sota model today?
goldmane yes
The latest AI News. Learn about LLMs, Gen AI and get ready for the rollout of AGI. Wes Roth covers the latest happenings in the world of OpenAI, Google, Anthropic, NVIDIA and Open Source AI.
My Links 🔗
➡️ Subscribe: https://www.youtube.com/@WesRoth?sub_confirmation=1
➡️ Twitter: https://x.com/WesRothMoney
➡️ AI Newsletter: https:...
chatgpt is so ass
google is sooooooooo much better
but their ai helps kill kids
but other then that its great
yes
chatgpt operator ass too
agreed
🌐 Build your next project on Hostinger with an INSANELY fast VPS: Get 10% off with code NETWORKCHUCK: http://hostinger.com/networkchuckvps
☕ Because everything in I.T. requires coffee: https://ntck.co/coffee
We’re in the future. AI can now run your web browser, complete tasks, and automate entire workflows– basically, acting like y...
i was using this to do my homework
its pretty decent
go to the end of the video you'll see the chatgpt vs browser use comparison
there really is no difference especially if you have a decent cpu/gpu
sure but I still don't believe hed have so much more access after that tbh, he says he has "connections" but then it's like he's "directly" there, combined with the fact it's already a reasoned inference via Logan saying it'd come early June
Who’s Brian?
hence the conflicting information that are equally valid (Logan also saying, "in weeks they'll come")
and he's not pushing for any of them
but ey
glad it's Thursday
billy
yea maybe, but he got goldmane spot on
i was skeptical at first, but damn google insiders are built diff
o3-pro and new gemini model tomorrow? that was would a lovely surprise
I need King to fall rn with free API usage
I'm curious what else OpenAI has up its sleeve this time to steal Gemini's thunder
Last time for 0325 they released GPT-4o imagen and Ghibli on the same day
If they only release o3pro today it probably won't be enough
Maybe o3 image gen?
Flux Kontext is threatening GPT image 1
And Flux Kontext is much faster than GPT's and it doesn't have that yellow filter that GPT always have.
This is going to be big in the future, whatever Wolfram does is always excellent: https://www.wolfram.com/llm-benchmarking-project/
No o3 though
👀
Qwen3 30B is just off the charts for it's size
if only wolfram did more in the first place
fxxk
the knowledge base in Poe is shxtty
I use 2.5 Pro as the base model, and then I asked him about a novel character(which is in the novel) and then it just pops out nonsense
ok?
why flash 2.5 in gemini app feels stupid today
That's a good observation. But I'm still surprised how good the 4.5 is and how bad is the o4-mini.
Hmm I wonder if it means that the train time compute > inference time compute in the end.
This says otherwise
That's true. The distribution is niche
Ahh I'd like to see the 4.5 with PRO level thinking
Yeah but still would be interesting experiment to see bench results
Indeed
it seems as if LMArena's sampling has become more restrictive
less creativity
Gemini is much less interesting then
there are certain things any mini model will either never beat 4.5 on, or will take a very long time to make it possible. SimpleQA, spatial awareness or spatial reasoning.
also context awareness ("vibe test") - the smaller the model is, typically the more literally it will take your last message or sentence having less capacity to consider everything else or "read between the lines"
SimpleQA 62.5% (gpt4.5) vs 19.3% (o4-mini-high)
You gotta be kidding me.
Why does the new website break down all the time?
Even discussing perfectly legitimate subjects gets the "Something went wrong" error.
could be Anthropic tbh. They added some really shady flagging with Opus 4
safetycell technology 🤷♂️
Weird.
I never got such errors on the old website.
Refreshed the website, and finally, Claude is speaking up.
opus has periods of unavailability, you simply have to wait some time and reroll regularly. it's not relat d to the prompt
Today I've used the google collab and android studio. They both have "gemini" chat to interact to, however, without model selection or identification. This could explain why they need so many different anonymous models on arena: each has a different use case and may be finetuned for this.
Or they simply but cheapest gemini flash 😦
So I didnt actually trigger a censor?
hey , you guys have some server where people actively post their AI projects and discuss?
I want to join such a server
claude 4 ops not there?
also , llama nemotron is ahead of 3.7 thinking?
crazy
Reminder we have our Staff AMA tomorrow! https://discord.gg/XkfsbYWX?event=1375223423009165435
We also have a contest running right now and would love to see your submissions! #announcements message
Hey mister "there have never been any nerfs" 🤓 @ocean vortex
goldmane will crush it yay
is this that excel spreadsheet where they counted missing scores (not tested or they failed to find them) as 0% scored?
lmao
Vibes says otherwise
?
this graph is wrong on some many levels. Not any better than some random thing someone drew
Like... don't mind gpt4.1 being above 2.5pro
LOL
the censor would be the the box with "retry" and "clear". your screenshot was the censor admitting the prompt and it failing at the generation stage
yeah this is accurate. Exactly the same +/- miniscule amount
This shows the real problem
oh btw claude 4 opus is not in the chart of artificial analysis
yeah but this is completely different to the official leaderboard... Wonder if artificialanalysis messed something up with their testing. Wouldn't be the first time
i dunno.. what it's actually benchmarking is very narrow..
The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language. These exercises have been done online by millions of humans, and we've developed effective tools for determining functional correctness of code, which we're now applying to LLMs.
Perhaps translating natural language into wolfram has more generalised value.. and full respect to Wolfram too.. im just not sure it is or will be that useful / meaningful
if you look there https://artificialanalysis.ai/models/gemini-2-5-pro there's not a single thing where the new one scored higher for them, that can't be right
we already saw it higher on AIDER, webdev arena etc
Why not, if I remmember they just showed the ELO based benchmarks during presentation, which are innacurate
Anyway, not important now as the GA is underway
i feel like if there's nerfing.. it happens from pre-release (e.g. nebula, goldmane etc ect) to preview/exp - then GA is just another layer of safety and corporate alignment nerfing
no, they showed substantial gains in coding, including livecodebench, artificialanalysis complete opposite:
But how they showed it? I don't remember numerical data
I think they messed smth up with their test suite lol
That's true indeed
Hmm
they literally showed it for both versions
So you justify "no-nerf" statement based solely on livebench?
there are more, basically most coding benchmarks show improvement
This is quickly becoming the new "As a AI language model....", for real.
What if you wanted to talk to a LLM, but God said: "There was an error."
yeah and if the version they release (whether preview or GA) is slightly less performant than goldmane, that would be consistent with what i feel like ive seen in the past
they rarely seem stronger compared to pre-release imo anyway
like, +30 elo
yeah both goldmane and redsword are solid af
well.. redsword is no more.. but it was (and i feel they are more the less the same checkpoints)
nah i think it's way more fundamental than that
they're making good models
yeah latest iterations have been incremental
but nebula / 2.5 pro was a massive performance step up
i'm not sure goldmane is of the same extent - maybe - but it feels quite signficicant imo / fwiw
goldmane will be the next Nebula moment. It's much better than 0506
yeah i feel 0506 didn't get nerfed so much as was f-ted for tool usage - and other areas suffered as a result
My feeling is that they strengthened instruction following, but this somehow reduced the model's judgment and autonomy (compared to 0325)
in terms of actual usage - o3 on chatgpt is my go-to for anything involving complexity and / or web search.. it's really something
but if i needed to use an API, gem pro 2.5 would be so much smoother
and it's such high quialuity
but yeah not o3 chatgpt
i mean i just use 4o for most things tbh lol
fair enough - i can appreciate that 👍
it is a strong model
but yeah 4o is just useful for quick stuff - like translating / transcribing, some quick question
like i don't always need the 'best' model - actually, aside from research, i rarely use thinking models
i do but that's just me ha
it's what oai says it is
yeah i dont see that as controversial tbh
but i guess others find it so
and like 4.5 is still 'preview' (i bet whatever is the OG version that wasn't tuned for safety and public release is a beast)
but it's being depcrated soon.. ig we'll never see a non-preview version..
gemini 2.5 flash 😍 or 4.1
?
ok
im talkin about most things.
It's clear that OpenAI has put a lot of effort into optimizing 4o's chat experience and multi-turn conversations. There's a reason it's now ranked 1st in multi-turn on the arena
yeah i agree, though the latest 2.5 flash with thinking is like notably strong imo
but still.. different use cases - like thinking gets in the way.. if i'm gonna use it i may as well go all out with o3 or whatever ha
waiting for leaderboard updates
Nope
what is GA?
is openai still having another livestream?
First they will put the model in preview mode (different from experiment mode) and if they find no bugs then only it will go to GA
Looks like latest model in GA target and will release as a preview model for first few weeks
I got a arc-agi problem through and it was right. O3 and last 2.5 pro get it right about half the time. Not enough time to test thoroughly
I saw it and it's gone again
portal color scheme
yeah, it's rolling out. I am probably hitting different server now and flag is not ON yet on new server
I got one through and now in the same chat instance it says model not available
pls say it fixed coding issues
Formatting is better. Looks a bit like chatgpt to me
Refresh 🙂
24 elo gain is good. I was hoping for 30 elo gain but still respectable
30+ point elo difference over Opus is hefty amount of ppl picking it over Opus
if you unchecked style control its 30
Damn may be back on the googtrain again
we are so back
how is it on there so fast?
it's goldmane model and has been on Arena for 10 days or so
oh, haven't heard about it, only heard about kingfall
https://x.com/sundarpichai/status/1930656033237823862
"We also heard your feedback and made improvements to style and the structure of responses. "
No wonder it feels better
Introducing our latest update to Gemini 2.5 Pro (06-05), which we expect to become our long term stable release. At a glance:
- SOTA on HLE, Aider, and GPQA
- Now supports thinking budgets
- Same cost, on pareto frontier
- Closes gap on 03-25 regressions
hahahahaha
close?
yea ever since cot became a thing i noticed verbosity went up so high and I never benefited from cot anyway. non cot sota models could already do every one of my tasks but i gotta be honest claude 4 opus thinking kinda beats non thinking opus 4 so im starting to accept cot and claude thinking doesnt have verbosity issues
that's cool
gemini 2.5 ultra👀
Lol the grok 3.5 lost opportunity for the good release. And now it is Irrelevant 😄
do you think it's likely kingfall will turn up on arena soon?
WOOOOOOOT
https://blog.google/products/gemini/gemini-2-5-pro-latest-preview/
It's clearly the best model now
and quite cheap
lmao o4 mini high got that high
this is what I am looking for.. could you align them better ? 😄
In proramming it's middle point between march and may updates. In everything else - better than both.
what is the Polymarket link?
wow that is one hell of a base model
it does, though
there'd a toggle in ai atudio
studio
speaking of AI studio it keeps throwing errors ugh
Ill take my chances. Ill win as long as non-google or openai models win
By the end of June i mean
yoooo this is so much better, like it feels better to use
which model was this? NW?
lol NW literally is a myth at this point, is goldmane better than NW?
most likely
ahh yeah that makes sense
what was this General availability people keep talking about?
i have been mia for a bit
this is impressive af
wasn't every model release GA?
Is it?
thats what someone said here
They are releasing snapshot after snapchot, still no gemini 3
or where is ultra
people keep talking about how bad the new releases are but I see the potentials, imagine what our children would be using 10 years later
i will definitely be fooled by AI
What about Ultra?
I don't believe Gemini 3 is that late tbh
Does anyone know if there is a way to set your defaults in ai studio like i prefer temperature at 0 or do you have to manually change it everytime
Does anyone have the benchmarks for version 05-06 (the previous one) please?
manually set it everytime you start a new chat
Oh i found this
It could be Sunstrike
Bruh I thought craig was not rage baitable
After seeing releases from Google, OAI and Claude, is there any hope for Microsoft, Amazon or Apple to release a good AI model ever?
@deep adder do u think OAI can beat gemini?
Nah
I am not OAI fan or gemini hater, and I think OAI can beat gemini
my toxic traits could never....
Amazon might be OK because of CLaude investment and Microsoft because of OAI but Apple is kinda done
Idc what the crowd wants lol. LMArena ranking is the only thing I care about
Does amazon own claude?
not sure how much % but may be aroudn 50% ?
Chatgpt is more popular but Gemini is better for me .
Why? Polymarket only cares about lmarena ranking and I use gemini pro on daily basis without any problem
chatgpt is guessestimating 15% and gemini is guessestimating 20%
and Google own around 10% of Claude. I thought difference would be much bigger.
06-05 is it goldmane or redsword?
goldmane, confirmed by CEO on twitter
when is kingfall
You have the proof from the web dev Arena ?
most likely never 😄
The kingfall didn t fall today 🙂
is kingfall deepthink parallel cot?\
Ah found
i used it without think
I have my doubts. It doesn't look that easy to cook something in this area
u missed on some wealth
how is function calling for the latest Gemini model? It used to be bad
bought some at 78c😋
brian is talking shh
BASED
when will sam fall?
with king fall 😛
and when will sam rise
you are going to be fired if you reveal too much
vvery cool thanks
please dont
chillax
word chillax detected, i will now delete messages
respect brian
word respect detected, reposting messages
@civic flame gemini hasn't solved it, im feeling grok 3.5
its not
oh
u can set thinking budget
im at 8192 default
im trying another
drop hints ... not full paragraphs 🙂
whats fixed
u need to make ur own discord
My favorite paragraphs😭
and charge huge
Bruh those deleted messages are only I think I think I think and other uncertainties
Hows that delete worthy
craigbench, oai at 100%, gemini at 0%
LMAO
Does anyone have any code examples from 06-05?
goldmane model is stuck in infinite loop for this question ... thinking for 4 min now
I'm going to finish my project with claude 4 opus thinking and gemini side by side now
Claude 4 been doing good but I want to see if they fixed the gemini coding annoyances
lol why does it keep getting worse at swe bench
Missed the opportunity b4 it got deleted 🥲
yup, i just set my thinking budget to 32k
still running
I got 4041 after 4 min thinking... answer is something like 3000?>
ohhh ffs
yeah nor could kingsfall
damn rlly
yeah it was one of the few prompts i tried it with
but tbf, thinking budget was default 8192 for kingfall?
wow
not grok 3.5 but grok 35
Bro he didnt even answer any question
Unless.. all the "i think"s and guessing stuff is just ACTING clueless but its in fact real and confirmed?????? 😱
whats the prompt?
There are 2022 users on a social network called Mathbook, and some of them are Mathbook-friends. (On Mathbook, friendship is always mutual and permanent.)
Starting now, Mathbook will only allow a new friendship to be formed between two users if they have at least two friends in common. What is the minimum number of friendships that must already exist so that every user could eventually become friends with every other user?
with tools*
ans = 3031
0 consistently
nebula (NOT released 2.5 pro for some reason) sometimes got it
that's the only model
that's because it was cheating lol
its 2022 usamo problem 6
tool usage is fair except web search
so yes.
until it pulls up evan chen's work
lol
vibe-wise i quite like this new 2.5 pro
it makes some actually visually pleasing frontends that don't feel like generic slop
agreed.. I used like to Gemini answers but now structure and presetation and tone is better too
html + css only
check the code
https://artofproblemsolving.com/wiki/index.php/2022_USAMO_Problems/Problem_6 hmm someone wrote 4044 as the answer
is it ass cheeks
u realized deeperthink mode has to search the web 😭
uhmm, how about this?
opus 4 is the best model in terms of actually nice code, but i still prefer the end result of 2.5 pro
fair
that would be funny, the models keep getting 4044
oh that's weird
one second
ya the fact opus is still at 79% swe, and 0605 at 67.2%, but aider 82% and 72% opus, fishy
look at the comments on this
i honestly did not realise this was a USAMO problem
they're entirely different things tho
its really both just code tasks
i love that sub
samesies
its funny there are a few people who just post oly problems there
thats copied from the aops discussion
wow, goldmane is confidently wrong.
I said "answer is 3031. prove it"
And reply was (last line)
"Therefore, despite your suggestion, the mathematical proof confirms that the minimum number of friendships is 4041."
🙂
😤
the initial presupposition of different coding strengths affirms this regardless tho? the fact that they simply ARE good at different coding things in practice, should surely be considered
wow brian left the server
he didn't
ping him
he's still in my heart
add him
💔
great
each solution i find comes up with a different end result lol
true, hopefully kingfall unanimously dominates both
Is this my fault
wanted to ask when will there be a claude max equivalent for gemini 😭
yes
yes
yes
yes
i trust evan chen
i think that's inevitable, and btw I don't think kingfall is unanimously better than goldmane, although in spatial tasks it seems to be like basically agi
🤦
brian is going to come back under a different alt
Guys if we all show enough love we can get him to join back
i prefer that than the "sorry, you're right..."
kingfall svgs >>>>>>
its ur fault hahaha
but that's pretty much it imo, the insane nuance in how they speak, the details
are there
Yes but multiple people showing love beats only one person (me) doing it!
well u can unleash ur treasure trove stash now
no
But my nu des ae private. I'd prefer not to share them here
Also because that would be against the rules
oh true
btw without thinking budget turned on, is it just how much it wants to think?
again, there is a non thinking mode
iirc it's just only on ai studio
whoops
yeah nevermind it's just bugged
well you can on the frontend it just throws an error
thats what he meant bro
yes i know dawg
i just didn't realise that because i thought the reason it was throwing an error was it was overloaded initially
why mine is 24576?
Aspiring to be this delusional
yea tbf oai still has the edge rn
even if kingfall drops, they'll just drop o4 and o5 mini high
damn
ion think this would be the case, especially before gpt 5s imminent release
đem
you'd have to treat Gemini deepthink with the same attitude
which to me, is obviously just selective tbh
they always change their plans
Ugh, really wish this server didn't have those disrespectful people. I'm going to miss Brian||and his insider infomations||
o3 pro wasn't supposed to make it
it's "technically" bannable but thats nonsensical
no reason to report them or do anything about it
(No one ever got banned for this unless they were a big name so good luck)
? where are you from
Anyway but I agree I also deleted the messages and wasn't going to do it again but oh well rip
not from vietnam
vai cac
i dont understand
same
NOw lets get brian back
w craiggers
mentally lacking
Apple can do it, but they have no AI lab as big as xAI or OpenAI
Its sad Apple didn't join the AI race
They have the resources to do so
every week its like that 😭
he said
'fixing bot replies'
....
google just released another model
veo 3
imagen 4
imagen 4 ultra
whisk...
Sry for the slow response, currently on mobile.
something def went wrong internally
true craig just gets butthurt or baited like realllllly easy it's fun bro lmaoo
almost a month delayed
but whatever in this case it was not invalid reaction. they got revenge on the guy for making brian leave so that is fair
yea
holy
thats what im thinking
sht
nahh you cried lmaoo
OMG
you should try kingfall
y'all gotta see this
you have access?
nah. better.
gork 3.5 release???
oh?
So if goldmane is 82% on aider, then which model was 86%?? Was it kingsfall or something else?
Idk why everyone hype
Where on web dev? Or on arena
i have private access
i can send link in dm
thanks
oh the choice of 'they'
i see what you did
lol
yea
you sleep well
8h sleep = sharp memory
tf
no way
thats unhealthy
ppl are weird
shrug
That's why we can't have good things. I am quite sure Google disabled thinking because of deepseek
it was kinda obv they will distill from diff models
its the fastest way and easiest road
too much liability for a big company
ye
and also, Google = data geeks
they want the cleanest data
so even if they did, it's not really going to be that meaningful
im here
@deep adder help
what secret
what
its kingfall
not fallking
i dont have fallking
im sorry
just go to twitter and type kingfall
wen kingfallfurther?
there is a chinese guy who shared the link just recently
it works
havent checked the code tbh
im just using it
what if????????
omg
stop it
imagine...
nah its not
its just kingfall
just random things
it works, just like you're accessing api endpoints at google internally
share please... how do I test otherwise
wait where do i get it
ok not gonna login with my google account 😭
i mean does it match kingfall of yesterday?
yea
wow
its the same model
its gemini 3
probably first checkpoint of gemini 3
because logan said goldmane is the stable gemini 2.5 pro ver
its quite decent
whats the context size
lets talk about o3 pro
where is it?
hmm i just checked the model selector
im blown away 😭
no my oai model selector
no
i thought it would came out thursday
they had everything set up, like why
you didnt thank me
thank u so much
thank uuuuuuuuuu
for the oai model selector
yes
thank me
thank me
thank u
thank u
yes
is 0605 better than 0325
How does the new version compare to sonnet and opus?
ye
btw what's the highest we've seen for the simplebench sample questions?
@torn mantle @civic flame
gonna try this