#general

1 messages · Page 79 of 1

torn mantle
#

simpleQA and GPQA

rare python
#

Seems specialized

torn mantle
#

so it should have some decent world knowledge

cedar tide
#

96% on simple qa without search tool its impossible

#

fake

civic flame
#

bro calls everything fake

torn mantle
deft vigil
#

did everything will be eaten by gpt 5

cedar tide
# civic flame bro calls everything fake

They are unknown and they must not have a lot of money. Do you believe that they created an LLM from A to Z with a 96 on the simplqa without access to a search tool?

reef bridge
#

is there MCP feature on LMarena? it would be cool to test out models of how good they are with MCP

civic flame
#

i trust them 🤷‍♂️ suit yourself

#

just because they're not a big lab that doesn't mean they can't make big advancements - they've been doing some cool math-related stuff in the field for a year

cedar tide
#

@civic flameI'm not saying they never got this score, but that they forgot to specify that this score is with a research tool

#

@civic flamethat he is strong in gpa it is possible that he found a new reasoning technique, but the simple qa is just knowledge so to have 96 it would be necessary to make a model 10 times larger than our sota and it is impossible that he did it (or trained it specifically on simple qa but it is stupid)

#

even perplexity deep research has only 94

civic flame
#

nvm this isn't from harmonic this is autopoiesis

torn mantle
#

im confused

#

is it the same model or nah

#

Aristotle X1 Verify pass@1 benchmark results.

civic flame
#

they're different lol

#

that's what had me confused

#

they both have models called aristotle

torn mantle
#

what did harmonic name their model

#

bruh

civic flame
#

aristotle

#

lol

torn mantle
#

yea im not trusting that either

#

i thought its from harmonic

cedar tide
civic flame
#

this is their only post 💀 😭

cedar tide
#

Ah now you change your mind

civic flame
#

again, i thought it was about the harmonic model

cedar tide
#

okk

woven thicket
#

Hey can some one teach me how to use it

#

Basically I joined today

#

Hey can anyone listen me

#

Tell me how to use it

cedar tide
#

7 employee

astral kayak
#

I call bs

#

or it's really bad at other stuff

pure anvil
#

They're not claiming that tools aren't being used though

torn mantle
#

small business strategy

cedar tide
#

🇰🇷 LG recently launched EXAONE 4.0 32B - it scores 62 on Artificial Analysis Intelligence Index, the highest score for a 32B model yet
︀︀
︀︀@LG_AI_Research's EXAONE 4.0 is released in two variants: the 32B hybrid reasoning model we’re reporting benchmarking results for here, and a smaller 1.2B model designed for on-device applications that we have not benchmarked yet.
︀︀
︀︀Alongside Upstage's recent Solar Pro 2 release, it's exciting to see Korean AI labs join the US and China near the top of the intelligence charts.
︀︀
︀︀Key results:
︀︀➤ 🧠 EXAONE 4.0 32B (Reasoning): In reasoning mode, EXAONE 4.0 scores 62 on the Artificial Analysis Intelligence Index. This matches Claude 4 Opus and the new Llama Nemotron Super 49B v1.5 from NVIDIA, and sits only 1 point behind Gemini 2.5 Flash
︀︀
︀︀➤ ⚡ EXAONE 4.0 32B (Non-Reasoning): In non-reasoning mode, EXAONE 4.0 scores 51 on the Artificial Analysis Intelligence Index.…

deft vigil
#

just got a potato lol is it open ai right

cedar tide
tame palm
shy dustBOT
#
Server Information [ LMArena ]

profile Name: LMArena
spacerightDoubleArrow ID: 1340554757349179412
spacerightDoubleArrow Description:

LMArena is an open platform where everyone can easily access, explore and interact with the world's leading AI models. Community shaped leaderboards help progress AI in a more transparent and grounded in real-world user way. Come join our community to explore and shape the frontier of AI.
owner Owner: @wooden mulchowner
features Features: vanityanimatedPicturesplash
creation Creation: <t:1739683560:R>
rightSort Channels: 286
spacechannel Text: 28
spacevoicechannel VC: 3
members Members: 6779
roles Roles: 26
spacebot Managed: 4

deft vigil
#

dino never finish generating is it that slow ? or simply not functioning now

keen beacon
hardy pecan
#

Simple Bench - Horizon Alpha: 3 / 20

#

Beautiful

tame palm
#

car

deft vigil
#

now nightride-on

keen beacon
deft vigil
#

never tried zenith is it still exist ?

keen beacon
outer zinc
#

is there a video generation leaderboard yet?

civic flame
#

👀

blazing bison
#

Everything is ready for next week

unborn ocean
#

just a bunch of PR

#

for higher evaluation

harsh sonnet
#

for how much time we get this video generation for free

torn mantle
safe falcon
#

What is the SoTA research models available?

#

I'm trying to find an affordable solution for deep research that doesn't hallucinate much

#

For example perplexity hallucinates and tends to believe report mills

#

Also, question, does LmSYS do categorization on AI tasks through leaderboards

echo aurora
cedar tide
#

🚀 Announcing Step 3: Our latest open-source multimodal reasoning model is here! Get ready for a stronger, faster, & more cost-effective VLM!
︀︀🔵 321B parameters (38B active), optimized for top-tier performance & cost-effective decoding.
︀︀🔵 Revolutionary Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD) enable efficient inference—even on modest GPUs.
︀︀🔵 Trained on 20T+ tokens (incl. 4T multimodal), with meticulous data curation ensuring reduced hallucinations & robust reasoning across vision and language.
︀︀🚄 Unmatched speed: Up to 4,039 tokens/sec/GPU—70% faster than DeepSeek-V3 under similar conditions.
︀︀💎 Step 3 sets a new Pareto frontier—bridging power, efficiency, and practicality.
︀︀👉 Start building with Step 3 today: huggingface.co/stepfun-ai/step3
︀︀👉More details on our research blog:
︀︀www.stepfun.com/research/zh/step3

💬 8 🔁 19 ❤️ 86

#
StepFun

StepFun AI is your smart and reliable personal assistant, here to help you acquire knowledge, find information, learn languages, unleash creativity in writing, and even write code. Whether you’re working, studying, or just navigating everyday life, it’s designed to solve your problems and help you discover and understand the world around you.

stray aspen
#

we need new SOTA models

cedar tide
#

🦥 Qwen3-Coder-Flash: Qwen3-Coder-30B-A3B-Instruct
︀︀💚 Just lightning-fast, accurate code generation.
︀︀✅ Native 256K context (supports up to 1M tokens with YaRN)
︀︀✅ Optimized for platforms like Qwen Code, Cline, Roo Code, Kilo Code, etc.
︀︀✅ Seamless function calling & agent workflows
︀︀
︀︀💬 Chat: chat.qwen.ai
︀︀🤗 Hugging Face: hf.co/Qwen/Qwen3-Coder-30B-A3B-Instruct
︀︀🤖 ModelScope: modelscope.cn/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
︀︀🔧 Qwen Code: github.com/QwenLM/qwen-code

**💬 6 🔁 5 ❤️ 25 👁️ 259 **

thorny sleet
#

hey can anyone help me

with the processs to generate videos here?

echo aurora
earnest rover
#

when lmarena video gen available on lmarena.ai base site ?

echo aurora
earnest rover
echo aurora
delicate atlas
#

Can you not attach images to claude sonnet and opus models anymore?

supple bluff
#

hello

keen beacon
#

Zenith still in arena

earnest rover
blazing bison
burnt halo
#

hello, I saw the vid on thread lol

primal orbit
#

I've got a model called "cuttlefish"

ebon patio
#

Hi

misty star
#

🧑‍🔬 Research Update: Today, we are releasing a new dataset with over 140k conversations from the text arena collected between April 17th and July 25th 2025. See thread to dig into it!
︀︀
︀︀We're pairing the data release with a deep dive into how model performance and evaluation dynamics have evolved over time. Let’s look at real-world trends, new features, and fresh prompts.
︀︀
︀︀What’s covered in the latest analysis:
︀︀- Overview of the released dataset
︀︀- Language & topic breakdowns
︀︀- Rating changes: How Arena scores shift over time
︀︀
︀︀And more! 🧵

**💬 1 🔁 1 ❤️ 5 👁️ 66 **

#

🤑

golden ocean
#

rip my ai girlfriend conversations

tribal aspen
#

#announcements the new one sounds more like the AI Companies will be more interested in knowing

cursive spoke
#

What models are there in video arena

tribal aspen
#

and it would take decades to learn them

misty star
golden ocean
#

true

golden ocean
whole sundial
civic flame
#

great start

whole sundial
#

lol an entire category section for a single model

misty star
whole sundial
tall summit
#

nothing in the article makes it seem that there are conversations in the statistics and dataset from direct chat

#

i don't even know if there are or not

tall summit
stray aspen
#

they leaked the group chat

tall summit
tall summit
golden ocean
#

I cant even find my chats wtf

#

o9h

#

theres no direct chats right?

keen beacon
#

lmao

tall summit
#

as in hopefully he didn't mention private information

keen beacon
#

if people get confused

stray aspen
#

bro who leaked the group cha

golden ocean
#

sydney is in my head

fleet lintel
stray aspen
#

yes

#

they leaked conversations

keen beacon
tall summit
#

"leaked" xd

keen beacon
#

it's not a private service

#

:/

brave orbit
keen beacon
brave orbit
#

bro pls stop saying hi and hello in other channels every one says hi in other channels then the general

#

mods ban that guy

#

bro you username

#

give me more info since ok

#

yeah i can not help you without info

#

just what you need

#

for it to do

#

say it or less i can not help

#

bro you had a swear word in you username bro you re

#

how can you ever join

fresh latch
#

Hey! Anyone know how I can try the Horizon Alpha Model? Supposedly there was a way to do it in LMArena.

whole sundial
#

openrouter chat

fresh latch
#

Thanks!

torn mantle
humble oyster
#

hey i just wanted to know there no limit on image generations, but in video generation there is limitations, i wanted know is there any posiblities that the video generation also be unlimited.?

jade egret
#

when gpt 5 ):

echo aurora
prime mulch
#

Bro is video generator have limit?

humble oyster
prime mulch
#

😭

#

I can understand but even the video is generated for 8 second actual scene is only 4 sec

torn mantle
torn mantle
prime mulch
jade egret
#

what the most fun model to talk with

keen beacon
dawn wharf
marsh stratus
#

is gemini-2.5-flash-lite not going to be in the leaderboard?

brave orbit
#

whats the best module for codeing in c++ and python and math what should i use pls help

#

help me

brittle tiger
#

Horizon Alpha gets math problems wrong that o3 never messes up.

brave orbit
#

and also codeing give me just a say whats the best and why @here

#

@here help me say whats the best module for math and codeing pls help me

willow grail
#

lmarena is a company?\

#

thought this is a hobby project

echo aurora
torn mantle
#

reading some prompts from the dataset

#

people really ask all sort of things

reef pawn
blazing bison
keen beacon
torn mantle
blazing bison
#

It's good for open source models. You know it's going to be shared publicly. If you share more than you should, it's your fault.

#

If you feel bad about using it, then don't

echo aurora
cedar tide
#

New Arena Models

- velocilux

- cogitolux

torn mantle
#

dont tell me what to do

cedar tide
#

@torn mantle essaye les nouveaux modèles et dit moi ce que t'en penses

upbeat aurora
#

hello

#

animate the image

echo aurora
keen fulcrum
echo aurora
keen fulcrum
#

Thanks appreciate it. The UI improved a lot over the last months

#

I propose making Search, Image, Video and Webdev Arena available through three major buttons to increase visibility. I attached a possible concept.

#

Its unclear that those buttons lead to a different arena

#

You may add a webdev arena button as its currently deployed on a separate platform

Additionally I propose adding tooltips to the leaderboard explaining how Rank, CI and Elo are determined

primal girder
# echo aurora We do apply aggressive PII filtering.

But when browsing the dataset, it could be seen that there’s some personal information included prompts being published. Ppl sometimes do stupid things like putting some files in without properly erasing all the personal info. 🤣 I know the TOS specified the rights and responsibilities and stuff. But maybe if there could be a way for users to choose to remove some of their prompts from a public release, it might be nicer?

echo aurora
#

correct

echo aurora
#

I don't know tbh, regardless I have been sharing these concerns with the team.

torn mantle
#

they probably filtered that out

frigid coral
white hatch
#

guys, did you notice that gemini started after some point to repeat himself or i'm tripping?

torn mantle
#

Its been consistent to me

#

I questioned myself if gemini 2.5 flash got even better

white hatch
#

I have 2 chats with that problem, maybe ran out of tokens idk

#

I use gemini 2.5 pro

echo viper
#

Video limit was 10 yesterday and 8 today. We have 4 more days left to limit 0. Hurry up.

flint sandal
#

I swear i wrote a message on lmarena in feedback for making video generation arena. And now it is real (i thought they will made it in the webapp)

leaden palm
#

what happened in #leaderboards? did this server get fake airdrop raided, openrouter-style?

echo aurora
half nimbus
#

hello, I'm new here. I'd like to get the most out of LM arena but I feel like I'm swimming in the deep end without floaties. When y'all got started how did you leverage it?

mellow frigate
half nimbus
#

I would like to fine tune my skills as a prompt engineer

cedar tide
#

Thank you, that's exactly enough.

torn mantle
#

ah its on webdev

#

well you asked for it xd

brittle tiger
leaden palm
#

kath is the jules... gal i guess

brittle tiger
barren prairie
patent aspen
#
poll_question_text

Will GPT-5 launch before Deep Think?

victor_answer_votes

12

total_votes

22

victor_answer_id

1

victor_answer_text

Yes

barren prairie
#

When our promts will be shared publicly , we will laugh a lot 😂😂😂🤣🤣🤣
Can t wait to read them .

tall summit
stray aspen
#

Can we use the study mode prompt on other Ais

hallow ridge
#

I have an instagram account with over 300k and I don’t want it anymore

leaden palm
keen beacon
leaden palm
forest prism
#

Which LLM is the best street smart? I think adding a leaderboard in LMArena for it would be sick

leaden palm
forest prism
# leaden palm what kinds of prompts would you classify as "testing street smarts"?

Something like this but advanced

“Your storefront is dead, but the parking lot next door is packed. What scrappy move might get you foot traffic?”
“You get your first bad online review — and it’s unfair. How do you respond publicly without looking defensive?”
“Your competitor just undercut your pricing. You can’t afford to match it — what do you do to stay in the game?”
“You’re launching a new product and have no ad budget. How do you create buzz with zero dollars?”
“A VC firm wants equity in exchange for mentorship, not money. Worth considering?”
“You’re about to go into business with someone who talks big but avoids putting anything in writing. What’s your move?”
“An early client wants a deep discount in exchange for ‘exposure.’ What questions should you ask before agreeing?”
“An employee you trust starts showing up late and missing deadlines. How do you handle it without losing them or getting walked over?”
“You have $5,000 left. Do you spend it on marketing, product development, or paying a debt collector breathing down your neck?”
“A supplier offers a ‘limited-time’ bulk discount, but you haven’t even sold your first batch. Do you go for it?”
#

It's subjective but I think that's why LMArena battle mode exists

forest prism
#

Yes, definitely on the non verifiable domain, but very useful tho

#

Interesting, Why don't you think it's useful?

#

I see where you're coming from, I think I'm on the entirely opposite camp, I believe in achieving singularity as the end goal not us being the bottleneck

#

Absolutely, that's the greatest outcome imo. I wonder what you have against it?

#

What's the gap that won't let it happen? like what would you say is the "missing/never will happen" component

#

we aren't there yet is different than it won't happen, won't happen means that there is a component that is impossible preventing singularity from ever happening

stray aspen
#

Craig do you think the new glm 4.5 is good

forest prism
#

What's that component?

stray aspen
#

@cedar tide david penses tu que le nouveau glm 4.5 est meilleur d une maniere ou d une autre que les autres modeles sota

last verge
#

Is there a possible making story video with consistent character?

wicked root
#

I keep getting rate limited on gemini

echo aurora
wicked root
#

ofc I do LOL

#

I got it to code 50+ times today within a span of 3 hours.

#

plus a whole bunch of CS questions

brave orbit
nocturne sparrow
#

Potter tying clay pot on tall bamboo pole, king and his sons failing with arrows, spectators watching with tension, mid-range shot, lateral pan motion with slow push-in on shattered hope in faces

static lark
echo aurora
static lark
#

better to just not use it for studying

#

and use the model normally

#

it walking through the concept with you takes longer than if it just explains it clearly and directly

verbal nimbus
#

I heard they merged LearnLM with 2.5 Pro

#

As long as you stick to a traditional syllabus, it's pretty great. For non-standard stuff, it's less on-point.

patent aspen
#

Seems like more of the same story. Apple has no comparative advantage in AI, but they own the world's best real estate. They'll continue being a luxury real estate company for as long as it works.

whole sundial
#

the members of this HF team are all OAI employees btw...

verbal nimbus
digital umbra
# whole sundial

it almost feels like the leaks are intentional, first the gpt-5 entrypoint and now this lol

#

well, i think leaks are more credible than sam altman's tweets anyway

brave orbit
willow grail
#

its free and at level of other sotas. its not best for swe.
but opus 4 is still trash for vibe-swe. money waste.

whole sundial
sacred quail
#

Best at analyzing videos

#

Not even has any competitor

#

Other models reading text of videos, while gemini literally watching whole video frame by frame for hours and can gives you detailed and specific outputs

#

People still dont know how useful is this

#

Analyzing video is bigger thing than analyzing pdfs

#

Gemini needs own benchmark just for this
Also analyzing for pdfs or text gemini is still best because of best at long context

brave orbit
#

And say why

#

it is better

#

say why it is better in a message

paper vault
#

The summer of 1305 finds William Wallace crouched in the dense undergrowth of a Scottish forest, his once-proud frame now gaunt from years of constant flight. The man who once commanded armies and negotiated with kings now lives like a hunted animal, moving from shadow to shadow across a homeland that no longer recognizes his authority. His weathered hands, scarred from countless battles, grip a simple dirk—the only weapon left to Scotland's former Guardian. Seven years have passed since Wallace, "A medieval storybook illustration of a grim knight riding a horse through a peasant village, peasants looking frightened, castles in the misty hills in the background, detailed faces, realistic proportions, dramatic lighting, vintage painting texture, inspired by oil painting and watercolor, muted earthy tones, [additional scene-specific detail here]"

primal girder
viral hamlet
#

LMARENA I DIDNT KNOW YOU HAVE A DIS I LOVE YOU GUYS

gleaming pagoda
#

And I have no way to check my vote result either, although I have voted more than 5 times.😭 Is that a bug or something?

lucid jacinth
#

Hi

dull raptor
#

Hello...

verbal nimbus
vast hound
#

Some of the user's requests in dataset are funny:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Are hamsters made of ham?",
        "image": null,
        "mimeType": null
      }
    ]
  }
sacred quail
#

Also if you select lower resolution and lower fps like 0.5

#

You can use for muuuch longer videos

#

Sometimes it gives error but just resend message again

#

Could be take several minutes if video too long, just be patient

#

Im asking summaries, asking time stamps

#

asking "are they talked about this"

#

asking "is this guy laughed or scared and which minute?"

#

Yea, it can analyzing face mimics too

#

Basically everything it takes hours if you do that but takes seconds when gemini does

#

Also you can make subtitles too but i dont recommend one shot try, instead translate with 20 minute parts, you can also select time parts. And like i said, gemini not only listen or reading, literally watches videos frame by frame so subtitles will be more accurate because gemini can see whats happening on screen that time

unborn ocean
#

info on openai's bigger oss model:
128 experts - 4 active -> very efficient
120b params - 5b active
4k initial context window - 128k current -> not horizon-alpha or what ever it is called (?)
trained in FP4 -> should only run on blackwell (?)

reef pawn
brittle gull
#

chemistry ei sob topic ar bornona den

mortal coyote
#

why the website image generator - GPT-1 is so slow ?

golden ocean
#

Are hamsters made of ham?

hardy pecan
#

gemini 2.5 deepthink is out for ultra members

marsh sundial
#

any screenshot?

reef pawn
hardy pecan
reef pawn
#

Nooo

#

😢

reef pawn
sour spindle
#

Very quiet release

earnest rover
#

i have a question to LMARENA guys. apart from video gen ai, when you guys will add temperature or token in/out settings in text models. also what about image ai. when we can control the image temperature or gradience.

marsh sundial
latent patio
#

any free ai image generator with no price unlimited other than LMarena ?

main gulch
#

the exact limit wasn't published afaik

marsh sundial
#

I am ultra subscriber

echo aurora
earnest rover
#

am i missing something. what is so special about deepthink

graceful sable
keen beacon
paper nimbus
lone vector
stray aspen
#

deepthhink is out

sour spindle
patent aspen
#

The IMO benchmark is just sassy lmao

autumn nacelle
#

@latent patio yo i am also looking for free ai image generator with no price and unlimited just like you all i know is freepass ai and i think its a bit bad. Wish someone made a list of free ai image generator

#

Something went wrong with this response, please try again. Also I Get This error when trying to create images in LMarena site any solutions ?

stray aspen
#

lmarena

blazing bison
#

Can't be real

#

10 rpd

stray aspen
#

what a daylight robbery

blazing bison
#

hopefully gpt-5 is better

#

funny if gpt-5 could do the same for $20

#

Grok 4 Heavy is a bigger scam than Google

stray aspen
#

ultra

#

they should have compared with grok 4 heavy and o3 pro high

patent aspen
#

My favorite is o3 being below IMO bronze level lmao

blazing bison
#

o3 is from december 2024

#

there is optimizations yes, but it's an almost 1 year old model in general

whole wagon
#

@patent aspen so this deep think is actually 2.5 ultra base model behind the scenes?

blazing bison
#

with deepthink they mena like 50k token of thinking

blazing bison
#

so it's not one

jade egret
#

guys

#

is deep think good

blazing bison
#

we have a bunch of gemini 2.5 pro with a lot of reasoning tokens enabled and one of them decide the best answer

whole wagon
#

Nobody even attempted to put it as a model request kek. I assume it will be rejected but worth a shot at least

lime coral
#

GPT5 < IMO GPT < Deep Think IMO

blazing bison
#

deep think = 50k thinking tokens

patent aspen
lime coral
#

I speak facts cry

#

When did I say something different?

patent aspen
#

I mean he's absolutely right if we're talking about math only. Otherwise yeah I'd disagree

lime coral
#

Math, coding

blazing bison
#

apparently no openai livestream today so no open source today

lime coral
#

Coding

whole wagon
#

Why are they sitting on these models so long lol

#

Like just release it already. The open source they had for ages

jade egret
#

will gpt-5 be better than deepthink?

#

why

leaden meteor
#

Is deep think even be going to be in Arena to test against GPT5?

barren prairie
still matrix
#

Someone with access to gemini deepthink can I give you a highly complicated clinical case question no other model can solve to check it's answers?

Please i beg you it's really a thrilling misery case no machine can solve and humana are struggling too

Edit :mystery *** I'm actually a friend of the clinical case and we been baffled for months without am answer

gentle plinth
#

Dafuq

#

Why admins deleted it

ocean vortex
# lone vector

All of those gains comparing with the initial deep think announcement can be easily attributed to base model update (06-05 vs 05-06) in my book

#

And their initial release:

#

(different LCB range)

willow grail
fleet lintel
# lone vector

numbers look good.. but i have learned to not get hyped before more confirmations.

Are they actually good?

willow grail
#

@leaden palm

ocean vortex
#

They sneakily did not include USAMO this time at all lol

lone vector
still matrix
lone vector
still matrix
ocean vortex
# lone vector

Nothing really insane about this tbh. Just parallel compute

fleet lintel
#

you have ultra access?

ocean vortex
#

You could have done this yourself after 06-05 was released with some coding, I believe

patent aspen
fleet lintel
still matrix
# patent aspen Yeah

I Will be eternally thankful i really will it's a very critical situation we are trying to solve here God bless you

fleet lintel
#

deep think probably takes like 5 min to answer any query

leaden meteor
torn mantle
#

Kingfall was faster

#

So we can assume that its gemini 3.0

torn mantle
#

Okay

patent aspen
#

My thoughts on Deep Think are that it's probably not something that 99% of chatbot users need, but the remaining 1% could have a categorical improvement in capability

#

E.g. mathematicians, scientists, critical medical situations, distributed systems problems, logistics, leading edge HFT firms, etc

torn mantle
#

Is it using something similar to kingfall as instruct model?

#

But why does it look worse than kingfall

analog raptor
#

@deep adder Grok is bad!

sweet tinsel
civic flame
#

yeah in terms of frontend design it did worse than base 2.5 ultra & 2.5 pro

#

but it had no bugs

primal orbit
#

Does deep think output long detailed answers like deep research?

#

or more concise like o3?

whole wagon
#

like we have the weights just not the inference to run it lmao

ocean vortex
#

It's just 06-05 with minimal changes if any at all + parallel compute. This is my conclusion thus far judging by what I saw until it can be proven otherwise. They barely showed any metrics at all, and those that they did showed similar gains to the 05-06 initial deep think.

patent bane
#

can I send some prompts to test?

#

I canceled my ultra plan a month ago

patent aspen
ocean vortex
patent aspen
ocean vortex
warm fulcrum
#

what does rpd mean?

whole wagon
#

for the length of time it thinks you cant really do many requests per day anyways

#

like each takes 10+ minutes

ocean vortex
#

For the amount of noise they made, essentially promising to release IMO gold medal model, this is kinda a disappointment

primal orbit
#

damn

warm fulcrum
#

so what's so bad about the deep think model besides the requests per day limit?

#

is it not worth it at all

ocean vortex
#

Perhaps but as things stand now they just released a thing that was supposed to be live months ago. Only based on a slightly newer model now lol

patent bane
#

oh yeah

civic flame
#

you just wasted 1 of your 10 RPD on that?

#

😭

whole wagon
#

is this the next dumb test

#

after strawberry

patent bane
#

not mine

hollow imp
patent bane
#

but that does prove something

whole wagon
#

it is essentially like using an optical illusion to assess someones intelligence. it is just an artifact of the tokenizer

patent bane
#

yes but it should have used tools to calculate

hollow imp
#

Guys I'm squeezing every bit of Gemini 2.5 through custom gems

#

I've added all Robert greene books pdfs in the knowledge base

patent aspen
#

And they don't like the 10rpd limit

#

And cost

warm fulcrum
#

the computing cost must be high

primal orbit
hollow imp
ocean vortex
whole wagon
#

o3 pro has a similar limit actually. they just dont state it. barely anyone will ever hit it, so whats the point. Its just adding unneccessary worry into the user about something which is likely not relevant to their usage

ocean vortex
#

Not really lol

hollow imp
#

O3 search in lmarena search is godly for me

#

My best experience with web Searching so far

blazing bison
#

The api one is always better

ocean vortex
#

I don't see a single good reason why this should have been delayed either tbh

#

But this is not based on Ultra is it

blazing bison
#

They rushed deep think because they know that next week is openai week

opaque gull
#

guys what to do if bot doesnt answer to me in dm when i want to gen video

ocean vortex
#

If it was they would have shared more metrics, gains would be higher and wouldn't match up to 05-06 deep think gains

blazing bison
#

Yeah, just a rushed version because releasing now some people will buy the ultra plan. Releasing next week no one would buy because of gpt 5

ocean vortex
#

and they wouldn't be afraid to include USAMO like they did earlier

hollow imp
#

WHERE IS DEEPSEEK R2 COMING

whole wagon
#

I like it. Don't care much about price I just want the best. Though they should introduce a tier above for unlimited usage kek

ocean vortex
#

Source...?

Also:

If you’re a Google AI Ultra subscriber, you can use Deep Think in the Gemini app today with a fixed set of prompts a day by toggling “Deep Think” in the prompt bar when selecting 2.5 Pro in the model drop down.

whole wagon
hollow imp
#

At least gemini 2.5 doesn't hallucinate as badly as grok 4. I have very horrible experiences in lmarena side by side

ocean vortex
#

Hm... Ok if it's indeed that, why not release other metrics and focus on math where small models like o4-mini-high are known to be often better than both medium and huge sized models? Makes no sense

#

@patent aspen

hollow imp
#

Can you trick 2.5 pro into similar or atleast 40% Deepthink like performance using some prompt engineering?

patent aspen
hollow imp
#

How can I subscribe for any of these? I'm 15 years old I don't have money

hollow imp
#

Our education system is cooked

hollow imp
patent bane
#

why would i 😂

hollow imp
#

Personal experience> benchmark

narrow haven
#

Hello guys and girls

whole wagon
#

50% odds to increase the lmarena score by 7 Elo?

torn bison
whole wagon
#

why dont they use the style control leaderboard

#

well i didnt even realise its a thing. its enabled by default lol

#

so they adjust the scores automatically by default?

hollow imp
#

Ayanokouji august

echo aurora
#

Yeah that's a good point and a topic we discuss internally here and there. I'll be sure to bring this up again as it's important how we structure this.

primal orbit
#

I thought models see only they part through the chat. You send message via site -> site uses API to send message to LLM -> API returns reponse to the site -> site outputs the message.

#

and the site handles 2 channels simultaneously this way

torn mantle
#

whats the difference between wolfstride and kingfall tho?

#

is wolfstride like a more recent checkpoint?

rare python
astral jetty
#

I can’t even try a sample of deep think because it’s behind a 200 dollar paywall

civic flame
#

is that it?

#

probably not any good 😭

#

yeah

drowsy cargo
#

wow i didn't know so many people were cheating on dev mode

brittle tiger
patent aspen
#

I mean isn't the same true for gpt-5

sudden pollen
#

Hi!

tall summit
#

oh deepthink out

patent aspen
#

I also think Google's naming is way saner than anyone else tbh

civic flame
tall summit
sick barn
#

yo

civic flame
tall summit
#

fair

#

i just ignore deepthink because i know i wont be able to use it

#

and i dont need it

#

but its cool to see advancements isnt it

unborn ocean
#

Output: 42, CoT: hidden, summary stupid

#

Imagine

#

Searching X for „Elon Musk opinion on meaning of everything“

candid field
#

guys is the limit 10 videos or 8 ?

keen beacon
patent aspen
#

What are the current GPT-5 benchmarks? Are they verified?

keen beacon
#

we have none rn

golden ocean
wintry tinsel
#

Gpt will be head and shoulders sota when it releases, remember o series models and 4.1,4.5 are checkpoints in the development of the finished product which is 5

patent aspen
#

What are the sources on the release date being next week?

keen beacon
#

apparently horizon alpha's reasoning version got 86% on gpqa tho. it was up for a little bit, whatever that is

wintry tinsel
#

But nobody can beat Gemini cuz nobody can beat free

civic flame
wintry tinsel
#

It’s o series integrated into the regular more versatile model so it will be bordering on a major leap if it is not one itself

civic flame
#

and they begun A/B testing it on chatgpt late last week

high ginkgo
#

can u translate this to simpler english for me

#

sorry for my bed england

#

thx

wintry tinsel
#

I hope that it can beat Claude on coding and writing/vibes cuz opus is expensive, slow and censored

stray aspen
#

of course it will beat claude

hollow imp
#

I've used it a lot on lmarena and I have bad experiences tbh

ocean vortex
#

So basically like deep think lol

#

or o3-pro vs o3

brittle tiger
ocean vortex
#

At least it won't be 10RPD and paywalled behind a door you need golden key for

brittle tiger
#

"The improvements won’t be comparable to the leaps in performance of earlier GPT-branded models, such as the improvements between GPT-3 in 2020 and GPT-4 in 2023"

patent aspen
#

I'm looking at the OAI help center. I see 15 rpm for o3 pro. Is that out of date?

ocean vortex
patent aspen
#

Oh that's for API

ocean vortex
#

for deep think I can't use it at all. Paying for their sub is not an option I would even consider tbh

patent aspen
#

15 requests / month

tall summit
#

I MUST KNOW

patent aspen
#

Isn't that even worse?

ocean vortex
#

I would guess most of the people using o3-pro here and there do NOT have a Pro sub. And that sub is already priced more reasonably than Gemini one

#

I think it's 'unlimited' only for o3, not the pro

#

But the fact alone that Google is competing on charging you comparable amounts of money and has even stricter limits with all their TPUs is kinda already crazy enough...

blazing bison
#

yes, on chatgpt pro requests for all models is unlimited

#

there is only limits on deep research and agent

#

and they are very reasonable limits btw

#

of all pro plans, openai offer the better one

#

claude is not good too, with weekly limits

torn mantle
patent aspen
#

IIRC the real limit for o3 pro on that plan is capped in low dozens per month

blazing bison
ocean vortex
#

o3-pro is not "unlimited", don't know their caps though...

blazing bison
patent aspen
#

Yeah it's not actually unlimited

blazing bison
ocean vortex
blazing bison
#

🤓

#

bcs before it was basically unlimited too

#

and so many people talking about claude code so

#

it was, atleast for me

#

80/ 100 requests day for me is basically unlimited

ocean vortex
#

This was fine. People were still able to test and use it. But also there's no way Google's operating cost of a single instance of their large model is anywhere near that. And if they can only make it perform with parallel compute that is still on them.

blazing bison
#

vibe requests lmao

#

cringe

torn mantle
#

whats kiro

blazing bison
#

amazon

keen fulcrum
#

Agentic ide such as cursor

torn mantle
#

brian is ignoring me 🙁

blazing bison
#

amazon cursor

#

kiro = amazon cursor

keen fulcrum
#

They advertised before with unlimited now they steering back.

pure anvil
ocean vortex
blazing bison
#

like when the models get good enough i'm not gonna need to do 100 requests

keen fulcrum
#

$0.20 for spec request and $0.04 for vibe request after you used your quota

blazing bison
#

i just reach 100 requests bcs of models doing dumb mistakes

ocean vortex
#

They were just about never good on value

blazing bison
#

so maybe with claude 6 the price will be reasonable

#

bcs the model will solve problems with less prompts

keen beacon
blazing bison
#

yeah, it was the best

blazing bison
#

it's not

#

you could, there was no rate limit

#

it's like their rate limit was not working or something

serene cliff
#

how to make video here with sound?

keen beacon
#

who thought that was a good idea at anthropic lol given their limited compute compared to other companies

whole wagon
#

GPT5 is going to cook Gemini 2.5 it's obvious. They better be working hard on Gemini 3 rn lol

pure anvil
echo aurora
keen fulcrum
blazing bison
#

and i was using it a lot with a lot of context

#

but i was paying for the $200 plan

#

now with 2 days i reached the week limit

keen fulcrum
#

August 28

blazing bison
#

wtf

ocean vortex
#

Honestly I think they are simply making a mistake. It's a short-sighted approach that has a high likelihood to hurt them long term and ensure it never beats chatgpt in popularity... They nuked their availability before they were in a position to do so IMO

blazing bison
#

so maybe you can unlock my account or smth since apparently you work at anthropic

whole wagon
#

They have veo3 still

blazing bison
#

veo3 is not that good

#

idk i don't think it's worth $250

#

like only if you don't pretend to generate revenue with it

serene cliff
#

there it's a option for sound in video?>

blazing bison
#

i wasted $400 this month with 2 pro signatures, but they accelerated my work like 10x

#

claude max and gpt pro

pure anvil
#

what work were you doing that it accelerated 10x?

#

lol

keen fulcrum
blazing bison
#

crud basically

blazing bison
digital umbra
#

veo 3 is pretty good compared to other video models at least

#

i wonder what sota will be next year lol

keen fulcrum
blazing bison
#

i just wasted $400 bcs anthropic f*** with my account

#

apparently the unlimited plan is not unlimited

#

openai never did that, even after abusing of it a lot

ocean vortex
#

Wait a sec...

#

Ultra plan was released roughly 90 days ago was it not...

#

and only first 90 days were the discounted price

#

lmfao

blazing bison
#

i considered getting one day of google ultra, google is easily to refund if you don't abuse

#

if the model is good then ok, i would keep it

#

but after seeing that it's 10 requests / day

#

🤦‍♂️

keen beacon
#

you mean ultra? pro doesnt get it i think

blazing bison
#

ye

#

ultra

#

so many names

granite jay
#

Hi

hollow imp
#

Any fireship viewer?

patent aspen
#

I think 6-8 weeks is pretty realistic

hollow imp
#

Isn't sota some openai video generation model?

keen beacon
#

🤣

patent aspen
ocean vortex
#

They expect you to pay up first, and only then receive a chance to even see if it's any good

#

Their blogpost alone is nowhere near enough to tell

#

and 10RPD you are still very constrained. So no proper testing of any kind and forget the benchmarks

keen fulcrum
ocean vortex
#

If this deep think is indeed based on Ultra, I think the odds of Gemini3 beating GPT5 just got way lower LOL

storm needle
#

it's a scam

ocean vortex
#

But it's soo weird to market huge model as math oriented one completely leaving things out like SimpleQA. Unless they used some derivative of a model meant for competing at IMO. But then it makes even less sense to use this for public release as their overall top performing model.

blazing bison
#

and they said on the announcement of the gold medal, that they would allow everyone to use the model

#

misleading

#

😆

#

when google releases something good, really good, bet on Logan marketing it

#

if Logan is in silence, then it's not good

keen beacon
#

if gemini 3 doesnt beat gpt 5 its a very bad sign for gdm tbh

blazing bison
keen beacon
#

given how its a new pretrained model and theyve pretrained two fresh model generations since 4o

ocean vortex
#

Well yeah for starters you have a "deep think" button which is only available when you have selected 2.5Pro, their previous best performing model. This strongly implies to use this for best possible performance

zinc ore
keen beacon
#

they cant host it practically

ocean vortex
zinc ore
#

No, it generalizes to many reasoning tasks

#

It's literally just a suped up version of the deepthink they're offering

ocean vortex
#

it does "work" for all tasks, but it was still tuned for math

zinc ore
#

But "slower"

#

They claim it is SOTA at coding as well and "other reasoning tasks" as they vaguely mention

#

And the main difference is the current one offered is a faster version

#

So if current deepthink is generalized at reasoning tasks, then the other version should be too

ocean vortex
#

tbh chances of that model doing better than your standard 2.5Pro at things like coding or your typical everyday tasks not involving math are very very slim. They trained it to perform as good as possible at IMO with no compromises while still keeping it usable.

ocean vortex
#

they do not lol

zinc ore
#

Yes they do lmao

ocean vortex
#

?

zinc ore
#

They said it is SOTA at coding and other reasoning tasks

ocean vortex
#

link

zinc ore
#

This was from a week or two ago, whenever the IMO happened

#

And current deepthink offering is literally the same system but faster as they say

ocean vortex
# zinc ore This was from a week or two ago, whenever the IMO happened
Google DeepMind

Our advanced model officially achieved a gold-medal level performance on problems from the International Mathematical Olympiad (IMO), the world’s most prestigious competition for young...

#

no source = didn't happen

zinc ore
#

Deepmind employee says it, I don't think they mention it in that blog

ocean vortex
#

Also why do you think they are still focusing on math with the current deep think released today? It would make no sense unless it's a derivative of that math oriented model, like I've already said

ocean vortex
zinc ore
#

Right before #imo2025, together with colleagues from Mountain View, NYC, Singapore, etc, we all gathered at @GoogleDeepMind headquarter in London for our final push for IMO. I believe that week was when all magic happened!

We put all individual recipes (that we figured out

#

"We finished training 2 days before IMO 😄 That model achieved SOTA results, not just for math, but coding along with other reasoning tasks, unbelievable!"

#

He's one of the leads of the IMO team lol

#

And they're literally saying they're offering that model to mathematicians right now, while the current one is based on the same system but faster

#

I don't make crap up, I just repeat what I've actually read

brittle tiger
ocean vortex
# zinc ore https://x.com/lmthang/status/1948458590492393834

Ok fair, but it's just confusing af 😄
If that was the same model, why can't it score on IMO the same after the fact even when they had the time with all the data and solutions out there? And if it's SOTA on coding and "other reasoning tasks", why no metrics for that?

zinc ore
#

Because it's unreleased lol

#

We don't know gpt5s benchmarks yet

ocean vortex
# zinc ore Because it's unreleased lol

If you assume that current deep think is based on Ultra, it would be unreasonable to assume that a) a different finetune of that performs so much better everywhere and also b) that they just released a much lesser version for $300 a month with 10rpd

zinc ore
#

They're literally advertising it as the IMO gold model, but a faster variation

ocean vortex
#

"faster variation" --> less test-time compute = same base model

#

That's how I'm reading this

timber kiln
ocean vortex
timber kiln
zinc ore
# ocean vortex "faster variation" --> less test-time compute = same base model

https://vxtwitter.com/lmthang/status/1951311980960350276

Same guy I just shared talking about it with the YOLO run from the tweet above

Our IMO journey continues: the yolo run model that we trained a week before #imo2025, despite all possible likelihood of failures, magically achieves SOTA across a wide range of reasoning tasks from maths, to coding, and challenging knowledge. I'm very excited that we have now delivered the IMO 🥇 system to the hands of mathematicians and a simplified version (results below) to all Google AI Ultra subscribers.

QRT: lmthang
Right before #imo2025, together with colleagues from Mountain View, NYC, Singapore, etc, we all gathered at @GoogleDeepMind headquarter in London for our final push for IMO. I believe that week was when all magic happened!

We put all individual recipes (that we figured out before) together and did a yolo run (with the compute that I had to beg various groups to loan) to train our most advanced Gemini model. We finished training 2 days before IMO :D That model achieved SOTA results, not just for math, but coding along with other reasoning tasks, unbe…

#

He calls it a "simplified version" here

ocean vortex
zinc ore
#

But again, connecting it to the YOLO IMO gold run, and calling it a variation of that

torn mantle
#

They can keep it

ocean vortex
zinc ore
#

Yeah, I wish there were more benchmarks to compare it to o3 pro and grok heavy

ocean vortex
#

Base model is likely the same, just different amount of parallel instances and the way that system is ran etc

zinc ore
#

"magically achieves SOTA"

#

Ie hype phrasing saying it generalizes beyond math

ocean vortex
hollow imp
#

Even 10 prompts a day is enough just give it to me 🙏

ocean vortex
#

Marginal gains like it getting the correct answer only occasionally as opposed to never at all, with parallel compute may convert this into it getting it right most of the time.

brittle tiger
#

I've definitely done more than 10 prompts today fwiw

keen beacon
#

is it a soft limit right now?

zinc ore
hollow imp
#

If it's really sota sota then play a game of chess till just 15 moves without hallucinating

trim sand
#

Why are all the posts here so weird

ocean vortex
#

in a nutshell

#

Smth like 100k+ thinking from a huge model with a ton of parallel instances is just not realistic to serve

#

Too concise...

torn mantle
#

oai really ruined it all for us

#

with the astronomical monthly 200$ plan

#

i mean i knew other labs will follow suit

#

but whats this?????

#

10 prompts per day for 200$ ?????????????

ocean vortex
#

Sure, but then you can't charge 200$ $300

#

or you may as well become irrelevant soon enough. Or less relevant than you were hoping for 👀

#

It's all for nothing if it doesn't materialize and does not reach people

#

yeah like... People could care less about things "more strategically important to the company", and the company itself will cease to be important if people can't be satisfied and the demand can't be met

primal orbit
#

more benchmark data within safety section

torn mantle
#

i see

#

it make sense but the rate limits are just brutal

elder rapids
#

so how is deepthink?

blazing bison
#

lmao openai staff got caught using anthropic models

#

that's funny

#

if even openai don't use openai models, i don't know what expect from gpt 5

#

@patent aspen coding

#

anthropic staff look into your data remember that

#

the data policy of anthropic is the worst of all

#

they said that openai was using their models to AI improvement related tasks

#

ye, idk if openai was drawing the line

patent aspen
blazing bison
#

it's not like that OAI was doing it, like sam said for them to do it

#

but their staff was

#

they cut access from personal employees

#

this article is bad news too

patent aspen
# elder rapids so how is deepthink?

I think it's great for math, science, really hard computer science problems like distributed computing. It's meh at coding although very few bugs

blazing bison
#

now i'm not sure if gpt 5 is comming next week

#

apparently gpt 5 is not a big leap from 4o

elder rapids
#

if they're the models we've been seeing on lmarena

#

they are big leaps

blazing bison
#

google is also facing difficulties

elder rapids
#

when does 3 come?

keen beacon
#

doesnt openai leak info to the information? (or at least it seems that way) maybe theyre trying to downplay it a bit and let everyone be a little surprisd

blazing bison
#

the only lab that do not suffer difficulties is anthropic

#

the paper that they released today wtf bro

#

even mark offering a bunch of money, anthropic researchers refused

#

what is happening there

elder rapids
#

do you think it's going to be much different than the 2.5 series

keen beacon
#

people were impressed by zenith/etc. and there have been massive preparations for gpt-5 in the frontend/backend apparently that people have datamined. it would be odd for it to be significantly delayed

loud leaf
blazing bison
#

you really don't know how to read it?

loud leaf
blazing bison
#

np

loud leaf
#

not much actual info on gpt5 there

elder rapids
#

you wish craig

loud leaf
#

i was expecting it to be an underwhelming wrapper that just routes to best previously existing model, but feedback on zenith suggested sota

patent aspen
#

Even if we did hear about Apple announcing they would acquire Anthropic, it wouldn't be confirmed because of the subsequent FTC and congressional approvals

loud leaf
#

only definitive claim it makes is the leap won't be as big as gpt3 -> 4 and like... yeah

patent aspen
#

In that situation they'd probably have about a 60-70% chance of success, but the risks of opening an antitrust investigation probably wouldn't be worth it

loud leaf
patent aspen
#

The problem is that, even if it were only a tail risk, a tail risk of potentially doing major damage to your core business probably wouldn't be worth it for Apple

#

And they could get most of the same benefits by partnering

jade egret
#

when gpt-5

patent aspen
#

Then they can participate in the AI race and have more negotiation leverage. It would derisk their business a bit

#

At the moment they're a luxury real estate company as far as AI is concerned

#

Their service businesses are also threatened by AI to some extent

#

I'm talking mega long term

#

The other options are off the table, and if they own Anthropic, they can make it whatever they want

#

They probably can't buy Google, OAI, or xAI

#

They don't have the talent

#

I thought you were talking about them building their own models

#

I mean even OAI is using TPUs over Nvidia on GCP so...

torn mantle
#

why is it a myth

#

elaborate

#

its actually years ahead of any major lab

#

they can have their own internal mini cuda but pretty sure its nowhere near it

patent aspen
#

IMO Cuda is replaceable because the AI companies can just push software up to a higher layer of abstraction given a long enough time horizon

#

At a certain point, you just use PyTorch, Tensorflow, Jax

#

They are now but they can wrap ASICS too and eventually that's just cheaper

#

Developers just use high level libraries

leaden palm
#

whatever jax uses

patent aspen
#

XLA is basically an ML compiler

south phoenix
#

where did the claude models go?

echo aurora
#

you're not?

south phoenix
blazing bison
echo aurora
torn mantle
# patent aspen IMO Cuda is replaceable because the AI companies can just push software up to a ...

yea but the abstraction wont be perfect... its not like pytorch or tensorflow will be used for anything, i mean to achieve a similar performance like in cuda you need to align perfectly code & hardware.. that's why there are libs like cudnn & cublas that are engineered precisely to get max performance from their tensor cores. and lets say we for example moved to amd rocm even though there's support the performance wont be the same

#

take throughput diff between a100 & mi215 for example

#

which is actually 20%

south phoenix
torn mantle
#

and for tpus, you have to be married by contract to google to use them, so i wont talk about that

#

for aws, their ceo literally said trainium is like a supplement to nvidia gpus, and for the maojorty of workloads they will still use nvidia

#

so it only created a little competition with billions $$ spent and many collabs too

#

yea because we are talking about an ecosystem

patent aspen
echo aurora
#

Are others seeing the same ^ ? (claude models not appearing in list)

south phoenix
warm fulcrum
south phoenix
warm fulcrum
#

check if its a browser issue

south phoenix
#

just say chrome

warm fulcrum
#

buddy u dont have to take it literally

patent aspen
#

Basically Microsoft and the companies that aren't tech giants

warm fulcrum
#

u know what i meant

south phoenix
#

yeah on chrome its visible

patent aspen
#

The Nvidia ecosystem is still pretty dominant today. I just don't think that will be the long term trend

warm fulcrum
#

on brave

patent aspen
#

I think a lot of legacy code will remain Nvidia-based for decades though

torn mantle
#

ofc things will change in the future

#

but i would give it like +10 years or more to replicate something like cuda

patent aspen
#

Nvidia is well positioned enough that they will always be relevant. I just don't think it's actually necessary to replicate CUDA if you can offer comparable performance at 1/5 the cost

torn mantle
#

just the migration process will be a headache

#

if this so 'imaginary' company succeeded

patent aspen
#

That's definitely relevant for a lot of companies, although if the major frameworks migrate, then it's way less work to migrate

#

Like imagine if <insert your favorite ML framework> just has a one-line config to select the hardware backend

torn mantle
#

i hope you are not only taking TFLOPS as the only criteria

patent aspen
torn mantle
#

again if your software stack isnt as optimized as cuda, then its a waste of time, they all have good theorical performance cards

#

1/5 is just TCO

#

cost of ownership

#

what about electricity

#

what about space

#

thats if we are assuming that the performance is like 70%

patent aspen
echo aurora
zinc ore
#

https://vxtwitter.com/testingcatalog/status/1951320162541388045

FYI, this guy has access to the gold IMO deepthink model, and has been sharing some tweets about what it makes

Gemini Deep Think IMO 👀

It is one of the first models which I am testing extensively b/c it is very fun to play with.

"Cyberpunk nuclear reactor control interface" https://t.co/y5zHfZYm6Y

QRT: testingcatalog
I have Gemini "Deep Think IMO" mode 👀

What should I ask? https://t.co/EhDw7kOAb3

▶ Play video
torn mantle
#

whos calculating tco?

south phoenix
torn mantle
#

also

#

why are we assuming nvidia will just stand still?

patent aspen
#

That's an opportunity if I ever saw one

patent aspen
#

Well definitely the electricity at least

blazing bison
torn mantle
#

it does include space & electricity

warm fulcrum
#

or is it just random

blazing bison
warm fulcrum
blazing bison
#

The deep think that they announced is the deep think imo. The deep think released is a slight improvement from gemini 2.5. Deep think imo looks like another model, you can check doing the same prompts pelican, star shoot game, etc

#

They added a unreasonable rate limit and a worst model for paid customers. While they gave influencers a much better model with unlimited requests

warm fulcrum
#

so selfish..

oak bolt
#

to create videos for my school projects 😛

blazing bison
#

????????????????????????????

#

He actually made stuff up

#

The difference between the imo model and the released one is inference config only

#

you are not

#

he deleted his message saying that he is a googler

#

lmao

#

well, different of you i can reference actual source for the shi* i say

keen beacon
#

we need more people asking how many rs are in strawberry using deepthink tbh

#

AGI benchmark

blazing bison
#

this @patent aspen is a clown

#

"made stuff up"

#

dumb clown

#

mf is lying

#

deep think

#

is the same thing

#

there is no different deep thinks

hardy pecan
#

Children stop fighting

blazing bison
#

the only thing that change is inference config

#

source?

#

yes

#

source? YES

ornate agate
blazing bison
#

he is lying lol

#

why you're on his side

#

i'm showing actual sources of people that is direct on the project

#

and you guys are on the side of the guy

#

without sources?

#

lmao

#

but he is lying all the time

#

he deleted his lies

#

every time that i show proof of the opposite of this guy is saying he delete his message

#

why you guys like liers lmao

blazing bison
#

???

#

bro the lies is some messages ago

#

there is a lot more too if you search his messages

#

i can create a exposed with more than 30 lies

#

that this guy made

#

this is sick

#

i think that you guys are the same account now

#

prob

#

the 3 of you lmao

#

i'm already making

#

a lot actually

#

already doing

#

but it's automated so

#

i can waste my time with whatever i want to

#

you're sick with 3 discord accounts lmao

#

kind of funny

#

blocked, i don't like to read lies

#

bye bye

#

bcs you're his alt account

#

you're the king of bs

keen beacon
#

brian isnt lying btw

rare python
#

Our IMO journey continues: the yolo run model that we trained a week before #imo2025, despite all possible likelihood of failures, magically achieves SOTA across a wide range of reasoning tasks from maths, to coding, and challenging knowledge. I'm very excited that we have now delivered the IMO 🥇 system to the hands of mathematicians and a simplified version (results below) to all Google AI Ultra subscribers.

Quoting Thang Luong (@lmthang)

Right before #imo2025, together with colleagues from Mountain View, NYC, Singapore, etc, we all gathered at @GoogleDeepMind headquarter in London for our final push for IMO. I believe that week was when all magic happened!
︀︀
︀︀We put all individual recipes (that we figured out before) together and did a yolo run (with the compute that I had to beg various groups to loan) to train our most advanced Gemini model. We finished training 2 days before IMO :D That model achieved SOTA results, not just for math, but coding alo…

blazing bison
#

imagine saying the opposite of the head of the gemini deep think project and using alt accounts to support it. Another level of commitment

#

yeah, he is lying

#

you're right

#

if you say so

keen beacon
#

we're all brian's alts 🤣

blazing bison
#

Even if you work at google, you are not part of the deep think team

#

bcs i know all of them

rare python
#

@blazing bison

#

two version of deepthink

blazing bison
rare python
#

2.5 pro deepthink got 80.4 LCB

#

2.5 deepthink got 87.6 lcb

blazing bison
#

They aren't my friends, i just know who they are, and their respective discords too

blazing bison
#

i'm not wasting my time anymore

#

bye

blazing bison
stray aspen
rare python
#

I just showed you there are three different deepthink Animated

patent aspen
#

As it stands, it's just poor value for 99% of people

#

Nah

rare python
#

You should tell the TPU team to scale up for DeepThink :D

#

You guys have to scale for both DeepThink and Gemini 3.0 damn

#

huh isn't gemini 2.5 ultra has above 1M context or something? So why DeepThink only has 100k? Cost?

civic flame
rare python
#

Bro is so eager to "gotcha" brian

#

💀

civic flame
#

brian has never failed me 🙏

#

i trust him more than like 90% of people here

blazing bison
rare python
#

brain

blazing bison
#

or do you believe that they are lying on X?

civic flame
#

i'm not going to waste my time

#

have fun

blazing bison
civic flame
#

if you were here more than a day you'd know he does that all the time

#

and if you had half a brain you'd be able to figure out why as well