#general
1 messages · Page 79 of 1
Seems specialized
so it should have some decent world knowledge
bro calls everything fake
i dont think its fake
did everything will be eaten by gpt 5
They are unknown and they must not have a lot of money. Do you believe that they created an LLM from A to Z with a 96 on the simplqa without access to a search tool?
is there MCP feature on LMarena? it would be cool to test out models of how good they are with MCP
You could put that in #1372230675914031105
they're backed by sequoia and they published their model's full proofs for the IMO
i trust them 🤷♂️ suit yourself
just because they're not a big lab that doesn't mean they can't make big advancements - they've been doing some cool math-related stuff in the field for a year
@civic flameI'm not saying they never got this score, but that they forgot to specify that this score is with a research tool
@civic flamethat he is strong in gpa it is possible that he found a new reasoning technique, but the simple qa is just knowledge so to have 96 it would be necessary to make a model 10 times larger than our sota and it is impossible that he did it (or trained it specifically on simple qa but it is stupid)
even perplexity deep research has only 94
nvm this isn't from harmonic this is autopoiesis
https://autopoiesis.science/blog/92-4-gpqa-diamond give this a read
wth
im confused
is it the same model or nah
Aristotle X1 Verify pass@1 benchmark results.
they're different lol
that's what had me confused
they both have models called aristotle
gpt 4.5 which is the largest llm to date has only 62
this is their only post 💀 😭
Ah now you change your mind
again, i thought it was about the harmonic model
okk
Hey can some one teach me how to use it
Basically I joined today
Hey can anyone listen me
Tell me how to use it
7 employee
They're not claiming that tools aren't being used though
yea they are looking for funds
small business strategy
🇰🇷 LG recently launched EXAONE 4.0 32B - it scores 62 on Artificial Analysis Intelligence Index, the highest score for a 32B model yet
︀︀
︀︀@LG_AI_Research's EXAONE 4.0 is released in two variants: the 32B hybrid reasoning model we’re reporting benchmarking results for here, and a smaller 1.2B model designed for on-device applications that we have not benchmarked yet.
︀︀
︀︀Alongside Upstage's recent Solar Pro 2 release, it's exciting to see Korean AI labs join the US and China near the top of the intelligence charts.
︀︀
︀︀Key results:
︀︀➤ 🧠 EXAONE 4.0 32B (Reasoning): In reasoning mode, EXAONE 4.0 scores 62 on the Artificial Analysis Intelligence Index. This matches Claude 4 Opus and the new Llama Nemotron Super 49B v1.5 from NVIDIA, and sits only 1 point behind Gemini 2.5 Flash
︀︀
︀︀➤ ⚡ EXAONE 4.0 32B (Non-Reasoning): In non-reasoning mode, EXAONE 4.0 scores 51 on the Artificial Analysis Intelligence Index.…
il you want to upvote https://discord.com/channels/1340554757349179412/1394703782255788122
just got a potato lol is it open ai right
go upvote kolors https://discord.com/channels/1340554757349179412/1386317762128773130
#video-arena-1 mars
Name: LMArena

ID: 1340554757349179412

Description:
LMArena is an open platform where everyone can easily access, explore and interact with the world's leading AI models. Community shaped leaderboards help progress AI in a more transparent and grounded in real-world user way. Come join our community to explore and shape the frontier of AI.
Owner: @wooden mulch
Features:
Creation: <t:1739683560:R>
Channels:
286
Text:
28
VC:
3
Members:
6779
Roles:
26
Managed:
4
dino never finish generating is it that slow ? or simply not functioning now
On battle mode?
How's it?
car
Any zenith level model?
Potato/Dino, any of two?
never tried zenith is it still exist ?
Nope.
is there a video generation leaderboard yet?
Everything is ready for next week
the cbo literally has a phd in how to pitch a start up (not joking!)
just a bunch of PR
for higher evaluation
for how much time we get this video generation for free
its coming
What is the SoTA research models available?
I'm trying to find an affordable solution for deep research that doesn't hallucinate much
For example perplexity hallucinates and tends to believe report mills
Also, question, does LmSYS do categorization on AI tasks through leaderboards
No, not yet.
🚀 Announcing Step 3: Our latest open-source multimodal reasoning model is here! Get ready for a stronger, faster, & more cost-effective VLM!
︀︀🔵 321B parameters (38B active), optimized for top-tier performance & cost-effective decoding.
︀︀🔵 Revolutionary Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD) enable efficient inference—even on modest GPUs.
︀︀🔵 Trained on 20T+ tokens (incl. 4T multimodal), with meticulous data curation ensuring reduced hallucinations & robust reasoning across vision and language.
︀︀🚄 Unmatched speed: Up to 4,039 tokens/sec/GPU—70% faster than DeepSeek-V3 under similar conditions.
︀︀💎 Step 3 sets a new Pareto frontier—bridging power, efficiency, and practicality.
︀︀👉 Start building with Step 3 today: huggingface.co/stepfun-ai/step3
︀︀👉More details on our research blog:
︀︀www.stepfun.com/research/zh/step3
StepFun AI is your smart and reliable personal assistant, here to help you acquire knowledge, find information, learn languages, unleash creativity in writing, and even write code. Whether you’re working, studying, or just navigating everyday life, it’s designed to solve your problems and help you discover and understand the world around you.
we need new SOTA models
🦥 Qwen3-Coder-Flash: Qwen3-Coder-30B-A3B-Instruct
︀︀💚 Just lightning-fast, accurate code generation.
︀︀✅ Native 256K context (supports up to 1M tokens with YaRN)
︀︀✅ Optimized for platforms like Qwen Code, Cline, Roo Code, Kilo Code, etc.
︀︀✅ Seamless function calling & agent workflows
︀︀
︀︀💬 Chat: chat.qwen.ai
︀︀🤗 Hugging Face: hf.co/Qwen/Qwen3-Coder-30B-A3B-Instruct
︀︀🤖 ModelScope: modelscope.cn/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
︀︀🔧 Qwen Code: github.com/QwenLM/qwen-code
hey can anyone help me
with the processs to generate videos here?
Yeah check out #1397655624103493813 for information on how to use Video Arena!
when lmarena video gen available on lmarena.ai base site ?
It’s possible! Be sure to share that you’d like this in #bot-feedback
okay. thanks to you (if you are the owner of lmarena or anyone from lmarena).
but i am actually curious when lmarena.ai will have video gen. will it have the rate limits.
That’s very much TBD. We’re treating this like an experiment and we’re looking to hear from the community before decisions like this are made.
Can you not attach images to claude sonnet and opus models anymore?
hello
Zenith still in arena
what is zenith
Its removed
hello, I saw the vid on thread lol
I've got a model called "cuttlefish"
Hi
🧑🔬 Research Update: Today, we are releasing a new dataset with over 140k conversations from the text arena collected between April 17th and July 25th 2025. See thread to dig into it!
︀︀
︀︀We're pairing the data release with a deep dive into how model performance and evaluation dynamics have evolved over time. Let’s look at real-world trends, new features, and fresh prompts.
︀︀
︀︀What’s covered in the latest analysis:
︀︀- Overview of the released dataset
︀︀- Language & topic breakdowns
︀︀- Rating changes: How Arena scores shift over time
︀︀
︀︀And more! 🧵
🤑
rip my ai girlfriend conversations
#announcements the new one sounds more like the AI Companies will be more interested in knowing
What models are there in video arena
they cant reveal the ip adress that sent the messages to the model
and it would take decades to learn them
true
LMAO
great start
lol an entire category section for a single model
nothing in the article makes it seem that there are conversations in the statistics and dataset from direct chat
i don't even know if there are or not
excuse me what
they leaked the group chat
these are so interesting
it looks as if there isn't which i find surprising
Lol why would you write those if you knew they would be released in a database?
lmao
hopefully it's all anonymous
as in hopefully he didn't mention private information
It should be stated in bold text that the inputs and outputs will be used directly for research
if people get confused
the voices convinced me
bro who leaked the group cha
sydney is in my head
what leak? anything interesting?
Ofc since it's for research
"leaked" xd
it does
bro admins make the channel for hello is general not other channels Say it in general channel pls make a rule for it bro most channels is just flooded
Ah, well still people seem to forget
bro pls stop saying hi and hello in other channels every one says hi in other channels then the general
mods ban that guy
bro you username
give me more info since ok
yeah i can not help you without info
just what you need
for it to do
say it or less i can not help
bro you had a swear word in you username bro you re
how can you ever join
Hey! Anyone know how I can try the Horizon Alpha Model? Supposedly there was a way to do it in LMArena.
Thanks!
pffft
hey i just wanted to know there no limit on image generations, but in video generation there is limitations, i wanted know is there any posiblities that the video generation also be unlimited.?
when gpt 5 ):
It's possible but unlikely. Be sure to share feedback/requests you have in #bot-feedback
Bro is video generator have limit?
yes bro you can generate 8 videos a day.
😭
I can understand but even the video is generated for 8 second actual scene is only 4 sec
thanks bro
bro how are you doing?
no way bro
Im doing well what about you
what the most fun model to talk with
Donno, I feel most comfortable with claude models. Doesn't feel too positive and goes straight to the point.
is gemini-2.5-flash-lite not going to be in the leaderboard?
whats the best module for codeing in c++ and python and math what should i use pls help
help me
Horizon Alpha gets math problems wrong that o3 never messes up.
and also codeing give me just a say whats the best and why @here
@here help me say whats the best module for math and codeing pls help me
Sure are! More information in this blog: https://news.lmarena.ai/new-lmarena/
Prolly never
If it's not filtered, I'd bet there are even credit cards in there
well, ig that is good for diversity
I mean, TOS talks about how our data is sent to providers, but I don't know how I feel about it being shared publicly
It's good for open source models. You know it's going to be shared publicly. If you share more than you should, it's your fault.
If you feel bad about using it, then don't
We do apply aggressive PII filtering.
New Arena Models
- velocilux
- cogitolux
dont what
dont tell me what to do
Cresylux family ?
So its from meituan ?
Its good ?
@torn mantle essaye les nouveaux modèles et dit moi ce que t'en penses
okay
Be sure to check out #1397655624103493813
we are quite uncertain what the goals of the platform are and what your internal roadmap holds. Can you make it public?
I'll share this with the team. We care a lot about transparency. Our mission remains:
To bring the best AI models to everyone, and to improve them through real-world community evaluations.
Thanks appreciate it. The UI improved a lot over the last months
I propose making Search, Image, Video and Webdev Arena available through three major buttons to increase visibility. I attached a possible concept.
Its unclear that those buttons lead to a different arena
You may add a webdev arena button as its currently deployed on a separate platform
Additionally I propose adding tooltips to the leaderboard explaining how Rank, CI and Elo are determined
But when browsing the dataset, it could be seen that there’s some personal information included prompts being published. Ppl sometimes do stupid things like putting some files in without properly erasing all the personal info. 🤣 I know the TOS specified the rights and responsibilities and stuff. But maybe if there could be a way for users to choose to remove some of their prompts from a public release, it might be nicer?
correct
If you're seeing examples of this please send me a DM so I can escalate.
I don't know tbh, regardless I have been sharing these concerns with the team.
they probably filtered that out
guys, did you notice that gemini started after some point to repeat himself or i'm tripping?
Not really
Its been consistent to me
I questioned myself if gemini 2.5 flash got even better
Video limit was 10 yesterday and 8 today. We have 4 more days left to limit 0. Hurry up.
I swear i wrote a message on lmarena in feedback for making video generation arena. And now it is real (i thought they will made it in the webapp)
what happened in #leaderboards? did this server get fake airdrop raided, openrouter-style?
everyone likes to say hello there 
hello, I'm new here. I'd like to get the most out of LM arena but I feel like I'm swimming in the deep end without floaties. When y'all got started how did you leverage it?
What do you want to get out of it?
I would like to fine tune my skills as a prompt engineer
Thank you, that's exactly enough.
its only added twice, glm 4.5 and air
ah its on webdev
well you asked for it xd
I wonder what the squid emoji is hinting at here from a Google pm
https://x.com/simpsoka/status/1951008214595805498?t=OMMvH3wgSpZfOgOqv-3Y3w&s=19
squid = jules
kath is the jules... gal i guess
horizon alpha on openrouter chat now has reasoning enabled. not getting these tests of mine wrong anymore
It is good that I wrote my promts in arabic , not a lot of people can understand my prompts 😂😂
Will GPT-5 launch before Deep Think?
12
22
1
Yes
When our promts will be shared publicly , we will laugh a lot 😂😂😂🤣🤣🤣
Can t wait to read them .
it was obvious it would happen again considering it happened before
Can we use the study mode prompt on other Ais
I have an instagram account with over 300k and I don’t want it anymore
wheres this from mbtw
saw it in a tweet, searched it and found the relevant text, idk about the rest of it
Which LLM is the best street smart? I think adding a leaderboard in LMArena for it would be sick
what kinds of prompts would you classify as "testing street smarts"?
Something like this but advanced
“Your storefront is dead, but the parking lot next door is packed. What scrappy move might get you foot traffic?”
“You get your first bad online review — and it’s unfair. How do you respond publicly without looking defensive?”
“Your competitor just undercut your pricing. You can’t afford to match it — what do you do to stay in the game?”
“You’re launching a new product and have no ad budget. How do you create buzz with zero dollars?”
“A VC firm wants equity in exchange for mentorship, not money. Worth considering?”
“You’re about to go into business with someone who talks big but avoids putting anything in writing. What’s your move?”
“An early client wants a deep discount in exchange for ‘exposure.’ What questions should you ask before agreeing?”
“An employee you trust starts showing up late and missing deadlines. How do you handle it without losing them or getting walked over?”
“You have $5,000 left. Do you spend it on marketing, product development, or paying a debt collector breathing down your neck?”
“A supplier offers a ‘limited-time’ bulk discount, but you haven’t even sold your first batch. Do you go for it?”
It's subjective but I think that's why LMArena battle mode exists
Yes, definitely on the non verifiable domain, but very useful tho
Interesting, Why don't you think it's useful?
I see where you're coming from, I think I'm on the entirely opposite camp, I believe in achieving singularity as the end goal not us being the bottleneck
Absolutely, that's the greatest outcome imo. I wonder what you have against it?
What's the gap that won't let it happen? like what would you say is the "missing/never will happen" component
we aren't there yet is different than it won't happen, won't happen means that there is a component that is impossible preventing singularity from ever happening
Craig do you think the new glm 4.5 is good
What's that component?
@cedar tide david penses tu que le nouveau glm 4.5 est meilleur d une maniere ou d une autre que les autres modeles sota
Is there a possible making story video with consistent character?
I keep getting rate limited on gemini
and you haven't been using it a lot?
ofc I do LOL
I got it to code 50+ times today within a span of 3 hours.
plus a whole bunch of CS questions
Potter tying clay pot on tall bamboo pole, king and his sons failing with arrows, spectators watching with tension, mid-range shot, lateral pan motion with slow push-in on shattered hope in faces
openai's study mode is horrible
The bot only works in the video-arena channels like #video-arena-3 , you'll want to type /image-to-video
better to just not use it for studying
and use the model normally
it walking through the concept with you takes longer than if it just explains it clearly and directly
Gemini on AIStudio is pretty good for studying
I heard they merged LearnLM with 2.5 Pro
As long as you stick to a traditional syllabus, it's pretty great. For non-standard stuff, it's less on-point.
Seems like more of the same story. Apple has no comparative advantage in AI, but they own the world's best real estate. They'll continue being a luxury real estate company for as long as it works.
Seems to be gone now: https://huggingface.co/yofo-happy-panda
it almost feels like the leaks are intentional, first the gpt-5 entrypoint and now this lol
well, i think leaks are more credible than sam altman's tweets anyway
its free and at level of other sotas. its not best for swe.
but opus 4 is still trash for vibe-swe. money waste.
i guess there's a 20B too
https://fixupx.com/apples_jimmy/status/1951180954208444758
Best at long context
Best at analyzing videos
Not even has any competitor
Other models reading text of videos, while gemini literally watching whole video frame by frame for hours and can gives you detailed and specific outputs
People still dont know how useful is this
Analyzing video is bigger thing than analyzing pdfs
Gemini needs own benchmark just for this
Also analyzing for pdfs or text gemini is still best because of best at long context
The summer of 1305 finds William Wallace crouched in the dense undergrowth of a Scottish forest, his once-proud frame now gaunt from years of constant flight. The man who once commanded armies and negotiated with kings now lives like a hunted animal, moving from shadow to shadow across a homeland that no longer recognizes his authority. His weathered hands, scarred from countless battles, grip a simple dirk—the only weapon left to Scotland's former Guardian. Seven years have passed since Wallace, "A medieval storybook illustration of a grim knight riding a horse through a peasant village, peasants looking frightened, castles in the misty hills in the background, detailed faces, realistic proportions, dramatic lighting, vintage painting texture, inspired by oil painting and watercolor, muted earthy tones, [additional scene-specific detail here]"
I still have no idea where to check this update of arena battle models 🤨 . Could anybody please enlighten me?
LMARENA I DIDNT KNOW YOU HAVE A DIS I LOVE YOU GUYS
The same. Guys, do you know how to check this?
And I have no way to check my vote result either, although I have voted more than 5 times.😭 Is that a bug or something?
Hi
Hello...
What do you use it for? I find that the hard part is downloading/uploading the video in the first place.
Some of the user's requests in dataset are funny:
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Are hamsters made of ham?",
"image": null,
"mimeType": null
}
]
}
im just pasting youtube links in ai studio
Also if you select lower resolution and lower fps like 0.5
You can use for muuuch longer videos
Sometimes it gives error but just resend message again
Could be take several minutes if video too long, just be patient
Im asking summaries, asking time stamps
asking "are they talked about this"
asking "is this guy laughed or scared and which minute?"
Yea, it can analyzing face mimics too
Basically everything it takes hours if you do that but takes seconds when gemini does
Also you can make subtitles too but i dont recommend one shot try, instead translate with 20 minute parts, you can also select time parts. And like i said, gemini not only listen or reading, literally watches videos frame by frame so subtitles will be more accurate because gemini can see whats happening on screen that time
info on openai's bigger oss model:
128 experts - 4 active -> very efficient
120b params - 5b active
4k initial context window - 128k current -> not horizon-alpha or what ever it is called (?)
trained in FP4 -> should only run on blackwell (?)
1 blackwell is enough?
chemistry ei sob topic ar bornona den
why the website image generator - GPT-1 is so slow ?
Are hamsters made of ham?
gemini 2.5 deepthink is out for ultra members
any screenshot?
When it will be available for Gemini AI pro members?
I dont think ever, looks exclusive to Ultra plans, but maybe limited queries for pro in the future, but I doubt it
Let's hope it will be available for pro as well ❤️🩹
Very quiet release
i have a question to LMARENA guys. apart from video gen ai, when you guys will add temperature or token in/out settings in text models. also what about image ai. when we can control the image temperature or gradience.
It only have 10 RPD for ultra subsriber, that is ridiculous
any free ai image generator with no price unlimited other than LMarena ?
confirmed?
the exact limit wasn't published afaik
TBD! We do hear this request a lot but yeah I can't say when to expect something like this.
its complicated. there is many. but the models are not that great
g4f.dev
ish.junioralive.in
polliantions.ai
am i missing something. what is so special about deepthink
nothing really
Can you ask it a question I have?
he already hit the limit bro
deepthhink is out
Thanks for saving me money lol
Yea i will pass
The IMO benchmark is just sassy lmao
@latent patio yo i am also looking for free ai image generator with no price and unlimited just like you all i know is freepass ai and i think its a bit bad. Wish someone made a list of free ai image generator
Something went wrong with this response, please try again. Also I Get This error when trying to create images in LMarena site any solutions ?
lmarena
Bro, I was literally one button away from paying for it
Can't be real
10 rpd
what a daylight robbery
hopefully gpt-5 is better
funny if gpt-5 could do the same for $20
Grok 4 Heavy is a bigger scam than Google
My favorite is o3 being below IMO bronze level lmao
o3 is from december 2024
there is optimizations yes, but it's an almost 1 year old model in general
@patent aspen so this deep think is actually 2.5 ultra base model behind the scenes?
yeah
It's one deep think
with deepthink they mena like 50k token of thinking
well, they said that deep think = max tokens thinking
so it's not one
we have a bunch of gemini 2.5 pro with a lot of reasoning tokens enabled and one of them decide the best answer
Nobody even attempted to put it as a model request kek. I assume it will be rejected but worth a shot at least
GPT5 < IMO GPT < Deep Think IMO
deep think = 50k thinking tokens
It's SoTA at math. Not practical for 99% of people
I mean he's absolutely right if we're talking about math only. Otherwise yeah I'd disagree
Math, coding
apparently no openai livestream today so no open source today
Coding
Why are they sitting on these models so long lol
Like just release it already. The open source they had for ages
Is deep think even be going to be in Arena to test against GPT5?
10 rpd for 250$ and you want it on Arena ?
Someone with access to gemini deepthink can I give you a highly complicated clinical case question no other model can solve to check it's answers?
Please i beg you it's really a thrilling misery case no machine can solve and humana are struggling too
Edit :mystery *** I'm actually a friend of the clinical case and we been baffled for months without am answer
Sure
All of those gains comparing with the initial deep think announcement can be easily attributed to base model update (06-05 vs 05-06) in my book
And their initial release:
(different LCB range)
numbers look good.. but i have learned to not get hyped before more confirmations.
Are they actually good?
@leaden palm
They sneakily did not include USAMO this time at all lol
Here's the question
I'm not a Ultra member, don't know
Here's the question
Nothing really insane about this tbh. Just parallel compute
you have ultra access?
You could have done this yourself after 06-05 was released with some coding, I believe
Yeah
wow.. you paying 300$ bucks?
I Will be eternally thankful i really will it's a very critical situation we are trying to solve here God bless you
deep think probably takes like 5 min to answer any query
You never know. Google might want to shell out to show off. It can afford it.
Hmm
Kingfall was faster
So we can assume that its gemini 3.0
Okay
My thoughts on Deep Think are that it's probably not something that 99% of chatbot users need, but the remaining 1% could have a categorical improvement in capability
E.g. mathematicians, scientists, critical medical situations, distributed systems problems, logistics, leading edge HFT firms, etc
Is it using something similar to kingfall as instruct model?
But why does it look worse than kingfall
@deep adder Grok is bad!
Hmmmm. It really has its quirks but it's generally solid.
nuh uh
yeah in terms of frontend design it did worse than base 2.5 ultra & 2.5 pro
but it had no bugs
Does deep think output long detailed answers like deep research?
or more concise like o3?
https://x.com/main_horse/status/1951201925778776530 but nobody can get it to run 😅
would be hilarious if someone gets it to work whilst openai are busy 'safety testing'
like we have the weights just not the inference to run it lmao
It's just 06-05 with minimal changes if any at all + parallel compute. This is my conclusion thus far judging by what I saw until it can be proven otherwise. They barely showed any metrics at all, and those that they did showed similar gains to the 05-06 initial deep think.
You've been making that claim for a while
for awhile? It's only released today lol
You've been saying anyone can roll their own deep think with parallel compute and pro
When it's only that it's fairly simple and this is mostly true. You can do 10 responses in parallel and see quite easily what can be improved with it
what does rpd mean?
for the length of time it thinks you cant really do many requests per day anyways
like each takes 10+ minutes
For the amount of noise they made, essentially promising to release IMO gold medal model, this is kinda a disappointment
damn
so what's so bad about the deep think model besides the requests per day limit?
is it not worth it at all
Perhaps but as things stand now they just released a thing that was supposed to be live months ago. Only based on a slightly newer model now lol
oh yeah
not mine
DEEPTHINK!
but that does prove something
it is essentially like using an optical illusion to assess someones intelligence. it is just an artifact of the tokenizer
yes but it should have used tools to calculate
Guys I'm squeezing every bit of Gemini 2.5 through custom gems
I've added all Robert greene books pdfs in the knowledge base
Mostly I think it's that people are expecting a model fine tuned for complex structured thinking problems to be better as a general purpose model
And they don't like the 10rpd limit
And cost
i mean the ai thinks for like 10 minutes or longer
the computing cost must be high
The problem most often is you have to be smart enough to pose the right questions. The 2.5 pro is capable of tackling very intricate dynamics, but it needs to be focused manually on necessary details. It doesn't handle prioritizing very well about where to dig more.
I've made a custom gem for prompt engineering
It must at the very least be no worse than 2.5Pro for any task. It is a thing they are not the first to do and it is directly competing with o3-pro and grok4-heavy
o3 pro has a similar limit actually. they just dont state it. barely anyone will ever hit it, so whats the point. Its just adding unneccessary worry into the user about something which is likely not relevant to their usage
Not really lol
O3 search in lmarena search is godly for me
My best experience with web Searching so far
The api one is always better
I don't see a single good reason why this should have been delayed either tbh
But this is not based on Ultra is it
They rushed deep think because they know that next week is openai week
guys what to do if bot doesnt answer to me in dm when i want to gen video
If it was they would have shared more metrics, gains would be higher and wouldn't match up to 05-06 deep think gains
Yeah, just a rushed version because releasing now some people will buy the ultra plan. Releasing next week no one would buy because of gpt 5
and they wouldn't be afraid to include USAMO like they did earlier
WHERE IS DEEPSEEK R2 COMING
I like it. Don't care much about price I just want the best. Though they should introduce a tier above for unlimited usage kek
Source...?
Also:
If you’re a Google AI Ultra subscriber, you can use Deep Think in the Gemini app today with a fixed set of prompts a day by toggling “Deep Think” in the prompt bar when selecting 2.5 Pro in the model drop down.
If GPT5 top of the line model beats it I will just switch to that instead
At least gemini 2.5 doesn't hallucinate as badly as grok 4. I have very horrible experiences in lmarena side by side
Hm... Ok if it's indeed that, why not release other metrics and focus on math where small models like o4-mini-high are known to be often better than both medium and huge sized models? Makes no sense
@patent aspen
Can you trick 2.5 pro into similar or atleast 40% Deepthink like performance using some prompt engineering?
I actually don't have any special insights into evals for models. Your guess is as good as mine
How can I subscribe for any of these? I'm 15 years old I don't have money
no and yes
Our education system is cooked
Reject the element no because square root cannot be Negative and tell me the yes 😂
why would i 😂
Hello guys and girls
50% odds to increase the lmarena score by 7 Elo?
you are looking at the leaderboard with stylecontrol enabled
why dont they use the style control leaderboard
well i didnt even realise its a thing. its enabled by default lol
so they adjust the scores automatically by default?
Ayanokouji august
Yeah that's a good point and a topic we discuss internally here and there. I'll be sure to bring this up again as it's important how we structure this.
I thought models see only they part through the chat. You send message via site -> site uses API to send message to LLM -> API returns reponse to the site -> site outputs the message.
and the site handles 2 channels simultaneously this way
whats the difference between wolfstride and kingfall tho?
is wolfstride like a more recent checkpoint?
Yes
latest base gemini model checkpoint that is not deepthink iirc
there are nightride but it's weird
I can’t even try a sample of deep think because it’s behind a 200 dollar paywall
wow i didn't know so many people were cheating on dev mode
I mean isn't the same true for gpt-5
Hi!
oh deepthink out
I also think Google's naming is way saner than anyone else tbh
rather late
i am rather late or deepthink is
yo
all of the above
lolol
fair
i just ignore deepthink because i know i wont be able to use it
and i dont need it
but its cool to see advancements isnt it
Output: 42, CoT: hidden, summary stupid
Imagine
Searching X for „Elon Musk opinion on meaning of everything“
guys is the limit 10 videos or 8 ?
8
What are the current GPT-5 benchmarks? Are they verified?
we have none rn
Gpt will be head and shoulders sota when it releases, remember o series models and 4.1,4.5 are checkpoints in the development of the finished product which is 5
What are the sources on the release date being next week?
apparently horizon alpha's reasoning version got 86% on gpqa tho. it was up for a little bit, whatever that is
But nobody can beat Gemini cuz nobody can beat free
openai have been intensely preparing for it for the last ~4 days
It’s o series integrated into the regular more versatile model so it will be bordering on a major leap if it is not one itself
and they begun A/B testing it on chatgpt late last week
I hope that it can beat Claude on coding and writing/vibes cuz opus is expensive, slow and censored
of course it will beat claude
But is it good
I've used it a lot on lmarena and I have bad experiences tbh
At least it won't be 10RPD and paywalled behind a door you need golden key for
"The improvements won’t be comparable to the leaps in performance of earlier GPT-branded models, such as the improvements between GPT-3 in 2020 and GPT-4 in 2023"
Yeah it probably won't be 15 requests a month like o3 pro
I'm looking at the OAI help center. I see 15 rpm for o3 pro. Is that out of date?
o3 pro I can use it on playground as much as I want without paying extortionate sub prices lol
Oh that's for API
for deep think I can't use it at all. Paying for their sub is not an option I would even consider tbh
15 requests / month
Isn't that even worse?
I would guess most of the people using o3-pro here and there do NOT have a Pro sub. And that sub is already priced more reasonably than Gemini one
I think it's 'unlimited' only for o3, not the pro
But the fact alone that Google is competing on charging you comparable amounts of money and has even stricter limits with all their TPUs is kinda already crazy enough...
yes, on chatgpt pro requests for all models is unlimited
there is only limits on deep research and agent
and they are very reasonable limits btw
of all pro plans, openai offer the better one
claude is not good too, with weekly limits
then how do you explain kingfall being better than wolfstride
IIRC the real limit for o3 pro on that plan is capped in low dozens per month
on enterprise plan that is like $30
wdym
o3-pro is not "unlimited", don't know their caps though...
bro i do more than 100 request / day
Yeah it's not actually unlimited
@patent aspen
do you have it?
Ok then their caps are very reasonable lol. But this only proves my point even more, what Google is trying to do with their model availability and pricing is insane.
i tried the same with opus on claude and they weekly limited me now after 2 days
🤓
bcs before it was basically unlimited too
and so many people talking about claude code so
it was, atleast for me
80/ 100 requests day for me is basically unlimited
This was fine. People were still able to test and use it. But also there's no way Google's operating cost of a single instance of their large model is anywhere near that. And if they can only make it perform with parallel compute that is still on them.
whats kiro
amazon
Agentic ide such as cursor
brian is ignoring me 🙁
They advertised before with unlimited now they steering back.
I see their ads on Reddit all the time throwing shade at other agentic tools lmao for their rate limits
Yeah Anthropic is hilariously bad with their limits. And somehow people still manage to make up excuses for them lmao
like when the models get good enough i'm not gonna need to do 100 requests
$0.20 for spec request and $0.04 for vibe request after you used your quota
i just reach 100 requests bcs of models doing dumb mistakes
They were just about never good on value
so maybe with claude 6 the price will be reasonable
bcs the model will solve problems with less prompts
claude max was good value before they made the recent change tho
yeah, it was the best
Still is
it's not
you could, there was no rate limit
it's like their rate limit was not working or something
how to make video here with sound?
who thought that was a good idea at anthropic lol given their limited compute compared to other companies
GPT5 is going to cook Gemini 2.5 it's obvious. They better be working hard on Gemini 3 rn lol
there's nothing better tho, openai models are trash for agentic coding atm so maybe gpt5 will change that
Check out #1397655624103493813 for more info
Almost every user reaches rate limit already
before the change i didnt
and i was using it a lot with a lot of context
but i was paying for the $200 plan
now with 2 days i reached the week limit
They didn’t change anything yet
August 28
Honestly I think they are simply making a mistake. It's a short-sighted approach that has a high likelihood to hurt them long term and ensure it never beats chatgpt in popularity... They nuked their availability before they were in a position to do so IMO
so maybe you can unlock my account or smth since apparently you work at anthropic
They have veo3 still
veo3 is not that good
idk i don't think it's worth $250
like only if you don't pretend to generate revenue with it
there it's a option for sound in video?>
i wasted $400 this month with 2 pro signatures, but they accelerated my work like 10x
claude max and gpt pro
Are you considering supergrok when their coding model releases
crud basically
no, i'm not paying $300
veo 3 is pretty good compared to other video models at least
i wonder what sota will be next year lol
You get grok 4 heavy, coder, multimodal and video model later this year
i just wasted $400 bcs anthropic f*** with my account
apparently the unlimited plan is not unlimited
openai never did that, even after abusing of it a lot
Wait a sec...
Ultra plan was released roughly 90 days ago was it not...
and only first 90 days were the discounted price
lmfao
i considered getting one day of google ultra, google is easily to refund if you don't abuse
if the model is good then ok, i would keep it
but after seeing that it's 10 requests / day
🤦♂️
30 day free trial
you mean ultra? pro doesnt get it i think
Hi
Any fireship viewer?
Isn't sota some openai video generation model?
🤣
That's sora haha
agree
They expect you to pay up first, and only then receive a chance to even see if it's any good
Their blogpost alone is nowhere near enough to tell
and 10RPD you are still very constrained. So no proper testing of any kind and forget the benchmarks
If this deep think is indeed based on Ultra, I think the odds of Gemini3 beating GPT5 just got way lower LOL
and it's not even deep think version that won the gold medal
it's a scam
But it's soo weird to market huge model as math oriented one completely leaving things out like SimpleQA. Unless they used some derivative of a model meant for competing at IMO. But then it makes even less sense to use this for public release as their overall top performing model.
and they said on the announcement of the gold medal, that they would allow everyone to use the model
misleading
😆
when google releases something good, really good, bet on Logan marketing it
if Logan is in silence, then it's not good
if gemini 3 doesnt beat gpt 5 its a very bad sign for gdm tbh
i think they gonna be on the same level
given how its a new pretrained model and theyve pretrained two fresh model generations since 4o
Well yeah for starters you have a "deep think" button which is only available when you have selected 2.5Pro, their previous best performing model. This strongly implies to use this for best possible performance
What about the IMO gold deepthink, which only a select number of people get which is much more performant?
they cant host it practically
it's a specialist model meant only for math
No, it generalizes to many reasoning tasks
It's literally just a suped up version of the deepthink they're offering
it does "work" for all tasks, but it was still tuned for math
But "slower"
They claim it is SOTA at coding as well and "other reasoning tasks" as they vaguely mention
And the main difference is the current one offered is a faster version
So if current deepthink is generalized at reasoning tasks, then the other version should be too
tbh chances of that model doing better than your standard 2.5Pro at things like coding or your typical everyday tasks not involving math are very very slim. They trained it to perform as good as possible at IMO with no compromises while still keeping it usable.
They literally say otherwise
where exactly do they say it peforms better ar coding than 2.5Pro?
they do not lol
Yes they do lmao
?
They said it is SOTA at coding and other reasoning tasks
link
This was from a week or two ago, whenever the IMO happened
And current deepthink offering is literally the same system but faster as they say
they said this and nowhere in that did they claim it performs at non-math tasks better than 2.5Pro https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/
no source = didn't happen
Deepmind employee says it, I don't think they mention it in that blog
Also why do you think they are still focusing on math with the current deep think released today? It would make no sense unless it's a derivative of that math oriented model, like I've already said
Well then link that tweet lol
"We finished training 2 days before IMO 😄 That model achieved SOTA results, not just for math, but coding along with other reasoning tasks, unbelievable!"
He's one of the leads of the IMO team lol
And they're literally saying they're offering that model to mathematicians right now, while the current one is based on the same system but faster
I don't make crap up, I just repeat what I've actually read
Deep Think prompt: Create a visually impressive Pokemon battle simulator web based game
Ok fair, but it's just confusing af 😄
If that was the same model, why can't it score on IMO the same after the fact even when they had the time with all the data and solutions out there? And if it's SOTA on coding and "other reasoning tasks", why no metrics for that?
If you assume that current deep think is based on Ultra, it would be unreasonable to assume that a) a different finetune of that performs so much better everywhere and also b) that they just released a much lesser version for $300 a month with 10rpd
Gemini 2.5 Deep Think is out!! We were able to improve the model substantially since our announcement at I/O, and it is a faster variation of the system that got Gold 🥇at IMO (still getting bronze level performance🥉!!)
The model is p good at detailed creative tasks too! https://t.co/uxNeFki8oR
They're literally advertising it as the IMO gold model, but a faster variation
"faster variation" --> less test-time compute = same base model
That's how I'm reading this
Very bad assumption
You don't release the highest cost version of a model first ever
You start with the normal one and then go higher
So you think Ultra with parallel test time compute is "normal one"? 🤣
No this is pro with parallel compute
They would release Ultra first and then parallel later
https://vxtwitter.com/lmthang/status/1951311980960350276
Same guy I just shared talking about it with the YOLO run from the tweet above
Our IMO journey continues: the yolo run model that we trained a week before #imo2025, despite all possible likelihood of failures, magically achieves SOTA across a wide range of reasoning tasks from maths, to coding, and challenging knowledge. I'm very excited that we have now delivered the IMO 🥇 system to the hands of mathematicians and a simplified version (results below) to all Google AI Ultra subscribers.
QRT: lmthang
Right before #imo2025, together with colleagues from Mountain View, NYC, Singapore, etc, we all gathered at @GoogleDeepMind headquarter in London for our final push for IMO. I believe that week was when all magic happened!We put all individual recipes (that we figured out before) together and did a yolo run (with the compute that I had to beg various groups to loan) to train our most advanced Gemini model. We finished training 2 days before IMO :D That model achieved SOTA results, not just for math, but coding along with other reasoning tasks, unbe…
He calls it a "simplified version" here
well according to @patent aspen it is definitively Ultra 🤷♂️
But again, connecting it to the YOLO IMO gold run, and calling it a variation of that
Simplified version and its 10 prompts per day for like 200$
They can keep it
yeah so... o3-preview with crazy test-time compute type of model to a few people, and then more realistic one to the people paying $300
Yeah, I wish there were more benchmarks to compare it to o3 pro and grok heavy
Base model is likely the same, just different amount of parallel instances and the way that system is ran etc
If there are no gains to show they wouldn't necessarily release it at all. Just look at Flash vs Pro, here (with Ultra) the differences are probably even smaller and maybe no contrast in SimpleQA even. But parallel compute amplifies any differences and gains
Even 10 prompts a day is enough just give it to me 🙏
Marginal gains like it getting the correct answer only occasionally as opposed to never at all, with parallel compute may convert this into it getting it right most of the time.
I've definitely done more than 10 prompts today fwiw
is it a soft limit right now?
If it's really sota sota then play a game of chess till just 15 moves without hallucinating
Why are all the posts here so weird
So same base model essentially confirmed. But their current cost constraints would not allow them to offer anything better than people got today. That's the best they can do
in a nutshell
Smth like 100k+ thinking from a huge model with a ton of parallel instances is just not realistic to serve
Too concise...
oai really ruined it all for us
with the astronomical monthly 200$ plan
i mean i knew other labs will follow suit
but whats this?????
10 prompts per day for 200$ ?????????????
Sure, but then you can't charge 200$ $300
or you may as well become irrelevant soon enough. Or less relevant than you were hoping for 👀
It's all for nothing if it doesn't materialize and does not reach people
yeah like... People could care less about things "more strategically important to the company", and the company itself will cease to be important if people can't be satisfied and the demand can't be met
so how is deepthink?
lmao openai staff got caught using anthropic models
that's funny
if even openai don't use openai models, i don't know what expect from gpt 5
@patent aspen coding
anthropic staff look into your data remember that
the data policy of anthropic is the worst of all
they said that openai was using their models to AI improvement related tasks
ye, idk if openai was drawing the line
I think OAI is the least legally compliant of the trio
it's not like that OAI was doing it, like sam said for them to do it
but their staff was
they cut access from personal employees
this article is bad news too
I think it's great for math, science, really hard computer science problems like distributed computing. It's meh at coding although very few bugs
now i'm not sure if gpt 5 is comming next week
apparently gpt 5 is not a big leap from 4o
google is also facing difficulties
when does 3 come?
doesnt openai leak info to the information? (or at least it seems that way) maybe theyre trying to downplay it a bit and let everyone be a little surprisd
the only lab that do not suffer difficulties is anthropic
the paper that they released today wtf bro
even mark offering a bunch of money, anthropic researchers refused
what is happening there
do you think it's going to be much different than the 2.5 series
people were impressed by zenith/etc. and there have been massive preparations for gpt-5 in the frontend/backend apparently that people have datamined. it would be odd for it to be significantly delayed
is the article worth $299
ty archive.ph no luck
np
not much actual info on gpt5 there
you wish craig
i was expecting it to be an underwhelming wrapper that just routes to best previously existing model, but feedback on zenith suggested sota
Even if we did hear about Apple announcing they would acquire Anthropic, it wouldn't be confirmed because of the subsequent FTC and congressional approvals
only definitive claim it makes is the leap won't be as big as gpt3 -> 4 and like... yeah
In that situation they'd probably have about a 60-70% chance of success, but the risks of opening an antitrust investigation probably wouldn't be worth it
kick trump $5m dono should be fine
They would also have the EU to worry about
The problem is that, even if it were only a tail risk, a tail risk of potentially doing major damage to your core business probably wouldn't be worth it for Apple
And they could get most of the same benefits by partnering
when gpt-5
Then they can participate in the AI race and have more negotiation leverage. It would derisk their business a bit
At the moment they're a luxury real estate company as far as AI is concerned
Their service businesses are also threatened by AI to some extent
I'm talking mega long term
The other options are off the table, and if they own Anthropic, they can make it whatever they want
They probably can't buy Google, OAI, or xAI
They don't have the talent
I thought you were talking about them building their own models
I mean even OAI is using TPUs over Nvidia on GCP so...
why is it a myth
elaborate
its actually years ahead of any major lab
they can have their own internal mini cuda but pretty sure its nowhere near it
IMO Cuda is replaceable because the AI companies can just push software up to a higher layer of abstraction given a long enough time horizon
At a certain point, you just use PyTorch, Tensorflow, Jax
They are now but they can wrap ASICS too and eventually that's just cheaper
Developers just use high level libraries
XLA is basically an ML compiler
where did the claude models go?
i see everything other than the anthropic models
Can you send a screenshot?
yea but the abstraction wont be perfect... its not like pytorch or tensorflow will be used for anything, i mean to achieve a similar performance like in cuda you need to align perfectly code & hardware.. that's why there are libs like cudnn & cublas that are engineered precisely to get max performance from their tensor cores. and lets say we for example moved to amd rocm even though there's support the performance wont be the same
take throughput diff between a100 & mi215 for example
which is actually 20%
and for tpus, you have to be married by contract to google to use them, so i wont talk about that
for aws, their ceo literally said trainium is like a supplement to nvidia gpus, and for the maojorty of workloads they will still use nvidia
so it only created a little competition with billions $$ spent and many collabs too
yea because we are talking about an ecosystem
Right I mean Jax is basically a high level library built on top of TPUs that can also interface with GPUs. I think the trend towards higher and higher level abstractions over ASICS is already in motion and will continue
That's strange. What browser are you using?
Are others seeing the same ^ ? (claude models not appearing in list)
nope they're there
im using brave
yep
check with google
bro what 💀
okay dont say google because thats not a browser
just say chrome
buddy u dont have to take it literally
Basically Microsoft and the companies that aren't tech giants
u know what i meant
i got confused for a sec
yeah on chrome its visible
The Nvidia ecosystem is still pretty dominant today. I just don't think that will be the long term trend
disable all extensions temporarily
on brave
I think a lot of legacy code will remain Nvidia-based for decades though
while i agree, 'long-term' is kinda vague
ofc things will change in the future
but i would give it like +10 years or more to replicate something like cuda
Nvidia is well positioned enough that they will always be relevant. I just don't think it's actually necessary to replicate CUDA if you can offer comparable performance at 1/5 the cost
just the migration process will be a headache
if this so 'imaginary' company succeeded
That's definitely relevant for a lot of companies, although if the major frameworks migrate, then it's way less work to migrate
Like imagine if <insert your favorite ML framework> just has a one-line config to select the hardware backend
yea but thats a big IF
i hope you are not only taking TFLOPS as the only criteria
Why wouldn't it happen if the economic incentives became big enough?
again if your software stack isnt as optimized as cuda, then its a waste of time, they all have good theorical performance cards
1/5 is just TCO
cost of ownership
what about electricity
what about space
thats if we are assuming that the performance is like 70%
Why buy electricity and space? That's what the cloud is for
I'll let team know, but yeah doesn't appear to be widespread, I wasn't able to repro even on Brave browser either.
https://vxtwitter.com/testingcatalog/status/1951320162541388045
FYI, this guy has access to the gold IMO deepthink model, and has been sharing some tweets about what it makes
Gemini Deep Think IMO 👀
It is one of the first models which I am testing extensively b/c it is very fun to play with.
"Cyberpunk nuclear reactor control interface" https://t.co/y5zHfZYm6Y
QRT: testingcatalog
I have Gemini "Deep Think IMO" mode 👀What should I ask? https://t.co/EhDw7kOAb3
yea thats from a user perspective, but whos paying the bill?
whos calculating tco?
yeah would be nice to have that error fixed
Mainly because they make 80+% margins on hardware
That's an opportunity if I ever saw one
TCO includes the electricity, space, etc
Well definitely the electricity at least
ye, people with deep think imo don't pay for it and has unlimited usage, not commenting about the model being better either. It's like spit on the face of paying customers
yea sorry i meant tco is what should be calculated/used as ref not just gpu cost only
it does include space & electricity
how do they get it?
or is it just random
friends
wowie
The deep think that they announced is the deep think imo. The deep think released is a slight improvement from gemini 2.5. Deep think imo looks like another model, you can check doing the same prompts pelican, star shoot game, etc
They added a unreasonable rate limit and a worst model for paid customers. While they gave influencers a much better model with unlimited requests
so selfish..
to create videos for my school projects 😛
????????????????????????????
He actually made stuff up
The difference between the imo model and the released one is inference config only
you are not
he deleted his message saying that he is a googler
lmao
well, different of you i can reference actual source for the shi* i say
@burny_tech @GoogleDeepMind This is a variation of our IMO gold model that is faster and more optimized for daily use! We are also giving the IMO gold full model to a set of mathematicians to test the value of the full capabilities.
we need more people asking how many rs are in strawberry using deepthink tbh
AGI benchmark
this @patent aspen is a clown
"made stuff up"
dumb clown
mf is lying
deep think
is the same thing
there is no different deep thinks
Children stop fighting
He’s not sundar Pichai. He just works there and didn’t make these decisions. Don’t have to be mean
he is lying lol
why you're on his side
i'm showing actual sources of people that is direct on the project
and you guys are on the side of the guy
without sources?
lmao
but he is lying all the time
he deleted his lies
every time that i show proof of the opposite of this guy is saying he delete his message
why you guys like liers lmao
???
bro the lies is some messages ago
there is a lot more too if you search his messages
i can create a exposed with more than 30 lies
that this guy made
this is sick
i think that you guys are the same account now
prob
the 3 of you lmao
i'm already making
a lot actually
already doing
but it's automated so
i can waste my time with whatever i want to
you're sick with 3 discord accounts lmao
kind of funny
blocked, i don't like to read lies
bye bye
bcs you're his alt account
you're the king of bs
brian isnt lying btw
Our IMO journey continues: the yolo run model that we trained a week before #imo2025, despite all possible likelihood of failures, magically achieves SOTA across a wide range of reasoning tasks from maths, to coding, and challenging knowledge. I'm very excited that we have now delivered the IMO 🥇 system to the hands of mathematicians and a simplified version (results below) to all Google AI Ultra subscribers.
Quoting Thang Luong (@lmthang)
︀
Right before #imo2025, together with colleagues from Mountain View, NYC, Singapore, etc, we all gathered at @GoogleDeepMind headquarter in London for our final push for IMO. I believe that week was when all magic happened!
︀︀
︀︀We put all individual recipes (that we figured out before) together and did a yolo run (with the compute that I had to beg various groups to loan) to train our most advanced Gemini model. We finished training 2 days before IMO :D That model achieved SOTA results, not just for math, but coding alo…
imagine saying the opposite of the head of the gemini deep think project and using alt accounts to support it. Another level of commitment
yeah, he is lying
you're right
if you say so
we're all brian's alts 🤣
Even if you work at google, you are not part of the deep think team
bcs i know all of them
not 2 versions, one is deep think with prompt and the other just the model and the benchs
They aren't my friends, i just know who they are, and their respective discords too
elaborate
brian is true
if you can't read the messages above, that's not my problem
so wtf is with prompt and with the bench????????????????????????????????
I just showed you there are three different deepthink Animated
You should tell the TPU team to scale up for DeepThink :D
You guys have to scale for both DeepThink and Gemini 3.0 damn
huh isn't gemini 2.5 ultra has above 1M context or something? So why DeepThink only has 100k? Cost?
lol why would someone who works on deep think be spilling a bunch of insider info on their main anyway
the problem is when the things they say doesn't match with the things he says
brain
or do you believe that they are lying on X?
scrolling up you're not going to be convinced no matter what anyone tells you
i'm not going to waste my time
have fun
He deleted his messages anyway
Owner: @wooden mulch
Features: 


Creation: <t:1739683560:R>
Channels:
Text:
VC:
Members:
Roles:
Managed:
