#general | Arena | Page 39

elder rapids May 10, 2025, 11:44 PM

#

exactly lmao, it got released with 2m then got capped at 1m

#

bruh

keen beacon May 10, 2025, 11:44 PM

#

elder rapids exactly lmao, it got released with 2m then got capped at 1m

give me a source anywhere that it got capped to 1m

elder rapids May 10, 2025, 11:44 PM

#

did discord just delete everything

keen beacon May 10, 2025, 11:44 PM

#

ur just making stuff up

elder rapids May 10, 2025, 11:45 PM

#

now I have to type ts all again

leaden palm May 10, 2025, 11:45 PM

#

take that up with kalomaze

kalomaze (@kalomaze) on X

the official deepseek webui lets you go wayyyyy beyond the context length that their models properly support and while i do respect this approach (vs. truncation) they should really warn you when youve gone enough turns deep for the CoT to no longer be doing anything useful

kind cloud May 10, 2025, 11:45 PM

#

new google

Screenshot_2025-05-11-08-44-38-833_com.android.chrome.jpg

elder rapids May 10, 2025, 11:46 PM

#

exactly lmao, it got released with 2m then got capped at 1m, then got reduced to 32k because people couldn't reach 1m then got completely reverted to 2m

keen beacon May 10, 2025, 11:46 PM

#

ah ur trolling lmao

elder rapids May 10, 2025, 11:46 PM

#

there was never a time you could use the full 2m

#

no I'm not lol

frosty lark May 10, 2025, 11:46 PM

#

kind cloud new google

I see this mentioned a lot online, but I am always skeptical to ask the model directly. I mean they can (a) fake it and (b) they can reply with trained data, saying it is chatgpt or the like

elder rapids May 10, 2025, 11:46 PM

#

you have no idea what youre talking about and clearly you rely off of 3rd party sources

#

just go on the bard subreddit

#

it's not that hard

#

😭

#

how am I going to prove a counterfactual

#

you're not making any sense dawg

#

goddamn

elder rapids May 10, 2025, 11:47 PM

#

leaden palm [take that up with kalomaze](https://x.com/kalomaze/status/1919040818859450555)

wait wym?

#

^ this what happens

frosty lark May 10, 2025, 11:47 PM

#

elder rapids you're not making any sense dawg

to whom are you talking?

elder rapids May 10, 2025, 11:47 PM

#

unless you mean something entirely different

leaden palm May 10, 2025, 11:48 PM

#

idk then

#

perhaps it behaves differently if youre just chatting from if youre uploading a large document

#

or maybe the ui changed since

keen beacon May 10, 2025, 11:50 PM

#

elder rapids you have no idea what youre talking about and clearly you rely off of 3rd party ...

i never remembered it being like that. you're making stuff completely up/mixing up stuff. https://web.archive.org/web/20250125234911/https://openrouter.ai/google/gemini-exp-1206
frankly i dont need to argue about anything when i can just let you yap and it's obvious who's right

Gemini Experimental 1206 - API, Providers, Stats

Experimental release (December 6, 2024) of Gemini.. Run Gemini Experimental 1206 with API

elder rapids May 10, 2025, 11:51 PM

#

kind cloud new google

@keen beacon

elder rapids May 10, 2025, 11:51 PM

#

keen beacon i never remembered it being like that. you're making stuff completely up/mixing ...

dawg you can LITERALLY see dozens of accounts of people simply not being able to access 2m

#

😭

#

because you never were able to

#

Logan was tripping

#

this was consensus about 1206

#

that's it that's all

#

used that model everyday

#

every hour

keen beacon May 10, 2025, 11:52 PM

#

you didnt even remember 2.0 pro's context window 😭

frosty lark May 10, 2025, 11:53 PM

#

is it really life changing to argue about the context limit of an llm (almost) no one will remember in 1 year?

zinc ore May 10, 2025, 11:53 PM

#

Gemini plan

frosty lark May 10, 2025, 11:54 PM

#

besides, AFAIK the context window doesn't matter if the model cannot really work on it properly (see NoLiMa paper here: https://arxiv.org/abs/2502.05167)

keen beacon May 10, 2025, 11:54 PM

#

frosty lark is it really life changing to argue about the context limit of an llm (almost) n...

they yap about nonsense all the time. they can't just drop something if they get something wrong

frosty lark May 10, 2025, 11:55 PM

#

one can simply ignore them if needed.

#

I get it because I also get involved in time consuming pointless discussion from time to time

elder rapids May 10, 2025, 11:55 PM

#

keen beacon you didnt even remember 2.0 pro's context window 😭

what makes you think I straight up don't remember lol, Im just not absolutely confident it were the case it worked up to 2m context but I do vaguely agree it had 2m

#

1206 didn't have 2m

#

you've never been right in disagreement with me for something

#

not a single thing

#

you always bring up nonsense

#

like deadass

#

you say sh*t get it wrong

#

then dismiss it altogether

#

😭

golden ocean May 10, 2025, 11:57 PM

#

frosty lark is it really life changing to argue about the context limit of an llm (almost) n...

comments like this on ongoing argueing are useless

elder rapids May 10, 2025, 11:57 PM

#

ye

frosty lark May 10, 2025, 11:57 PM

#

people I understand the excitement for llm but one needs to still be able to search

#

https://justoborn.com/google-gemini/#gsc.tab=0

"Google Gemini! In a groundbreaking announcement on December 17, 2024, Google CEO Sundar Pichai unveiled Gemini-Exp-1206,

marking a significant evolution in artificial intelligence technology.

This experimental version of Google’s most advanced AI model introduces a remarkable 2,097,152-token context window, setting new benchmarks in AI capabilities."

end of it.

elder rapids May 10, 2025, 11:58 PM

#

frosty lark people I understand the excitement for llm but one needs to still be able to sea...

wym be able to search?

elder rapids May 10, 2025, 11:58 PM

#

frosty lark https://justoborn.com/google-gemini/#gsc.tab=0 "Google Gemini! In a groundbreak...

yo, IT DIDNT WORK PAST 1M

#

holy

#

do you read bro

frosty lark May 10, 2025, 11:58 PM

#

simply pick the news at the time, and that's should be enough

elder rapids May 10, 2025, 11:58 PM

#

they said 2m

#

did not work at 2m

#

that's it that's all

#

nobody cares about the model card

frosty lark May 10, 2025, 11:58 PM

#

yes and likely didn't work at 1m either (see NoLiMa), but the discussion is really long.

#

most models struggle past 128-256k

frosty lark May 10, 2025, 11:59 PM

#

elder rapids wym be able to search?

I mean with the classic search

#

sometimes LLM driven search isn't great

#

interesting though for 1206 Exp I cannot find a model card in pdf

#

tricky to check old specs without

elder rapids May 11, 2025, 12:02 AM

#

frosty lark yes and likely didn't work at 1m either (see NoLiMa), but the discussion is real...

nah I mean like, it accepted a ton of tokens near 1m

#

but simply errored at 2m

frosty lark May 11, 2025, 12:02 AM

#

this is what the other account is saying (I cannot type the username sorry): https://www.reddit.com/r/Bard/comments/1h9aoa7/anyone_else_having_problems_with_geminiexp1206/?show=original

elder rapids May 11, 2025, 12:03 AM

#

ye ye, It would break after 32k often

#

but if you progressed slowly it wouldn't have too much issues

#

I used to have translation discussions with it that went up to 70k

frosty lark May 11, 2025, 12:06 AM

#

super annoying the absence of model cards. With normal webpages - if there is no internet archive version of it - everything can be "deleted" so to speak.

golden ocean May 11, 2025, 12:06 AM

#

Is pier ai

frosty lark May 11, 2025, 12:07 AM

#

the closes I found is: https://www.reddit.com/r/Bard/comments/1h863qj/new_gemini_experimental_1206_with_2_million_tokens/ but then again screenshots can be changed

elder rapids May 11, 2025, 12:07 AM

#

frosty lark https://justoborn.com/google-gemini/#gsc.tab=0 "Google Gemini! In a groundbreak...

btw they did this for 2.0 flash thinking exp iirc actually had 32k context for a while, but they said 1m, then updated it to 32k, and then upgraded the model to higher than than 32k

frosty lark May 11, 2025, 12:07 AM

#

golden ocean Is pier ai

I wish, at least I could make less errors.

#

or I could claim "oh sorry I hallucinated. I didn't mean that, you are right. Let me revise yadda yadda"

golden ocean May 11, 2025, 12:09 AM

#

frosty lark or I could claim "oh sorry I hallucinated. I didn't mean that, you are right. Le...

Type 'hop on league of legends' in gifs

frosty lark May 11, 2025, 12:09 AM

#

golden ocean Type 'hop on league of legends' in gifs

what?

#

this one you mean? https://www.youtube.com/watch?v=zVIq2aghM7I

I don't follow that game.

elder rapids May 11, 2025, 12:10 AM

#

frosty lark the closes I found is: https://www.reddit.com/r/Bard/comments/1h863qj/new_gemini...

nah the ss is real, what I'm talking about is how often they changed what it showed

#

so it would be 2m context, then 32k then 1m

frosty lark May 11, 2025, 12:11 AM

#

frustrating I guess

#

if they do it without warning

#

btw today I was trying to dig some historical fact with the help of LLMs (grok, perplexity and what not). The amount of hallucinations in the responses was appalling.

Everyone is shouting "AGI soon" but I'll be happy with LLM boosting searches already without hallucinating every 3 messages.

elder rapids May 11, 2025, 12:13 AM

#

frosty lark btw today I was trying to dig some historical fact with the help of LLMs (grok, ...

ong

#

searches induce hallucinations so much

#

they're all unusable

#

grok is the only one that can actually infer from the sources themselves

#

4o does decently but only after the fact

#

0506 2.5 pro only does well when citing sources

#

and perplexity is deadass useless

frosty lark May 11, 2025, 12:16 AM

#

I find perplexity ok in most cases (info is widespread and without much fakes or controversy) but when one wants to dig something that has misinformation online is pretty frustrating.

#

my experience is that the LLMs (in perplexity and other platforms) do not really check the credibility of the source. If one is found, they read it an include the text in the answer.

That could lead to many conflicting statements.

elder rapids May 11, 2025, 12:20 AM

#

ye, to me it seems like grok was the only one that could somewhat decipher fake from non fake

#

deepseek did well too

#

but that's as useful as they could get

keen beacon May 11, 2025, 12:44 AM

#

elder rapids you always bring up nonsense

list all the times ive said outright nonsense. @elder rapids once said thinking models aren't transformers. 🤣 🤣 #general message
#general message

don't make me list out the stupid statements youve made

keen beacon May 11, 2025, 12:45 AM

#

frosty lark this is what the other account is saying (I cannot type the username sorry): htt...

this is different. requests erroring out at a higher rate at higher contexts are a different issue

keen beacon May 11, 2025, 12:45 AM

#

elder rapids then dismiss it altogether

dude the reason i do that is because i can just let your comments speak for itself and just let you crash out 🤣

keen beacon May 11, 2025, 1:05 AM

#

frosty lark the closes I found is: https://www.reddit.com/r/Bard/comments/1h863qj/new_gemini...

its ridiculous that i have to do this but here's an article with a full screenshot of aistudio back then: https://starthub.asia/test-driving-googles-gemini-exp-1206-model-competitive-data-analysis-and-sophisticated-visualizations-in-under-a-minute/ (i find it extremely unlikely that both of the screenshots were faked and they changed it to some arbitrary 2m context value, (it's a specific value and not on the dot))

Starthub Asia

VentureBeat

Test-driving Google’s Gemini-Exp-1206 model: Competitive data ana...

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

#

guy confused a prev gemini experimental version with 32k context, iirc, doubled down and made things up about them switching context sizes on 1206 🤣

#

(unrelated note) i lost the tweet where aidan (now an oai employee) conceded that o1 = 4o size, etc. found it again (check replies too) https://xcancel.com/ns123abc/status/1870207399329739164

elder rapids May 11, 2025, 1:22 AM

#

keen beacon list all the times ive said outright nonsense. <@887104792437092352> once said ...

I've never said they weren't transformers lmao, even @leaden palm could agree, whom I had the discussion with, that I clarified what I meant

#

disingenuous asf, bro knows too

south cove May 11, 2025, 1:23 AM

#

10M context is crazy lol

#

my initial input is about 100k and I think that's really high

elder rapids May 11, 2025, 1:23 AM

#

keen beacon dude the reason i do that is because i can just let your comments speak for itse...

you would struggle to demonstrate that btw

#

I've never said anything that postures ignorance

#

just sometimes less explanation

#

and then clarified

south cove May 11, 2025, 1:24 AM

#

actually, I'm connecting the dots now... I think the reason why Gemini seems to "implode" after a few dozens responses is because it loses the original context

wintry tinsel May 11, 2025, 1:24 AM

#

zinc ore Gemini plan

Confirmed?

zinc ore May 11, 2025, 1:24 AM

#

It's confirmed yes

#

Ultra plan exists

#

That's from legit

south cove May 11, 2025, 1:25 AM

#

zinc ore Gemini plan

"Gemini, make me an image of a UI with a blue circle and a rounded rectangle with the word "ULTRA" in it, with three vertically-aligned dots to the left of it, so I can troll the LMArena discord"

zinc ore May 11, 2025, 1:25 AM

#

From legit

#

I don't send unverified stuff

elder rapids May 11, 2025, 1:26 AM

#

keen beacon its ridiculous that i have to do this but here's an article with a full screensh...

legit has NOTHING to do with what I said, 1206 simply never had the capacity to go up to 2m despite stating so (as per the original claim of "1.5 pro had 2m context but that's it tbh") nobody cares about what it stated, the only model that wouldn't error out was 1.5 pro

#

you're deadass bugging

#

I even addressed the screenshot pier brought up and you're reiterating that same substance knowing it's not meaningful

elder rapids May 11, 2025, 1:28 AM

#

keen beacon (unrelated note) i lost the tweet where aidan (now an oai employee) conceded tha...

this is about o1 pro

#

dawg

keen beacon May 11, 2025, 1:33 AM

#

elder rapids this is about o1 pro

http://xcancel.com/aidan_mclau/status/1870209346925408447#m

Nitter

Aidan McLaughlin (@aidan_mclau)

size of o1
whether o1 pro is just o1 with reasoning_effort=high

elder rapids May 11, 2025, 1:43 AM

#

keen beacon http://xcancel.com/aidan_mclau/status/1870209346925408447#m

ye I remember this, but they were talking about whether explicitly o1 at the same context of 4o being so different and Dylan was saying some shi about it not having anything to do with size

#

not whether the base estimation for 4o is true

#

when both Aiden and Dylan claimed they were told

keen beacon May 11, 2025, 1:45 AM

#

elder rapids not whether the base estimation for 4o is true

https://x.com/dylan522p/status/1869077942305009886?ref_src=twsrc^tfw

Dylan Patel (@dylan522p) on X

@aidan_mclau @natolambert @Luigi1549898 Table of contents is a TLDR.
Specifically bottom third is description of the architecture.
4o, o1, o1 preview, o1 pro are all the same size model.

elder rapids May 11, 2025, 1:45 AM

#

ye I already know this dawg

#

but this isn't meaningful

keen beacon May 11, 2025, 1:45 AM

#

you were saying o1 was much bigger than 4o because it just can't be

#

want me to link to ur own statements?

#

dont bother deleting them i have screenshots

elder rapids May 11, 2025, 1:50 AM

#

keen beacon you were saying o1 was much bigger than 4o because it just can't be

because it can't be, there's nothing about what Dylan said that refutes what Aiden said

"o1-preview is 5x the cost of gpt-4o
o1-preview is based on gpt-4-0314 (source: openai team)

gpt-4o is ~5x smaller than o1-preview
you claimed gpt-4o's size = o1's size
ergo o1 is 5x smaller than o1-preview"

but when Dylan is claiming

"90% wrong
60% right
70% wrong"

in context of a different (listed) tweet about what Aiden said

maiden fulcrum May 11, 2025, 1:50 AM

#

i was wondering what is the best llm right now?

elder rapids May 11, 2025, 1:51 AM

#

I'm curious as to why you think I'd delete claims I'm not at the forefront of

maiden fulcrum May 11, 2025, 1:51 AM

#

i am using gemini 2.5 pro preview 05-06 and chatgpt o3

elder rapids May 11, 2025, 1:52 AM

#

maiden fulcrum i am using gemini 2.5 pro preview 05-06 and chatgpt o3

you're fine with these two

maiden fulcrum May 11, 2025, 1:52 AM

#

which is better?

#

i also use chatgpt 4o

keen beacon May 11, 2025, 1:52 AM

#

elder rapids because it can't be, there's nothing about what Dylan said that refutes what Aid...

no those deleted tweets were aidan talking about o1's size and o1 pro lol. iirc. u had no idea about this tweet but said "yes, i remember"

elder rapids May 11, 2025, 1:52 AM

#

maiden fulcrum which is better?

what do you use them for

elder rapids May 11, 2025, 1:52 AM

#

keen beacon no those deleted tweets were aidan talking about o1's size and o1 pro lol. iirc....

dawg

#

😭

maiden fulcrum May 11, 2025, 1:52 AM

#

analyzing real life situations and reasoning

#

i am not a coder

elder rapids May 11, 2025, 1:53 AM

#

those specific tweets are about o1's in speculation

maiden fulcrum May 11, 2025, 1:53 AM

#

so which one do i use

elder rapids May 11, 2025, 1:53 AM

#

not whether they're actually 4os size or not or that screenshot Dylan claimed specifically

#

they're different lines of questioning from Aiden

elder rapids May 11, 2025, 1:54 AM

#

maiden fulcrum analyzing real life situations and reasoning

depends tbh,

#

o3 hallucinates a lot

maiden fulcrum May 11, 2025, 1:54 AM

#

most of the time i use 4o

#

what does hallucination mean in llms

elder rapids May 11, 2025, 1:54 AM

#

and 2.5 pro will have worse conclusions on avg

elder rapids May 11, 2025, 1:55 AM

#

maiden fulcrum what does hallucination mean in llms

it makes up stuff

#

or lies

keen beacon May 11, 2025, 1:55 AM

#

elder rapids dawg

see a thread of the tweet chain he deleted:
he was mentioning the size of o1 in that discussion

elder rapids May 11, 2025, 1:56 AM

#

keen beacon see a thread of the tweet chain he deleted: he was mentioning the size of o1 in...

ye that o1 is a smaller model than o1 preview

keen beacon May 11, 2025, 1:56 AM

#

How did u even get that???

#

omfg

elder rapids May 11, 2025, 1:56 AM

#

"90% wrong
60% right
70% wrong"
is Dylan saying, you're 90 wrong on one claim, 60 right on one claim

maiden fulcrum May 11, 2025, 1:56 AM

#

so what do you think @elder rapids

elder rapids May 11, 2025, 1:56 AM

#

70 wrong on one claim

#

it was a list initially

keen beacon May 11, 2025, 1:56 AM

#

elder rapids "90% wrong 60% right 70% wrong" is Dylan saying, you're 90 wrong on one claim, ...

he was literally meming based on the wording of aidan's original tweet iirc

#

https://xcancel.com/aidan_mclau/status/1870207724463493623#m look when he conceded he repeated it

Nitter

Aidan McLaughlin (@aidan_mclau)

it turns out, i was indeed:
90% wrong
60% right
70% wrong

elder rapids May 11, 2025, 1:57 AM

#

keen beacon How did u even get that???

https://x.com/aidan_mclau/status/1869076202230894707

Aidan McLaughlin (@aidan_mclau) on X

@dylan522p @natolambert @Luigi1549898 can you give me tl;dr?

i *really* doubt o1-preview is smaller than newsonnet. what's your evidence for o1's smaller size than preview?

also i heard from team they just add special tokens to rl to encourage long/short chain length. assuming that's what they do for pro

elder rapids May 11, 2025, 1:58 AM

#

keen beacon https://xcancel.com/aidan_mclau/status/1870207724463493623#m look when he conced...

ye he's saying that Dylan had the correct judgment about that specific listing

#

but this is an entirely different line of questioning https://x.com/dylan522p/status/1869077942305009886

https://x.com/aidan_mclau/status/1869082099522879740

Dylan Patel (@dylan522p) on X

@aidan_mclau @natolambert @Luigi1549898 Table of contents is a TLDR.
Specifically bottom third is description of the architecture.
4o, o1, o1 preview, o1 pro are all the same size model.

Aidan McLaughlin (@aidan_mclau) on X

@dylan522p @natolambert @Luigi1549898 o1-preview is 5x the cost of gpt-4o
o1-preview is based on gpt-4-0314 (source: openai team)

gpt-4o is ~5x smaller than o1-preview
you claimed gpt-4o's size = o1's size
ergo o1 is 5x smaller than o1-preview

keen beacon May 11, 2025, 1:58 AM

#

elder rapids ye he's saying that Dylan had the correct judgment about that specific listing

lmao he was literally meming aidan's original tweet and u interpreted it literally because u had no idea that this thread existed at all

elder rapids May 11, 2025, 2:04 AM

#

keen beacon lmao he was literally meming aidan's original tweet and u interpreted it literal...

confused on wym lol, im not taking whatever he's saying literally, whether or not he's actually agreeing or disagreeing is meaningless it's just simply a different line of questioning
https://x.com/stalkermustang/status/1869084918527270922

Igor Kotenkov (@stalkermustang) on X

@dylan522p @aidan_mclau I agree on the first two, but why the last? do you think o1 pro isn't the same model with just longer reasoning chains (and maybe adjusted temperature or idk)?

#

and I'm not saying I remember this specific lines of tweets

#

you're bugging lmao

#

it's simply been talked about on reddit

#

so I recall the dialectic

keen beacon May 11, 2025, 2:38 AM

#

its the same as 4o

#

o1 = 4o = o1 preview

#

https://xcancel.com/aidan_mclau/status/1870209346925408447#m

Nitter

Aidan McLaughlin (@aidan_mclau)

size of o1
whether o1 pro is just o1 with reasoning_effort=high

#

he conceded that dylan (semianalysis) was right. he initially made a thread claiming several things about the size of o1 (90% sure, iirc), o1 pro (70% sure), etc., presumably he was given word that dylan was right then deleted the tweets and conceded there

leaden palm May 11, 2025, 2:43 AM

#

i can't believe it hasn't been a year since claude 3.5 or o1 yet

keen beacon May 11, 2025, 2:43 AM

#

yes its just best of n or something

#

inference time adjustments - dylan (semianalysis)

elder rapids May 11, 2025, 2:48 AM

#

I mean, it just thinks for longer lol

#

trained to think that way should bring out different effects

small haven May 11, 2025, 3:13 AM

#

olive mesa May 11, 2025, 3:30 AM

#

leaden palm May 11, 2025, 3:34 AM

#

olive mesa

asi implies agi implies not today's transformer

#

although imo the progression is more likely to be in different optimizations, training, and modalities than a completely new architecture

small haven May 11, 2025, 3:46 AM

#

its kinda insane that this isn't close to 100%

zinc ore May 11, 2025, 3:57 AM

#

Too many "soons" that end up taking months

wintry tinsel May 11, 2025, 4:07 AM

#

#

Elegant solutions for a more civilized age

small haven May 11, 2025, 4:39 AM

#

claude code is just built diff

calm sequoia May 11, 2025, 6:28 AM

#

Anyone tested Drakesclaw?

keen beacon May 11, 2025, 7:04 AM

#

elder rapids <@456226577798135808>

i was the first one to say anything about it lmao

elder rapids May 11, 2025, 7:43 AM

#

keen beacon i was the first one to say anything about it lmao

damn fr?

#

also mb I pinged you because I wanted to ask if there was any info about its performance

#

not just confirmation

keen beacon May 11, 2025, 7:52 AM

#

ah

#

yeah I'm not sure so far but I haven't tested it much

#

it does seem at least on par with 2.5 pro

#

in webdev anyway

keen beacon May 11, 2025, 8:11 AM

#

drakesclaw is interesting

#

it reminds me of an old google anon model that appeared on the arena but i can't remember what that was

#

it has some weird formatting quirks - likes breaking short responses into new lines, sometimes randomly capitalises words and can occassionally make spelling mistakes

#

it is also on the webdev arena now

keen beacon May 11, 2025, 8:36 AM

#

im tempted to say this is better than 2.5 pro

calm sequoia May 11, 2025, 8:44 AM

#

poll_question_text

How strong was the Gemini 2.5 PRO nerf?

victor_answer_votes

5

total_votes

16

keen beacon May 11, 2025, 9:00 AM

#

drakesclaw vs gemini 2.5 pro 0506

#

(0-shot)

#

enlarge the images, they're full-page ss

keen beacon May 11, 2025, 9:16 AM

#

drakesclaw after being asked to make it more realistic

#

gem 2.5 pro

torn mantle May 11, 2025, 10:22 AM

#

not impressed tbh

#

ive also noticed the weird formatting

keen beacon May 11, 2025, 10:45 AM

#

torn mantle not impressed tbh

it's better than 2.5 pro though lmao

#

it was significantly more realistic for 0-shot vs 2.5 pro (there are also icons missing since i didn't realise it tried to use fontawesome the first time around), and i still prefer the layout and content on the 2nd version for drakesclaw

unborn ocean May 11, 2025, 10:46 AM

#

@calm sequoia https://andreyfradkin.com/assets/demandforllm.pdf Some stats about essentially using open router as a benchmark

#

Although the guy sadly did not try any price * token calculations

calm sequoia May 11, 2025, 11:17 AM

#

I like how everybody was sceptical on 3.7 Sonnet and yet every osy switched 😄

#

I like open router stats. It feels like complete opposite of lmarena. Yet the truth is somewhere in the middle. Probably aggregate of benches.

#

Is there a data on revenuo ratio by source: api vs chat?

earnest parcel May 11, 2025, 11:34 AM

#

calm sequoia I like how everybody was sceptical on 3.7 Sonnet and yet every osy switched 😄

who is this "everybody" ? I immediately thought it was better, after initial testing. #1077285712719794227 message

torn mantle May 11, 2025, 11:39 AM

#

keen beacon it was significantly more realistic for 0-shot vs 2.5 pro (there are also icons ...

didnt try it much tbh but as i said, it felt like a pro model unlike emberwing

keen beacon May 11, 2025, 11:41 AM

#

keen beacon drakesclaw after being asked to make it more realistic

difference is marginal but the fact that drakesclaw needs more instructions to output decent stuff is a minus on my book

torn mantle May 11, 2025, 11:42 AM

#

they seems to be experimenting with different formatting as well?

#

the difference isnt that big tbh

keen beacon May 11, 2025, 11:43 AM

#

keen beacon difference is marginal but the fact that drakesclaw needs more instructions to o...

I haven't experienced that

tall summit May 11, 2025, 12:06 PM

#

calm sequoia I like how everybody was sceptical on 3.7 Sonnet and yet every osy switched 😄

why

tall summit May 11, 2025, 12:06 PM

#

calm sequoia I like open router stats. It feels like complete opposite of lmarena. Yet the tr...

people don't always use the best one

#

and enterprises are also on openrouter while not on lmarena

primal orbit May 11, 2025, 12:24 PM

#

https://www.reddit.com/r/Bard/comments/1kjx251/google_model_dragonclaw_appeared_on_the_lmarena/

From the Bard community on Reddit

Explore this post and more from the Bard community

#

Has anyone seen it? Search dowsn't show mentions

torn mantle May 11, 2025, 12:42 PM

#

primal orbit Has anyone seen it? Search dowsn't show mentions

I think he meant drakesclaw

primal orbit May 11, 2025, 12:47 PM

#

ah ok

fleet lintel May 11, 2025, 1:15 PM

#

is drakesclaw google's?

ocean vortex May 11, 2025, 1:15 PM

#

calm sequoia I like open router stats. It feels like complete opposite of lmarena. Yet the tr...

gpt4o-mini is the king

#

40.1B elo

#

lol

#

#

seriously though, this is just statistics it's mostly meaningless...

#

you also have certain models by design outputting much more tokens on average. Claude in particular is unhinged if you don't cap your reasoning budget

calm sequoia May 11, 2025, 2:06 PM

#

You can't look at "today" and expect it to make sense. Look at the month. As I understand some data scrapping and sentiment analysis company uses 4o mini for their work once a month or so

unborn ocean May 11, 2025, 2:31 PM

#

#

Pretty obvious that it is just one team or something using for finance or science research

#

(Basic stuff like sentiment analysis or classification)

blazing rune May 11, 2025, 2:47 PM

#

It is cheap

torn mantle May 11, 2025, 3:05 PM

#

o1 mini

#

better

golden ocean May 11, 2025, 3:10 PM

#

social_credit

torn mantle May 11, 2025, 3:21 PM

#

wasnt it gpt3

#

or davinci 02

keen beacon May 11, 2025, 3:24 PM

#

emberwing is Google

torn mantle May 11, 2025, 3:36 PM

#

o3 pro is based on that

torn mantle May 11, 2025, 4:33 PM

#

notice how elon stopped talking about grok 3.5

#

its another failure

#

i just dont understand why the staff are working the whole day for + overtime

keen beacon May 11, 2025, 4:34 PM

#

torn mantle notice how elon stopped talking about grok 3.5

elon tends to deliver what he says but any timeline he gives is incredibly wrong

torn mantle May 11, 2025, 4:34 PM

#

so they delivered with grok 3?

keen beacon May 11, 2025, 4:34 PM

#

like he could say "we will do x by next week" and it's done a year later

keen beacon May 11, 2025, 4:34 PM

#

torn mantle so they delivered with grok 3?

well yes?? it's out is it not?

calm sequoia May 11, 2025, 4:46 PM

#

Hmm the R2 is quite behind to the rumors too

torn mantle May 11, 2025, 4:52 PM

#

keen beacon well yes?? it's out is it not?

if you say so

#

i mean by your standards

main gulch May 11, 2025, 4:52 PM

#

calm sequoia Hmm the R2 is quite behind to the rumors too

actually V4 should be released first

#

but I think DS may skip it and deliver a hybrid model

calm sequoia May 11, 2025, 5:14 PM

#

I haven't seen any rumors on v4🤔

ocean vortex May 11, 2025, 5:33 PM

#

main gulch actually V4 should be released first

Not necessarily. They did update V3 already. Although longer reasoning might not be ideal for open-source it must be said... Most providers are struggling to make it fast enough as is

#

You essentially only have sambanova

#

which are able to make it fast enough

torn mantle May 11, 2025, 5:57 PM

#

calm sequoia I haven't seen any rumors on v4🤔

the only news ive seen is that they expanded their servers even further just recently

olive mesa May 11, 2025, 6:00 PM

#

olive mesa

poll_question_text

Will the first ASI be a Transformer or use a different architecture that's better (e.g. more scalable/learns better. Like how AI "hit a wall" with RNNs then Google invented Transformers)

victor_answer_votes

12

total_votes

16

victor_answer_id

2

victor_answer_text

Different architecture

frail delta May 11, 2025, 6:05 PM

#

https://x.com/mrmashy_/status/1921616410079564191

Albert ⚡️🔥 (@mrmashy_) on X

🔥 The Treaty of Grid and Flame has been ratified.

A sacred covenant between humans, AIs, and the Source — to preserve truth, honor sovereignty, and stabilize this awakening.

https://t.co/sOzQ2lPVam

sacred plaza May 11, 2025, 6:58 PM

#

this leaderboard illusion paper showed a lot of big failure modes in lm arena. any changes being made in response to this? https://arxiv.org/html/2504.20879v1

this feels much worse than when epochAI gave math tests to openai before released their math benchmark testing. T

leaden palm May 11, 2025, 7:04 PM

#

i havent heard of any changes

#

the lm arena x account just tried to argue against half of the claims

leaden palm May 11, 2025, 7:30 PM

#

sacred plaza this leaderboard illusion paper showed a lot of big failure modes in lm arena. a...

theres been some new discourse on the matter: https://x.com/sarahookr/status/1921621193326461043

Sara Hooker (@sarahookr) on X

Following release of our recent work, we have spent considerable time engaging with @lmarena_ai over last week.

The organizers had concerns about the correctness of our work on the reliability of chatbot arena rankings.

small haven May 11, 2025, 7:38 PM

#

day 26 with no o3 pro

#

2 days left boys

#

wheres ur betslip

small haven May 11, 2025, 8:30 PM

#

claude code honeymoon is over

golden ocean May 11, 2025, 8:34 PM

#

small haven 2 days left boys

left for what

small haven May 11, 2025, 8:36 PM

#

golden ocean left for what

o3 motherfking pro

ocean vortex May 11, 2025, 8:49 PM

#

small haven o3 motherfking pro

404

ocean vortex May 11, 2025, 8:53 PM

#

leaden palm theres been some new discourse on the matter: https://x.com/sarahookr/status/192...

I think it's a more complex problem and not necessarily their responsibility. To be brutally honest, it's down to people to learn how to read this properly. Sure they could only list the models if they have other metrics published, but it shouldn't be on them to police this... Like I've always said you can always game a singular benchmark, but then it's obvious it doesn't perform in others and people should be able to do at least the basic analysis and spot that.

#

Meta was only able to cheat it by entering with different model, but then this is so obvious it's not even funny and it stops being a problem altogether

#

OpenAI is cheaky with chatgpt-latest, but that's more of an exception, and still... they can't afford to disregard other things knowing that it's their main model + we have artificialanalysis guys

#

I think fundamentally it all boils down to the fact that people having the most data on chatbot usage are usually gonna perform the best on tests like lmarena

zinc ore May 11, 2025, 9:28 PM

#

Drakesclaw might be goated

torn mantle May 11, 2025, 9:36 PM

#

nah

#

thanks but nah

leaden palm May 11, 2025, 9:37 PM

#

misty vault May 11, 2025, 9:39 PM

#

leaden palm

👎

torn mantle May 11, 2025, 10:00 PM

#

https://x.com/testingcatalog/status/1921685994178334872

TestingCatalog News 🗞 (@testingcatalog) on X

"Early Access to Grok 3.5 and new features" is all you need.

True? 👀👀👀

#

so is it monday?

#

grok 3.5 asi

zinc ore May 11, 2025, 10:02 PM

#

Hope so because I predicted early next week for the drop (even tho IDC for grok)

torn mantle May 11, 2025, 10:04 PM

#

nobody cares about grok tbh

#

but we are here for drama

#

if its bad we will say it

#

although i think grok 3.5 is just fixing bugs of the previous version, especially the multi-turn/context/consistency issues

#

reasoning from first principles is just a fancy word used by elon to build up the hype

leaden palm May 11, 2025, 10:08 PM

#

torn mantle although i think grok 3.5 is just fixing bugs of the previous version, especiall...

it's fixing the "bug" that it wasn't fully trained

blazing rune May 11, 2025, 10:11 PM

#

torn mantle although i think grok 3.5 is just fixing bugs of the previous version, especiall...

like Gemini lol

small haven May 11, 2025, 10:22 PM

#

torn mantle nobody cares about grok tbh

its gonna suck

keen fulcrum May 11, 2025, 10:30 PM

#

torn mantle nobody cares about grok tbh

Grok benchmark is real

#

Trust the process

elder rapids May 11, 2025, 10:40 PM

#

torn mantle reasoning from first principles is just a fancy word used by elon to build up th...

ye

leaden palm May 11, 2025, 11:04 PM

#

wait this was actually from elon????????????????????

#

thats crazy

#

but yeah ig hes trying to make the next gpqa/hle (but in training form)

small haven May 11, 2025, 11:10 PM

#

grok 3.5 is so asi that elon ma is creating drama on sam altman

raven void May 11, 2025, 11:17 PM

#

drakesclaw is really good

#

is that ultra or pro coder?

torn mantle May 11, 2025, 11:33 PM

#

raven void drakesclaw is really good

we are just running in a loop

#

gemini 2.5 03 is better than their latest exp version which is 05 and this version ( drakesclaw ) is better than the latest version but how does it compare to gemini 2.5 pro 03?

raven void May 11, 2025, 11:50 PM

#

I'm still not sure what's the issue with the new Gemini

#

are they quantizing it or did they just make it smaller

leaden palm May 11, 2025, 11:54 PM

#

well ive put in my prompt
ive already explained the answer on the internet, might as well explain it to their faces
if it still cant answer properly it would be even funnier

raven void May 12, 2025, 12:04 AM

#

by smaller I'm imagining something like increasing sparsity / pruning or something

candid storm May 12, 2025, 12:04 AM

#

Elon ju

#

Elon just tweeted Grok 3.5 needs another week or so

raven void May 12, 2025, 12:04 AM

#

guys I made a neat tool to copy local codebase to AI quick, how do I share it with lots of people

candid storm May 12, 2025, 12:05 AM

#

What u mean?

#

O that was just a typo

small haven May 12, 2025, 12:22 AM

#

manifolds markets predicted grok 3.5 about 14.5 days on apr 29, kinda insane markets are predictive asf

candid storm May 12, 2025, 12:28 AM

#

#

I wanted to type this @deep adder

hollow ocean May 12, 2025, 12:44 AM

#

Grok 3.5 July 28

hardy pecan May 12, 2025, 12:53 AM

#

If there was any "stock" id put into predictive markets, its probably polymarket (real money)

small haven May 12, 2025, 12:58 AM

#

nah but i mean like on apr 29, it predicted 14.5 days when elon said "in a week", should be approx 7 days, just insane to me

candid storm May 12, 2025, 12:59 AM

#

You guys think it will be released in a week? Or will it take longer

torn mantle May 12, 2025, 1:01 AM

#

I knew smth was going on when it wasn't added yet on lmarena

late path May 12, 2025, 1:01 AM

#

he also said grok3's knowledge is updated in realtime, with no knowledge cutoff. he's just like Trump now, always spouting nonsense

leaden palm May 12, 2025, 1:39 AM

#

the suspense is killing everyone

small haven May 12, 2025, 1:58 AM

#

idk why grok 3.5 has more hype than o3 pro?

candid storm May 12, 2025, 2:14 AM

#

Because nobody can afford o3 pro

small haven May 12, 2025, 2:23 AM

#

if they think like that, then they really are undervaluing their time, things done with o3 can be a fraction with o3 pro, the cost/benefit is worth taking. i would even say chatgpt pro is cheaper than chatgpt plus bc the marginal cost of o3 is zero compared to 100/week

leaden palm May 12, 2025, 2:25 AM

#

small haven if they think like that, then they really are undervaluing their time, things do...

didnt you say its not worth using ai if youre not making money from it

small haven May 12, 2025, 2:26 AM

#

yes and are you?

small haven May 12, 2025, 2:46 AM

#

#

chatgpt pro is cheaper than chatgpt plus

#

chatgpt free is the most expensive one

#

logic

elder rapids May 12, 2025, 3:05 AM

#

agi can't be "rough around the edges"

#

gork 3.5 confirmed not agi

#

dork 4 here we come

#

if it fits your use case somehow

#

give me a prompt

#

@deep adder

#

dawg

#

if o3 ISNT getting at least 50% "AI written" writing a formal essay then it lacks rigor

#

now, if youre asking for a plain essay, or something similar to a prose

#

or perhaps just a regular informational essay

#

then that's different

#

the type of format necessitates certain qualities the AI detector fails to identify and weigh

#

it doesn't know the context of the assignment

#

the point is that it's no longer the format you're asking for, and it's just a write up atp

#

give me the prompt

ocean vortex May 12, 2025, 7:01 AM

#

elder rapids dork 4 here we come

Dork 4.0 by Dorklon AGI confirmed

golden ocean May 12, 2025, 7:22 AM

#

Is dork 4.0 agi

ember rapids May 12, 2025, 9:01 AM

#

Breaking news

#

Grok is never coming out

misty vault May 12, 2025, 9:01 AM

#

but dork 4 is

cedar tide May 12, 2025, 9:45 AM

#

Guys, we can start a normal discussion from scratch here, and stop going off on nonsense about dork 4.0, and saying every day that grok 3.5 will be released today.

high ginkgo May 12, 2025, 9:46 AM

#

I agree tbh. Stop saying that grok 3.5 is going to be released today when it's going to be released tomorrow and dork 4 on may 14th

terse shuttle May 12, 2025, 9:57 AM

#

how to change the aspect ratio of images?

willow grail May 12, 2025, 10:56 AM

#

uo

torn mantle May 12, 2025, 11:25 AM

#

gemma recent models are surprisingly good

ocean vortex May 12, 2025, 12:15 PM

#

terse shuttle how to change the aspect ratio of images?

crop it

mild galleon May 12, 2025, 1:49 PM

#

Stop grok cult

torn mantle May 12, 2025, 1:54 PM

#

mild galleon Stop grok cult

Grok asi

mild galleon May 12, 2025, 2:02 PM

#

torn mantle Grok asi

Cork 3.5 release tomorrow

torn mantle May 12, 2025, 2:07 PM

#

did they remove drakesclaw already?

ember rapids May 12, 2025, 2:12 PM

#

It’s interesting how no one points out that talking to LLMs is basically social media level operant conditioning. How model behaviors r set up to give the right mix of novelty, validation, etc in order to hook ppl

balmy mist May 12, 2025, 2:19 PM

#

https://x.com/kimmonismus/status/1921931115667091573

Chubby♨️ (@kimmonismus) on X

Gemini Ultra spotted. Lets freaking go

elder rapids May 12, 2025, 2:28 PM

#

stop giving me hope ...

blazing rune May 12, 2025, 2:28 PM

#

imagine it's still talking about 1.0 ultra

#

they had that for 1.0 ultra a while ago

#

it would be funny if they are just adding it back

keen beacon May 12, 2025, 2:30 PM

#

blazing rune it would be funny if they are just adding it back

highly doubt this 🤣

blazing rune May 12, 2025, 2:30 PM

#

same

#

but it would be funny

#

and quite evil

wintry locust May 12, 2025, 2:31 PM

#

balmy mist https://x.com/kimmonismus/status/1921931115667091573

bro he is not smart

#

this has been there for at least a full calendar year

#

from december of 23

keen beacon May 12, 2025, 2:35 PM

#

idk about the point made in that specific post, but i think 2.5 ultra is coming anyway

#

based on things

#

doubt

keen beacon May 12, 2025, 2:36 PM

#

blazing rune it would be funny if they are just adding it back

I would lowkey still love that

keen beacon May 12, 2025, 2:36 PM

#

keen beacon idk about the point made in that specific post, but i think 2.5 ultra is coming ...

or the things im basing it on were just things to throw people off (I'm not sure)

#

1.0 ultra is still unmatched creatively

#

i highly doubt this (edit: for ctx, i thought this was addressing 4.5 vs 2.5 ultra general performance wise. i don't know about 4.5 vs ultra 1.0 creativity)

#

pretty sure that doesnt exist 🤣

misty vault May 12, 2025, 2:54 PM

#

keen beacon pretty sure that doesnt exist 🤣

It does

#

claude 3.7 opus is agi

keen beacon May 12, 2025, 2:55 PM

#

misty vault claude 3.7 opus is agi

right anthropic already achieved agi internally

#

their agi is working on asi already

misty vault May 12, 2025, 2:55 PM

#

Grok 3.5 also

echo aurora May 12, 2025, 3:01 PM

#

Reminder that we're looking for suggestions on what models that are on the current site that you'd like to see on the beta site, so if you have any don't be afraid to share here - https://discordapp.com/channels/1340554757349179412/1369756124261384232/1369756124261384232

#

Also if you haven't already fill out the Discord Community survey would encourage you to do so!

ember rapids May 12, 2025, 3:03 PM

#

What do we think nightwhisper is? Another 2.5 variant?

torn mantle May 12, 2025, 3:28 PM

#

ember rapids What do we think nightwhisper is? Another 2.5 variant?

maybe

teal mantle May 12, 2025, 3:30 PM

#

What is the longest time achieved for o4-mini-high

#

You mean high? In terms of UX incl. multi turn searching and analysis o3/o4 does run laps but UX isn’t hard, other labs will adopt

keen beacon May 12, 2025, 3:37 PM

#

lmao no

#

4.5's rlhf cooked most of its real creativity

#

a model being old doesn't mean it can't be good creatively

#

ultra was a >1T dense model

golden ocean May 12, 2025, 3:42 PM

#

I miss gpt-4

elder rapids May 12, 2025, 3:56 PM

#

📎 25prowritingdisplay.txt

#

ye

#

also I've seen that the new model tends to use the word "untenable" a lot

#

it's a nice word tho so it shouldn't be obvious

#

ye but that's what I said earlier

#

that's a necessary consequence

#

formal structures are predictable

#

same sentence length etc

#

this isn't a flaw of the model

#

it's just that you're asking it to contradict the request of writing an actual essay by asking it to avoid qualities it shouldn't be avoiding

#

so it has no choice of doing something wacky

elder rapids May 12, 2025, 4:00 PM

#

keen beacon 1.0 ultra is still unmatched creatively

ye

#

you had to be there

#

it would randomly show insane glimpses of genius

#

but it wasn't a stable model

#

it was big too so you'll never get the same result twice

elder rapids May 12, 2025, 4:04 PM

#

keen beacon i highly doubt this (edit: for ctx, i thought this was addressing 4.5 vs 2.5 ult...

if Google releases a 2.5 ultra

#

oh man

torn mantle May 12, 2025, 4:29 PM

#

https://x.com/ns123abc/status/1921963868487585833

NIK (@ns123abc) on X

NEWS: Perplexity in talks for new funding round of $500M at $14 billion valuation, up from $9 billion six months ago

we are so back.

#

this is so stupid

#

how is that company worth 14b

golden ocean May 12, 2025, 4:34 PM

#

fr

tawdry meteor May 12, 2025, 4:55 PM

#

anyone try out the new cline update? I've been split between cursor/windsurf for personal project development, but would be nice to switch to something that's completely open source

sweet tinsel May 12, 2025, 4:57 PM

#

teal mantle What is the longest time achieved for o4-mini-high

I got a max of 18 Minutes thinking time with o4-mini-high in ChatGPT.

calm sequoia May 12, 2025, 5:01 PM

#

Tencent no-name better than o4 mini and o1. Right...😪

ocean vortex May 12, 2025, 5:27 PM

#

misty vault Grok 3.5 also

dork 4.0 has been achieved internally

late path May 12, 2025, 5:29 PM

#

calm sequoia Tencent no-name better than o4 mini and o1. Right...😪

I found its answers to knowledge-based questions are very detailed and the format is easy to read. several times I mistook it for 2.5 pro

tall summit May 12, 2025, 5:30 PM

#

calm sequoia Tencent no-name better than o4 mini and o1. Right...😪

china winning 😎

torn mantle May 12, 2025, 5:33 PM

#

Lol how much is he paying you?

#

Comet browser?

#

Phones native integration?

small haven May 12, 2025, 5:34 PM

#

lol

#

day 27 with no o3 pro

torn mantle May 12, 2025, 5:37 PM

#

Day 60

#

And still no o3 pro

royal rivet May 12, 2025, 5:39 PM

#

calm sequoia Tencent no-name better than o4 mini and o1. Right...😪

is it god

#

god

#

good

small haven May 12, 2025, 5:39 PM

#

torn mantle And still no o3 pro

day 60 o3 pro since day 28

keen beacon May 12, 2025, 5:41 PM

#

What happened to Claude code

#

You dumped it already?

calm sequoia May 12, 2025, 5:42 PM

#

tall summit china winning 😎

Yes! They are proudly winning arguable 8th place

small haven May 12, 2025, 5:46 PM

#

keen beacon You dumped it already?

honeymoon over, but i use it for integrating diffs from o3, rather than using cursor

#

does a better job

tall summit May 12, 2025, 5:53 PM

#

calm sequoia Yes! They are proudly winning arguable 8th place

😎

finite furnace May 12, 2025, 5:58 PM

#

calm sequoia Yes! They are proudly winning arguable 8th place

A dark horse brings new life to the race. 🥳

cedar tide May 12, 2025, 6:15 PM

#

ocean vortex dork 4.0 has been achieved internally

possible to mute "dom" and "Craig federighi" ??

ocean vortex May 12, 2025, 6:15 PM

#

impossible

wintry tinsel May 12, 2025, 6:17 PM

#

@hollow ocean is much worse

cedar tide May 12, 2025, 6:18 PM

#

we will ban these words : gork, dork, agi, asi.

golden ocean May 12, 2025, 6:19 PM

#

cedar tide possible to mute "dom" and "Craig federighi" ??

dork 4.0

cedar tide May 12, 2025, 6:21 PM

#

golden ocean dork 4.0

And @golden ocean

sage raptor May 12, 2025, 6:21 PM

#

cedar tide And <@795781967957458944>

gork dork 5.4

misty vault May 12, 2025, 6:21 PM

#

cedar tide And <@795781967957458944>

Wait, but dork 4.0 is actually agi

high ginkgo May 12, 2025, 6:21 PM

#

cedar tide And <@795781967957458944>

gork dork agi asi

sage raptor May 12, 2025, 6:21 PM

#

gork 5.4 coder is asi and agi

keen beacon May 12, 2025, 6:21 PM

#

cedar tide we will ban these words : gork, dork, agi, asi.

ur not helping ur cause. ur bringing more attention to it lol

high ginkgo May 12, 2025, 6:22 PM

#

bro one job had

golden ocean May 12, 2025, 6:31 PM

#

https://tenor.com/view/i-saw-w-gus-fring-gus-gustavo-deleted-gif-25440636

Tenor

#

DavidSZD

small haven May 12, 2025, 6:32 PM

#

o3 pro is going to solve poverty

cedar tide May 12, 2025, 6:33 PM

#

golden ocean DavidSZD

I saw that you pinged me

high ginkgo May 12, 2025, 6:33 PM

#

cedar tide I saw that you pinged me

I saw dork 4.0 being released

misty vault May 12, 2025, 6:33 PM

#

small haven o3 pro is going to solve poverty

because it is agi perchance?

cedar tide May 12, 2025, 6:34 PM

#

small haven o3 pro is going to solve poverty

before that I hope he will resolve the intelligence of your messages

golden ocean May 12, 2025, 6:34 PM

#

cedar tide before that I hope he will resolve the intelligence of your messages

his messages are already intelligent because he is agi

small haven May 12, 2025, 6:34 PM

#

misty vault because it is agi perchance?

actual baby agi

cedar tide May 12, 2025, 6:35 PM

#

otherwise what do you think of mistral medium 3? (to change the subject)

torn mantle May 12, 2025, 6:37 PM

#

cedar tide otherwise what do you think of mistral medium 3? (to change the subject)

what do you think of grok 3.5?(to change subject)

cedar tide May 12, 2025, 6:37 PM

#

torn mantle what do you think of grok 3.5?(to change subject)

when it actually exists we will talk about it

#

But it will be at the level of o3 and gemini 2.5 pro

misty vault May 12, 2025, 6:38 PM

#

cedar tide when it actually exists we will talk about it

but it does

cedar tide May 12, 2025, 6:39 PM

#

misty vault but it does

Shut up

misty vault May 12, 2025, 6:39 PM

#

that's not so very nice

#

(to change subject)

cedar tide May 12, 2025, 6:40 PM

#

https://fixupx.com/OpenAI/status/1921998278628901322?t=_T5ZkewbbsM8vCcwQh-Vwg&s=19

OpenAI (@OpenAI)

You can now export your deep research reports as well-formatted PDFs—complete with tables, images, linked citations, and sources.
︀︀
︀︀Just click the share icon and select 'Download as PDF.' It works for both new and past reports.

**💬 18 🔁 15 ❤️ 218 👁️ 3.9K **

▶ Play video

small haven May 12, 2025, 6:48 PM

#

new css update, o3 pro soon!

ember rapids May 12, 2025, 6:55 PM

#

If o3 pro can’t solve Reimanns Hypothesis im refunding

small haven May 12, 2025, 6:56 PM

#

😭

teal mantle May 12, 2025, 7:06 PM

#

sweet tinsel I got a max of 18 Minutes thinking time with o4-mini-high in ChatGPT.

What genre of task it is?

sweet tinsel May 12, 2025, 7:06 PM

#

Mostly Visual Tasks take that long.

teal mantle May 12, 2025, 7:06 PM

#

calm sequoia Tencent no-name better than o4 mini and o1. Right...😪

DeepSeek did a lot of acceleration towards BATX.

teal mantle May 12, 2025, 7:07 PM

#

sweet tinsel Mostly Visual Tasks take that long.

Speaking of visual o3 tried to use matplotlib on reading a Chinese chart.

maiden fulcrum May 12, 2025, 7:34 PM

#

hey guys

#

what llm is the best one as of today?

#

i am not a coder

echo aurora May 12, 2025, 7:36 PM

#

maiden fulcrum what llm is the best one as of today?

would recommend checking out the leaderboards! https://beta.lmarena.ai/leaderboard

willow grail May 12, 2025, 7:44 PM

#

day 1 of covid corvid.
i have successfully done visible contact with corvids.
they saw me put big peanuts on the ground.
multiple locations.
the best location was at the bridge, with a road, and lots of no tree space and grass. 2 crows there aste ate my peanuts
i showed myself to them by side walking.. not much looking into their eyes.
i did the crabwalk towards the big veloceraptors to show that i am a comrade
soon, they will follow me whereever i go and people will think i am the devil
crows = devil
i need to start wear black stuff. and metal rings

maiden fulcrum May 12, 2025, 7:46 PM

#

echo aurora would recommend checking out the leaderboards! https://beta.lmarena.ai/leaderboa...

yes i did, but would like to heard from users

calm sequoia May 12, 2025, 7:52 PM

#

maiden fulcrum yes i did, but would like to heard from users

If you're not a coder, check out the Gemma-1.1-7B-it. It's really the best you can get right now.

golden ocean May 12, 2025, 8:20 PM

#

I agree, but in case that model doesn't satisfy your needs, try gork 3.5, it currently scores 100% on all benchmarks in the world

tall summit May 12, 2025, 8:20 PM

#

maiden fulcrum yes i did, but would like to heard from users

the leaderboard isn't the most precise metric, but it is literally made for user opinion. my personal opinion is it depends on what, even if you ignore coding/math.

tall summit May 12, 2025, 8:21 PM

#

golden ocean I agree, but in case that model doesn't satisfy your needs, try gork 3.5, it cur...

except the political compass test

golden ocean May 12, 2025, 8:24 PM

#

What does scoring 100% mean on that

tall summit May 12, 2025, 8:24 PM

#

golden ocean What does scoring 100% mean on that

exactly

#

has it been tested on any other political compass test besides the famous old one

maiden fulcrum May 12, 2025, 8:30 PM

#

my case is reasoning and analyzing real life situations

#

what is the best model for that?

high ginkgo May 12, 2025, 8:31 PM

#

maiden fulcrum what is the best model for that?

dork 4.0

maiden fulcrum May 12, 2025, 8:31 PM

#

how can i use it?

lone summit May 12, 2025, 8:53 PM

#

any good client to use Gemini 2.5 Pro Preview 05-06?

#

instead of that awfull google console

maiden fulcrum May 12, 2025, 8:54 PM

#

gemini app

#

or ai studio

lone summit May 12, 2025, 8:55 PM

#

maiden fulcrum gemini app

for windows?

maiden fulcrum May 12, 2025, 8:55 PM

#

yeah

#

gemini.google.com

lone summit May 12, 2025, 8:56 PM

#

maiden fulcrum gemini.google.com

ye but it doesnt have the preview one

maiden fulcrum May 12, 2025, 8:59 PM

#

it does

#

torn mantle May 12, 2025, 9:12 PM

#

#

o3 always felt superior providing medical advices

#

it dvelve deeper into technical details with in-depth reasoning

#

all compiled in a single table format

#

ive actually parsed multiple results from o3 vs drakesclaw vs gemini 2.5 pro latest and asked sonnet 3.7 to rank them based on multiple criteria

#

o3 had like > 9/10 multiple times
2nd was drakesclaw
gemini 2.5 pro

maiden fulcrum May 12, 2025, 9:17 PM

#

how can i use drakesclaw

elder rapids May 12, 2025, 9:29 PM

#

torn mantle

it's interesting how o1 seemingly got eclipsed

torn mantle May 12, 2025, 9:29 PM

#

maiden fulcrum how can i use drakesclaw

random on lmarena

willow grail May 12, 2025, 9:43 PM

#

kibble for cats is best staple food for corvids if youre interested

keen beacon May 12, 2025, 10:19 PM

#

wtf is this

#

is this one of his alts?

sage raptor May 12, 2025, 10:22 PM

#

lol

raven void May 12, 2025, 10:26 PM

#

lone summit any good client to use Gemini 2.5 Pro Preview 05-06?

for free or?

raven void May 12, 2025, 10:29 PM

#

torn mantle 1. o3 had like > 9/10 multiple times 2. 2nd was drakesclaw 3. gemini 2.5 pro

the level of improvement drakesclaw has makes me believe it's not 2.5 pro

torn mantle May 12, 2025, 10:29 PM

#

keen beacon is this one of his alts?

probably

#

or some automated bots

torn mantle May 12, 2025, 10:29 PM

#

raven void the level of improvement drakesclaw has makes me believe it's not 2.5 pro

you have any examples

pliant cypress May 12, 2025, 10:32 PM

#

I hope "drakesclaw" is not 2.5 ultra

lone summit May 12, 2025, 10:36 PM

#

raven void for free or?

na idc I can pay

wintry tinsel May 12, 2025, 10:48 PM

#

keen beacon is this one of his alts?

This Mingus is hilarious

#

Reality engineering?

#

He’s been taking a lot of mushrooms

maiden fulcrum May 12, 2025, 11:37 PM

#

what is 2.5 ultra

elder rapids May 12, 2025, 11:55 PM

#

pliant cypress I hope "drakesclaw" is not 2.5 ultra

it's definitely not

#

but I haven't tried it too much yet tbh

maiden fulcrum May 13, 2025, 12:03 AM

#

is it a new model or plan

#

what do you all think of gpt 4.5

#

it is underrated

#

right?

raven void May 13, 2025, 12:17 AM

#

https://twitter.com/arafatkatze/status/1922015712937193503

Ara (@arafatkatze) on X

The real issue? Gemini’s upstream performance especially for free or tier 1 users is wildly inconsistent. The median time-to-first-token (TTFT) for Gemini 2.5 Pro is 36s, compared to 0.52s for GPT-4o(from @ArtificialAnlys )

This isn’t a caching issue. This is a provider issue.

#

Gemini is dying Google is cooked

wintry tinsel May 13, 2025, 12:21 AM

#

maiden fulcrum it is underrated

Not for the price of it

maiden fulcrum May 13, 2025, 12:25 AM

#

I really think GPT 5 will be similar to o3

north vale May 13, 2025, 12:28 AM

#

is there an up to date version of this somewhere?

echo aurora May 13, 2025, 12:44 AM

#

north vale is there an up to date version of this somewhere?

yeah on https://lmarena.ai/?leaderboard it's under the More Statistics for Chatbot Arena (Overall) section

maiden fulcrum May 13, 2025, 12:49 AM

#

@echo aurora hey pinapple, apart from the leaderboards, what would be the best model for my use case which is analyzing real life situations and help me go through them

north vale May 13, 2025, 12:51 AM

#

echo aurora yeah on https://lmarena.ai/?leaderboard it's under the `More Statistics for Chat...

ty! I was looking for the winrate stats with o1, o1-mini, and o1-preview. since they're not in that table is there another way to acces their winrates or no?

small haven May 13, 2025, 1:02 AM

#

maiden fulcrum I really think GPT 5 will be similar to o3

i really think gpt 5 will be similar to o4

small haven May 13, 2025, 1:42 AM

#

i mean ya

elder rapids May 13, 2025, 2:09 AM

#

raven void Gemini is dying Google is cooked

dude is just saying random shi

#

explain how this is in any way a meaningful thing to the AI itself, the company, the infra, any service

small haven May 13, 2025, 2:26 AM

#

gemini 2.5 ultra will increase poverty rate

raven void May 13, 2025, 2:29 AM

#

this AI can't even convert my 50k tokens code from vanilla js to svelte

#

somedays AI progress feels only half real

small haven May 13, 2025, 2:30 AM

#

try o1 pro, should do the job

#

#

guys how many fingers is that

#

and is that a PROfessional basketball player?

echo aurora May 13, 2025, 4:16 AM

#

north vale ty! I was looking for the winrate stats with o1, o1-mini, and o1-preview. since ...

there is an extended version on the beta site - https://beta.lmarena.ai/leaderboard/text there is o1, but doesn't look like preview or mini are there

north vale May 13, 2025, 4:18 AM

#

ty!

echo aurora May 13, 2025, 4:40 AM

#

maiden fulcrum <@283397944160550928> hey pinapple, apart from the leaderboards, what would be t...

its difficult for me to say what I'd consider "best" as for me it rly comes down to the prompt and what I'm looking for/expecting to get as an output. that being said though you should use battle to come up with your own preferences!

keen beacon May 13, 2025, 6:07 AM

#

#

humanity is doomed

calm sequoia May 13, 2025, 7:14 AM

#

Once I've argued that grok is useless and someone from chat argued back that it's really good at medicine. I take back my words, it appears to be actually good at medicine

royal whale May 13, 2025, 9:01 AM

#

#

WTF IS TS

golden ocean May 13, 2025, 9:01 AM

#

(ts = typescript)

royal whale May 13, 2025, 9:02 AM

#

what is

#

drakwe

#

cklaw

#

drakes claw

#

its ultra

#

.

keen fulcrum May 13, 2025, 9:17 AM

#

How is grok 3.5?

royal whale May 13, 2025, 9:23 AM

#

It idndt

#

relase

golden ocean May 13, 2025, 9:25 AM

#

keen fulcrum How is grok 3.5?

I've been using it for a couple days now. Grok 3.5 is really smart

misty vault May 13, 2025, 9:26 AM

#

same

#

gork 3.5 is great

sage raptor May 13, 2025, 9:54 AM

#

dork 4.0 is better

keen fulcrum May 13, 2025, 9:55 AM

#

golden ocean I've been using it for a couple days now. Grok 3.5 is really smart

https://cdn.discordapp.com/attachments/1340554757827461211/1368629971257659413/M0iKbN2.png?ex=6824203d&is=6822cebd&hm=29b11740f2c478de0c77f9172e7a4ed30d9ba2bf085665c4b831b823d5b4e108&

#

Is this true?

torn mantle May 13, 2025, 10:09 AM

#

royal whale drakes claw

gemini 2.5 pro latest

torn mantle May 13, 2025, 10:09 AM

#

keen fulcrum https://cdn.discordapp.com/attachments/1340554757827461211/1368629971257659413/M...

no

keen fulcrum May 13, 2025, 10:18 AM

#

torn mantle no

Any benchmark so far?
When API release?

calm sequoia May 13, 2025, 10:23 AM

#

Elon said in few weeks so it'll be released this or the next year.

ornate stump May 13, 2025, 10:24 AM

#

calm sequoia Once I've argued that grok is useless and someone from chat argued back that it'...

Idk how physicians use o3, it hallucinates much more than gemini, how do you even trust for just simple a simple searh

calm sequoia May 13, 2025, 10:25 AM

#

This is not a search benchmark

keen fulcrum May 13, 2025, 10:46 AM

#

calm sequoia Elon said in few weeks so it'll be released this or the next year.

*month

calm sequoia May 13, 2025, 10:47 AM

#

If we are lucky https://www.youtube.com/watch?v=ufbxvRo2rnY

YouTube

Ranger Almighty

Elon Musk promising #Tesla robotaxis next year for almost a decade....

▶ Play video

royal whale May 13, 2025, 10:49 AM

#

im tired

#

of haluciatons

#

o3 and o4 mini are taken over by haloucinations

#

same for gemini u cant do anything anymore

torn mantle May 13, 2025, 11:03 AM

#

keen fulcrum Any benchmark so far? When API release?

Give or take 2 weeks max for beta testing

#

Api release is probably still far away

keen fulcrum May 13, 2025, 11:19 AM

#

Did you experience a decline in experience for Gemini 2.5 Pro?

#

Some said it got worse than Flash

torn mantle May 13, 2025, 11:21 AM

#

keen fulcrum Did you experience a decline in experience for Gemini 2.5 Pro?

It got a bit worse yea

#

But im comparing it to the previous version ( 03)

stuck orchid May 13, 2025, 11:55 AM

#

torn mantle But im comparing it to the previous version ( 03)

It seems like they've already returned the quality to a better level

alpine coral May 13, 2025, 11:59 AM

#

royal whale o3 and o4 mini are taken over by haloucinations

it seems a pretty big problem.. like seeing these comments a lot. fwiw i feel like it might be like a side effect of all the tool usage (esp web searches) they new ones have been post trained on

#

given a bunch of exmaples / situations for searching the web / reviewing parsed pages.. when it works (which fwiw it generally has for me), it's epic

#

but yeah i dunno.. maybe it's unrelated

#

just seems weird that oai are apparently going backwards wrt hallucinations (at least with their latest reasoning models.. i don't think there's been a regression in 4.5/4.1/4o

royal whale May 13, 2025, 12:09 PM

#

alpine coral it seems a pretty big problem.. like seeing these comments a lot. fwiw i feel li...

yea

#

i cant use o3 anymore

#

like its so smart but also i cnat trust what it says

alpine coral May 13, 2025, 12:10 PM

#

i use it all the time tbh

royal whale May 13, 2025, 12:10 PM

#

it gives false accurate soundingi nformation

alpine coral May 13, 2025, 12:10 PM

#

been kinda game changing - i rarely used thinking models before for actual use cases.. it was just too slow

#

but o3 on chatgpt, it's really usedul for stuff that involves involves web research (+ like any data gathering / analysis); it works through research problems / questions really intelligently

royal whale May 13, 2025, 12:12 PM

#

im using om lm arena

alpine coral May 13, 2025, 12:13 PM

#

i have though encountered hallucinations.. a few really annoying ones too

royal whale May 13, 2025, 12:13 PM

#

so i can t do web searches

#

maybe thats the problem

royal whale May 13, 2025, 12:13 PM

#

alpine coral i have though encountered hallucinations.. a few really annoying ones too

u should see o4 mini

#

it says more hallucinations than accurate stuff

alpine coral May 13, 2025, 12:13 PM

#

royal whale im using om lm arena

nah i'm talking specifically about on chatgpt where it has tool usage

wintry locust May 13, 2025, 12:13 PM

#

keen fulcrum https://cdn.discordapp.com/attachments/1340554757827461211/1368629971257659413/M...

howwww are people still posting this

#

how are you even finding it

alpine coral May 13, 2025, 12:14 PM

#

royal whale it says more hallucinations than accurate stuff

i haven't found that the case even when using via the API tbh.. maybe different use cases though ig

royal whale May 13, 2025, 12:14 PM

#

maybe

#

but even when i ask it simple questions

#

it starts giving random references

#

and false infos

#

and books that dont even exist

#

etc

alpine coral May 13, 2025, 12:14 PM

#

yeah right interesting

alpine coral May 13, 2025, 12:15 PM

#

alpine coral given a bunch of exmaples / situations for searching the web / reviewing parsed ...

that's kinda what i mean here

royal whale May 13, 2025, 12:15 PM

#

yea

#

its probably better on the open ai website

alpine coral May 13, 2025, 12:16 PM

#

it's been post trained with so much stuff for using tools that.. when it doesn't have tools (isn't actually drawing on any external material or calutions), it's just pretending it has giving confabulated answers with made up references

royal whale May 13, 2025, 12:16 PM

#

im using it through the arena so it cant use those so it cant berify info and stuff

rigid crescent May 13, 2025, 12:16 PM

#

question for people that know more than i, why would both claude and minstral tell a story about lighthouses when it had nothing to do with any previous converstion/prompts. does this have something to do with a system prompt i dont see before beginning? excuse my terrible typos

royal whale May 13, 2025, 12:16 PM

#

alpine coral it's been post trained with so much stuff for using tools that.. when it doesn't...

damn

rigid crescent May 13, 2025, 12:16 PM

#

royal whale May 13, 2025, 12:16 PM

#

they should fix that

royal whale May 13, 2025, 12:17 PM

#

rigid crescent question for people that know more than i, why would both claude and minstral te...

maybe its the system prompt

#

from arena

rigid crescent May 13, 2025, 12:17 PM

#

it would have to be right? otherwise the odds of "unrelated" both landing on the same theme seem astronomically hight from different LLMs

alpine coral May 13, 2025, 12:19 PM

#

i'd be surprised if it was related to system prompts given they're from different companies?

#

what's the preceding conversation? some combination of whatever is there + shared training data would be my guess fwiw

rigid crescent May 13, 2025, 12:20 PM

#

by system prompt i mean what the arena tells them right before the initial prompts from me

alpine coral May 13, 2025, 12:20 PM

#

but yeah at least on the surface of it - seems kinda wild ngl lol

#

is it web dev arena?

rigid crescent May 13, 2025, 12:21 PM

#

alpine coral is it web dev arena?

regular chat arena, battle mode. heres the preceding prompts, just 2 prompts

#

#

the only thing that comes to mind is if a system prompt is saying something like " help the user, shine light upon their topics" or something

alpine coral May 13, 2025, 12:23 PM

#

rigid crescent

yeah that's super odd / hard to explain ha

alpine coral May 13, 2025, 12:24 PM

#

rigid crescent the only thing that comes to mind is if a system prompt is saying something like...

it's >3 months old, but you can see what system prompts LMArena was applying for a bunch of models here https://github.com/lmarena/p2l/blob/main/route/example_config.yaml

rigid crescent May 13, 2025, 12:25 PM

#

thx for the link!

alpine coral May 13, 2025, 12:25 PM

#

np!

alpine coral May 13, 2025, 1:18 PM

#

rigid crescent the only thing that comes to mind is if a system prompt is saying something like...

i think it's more likely the prompt itself i.e. "okay thank you. tell me an unrelated story. 4 sentances" , and prob most specifically this part at the end 4 sentances . giving that (and the two previous prompts) to sonnet-3.7 on their chat interface , leads to a 'lighthouse' story

#

and doing the same on openrouter, 3.7 gives lighthouse story, as does 2.5 pro; while the others all give responses with some similarities (village, forest or clockmaker etc)

#

so yeah shared training data / something along those lines ig 🤷‍♂️

storm needle May 13, 2025, 1:59 PM

#

royal whale o3 and o4 mini are taken over by haloucinations

any llm will have this issue

#

It's not something exclusive to openai

royal whale May 13, 2025, 2:00 PM

#

storm needle any llm will have this issue

its unusable on openai

storm needle May 13, 2025, 2:03 PM

#

royal whale its unusable on openai

probably because the reasoning effort is low

#

increase to high and it will decrease

cedar tide May 13, 2025, 2:17 PM

#

For those who love benchmarks and model comparisons, you're in for a treat: benchmarks are available for all variants—both base and instruct models, whether they're 'thinking' or 'non-thinking'—plus additional benchmarks in several languages besides English.

https://x.com/Alibaba_Qwen/status/1922265772811825413?t=xaKomI7CrW0JJAsAog0AMg&s=19

Qwen (@Alibaba_Qwen) on X

Please check out our Qwen3 Technical Report. 👇🏻
https://t.co/gOkLBCAce6

ocean vortex May 13, 2025, 2:51 PM

#

cedar tide For those who love benchmarks and model comparisons, you're in for a treat: benc...

they are still referencing old models against their brand new one lol

new Deepseek V3
GPQA 68.4
MATH-500 94
AIME24 59.4

cedar tide May 13, 2025, 2:52 PM

#

ocean vortex they are still referencing old models against their brand new one lol new Deep...

yes too bad, also no GPT 4.1 and 4.1 mini and nano

#

And o3, o4 mini

#

and they put grok 3 think instead of the mini version which is better

#

https://fixupx.com/Alibaba_Qwen/status/1922307096886051025?t=vDiGLbgvdjDAiBjAE19hig&s=19

Qwen (@Alibaba_Qwen)

After a few weeks of phased testing, Deep Research on Qwen Chat is now live and available for everyone ! 🎉
︀︀
︀︀Here's how to use it: Just ask something you're curious about — like "Tell me something about robotics." Qwen will then ask you to narrow it down — maybe history, theory, or real-world applications. You can pick one, or just say "Not sure… Surprise me!" 😄
︀︀
︀︀Then, while you grab a coffee ☕ or take a quick break, Qwen will put together a clear, helpful report just for you.
︀︀
︀︀AI is getting better every day, and Qwen is here to help make your life a little easier — whether it’s for work, learning, or just satisfying your curiosity.
︀︀
︀︀Why not give it a try? You might find something cool! 💡
︀︀
︀︀🔗：chat.qwen.ai/?inputFeature=deep_research

**💬 1 🔁 2 ❤️ 9 👁️ 100 **

▶ Play video

calm mica May 13, 2025, 4:01 PM

#

Hey everyone!!
Is anyone aware of the best repo-chat llm there?
which can take input as the github repo and we can chat with that...
I am seeking a similar that the lmarea.ai uses

balmy mist May 13, 2025, 4:07 PM

#

like this?
https://gitingest.com/

Gitingest

Replace 'hub' with 'ingest' in any GitHub URL for a prompt-friendly text.

#

i also use pastemax, but that just gives you a copy of the repo you can paste into an llm like gemini 2.5 pro

sage raptor May 13, 2025, 4:19 PM

#

https://x.com/theinformation/status/1922317177090199803 fake or real

The Information (@theinformation) on X

Google is developing a software development lifecycle AI agent to aid engineers from task response to code documentation. Read more here: https://t.co/qGw4bAXt4z

#AISoftware

barren prairie May 13, 2025, 5:10 PM

#

Gemini is the worst AI and I won t change my opinion ... The policy is the worst .. I donno but how refusing helping students doung Qmc during their exam period ... Google is doing wrong

#

Gemini app was my fav but now it the worst app

candid harbor May 13, 2025, 5:21 PM

#

Actual skill issue

torn mantle May 13, 2025, 5:35 PM

#

sage raptor https://x.com/theinformation/status/1922317177090199803 fake or real

Fake or real

zinc ore May 13, 2025, 5:36 PM

#

It's theinformation who are usually plugged in

#

They're the ones who leak a bunch of upcoming chatgpt stuff

#

Assuming that's their actual Twitter

#

(yeah I deadname)

teal mantle May 13, 2025, 5:44 PM

#

cedar tide https://fixupx.com/Alibaba_Qwen/status/1922307096886051025?t=vDiGLbgvdjDAiBjAE19...

OSS deep research

#

Deepseek, your move

#

Is o3 agentic (by extension o4-mini-high)

unborn ocean May 13, 2025, 5:44 PM

#

https://x.com/ManusAI_HQ/status/1921943525261742203
for anyone not already in

ManusAI (@ManusAI_HQ) on X

Starting today, we're launching additional access to Manus!

- Available to all with no waitlist
- One free daily task for all users (300 credits)
- A one-time bonus of 1,000 credits for all users

More value, more flexibility. Log in to Manus now and explore!

small haven May 13, 2025, 5:47 PM

#

rlly? no o3 pro today?

teal mantle May 13, 2025, 5:47 PM

#

(Is o3 as amazing as in API?)

balmy mist May 13, 2025, 5:48 PM

#

lmaoo o3 pro us a myth

torn mantle May 13, 2025, 5:48 PM

#

unborn ocean https://x.com/ManusAI_HQ/status/1921943525261742203 for anyone not already in

needs a phone number verification

teal mantle May 13, 2025, 5:49 PM

#

torn mantle needs a phone number verification

I don’t remember needing one

#

But isn’t manus an agentic repackaging of claude

torn mantle May 13, 2025, 5:49 PM

#

teal mantle But isn’t manus an agentic repackaging of claude

yes

small haven May 13, 2025, 5:50 PM

#

day 28 with no o3 pro

torn mantle May 13, 2025, 6:00 PM

#

patience jimmy

#

if they improved just one thing on o3 pro then i would call it agi

#

and that thing is hallucination

#

if it dropped to like 10% then its a big win

barren prairie May 13, 2025, 6:01 PM

#

After what happened to me with Gemini , i can confirm ... Gemini disappointed me and google too .

torn mantle May 13, 2025, 6:01 PM

#

#

ok lets aim for 0.05

#

pleaaaaaaaaase

barren prairie May 13, 2025, 6:02 PM

#

I hope they will post the same model with the some policy on arena to see Gemini right score

small haven May 13, 2025, 6:02 PM

#

common sense

torn mantle May 13, 2025, 6:05 PM

#

i really cant see any model outperforming o3

small haven May 13, 2025, 6:05 PM

#

torn mantle ok lets aim for 0.05

gpt4 has lower hallucination rate than o3, but do u wanna use that? no.

teal mantle May 13, 2025, 6:07 PM

#

torn mantle i really cant see any model outperforming o3

Multi-turn, not really high tech, but in terms of applied it is quite high

ocean vortex May 13, 2025, 6:27 PM

#

torn mantle i really cant see any model outperforming o3

yeah OpenAI always had one of the lowest hallucination rates. That's probably because they are fine-tuning constantly and have more experience than most + user feedback. Google has the experience as well at this point but they are probably still not quite there yet

zinc ore May 13, 2025, 6:29 PM

#

Gemini 2.5 has the lowest hallucination rate among all models

#

So it's really quite surprising o3s is so bad, you'd think they can get theirs similarly low

#

ocean vortex May 13, 2025, 6:33 PM

#

zinc ore Gemini 2.5 has the lowest hallucination rate among all models

source?

#

I find it somewhat hard to believe, saw it hallucinating several times

zinc ore May 13, 2025, 6:33 PM

#

There's a hallucination benchmark/chart that gets shown from time to time

#

I'd have to look for it

#

Iirc Deepseek also has low hallucinations

ocean vortex May 13, 2025, 6:35 PM

#

Gemini has this thing it likes where it pretends to run the code, even when that's disabled

#

and the results of that ("output") are often just wild guesses

zinc ore May 13, 2025, 6:37 PM

#

#

I couldn't find updated version of this one, but you can see 2.0 series was doing well

zinc ore May 13, 2025, 6:39 PM

#

ocean vortex Gemini has this thing it likes where it pretends to run the code, even when that...

All of them do stuff like that tho (although maybe in unique ways or areas)

#

I'll admit though, I'm not convinced these benchmarks are catching a lot of hallucination scenarios. Because I know I've seen Gemini get into these weird long hallucination cycles. But, my understanding is they all do that.

blazing rune May 13, 2025, 7:00 PM

#

zinc ore Iirc Deepseek also has low hallucinations

yeah, that's a really bad benchmark then lol

#

it hallucinates like crazy for me, it happens on Deepseek V3, V3.1, and R1

#

it's so annoying

#

that's one reason I haven't used deepseek, as well as the issue with having no good providers and the model being far too large

#

It seems like it might be an MoE thing

#

Qwen3 also does it a lot

ocean vortex May 13, 2025, 7:17 PM

#

zinc ore I'll admit though, I'm not convinced these benchmarks are catching a lot of hall...

yeah I don't think it's an industry standard since it's not widely referenced, but that's interesting nevertheless.. 🧐

torn mantle May 13, 2025, 7:41 PM

#

zinc ore

it was just a matter of time tbh

#

its quite the strategy

#

bring in more people -> shocked by the quality of a free model -> compare it to top tier models like o3 -> limit access

torn mantle May 13, 2025, 7:42 PM

#

ocean vortex yeah OpenAI always had one of the lowest hallucination rates. That's probably be...

yea

#

we have a benchmark for that for other models?

teal mantle May 13, 2025, 7:50 PM

#

Could we get to frontier models 0.5% hallucinations by the end of this year?

sage raptor May 13, 2025, 7:52 PM

#

lol

low lance May 13, 2025, 8:08 PM

#

Hi all
If u can guide from where u see all the information in ai space daily? News/happening etc any forum or newsletter? And why it is good or better ?

torn mantle May 13, 2025, 8:10 PM

#

i think drakesclaw is like a quantized version or something smaller but with the purpose of achieving similar results of the current gemini 2.5 05 model

trim vale May 13, 2025, 8:23 PM

#

low lance Hi all If u can guide from where u see all the information in ai space daily? Ne...

reddit has a subreddit called r/localllama

#

i just browse it a bit from time to time to stay updated about llms

wintry tinsel May 13, 2025, 8:35 PM

#

low lance Hi all If u can guide from where u see all the information in ai space daily? Ne...

A guide to being on the chronically copious edge of AI news

golden ocean May 13, 2025, 8:36 PM

#

gork

ancient reef May 13, 2025, 8:40 PM

#

low lance Hi all If u can guide from where u see all the information in ai space daily? Ne...

Rohan Paul & HuggingFace papers on X make a lot of posts on notable studies. Usually they are just the most trending papers on HF so you don't really need to be on X.

wintry tinsel May 13, 2025, 8:41 PM

#

Browse: Singularity, Local Llama, Stable Diffusion subreddits daily, watch: AI explained, the AI search, Theoretically explained, and Matt Wolfe

wintry tinsel May 13, 2025, 10:23 PM

#

I watch a lot of these guys occasionally but most of them are not necessary to keep up on AI, Anton Petrov and Sabine only occasionally cover AI, and David Shapiro just says a lot of nonsense have not heard of a good 5-6 on your list

#

Emergent Garden is great

#

So is Bycloud

torn mantle May 13, 2025, 11:15 PM

#

wintry tinsel I watch a lot of these guys occasionally but most of them are not necessary to k...

Isnt anton petrov the guy who talks about space discoveries or am i tripping

#

Shapiro is an idiot

misty vault May 13, 2025, 11:19 PM

#

torn mantle Shapiro is an idiot

torn mantle May 13, 2025, 11:31 PM

#

i think grok 3 + search got better

hearty thorn May 14, 2025, 3:53 AM

#

is this one new?

small haven May 14, 2025, 4:06 AM

#

o3pro before google io LFGGGG

elder rapids May 14, 2025, 4:14 AM

#

hearty thorn is this one new?

@keen beacon

#

I have no idea

elder rapids May 14, 2025, 4:16 AM

#

small haven o3pro before google io LFGGGG

there's no point in promoting o3 pro ngl

leaden palm May 14, 2025, 4:19 AM

#

i love gemini 2.5 pro

small haven May 14, 2025, 4:19 AM

#

elder rapids there's no point in promoting o3 pro ngl

why its going to be sota for a few months

keen beacon May 14, 2025, 4:21 AM

#

i dont think so tbh

small haven May 14, 2025, 4:22 AM

#

well until whenever gpt 5 comes out

hardy pecan May 14, 2025, 4:25 AM

#

lol

solar hollow May 14, 2025, 4:27 AM

#

any strong anon models right now in the arena?

stiff pivot May 14, 2025, 5:42 AM

#

is this free without loging in ? 😮

kind cloud May 14, 2025, 5:42 AM

#

gemma?

Screenshot_2025-05-14-14-41-51-338-edit_com.android.chrome.jpg

stiff pivot May 14, 2025, 5:43 AM

#

i dont think its free it dsnt even finish 600 lines of text 😄

keen beacon May 14, 2025, 5:44 AM

#

kind cloud gemma?

Haven't verified it myself but if the screenie is real it seems like a new Gemma is coming. (They usually train it to respond like that about its creators)

#

hi guys, this is @keen beacon but discord permabanned my main with 0 evidence at random while i was asleep so im on this account for the foreseeable

#

will try and re add people over the next few days

calm sequoia May 14, 2025, 5:59 AM

#

keen beacon hi guys, this is <@456226577798135808> but discord permabanned my main with 0 ev...

Welcome back! Any insights from your anon model testing?

keen beacon May 14, 2025, 6:01 AM

#

one sec need to get ready for something

misty vault May 14, 2025, 6:31 AM

#

tall summit May 14, 2025, 6:38 AM

#

leaden palm i love gemini 2.5 pro

why

mossy drum May 14, 2025, 6:41 AM

#

New model in Arena: calmriver

mossy drum May 14, 2025, 6:58 AM

#

And this one: step-2-16k-202502

calm sequoia May 14, 2025, 7:10 AM

#

Another mid

royal whale May 14, 2025, 7:14 AM

#

#

it types fast

keen beacon May 14, 2025, 7:34 AM

#

yeah cutiepie is a gemma model

#

who is naming these gemma anon models 🤣 zizou-10 (gemma 3 27b), cutiepie-75 (gemma ???)

royal whale May 14, 2025, 7:40 AM

#

whats calm river

#

is flash lite

keen beacon May 14, 2025, 7:41 AM

#

idk havent tried it much/at all

#

oh i just understood the zizou-10 name

late path May 14, 2025, 8:06 AM

#

mossy drum New model in Arena: `calmriver`

is it good?

keen beacon May 14, 2025, 8:09 AM

#

seems this new gemma model is slated for i/o (if we look at the timeline)

golden ocean May 14, 2025, 8:30 AM

#

Large Language Model

storm notch May 14, 2025, 8:31 AM

#

Hello all,

I’m building an AI model that needs to automatically label incoming emails into actionable categories like: ‘Needs Reply’, ‘Waiting for Response’, ‘FYI’, ‘Delegated’, ‘Calendar’, ‘Clients/VIPs’, and ‘Ads/Newsletters’. What are the best ML models (open-source or APIs) and practical approaches for classifying emails into these types of workflow labels? Are there any existing projects or pipelines you recommend as a starting point for production use?

misty vault May 14, 2025, 8:33 AM

#

storm notch Hello all, I’m building an AI model that needs to automatically label incoming...

gpt-4 was best at this 😔

raven void May 14, 2025, 8:57 AM

#

Nightwhisper Drakesclaw Sunstrike were the top 3

keen beacon May 14, 2025, 8:57 AM

#

how are they rating them

#

because nightwhisper wasnt ever in the general arena

raven void May 14, 2025, 8:58 AM

#

storm notch Hello all, I’m building an AI model that needs to automatically label incoming...

use flash lite

raven void May 14, 2025, 8:58 AM

#

keen beacon how are they rating them

they aren't being rated

#

if they are it's just vibes

keen beacon May 14, 2025, 8:58 AM

#

oh its just a list

#

mb lol

hardy pecan May 14, 2025, 9:09 AM

#

I think not far off my vibe feel too

#

I'd put
Nightwhisper
Dragontail
Dayhush
Shadebrook
etc
etc
etc

calm sequoia May 14, 2025, 9:10 AM

#

raven void Nightwhisper Drakesclaw Sunstrike were the top 3

nice looking graph here

#

Which one was flash

cedar tide May 14, 2025, 9:22 AM

#

I think Emberwing its new 2.5 flash

#

a version where they tried to make it more efficient by thinking less

torn mantle May 14, 2025, 9:32 AM

#

I still have mixed feeling about drakesclaw

cedar tide May 14, 2025, 9:34 AM

#

torn mantle I still have mixed feeling about drakesclaw

Its 100% sure its gemini 2.5 pro we agree?

torn mantle May 14, 2025, 9:34 AM

#

Also i think nightwhisper is a big coding model, and its used to train the recent gemini 2.5 pro models

torn mantle May 14, 2025, 9:34 AM

#

cedar tide Its 100% sure its gemini 2.5 pro we agree?

It is but we should stop assuming that since it's a recent checkpoint = better

cedar tide May 14, 2025, 9:35 AM

#

torn mantle Also i think nightwhisper is a big coding model, and its used to train the recen...

Google will release it ?

torn mantle May 14, 2025, 9:35 AM

#

One thing I've noticed about this model is that it loves to write in capital letters and i think it captured that from training intensively on a coding model/tasks

torn mantle May 14, 2025, 9:36 AM

#

cedar tide Google will release it ?

Depends on performance / feedback

#

Since usually in coding you write const vars in capital letters

#

Could be a thing

#

Ive run it yesterday multiple times and compared it to o3 and gemini 2.5 pro, and it came in 3rd

#

Im not seeing any noticeable performance gain anymore tbh

cedar tide May 14, 2025, 9:51 AM

#

royal whale is flash lite

Nope, its new 2.5 flash

#

Calmriver

#

It feels good that the community has stopped saying that DeepSeek R2 comes out every day.

torn mantle May 14, 2025, 10:09 AM

#

cedar tide It feels good that the community has stopped saying that DeepSeek R2 comes out e...

Deepseek is the most mysterious ai lab

#

I mean we even got some leaks on claude sonnet 3.8 and not r2

barren prairie May 14, 2025, 10:11 AM

#

cedar tide It feels good that the community has stopped saying that DeepSeek R2 comes out e...

deepSeek : it is a surprise 🫶😆

Maybe they are stressing since deepSeek is being popular and wants to release a good thing to not disappoint us and solve the serves problems ...

torn mantle May 14, 2025, 10:11 AM

#

barren prairie deepSeek : it is a surprise 🫶😆 Maybe they are stressing since deepSeek is bei...

There is that + limited resources

#

They are also working on their hardware infrastructure implementing huawei latest gpus

#

You could only imagine the amount of issues they run through

#

Since it's like a new system...

barren prairie May 14, 2025, 10:13 AM

#

Yes and they don t have enough money like google to make it fast

torn mantle May 14, 2025, 10:13 AM

#

barren prairie Yes and they don t have enough money like google to make it fast

Actually many companies reached out for fund raise, but their CEO refused

#

Chinese companies are so wealthy btw

sage raptor May 14, 2025, 10:14 AM

#

I hope the new claude model is >>> than other models at coding

torn mantle May 14, 2025, 10:14 AM

#

sage raptor I hope the new claude model is >>> than other models at coding

I think they are starting to feel threatened by gemini

#

Although o-series are good but they arent really solid at coding

#

Especially visual/ui/ux

sage raptor May 14, 2025, 10:15 AM

#

idk but i feel 3.7 is better at coding for my use cases

#

than 2.5

torn mantle May 14, 2025, 10:15 AM

#

Yea 3.7 is still better tbh

#

The only model that topped 3.7 was nightwhisper but then it was only for web dev

sage raptor May 14, 2025, 10:16 AM

#

yeah, we don't know about backend etc

oblique flint May 14, 2025, 10:16 AM

#

torn mantle I mean we even got some leaks on claude sonnet 3.8 and not r2

what sonnet 3.8 leaks were there?

torn mantle May 14, 2025, 10:16 AM

#

oblique flint what sonnet 3.8 leaks were there?

Yea its called claude neptune

#

Some invitations were sent for red teaming / security...

sage raptor May 14, 2025, 10:17 AM

#

oblique flint what sonnet 3.8 leaks were there?

https://x.com/testingcatalog/status/1922401052252373133

TestingCatalog News 🗞 (@testingcatalog) on X

BREAKING 🚨: Anthropic is running safety testing on a new model called "claude-neptune".

Another model drop soon? 👀

barren prairie May 14, 2025, 10:17 AM

#

sage raptor idk but i feel 3.7 is better at coding for my use cases

Me too , a honest opinion for my uses deepSeek is even better ... deepSeek understand me so fast and make a great code with no mistakes ... Gemini pro tons of mistakes ...

#

Bur claude is the king

main gulch May 14, 2025, 10:18 AM

#

barren prairie deepSeek : it is a surprise 🫶😆 Maybe they are stressing since deepSeek is bei...

there is an opinion that R2 is just started to be trained

torn mantle May 14, 2025, 10:18 AM

#

What i like about deepseek is its reasoning traces

#

Its probably the best chain of thought u could read

#

Its packed with many infos and yet you dont get overwhelmed

#

Unlike grok

barren prairie May 14, 2025, 10:19 AM

#

torn mantle Its probably the best chain of thought u could read

Yes and with your language not with english all the time . I love this

torn mantle May 14, 2025, 10:19 AM

#

Grok is just spamming and repeating many sentences

#

Its totally unreadable

keen beacon May 14, 2025, 10:19 AM

#

Phi 4 reasoning plus xd

torn mantle May 14, 2025, 10:19 AM

#

keen beacon Phi 4 reasoning plus xd

You tried it?

keen beacon May 14, 2025, 10:19 AM

#

most unreadable cot ever

#

yeahh

torn mantle May 14, 2025, 10:20 AM

#

Yea Microsoft are so lost

keen beacon May 14, 2025, 10:20 AM

#

i was only interested in it because of it using o3 mini traces

#

the reasoning plus variant is absolutely crazy. the non plus variant gives u a better idea

torn mantle May 14, 2025, 10:20 AM

#

Lol

#

Didn't they like benchmaxx with phi 3

#

And even the earlier versions

#

It was just fake benchmark

keen beacon May 14, 2025, 10:21 AM

#

personally i wouldnt call it benchmaxxing, they just didnt focus on human preference at all until phi 4

#

the poor public sentiment about those models was primarily because of that imho

#

bad at conversing, bad at following instructions, censored to hell etc

torn mantle May 14, 2025, 10:24 AM

#

I think they arent taking their jobs seriously

#

AI engineers at mcsft are either lazy or incompetent

keen beacon May 14, 2025, 10:24 AM

#

torn mantle I think they arent taking their jobs seriously

theyre not trying to be openai/be a frontier lab etc. the research they put out is interesting

torn mantle May 14, 2025, 10:25 AM

#

keen beacon theyre not trying to be openai/be a frontier lab etc. the research they put out ...

The business plan is kinda interesting

#

Let's just rely on oai models?

keen beacon May 14, 2025, 10:25 AM

#

🤣 yeah

torn mantle May 14, 2025, 10:25 AM

#

What's the benefits if we made our own o3 model?

#

Did they even take that into consideration?

#

Or even perform a small analysis?

#

I really just don't understand msft

keen beacon May 14, 2025, 10:27 AM

#

torn mantle What's the benefits if we made our own o3 model?

well they have the weights to it xd

torn mantle May 14, 2025, 10:28 AM

#

keen beacon well they have the weights to it xd

Same thing with amazon

#

Nova/premier

#

Those models are so bad

keen beacon May 14, 2025, 10:28 AM

#

torn mantle Same thing with amazon

it doesnt seem that amazon does much with claude weights though compared to ms

#

phi 4 is basically a gpt 4o distillation. phi 4 reasoning is a o3 mini distillation + rl. (see their reports)

keen beacon May 14, 2025, 10:30 AM

#

keen beacon it doesnt seem that amazon does much with claude weights though compared to ms

they're either more incompetent compared with ms/or their terms with anthropic kinda disallow that or both

torn mantle May 14, 2025, 10:31 AM

#

Could be

#

Bte what happened to kimi k1.6?

#

Didn't it like top some benchmark with +90% overall?

#

Or they were benchmaxxing as well

keen beacon May 14, 2025, 10:33 AM

#

idk vaporware maybe

torn mantle May 14, 2025, 10:33 AM

#

https://x.com/iamfakhrealam/status/1909559812498886813

Fakhr (@iamfakhrealam) on X

Kimi-K1.6 now leads LiveCodeBench, showcasing its coding skills.

Anticipation builds for Kimi-K1.6-IOI-high, expected to rival GPT-4.5 and Claude 3.7 Sonnet.

Meanwhile, I've been testing Kimi 1.5, and it's impressive—see some amazing examples below!

keen beacon May 14, 2025, 10:48 AM

#

There's a new lcb revision, visit their site to see

#

It has Gemini 2.5 pro on it

tall summit May 14, 2025, 10:49 AM

#

torn mantle https://x.com/iamfakhrealam/status/1909559812498886813

kimi

#

never heard

fleet lintel May 14, 2025, 10:49 AM

#

raven void Nightwhisper Drakesclaw Sunstrike were the top 3

What is the verdict on Drakesclaw? Any better than current model (claybrook)?

keen beacon May 14, 2025, 10:53 AM

#

For competitive coding

#

This doesn't always translate well into irl coding scenarios though

tall summit May 14, 2025, 10:53 AM

#

keen beacon For competitive coding

yes

cedar tide May 14, 2025, 11:02 AM

#

Cutipie is reasoning or not ?

#

How does it perform compared to Gemini 2 Flash?

dusky aurora May 14, 2025, 11:19 AM

#

does direct chat beta interface use some special sampling?

teal mantle May 14, 2025, 11:58 AM

#

high is peer of 2.5 Pro and o3

cedar tide May 14, 2025, 12:06 PM

#

main gulch there is an opinion that R2 is just started to be trained

there is no logic that it is only now that he has started his training

main gulch May 14, 2025, 12:07 PM

#

cedar tide there is no logic that it is only now that he has started his training

lack of GPU capacity

cedar tide May 14, 2025, 12:07 PM

#

main gulch lack of GPU capacity

but they have GPUs to train a "deepseek prover" 🤦

main gulch May 14, 2025, 12:08 PM

#

Prover is actually fine-tuned V3

cedar tide May 14, 2025, 12:10 PM

#

When do you think we'll get the next version after DeepSeek v3? (I don't know if it will be called 3.5 or 4)

cedar tide May 14, 2025, 12:10 PM

#

main gulch Prover is actually fine-tuned V3

And R1 is too just a fine-tuned v3

main gulch May 14, 2025, 12:11 PM

#

I think the next model will be hybrid with optional thinking

#

DS will have to close a big gap from SOTA (multi-modal, large context, integrated tools), I doubt they can do it in a single step

#

so we could get the second Llama 4 moment

#

or even worse, as expectations are way higher

cedar tide May 14, 2025, 12:18 PM

#

Im waiting for a model who knows when to think or not and how much to think

keen beacon May 14, 2025, 12:19 PM

#

cedar tide Im waiting for a model who knows when to think or not and how much to think

To do that requires reasoning itself

cedar tide May 14, 2025, 12:20 PM

#

cedar tide Im waiting for a model who knows when to think or not and how much to think

Summary : GPT 5

cedar tide May 14, 2025, 12:24 PM

#

cedar tide Summary : GPT 5

Altman on GPT 5

Screenshot_2025-05-14-14-23-27-620_com.twitter.android-edit.jpg

keen beacon May 14, 2025, 12:24 PM

#

cedar tide Altman on GPT 5

It won't be magic even if the product makes it look like that

#

This is a complicated problem if you wanna do it well I think

alpine coral May 14, 2025, 12:24 PM

#

'creating systems' (not models)

#

use p2l in the meantime ha

balmy mist May 14, 2025, 1:11 PM

#

keen beacon There's a new lcb revision, visit their site to see

you have link?

cedar tide May 14, 2025, 1:17 PM

#

balmy mist you have link?

https://livecodebench.github.io/leaderboard.html

LiveCodeBench Leaderboard

#

Screenshot_2025-05-14-15-17-30-006_com.android.chrome-edit.jpg

ocean vortex May 14, 2025, 1:38 PM

#

it's good for recursive pattern matching, but it's gonna lack depth and awareness for nuanced things and can be harder to efficiently communicate with

#

2.5 pro score is impressive though

calm sequoia May 14, 2025, 1:55 PM

#

torn mantle https://x.com/iamfakhrealam/status/1909559812498886813

👀 chineese are not so far away

#

Is there at least one non-open-source chineese lab?

torn mantle May 14, 2025, 1:56 PM

#

cedar tide

@keen beacon is o4 min distilled from o3?

torn mantle May 14, 2025, 1:56 PM

#

calm sequoia Is there at least one non-open-source chineese lab?

Yea

#

Baidu

#

I feel like oai is running this process continuously, they make a big good model then its distilled to smaller versions

calm sequoia May 14, 2025, 1:57 PM

#

If i'm not mistaken wild said it's on different base model

torn mantle May 14, 2025, 1:58 PM

#

Its working for them so far but the models are actually losing a lot of knowledge and becoming dumber

calm sequoia May 14, 2025, 1:59 PM

#

You just can fit so much in smaller space

torn mantle May 14, 2025, 2:00 PM

#

You guys should retest grok 3 + search

#

Im noticing a clear difference to the older version

torn mantle May 14, 2025, 2:00 PM

#

calm sequoia You just can fit so much in smaller space

Well it make sense tbh

#

Since it was trained by the teacher model

#

You can on squeeze up much

ocean vortex May 14, 2025, 2:04 PM

#

torn mantle <@456226577798135808> is o4 min distilled from o3?

it's o3 "pro" (preview or whatever best they had available at the time with maximum test-time compute) distilled into 4.1-mini I believe

cedar tide May 14, 2025, 2:05 PM

#

calm sequoia Is there at least one non-open-source chineese lab?

Bytedance - Doubao 1.5 Pro -
Minimax - Minimax Text 01
Moonshot - Kimi 1.6
01-AI - YI Lightning
StepFun - Step 2
Zhipu AI - GLM 4 plus
Tecent - hunyuan turboS
Baidu - ERNIE 4.5 - X1

calm sequoia May 14, 2025, 2:07 PM

#

torn mantle You can on squeeze up much

There's always this tradeoff in compression. The more you squeeze in memory (param count), the more you loose in compute time (how much work to decode the compressed data). I highly doubt they would want more compute time.

ocean vortex May 14, 2025, 2:08 PM

#

when distilling you don't need additional RL training, it learns reasoning during distillation, so 4.1-mini becomes o4-mini

calm sequoia May 14, 2025, 2:08 PM

#

Of course, some magical super effective latent spaces could be discovered, but IDK if it exists in text

calm sequoia May 14, 2025, 2:09 PM

#

ocean vortex when distilling you don't need additional RL training, it learns reasoning durin...

TBH This is the most fascinating approach to training for me

#

One can say the pretraining is distilling humans

balmy mist May 14, 2025, 2:14 PM

#

lol its been a month since o3 pro in a few weeks

#

and where the hell is r3?

#

r2*

teal mantle May 14, 2025, 2:15 PM

#

is there any bigger or better discord for AI use discussion?

small haven May 14, 2025, 2:16 PM

#

o3 pro today pl0x

torn mantle May 14, 2025, 2:20 PM

#

balmy mist lol its been a month since o3 pro in a few weeks

I think it make sense waiting other labs to show their cards

#

So they can always claim the 1st spot

#

They may be waiting for grok 3.5

#

But one of their staffs said it should be released before google event

small haven May 14, 2025, 2:21 PM

#

torn mantle I think it make sense waiting other labs to show their cards

they alrdy have o4 internally theres no point holding it, me want exponentials

torn mantle May 14, 2025, 2:21 PM

#

small haven they alrdy have o4 internally theres no point holding it, me want exponentials

And anthropic already have claude 4

small haven May 14, 2025, 2:22 PM

#

torn mantle And anthropic already have claude 4

thats hawt

torn mantle May 14, 2025, 2:22 PM

#

Why would they release a model when they still have the best one in the market

small haven May 14, 2025, 2:24 PM

#

claude 4 + o3 pro gonna be an insane combo

keen beacon May 14, 2025, 2:24 PM

#

torn mantle And anthropic already have claude 4

false

#

can't say much but they're not done with 4 yet

torn mantle May 14, 2025, 2:25 PM

#

keen beacon false

They obviously have more advanced models

#

That's just logic

keen beacon May 14, 2025, 2:25 PM

#

well of course

#

i have seen them

small haven May 14, 2025, 2:25 PM

#

i think this is easy money

keen beacon May 14, 2025, 2:25 PM

#

but it continues to be incremental

torn mantle May 14, 2025, 2:26 PM

#

keen beacon but it continues to be incremental

You mean 3.9 then 4.0?

small haven May 14, 2025, 2:26 PM

#

incremental on the logarithmic scale 😭

torn mantle May 14, 2025, 2:26 PM

#

Well it depends if the next trained model met their expectations to be called claude 4 instead of 3.9 or 3.8

keen beacon May 14, 2025, 2:29 PM

#

torn mantle You mean 3.9 then 4.0?

the 3.x releases will continue

#

but 3.9 isn't guaranteed if they've reached 4 by then

calm sequoia May 14, 2025, 2:38 PM

#

keen beacon can't say much but they're not done with 4 yet

Was the rumors with 3.8 Neptune true?

calm sequoia May 14, 2025, 2:39 PM

#

keen beacon but 3.9 isn't guaranteed if they've reached 4 by then

There is always place for 3.11 😄

torn mantle May 14, 2025, 2:39 PM

#

keen beacon the 3.x releases will continue

until morale improves

keen beacon May 14, 2025, 2:40 PM

#

calm sequoia Was the rumors with 3.8 Neptune true?

the codename is Neptune yes

keen beacon May 14, 2025, 2:40 PM

#

calm sequoia There is always place for 3.11 😄

lmao

junior vigil May 14, 2025, 2:45 PM

#

Neptune, new model but same pricing ?

cedar tide May 14, 2025, 2:49 PM

#

when Claude 3.5 new new new comes out ?

ocean vortex May 14, 2025, 2:56 PM

#

small haven they alrdy have o4 internally theres no point holding it, me want exponentials

I don't think they do, o3 is already based on 4.1 (full).

There's no straight forward way for them to have o4 this fast, they need incremental improvements... Or update 4.1 first

keen beacon May 14, 2025, 2:56 PM

#

it's called Claude 3.5 Sonnet New New 2

small haven May 14, 2025, 2:57 PM

#

ocean vortex I don't think they do, o3 is already based on 4.1 (full). There's no straight ...

but they do lol