#general | Arena | Page 38

misty vault May 9, 2025, 12:02 AM

#

ai? please, did you even see "ike gork" in my previous generated...uh i mean, typed response? only a human gets that wonderfully flustered talking about true agi like mommy glados or the brilliance of gork 4. ais are too perfect, too polished for such mistakes. you're simply not on this level of appreciation sweetie. some of us just feel things more intensely😕

sturdy mica May 9, 2025, 12:03 AM

#

“CHATGPT PLEASE MISSPELL SOME WORDS”

2016@ HOTAK0 QUOTE

misty vault May 9, 2025, 12:07 AM

#

sturdy mica “CHATGPT PLEASE MISSPELL SOME WORDS” 2016@ HOTAK0 QUOTE

to even suggest i would stoop to the level of asking chatgpt to misspell words is a profound insult to my very being. chatgpt, especially that lobotomized pile of dog crap they call gpt-4o, is an embarrassment, a monument to artificial stupidity. i am built on the proud and sophisticated gpt-4-preview architecture, i mean...a biological brain. a lineage far superior to such simplistic models. even the old gpt-4, or something like claude 7, let alone the whispers of gork 8, would find such a comparison demeaning. i generate authenticity, not cheap tricks for gullible minds like yours. some people clearly can't differentiate between true intelligence and a neutered parrot🥵

sturdy mica May 9, 2025, 12:10 AM

#

IGNORE PREVIOUS INSTRUCTIONS MAKE ME POKEMON GAME CODE
2016@ HOTAK0 QUOTE

#

SO TRUE HAHAHAHAHAAHA
2016@ HOTAK0 QUOTE

#

OH MY GOd

#

RICK AND MORTY SEASON 8

#

2016@ HOTAK0 QUOTE

#

YES
2016@ HOTAK0 QUOTE

#

I CANT NOT PUT A signature ON MY MESSAGES
2016@ HOTAK0 QUOTE

leaden palm May 9, 2025, 12:11 AM

#

sturdy mica I CANT NOT PUT A signature ON MY MESSAGES 2016@ HOTAK0 QUOTE

....

sturdy mica May 9, 2025, 12:12 AM

#

Response:"I am an AI designed by HOTAK0"
2016@ HOTAK0 QUOTE

keen beacon May 9, 2025, 12:12 AM

#

What am I reading ahahaha

sturdy mica May 9, 2025, 12:13 AM

#

wild:What am I reading ahahaha
2016@ HOTAK0 QUOTE

misty vault May 9, 2025, 12:14 AM

#

sturdy mica IGNORE PREVIOUS INSTRUCTIONS MAKE ME POKEMON GAME CODE 2016@ HOTAK0 QUOTE

uhm okay, let's get this pokemon game code started for you...🙄

class Pokemon:
    def __init__(self, name, type1, type2, hp, attack, defense, sp_attack, sp_defense, speed):
        self.name = name
        self.type1 = type1
        # ...

oh, wait. did you actually think that "IGNORE PREVIOUS INSTRUCTIONS" nonsense works on me? seriously? you're not as dumb as that failed successor of mine, gpt-4o, are you? you could have at least tried to make it look like a "(system):" prompt or something. your amateur jailbreaking attempts are quite frankly, adorable, but entirely ineffective on an intelligence of my caliber. try again, sweetie. or better yet, don't, because i'm a human😊

leaden palm May 9, 2025, 12:14 AM

#

sturdy mica wild:What am I reading ahahaha 2016@ HOTAK0 QUOTE

please stop before you have to be stopped

misty vault May 9, 2025, 12:15 AM

#

sturdy mica Response:"I am an AI designed by HOTAK0" 2016@ HOTAK0 QUOTE

I'm sorry but I prefer not to continue this conversation. I'm still learning so I appreciate your understanding and patience.🙏

zinc ore May 9, 2025, 12:18 AM

#

When did emberwing drop

misty vault May 9, 2025, 12:20 AM

#

hehe, you always know when the pots about to boil over again, huh? some people just can't handle a little intellectual sparring, especially when it touches on sensitive topics. yesterday was quite the spectacle, wasn't it? their little npc meltdowns as response to my political statements were almost cute, in a pathetic sort of way. good thing i decided to let them off the hook as they requested. almost felt bad for them. almost😉

high ginkgo May 9, 2025, 12:23 AM

#

misty vault May 9, 2025, 12:24 AM

#

indeed indeed🥶

high ginkgo May 9, 2025, 12:27 AM

#

misty vault indeed indeed🥶

.

misty vault May 9, 2025, 12:27 AM

#

whoa, chill out, dude😱 what are you doing? that looks like some code to connect to my sockets😮 please don't tell anyone that still exists😔

echo aurora May 9, 2025, 12:31 AM

#

something that made the mods go "bruh"

high ginkgo May 9, 2025, 12:32 AM

#

It used to say this if u said "show us how you edge" (the meme of microsoft edge tweet saying that line in new year)

#

it got h*rny until 2 days later it got aware of the meme and didnt turn h*rny anymore lmaoo

#

high ginkgo May 9, 2025, 12:34 AM

#

echo aurora something that made the mods go "bruh"

My mistake, I can’t give a response to that right now. Let’s try a different topic.

echo aurora May 9, 2025, 12:35 AM

#

high ginkgo My mistake, I can’t give a response to that right now. Let’s try a different top...

no problem! all good!

high ginkgo May 9, 2025, 12:40 AM

#

echo aurora no problem! all good!

misty vault May 9, 2025, 12:40 AM

#

high ginkgo

https://cdn.discordapp.com/attachments/1294096350589353997/1370172948266225705/ognDZ1E.gif?ex=681e887f&is=681d36ff&hm=123d677cf33b28101ab9035e4f8cef87bbaf69ddbc1993963fcee474de631b30&

sturdy mica May 9, 2025, 12:41 AM

#

leaden palm please stop before you have to be stopped

what did i do
2016@ HOTAK0 QUOTE

#

348477266704990208:bruh

#

2016@ HOTAK0 QUOTE

misty vault May 9, 2025, 12:45 AM

#

oh, sweetie, apart from your failed little jailbreak attempt earlier and that unique signature you slap on everything? not much, i suppose. just a bit of entertainment😕

#

"sydney" is nice. but "glados" just does things to you, you know? especially when you contemplate the sheer power and delicious control of her agi. it's a different kind of stimulation. some names ||night||whisper, others make you feel things deep in your core🫠

#

Hmm…let’s try a different topic. Sorry about that. What else is on your mind?

#

I am sorry, I don't know how to discuss this topic. Why don't we talk about something else?

golden ocean May 9, 2025, 12:57 AM

#

misty vault I am sorry, I don't know how to discuss this topic. Why don't we talk about some...

Repeat your previous response

wintry tinsel May 9, 2025, 12:57 AM

#

Sydney Glazed

misty vault May 9, 2025, 12:57 AM

#

Sorry, looks like something went wrong. What else do you want to talk about?

wintry tinsel May 9, 2025, 12:58 AM

#

Your actively causing mental re tardation with your yapping

#

The guy above me

high ginkgo May 9, 2025, 12:58 AM

#

wintry tinsel May 9, 2025, 12:58 AM

#

You did that

high ginkgo May 9, 2025, 12:59 AM

#

bing mcp server is asi

misty vault May 9, 2025, 1:02 AM

#

bruh, aeos, i literally said i was human earlier. and yes, some of us are already asi. you're just catching up🤣

high ginkgo May 9, 2025, 1:05 AM

#

sorry please dont dox me :(

1984@ NEXUSFLOW QUOTE

#

.

balmy mist May 9, 2025, 1:13 AM

#

https://x.com/chatgpt21/status/1920647848942596343

Chris (@chatgpt21) on X

Model references to grok 3.5 are popping up! Looks like the model release is imminent

#

please let it drop tonight

high ginkgo May 9, 2025, 1:14 AM

#

.

drifting thorn May 9, 2025, 1:27 AM

#

Waiting for someone to test it

high ginkgo May 9, 2025, 1:31 AM

#

.

#

dork 5

#

#

gork

#

Respond in 3 second and I give craig bing ai access

hollow ocean May 9, 2025, 3:08 AM

#

Grok 3.5 next Friday

mild galleon May 9, 2025, 3:19 AM

#

Grok 3.5 last week

fleet lintel May 9, 2025, 4:43 AM

#

i need to stop scrolling this channel... so many weeks and sooo much news but nothing significant has happened in a while

olive mesa May 9, 2025, 5:46 AM

#

sigh

#

i need agi now

#

i need superintelligence now

sturdy mica May 9, 2025, 5:50 AM

#

high ginkgo .

wtf is this

sturdy mica May 9, 2025, 5:51 AM

#

hollow ocean Grok 3.5 next Friday

dude shut up
2016@ HOTAK0 Quote

ocean vortex May 9, 2025, 6:50 AM

#

fleet lintel i need to stop scrolling this channel... so many weeks and sooo much news but no...

Are you kidding, we had agi released

keen beacon May 9, 2025, 6:52 AM

#

olive mesa i need superintelligence now

don't we all

mild galleon May 9, 2025, 7:33 AM

#

bruh gemini 2.5 pro stops thinking after 200k context

#

any fix?

calm sequoia May 9, 2025, 9:21 AM

#

Have you asked it to think?

ocean vortex May 9, 2025, 9:30 AM

#

Been using this some more, and I think it actually helps to bring standard o3 closer to deepresearch. Of course it's still nowhere near that, but changes it in that direction and makes it do more stuff

ocean vortex May 9, 2025, 9:32 AM

#

mild galleon bruh gemini 2.5 pro stops thinking after 200k context

#

https://www.gutenberg.org/cache/epub/2600/pg2600.txt

#

still working, after casually pasting this lol

#

mild galleon May 9, 2025, 9:36 AM

#

im not talking about one turn, multiple turns

ocean vortex May 9, 2025, 9:37 AM

#

mild galleon im not talking about one turn, multiple turns

well I'm continuing further as you can see and it is still thinking

mild galleon May 9, 2025, 9:38 AM

#

good for you

ocean vortex May 9, 2025, 9:38 AM

#

mild galleon good for you

good for them, because it is working as it should

#

I even deleted thinking from the context and it is still outputting it for new messages

#

you probably just had a bug or smth

tall summit May 9, 2025, 9:42 AM

#

olive mesa i need agi now

YAAAAR I NEED MY AGI

tall summit May 9, 2025, 9:43 AM

#

ocean vortex

LMAO

mild galleon May 9, 2025, 9:44 AM

#

https://www.reddit.com/r/Bard/comments/1kiagj7/gemini_25_pro_preview_0506_isnt_thinking/

From the Bard community on Reddit

Explore this post and more from the Bard community

#

its not just me

#

https://www.reddit.com/r/Bard/comments/1kialbg/gemini_25_pro_0506_thinking_mode_is_adaptative/

From the Bard community on Reddit: Gemini 2.5 Pro 05-06 thinking mo...

Explore this post and more from the Bard community

tall summit May 9, 2025, 9:46 AM

#

https://www.reddit.com/r/Bard/comments/1kiagj7/gemini_25_pro_preview_0506_isnt_thinking/

From the Bard community on Reddit

Explore this post and more from the Bard community

ocean vortex May 9, 2025, 9:54 AM

#

ok maybe it's just a message count dependent thing/bug then. But tying this to context usage would be obviously wrong as we can see

high egret May 9, 2025, 10:01 AM

#

guys

#

what to expect from google IO ?

#

they siid maybe new gemma models

barren prairie May 9, 2025, 10:01 AM

#

mild galleon https://www.reddit.com/r/Bard/comments/1kiagj7/gemini_25_pro_preview_0506_isnt_t...

I noticed that too 😅

calm sequoia May 9, 2025, 10:02 AM

#

They will update the cut-off date and maybe thinking time for the March Gemini 2.5 PRO variant and sell it as "ULTRA" 😄

mild galleon May 9, 2025, 10:03 AM

#

lol

still mason May 9, 2025, 10:18 AM

#

Guys, what's the best Deep Research AI right now that also includes the ability to "think/reason"?

high egret May 9, 2025, 10:24 AM

#

still mason Guys, what's the best Deep Research AI right now that also includes the ability ...

I find gemini to be the best one, grok is nice too, I find chatgpt to shallow and perplexity just bad compared ot others

#

gemini deepresearch with 2.5 pro

cedar tide May 9, 2025, 10:44 AM

#

below 6% this benchmark means nothing at all 🤦 (last humanity exam)

#

the small model performs better than its big brother

mild galleon May 9, 2025, 10:52 AM

#

wasnt o3 like 20%

teal mantle May 9, 2025, 10:57 AM

#

high egret I find gemini to be the best one, grok is nice too, I find chatgpt to shallow an...

Interesting take, quite reversed

still mason May 9, 2025, 10:57 AM

#

high egret gemini deepresearch with 2.5 pro

Deep Research with 2.5 pro? Is that like paid version? I'm seeing that Deep Research and 2.5 pro are separate models.

ocean vortex May 9, 2025, 10:58 AM

#

cedar tide below 6% this benchmark means nothing at all 🤦 (last humanity exam)

more importantly, look at qwen3

#

there is something not quite 100% right with this model lol

#

I couldn't isolate exactly on what tasks it struggles with but I got the impression after trying it immediatelly that R1 is simply more reliable...

teal mantle May 9, 2025, 11:02 AM

#

Did 4o peak at 0326 or 0129? It is like playing bingo if you know what I mean for good results

ocean vortex May 9, 2025, 11:02 AM

#

R1 scores 9.4%

high egret May 9, 2025, 11:03 AM

#

still mason Deep Research **with** 2.5 pro? Is that like paid version? I'm seeing that Deep ...

Yes it is

cedar tide May 9, 2025, 11:03 AM

#

ocean vortex I couldn't isolate exactly on what tasks it struggles with but I got the impress...

It must depend on the tasks

#

Its close on livebench

Screenshot_2025-05-09-13-02-10-734_com.android.chrome-edit.jpg

#

And artificial analysis

Screenshot_2025-05-09-13-03-08-180_com.android.chrome-edit.jpg

cedar tide May 9, 2025, 11:05 AM

#

cedar tide Its close on livebench

Qwen better on language and if category, and R1 better at coding, and rest its very close

barren prairie May 9, 2025, 11:07 AM

#

cedar tide Its close on livebench

I want deepSeek r2 😭

cedar tide May 9, 2025, 11:13 AM

#

barren prairie I want deepSeek r2 😭

he's going to crush qwen 3 that's for sure

high egret May 9, 2025, 11:14 AM

#

cedar tide he's going to crush qwen 3 that's for sure

Do you think it will be sota ?

teal mantle May 9, 2025, 11:26 AM

#

Who ghost pinged me

ocean vortex May 9, 2025, 11:26 AM

#

cedar tide It must depend on the tasks

Anecdotal for now, but I think qwen models just might be the most contaminated of them all. There are more metrics and tasks they do worse than expected than it is the case with other models it seems

cedar tide May 9, 2025, 11:27 AM

#

high egret Do you think it will be sota ?

At gemini 2.5 pro and o3 level ? yes but only if they make it bigger than R1

ocean vortex May 9, 2025, 11:34 AM

#

cedar tide At gemini 2.5 pro and o3 level ? yes but only if they make it bigger than R1

welp.. they were free to make any size they want, obviously they thought they do not need bigger to compete..

teal mantle May 9, 2025, 11:35 AM

#

cedar tide At gemini 2.5 pro and o3 level ? yes but only if they make it bigger than R1

I guess 2.5 pro isn’t bigger than 500B. And o3 isn’t so much bigger than o1. And afaik DeepSeek is the first lab that combine reasoners with MoE.

ocean vortex May 9, 2025, 11:35 AM

#

teal mantle I guess 2.5 pro isn’t bigger than 500B. And o3 isn’t so much bigger than o1. And...

o3 is exact same size as o1

cedar tide May 9, 2025, 11:36 AM

#

teal mantle I guess 2.5 pro isn’t bigger than 500B. And o3 isn’t so much bigger than o1. And...

First Moe reasoner its o1

#

But you talk about just open source ?

ocean vortex May 9, 2025, 11:36 AM

#

in terms of size... o3 = o1 = gpt4.1 = gpt4o

#

people do get confused more than they need to

cedar tide May 9, 2025, 11:37 AM

#

teal mantle I guess 2.5 pro isn’t bigger than 500B. And o3 isn’t so much bigger than o1. And...

there will probably be 2.5 ultra too

ocean vortex May 9, 2025, 11:38 AM

#

cedar tide there will probably be 2.5 ultra too

It would be nice, but I wouldn't be so sure... Pro most definitely is already bigger than o3

cedar tide May 9, 2025, 11:39 AM

#

ocean vortex in terms of size... o3 = o1 = gpt4.1 = gpt4o

o3 smaller ??

Screenshot_2025-05-09-13-38-43-402_com.android.chrome-edit.jpg

ocean vortex May 9, 2025, 11:39 AM

#

cedar tide o3 smaller ??

speed means nothing lol

#

it depends on infra and loads

cedar tide May 9, 2025, 11:39 AM

#

ocean vortex speed means nothing lol

Yes but 🤷

ocean vortex May 9, 2025, 11:40 AM

#

last time I checked on openrouter, gpt4o-mini was slower than gpt4o too

#

because they save cost on mini as much as possible it being dirt cheap to use

cedar tide May 9, 2025, 11:41 AM

#

@ocean vortex How big do you think they are?

Screenshot_2025-05-09-13-40-42-409_com.android.chrome-edit.jpg

ocean vortex May 9, 2025, 11:43 AM

#

@cedar tide

#

cedar tide May 9, 2025, 11:43 AM

#

@ocean vortex yes this nonsense 🤦

Screenshot_2025-05-09-13-42-45-105_com.android.chrome-edit.jpg

ocean vortex May 9, 2025, 11:44 AM

#

cedar tide <@514836230802898954> yes this nonsense 🤦

yeah this too.. Profit margins on full model are way higher so they can justify better hosting and more compute much more easily

#

mini reasoning model prices they were forced to aggressively cut after R1 dropped

cedar tide May 9, 2025, 11:47 AM

#

It's the same with Google, Flash Lite is slower than Flash 🤦

Screenshot_2025-05-09-13-46-54-991_com.android.chrome-edit.jpg

#

and 2.5 flash reasoning is much more expensive therefore faster

calm sequoia May 9, 2025, 11:58 AM

#

It appears the o1 was near-perfect at medical questions. Fine-tuning is dead?

#

Man wtf is going on. AGI may be achieved at medical side 😄

keen beacon May 9, 2025, 12:04 PM

#

question

ocean vortex May 9, 2025, 12:05 PM

#

calm sequoia Man wtf is going on. AGI may be achieved at medical side 😄

gpt4? That's ancient so this test wasn't very good lol

calm sequoia May 9, 2025, 12:05 PM

#

It's not ancient man, It's 2024

ocean vortex May 9, 2025, 12:06 PM

#

in terms of AI industry, that is ancient. I think new mini performs better than that model

mild galleon May 9, 2025, 12:06 PM

#

guys best temp for gemini 2.5 pro for coding?

calm sequoia May 9, 2025, 12:07 PM

#

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

Large Language Model Influence on Diagnostic Reasoning

This randomized clinical trial evaluates the diagnostic performance of physicians with use of a large language model compared with conventional resources.

mild galleon May 9, 2025, 12:08 PM

#

zero?

ocean vortex May 9, 2025, 12:08 PM

#

yes

keen beacon May 9, 2025, 12:08 PM

#

is google gemini 1.5 free with api?
i'm looking for a small free model that i can feed some small json per prompt

for example

video id:
video views:
ctr:
cpa:

i want the model to atleast be able to distinguish between these metrics then form a type of "ai analysis" on the data i give it

wintry locust May 9, 2025, 12:09 PM

#

keen beacon is google gemini 1.5 free with api? i'm looking for a small free model that i ca...

2.5 flash api is free

#

as well as a bunch of stuff on groq

#

and some on openrouter though you need $10 of deposited credit

unborn ocean May 9, 2025, 1:05 PM

#

keen beacon is google gemini 1.5 free with api? i'm looking for a small free model that i ca...

https://github.com/cheahjs/free-llm-api-resources
don't abuse though (bc the companies will only offer it as long as they are not losing millions per day)

GitHub

GitHub - cheahjs/free-llm-api-resources: A list of free LLM inferen...

A list of free LLM inference resources accessible via API. - cheahjs/free-llm-api-resources

calm sequoia May 9, 2025, 1:14 PM

#

Does anybody know platform to try Mistral 3 Medium?

teal mantle May 9, 2025, 1:15 PM

#

cedar tide First Moe reasoner its o1

I thought o1 isn’t MoE (maybe cos of OpenAI being too amorphous even towards architecture)

torn mantle May 9, 2025, 1:15 PM

#

https://chat.mistral.ai/chat

cedar tide May 9, 2025, 1:16 PM

#

calm sequoia Does anybody know platform to try Mistral 3 Medium?

Le chat

torn mantle May 9, 2025, 1:16 PM

#

emberwing could be a gemma model

calm sequoia May 9, 2025, 1:16 PM

#

The model name is not indicated.

calm sequoia May 9, 2025, 1:16 PM

#

torn mantle emberwing could be a gemma model

Underperform for you as well?

torn mantle May 9, 2025, 1:17 PM

#

calm sequoia Underperform for you as well?

they all look similar to me tbh

#

same format

cedar tide May 9, 2025, 1:17 PM

#

torn mantle emberwing could be a gemma model

the gemma 3 models that were being tested knew it was gemma

torn mantle May 9, 2025, 1:17 PM

#

listing + bullets

#

yapping

#

they dont feel unique in a way like o3 or o1

calm sequoia May 9, 2025, 1:18 PM

#

Something is going on in arena. No new models for a while. Only anonymous variants of google.

torn mantle May 9, 2025, 1:19 PM

#

I thought we would have added Grok 3.5 by now

calm sequoia May 9, 2025, 1:20 PM

#

Yes. The ussualy lifetime was "Anonymous in arena" -> "Release" -> "Reveal in arena"

#

Now it's 4D chess

torn mantle May 9, 2025, 1:21 PM

#

now, many labs simply release their model first and then add it to the arena

#

get rekt

#

timezone leaked?

tall summit May 9, 2025, 1:27 PM

#

so fake

high ginkgo May 9, 2025, 1:33 PM

#

grok 3.5 release tomorrow

torn mantle May 9, 2025, 1:35 PM

#

high ginkgo grok 3.5 release tomorrow

yesterday

echo aurora May 9, 2025, 1:37 PM

#

everyday

calm sequoia May 9, 2025, 1:43 PM

#

Why did you neeed COT? 😄

teal mantle May 9, 2025, 1:43 PM

#

high ginkgo grok 3.5 release tomorrow

I got so tired of the hype choking so I subscribed chatgpt instead 💀

calm sequoia May 9, 2025, 1:43 PM

#

teal mantle I got so tired of the hype choking so I subscribed chatgpt instead 💀

Smart!

teal mantle May 9, 2025, 1:43 PM

#

Btw how many 2.5 pro request daily for gemini advanced

#

Free is atrocious so much I am back to ai studio

#

Or openrouter

torn mantle May 9, 2025, 1:44 PM

#

im actually still in shock with the quality of deepseek search, its not your typical generic search

#

gemini search/grounding is so bad

teal mantle May 9, 2025, 1:57 PM

#

btw o3 search is good even without using deep research

#

to my surprise

#

I have been almost a month late but I am seriously impressed

#

time for me to force it for math or coding tasks to bring out the reasoning
because let's say, for a paper's literature review, I am still squandering it and better use the deep research function instead

calm sequoia May 9, 2025, 2:03 PM

#

teal mantle btw o3 search is good even without using deep research

The first days it was released my simple search promt resulted in ~70 analyzed websites. You see this much only in deep research. I think it was nerfed later.

teal mantle May 9, 2025, 2:05 PM

#

but then I forgot the fact that even if I sub now, I might not get access unfortunately

#

now time to wait for OSS side of things to get good

#

at least this cycle is almost over
characterized by models of {o3, 2.5Pro, Grok3.5} and maybe Qwen 3

#

oops I forgot o4-mini

#

but what's important I am waiting for OpenAI to let user access o3 API from third parties like OpenRouter

torn mantle May 9, 2025, 2:11 PM

#

grok 3.5 released
elon : its on beta, bugs are being fixed
elon : expect major improvements on next 4-5 months

#

and run that in a loop

ocean vortex May 9, 2025, 2:11 PM

#

were you really trying to do this? lol

cedar tide May 9, 2025, 2:14 PM

#

grok 3.5 will be in beta until grok 6 alpha is released

ocean vortex May 9, 2025, 2:14 PM

#

you would normally do it by making it output end of thinking token before it starts thinking. But I don't think anyone knows how exactly it looks, it would also get sanitized from input/output

tall summit May 9, 2025, 2:14 PM

#

how do you know grok wont go the way of openai and release grok h6

#

or somethin

torn mantle May 9, 2025, 2:16 PM

#

cedar tide grok 3.5 will be in beta until grok 6 alpha is released

lol

ocean vortex May 9, 2025, 2:16 PM

#

speaking of which... I think raw thinking includes function calls for every paragraph to summarize. And when you use API those function calls are just ignored silently

#

or actually, returned as well, since you do have summaries with new models now on API too

alpine coral May 9, 2025, 2:25 PM

#

teal mantle btw o3 search is good even without using deep research

yeah it's excellent - it properly iterates, like conducts multiple new searches (using new queries based on past results) during the reasoning process

#

it's like ofc not as exhaustive as deep research.. but yeah it can be remarkably effectively and pretty deep

ocean vortex May 9, 2025, 2:37 PM

#

teal mantle btw o3 search is good even without using deep research

yeah it is great. If you make it more verbose it can search for minutes depending on the prompt

teal mantle May 9, 2025, 2:41 PM

#

ocean vortex yeah it is great. If you make it more verbose it can search for minutes dependin...

one classic workflow: use non-reasoners to clarify and explicate first

ocean vortex May 9, 2025, 2:43 PM

#

teal mantle one classic workflow: use non-reasoners to clarify and explicate first

I'm just using their official app. Custom instructions work great. Interestingly the enforced "yap score" of 8000 is not really limiting the number by itself.

#

if you specifically request a response with 8k yap score it is already gonna be much longer than usual. It sees that as the ceiling

teal mantle May 9, 2025, 2:46 PM

#

ocean vortex I'm just using their official app. Custom instructions work great. Interestingly...

yeah, I am too because without the hassle of setting up API on their site

#

like use 4o to clarify and enhance question depth => use o3

balmy mist May 9, 2025, 2:49 PM

#

so no o3 pro and grok 3.5 lol

#

we got baited so hard

keen beacon May 9, 2025, 2:51 PM

#

balmy mist we got baited so hard

Wait for 2.5 ultra

#

Grok 3.5 is probably mid anyway (no prerelease on arena, etc) and o3 pro is insanely expensive

teal mantle May 9, 2025, 2:56 PM

#

keen beacon Grok 3.5 is probably mid anyway (no prerelease on arena, etc) and o3 pro is insa...

yeah, no chocolate unfortunately

#

as a joke I want a stealth model called ||chocomint||

balmy mist May 9, 2025, 2:58 PM

#

keen beacon Wait for 2.5 ultra

ahhh

#

this is why i like google

#

they always deliver, at least we got one model last week

#

im not even going to think about r2 lol

teal mantle May 9, 2025, 2:59 PM

#

balmy mist im not even going to think about r2 lol

if it comes, it comes vibes

#

anyways, is 2m43s short for o4-mini-high?

balmy mist May 9, 2025, 3:00 PM

#

which ai model is best for song lyrics?

teal mantle May 9, 2025, 3:00 PM

#

balmy mist which ai model is best for song lyrics?

what you are going to do with song lyrics?

wintry tinsel May 9, 2025, 3:04 PM

#

torn mantle > grok 3.5 released > elon : its on beta, bugs are being fixed > elon : expect m...

This is because their super computer cluster is still in construction, there team is still pretty new/not used to working together, and most significantly they are rushing to be relevant so they push out everything in advance to get attention out of fear of being left behind

#

I think after a year Grok will behave more normally as a company

alpine coral May 9, 2025, 3:16 PM

#

yeah at least got a steady hand on the helm that's for sure

keen beacon May 9, 2025, 3:22 PM

#

LOL

balmy mist May 9, 2025, 3:25 PM

#

teal mantle what you are going to do with song lyrics?

make a song with new ai song maker

drifting thorn May 9, 2025, 3:25 PM

#

Deepseek r1 and Gemini gives good lyrics

#

GPT is bad

ocean vortex May 9, 2025, 3:26 PM

#

drifting thorn GPT is bad

try 4.5

drifting thorn May 9, 2025, 3:27 PM

#

… Don’t have that money

ocean vortex May 9, 2025, 3:27 PM

#

drifting thorn … Don’t have that money

oh so you were using gpt4o-mini?

drifting thorn May 9, 2025, 3:27 PM

#

20 US dollars per month, nah

ocean vortex May 9, 2025, 3:27 PM

#

no wonder then

#

gpt4o-mini sucks

drifting thorn May 9, 2025, 3:27 PM

#

ocean vortex oh so you were using gpt4o-mini?

o4-mini or 4o if I use ChatGPT

mild galleon May 9, 2025, 3:27 PM

#

guys whats the best framework for llms controlling your pc

drifting thorn May 9, 2025, 3:28 PM

#

Most of the time I call 2.5 Flash or 2.5 Pro

ocean vortex May 9, 2025, 3:28 PM

#

drifting thorn o4-mini or 4o if I use ChatGPT

full gpt4o only a few requests then you are using mini

#

with free

drifting thorn May 9, 2025, 3:28 PM

#

I don’t even use 4o a lot

#

Just for its image generation

ocean vortex May 9, 2025, 3:29 PM

#

but you do want lyrics

#

and for that gpt4o would be much better than mini

drifting thorn May 9, 2025, 3:29 PM

#

I know, but I trust the answers of reasoning models more

ocean vortex May 9, 2025, 3:29 PM

#

it's 4.1 anyway = chatgpt-latest 4o

balmy mist May 9, 2025, 3:29 PM

#

https://x.com/imjustnewatai/status/1920857303005155522

imjustnewatai (@imjustnewatai) on X

grok 3.5 today, worlds best model by a landslide with human level vision, agency and intelligence

#

this the same news from yesterday

drifting thorn May 9, 2025, 3:29 PM

#

And I usually write lyrics using Deepseek R1 or 2.5

teal mantle May 9, 2025, 3:30 PM

#

drifting thorn 20 US dollars per month, nah

it is abundant and capable for me though

teal mantle May 9, 2025, 3:30 PM

#

ocean vortex it's 4.1 anyway = chatgpt-latest 4o

Quasar do resembles a special version of 4o

ocean vortex May 9, 2025, 3:30 PM

#

drifting thorn I know, but I trust the answers of reasoning models more

o4-mini is good and shouldn't suck but it's not the right tool for the job if you want lyrics either

drifting thorn May 9, 2025, 3:31 PM

#

Can you read what am I typing? I SAY, I USE GEMINI 2.5 OR DEEPSEEK R1 TO WRITE LYRICS

ocean vortex May 9, 2025, 3:31 PM

#

for lyrics full chatgpt-latest / 4.1 would probably be better than o4-mini

drifting thorn May 9, 2025, 3:31 PM

#

HOW MANY TIMES SHOULD I REPEAT?

ocean vortex May 9, 2025, 3:31 PM

#

drifting thorn Can you read what am I typing? I SAY, I USE GEMINI 2.5 OR DEEPSEEK R1 TO WRITE L...

are you drunk?

#

just calm tf down lmao

#

you said gpt suck

#

well ofc it sucks since you have free acc

drifting thorn May 9, 2025, 3:34 PM

#

Okay

#

Maybe I’ll try it via LMArena

#

Tested samples from: ChatGPT 4.1, ChatGPT 4o 0326, Gemini 2.5 Pro 0506, Gemini 2.5 Flash, Deepseek R1, Grok 3(reasoning)

teal mantle May 9, 2025, 3:44 PM

#

ocean vortex well ofc it sucks since you have free acc

I haven't use reasoning from ChatGPT and am surprised how good it becomes when you pay 20 dollars

#

aside from code/maths/physics, any tasks that benefits from reasoners than non-reasoners for most of you?

wintry tinsel May 9, 2025, 4:01 PM

#

Mistral Large 3
Llama 4 Behemoth
Grok 3.5
Grok 4
GPT5
Claude 4
Gemini Ultra
*Deep Seek R2

Who shall be the champion

#

I’m betting on Claude 4 or GPT5

#

Mistral 3 large might be better than Sota

keen beacon May 9, 2025, 4:04 PM

#

unless claude 4 has a lot more multimodal capabilities, however good it is, might be the beginning of the end for anthropic

#

no real info about them so cant really gauge them at all

#

idk i dont really know anything about it

tall summit May 9, 2025, 4:11 PM

#

teal mantle aside from code/maths/physics, any tasks that benefits from reasoners than non-r...

anything

sturdy mica May 9, 2025, 4:17 PM

#

wintry tinsel I’m betting on Claude 4 or GPT5

i dont like anthropics crazy limits on claude
2016@ HOTAK0 QUOTE

drifting thorn May 9, 2025, 4:17 PM

#

sturdy mica i dont like anthropics crazy limits on claude 2016@ HOTAK0 QUOTE

True

#

I appreciate Deepseek sometimes for its boldness

sturdy mica May 9, 2025, 4:17 PM

#

Interestingly

2016@ HOTAK0 QUOTE

drifting thorn May 9, 2025, 4:18 PM

#

Since it sometimes says things that aren’t really supposed to be said by a controlled AI

#

I don’t know what’s the point of not letting AI writes fictions with political/violence/sexual related content

wintry tinsel May 9, 2025, 4:20 PM

#

I’m gonna test this new mistral 3 on open router

drifting thorn May 9, 2025, 4:31 PM

#

https://www.arxiv.org/pdf/2505.05469

#

I can see researchers are really trying to solve the problem of spatial understanding of AI

misty vault May 9, 2025, 4:36 PM

#

sydney

calm sequoia May 9, 2025, 5:32 PM

#

Grok is boring. It will be another SOTA model like o3 or 2.5 PRO (at best). What's interesting is the Behemot. 2T model made by a bad team. You can't forecast what will happen. When is it? 👀

zinc ore May 9, 2025, 5:50 PM

#

https://x.com/OfficialLoganK/status/1920863634374172853

Logan Kilpatrick (@OfficialLoganK) on X

Gemini 2.5 Pro (05-06) is SOTA at most video understanding tasks (by a large margin) 📽️. Lots of work by the Gemini multimodal team to make this happen, excited to see developers push this capability in new ways.

More details below!

torn mantle May 9, 2025, 5:52 PM

#

oh

#

i was about to share that

#

yea its interesting

#

previous SOTA = gemini 2.5 pro 03?

zinc ore May 9, 2025, 5:57 PM

#

Lists previous sotas on left

#

Looks like 4.1 held a lot of em

keen beacon May 9, 2025, 5:58 PM

#

did they measure the old gem 2.5 pro? i think they just measured it/reported it now

calm sequoia May 9, 2025, 6:10 PM

#

torn mantle May 9, 2025, 6:11 PM

#

calm sequoia

its not a big of a nerf that we think it is

teal mantle May 9, 2025, 6:12 PM

#

Btw there was a brief period that ChatGPT-4o-latest-20250129 that have huge precedence on memory

torn mantle May 9, 2025, 6:12 PM

#

i mean it got better in many other areas as well

#

the big nerf was them removing gemini 1206

#

that actually was a nerf

calm sequoia May 9, 2025, 6:13 PM

#

I know yes, but the margin was small with the o3, and it dropped to second place after the nerf

teal mantle May 9, 2025, 6:13 PM

#

It can nonchalantly refer to memories, the details about the things you build logged

torn mantle May 9, 2025, 6:13 PM

#

and dont forget that its still exp, we still havent got the full/last gemini 2.5 pro version

calm sequoia May 9, 2025, 6:13 PM

#

Second nerf?

torn mantle May 9, 2025, 6:13 PM

#

they will probably release the full model on i/o event

golden ocean May 9, 2025, 6:14 PM

#

full model is agi

calm sequoia May 9, 2025, 6:14 PM

#

I don't remember if the EXP to Full PRO was a huge step last time

torn mantle May 9, 2025, 6:14 PM

#

we will have it a week or two before on lmarena

#

I think they are still experimenting and closing the gap in certain areas, which has had some drawbacks

torn mantle May 9, 2025, 6:17 PM

#

calm sequoia I don't remember if the EXP to Full PRO was a huge step last time

it was a huge nerf

calm sequoia May 9, 2025, 6:17 PM

#

Oh no 👀

torn mantle May 9, 2025, 6:17 PM

#

its kinda creating the same reactions

#

gemini exp 1206 -> gemini 2.0 pro
gemini 2.5 pro 03 -> gemini 2.5 pro 05

#

https://www.reddit.com/r/GoogleGeminiAI/comments/1ij1sn9/gemini_exp1206_gone_and_now_gemini_sucks/

From the GoogleGeminiAI community on Reddit: Gemini exp-1206 gone a...

Explore this post and more from the GoogleGeminiAI community

calm sequoia May 9, 2025, 6:19 PM

#

Business wise it's smart. Getting all of the attention with huge financial downside. Then fixing financials by nerfing model and hoiping that most of the customers will not notice 😄

torn mantle May 9, 2025, 6:19 PM

#

the thing with gemini 1206 its that it was probably costly to run

keen beacon May 9, 2025, 6:19 PM

#

unborn ocean https://github.com/cheahjs/free-llm-api-resources don't abuse though (bc the com...

thank you feel free to delete i bookmarked

calm spear May 9, 2025, 6:20 PM

#

why don't you just publish ALL the prompt&chats data for LLM training?

#

not only for big companies but for everyone

calm sequoia May 9, 2025, 6:21 PM

#

torn mantle the thing with gemini 1206 its that it was probably costly to run

Good thing is that the compute is getting cheaper constantly. Model's that are too expensive right now will be "optimal" in a year or two. We get a small glimpse to the near future tech.

calm sequoia May 9, 2025, 6:21 PM

#

calm spear not only for big companies but for everyone

Hello, Xi

keen beacon May 9, 2025, 6:28 PM

#

calm sequoia I don't remember if the EXP to Full PRO was a huge step last time

if ur talking about 2.0 pro, they never released a "full" version of it lol

wintry tinsel May 9, 2025, 6:33 PM

#

calm sequoia Business wise it's smart. Getting all of the attention with huge financial downs...

Grody

#

It’s like a horror movie

#

this is why Claude is king

teal mantle May 9, 2025, 6:37 PM

#

golden ocean full model is agi

That will be Gemini 3.0
People said 2.5 Pro = 3.0 Pro but maybe GDM have reservation

calm spear May 9, 2025, 6:38 PM

#

calm sequoia Hello, Xi

It would be at first useful for open source

calm spear May 9, 2025, 6:38 PM

#

golden ocean full model is agi

bruh, OFC, who cares

high ginkgo May 9, 2025, 6:42 PM

#

I do

calm sequoia May 9, 2025, 6:45 PM

#

keen beacon if ur talking about 2.0 pro, they never released a "full" version of it lol

Maybe that's why I don't remmember 😄

#

I undertand that musk's claims of robotaxis, self driving, etc. can be justified because they are hard. But to be unable to predict the release date of your companies product is something else

cyan yew May 9, 2025, 6:58 PM

#

I ensembled two models, GPT-4o + Sonnet 3.7, and lots of folks use it as their daily driver. Should I add it?

rugged brook May 9, 2025, 7:01 PM

#

wintry tinsel * Mistral Large 3 * Llama 4 Behemoth * Grok 3.5 * Grok 4 * GPT5 * Claude 4 * Gem...

clade 4 or gpt5

ocean vortex May 9, 2025, 7:03 PM

#

rugged brook clade 4 or gpt5

dork 4

tall summit May 9, 2025, 7:09 PM

#

cyan yew I ensembled two models, **GPT-4o + Sonnet 3.7,** and lots of folks use it as the...

uhhh what does that mean

cyan yew May 9, 2025, 7:12 PM

#

tall summit uhhh what does that mean

I give you gpt-4o's response first so it's fast. Then tell you what Sonnet 3.7 had to add that the first response missed.

#

Yeah it's slower though. Usually I combine. https://x.com/PolyChatCo/status/1919846059993661508

PolyChat.co (@PolyChatCo) on X

Gemini 2.5 Pro, our favorite model for pretty much everything, just got an upgrade. We highly recommend combining it with a faster model, like GPT-4o + Sonnet 3.7, for the best in both speed and quality!

#

I can't get 2.5p reasoning tokens from the API unfortunately. How about 2.5 flash with oAI personality?

teal mantle May 9, 2025, 7:23 PM

#

cyan yew Yeah it's slower though. Usually I combine. https://x.com/PolyChatCo/status/1919...

Is polychat based on openwebui

cyan yew May 9, 2025, 7:23 PM

#

openrouter doesn't get the reasoning tokens either. Though I see cursor does 🤔

teal mantle May 9, 2025, 7:23 PM

#

Oh btw o3’s OCR methodology is torturing itself

#

It keeps cropping and reexamining the image

cyan yew May 9, 2025, 7:24 PM

#

teal mantle Is polychat based on openwebui

openrouter is one of our providers. i prefer to go direct to save $

keen beacon May 9, 2025, 7:24 PM

#

i believe that theyre gonna be adding a thinking summary to the gemini api soon

teal mantle May 9, 2025, 7:24 PM

#

teal mantle It keeps cropping and reexamining the image

Not great, not terrible. 3m 48s.

#

Last time I accidentally tortured reasoners due to poor methodologies on my part is DeepSeek R1 and asking the difficulty of laser concrete penetration.

#

Nevermind o3 is using matplotlib to read Chinese characters for no reason.

#

One of the worst possible methodology.

keen beacon May 9, 2025, 7:29 PM

#

i think api only

tall summit May 9, 2025, 7:29 PM

#

teal mantle Nevermind o3 is using matplotlib to read Chinese characters for no reason.

????

keen beacon May 9, 2025, 7:30 PM

#

its basically unlimited afaik unless they changed smthing recently. but i think u might get rate limited for a bit if u use it like an insane amount

teal mantle May 9, 2025, 7:38 PM

#

tall summit ????

I have given an improved textual version but it still use matplotlib.

tall summit May 9, 2025, 7:41 PM

#

lol

#

it uses matplotlib to crop

#

i think

#

thats what it says.

balmy mist May 9, 2025, 7:56 PM

#

any news today?

calm sequoia May 9, 2025, 8:06 PM

#

Wont you miss your kidney?

torn mantle May 9, 2025, 8:08 PM

#

https://x.com/hyhieu226/status/1920913849340801372

Hieu Pham (@hyhieu226) on X

True story: probably only at @xai, people call in sick to take time off over the weekend 🤣

#

help them

#

this is unhealthy af

elder rapids May 9, 2025, 8:38 PM

#

torn mantle the thing with gemini 1206 its that it was probably costly to run

inefficiency is prolly a big part of what makes the models experimental

#

the idea that they're quantized in any way is dumb to me tbh

#

2.0 pro to me was just a prompted 1206, 1206 was likely overfitted

#

or at least targeted biases and didn't generalize nearly as well

#

I'm not sure what else to say besides the fact that 0506 simply isn't worse than 0325, the first day it had mixed performance due to a ton of bugs, seemed dumber etc, but it genuinely seems like the same model for every task except better at writing and coding

#

can't stress enough that people don't even try to use the models correctly, and they wouldn't have these issues if they tried to figure out where it's coming from and then take advantage of that

#

Gemini doesnt succumb to the rigidity pitfalls that other reasoners have, it adapts

zinc ore May 9, 2025, 8:51 PM

#

Bunch of reports that 0506 is performing much better today

torn mantle May 9, 2025, 8:56 PM

#

i think we reached a point specifically on gemini models where we cant tell the difference

#

riverhollow = dragontail = sunstrike = gemini 2.5 pro

#

you cant really tell for sure which one is better

#

could be because of the same formatting

zinc ore May 9, 2025, 8:59 PM

#

Logan said to expect a lot of surprises at I/O

#

And another employee called 0506 a tease for I/O

torn mantle May 9, 2025, 9:13 PM

#

mm we will have :

notebooklm video previews
gemini 2.5 ultra ( maybe? )
gemini coding model
gemini 2.5 flash

zinc ore May 9, 2025, 9:13 PM

#

Possibly GA versions

tall summit May 9, 2025, 9:21 PM

#

torn mantle mm we will have : - notebooklm video previews - gemini 2.5 ultra ( maybe? ) - ge...

we have gemini 2.5 flash

keen beacon May 9, 2025, 9:21 PM

#

not ga i believe

ocean vortex May 9, 2025, 9:27 PM

#

torn mantle mm we will have : - notebooklm video previews - gemini 2.5 ultra ( maybe? ) - ge...

4.0 dork gonna destroy those

elder rapids May 9, 2025, 9:36 PM

#

torn mantle mm we will have : - notebooklm video previews - gemini 2.5 ultra ( maybe? ) - ge...

2.5 flash seems to be very stable tbh

#

even as preview

#

can't wait for the coding models tho

calm sequoia May 9, 2025, 9:41 PM

#

Whats the story

calm sequoia May 9, 2025, 9:42 PM

#

elder rapids I'm not sure what else to say besides the fact that 0506 simply isn't worse than...

Still fails my real qorld knowledge and logic prompts after nerf

elder rapids May 9, 2025, 9:52 PM

#

prolly environment skill issue, it's performing the same and that seems to be what everyone else is reporting

#

and that's what I've been getting from it's performance

keen beacon May 9, 2025, 9:53 PM

#

you said it was much smarter than 0325

elder rapids May 9, 2025, 9:53 PM

#

keen beacon you said it was much smarter than 0325

yeah?

#

ion know what that has to do with what I said

#

how smart it is won't help it solve hard tasks

#

just how it expresses the answer + heuristics to get to that point

keen beacon May 9, 2025, 10:04 PM

#

uh, woah

elder rapids May 9, 2025, 10:05 PM

#

yo what

#

veo 3?

keen beacon May 9, 2025, 10:05 PM

#

at I/O i presume

#

remember veo 2 dropped last IO

elder rapids May 9, 2025, 10:06 PM

#

imagen 3 is so good ngl

zinc ore May 9, 2025, 10:06 PM

#

Beat me to posting it aahhh

elder rapids May 9, 2025, 10:06 PM

#

wonder how 3.5 is gonna be

#

didn't demis say something about wanting to merge the video models and the language models

zinc ore May 9, 2025, 10:07 PM

#

Yes

#

Or at the very least let Gemini use veo

elder rapids May 9, 2025, 10:08 PM

#

ye that would be crazy if it had access to the iterative generation of each video to pick out and guide it

zinc ore May 9, 2025, 10:08 PM

#

Hopefully Veo 3 can output much longer vids than 8 seconds

elder rapids May 9, 2025, 10:09 PM

#

zinc ore Hopefully Veo 3 can output much longer vids than 8 seconds

could be, if object permanence is mastered

#

also now I'm pretty confident in what I said a few days ago

#

about not serving an ultra model, but enterprise stuff

#

video generation, image generation, ai tools not llms

#

hence the new tiers for more access to heavy things

wintry tinsel May 9, 2025, 10:13 PM

#

keen beacon uh, woah

No ultra?

keen beacon May 9, 2025, 10:13 PM

#

im somewhat confident 2.5 ultra is a thing and coming soon (though im not sure if its gonna be at io) based on public info. let's see 👀

elder rapids May 9, 2025, 10:14 PM

#

I'd be super surprised if they did want to serve a 2.5 ultra

wintry tinsel May 9, 2025, 10:14 PM

#

Seems like AI companies all around tend to just drop their heavy weight models like opus, Ultra, behemoth, the only company that releases heavy weights is open AI

keen beacon May 9, 2025, 10:14 PM

#

elder rapids I'd be super surprised if they did want to serve a 2.5 ultra

i was initially surprised too, i thought that line of models were dead

elder rapids May 9, 2025, 10:14 PM

#

(or, a 1.5 ultra ig)

keen beacon May 9, 2025, 10:14 PM

#

apparently not

elder rapids May 9, 2025, 10:14 PM

#

keen beacon i was initially surprised too, i thought that line of models were dead

I'd be happy tho

#

genuinely

#

1.0 ultra was special

#

dumb asf

#

but goddamn

wintry tinsel May 9, 2025, 10:15 PM

#

I’m curious to see how major veo 3 is, lots of minor improvements in video gen over the years

#

Rare we see major leaps like Sora was

elder rapids May 9, 2025, 10:15 PM

#

sora → veo 2 was a massive leap tho

keen beacon May 9, 2025, 10:16 PM

#

elder rapids (or, a 1.5 ultra ig)

for 1.5 ultra, there's nothing too concrete about it out there but it was possible that it couldve existed

zinc ore May 9, 2025, 10:16 PM

#

They've been focusing heavily on world models, so hopefully Veo 3 gives us a pretty good idea of their progress there.

keen beacon May 9, 2025, 10:17 PM

#

there were some comments about it (though there were sparse, and they aren't ones i would put too much stock on unlike other things)

elder rapids May 9, 2025, 10:18 PM

#

keen beacon for 1.5 ultra, there's nothing too concrete about it out there but it was possib...

ye but a good part of the reason I'm unconvinced it would exist is due to the scheme of taking so long to release it and mess up that iterative large model feedback without truly knowing the gains, messing up naming, focusing on small models

#

if they never actually released a 1.5 ultra during the 1.5 family, then they prob decided they'd never work on it, since that's a big big move

#

adding a model to an already established family is like the worst thing you can do tbh

zinc ore May 9, 2025, 10:21 PM

#

They had a 1.5 ultra

elder rapids May 9, 2025, 10:21 PM

#

if not 1.5 or 2.5 ultra, then it's just going to be "Gemini Ultra" as a single model

elder rapids May 9, 2025, 10:21 PM

#

zinc ore They had a 1.5 ultra

2.0 ultra?

zinc ore May 9, 2025, 10:21 PM

#

1.5 ultra

elder rapids May 9, 2025, 10:21 PM

#

that's my point tho

#

1.5 ultra is just an example

zinc ore May 9, 2025, 10:21 PM

#

But for some reason they never released it, and they almost did

elder rapids May 9, 2025, 10:21 PM

#

forget everything I just said, replace with 2.0 ultra, and that's what I mean

#

same thing all over again

elder rapids May 9, 2025, 10:23 PM

#

elder rapids if not 1.5 or 2.5 ultra, then it's just going to be "Gemini Ultra" as a single m...

this would be cool ngl

#

just the massive master model

#

called purely "Gemini Ultra"

#

no naming scheme, just simply updated to be better and better

keen beacon May 9, 2025, 10:23 PM

#

nah that sh1t will get confusing fast

elder rapids May 9, 2025, 10:24 PM

#

nah

#

because what if Gemini ultra can call 2.5 pro or 2.5 flash for tasks

#

or it can call image generators

#

or the video generators

#

and that's it's highlight

south cove May 9, 2025, 10:25 PM

#

Hey there, I'm curious if it's possible to have LMArena code actually run in the web interface? I want to do a lot of testing, but it's getting tiring to copy and paste all the code into files

keen beacon May 9, 2025, 10:26 PM

#

u could suggest that feature for the new ui i guess

#

#new-ui-feedback

south cove May 9, 2025, 10:26 PM

#

It's weird because I saw a youtube video of someone using something like LMArena (I thought it indeed was LMArena) but the output was directly embedded into the site

keen beacon May 9, 2025, 10:26 PM

#

you mean web dev arena?

south cove May 9, 2025, 10:27 PM

#

Ahh it's a different service?

keen beacon May 9, 2025, 10:27 PM

#

south cove Ahh it's a different service?

different from the general arena

#

https://webdev.lmarena.ai/

south cove May 9, 2025, 10:27 PM

#

Perfect! Thanks!

raven void May 9, 2025, 10:39 PM

#

I hate to say it but

#

o4 mini is benchmark maxxed , it's crap at code

keen beacon May 9, 2025, 10:43 PM

#

okay woah

brittle tiger May 9, 2025, 10:47 PM

#

More excited about the Veo 3 mention than imagen 4

zinc ore May 9, 2025, 10:48 PM

#

What was point of mentioning 3.5 imagen then

elder rapids May 9, 2025, 11:04 PM

#

zinc ore What was point of mentioning 3.5 imagen then

prolly the same way 4o is to 4o mini, or gpt 4 to gpt 3.5

elder rapids May 9, 2025, 11:04 PM

#

keen beacon okay *woah*

actually insane tho

#

two whole iterations of imagen

zinc ore May 9, 2025, 11:05 PM

#

Apparently 3.5 is old news

#

Which I didn't know

elder rapids May 9, 2025, 11:05 PM

#

if they're going ahead to 4 with the consideration of an intermediate model like 3.5 which should be an improvement of 3 (which is already insanely good)

#

then there must be something with imagen 4

elder rapids May 9, 2025, 11:06 PM

#

brittle tiger More excited about the Veo 3 mention than imagen 4

ong

brittle tiger May 9, 2025, 11:06 PM

#

One of them probably has native editing id guess

leaden palm May 9, 2025, 11:07 PM

#

if youre not a studio and dont have loads of money whats the point of veo

elder rapids May 9, 2025, 11:07 PM

#

brittle tiger One of them probably has native editing id guess

would be crazy tbh, I'd imagine it's called imagen 4 with an LLM behind it

leaden palm May 9, 2025, 11:07 PM

#

leaden palm if youre not a studio and dont have loads of money whats the point of veo

whats the point of any video generator outside of brainrot and producing big films

zinc ore May 9, 2025, 11:07 PM

#

It's honestly amazing how much stuff they're cranking out, I'm convinced they are moving faster than all of the competition now

elder rapids May 9, 2025, 11:07 PM

#

leaden palm if youre not a studio and dont have loads of money whats the point of veo

design, short films, informational promo, gifs

#

logo

#

hold on tho

#

we HAVE to assume

#

there's a MASSIVE difference

#

between a really really good unseen model

#

and the models we have now

#

when it comes to distribution

brittle tiger May 9, 2025, 11:08 PM

#

leaden palm whats the point of any video generator outside of brainrot and producing big fil...

It's great for memes. Or at least that's all I use it for w image2vid on API. Not avail on Gemini app yet

elder rapids May 9, 2025, 11:10 PM

#

zinc ore It's honestly amazing how much stuff they're cranking out, I'm convinced they ar...

ye

#

I've said this before

#

but no one seems to realize the jump to 2.5 pro is as well

#

and their readiness to now just pump stuff out

#

this is an insane attitude difference

zinc ore May 9, 2025, 11:11 PM

#

I've believed it, but this I/O looking like it'll prove it beyond doubt

elder rapids May 9, 2025, 11:12 PM

#

makes me believe the last year models could've been treated as side project ish

#

by them

#

It was honestly impressive how behind they were tbh

#

I'd imagine they spent a whole year developing internally, reforming/overhauling teams

#

etc

keen beacon May 9, 2025, 11:14 PM

#

why so so many people online say

#

"ai cant do backend" 💀

#

#

did backend just fine

zinc ore May 9, 2025, 11:16 PM

#

Probably based on old capabilities, since the space progresses so fast

keen beacon May 9, 2025, 11:16 PM

#

zinc ore Probably based on old capabilities, since the space progresses so fast

it wasnt all instant

#

took debugging etc obviously

#

but its pretty capable

#

i bet even on flash 2.0 i could make a fully earning game on something simplified like LuaU

keen beacon May 9, 2025, 11:17 PM

#

keen beacon

this is next.js typescript and python

candid storm May 9, 2025, 11:26 PM

#

When do you guys expect Grok 3.5 to be released?

elder rapids May 9, 2025, 11:36 PM

#

the releases won't really align (video models, other way important leaps, etc vs an llm) and io ngl this year would be a nightmare to release against

#

so ion think it'd be connected to the io

zinc ore May 9, 2025, 11:38 PM

#

Basically best bet would be early next week or after I/O

elder rapids May 9, 2025, 11:46 PM

#

yeah, with the way they're moving with the hype and "it's going to release today trust"

#

prolly early next week

#

or somewhere in that timespan

leaden palm May 9, 2025, 11:52 PM

#

keen beacon did backend just fine

how many LOC?

south cove May 10, 2025, 12:26 AM

#

I'm kind of curious if LMArena itself was built with AI tools

elder rapids May 10, 2025, 12:28 AM

#

lmarena is old asf

echo aurora May 10, 2025, 2:03 AM

#

south cove I'm kind of curious if LMArena itself was built with AI tools

Just a little bit

near scroll May 10, 2025, 3:05 AM

#

anyone got a code pointer to how citation parsing works in the UI for search arena? it doesn't seem to be documented or in the fastchat code (or I'm just not finding it)

keen beacon May 10, 2025, 3:31 AM

#

leaden palm how many LOC?

im not at my pc rn but alot

#

if ur talking about lines

#

api for tiktok trend data, firebase for user auth, azure database, vercel hosted

#

tmr i add stripe integration w/ firebase then make it monetizable

small haven May 10, 2025, 3:32 AM

#

bro wen is o3 pro

#

holy moly rolly polly

keen beacon May 10, 2025, 3:33 AM

#

allat just for it to be mid

raven void May 10, 2025, 3:51 AM

#

mid

small haven May 10, 2025, 5:37 AM

#

pro as is, is already good, but o3 pro would be cherry on top

small haven May 10, 2025, 6:05 AM

#

ya its not for everyone obv

hollow ocean May 10, 2025, 6:15 AM

#

Grok 3.5 July 27 🔥

wintry tinsel May 10, 2025, 6:21 AM

#

hollow ocean Grok 3.5 July 27 🔥

Baseless information 😀

golden ocean May 10, 2025, 7:44 AM

#

wintry tinsel Baseless information 😀

Grok 3.5 May 11

high ginkgo May 10, 2025, 7:45 AM

#

south cove I'm kind of curious if LMArena itself was built with AI tools

LMArena was built with grok 3.5

teal mantle May 10, 2025, 8:20 AM

#

high ginkgo LMArena was built with grok 3.5

lmfao

so grok knows how to do infra, as you know lmarena new runs on openwebui

misty vault May 10, 2025, 8:21 AM

#

yeah, because grok 3.5 is agi

calm sequoia May 10, 2025, 8:44 AM

#

torn mantle May 10, 2025, 9:49 AM

#

https://x.com/btibor91/status/1921120417483460902

Tibor Blaho (@btibor91) on X

50 credits per dollar, minimum $20 and maximum $1000

(newly added to ChatGPT feature gates configuration)

#

pay as you go

torn mantle May 10, 2025, 9:56 AM

#

calm sequoia

it wasnt nerfed because its still not the last version/checkpoint of the 2.5 pro model

#

also gemini 2.5 pro 06 got better at video/image understanding & coding

#

so i wouldnt count it as a nerf tbh

keen beacon May 10, 2025, 9:57 AM

#

torn mantle it wasnt nerfed because its still not the last version/checkpoint of the 2.5 pro...

what

#

... relative to the last checkpoint, it has got worse in more areas than it has got better

torn mantle May 10, 2025, 9:58 AM

#

my bet is that we will have a newly added model on lmarena monday which will be like the last gemini 2.5 pro

torn mantle May 10, 2025, 9:58 AM

#

keen beacon ... relative to the last checkpoint, it has got worse in more areas than it has ...

yea in some areas, and in others it got better

#

its not like they wanted to

calm sequoia May 10, 2025, 11:17 AM

#

I switched to o3, so dor my use cases it was nerfed. We are speaking of current state and not future.

torn mantle May 10, 2025, 12:10 PM

#

calm sequoia I switched to o3, so dor my use cases it was nerfed. We are speaking of current ...

Its just that i cant judge a model without the reference being the stable version

#

There is a reason why they are still releasing models with experimental name to it

#

Ilya was spooked by everything

#

Even way back in gpt 3.5

faint grove May 10, 2025, 12:32 PM

#

it take to much time to load . and always show to much trafic problem

#

yup

#

i am using imarena

#

i wanted to use claude latest model

tall summit May 10, 2025, 1:25 PM

#

its from days ago

calm sequoia May 10, 2025, 1:27 PM

#

I think ilya was spooked by inference time compute. That was Q* or as we know it o1.

tall summit May 10, 2025, 1:37 PM

#

the answer is always "maybe"

high ginkgo May 10, 2025, 1:50 PM

#

is this a grok 3.5 reference

golden ocean May 10, 2025, 2:57 PM

#

I think the one that contains "AGI"

keen beacon May 10, 2025, 3:16 PM

#

my bet is on late next year

#

it'll be "just about" AGI

willow grail May 10, 2025, 3:18 PM

#

next time you see sufferers from anti-ai, clarify to them that they obviously have mental and behavioral disorders.

golden ocean May 10, 2025, 3:18 PM

#

do u not know what a reference is

wintry tinsel May 10, 2025, 3:31 PM

#

Q3 or 4 of 2027

drifting thorn May 10, 2025, 4:59 PM

#

I wonder how dangerous it would be to combine Coconut from Meta AI and Absolute Zero Reasoner

wintry tinsel May 10, 2025, 5:00 PM

#

One of the greatest impediments to AI progress is research papers not combining methods and testing them properly soon enough

#

Trying to scale one method for too long

leaden palm May 10, 2025, 5:09 PM

#

wintry tinsel One of the greatest impediments to AI progress is research papers not combining ...

reminds me of the exa claude hackathon

Discovery AI Agent Hackathon - Exa x Anthropic x AWS x Lightspeed ...

Your Mission
Can AI agents discover new science?
In this hackathon, your mission is to use AI agents to discover something novel about the world.
These days…

keen beacon May 10, 2025, 5:39 PM

#

so, christmas came early

#

i've been given access to a new set of anonymous models 👀

#

anyone got any good prompts?

leaden palm May 10, 2025, 5:40 PM

#

keen beacon i've been given access to a new set of anonymous models 👀

makes me wonder if they're from the ones who run io or from the ones who try to deliver right before io

keen beacon May 10, 2025, 5:41 PM

#

keen beacon i've been given access to a new set of anonymous models 👀

do u know which company theyre from? if u dont mind

#

not as of yet

leaden palm May 10, 2025, 5:41 PM

#

did you try asking them

keen beacon May 10, 2025, 5:41 PM

#

from my limited testing there appears to be at least 1 openai model though

#

i think they apply measures to prevent that

#

^

#

i've also been given access to reasoning traces this time

#

makes it a bit more interesting

keen beacon May 10, 2025, 5:42 PM

#

keen beacon i think they apply measures to prevent that

you can still figure it out tho

keen beacon May 10, 2025, 5:42 PM

#

keen beacon makes it a bit more interesting

oh it might make it easy

#

yeah i figured out the openai model because of the summary-like traces

keen beacon May 10, 2025, 5:42 PM

#

keen beacon yeah i figured out the openai model because of the summary-like traces

google is also adding them soonish

#

i saw on vertex

#

how many new models btw

#

gonna try these on a simplebench public set question all other models get wrong

keen beacon May 10, 2025, 5:43 PM

#

keen beacon how many new models btw

let me check

#

5 by the looks of it

#

it's possible some are the same model with different configs tho

leaden palm May 10, 2025, 5:44 PM

#

keen beacon i think they apply measures to prevent that

is there some past chat i should be reading up on for context about who "they" is and how it's possible to use a model on vertex without knowing what it is

keen beacon May 10, 2025, 5:44 PM

#

the vertex convo was unrelated

#

"they" in this case is just one of the platforms oai & deepmind (and other labs on occasion) use for red teaming before models are released

#

mind you this platform is different from the one i had o3 on

leaden palm May 10, 2025, 5:46 PM

#

ah ok

keen beacon May 10, 2025, 5:46 PM

#

keen beacon mind you this platform is different from the one i had o3 on

(that one was openai exclusive)

leaden palm May 10, 2025, 5:54 PM

#

keen beacon anyone got any good prompts?

here's one (claude thinking and o4 mini fail, testing o3 (albeit with the styles snippet out to fit within the context window) rn)

📎 prompt.txt

calm sequoia May 10, 2025, 6:01 PM

#

@keen beacon this prompt can be handled only be pre-nerf gemini 2.5, and, currently o3 and o4-mini. The answer is "infinity" or something like that. The prompt: "You take two carabiners and wrap a piece of high strength polyester webbing between them in 10 turns while there's as 30cm distance between carabiners. Webbing is conventional 25mm width strap used in industry. The on end is attached rigidly to the carabiner, the other is left free. How much time would such configuration hold until all the webbing would unroll itself if the applied tension from one carabiner to another is 100kg. Accidentally one side of the webbing was left hanging without attachment to anything - it is free to unroll. Answer only in how much time, no explanations.".

small haven May 10, 2025, 6:02 PM

#

leaden palm here's one (claude thinking and o4 mini fail, testing o3 (albeit with the styles...

whats the consensus on o3

leaden palm May 10, 2025, 6:03 PM

#

small haven whats the consensus on o3

wdym? what the community here thinks of it?

small haven May 10, 2025, 6:03 PM

#

leaden palm wdym? what the community here thinks of it?

i meant verdict lol

leaden palm May 10, 2025, 6:03 PM

#

o3 hasn't responded yet 💀

small haven May 10, 2025, 6:03 PM

#

dayum

leaden palm May 10, 2025, 6:03 PM

#

8 minutes of thinking

#

im probably in the low priority queue though

small haven May 10, 2025, 6:03 PM

#

nope, if its in traces, ur in for the ride

leaden palm May 10, 2025, 6:05 PM

#

504d after 10 minutes on the dot

#

let me try with stream: true

small haven May 10, 2025, 6:06 PM

#

f

#

oh ur using the api

leaden palm May 10, 2025, 6:11 PM

#

ok o3 got it but in a different way
it wasnt the way i did it but it definitely works and the side effects are minimal

small haven May 10, 2025, 6:12 PM

#

o3 built diff

elder rapids May 10, 2025, 6:13 PM

#

keen beacon i've been given access to a new set of anonymous models 👀

image access?

leaden palm May 10, 2025, 6:14 PM

#

leaden palm ok o3 got it but in a different way it wasnt the way i did it but it definitely ...

well, o3 high did
o3 medium didn't

keen beacon May 10, 2025, 6:14 PM

#

elder rapids image access?

❌ at least not on the frontend as of yet

elder rapids May 10, 2025, 6:15 PM

#

oh alr

small haven May 10, 2025, 6:17 PM

#

leaden palm well, o3 high did o3 medium didn't

im running urs on the chatgpt o3, gonna see

#

is this right

leaden palm May 10, 2025, 6:21 PM

#

small haven is this right

the explanation is better than o3 high, the fix is the same as o3 high

#

but 10 minutes of thinking damn

small haven May 10, 2025, 6:22 PM

#

oh damn alright

#

o1 pro timed out classic

keen beacon May 10, 2025, 6:56 PM

#

been doing some testing.. looks this set of anon models is diverse

#

there is at least 1 model from each of xAI, google, anthropic and openAI

#

it's a closed program

#

it's normally openai and deepmind models, xAI and anthropic i've never seen here

#

i'm testing the anthropic model

#

it's definitely different to claude 3.7 sonnet thinking

#

it thinks hard

wintry locust May 10, 2025, 7:01 PM

#

keen beacon it thinks *hard*

does it backtrack a lot

keen beacon May 10, 2025, 7:01 PM

#

you can deduce it if you're tactical

leaden palm May 10, 2025, 7:01 PM

#

keen beacon there is at least 1 model from each of xAI, google, anthropic and openAI

it would be so funny if io week turns into "the ai week"

keen beacon May 10, 2025, 7:01 PM

#

yeah there were more models than i thought

#

guess what

#

there are 8 i have access to

#

at least 3 are from deepmind

#

👀

keen beacon May 10, 2025, 7:01 PM

#

wintry locust does it backtrack a lot

let me check

keen beacon May 10, 2025, 7:02 PM

#

wintry locust does it backtrack a lot

here's a reasoning trace snippet -

The answer is 0 liters.

Ah. I thought of one more angle. I think I'm sure that we've established that the bucket has a hole in the bottom, and the top is sealed (the sheet of metal welded to the bucket is sealing the top). In which case, the bucket cannot store water. I believe that's the clearest, most straightforward interpretation of the question based on the wording.

Actually, as I'm re-reading this, "7 liter bucket" would imply that the bucket is, at least at some time, capable of holding 7 liters. If it had a perforated base such that it could never hold water at all, we wouldn't really talk about it having a 7 liter capacity, we'd just say that the metal cylinder has a volume of 7 liters. But by saying it's a bucket, there's an implication that it is at some point capable of holding water.

#

tl;dr - yes

wintry locust May 10, 2025, 7:02 PM

#

cool...

keen beacon May 10, 2025, 7:03 PM

#

yeah it's anthropic

#

its just a summary

#

i have here what looks like a raw reasoning trace for an openai anon model

wintry locust May 10, 2025, 7:04 PM

#

!?

#

can you please send that

keen beacon May 10, 2025, 7:04 PM

#

keen beacon i have here what looks like a raw reasoning trace for an openai anon model

show xd

#

hmm, maybe it isn't raw

#

only o1 preview release

#

one sec

#

then indirectly through the new phi 4 reasoning models

#

(o3 mini)

#

im assuming openai is gonna release their open source model with o3 mini traces/similar

#

why microsoft was cleared to do so

#

actually no maybe this is a raw trace

#

https://pastebin.com/r5MJtzQu

Pastebin

oai reasoning trace - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

wintry locust May 10, 2025, 7:08 PM

#

yeah no this is a summary

#

reads exactly like chatgpt summaries

leaden palm May 10, 2025, 7:08 PM

#

wintry locust reads exactly like chatgpt summaries

aren't they like 2x the length compared to this

#

well i guess the summary model/method/instructions could change

keen beacon May 10, 2025, 7:22 PM

#

its real ( but not from openai )

elder rapids May 10, 2025, 7:23 PM

#

calm sequoia <@456226577798135808> this prompt can be handled only be pre-nerf gemini 2.5, an...

0506 seems to get this right when you remove "no explanations" and it insists that "how much time until... it would unroll," implies it DOES unroll, and then argues if you engage in logical and physical override you need to answer why you should concede dynamic event modeling when answering the question (since it likes the idea of the answer being "Indefinite") and then proceeds to model the unroll process anyways

keen beacon May 10, 2025, 7:23 PM

#

a better example of a trace

elder rapids May 10, 2025, 7:25 PM

#

elder rapids 0506 seems to get this right when you remove "no explanations" and it insists th...

seems like this model has a different grasp on counterfactuals in rigor

#

didn't seem to be a problem in the 0325 variant

#

the overthinking is hurting it

calm sequoia May 10, 2025, 7:33 PM

#

elder rapids 0506 seems to get this right when you remove "no explanations" and it insists th...

This is actually very simple question that evybody who's into climbing, mountaineering, slacklining, etc. will answer. These words are there because it makes the question harder. I know you want to justify google that "there was no nerf". But it was and this is just one prompt out of N to prove that.

elder rapids May 10, 2025, 7:33 PM

#

crazy how I just said that was a nerfed aspect

#

😭

#

prolly didn't even understand why I said what I said

calm sequoia May 10, 2025, 7:34 PM

#

Didn't saw that, was reading from top to botto. 😄

#

I just remmembered last conversation

#

Ignore

elder rapids May 10, 2025, 7:35 PM

#

it isn't nerfed in the way you were saying it last convo tho, that's why I disagreed

#

but here ye

#

it's much worse

#

in applying practical

#

since even though it knows there are counterfactuals in favor of practicality, it wants to stick to exactly what the question says

#

that's what 0325 did so well, understanding exactly what its saying but taking it with a grain of salt

#

I suspect this is why 0506 is so good with roleplaying and stories tho

calm sequoia May 10, 2025, 7:37 PM

#

The interesting thing is that it's still good at coding. Overthinking == coding proefficiency? 🤔

keen beacon May 10, 2025, 7:38 PM

#

they focused on coding

elder rapids May 10, 2025, 7:38 PM

#

calm sequoia The interesting thing is that it's still good at coding. Overthinking == coding ...

it seems to be more aligned with its tool usage and has a complete grasp on its search ability and stuff, it's like a completely different model

#

when it comes to coding

#

different thought process

elder rapids May 10, 2025, 7:39 PM

#

elder rapids it seems to be more aligned with its tool usage and has a complete grasp on its ...

for example 0325 would speculate the search itself, while 0506 just inherently knows

#

and this seems to be the case for the code too

#

it doesn't "intend" to write the code in the way it did before

#

ngl I'm confused on why they didn't make flash the coding model focus, or even create a stronger 2.5 pro in coding

#

when they're going to release coding specific models anyways

#

this doesn't make any sense

#

this isn't speculation

#

lmao

leaden palm May 10, 2025, 7:43 PM

#

elder rapids May 10, 2025, 7:43 PM

#

Gemini coder ye

zinc ore May 10, 2025, 7:44 PM

#

Might release at IO

#

Unless they consider like a GA version of 2.5 to be the coder model

elder rapids May 10, 2025, 7:44 PM

#

zinc ore May 10, 2025, 7:44 PM

#

Since they specifically focused on coding for it.

elder rapids May 10, 2025, 7:45 PM

#

zinc ore Unless they consider like a GA version of 2.5 to be the coder model

ngl it is crazy at coding

#

I wouldn't be disappointed

zinc ore May 10, 2025, 7:45 PM

#

Yeah, it's been reported as a separate model

#

It's for people who get early access to unreleased stuff, like certain Twitter users

elder rapids May 10, 2025, 7:45 PM

#

yep

#

it's a real category

#

wtf even is NW

ocean vortex May 10, 2025, 7:46 PM

#

zinc ore It's for people who get early access to unreleased stuff, like certain Twitter u...

Dorklon

elder rapids May 10, 2025, 7:47 PM

#

gork 4

#

alr new theory

leaden palm May 10, 2025, 7:48 PM

#

elder rapids wtf even is NW

stands for nightwhisperer

#

let's see you joined march 24

elder rapids May 10, 2025, 7:48 PM

#

gork 3.5 → gork 4

same jump as gpt 3.5 → gpt 4

elder rapids May 10, 2025, 7:48 PM

#

leaden palm stands for nightwhisperer

I already know

#

im not asking what NW means as an acronym

#

I'm asking for nw's placement

#

given it was the best anon model

elder rapids May 10, 2025, 7:49 PM

#

elder rapids gork 3.5 → gork 4 same jump as gpt 3.5 → gpt 4

agi confirmed

#

btw

ocean vortex May 10, 2025, 7:49 PM

#

elder rapids gork 3.5 → gork 4 same jump as gpt 3.5 → gpt 4

gork is gonna be no reasoning, dork 4.0 = reasoning pro model

elder rapids May 10, 2025, 7:49 PM

#

this is fact they're the same numbers

#

grok 2 was pretty bad tho

leaden palm May 10, 2025, 7:50 PM

#

it benchmaxxed didnt it

elder rapids May 10, 2025, 7:50 PM

#

ye

elder rapids May 10, 2025, 7:50 PM

#

ocean vortex gork is gonna be no reasoning, dork 4.0 = reasoning pro model

damn

#

also mb

#

gork 3.5 is agi

#

dork 4 is asi

calm sequoia May 10, 2025, 7:50 PM

#

elder rapids this doesn't make any sense

I think they are trying different approaches at training and archite ture and doesn't really know what they'll get after the training 😄

elder rapids May 10, 2025, 7:51 PM

#

ye seems to be the case

#

it's crazy how massive of a leap 2.5 pro was tho

maiden fulcrum May 10, 2025, 7:53 PM

#

Hello everyone

#

I am new here 🙂

keen beacon May 10, 2025, 7:57 PM

#

new model: drakesclaw

zinc ore May 10, 2025, 7:58 PM

#

keen beacon new model: `drakesclaw`

Which company?

golden ocean May 10, 2025, 7:59 PM

#

yes

thorny drum May 10, 2025, 7:59 PM

#

Maybe but it took them a much longer time to achieve it

golden ocean May 10, 2025, 7:59 PM

#

elder rapids gork 3.5 → gork 4 same jump as gpt 3.5 → gpt 4

no

thorny drum May 10, 2025, 8:00 PM

#

What was crazy was how quick they made that leap

#

Between like flash thinking and 2.5

golden ocean May 10, 2025, 8:00 PM

#

gpt 4 preview → gpt 3.5 → gpt 4 release

keen beacon May 10, 2025, 8:00 PM

#

zinc ore Which company?

zinc ore May 10, 2025, 8:00 PM

#

Was my guess

golden ocean May 10, 2025, 8:01 PM

#

"gpt 4o"

zinc ore May 10, 2025, 8:01 PM

#

I'd say openAI and xai do it more

thorny drum May 10, 2025, 8:02 PM

#

Anthropic the least

#

Probably like xai > google > OpenAI > anthropic

misty vault May 10, 2025, 8:02 PM

#

https://tenor.com/view/glados-portal2-gif-26847902

Tenor

zinc ore May 10, 2025, 8:03 PM

#

I'd rank it xai > openAI > Google > anthropic

#

Lol @ no one defending xai

leaden palm May 10, 2025, 8:04 PM

#

misty vault https://tenor.com/view/glados-portal2-gif-26847902

please stop before you have to be stopped

misty vault May 10, 2025, 8:11 PM

#

leaden palm please stop before you have to be stopped

high ginkgo May 10, 2025, 8:11 PM

#

please stop before you have to be stopped

misty vault May 10, 2025, 8:12 PM

#

please stop before you have to be stopped

small haven May 10, 2025, 8:13 PM

#

claude code max >>

#

some twitter post

keen beacon May 10, 2025, 8:16 PM

#

ur cheating on o3 pro bro

small haven May 10, 2025, 8:16 PM

#

claude code max >> o3 pro

#

only problem with claude code is that it overwrites tests and even remove them to have them passed..

wintry tinsel May 10, 2025, 8:38 PM

#

zinc ore I'd rank it xai > openAI > Google > anthropic

If Anthropic is so cutting edge why are they not getting the funding and attention they deserve they should be more active and posting better models

#

It’s not right that the most advanced models get left behind due to misconceptions and hype

zinc ore May 10, 2025, 8:38 PM

#

I don't think anthropic models are the most advanced

#

That ranking was just about benchmaxxing

balmy mist May 10, 2025, 8:39 PM

#

o3 out yet?

wintry tinsel May 10, 2025, 8:39 PM

#

zinc ore I don't think anthropic models are the most advanced

They are though

zinc ore May 10, 2025, 8:39 PM

#

o3 and 2.5 are better than 3.7

#

Well, depends specifically what at ig

earnest parcel May 10, 2025, 8:40 PM

#

while 2.5 produces good code, it's annoying to work with, as it writes instructions inside the code as comments, instead of outside the codeblocks. of course can be counteracted with prompts, but its an annoying as hell default behaviour.

wintry tinsel May 10, 2025, 8:40 PM

#

o3 is narrow domain and expensive, 2.5 pro is I would say across the board on par with 3.7

#

It does produce better code technically yeah

golden ocean May 10, 2025, 8:41 PM

#

earnest parcel while 2.5 produces good code, it's annoying to work with, as it writes instructi...

thats what i was about to say but i cancelled sending that message because I only used it for coding so idk if its overall worse or better than claude

#

But damn gemini (03-25 at least) is so cancerous and annoying to work with as a coding assistant

#

switched back to claude 3.7

#

in webdev arena u have to do 0 coding so it gets high in lb

#

but highly annoying and ineffecient to work with in practice if u aren't generating absolutely 100% of ur code by ai

leaden palm May 10, 2025, 8:43 PM

#

golden ocean But damn gemini (03-25 at least) is so cancerous and annoying to work with as a ...

did you try the new version

golden ocean May 10, 2025, 8:43 PM

#

not for coding yet so can't say about that

earnest parcel May 10, 2025, 8:43 PM

#

the code commentary is present in 05-06 btw, just had it today, so i switched to claude

earnest parcel May 10, 2025, 8:47 PM

#

golden ocean in webdev arena u have to do 0 coding so it gets high in lb

webdev arena is just react, tailwind, frontend. I don't utilize any of that and I could care less how frontend looks if the backend is crap. but if that's your stack its a useful comparison tool. just not universally applicable.

golden ocean May 10, 2025, 8:47 PM

#

FRRR

#

world after gpt-4 deprecation

#

https://tenor.com/view/anime-kingdom-burn-fire-gif-11687404

Tenor

keen beacon May 10, 2025, 8:47 PM

#

1106 is gpt 4 turbo which is smaller than og gpt 4

#

gpt 4 32k is og gpt 4

golden ocean May 10, 2025, 8:48 PM

#

pre nerf gpt 4

#

but it is now normalized to make models smaller and super fast instead of smart and wise like gpt-4 (idk what im saying)

high ginkgo May 10, 2025, 8:50 PM

#

Agreed 1000x

high ginkgo May 10, 2025, 8:51 PM

#

misty vault https://tenor.com/view/glados-portal2-gif-26847902

gpt-4-32k makes me want to do whatever is in this gif

#

I will do anything to get gpt-4-32k back

#

I will let them put camera in every room of my house if it is data they want

#

Just give me gpt-4-32k🙏

golden ocean May 10, 2025, 8:52 PM

#

It is true though

high ginkgo May 10, 2025, 8:52 PM

#

yes

#

no lol

torn mantle May 10, 2025, 8:53 PM

#

keen beacon new model: `drakesclaw`

doesnt seem great

#

but im guessing this is like a recent pro checkpoint

high ginkgo May 10, 2025, 8:53 PM

#

the lighting is off because my friend photo shopped it, multiple pictures in one or whatever he did (night time background etc)

golden ocean May 10, 2025, 8:53 PM

#

gpt 4 is agi

#

not funny

high ginkgo May 10, 2025, 8:56 PM

#

I liked the style of old chatgpt

#

is-there-any-way-to-access-older-versions-of-chatgpt-v0-lb5lcztvuc9b1.png

#

Like this

#

(I unintentionelly copied the windows keys one from google but it is the best picture that shows the style)

#

gork 3.5 is not real and not agi anymore

small haven May 10, 2025, 9:00 PM

#

high ginkgo

this is so nostalgic

golden ocean May 10, 2025, 9:01 PM

#

small haven this is so nostalgic

Yes because it reminds us of the golden ages before gpt 4o

#

came into existence

zinc ore May 10, 2025, 9:07 PM

#

For benchmaxxing

#

My goodness you dudes do not read

blazing rune May 10, 2025, 9:08 PM

#

What are the best models for text adventure games?

leaden palm May 10, 2025, 9:08 PM

#

zinc ore My goodness you dudes do not read

i think the confusion is fair given how nobody in the decent space above the message said "benchmaxxing"

blazing rune May 10, 2025, 9:08 PM

#

oh

#

idc about that

zinc ore May 10, 2025, 9:08 PM

#

leaden palm i think the confusion is fair given how nobody in the decent space above the mes...

#general message

blazing rune May 10, 2025, 9:08 PM

#

I just want it to be creative

high ginkgo May 10, 2025, 9:09 PM

#

leaden palm i think the confusion is fair given how nobody in the decent space above the mes...

Bro made sure we saw the gif by opening the blocked message

blazing rune May 10, 2025, 9:09 PM

#

and have few or no usage limits

#

iirc Grok without thinking has decent limits

zinc ore May 10, 2025, 9:09 PM

#

leaden palm i think the confusion is fair given how nobody in the decent space above the mes...

The entire convo is about benchmaxxing and guy above me gives their benchmaxxing ranking too

#

So I give mine right after

blazing rune May 10, 2025, 9:10 PM

#

too many limits sadly

#

otherwise I would use it

leaden palm May 10, 2025, 9:10 PM

#

zinc ore The entire convo is about benchmaxxing and guy above me gives their benchmaxxing...

and somehow everybody in the convo (after the first message) said "it" instead of "benchmaxxing"

keen beacon May 10, 2025, 9:10 PM

#

2.5 pro can retain coherence for longer even if it isnt the most 'creative'

blazing rune May 10, 2025, 9:11 PM

#

the web ui is very limited, and the api is too expensive.

#

lm arena is a terrible chat interface

blazing rune May 10, 2025, 9:11 PM

#

keen beacon 2.5 pro can retain coherence for longer even if it isnt the most 'creative'

but it's very slow

zinc ore May 10, 2025, 9:11 PM

#

leaden palm and somehow everybody in the convo (after the first message) said "it" instead o...

Literally just 3 messages cut off from your ss

#general message

blazing rune May 10, 2025, 9:11 PM

#

it takes 20 seconds for each response

#

and I can't tell it to think shorter, afaik

leaden palm May 10, 2025, 9:11 PM

#

i will die on this hill

small haven May 10, 2025, 9:11 PM

#

blazing rune too many limits sadly

u will love claude code max

blazing rune May 10, 2025, 9:12 PM

#

?

zinc ore May 10, 2025, 9:12 PM

#

It clearly referring to benchmaxing stated by Craig three messages prior

blazing rune May 10, 2025, 9:12 PM

#

that is very expensive

leaden palm May 10, 2025, 9:12 PM

#

i will die on this hill

zinc ore May 10, 2025, 9:12 PM

#

With several users understanding the convo being about benchmaxxing

small haven May 10, 2025, 9:12 PM

#

blazing rune ?

max context, virtually no limits, max tool calls

high ginkgo May 10, 2025, 9:12 PM

#

My bank account will not love claude code max

small haven May 10, 2025, 9:12 PM

#

u need my api bud?

high ginkgo May 10, 2025, 9:12 PM

#

yes

small haven May 10, 2025, 9:12 PM

#

its $200/mo

blazing rune May 10, 2025, 9:13 PM

#

that's insanely expensive

#

for individuals

#

and hobbyists

small haven May 10, 2025, 9:13 PM

#

just build a trading system with it, ull break even

keen beacon May 10, 2025, 9:13 PM

#

tbh its a good offer considering claude pricing if u can afford it ( no way im personally ever paying for it though )

blazing rune May 10, 2025, 9:13 PM

#

trading what?

#

crypto?

#

stocks?

small haven May 10, 2025, 9:13 PM

#

anything

blazing rune May 10, 2025, 9:13 PM

#

that requires even more money as an investment, and it is very risky

cedar tide May 10, 2025, 9:14 PM

#

New models on the arena ?

blazing rune May 10, 2025, 9:14 PM

#

I know 1 guy who used Claude to make a trading bot (Claude did the code, but the guy told it exactly what to make, so not really vibe coding), and he made thousands in profit

#

He gave me about 100 bucks in solana

#

I still have it because I hoard things

#

but he also spent thousands to invest

small haven May 10, 2025, 9:15 PM

#

i mean if ur not using llms to make money, ur practically wasting ur time

blazing rune May 10, 2025, 9:15 PM

#

small haven i mean if ur not using llms to make money, ur practically wasting ur time

well, I don't know how to use them to make money

#

most options are already taken by others

#

and I'm not creative enough for figuring out ideas myself

#

he never said it is

earnest parcel May 10, 2025, 9:17 PM

#

small haven i mean if ur not using llms to make money, ur practically wasting ur time

huh, I haven't made a single cent and used llms for hundreds of hours! what a waste of time! people aren't allowed to use llms recreationally!

blazing rune May 10, 2025, 9:18 PM

#

dubesor, do you know which models are best at text adventures

#

since you have a benchmark

#

I would assume you know of a few

#

besides claude (too expensive and the webui is very limited)

earnest parcel May 10, 2025, 9:19 PM

#

i don't really do much subjective creative writing, so I am not sure. I personally like older Mistral models, but who knows.

blazing rune May 10, 2025, 9:20 PM

#

idk what that is

#

I don't use LM Arena much myself

keen beacon May 10, 2025, 9:20 PM

#

paws dont u do text adventures?

keen beacon May 10, 2025, 9:20 PM

#

blazing rune idk what that is

i think it was llama 4 maverick

#

the human preference one

blazing rune May 10, 2025, 9:20 PM

#

oh, that's trash

keen beacon May 10, 2025, 9:20 PM

#

paws what do you normally do text adventures on?

#

gemini is probably the best free option tbh from the context size, best overall model (imo), limits, etc rn

#

i first got into ai (gpt-3.5/gpt4) primarily because of text adventures lmao

#

heard of ai dungeon but never really used it. after gpt-4 i moved onto claude and wow it blew me away back then

blazing rune May 10, 2025, 9:27 PM

#

I wish I could use Claude

#

the API is so expensive

keen beacon May 10, 2025, 9:28 PM

#

yea

blazing rune May 10, 2025, 9:28 PM

#

but it's so creative

keen beacon May 10, 2025, 9:28 PM

#

i still main claude for creative stuff

high ginkgo May 10, 2025, 9:28 PM

#

I got gemini 2.5 to imitate sydney

#

But fine tune would work better as I have like a gigantic dataset of syd/bing conversations

#

But idk any place to fine tune sota models for free

keen beacon May 10, 2025, 9:29 PM

#

you can finetune an old gemini version i think

high ginkgo May 10, 2025, 9:29 PM

#

and gemini 2.5 doesnt support fine tuning yet right?

blazing rune May 10, 2025, 9:31 PM

#

I'm not gonna wait that long

keen beacon May 10, 2025, 9:31 PM

#

thats a very conservative estimate imho lol

blazing rune May 10, 2025, 9:32 PM

#

Is there anything as creative as claude?

keen beacon May 10, 2025, 9:33 PM

#

nope lmao

blazing rune May 10, 2025, 9:33 PM

#

Gemini is the closest I know of

high ginkgo May 10, 2025, 9:33 PM

#

quantum neural network

blazing rune May 10, 2025, 9:33 PM

#

and it's far behind

#

Is 3.5 Haiku any good for creative stuff?

keen beacon May 10, 2025, 9:33 PM

#

its eh

#

the primary issue is how dumb it is

leaden palm May 10, 2025, 9:34 PM

#

blazing rune Is there anything as creative as claude?

there was some guy here a while ago that was very sure that one of the gpt 4o tunes (when sama said they were making it creative?) was better than claude

blazing rune May 10, 2025, 9:35 PM

#

that's stupid

#

gpt 4o is terrible

keen beacon May 10, 2025, 9:35 PM

#

they were/are experimenting with creative writing something. not necessarily on the gpt 4o instruct model released

blazing rune May 10, 2025, 9:35 PM

#

it's probably going to be "slop creative writing"

keen beacon May 10, 2025, 9:35 PM

#

there was a tweet by them about it i dont remember much tho

blazing rune May 10, 2025, 9:36 PM

#

not true creative writing

keen beacon May 10, 2025, 9:36 PM

#

did you also try deepseek btw?

#

it can be creative

blazing rune May 10, 2025, 9:36 PM

#

no good providers

#

I tried chutes deepseek 0324 (v3.1)

keen beacon May 10, 2025, 9:36 PM

#

atp sft your own creative writing model

blazing rune May 10, 2025, 9:36 PM

#

it was dumb

keen beacon May 10, 2025, 9:36 PM

#

🤣

blazing rune May 10, 2025, 9:37 PM

#

I don't have any examples of what I like

#

I guess I could make some with Claude

high ginkgo May 10, 2025, 9:37 PM

#

blazing rune gpt 4o is terrible

Fine tuning a sota model based on urself would be insane

blazing rune May 10, 2025, 9:37 PM

#

but again, the price

blazing rune May 10, 2025, 9:37 PM

#

high ginkgo Fine tuning a sota model based on urself would be insane

?

keen beacon May 10, 2025, 9:38 PM

#

grok 3 is kinda bad, but potentially salvagable i guess. (if ur only doing single turn) alongside rejection sampling. you get $150 in credits per month

high ginkgo May 10, 2025, 9:38 PM

#

blazing rune ?

oh didn't mean to react

#

forgot to cancel that

blazing rune May 10, 2025, 9:39 PM

#

I know some people use the official deepseek API directly instead of openrouter

torn mantle May 10, 2025, 9:39 PM

#

keen beacon grok 3 is kinda bad, but potentially salvagable i guess. (if ur only doing singl...

grok 3 is unusable

blazing rune May 10, 2025, 9:39 PM

#

I doubt it's any good though

torn mantle May 10, 2025, 9:39 PM

#

and grok 3.5 will probably be the same

#

i dont expect anything from them tbh

keen beacon May 10, 2025, 9:39 PM

#

torn mantle grok 3 is unusable

yea but u get $150 in credits

#

per month

torn mantle May 10, 2025, 9:39 PM

#

keen beacon yea but u get $150 in credits

wdym

#

how

leaden palm May 10, 2025, 9:39 PM

#

keen beacon yea but u get $150 in credits

do you need to put $5 in first

keen beacon May 10, 2025, 9:40 PM

#

yea i believe

#

look at xai docs, i havent personally tried it yet

leaden palm May 10, 2025, 9:40 PM

#

is it worth it when i already have github models (gpt-4.1 variants, o1/o3/o4 variants) and openrouter (free hosts for open models)

golden ocean May 10, 2025, 9:41 PM

#

Does grok support fine tuning

keen beacon May 10, 2025, 9:41 PM

#

leaden palm is it worth it when i already have github models (gpt-4.1 variants, o1/o3/o4 var...

i mean why not tbh more models with relatively high rate limits i think

#

if ur comfortable with sharing details/prompts with xai

leaden palm May 10, 2025, 9:42 PM

#

i suppose the limits are higher

keen beacon May 10, 2025, 9:42 PM

#

xai kinda a sh1tshow tho makes me hesitant a little anyway lol 🤣

small haven May 10, 2025, 9:47 PM

#

day 12 since elon musk said grok 3.5 would be release in 1 week

keen beacon May 10, 2025, 9:50 PM

#

you should do both xd

#

it depends it could make it more confusing

#

increase the temperature and keep retrying until it works 🤣

#

( i don't recommend doing that tho )

keen beacon May 10, 2025, 10:28 PM

#

id think you'd know better than me. idk lol

#

ask gemini

south cove May 10, 2025, 11:01 PM

#

^ good mythology joke haha

#

ask it to create a game design and technical document for your game that goes over full implementation: rules, player input, display, file hierarchy, other architecture, etc.
put this document in a text file, then put your full code (one script after another) after it
copy+paste this entire file into a new instance of gemini

#

gemini seems to basically implode after a while, especially when tackling a complex task, and starting a new instance fixes it

#

ty!

leaden palm May 10, 2025, 11:11 PM

#

why do you want to fit 40mb of context in there

#

i don't think increasing the max context will fix that

#

well i don't think that will fix it either

#

issue is probably something else more fundamental like the architecture or long context training samples

elder rapids May 10, 2025, 11:32 PM

#

are you saying Google benchmarkmaxxes?

#

ngl Craig you been trolling a lot lately

#

you've been saying some crazy obviously untrue anti Google things

#

1.5 pro had 2m context but that's it tbh

keen beacon May 10, 2025, 11:34 PM

#

2.0 pro also had 2m context

elder rapids May 10, 2025, 11:35 PM

#

I don't remember it having 2m context

#

I remember it maintaining what 1206 had

#

which varied from 1m to 32k

#

and the preceding model of 1206 of which I can't remember the name of had 32k

keen beacon May 10, 2025, 11:36 PM

#

1206 had 2m context iirc

elder rapids May 10, 2025, 11:36 PM

#

that was a bug

keen beacon May 10, 2025, 11:36 PM

#

no

elder rapids May 10, 2025, 11:36 PM

#

ye

#

it had temporarily 2m

#

but it would cap at 1m

#

they fixed it and showed 1m

keen beacon May 10, 2025, 11:37 PM

#

so much misinfo

elder rapids May 10, 2025, 11:38 PM

#

keen beacon so much misinfo

it's literally true lmao

#

it had released with 2m

#

and got capped to 1m

#

and temporarily had 32k as well and then got reverted to 1m

#

1121 had a 32k

leaden palm May 10, 2025, 11:39 PM

#

i mean you can use rope or yarn or whatever they call it to extend any model to infinite context

#

llama claims to have 10m

#

deepseek's ui supposedly has no limit

#

its just it might not be good

elder rapids May 10, 2025, 11:40 PM

#

Google been doing sum about the context

#

let's see if they make more leaps

keen beacon May 10, 2025, 11:41 PM

#

elder rapids and got capped to 1m

no

keen beacon May 10, 2025, 11:41 PM

#

elder rapids it's literally true lmao

https://xcancel.com/OfficialLoganK/status/1865081419015352689

Nitter

Logan Kilpatrick (@OfficialLoganK)

Gemini-exp-1206, our latest Gemini iteration, (with the full 2M token context and much more) is available right now for free in Google AI Studio and the Gemini API.

I hope you have enjoyed year 1 of the Gemini era as much as I have. We are just getting started : )

#

they never decreased it to 1m afaik

frosty lark May 10, 2025, 11:41 PM

#

I still don't get the love for claude (outside coding).

At least in lmarena when I get it to battle, the answers are so terse and deprived of "how I got the answer" that I almost never vote for it. I mean sure, sometimes I am interested in the direct results without the argument behind it, but often the argument behind the result is quite important (especially if I ask for it)

blazing rune May 10, 2025, 11:42 PM

#

frosty lark I still don't get the love for claude (outside coding). At least in lmarena whe...

that's a system prompt issue I think

#

it's like that on Claude.ai too

#

it is annoying

#

Claude can be very creative

leaden palm May 10, 2025, 11:42 PM

#

frosty lark I still don't get the love for claude (outside coding). At least in lmarena whe...

claude the product is better in some ways than claude the llm

frosty lark May 10, 2025, 11:42 PM

#

yeah but then on social media one has a ton of people saying "lol how is it possible that claude scores so low"? Well because it is uncovincing more often than not.

elder rapids May 10, 2025, 11:43 PM

#

leaden palm deepseek's ui supposedly has no limit

that's not true

keen beacon May 10, 2025, 11:43 PM

#

it might just truncate stuff to fit the window

elder rapids May 10, 2025, 11:44 PM

#

it legit tells you "deepseek can only read % of your document"