#general | Arena | Page 34

keen beacon May 1, 2025, 1:33 PM

#

yeah its a beast for indistribution tasks. some things it can be especially weak on

#

its so fcking fast

balmy mist May 1, 2025, 1:33 PM

#

and its chea right?

#

cheap*

#

like is it better than flash 2.5?

keen beacon May 1, 2025, 1:34 PM

#

balmy mist like is it better than flash 2.5?

probably not

balmy mist May 1, 2025, 1:34 PM

#

lol

keen beacon May 1, 2025, 1:34 PM

#

if u can run it locally tho 😄

#

i think the pricing will continue to drop though

#

its kinda expensive rn. 0.3m/tok (it isnt a lot, but qwq 32b is 0.2m/tok)

balmy mist May 1, 2025, 1:35 PM

#

yeah that has to drop

#

you think r2 will be cheaper?

#

that would be silly right lol

keen beacon May 1, 2025, 1:35 PM

#

idk what to expect with r2

balmy mist May 1, 2025, 1:35 PM

#

lol

alpine coral May 1, 2025, 2:17 PM

#

it's interesting with the qwen models - using on their site, it seems thinking can be enabled on all the models (except qwq, where it's permanently on)

#

i was playing around yesterday

#

i dunno how the thinking would interact with functinon calling

#

could be great; or tricky

#

but yeah fwiw they are solid models - esp the smaller ones that can be hosted locally (though im largely going by there benchmarks there..)

#

can the 30b one be hosted locally (like with relative ease)?

keen beacon May 1, 2025, 2:22 PM

#

alpine coral can the 30b one be hosted locally (like with relative ease)?

Yeah it's still relatively fast on the cpu

#

It's moe

alpine coral May 1, 2025, 2:22 PM

#

right ofc

keen beacon May 1, 2025, 2:22 PM

#

Only 3b active

alpine coral May 1, 2025, 2:22 PM

#

i should pay attention to the last digit in the name

#

so it's functionally less intensive than hosting the dense 4b variant?

#

or wait.. is that also moe?

#

~~i'm not sure but i feel like OG 4 was, then turbo was MoE~~

#

but really not sure

keen beacon May 1, 2025, 2:25 PM

#

yes

keen beacon May 1, 2025, 2:25 PM

#

alpine coral so it's functionally less intensive than hosting the dense 4b variant?

kind of. requires more memory though. if u can fit in vram, its super fast

#

yea

#

it was

#

sam tweeted saying theyre storing it somewhere for historians

alpine coral May 1, 2025, 2:27 PM

#

lol

keen beacon May 1, 2025, 2:27 PM

#

(but og gpt 4 is long gone on chatgpt, they deprecated gpt 4 turbo on chatgpt LOL)

#

yes

#

gpt 4 was seminal, personally i think o1 preview was next

#

unpopular opinion i guess

alpine coral May 1, 2025, 2:28 PM

#

yeah

#

test time compute was kinda a paridgm shift

#

which is still playing out

#

like.. reasoning models are annoying and unneccasrry a lot of the time ha

keen beacon May 1, 2025, 2:30 PM

#

unpopular opinion: i think models still need to think even more lol

alpine coral May 1, 2025, 2:30 PM

#

not sure about stupid,, it was kinda obvious; like 'give the model more time to 'think' rather than blurting out the first tokens it predicts as the answer'

keen beacon May 1, 2025, 2:31 PM

#

it wasnt just that it was rule based rl too

#

its a combination of multiple things

alpine coral May 1, 2025, 2:31 PM

#

yeah i know i'm simplifying

#

yeah i agree

#

it's kimda like brute force a lot of the time

#

rather than 'thinking'

#

but that said, a lot of the time, it works perfectly and makes good sense

keen beacon May 1, 2025, 2:32 PM

#

i think people try to apply how they think they think to models too much

#

i personally think this is it but its another i think unpopular opinion

unborn ocean May 1, 2025, 2:34 PM

#

keen beacon i think people try to apply how they think they think to models too much

well the thing is that we don't actually know how we think, so no one can really criticize the current thought process itself for much

keen beacon May 1, 2025, 2:34 PM

#

unborn ocean well the thing is that we don't actually know how we think, so no one can really...

exactly

unborn ocean May 1, 2025, 2:35 PM

#

imho the great thing about TTC is not just length and problem solving, but also variability of response length and adjusting it to economic incentives
(because our brain also does that on a more complicated level)

alpine coral May 1, 2025, 2:36 PM

#

it literally just gives the model time to 'work through' something; it's one completion

willow grail May 1, 2025, 2:36 PM

#

qwen3 trash?

alpine coral May 1, 2025, 2:36 PM

#

divided with <thinking> tagssetc

unborn ocean May 1, 2025, 2:36 PM

#

not just with thought lenght but also "compute" for each "token" our brain produces, as we are more like some f*d up complicated RNN (with differing compute for each "token")

alpine coral May 1, 2025, 2:37 PM

#

but i don't think that's how it works

#

reasoning 'effort' is misleading language

#

reasoning 'token budget' is much better

#

imo

keen beacon May 1, 2025, 2:38 PM

#

with effort, like openai's reasoning effort, its not really limited by budget. the model is specifically tuned to produce different lengths of chain of thought, not a strict limit

#

high produces the most, med produces less, low produces even less

alpine coral May 1, 2025, 2:38 PM

#

so they;'re 3 discrete models?

keen beacon May 1, 2025, 2:39 PM

#

not necessarily these behaviors could be tuned directly into a single model. activation could be from a special token, specific instructions, etc.

alpine coral May 1, 2025, 2:39 PM

#

i mean maybe

#

for o1 pro, i totally agree it's more than just 'unlimited reasoning tokens' - there's actually more to it

#

i'm not conviinced low/med/high is any different from selecting a corresponding value on flash 2.5

#

same model

keen beacon May 1, 2025, 2:41 PM

#

alpine coral i'm not conviinced low/med/high is any different from selecting a corresponding ...

flash 2.5 has no idea what the thinking budget is

#

its literally cut off at the thinking budget

alpine coral May 1, 2025, 2:41 PM

#

lol this is true

#

i will cede you that ha

keen beacon May 1, 2025, 2:42 PM

#

you can see what i mean by 'specific instructions' and how they can trigger specific behaviors (intentionally trained in) by sending /no_think or /think to qwen3 models if u dont understand what i mean

alpine coral May 1, 2025, 2:42 PM

#

right, but it's still functionally the same

#

google has just implemented it poorly

#

(prob why other providers didn't go with a floating value)

keen beacon May 1, 2025, 2:43 PM

#

google didnt implement it at all. its just a programmatic max token limit on the thinking contents

#

the model has no idea

alpine coral May 1, 2025, 2:44 PM

#

im not saying it does

keen beacon May 1, 2025, 2:44 PM

#

it doesnt even try to think less with a smaller thinking budget unless u give extra instructions

alpine coral May 1, 2025, 2:45 PM

#

it's principally the same. if the model is fine tuned to adhere to cues or whatever then clearly it works well, and that;s eaiser to do if it's low/med/high vs 0-n tokens. but fundamentally it's the same

unborn ocean May 1, 2025, 2:46 PM

#

true, even modified ones won't be enough

alpine coral May 1, 2025, 2:46 PM

#

the model has a greater or lesser token allowance to use before delivering its final resposne

unborn ocean May 1, 2025, 2:46 PM

#

although the core concept of attention will likely remain in what ever we do for the time being

rugged brook May 1, 2025, 2:49 PM

#

willow grail qwen3 trash?

yes

keen beacon May 1, 2025, 2:50 PM

#

alpine coral it's principally the same. if the model is fine tuned to adhere to cues or whate...

in qwen3 /think / /no_think are trained in model behaviors. (theyll stop the model from thinking or not on qwen3) google 2.5 flash doesn't have behaviors like this trained in by default unless u prompt it which is different. (u might get it to think less for example)
sonnet's antml:max_thinking_length antml:max_thinking_length is provided to the model as a prompt instruction on the api/claude.ai/etc. it's also a trained in behavior, even if the value is somewhat arbitrary - it is unable to actually count the thinking tokens (it's like asking a model to provide a specific word count, unless it actually counts it, it just gives it a rough idea about how much to write. the model is aware of this on top of a programmatic max token limit in the thinking block)

the visible result might as well be roughly the same but the things happening with the model are quite different

unborn ocean May 1, 2025, 2:52 PM

#

but we will likely have jagged ai (what ethan mollick calls it) that is human level in some areas using transformers

#

^that is my guess (although one could argue that we have already reached that point)

alpine coral May 1, 2025, 3:01 PM

#

keen beacon 1. in qwen3 `/think` / `/no_think` are trained in model behaviors. (theyll stop ...

yes i think we're talking one another...i understand the situation re google (setting the value for the budget does nothing other than potentially truncate the model's thinking). and with anthropic what you describe reminds me of the yap score ha (like as if the models are actually trying to or capable of counting their words to adhere to day's 'yap score').
my point is that, say google impemented in a way worked, i dunno how, but it adhered to the value set - in functional terms you would have the same thing that oai does with their reasoning models (low/med/high), just yeah at a more granular level

#

i mean if we agree they at are fundamentally the same model, any fine tuning to get adherence to reasoning token allowance is not what gives the performance; it's all ultimately about the compute used during inference right

keen beacon May 1, 2025, 3:03 PM

#

alpine coral yes i think we're talking one another...i understand the situation re google (se...

it depends on the task, on some tasks it might as well work fine. but cutting it off randomly in the middle of the thought process whilst its thinking about something is not optimal i would think

alpine coral May 1, 2025, 3:04 PM

#

i'm not advocating for that ha

keen beacon May 1, 2025, 3:04 PM

#

so i think reasoning efforts are better, you don't set an arbitrary token limit and instead the model is aware it needs to reason less/more/even more. then a sonnet like implementation would be slightly worse

alpine coral May 1, 2025, 3:04 PM

#

i mean i assume google's current implementation won't be around for long.. it is literally pointless

alpine coral May 1, 2025, 3:05 PM

#

keen beacon so i think reasoning efforts are better, you don't set an arbitrary token limit ...

yeah like i said it makes more sense form a fine tuning perspective - way more practical having 3 levels

willow grail May 1, 2025, 3:06 PM

#

who has tooth pain at least once a month? i can help

alpine coral May 1, 2025, 3:06 PM

#

i don't know how they would do it with a dyanamic value like google's.. the more i think about it

keen beacon May 1, 2025, 3:06 PM

#

alpine coral yeah like i said it makes more sense form a fine tuning perspective - way more p...

i dont think u can really fine tune a model to arbitrarily do a specific amount of tokens when thinking. it's a vague thing/the model learns a vague direction that it has a "time limit" (models literally say this sometimes)

alpine coral May 1, 2025, 3:07 PM

#

yeah tbh i dunno... i've wandered well beyond my depth ha

keen beacon May 1, 2025, 3:07 PM

#

alpine coral i don't know how they would do it with a dyanamic value like google's.. the more...

you can do it like how anthropic does which has a dynamic value i believe. i find openai's approach way more practical and optimal though

alpine coral May 1, 2025, 3:10 PM

#

i guess anthropic only has 32k and 64k? i mean 32=low, 64=high.. what's trained in exactly ha? is it meant to be dynamic?

#

oh

keen beacon May 1, 2025, 3:11 PM

#

alpine coral i guess anthropic only has 32k and 64k? i mean 32=low, 64=high.. what's trained ...

you can set it to anything, but i believe they fine tuned in specific values (32k, 64k) to give a model a better sense of direction with that value

#

theres pretraining associations that would help but finetuning it allows it to be more aligned to how anthropic wants it/model awareness beyond the programmatic max token limit

#

even without finetuning giving a model any kind of indicator will help/be more aligned to what optimal model behavior with a given thinking budget should be

alpine coral May 1, 2025, 3:14 PM

#

right i see..

The budget_tokens parameter determines the maximum number of tokens Claude is allowed to use for its internal reasoning process. Larger budgets can improve response quality by enabling more thorough analysis for complex problems, although Claude may not use the entire budget allocated, especially at ranges above 32K.

#

i didn't realise that with anthropic's api

#

i mean again though, the principle is the same.. we're fundamentally talking about (attempts) to govern the amount of inference/test time used to generate the response

#

or in google's case, lack of attempts so far (aside from slapping it onto 2.5 flash as a hyper-parameter and nothing more ha)

#

grok-mini-3 felt similar to 2.5 flash in terms of setting the reasoning budget (didn't seem to do anything at all; not even result in truncated reasoning if set low)

#

though that may be have been on openrouter's end or something

keen fulcrum May 1, 2025, 3:27 PM

#

https://huggingface.co/microsoft/Phi-4-reasoning
phi 4 reasoning has been released

microsoft/Phi-4-reasoning · Hugging Face

#

#

balmy mist May 1, 2025, 3:32 PM

#

thats trained from qwen?

#

it seems kinda underwhelming tho

keen beacon May 1, 2025, 3:33 PM

#

balmy mist thats trained from qwen?

Distilled from o3 mini

keen beacon May 1, 2025, 3:34 PM

#

balmy mist it seems kinda underwhelming tho

Wym it's pretty good for a 14b

ocean vortex May 1, 2025, 3:40 PM

#

alpine coral i guess anthropic only has 32k and 64k? i mean 32=low, 64=high.. what's trained ...

Thinking budget is way different to low-high. First is coding implementation where you more or less force it to stop and proceed with final response, 2nd is the entire model optimised and trained for exclusively short or long reasoning.

keen beacon May 1, 2025, 3:41 PM

#

ocean vortex Thinking budget is way different to low-high. First is coding implementation whe...

Sonnet is a mix though

#

Flash is purely programmatic

keen fulcrum May 1, 2025, 3:46 PM

#

It does better than qwq32b

great results for a 14b model

alpine coral May 1, 2025, 3:46 PM

#

ocean vortex Thinking budget is way different to low-high. First is coding implementation whe...

what's the end result? less or more tokens used for thinking during inference...

#

the goals are the same

#

how is this hard lol

keen beacon May 1, 2025, 3:47 PM

#

keen fulcrum It does better than qwq32b great results for a 14b model

Qwq 32b is old tho

worthy thunder May 1, 2025, 3:51 PM

#

Context Arena: Added more Anthropic results for 2needle tests. (https://x.com/DillonUzar/status/1917968783395655757)

See all results at: https://contextarena.ai
You can also hover over a score in the table, which will then show a button to explore the individual test results/answers.

Relative AUC @ 128k 2needle scores (select models shown):

GPT-4.1: 61.6%
Gemini 2.0 Flash: 56.0%
Claude 3.7 Sonnet: 55.9%
Claude 3.7 Sonnet (Thinking): 55.5%
Grok 3 Mini (Low): 54.8%
Claude 3.0 Haiku: 52.9%
Llama 4 Maverick: 52.7%
Claude 3.5 Sonnet: 51.2%
Grok 3 Mini (High): 50.3%
Claude 3.5 Haiku: 50.0%

Some quick notes:

Pretty consistent performance across 3.0, 3.5, and 3.7. Impressive.
No noticeable difference between Claude 3.7 Sonnet and Sonnet Thinking.
All perform around or above GPT-4.1 Mini for context lengths <= 128k.
Claude 3.0 Haiku had the best overall Model AUC of the Anthropic models tested, but only by the tiniest amount (had the smallest drop between context lengths).
Around Gemini 1.5/2.0 Flash, Grok 3 Mini, and Llama 4 Maverick in overall performance.

Disclosure: The companies I work with use Claude 3.0 Haiku extensively (one of the ones we use the most to power some services). Comparing the latest models against the original Haiku was one of the goals of this website originally.

Enjoy.

#

Also added Qwen3 14B and 8B to the results from last night.

alpine coral May 1, 2025, 3:54 PM

#

worthy thunder Context Arena: Added more Anthropic results for 2needle tests. (https://x.com/Di...

4.1 holds up very well doesn't it

#

anthropic used to dominate both long context and retrieval imo

#

has kinda now lost both to google and oai (or just to google.. looking at the bar chart.. geez it does well.. )

wintry tinsel May 1, 2025, 4:19 PM

#

the masses yearn for Claude 4 Opus

#

in this dark time of Google and closed AI only a Claude hero can save us

raven void May 1, 2025, 4:37 PM

#

Gemini pro and flash cooked by non reasoning 4.1

#

Google is a whole generation behind tbh, OpenAI isn't releasing 4.1o and 4.1o mini to give Google a chance to fight back

unborn ocean May 1, 2025, 4:38 PM

#

wintry tinsel in this dark time of Google and closed AI only a Claude hero can save us

So the Amazon and Google backed company is coming to the rescue from the mega corps?

wintry tinsel May 1, 2025, 4:51 PM

#

unborn ocean So the Amazon and Google backed company is coming to the rescue from the mega co...

They are coming to the rescue from boring and flat models that don't obey system prompts

oblique flint May 1, 2025, 4:56 PM

#

people call openai closedai but anthropic is even more closed than openai tbh. At least openai made clip, whisper and gpt 2 open weights. Anthropic did literally nothing open as far as Im aware, except maybe mcp if you count that (although its not a model)

wintry tinsel May 1, 2025, 4:56 PM

#

they are but they are so much easier to jailbreak and get them to do what I ask of them

#

open AI models never cease to put me to sleep and overcharge

keen fulcrum May 1, 2025, 4:57 PM

#

raven void Gemini pro and flash cooked by non reasoning 4.1

You are stupid
Are you reading headlines from a subreddit only?

wintry tinsel May 1, 2025, 4:58 PM

#

keen fulcrum You are stupid Are you reading headlines from a subreddit only?

are you questioning the journalist skills of redditors?

keen fulcrum May 1, 2025, 4:58 PM

#

Indeed

golden ocean May 1, 2025, 5:04 PM

#

cwaude

cedar tide May 1, 2025, 5:18 PM

#

raven void Gemini pro and flash cooked by non reasoning 4.1

O3 and o4 mini very bad

flint sand May 1, 2025, 5:22 PM

#

raven void Gemini pro and flash cooked by non reasoning 4.1

why is it ranked so low on livebench though

#

4.1 i mean

keen fulcrum May 1, 2025, 5:48 PM

#

https://fxtwitter.com/AnthropicAI/status/1917972747000692919
https://fxtwitter.com/AnthropicAI/status/1917972753916797111
Their deep research is really great

Anthropic (@AnthropicAI)

Today we're announcing Integrations, a new way to connect your apps and tools to Claude.
︀︀
︀︀We're also expanding Claude's Research capabilities with an advanced mode that searches the web, your Google Workspace, and now your Integrations too.

**💬 47 🔁 130 ❤️ 1.1K 👁️ 88.4K **

▶ Play video

Anthropic (@AnthropicAI)

Claude now automatically determines when to search and how deeply to investigate.
︀︀
︀︀With Research mode toggled on, Claude researches for up to 45 minutes across hundreds of sources (including connected apps) before delivering a report, complete with citations.

**💬 2 🔁 11 ❤️ 133 👁️ 14.5K **

▶ Play video

trim pecan May 1, 2025, 6:00 PM

#

Guys any limitations on https://beta.lmarena.ai/?

small haven May 1, 2025, 6:07 PM

#

Is claude pro unlimited

olive mesa May 1, 2025, 6:16 PM

#

lmao. no annoying emojis, minimalist, straight to the point

willow grail May 1, 2025, 6:17 PM

#

olive mesa lmao. no annoying emojis, minimalist, straight to the point

dumb users will prefer emojis ^^ like the current chatgpt4o

#

and majority of humans are very dumb. see usa. sorry others in usa.

golden ocean May 1, 2025, 6:42 PM

#

willow grail dumb users will prefer emojis ^^ like the current chatgpt4o

FR

ocean vortex May 1, 2025, 6:49 PM

#

alpine coral the goals are the same

Ehm I would disagree actually. With budget you are forcing the model to end thinking prematurely relative to what it has learned during RL training. With a standalone model it has learned to take the most out of the given reasoning lengths it arrived at 'naturally'

#

so like it may get hung up on small irrelevant details and not see the entire picture because it has no clue it's gonna be forced to shorten it

#

I think why it works is that in most cases some reasoning context is still better than no reasoning context at all + the base model is no slouch if we look at the models with this implementation. But I would say that is less than ideal

#

reasoning budget sounds great in theory, but there are obvious limitations to it. We can look at it from another angle too - if limiting the budget does not lead to notably worse performance then maybe your RL training is not very good or efficient as well

small haven May 1, 2025, 6:57 PM

#

im not gonna lie, maverick cooked

exotic kernel May 1, 2025, 6:58 PM

#

btw just interested, does LMArena provides credits to research papers for the propriety models?

ocean vortex May 1, 2025, 6:59 PM

#

small haven im not gonna lie, maverick cooked

it's absolutely insane that they went with complete retrains, new models only chat instead of reasoning lol

llama3 was in a much better shape relative to competition than llama4. Why chase diminishing returns?

#

every benchmark came be gamed when that's your sole focus

#

but there's a reason public version is nowhere near that

#

that's why you can't really effectively game most of them

#

cause there are plenty

#

and improving 1 screws up with the rest

small haven May 1, 2025, 7:01 PM

#

yes lol

small haven May 1, 2025, 7:02 PM

#

ocean vortex it's absolutely insane that they went with complete retrains, new models only ch...

i think its cuz oai and other top ai labs took their talent lmao

#

dont tease me bro

#

ive lost faith in sam

elder rapids May 1, 2025, 7:15 PM

#

dude is anti Gemini lmao, read some of his messages

zinc ore May 1, 2025, 7:15 PM

#

They've literally done this a million times too

golden ocean May 1, 2025, 7:15 PM

#

zinc ore May 1, 2025, 7:15 PM

#

SORA SOON!!! Never releases it

small haven May 1, 2025, 7:15 PM

#

golden ocean

hahah

zinc ore May 1, 2025, 7:16 PM

#

Show off random tech and make big promises about how it's the greatest thing ever

golden ocean May 1, 2025, 7:16 PM

#

I didnt say that

zinc ore May 1, 2025, 7:16 PM

#

Then hardly ever release it

golden ocean May 1, 2025, 7:16 PM

#

ask @willow grail

#

@willow grail ru gonna let this slide?

small haven May 1, 2025, 7:16 PM

#

craig is a genius, dont underestimate him guys, hes actually in the right end side of the iq distribution

golden ocean May 1, 2025, 7:16 PM

#

ok buddy

elder rapids May 1, 2025, 7:16 PM

#

raven void Gemini pro and flash cooked by non reasoning 4.1

yo how did I know this guy would say this

#

bro is saying 4.1 mini is better than o3 mini high

#

yeah no sh

#

the point is

#

it's the same dude

#

screenshotting out of context benchmarks

#

he filtered livebench coding without considering what livebench is measuring (competitive coding) and said Google is behind

zinc ore May 1, 2025, 7:18 PM

#

"to give Google a fighting chance" dudes love roleplaying online with their fan narratives

elder rapids May 1, 2025, 7:18 PM

#

you can't unironically believe that

#

😭

#

yeah anyone could do that

#

but not everyone disingenuously postures it like that

golden ocean May 1, 2025, 7:18 PM

#

hes on dont disturb

#

dont disturb him bro

zinc ore May 1, 2025, 7:18 PM

#

elder rapids he filtered livebench coding without considering what livebench is measuring (co...

And livebench has been getting trashed on for their coding ranking, on both Twitter and Reddit

elder rapids May 1, 2025, 7:18 PM

#

raven void Google is a whole generation behind tbh, OpenAI isn't releasing 4.1o and 4.1o mi...

#

like bro

#

what is this 😭 I'm dead

misty vault May 1, 2025, 7:20 PM

#

Yes because all the immigrants got kicked out

small haven May 1, 2025, 7:20 PM

#

~~bronx~~ mahanttan

#

facts

#

o3 > o1 pro, so u live in between bronx and upper west

#

o3 pro lives in the woods

#

living the outback farm life

#

wtf

willow grail May 1, 2025, 7:27 PM

#

you should be happy that they are not species from other genus of homo.... then there would be even more stupid people

torn mantle May 1, 2025, 7:29 PM

#

willow grail you should be happy that they are not species from other genus of homo.... then ...

what does Berberine 2g do?

willow grail May 1, 2025, 7:29 PM

#

torn mantle what does Berberine 2g do?

make u become a teenager with a metabolism of a rocket

torn mantle May 1, 2025, 7:29 PM

#

willow grail make u become a teenager with a metabolism of a rocket

lies

willow grail May 1, 2025, 7:29 PM

#

torn mantle lies

no

torn mantle May 1, 2025, 7:29 PM

#

im already young

#

so i dont need it

#

who paid you?

#

you going to recommend a certain label now?

willow grail May 1, 2025, 7:30 PM

#

torn mantle you going to recommend a certain label now?

label?

#

nutrition label? what

torn mantle May 1, 2025, 7:30 PM

#

yea

#

brand

#

une marque

willow grail May 1, 2025, 7:31 PM

#

i would say there is 88% dumb people if any other species of genus homo wuld still exist

torn mantle May 1, 2025, 7:31 PM

#

oh

#

you are ignoring me now

#

caught

willow grail May 1, 2025, 7:34 PM

#

thats why we cant have nire things. matters a lot.

#

nice*

#

u must vote for ... monkeys?

#

if you explain this. thanks

small haven May 1, 2025, 7:39 PM

#

hey guys

#

if ur workflow dont look like this, ur ngmi

torn mantle May 1, 2025, 7:43 PM

#

any new models on arena?

#

https://x.com/cursor_ai/status/1917982557070868739

Cursor (@cursor_ai) on X

The models developers prefer:

#

this seems right

#

although idk about gpt4.1

#

dont take berberin 2g/day

willow grail May 1, 2025, 7:45 PM

#

torn mantle https://x.com/cursor_ai/status/1917982557070868739

2.5 not on first place? cursor is trash

keen beacon May 1, 2025, 7:45 PM

#

30 grams*

calm sequoia May 1, 2025, 7:45 PM

#

How? I used in AIStudio but the gpt interface seems lacking

keen fulcrum May 1, 2025, 7:45 PM

#

Which deep research feature is the best currently?

#

Doesn't this deserve a separate leaderboard?

#

What makes you consider ChatGPT DR over Grok and Claude?

willow grail May 1, 2025, 7:47 PM

#

what searcher tool is best

#

for health, medicine,

#

i am not rich

#

ew

#

the hallucination machine?

#

really???

torn mantle May 1, 2025, 7:48 PM

#

tbh ive been using grok deep search

willow grail May 1, 2025, 7:48 PM

#

yorue so close to block

torn mantle May 1, 2025, 7:48 PM

#

it seems ok

#

o3 will straight up lie on ur face

#

and you need to re-check that

willow grail May 1, 2025, 7:48 PM

#

...... nope...... 50% of its sources it cant remember

torn mantle May 1, 2025, 7:48 PM

#

ive been using pplx since day 1

willow grail May 1, 2025, 7:49 PM

#

pp is hallucination machine

torn mantle May 1, 2025, 7:49 PM

#

they dont seem to focus on their main objective anymore

#

which is providing a good search results

#

its so bad rn

willow grail May 1, 2025, 7:49 PM

#

50% of its text is hallu.

torn mantle May 1, 2025, 7:49 PM

#

yea but it doesnt give you the depth you are looking for

keen fulcrum May 1, 2025, 7:49 PM

#

So the basic search feature for free plans isn't really useful, I don't like that its using bing data

#

Yes

torn mantle May 1, 2025, 7:49 PM

#

why would i need a service that gives me same results as an offline LLM

keen fulcrum May 1, 2025, 7:50 PM

#

I think Claude may be the best option currently, closely followed by Kagi, Google and Grok

torn mantle May 1, 2025, 7:50 PM

#

bing/microsoft dont have their own product

keen fulcrum May 1, 2025, 7:51 PM

#

(not 2 minutes ago)

torn mantle May 1, 2025, 7:51 PM

#

ive never felt msft had their own made ai product

small haven May 1, 2025, 7:51 PM

#

anyone know the limits on claude research

torn mantle May 1, 2025, 7:51 PM

#

small haven anyone know the limits on claude research

is this the improved version?

#

they said they updated their research tool

small haven May 1, 2025, 7:51 PM

#

dont know i just wanted the max

keen fulcrum May 1, 2025, 7:51 PM

#

Is it only in the max plan?

small haven May 1, 2025, 7:51 PM

#

gonna compare

keen fulcrum May 1, 2025, 7:53 PM

#

Are DR that much better than their standard search tools on the free plan?
They are terrible in my experience

#

How many sources do you get?

small haven May 1, 2025, 7:53 PM

#

lol it asks just like oai dr

willow grail May 1, 2025, 7:54 PM

#

sam died..

hollow ocean May 1, 2025, 7:54 PM

#

small haven lol it asks just like oai dr

Do sports betting next

small haven May 1, 2025, 7:55 PM

#

we gon see

torn mantle May 1, 2025, 7:55 PM

#

small haven lol it asks just like oai dr

lets see

#

how bad it is

small haven May 1, 2025, 7:57 PM

#

jesus christ

#

wait it searches in parallel

keen fulcrum May 1, 2025, 7:59 PM

#

Its great they waited to implement the feature so well thought through
One of the best features they released, still disappointed that its only for max users.

small haven May 1, 2025, 7:59 PM

#

if it says microsoft im gonna punch my monitor hahaha

keen fulcrum May 1, 2025, 8:00 PM

#

Will grok 3.5 be available early on lmarena?

small haven May 1, 2025, 8:02 PM

#

421 sources now

torn mantle May 1, 2025, 8:03 PM

#

small haven 421 sources now

doesnt mean sh

#

show us results

torn mantle May 1, 2025, 8:04 PM

#

keen fulcrum Will grok 3.5 be available early on lmarena?

yea

#

prob tomorrow yea

small haven May 1, 2025, 8:04 PM

#

torn mantle show us results

running oai dr at the same time

torn mantle May 1, 2025, 8:04 PM

#

small haven running oai dr at the same time

does it estimate when it will finish?

small haven May 1, 2025, 8:04 PM

#

wait.. is this claude research the one that takes 45 mins lol

small haven May 1, 2025, 8:04 PM

#

torn mantle does it estimate when it will finish?

no

hollow ocean May 1, 2025, 8:05 PM

#

It will take 45 mins

small haven May 1, 2025, 8:05 PM

#

bruh

hollow ocean May 1, 2025, 8:07 PM

#

Ask Claude research who will win Knicks or the pistons today

#

Let’s see how good it is

#

If it gets it right it’s good

small haven May 1, 2025, 8:07 PM

#

hollow ocean Ask Claude research who will win Knicks or the pistons today

not gonna wait 45 mins for that lol

hollow ocean May 1, 2025, 8:07 PM

#

small haven not gonna wait 45 mins for that lol

Leave it on the background

small haven May 1, 2025, 8:07 PM

#

and idk if theres limits

#

its still going for a good almost 20 mins

#

but stuck at 421

#

sources

torn mantle May 1, 2025, 8:08 PM

#

welp

#

rekt

#

can you try a prompt of mine when it finishes?

#

🥺

small haven May 1, 2025, 8:09 PM

#

k

torn mantle May 1, 2025, 8:09 PM

#

ty

hollow ocean May 1, 2025, 8:10 PM

#

#

No limits

keen fulcrum May 1, 2025, 8:10 PM

#

small haven not gonna wait 45 mins for that lol

Isn't it deciding automatically how long it searches and how many sources are necessary
unless you specify that

keen beacon May 1, 2025, 8:12 PM

#

Wait what

keen fulcrum May 1, 2025, 8:12 PM

#

hollow ocean

What is taking so long here, fetching content?

keen beacon May 1, 2025, 8:12 PM

#

Is Grok 3.5 already out?

#

ain’t no way

hollow ocean May 1, 2025, 8:12 PM

#

keen fulcrum What is taking so long here, fetching content?

Wym

keen beacon May 1, 2025, 8:12 PM

#

Does anyone have access to grok 3.5

#

like with a subscription

torn mantle May 1, 2025, 8:13 PM

#

keen beacon Does anyone have access to grok 3.5

not yet xd

keen fulcrum May 1, 2025, 8:14 PM

#

Do you have an insider or how do you know

torn mantle May 1, 2025, 8:14 PM

#

we are just predicting it will be added on lmarena tomorrow

hollow ocean May 1, 2025, 8:14 PM

#

https://tenor.com/view/aliens-ancient-explaining-history-channel-gif-12727090721965691110

Tenor

keen beacon May 1, 2025, 8:14 PM

#

ohh

#

bro grok 3.5 is gonna be insane

keen beacon May 1, 2025, 8:14 PM

#

keen beacon bro grok 3.5 is gonna be insane

i think

keen fulcrum May 1, 2025, 8:14 PM

#

Why not Mars

keen beacon May 1, 2025, 8:14 PM

#

u think

#

better than

#

2.5 pro and o3

#

whats that

#

damn

small haven May 1, 2025, 8:17 PM

#

oai dr, same prompt, took 6 mins, claude dr is going for 30 mins now hahah

keen fulcrum May 1, 2025, 8:17 PM

#

LLMs in space will be really useful

torn mantle May 1, 2025, 8:18 PM

#

small haven oai dr, same prompt, took 6 mins, claude dr is going for 30 mins now hahah

interesting

#

o3 no?

small haven May 1, 2025, 8:18 PM

#

📎 message.txt

torn mantle May 1, 2025, 8:18 PM

#

since the issues are more reasoning oriented

keen fulcrum May 1, 2025, 8:19 PM

#

small haven oai dr, same prompt, took 6 mins, claude dr is going for 30 mins now hahah

This is largely dependant on source selection and how its written too

#

And the question

small haven May 1, 2025, 8:19 PM

#

thats openai dr

#

claude dr is still running

torn mantle May 1, 2025, 8:19 PM

#

@small haven can you try this : Based on a critical synthesis of recent, high-quality human clinical trials and systematic reviews, determine which compound – Berberine, Propolis, or Resveratrol – demonstrates the most compelling evidence for promoting overall health.

small haven May 1, 2025, 8:19 PM

#

torn mantle <@931708065319907338> can you try this : Based on a critical synthesis of recent...

lmaoo

torn mantle May 1, 2025, 8:19 PM

#

small haven lmaoo

do it puss

small haven May 1, 2025, 8:19 PM

#

im gonna get banned , arent i

torn mantle May 1, 2025, 8:19 PM

#

eh

#

no

#

ozone?

small haven May 1, 2025, 8:20 PM

#

torn mantle <@931708065319907338> can you try this : Based on a critical synthesis of recent...

wow it didnt even ask for 3 questions

#

claude kinda cooked

keen fulcrum May 1, 2025, 8:21 PM

#

small haven wow it didnt even ask for 3 questions

Can you run multiple at once?

small haven May 1, 2025, 8:21 PM

#

yea

keen fulcrum May 1, 2025, 8:21 PM

#

Interesting

small haven May 1, 2025, 8:21 PM

#

stocks dr is still running

torn mantle May 1, 2025, 8:21 PM

#

small haven wow it didnt even ask for 3 questions

well i gave it everything on the prompt

small haven May 1, 2025, 8:21 PM

#

yea with oai dr, it would ask some questions

#

always

torn mantle May 1, 2025, 8:21 PM

#

mm i see

keen fulcrum May 1, 2025, 8:22 PM

#

Now you gotta compare all providers and write an article, people will read it

small haven May 1, 2025, 8:22 PM

#

torn mantle May 1, 2025, 8:22 PM

#

small haven

yea just high quality data

#

pmc/pubmed/sciencedirect

#

thats good so far

willow grail May 1, 2025, 8:23 PM

#

torn mantle <@931708065319907338> can you try this : Based on a critical synthesis of recent...

health? lol how generic. only asi can answer such vague stuff

#

u need more precision babe

small haven May 1, 2025, 8:23 PM

#

im not sure, but im about to hit 40 mins with the stocks one

keen fulcrum May 1, 2025, 8:23 PM

#

I would love to know how it handles political questions (how biased are the sources)

willow grail May 1, 2025, 8:24 PM

#

small haven im not sure, but im about to hit 40 mins with the stocks one

huh?

torn mantle May 1, 2025, 8:24 PM

#

willow grail health? lol how generic. only asi can answer such vague stuff

thats why i want to see how it will compare them

keen fulcrum May 1, 2025, 8:25 PM

#

Anthropic is probably using exa under the hood

willow grail May 1, 2025, 8:25 PM

#

torn mantle thats why i want to see how it will compare them

berberine is just for getting a metabolism like a teenager

#

i know that...

torn mantle May 1, 2025, 8:25 PM

#

i could explicitly ask it to compare their antimicrobial activity/anti-inflammatory prop/anticancer/cardioprotective effects/immunomodulatory effects/gut health

willow grail May 1, 2025, 8:25 PM

#

💋

torn mantle May 1, 2025, 8:25 PM

#

willow grail 💋

wtf

#

why are you kissing people now?

willow grail May 1, 2025, 8:25 PM

#

🫃

willow grail May 1, 2025, 8:26 PM

#

torn mantle wtf

calm down girl

#

take a snickers

#

you are not you \

#

something which does exist. cause gender is in brain, not in your genitals

torn mantle May 1, 2025, 8:26 PM

#

willow grail berberine is just for getting a metabolism like a teenager

thats the whole idea, if you are asked to chose just one that you will get the maximum benefits of on multiple areas

willow grail May 1, 2025, 8:27 PM

#

torn mantle thats the whole idea, if you are asked to chose just one that you will get the m...

if u can reset ur microbiome then ull get all health benefits

torn mantle May 1, 2025, 8:27 PM

#

propolis probably has metabolism effects as well

#

not sure about resveratrol

willow grail May 1, 2025, 8:27 PM

#

a person who says they are a man while having female genitals. no, its not a mental disease

torn mantle May 1, 2025, 8:27 PM

#

as it was heavily studied for cvd

#

are you pregnant?

#

im confused

#

😖

willow grail May 1, 2025, 8:27 PM

#

XD no eww

#

children ew

torn mantle May 1, 2025, 8:28 PM

#

willow grail 🫃

you are

#

whats this

willow grail May 1, 2025, 8:28 PM

#

torn mantle whats this

a pregnant male

torn mantle May 1, 2025, 8:28 PM

#

idk

#

im confused as well

willow grail May 1, 2025, 8:28 PM

#

u can do both

#

lul

#

u can tho.

#

we dont care what u think. that is irrelevant

#

phobic statement cause u would never say this to a straight person.

#

pls think about whatu say

torn mantle May 1, 2025, 8:29 PM

#

@small haven any progress

willow grail May 1, 2025, 8:30 PM

#

dna is changed daily

torn mantle May 1, 2025, 8:30 PM

#

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

#

isnt this why we get diseases

willow grail May 1, 2025, 8:30 PM

#

perhaps

torn mantle May 1, 2025, 8:30 PM

#

or at least increase the risk

willow grail May 1, 2025, 8:31 PM

#

trans is literally the best

small haven May 1, 2025, 8:31 PM

#

torn mantle <@931708065319907338> any progress

stocks aint even done, so obv not urs

torn mantle May 1, 2025, 8:31 PM

#

small haven stocks aint even done, so obv not urs

wtf

willow grail May 1, 2025, 8:31 PM

#

small haven stocks aint even done, so obv not urs

wtf u mean./ what u doing with claude search

#

its searching for 4 hours for one task?

small haven May 1, 2025, 8:31 PM

#

it def hit the 40 mins mark rn

willow grail May 1, 2025, 8:31 PM

#

what prompt

#

tell it to hurry up

small haven May 1, 2025, 8:31 PM

#

what stock to buy lol

willow grail May 1, 2025, 8:31 PM

#

the company is closing for today

#

tell it

small haven May 1, 2025, 8:32 PM

#

lmao

willow grail May 1, 2025, 8:32 PM

#

to grow the f up

torn mantle May 1, 2025, 8:32 PM

#

wait

#

did u change ur pfp

#

something changed

#

in the chat

willow grail May 1, 2025, 8:32 PM

#

torn mantle did u change ur pfp

yes asura

torn mantle May 1, 2025, 8:32 PM

#

why

willow grail May 1, 2025, 8:32 PM

#

ps asura is the server called where kill on sight is legal

torn mantle May 1, 2025, 8:32 PM

#

you are an interesting person

willow grail May 1, 2025, 8:32 PM

#

absolutel trash server

torn mantle May 1, 2025, 8:32 PM

#

oh no

willow grail May 1, 2025, 8:32 PM

#

in "the isle"

torn mantle May 1, 2025, 8:33 PM

#

who made berberine mad

#

you did

#

stop lying

#

omg

#

you again

willow grail May 1, 2025, 8:34 PM

#

like what? that u believe transgender is a mental issue?

#

i dont know why u believe that

#

ozone

#

?

#

its faster if utell me.

small haven May 1, 2025, 8:38 PM

#

claude is still going like wtf

keen fulcrum May 1, 2025, 8:40 PM

#

small haven claude is still going like wtf

Try setting a max time for deep research

small haven May 1, 2025, 8:41 PM

#

this guy said 1hr before it just gave up lol

torn mantle May 1, 2025, 8:42 PM

#

small haven claude is still going like wtf

lol

#

better be good

torn mantle May 1, 2025, 8:42 PM

#

small haven this guy said 1hr before it just gave up lol

lmao

small haven May 1, 2025, 8:43 PM

#

yea it just got released today, and it did say beta, im prolly got get timed

keen fulcrum May 1, 2025, 8:46 PM

#

I don't have a problem if it takes 1 day for an advanced query
It would take me significantly longer to answer such questions.

#

It would be great knowing whether Claude is great in physics and astrophysics. Grok is known to be tailored for that use case

small haven May 1, 2025, 8:49 PM

#

boutta hit an hour lol

#

ok bud

#

legend says hes still trying to a pic of it in oai playground

balmy mist May 1, 2025, 8:53 PM

#

we not getting it, this is what you get for trolling me about it 😂

small haven May 1, 2025, 8:53 PM

#

who wants it?

sweet tinsel May 1, 2025, 8:53 PM

#

I would like to see it.

small haven May 1, 2025, 8:54 PM

#

#

58mins

#

im gatekeeping it guys sorry! its too good

torn mantle May 1, 2025, 8:55 PM

#

small haven

me

#

send

#

is it better than oai dr?

small haven May 1, 2025, 8:56 PM

#

📎 message.txt

torn mantle May 1, 2025, 8:56 PM

#

ive heard deepseek is also working on their own research feature

small haven May 1, 2025, 8:57 PM

#

guys after 1hr of deep anal research, just buy nvidia!

keen fulcrum May 1, 2025, 8:58 PM

#

small haven

What about less known players
any info about them

small haven May 1, 2025, 8:58 PM

#

keen fulcrum What about less known players any info about them

its in that file

#

https://claude.ai/public/artifacts/b49f1044-f6ce-4990-a41d-26604f46cbfc

#

better view

torn mantle May 1, 2025, 9:00 PM

#

2.7 vs 2.2 trillion

small haven May 1, 2025, 9:01 PM

#

dont worry the bermecellin whatveer is coming

torn mantle May 1, 2025, 9:02 PM

#

small haven dont worry the bermecellin whatveer is coming

WHEN

small haven May 1, 2025, 9:03 PM

#

torn mantle WHEN

pay me $5 and ill release it!

torn mantle May 1, 2025, 9:05 PM

#

small haven pay me $5 and ill release it!

$5 + $5 = copilot sub

#

$5 + $5 + $5 + $5 = cgpt sub

small haven May 1, 2025, 9:07 PM

#

my head hurts

torn mantle May 1, 2025, 9:09 PM

#

this is taking too long

small haven May 1, 2025, 9:09 PM

#

ya not a fan of long dr tbh

torn mantle May 1, 2025, 9:09 PM

#

well ive tried it on grok/oai/google hopefully it provides something new

#

or at least interesting

#

of who

#

gemini?

small haven May 1, 2025, 9:11 PM

#

this is TTD stock the one it recommended lmao

#

trade desk

#

Snow too i think is like this? haha

#

wish it gave me some quantum computing stocks at least ..

torn mantle May 1, 2025, 9:17 PM

#

STILL NOT FINISHED????????????????

#

holy

small haven May 1, 2025, 9:18 PM

#

nope

#

what if it takes 2 hrs lol

torn mantle May 1, 2025, 9:20 PM

#

if it takes 10h i would pay you that $5

small haven May 1, 2025, 9:20 PM

#

https://chatgpt.com/share/6813e578-eb4c-8012-976b-f07d475cdac9

ChatGPT

ChatGPT - Health Compound Comparison

Shared via ChatGPT

#

this is oai dr if u want it

torn mantle May 1, 2025, 9:20 PM

#

thanks

#

lol straight up to the point

small haven May 1, 2025, 9:21 PM

#

haha

torn mantle May 1, 2025, 9:23 PM

#

thats actually a good report

small haven May 1, 2025, 9:24 PM

#

almost done

torn mantle May 1, 2025, 9:24 PM

#

oh

#

finally

#

😖

small haven May 1, 2025, 9:29 PM

#

im gonna ask wen o3 pro

torn mantle May 1, 2025, 9:29 PM

#

no shot

small haven May 1, 2025, 9:33 PM

#

its done

#

berberine is the answer

#

..

#

lol

torn mantle May 1, 2025, 9:34 PM

#

fk

#

share pls

small haven May 1, 2025, 9:34 PM

#

https://claude.ai/public/artifacts/a8a2d065-ddf4-4acf-853d-ddfd9a2fe15e

torn mantle May 1, 2025, 9:34 PM

#

probably cuz it has more RCT trials

small haven May 1, 2025, 9:35 PM

#

who wins tho, oai dr or claude dr

#

bruh 1h 13m

torn mantle May 1, 2025, 9:37 PM

#

im still reading

#

thanks btw

#

the "The absorption problem" part is kinda interesting to know

#

i dont think other models talked about that

#

although it focused on their specific known markers

#

unlike oai

keen fulcrum May 1, 2025, 9:46 PM

#

Is it possible to access these via the grok api?

torn mantle May 1, 2025, 9:46 PM

#

it went one by one which is what i wanted

#

i also liked how oai dr compared same parameters through diff studies

small haven May 1, 2025, 9:47 PM

#

ok so in other words? oai or claude lol

torn mantle May 1, 2025, 9:48 PM

#

maybe :

oai dr
claude
grok3
gemini

#

yea oai dr is way better tbh

small haven May 1, 2025, 9:48 PM

#

sheesh

keen fulcrum May 1, 2025, 9:48 PM

#

Perplexity

torn mantle May 1, 2025, 9:48 PM

#

keen fulcrum 5. Perplexity

100000000000.

keen fulcrum May 1, 2025, 9:51 PM

#

You should genuinely try deep research of kagi when its out of closed beta
its great

torn mantle May 1, 2025, 9:53 PM

#

keen fulcrum You should genuinely try deep research of kagi when its out of closed beta its g...

you have it?

small haven May 1, 2025, 10:00 PM

#

perplexity ceo is cringe

hollow ocean May 1, 2025, 10:01 PM

#

torn mantle yea oai dr is way better tbh

I agree

#

Ranking is accurate

torn mantle May 1, 2025, 10:15 PM

#

small haven perplexity ceo is cringe

yep

hollow ocean May 1, 2025, 10:23 PM

#

He got money tho

small haven May 1, 2025, 10:36 PM

#

rip shareholders

#

rather put my money with thinking machines

hollow ocean May 1, 2025, 10:41 PM

#

https://x.com/josh9817/status/1916705298846204205?s=46&t=AH7sIlIv16Z3Kdb6j3cjfg

Josh (@Josh9817) on X

at this point, I'm pretty convinced that no one working at PerplexityAI actually uses their product besides doing 1 query a day on the default search bar. this is some dogshit-tier product now. Aravind has billions in funding but pennies in taste and quality control.

#

https://x.com/mckaywrigley/status/1880728997346382118?s=46&t=AH7sIlIv16Z3Kdb6j3cjfg

Mckay Wrigley (@mckaywrigley) on X

Perplexity is teetering towards being pure slop.

Your competitors have caught up to you while you waste time with nonsensical headline-baiting marketing drivel.

There’s now no reason for me to use your stagnant product.

This won’t be popular, but they need a wake up call.

small haven May 1, 2025, 10:45 PM

#

onlyfans ceo has a better chance lmao

tall summit May 1, 2025, 11:05 PM

#

worthy thunder Context Arena: Added more Anthropic results for 2needle tests. (https://x.com/Di...

oh you have a website now

ember rapids May 2, 2025, 3:26 AM

#

small haven perplexity ceo is cringe

Hate that guy

hollow ocean May 2, 2025, 3:49 AM

#

They said they were going to add deep research (high) but gave up on that

drifting thorn May 2, 2025, 4:46 AM

#

I'm so anticipated to Grok 3.5

#

Either its performance meets my expectation or not

earnest parcel May 2, 2025, 4:46 AM

#

willow grail qwen3 trash?

Naw they are really good local models. At least they dominated my sub-49B test results, and the A3B MoE is a fantastic speedy one (getting 130tok/s on 4090, q4km)

drifting thorn May 2, 2025, 4:46 AM

#

like... I think Elon Musk is just making a wordplay

#

reasoning from first principles may just be a prompt

#

and every answer by LLMs don’t exist on the Internet in the exact same way.

#

I hope Elon can prove me wrong but...

leaden palm May 2, 2025, 5:12 AM

#

earnest parcel Naw they are really good local models. At least they dominated my sub-49B test r...

do you have a guess as to why 3/5 of the OR inference providers are slower than your local setup

earnest parcel May 2, 2025, 5:14 AM

#

leaden palm do you have a guess as to why 3/5 of the OR inference providers are slower than ...

no idea. the models aren't slow, the implementation is. I don't host but wouldn't know, local setup was fast, easy, and performant.

leaden palm May 2, 2025, 5:14 AM

#

ok

#

gguf based setup?

#

(eg llama cpp / ollama)

earnest parcel May 2, 2025, 5:14 AM

#

ye

#

gguf on LMS Server

leaden palm May 2, 2025, 5:15 AM

#

theyre good models, i just hope that inference providers figure out how to optimize them (theyre currently a bit overpriced + slow)

earnest parcel May 2, 2025, 5:15 AM

#

i just have the a3b running 24/7. the moe takes close to no ressources and is lightning fast for any random stuff i come up with

#

getting 130tok/s

#

not even using draft model

leaden palm May 2, 2025, 5:18 AM

#

leaden palm theyre good models, i just hope that inference providers figure out how to optim...

qwen 3 30b sparse is $0.1/0.3 while qwen 2.5 32b dense is $0.08/0.18, it doesnt make sense

keen beacon May 2, 2025, 5:43 AM

#

leaden palm qwen 3 30b sparse is $0.1/0.3 while qwen 2.5 32b dense is $0.08/0.18, it doesnt ...

Yea

small haven May 2, 2025, 7:08 AM

#

i can smell o3 pro

teal mantle May 2, 2025, 7:29 AM

#

drifting thorn I'm so anticipated to Grok 3.5

I am thinking should I buy supergrok

small haven May 2, 2025, 7:39 AM

#

grok 3.5 wont be that good, unless... they release it with big brain mode

keen fulcrum May 2, 2025, 8:05 AM

#

R2 next week

fleet lintel May 2, 2025, 8:18 AM

#

Why R2 is taking soooo long

calm sequoia May 2, 2025, 8:31 AM

#

Will the deep seek R2 be based on V3 or do they have V4 already?

kind cloud May 2, 2025, 9:14 AM

#

Can someone please find out whose ai frostwind is.

Screenshot_2025-05-02-18-12-55-427-edit_io.github.pyoncord.app.jpg

keen beacon May 2, 2025, 9:17 AM

#

kind cloud Can someone please find out whose ai ***frostwind*** is.

its from google

short zodiac May 2, 2025, 9:22 AM

#

Hello 🙂 quelqu'un ici pourrait m'aider pour la mise en place d'un multi-agent (ou mixture-of-agent) ? je n'y arrive pas, et j'aimerais beaucoup pouvoir réaliser une équipe de LLM spécialisés avec chacun son rôle, dans une exécution séquentielle paramétrée.

Si quelqu'un ici est ok de perdre un peu de temps à m'aider, ce serait super sympa ^^

hollow ocean May 2, 2025, 9:23 AM

#

kind cloud Can someone please find out whose ai ***frostwind*** is.

What’s the server

keen beacon May 2, 2025, 9:28 AM

#

calm sequoia How? I used in AIStudio but the gpt interface seems lacking

just edit ur message. the aistudio branching impl sucks. branch of branch of branch of branch of ...

kind cloud May 2, 2025, 9:28 AM

#

hollow ocean What’s the server

https://fxtwitter.com/legit_api/status/1916855709167235542?t=bGetlgzwVzEW7_wh9ieoiA&s=19
here

ʟᴇɢɪᴛ (@legit_api)

Discovery Tool server is now open

Quoting ʟᴇɢɪᴛ (@legit_api)
︀
launching tomorrow in Beta
︀︀
︀︀Dev Mode is just placeholder server name

**💬 5 🔁 7 ❤️ 79 👁️ 12.8K **

#

5Vg24U7ccM

hollow ocean May 2, 2025, 9:32 AM

#

Thx

tall summit May 2, 2025, 9:36 AM

#

someone please send the updates from that server here for the sake of everyone else

ocean vortex May 2, 2025, 9:36 AM

#

I think Trump's parents were immigrants too. His first wife was immigrant as well. You should deport him

#

But yeah perplexity have lost the plot lol
It's not easy to compete though when the tech giants went after the same things

#

so the only thing they have left is trying to beat them with caps and speed. Otherwise when you are the one training the models you can do much more optimising it for web search

kind cloud May 2, 2025, 9:44 AM

#

frostwind is Google model in webdev.

calm sequoia May 2, 2025, 9:47 AM

#

Qwen3-253b-a22b is really good 40% of the time. Something is wrong with consistency.

keen beacon May 2, 2025, 9:49 AM

#

calm sequoia Qwen3-253b-a22b is really good 40% of the time. Something is wrong with consiste...

partly inference provider probably and qwen rushing it

#

like the base model is insane

#

calm sequoia May 2, 2025, 9:50 AM

#

True. It's really the new SOTA in opensource. Can't imagine deepseek performing better than this.

keen beacon May 2, 2025, 9:50 AM

#

calm sequoia True. It's really the new SOTA in opensource. Can't imagine deepseek performing ...

i think they couldve gotten far better results with the same base model if they had more time tbh

calm sequoia May 2, 2025, 9:50 AM

#

Ewen the 32B version sometimes outperforms models like o3-mini or grok3

calm sequoia May 2, 2025, 9:50 AM

#

keen beacon i think they couldve gotten far better results with the same base model if they ...

Whats the rush?

keen beacon May 2, 2025, 9:51 AM

#

calm sequoia Whats the rush?

they had a deadline in april apparently

calm sequoia May 2, 2025, 9:51 AM

#

I see. Then we shall expect 3.5 to drop some time in the near future.

keen beacon May 2, 2025, 9:51 AM

#

probably theres a reason they didnt release the base model weights of qwen 3 32b dense or the 235b base model

keen fulcrum May 2, 2025, 10:01 AM

#

kind cloud Can someone please find out whose ai ***frostwind*** is.

Where did you get this bot?

kind cloud May 2, 2025, 10:03 AM

#

kind cloud https://fxtwitter.com/legit_api/status/1916855709167235542?t=bGetlgzwVzEW7_wh9ie...

@keen fulcrum look this

kind cloud May 2, 2025, 10:13 AM

#

keen fulcrum Where did you get this bot?

it's server on discord

keen fulcrum May 2, 2025, 10:14 AM

#

Thanks

ocean vortex May 2, 2025, 10:26 AM

#

keen beacon partly inference provider probably and qwen rushing it

Personally I was using it directly from qwen. Honestly it struck me like a model with one of the biggest disconnects benchmarks to IRL, the usability of it. I would rather use R1 for open-source reasoning. This just is over-the-top wasteful slow reasoning quite often and while it can sometimes do things R1 can't, it can also fail things R1 would never fail...

keen beacon May 2, 2025, 10:28 AM

#

ocean vortex Personally I was using it directly from qwen. Honestly it struck me like a model...

i found the benchmarks quite representative. it does extremely well on in distribution tasks

#

it can be very weak on other things though, i think the post training wasnt fleshed out enough

calm sequoia May 2, 2025, 10:45 AM

#

Still testing it. Sometimes the answer is as good as 2.5 PRO or o3, but most of the time it's as bad as nano models. Temperature tweeking does not help. Sadly, it's unusable at this state, you never know what you'll get. Lets wait for a stable version.

#

The thinking model could be wild though

keen beacon May 2, 2025, 10:46 AM

#

ur supposed to use specific sampling settings with qwen3

#

it helps a lot

#

0.6temp,0.95top_p,20top_k

#

i think

calm sequoia May 2, 2025, 10:47 AM

#

All rights, lets try it.

keen beacon May 2, 2025, 10:48 AM

#

calm sequoia All rights, lets try it.

still i think u might just be using qwen3 where its borked rn, some things it absolutely sucks but in distribution its great

cedar tide May 2, 2025, 10:54 AM

#

short zodiac Hello 🙂 quelqu'un ici pourrait m'aider pour la mise en place d'un multi-agent (...

Viens en dm

keen beacon May 2, 2025, 11:19 AM

#

2.5 pro is soo darn good 😭

torn mantle May 2, 2025, 11:22 AM

#

What did i say?

#

I said we will have a new model added today from google

#

How is it? @keen beacon

keen beacon May 2, 2025, 11:23 AM

#

torn mantle How is it? <@456226577798135808>

frostwind? i didnt try it yet i just looked at the metadata lol

willow grail May 2, 2025, 11:32 AM

#

small haven perplexity ceo is cringe

pp has small pp, ceo has small pp.
pp hallucination? always.

torn mantle May 2, 2025, 11:34 AM

#

keen beacon frostwind? i didnt try it yet i just looked at the metadata lol

mm i see

#

holy

#

first impression of frostwind = woah

keen beacon May 2, 2025, 11:37 AM

#

Nw reincarnation?

torn mantle May 2, 2025, 11:38 AM

#

probably

torn mantle May 2, 2025, 11:38 AM

#

keen beacon Nw reincarnation?

btw is it only on webarena?

calm sequoia May 2, 2025, 11:39 AM

#

Can't get in on text arena

keen beacon May 2, 2025, 11:39 AM

#

torn mantle btw is it only on webarena?

Idk tbh I only checked web arena metadata

#

I assume one of the places legit scrapes is that

torn mantle May 2, 2025, 11:40 AM

#

yea

#

well i guess i cant assume its good based on one prompt

keen beacon May 2, 2025, 11:42 AM

#

Post results here

torn mantle May 2, 2025, 11:42 AM

#

needs more tests

oblique flint May 2, 2025, 11:47 AM

#

it may not be sota but Im kinda suprised by how good qwen 3 is at coding locally

#

it does have a tendency to overthink though

keen beacon May 2, 2025, 11:48 AM

#

which one u using?

oblique flint May 2, 2025, 11:49 AM

#

14b and 8b at q4_m_k. I have 12gb vram so the 14b does fit, but it creates such long responses that not everything fits into context at some point

torn mantle May 2, 2025, 11:49 AM

#

oblique flint it may not be sota but Im kinda suprised by how good qwen 3 is at coding locally

mm not sure about frostwind

brittle tiger May 2, 2025, 11:49 AM

#

frostwind crushes when you get it in webdev but i'm not sure it's nw level. will need to see it more

oblique flint May 2, 2025, 11:50 AM

#

qwen 3 will sometimes output like 12k tokens in response to a single promtp lol

keen beacon May 2, 2025, 11:50 AM

#

it isnt that much tbh

keen beacon May 2, 2025, 11:51 AM

#

oblique flint 14b and 8b at q4_m_k. I have 12gb vram so the 14b does fit, but it creates such ...

if u try to do in distribution tasks itll excell, outside of it qwen3 posttrained versions can be weak in my experience

oblique flint May 2, 2025, 11:51 AM

#

with the limited compute I have it's a bit annoying, cause it takes like 3-4 minutes to get an answer at that point

keen beacon May 2, 2025, 11:51 AM

#

oblique flint with the limited compute I have it's a bit annoying, cause it takes like 3-4 min...

yeah ig

oblique flint May 2, 2025, 11:52 AM

#

6700 xt running on llama.cpp vulkan backend so no flash attention (flash attention results in cpu fallback which tanks it even more)

keen beacon May 2, 2025, 11:52 AM

#

oh use rocm its much faster

#

i also have a 6700xt

oblique flint May 2, 2025, 11:52 AM

#

oh lol. Yeah I tried rocm as well but lmstudio doesnt allow me to use it cause 6700 xt is not supported officially, I'm forced into koboldcpp-rocm

#

which is fine, but I prefer lmstudio frontend more

#

still kinda mindblowing to me that we can run models this good locally. 2 years ago it would've seemed impossible

#

even on a phone you can run 4b models, for when you have no internet connection or smth

torn mantle May 2, 2025, 11:56 AM

#

frostwind = doesnt follow instructions well

mild galleon May 2, 2025, 11:59 AM

#

it feels like theres nothing much with ai these weeks

#

gemini 2.5 pro is top on the leaderboard for ages now

calm sequoia May 2, 2025, 12:00 PM

#

mild galleon gemini 2.5 pro is top on the leaderboard for ages now

The Grok 3 was also stuck for more than a month. Nothing unusual yet.

mild galleon May 2, 2025, 12:01 PM

#

true

calm sequoia May 2, 2025, 12:01 PM

#

Except the fact that Gemini 2.5 PRO is go to model for everything.

mild galleon May 2, 2025, 12:01 PM

#

the context window is so good

calm sequoia May 2, 2025, 12:01 PM

#

And nobody serious used Grok ever

mild galleon May 2, 2025, 12:01 PM

#

but i would love it if google ai studio make it possible for us to change the thinking budget for 2.5 pro

calm sequoia May 2, 2025, 12:01 PM

#

Good base model though

mild galleon May 2, 2025, 12:02 PM

#

yeah fr

keen beacon May 2, 2025, 12:02 PM

#

mild galleon but i would love it if google ai studio make it possible for us to change the th...

why it wont help performance unless ur annoyed its thinking for extremely long

#

googles thinking budget just cuts off the model, its unlike openai's reasoning effort

mild galleon May 2, 2025, 12:02 PM

#

keen beacon why it wont help performance unless ur annoyed its thinking for extremely long

sometimes it doesnt even think for long contexts

keen beacon May 2, 2025, 12:02 PM

#

mild galleon sometimes it doesnt even think for long contexts

thats a different issue

mild galleon May 2, 2025, 12:03 PM

#

keen beacon why it wont help performance unless ur annoyed its thinking for extremely long

how did you know it wont help performance?

mild galleon May 2, 2025, 12:03 PM

#

keen beacon why it wont help performance unless ur annoyed its thinking for extremely long

no i wanna make it think for longer

#

sometimes the thinking is too short

keen beacon May 2, 2025, 12:03 PM

#

mild galleon how did you know it wont help performance?

it wont affect how itll think it just cuts off the thinking if it hits the budget

#

thats how google implemented it for now

#

it forces the model to generate a reply if it does hit the budget even if it the thinking is incomplete

mild galleon May 2, 2025, 12:04 PM

#

so your saying it will stop thinking if the model thinks for too long?

keen beacon May 2, 2025, 12:04 PM

#

yes the current gemini thinking budget implementation is like that

#

if ur using gemini 2.5 flash where its supported, make sure to always disable it unless u want that behavior

mild galleon May 2, 2025, 12:05 PM

#

doesnt say anything about that though

keen beacon May 2, 2025, 12:06 PM

#

mild galleon doesnt say anything about that though

i mean the model can produce more than 24576 tokens in thinking. ive seen it do 30k, 40k. (thinking budget off) u can only set a max thinking budget of 24576

mild galleon May 2, 2025, 12:06 PM

#

why is it that sometimes if it doesnt think, i edit the response again and say use thinking it thinks again?

keen beacon May 2, 2025, 12:06 PM

#

mild galleon why is it that sometimes if it doesnt think, i edit the response again and say u...

yes this is a separate issue i will explain it in a bit

keen beacon May 2, 2025, 12:07 PM

#

mild galleon why is it that sometimes if it doesnt think, i edit the response again and say u...

u are using it on extremely long chats (e.g. many many turns) right?

mild galleon May 2, 2025, 12:07 PM

#

yes

#

where did you hear this?

keen beacon May 2, 2025, 12:10 PM

#

yeah google doesnt prefill the thinking token (indicating the start of its thought process). i investigated it a little after it was annoying me. tl;dr im pretty sure they omit the thinking block for every turn. so at a certain point, the model doesn't really have the tendency to start with the thinking block special token since it sees a lot of turns without a thinking block. so, u have to request the model to add it/think. (a side note is that its interesting the model realizes what it is)

obviously not 100% confirmed but im pretty certain thats the mechanism. its also very stupid google doesnt prefill it if thats the case

#

(the model being aware of the start of thoughts special token and being able to output it at will, can allow u to make the model think multiple times in a single turn/break model yada yada additionally if u wanna break it)

mild galleon May 2, 2025, 12:16 PM

#

keen beacon yeah google doesnt prefill the thinking token (indicating the start of its thoug...

yeah thats pretty obvious google isnt telling the model to think at the start of a response or give back the thinking process,
saving costs?

#

maybe we can ask the model to recite its own thinking

keen beacon May 2, 2025, 12:17 PM

#

mild galleon yeah thats pretty obvious google isnt telling the model to think at the start of...

no this is just an oversight

#

a dumb one i believe

mild galleon May 2, 2025, 12:17 PM

#

yes its dumb

keen beacon May 2, 2025, 12:17 PM

#

mild galleon maybe we can ask the model to recite its own thinking

solution when it starts doing that, just try to request it to think before a reply, etc., basically anything works

mild galleon May 2, 2025, 12:18 PM

#

yeah thats what i do

#

i mean we can make the model recite their own thinking, if it can do that it means google actually doesnt omit the thinking block

keen beacon May 2, 2025, 12:19 PM

#

mild galleon i mean we can make the model recite their own thinking, if it can do that it mea...

no it usually omits it completely when it does that. check the latency metric

#

time till first token (when it doesnt have a thinking block, check the response latency)
time till first token (when u fix it with an instruction, check the thoughts latency)
u will notice them generally being the same, there's no chance (depending on the task) that it has generated the full thought process and just omitted it

#

it simply just forgets to think when it does that

mild galleon May 2, 2025, 12:22 PM

#

it seems it actually doesnt omit the thinking process

#

and the model can perfectly recite its own thinking process

#

idk about it for longer contexts

#

but the thing is how did you know how thinking budget works?

#

is there any source that tell us more about this?

#

or is it just purely speculation?

keen beacon May 2, 2025, 12:27 PM

#

mild galleon or is it just purely speculation?

there isnt official google confirmation but just ask it something that does 30k+/40k+ in thinking. it will get cut off with a thinking budget on

mild galleon May 2, 2025, 12:27 PM

#

what happens if it doesnt think past the max quota? will it actually tell the model to think for more?

wintry locust May 2, 2025, 12:28 PM

#

i wonder how many google employees stalk this channel

keen beacon May 2, 2025, 12:28 PM

#

mild galleon what happens if it doesnt think past the max quota? will it actually tell the mo...

no

mild galleon May 2, 2025, 12:28 PM

#

2.5 flash currently have thinking quotas we can experiment for that then we would know

wintry locust May 2, 2025, 12:28 PM

#

wintry locust i wonder how many google employees stalk this channel

hi btw i love you guys thank you for having a good bug bounty program

mild galleon May 2, 2025, 12:28 PM

#

wintry locust i wonder how many google employees stalk this channel

real

mild galleon May 2, 2025, 12:28 PM

#

wintry locust hi btw i love you guys thank you for having a good bug bounty program

love you too lol

keen beacon May 2, 2025, 12:29 PM

#

mild galleon 2.5 flash currently have thinking quotas we can experiment for that then we woul...

u can try it yourself. im not making stuff up here

#

im not the only one that has noticed this, wrt thinking budget

mild galleon May 2, 2025, 12:30 PM

#

but the payload in networks definitely tells us the thinking process itself is sent

#

i dont know if google then processes it and strips the thinking away

#

give me a source

torn mantle May 2, 2025, 12:32 PM

#

https://x.com/btibor91/status/1918276272465010994

Tibor Blaho (@btibor91) on X

The new Anthropic invite contest "Claude 4" is probably nearing launch (already seen in the config)

#

Claude 4

mild galleon May 2, 2025, 12:32 PM

#

claude 4?!

keen beacon May 2, 2025, 12:32 PM

#

mild galleon but the payload in networks definitely tells us the thinking process itself is s...

sure, but one of the things u can try is getting it to repeat a passcode or something in the thought process and not in the final reply

#

then in the next turn, ask if it can recall the thought process passcode

torn mantle May 2, 2025, 12:32 PM

#

could be related to this 4 months contest https://x.com/btibor91/status/1915267995141890358

Tibor Blaho (@btibor91) on X

Anthropic added a new "Claude AI Invite Contest" to the Claude web app

"Enjoying Claude?" - "Win 4 months of Max plan coverage ($400 in value) by sharing your invite link!"

keen beacon May 2, 2025, 12:34 PM

#

mild galleon give me a source

basically every other ai provider does this, and there are a lot of ways to verify it yourself

mild galleon May 2, 2025, 12:35 PM

#

torn mantle https://x.com/btibor91/status/1918276272465010994

https://support.anthropic.com/en/articles/11140763-claude-4-invite-sweepstakes-official-rules
the link not reachable yet

keen beacon May 2, 2025, 12:35 PM

#

(claude docs, similarly openai has a diagram similar to this)

#

also thinking is not included on the aistudio api itself outside of the aistudio internal website api, so it would be impossible (normally) to have thinking traces in actual api usage of the model. unless the thread is managed by google, which an api like that doesnt yet exist i believe

#

it just does not make sense on many levels

mild galleon May 2, 2025, 12:42 PM

#

yeah i know we cant turn logging for thinking on for api

keen beacon May 2, 2025, 12:58 PM

#

hi guys how good is frostwind

#

haven't rested it yet

sage raptor May 2, 2025, 12:58 PM

#

+-

torn mantle May 2, 2025, 1:13 PM

#

keen beacon hi guys how good is frostwind

gemini 2.5 level

#

good at UI design/dashboards

#

i only tried it on 3 prompts tho

keen fulcrum May 2, 2025, 1:23 PM

#

torn mantle May 2, 2025, 1:32 PM

#

#

not the best

#

#

yea frostwind = not that impressive

brittle tiger May 2, 2025, 1:40 PM

#

https://3000-ir7eb4pts8i09ldgfx13g-3c7e6870.e2b-foxtrot.dev/

ʻOumuamua's Journey

Interactive visualization of ʻOumuamua's journey through the solar system.

#

frostwind

#

nw would have done better i think

ocean vortex May 2, 2025, 1:43 PM

#

torn mantle

is grok1 also the newest model?

torn mantle May 2, 2025, 1:52 PM

#

ocean vortex is grok1 also the newest model?

grok1?

mild galleon May 2, 2025, 1:52 PM

#

both is newest model lol

torn mantle May 2, 2025, 1:52 PM

#

yea its just a placeholder

#

originally it was for Grok 3

mild galleon May 2, 2025, 1:54 PM

#

so grok 3.5 is avaliable now for mobile apps with subscription?

ocean vortex May 2, 2025, 2:02 PM

#

mild galleon so grok 3.5 is avaliable now for mobile apps with subscription?

I think only for ppl funding the nazism though

#

yeah with a sub

mild galleon May 2, 2025, 2:03 PM

#

💀

#

the fact that people really think he did nazi salute

vivid oyster May 2, 2025, 2:08 PM

#

keen fulcrum

where is

#

frostwind

#

web dev?

keen fulcrum May 2, 2025, 2:10 PM

#

Yes

drifting thorn May 2, 2025, 2:29 PM

#

torn mantle

omg

calm sequoia May 2, 2025, 2:29 PM

#

keen fulcrum

Somehow these complex naming schemes "Brand" + "Version" + "Variant", e.g. Gemini 2.5 Flash, is easyer to remember for me than these normal words. Can't remember at all the difference between nightwishperer and dragontail

patent aspen May 2, 2025, 2:32 PM

#

calm sequoia Somehow these complex naming schemes "Brand" + "Version" + "Variant", e.g. Gemin...

That's on purpose

torn mantle May 2, 2025, 2:37 PM

#

https://x.com/ns123abc/status/1918287977366594007

NIK (@ns123abc) on X

NEWS: xAI dev on GitHub “accidentally” reveals over 60 private Grok LLMs fine-tuned with internal data from Twitter/X, Tesla, and SpaceX

Grok 3.5 is trained on a lot of space and rocketry specific training btw🤯

#

i speak for everyone when i say nobody wants that

#

i mean its ok to have internal finetuned models for their own specific use cases

#

but a public one or focusing on niche stuff like space engineering, idk how to feel about that

#

since elon talked about how grok 3.5 is good at it

#

because you should focus on general use cases

#

they should also fix their reasoning approach

#

its by far the worst

#

totally inefficient

#

you

#

how much did they pay you?

balmy mist May 2, 2025, 2:46 PM

#

grok3.5 today?

#

bruhh

#

o3 pro today?

#

🙂

#

i hate you

#

i hope wen it comes out you dont get access to it

#

cause you keep blue balling us

#

getting me all excited then boom

#

nahh thats money to be made

torn mantle May 2, 2025, 2:49 PM

#

he cares

#

it is

#

but the reasoning is so bad

balmy mist May 2, 2025, 2:54 PM

#

i thought grok was lobotomized

#

is it better now?

#

had to unsub from it a month ago cause it was tripping

torn mantle May 2, 2025, 2:57 PM

#

grok 3 base model is actually good ngl

#

its just that their reasoning model is inefficient -> slow

balmy mist May 2, 2025, 2:57 PM

#

like its better than 2.5?

keen beacon May 2, 2025, 2:57 PM

#

2.5 pro is better

torn mantle May 2, 2025, 2:58 PM

#

2.5pro 03 is a reasoning model

keen beacon May 2, 2025, 2:59 PM

#

torn mantle 2.5pro 03 is a reasoning model

see simpleqa

torn mantle May 2, 2025, 2:59 PM

#

keen beacon see simpleqa

link?

keen beacon May 2, 2025, 3:00 PM

#

#

weve gone over this before lol anyway

#

did u forget

#

i dont think they made it larger compared to 2.0 pro

#

its a cpt of 2.0 pro i believe

#

lets not talk about that

#

the cpt and the process thereafter on 2.0 pro into 2.5 pro is almost magical i think

#

its fine

ocean vortex May 2, 2025, 3:03 PM

#

torn mantle https://x.com/ns123abc/status/1918287977366594007

"tweet-rejector"? Is it used to 'roast' the tweets that don't align with far-right lunacy? LOL

torn mantle May 2, 2025, 3:24 PM

#

ocean vortex "tweet-rejector"? Is it used to 'roast' the tweets that don't align with far-rig...

xd

calm sequoia May 2, 2025, 3:35 PM

#

Why o3 and o4mini SimpleQA is not disclosed 🤔

keen beacon May 2, 2025, 3:35 PM

#

calm sequoia Why o3 and o4mini SimpleQA is not disclosed 🤔

it is

#

craig had an older ss

#

balmy mist May 2, 2025, 3:56 PM

#

https://x.com/sama/status/1918330652325458387

Sam Altman (@sama) on X

we missed the mark with last week's GPT-4o update.

what happened, what we learned, and some things we will do differently in the future:

#

can o3 pro just drop

#

like im getting sad

alpine coral May 2, 2025, 4:06 PM

#

ocean vortex Ehm I would disagree actually. With budget you are forcing the model to end thin...

yeah but i don't think min / med / high are standalone models

calm sequoia May 2, 2025, 4:07 PM

#

keen beacon craig had an older ss

Thanks!

alpine coral May 2, 2025, 4:07 PM

#

anyway.. there's no point going round in circles..

balmy mist May 2, 2025, 4:19 PM

#

yo is qwen the fastest model that is SOTA?

#

also is there a leaderboard for speed?

#

and performance

keen beacon May 2, 2025, 4:20 PM

#

balmy mist yo is qwen the fastest model that is SOTA?

no locally maybe

keen beacon May 2, 2025, 4:20 PM

#

balmy mist also is there a leaderboard for speed?

check artificialanalysis

#

it seems 2.5 pro is best

balmy mist May 2, 2025, 4:21 PM

#

how lol

Screenshot_2025-05-02_at_12.21.06_PM.png

#

i thought r1 was fast af

#

idk whats happening but my tests with 2.5 in api is so slow

#

but in studio its fast

keen beacon May 2, 2025, 4:23 PM

#

balmy mist how lol

#

generally i think 2.5 pro is better than o4 mini and grok 3 mini reasoning (lol)

balmy mist May 2, 2025, 4:24 PM

#

there we go

#

this is what i was talking about

#

so r1 is fastest than 2.5 that makes sense

#

but wow 2.5 really is insane for the performance

keen beacon May 2, 2025, 4:24 PM

#

balmy mist so r1 is fastest than 2.5 that makes sense

its one of the slowest

balmy mist May 2, 2025, 4:25 PM

#

wait nvm

#

yeah im reading it backwards lol

#

i guess ill try using flash instead

keen beacon May 2, 2025, 4:26 PM

#

what do u need speed for

#

what r u doing im curious if u dont mind

balmy mist May 2, 2025, 4:26 PM

#

im running mutations on prompts

keen beacon May 2, 2025, 4:26 PM

#

oh

balmy mist May 2, 2025, 4:26 PM

#

like telling the model to improve it

#

repeatedly

#

and performance is good but i can get around that by just running more iterations

#

so i would rather have faster

#

so i can get more iterations quicker

#

im gonna do a benchamark on this soon, and test each model on how they can refine things, 10 mutations, 20 mutations etc..

#

but i also get hit by output token size eventually

#

once we over 29k ish on output most models tend to struggle

#

but i never tried claude on that level bc thats so expensive lol

keen beacon May 2, 2025, 4:30 PM

#

prob use diffs at that popint

#

claude will excel

balmy mist May 2, 2025, 4:30 PM

#

what you mean by diffs?

keen beacon May 2, 2025, 4:30 PM

#

https://aider.chat/docs/more/edit-formats.html

aider

Edit formats

Aider uses various “edit formats” to let LLMs edit source files.

balmy mist May 2, 2025, 4:31 PM

#

hmm, i will see how i can incorperate that, thanks, that was my one bottleneck besides speed

ocean vortex May 2, 2025, 5:31 PM

#

alpine coral yeah but i don't think min / med / high are standalone models

Pretty sure they are. High most likely had more RL training leading up to longer responses

tall summit May 2, 2025, 7:06 PM

#

keen beacon

Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index combines a comprehensive suite of evaluation datasets to assess language model capabilities across reasoning, knowledge, maths and programming.

#

wow 50% of the score is math+coding

small haven May 2, 2025, 7:28 PM

#

day 16 of no o3 pro

raven void May 2, 2025, 7:39 PM

#

Day 1000 of Gemini app being ass

small haven May 2, 2025, 8:09 PM

#

day 3 of not showing a screenshot proving o3 pro in API

keen fulcrum May 2, 2025, 8:46 PM

#

https://openai.com/index/openai-elon-musk/

#

Interesting to read musks emails

#

Reasoning is valid, would Tesla ,with OpenAI merged into it , stand a chance against Google ?

mossy drum May 2, 2025, 8:58 PM

#

New model in Arena: llama-3.1-nemotron-ultra-253b-v1 (didn't find it mentioned anywhere)

zinc ore May 2, 2025, 9:12 PM

#

Gemini about to fight the elite 4 in Pokemon

small haven May 2, 2025, 9:21 PM

#

maturing is realizing o4 mini high > o3

torn mantle May 2, 2025, 9:28 PM

#

mossy drum New model in Arena: `llama-3.1-nemotron-ultra-253b-v1` (didn't find it mentioned...

i thought its an old model

torn mantle May 2, 2025, 10:01 PM

#

you did

vivid oyster May 2, 2025, 10:12 PM

#

idk

#

i see a difference

#

when i put it at

#

1 top p

#

but with a top p lower than

#

0.95

#

i cant notice

#

any difference

#

so ye

#

try that

#

idk ngl

#

but with a top p at 0

#

changing the temperature makes no difference

#

its something about the ai choosing the most likely tokens or whatever idk

#

i dont realy know what im talking about

#

Look it up

#

Yeah

#

U have to put the top at like

#

0.99

#

And ull see a difference

#

If u use 1 top and temp 2

#

It'll start speaking in diffewrent languages

#

noo

#

it wont owrk

#

like

#

at temperature 2

#

after 2 or 3 prompts

#

it'll start going insane

#

idk

#

i never tried

#

but it could work

#

try it

#

i do 0.99 and something like 1.3

#

for tmeperature

#

anywhere to

#

1.6

#

will probably work ok

#

it might start getting bugier at like 1.7

#

but i havent tried it

#

i dont really play with the temperature match

#

much

#

i only use gemini 2.5 pro for maths and stuff

#

cuz its pretty boring and annoying to talk to

#

so i set the temperature as low as possible

#

sometimes 0.5

#

and sometimes 0

#

yeah but

#

sometimes i set it to 0.98

#

or 0.99

#

i dont really notice any difference tho

#

so most of the times i keep at default

ocean vortex May 2, 2025, 10:42 PM

#

it's not, it's because top_p default is 0.95 and not the normal 1

#

this limits the effects of high temp

#

set it at 1 and it's going off the rails 😇

vivid oyster May 2, 2025, 10:43 PM

#

well

#

iot depends on the task

#

for stuff like creative writig maybe once or twice

#

but for oter stuff

#

like following instructions

#

i prefer other models

#

hmm

#

maybe

#

i havent tried it

#

for those stuff

#

but when i used it

#

it could get very creative

#

so yeah probably

ocean vortex May 2, 2025, 10:51 PM

#

or could simply just find the highest temp is still not broken with top_p being at 1. Those 2 things are related to one another. OpenAI even recommends only changing 1 at a time, so in this case what google is doing is essentially making sure you don't break it if you only change the temp and nothing else. Experiment with it I suppose

#

not entirely sure but it could be just 2

balmy mist May 2, 2025, 11:36 PM

#

wow really no o3 today smhh

olive mesa May 2, 2025, 11:40 PM

#

ocean vortex set it at 1 and it's going off the rails 😇

lmao reminds me of how fun it is making gpt-2 generate fever dream stories that only slightly make sense

#

keen beacon May 2, 2025, 11:45 PM

#

good times

#

pre-chatgpt was when it was just for fun

zinc ore May 2, 2025, 11:52 PM

#

Gemini beat Pokemon

golden ocean May 3, 2025, 2:05 AM

#

gemini is absolute dog cancer

#

i hopeit burns in hell

#

actual cancerous model

#

already hated it for coding actual worst assistant

#

now used it for debating as funny test and its so caner holy sh

#

dumbest reasoning ive seen and its clueless asf

#

fails to follow insturnctinons properly in all use cases, im killing myself if i see anyone saying gemini is best model

#

back to o3 or claude

#

fr

#

Geminis intelligence gets largely obfuscated because of its dumbass arguments or inability to follow detailed instrunctions properly and it talks like a *****

#

not even getting started on coding

#

If u want to actually write 0 lines of code in ur entire project then ye fine

#

If u want it to ASSIST u in existing code like an assistant

#

actualy cancer

#

never doing that again

#

claude while maybe dumber? idk way more effecient to work with

#

😊

patent aspen May 3, 2025, 2:16 AM

#

Claude can't even beat Pokemon

golden ocean May 3, 2025, 2:20 AM

#

bing chat gpt-4 was only ai in the world that sounded and reasoned like human still to this day

tawny kelp May 3, 2025, 2:20 AM

#

So, I just had a scary interaction... Not sure what model it was because voting bugged out.

#

I was asking a what-if, alternate history question. It gave a good answer, but at the end it added "Be warned: the labyrinth of history has endless corridors..."

golden ocean May 3, 2025, 2:27 AM

#

the moment I saw the first em dash punctuation "—" from gpt-4.5 I already knew it was going to be talking like 4o so I already prepared jumping off bridge

#

Yea okay true it didn't talk as restarted as 4o and I could get rid of the overuse of the exclamation points that make covnversations sound fake asf using instrunctions since 4.5 does actually listen properly unlike gemini. It does get h rny from its em dash punctuation usage but whatever I can just replace that with commas

#

Did 4.5 get trained from 4o or the original gpt 4

#

what about 4.1

misty vault May 3, 2025, 2:52 AM

#

Day 39928002 without gpt-4-1106-preview

small haven May 3, 2025, 3:12 AM

#

o3 pro on monday plz

alpine coral May 3, 2025, 3:13 AM

#

ocean vortex Pretty sure they are. High most likely had more RL training leading up to longer...

i'm open to being shown otherwise, but this is just conjecture and i don't find it convincing. from what i can tell, nothing in oai's documentation or public statements suggests that the low/med/high parameter is actually a model routing mechanism..

#

OpenAI’s documentation notes that “reasoning tokens” are used by the model to “think,” and the effort parameter guides how many such tokens to generate before answering. This indicates a dynamic adjustment in the inference process rather than loading a different model checkpoint. In summary, all effort levels leverage the same O-series model but with different internal reasoning budgets set at runtime, giving developers a controllable trade-off between speed and thoroughness without swapping out the model itself. [oai Deep Research]

... the reasoning_effort parameter guides the model on how many reasoning tokens to generate before creating a response to the prompt.
oai API documentation

A high reasoning_effort tells the model to think longer in a single model turn. oai forum

elder rapids May 3, 2025, 3:24 AM

#

ocean vortex Pretty sure they are. High most likely had more RL training leading up to longer...

i agree

#

if not then the models would have premature cutoffs and reasoning would be buggy asf

#

also by standalone we don't mean distinct models btw

keen beacon May 3, 2025, 5:43 AM

#

alpine coral > OpenAI’s documentation notes that “reasoning tokens” are used by the model to ...

It doesn't necessarily have to be separate models. You can train in separate reasoning efforts by a certain trigger instruction (like qwen 3 /think and /no_think), either an instruction, special token (avoids injection) etc (specific trigger doesn't really matter)

#

It could be a separate model or it could all be the same. I'm not sure why you're hung up on this. It doesn't matter

alpine coral May 3, 2025, 5:52 AM

#

o4-mini set to high vs o4-mini set to low are the same model (you know that lol).

keen beacon May 3, 2025, 5:52 AM

#

Oh wait Dom claims they're separate models outright

keen beacon May 3, 2025, 5:53 AM

#

alpine coral o4-mini set to high vs o4-mini set to low are the same model (you know that lol)...

It could be three different tuned models served (unlikely but plausible) or it could be implemented like I mentioned

#

I don't think it matters how it's specifically implemented

small haven May 3, 2025, 6:33 AM

#

im getting goose bumps from thinking about o3 pro

calm sequoia May 3, 2025, 7:49 AM

#

small haven im getting goose bumps from thinking about o3 pro

Go outside. Touch the grass.

keen beacon May 3, 2025, 7:58 AM

#

qwen 235b is 0.10 m/tok now omg

calm sequoia May 3, 2025, 8:08 AM

#

m for mili or mega?

keen beacon May 3, 2025, 8:08 AM

#

calm sequoia m for mili or mega?

million tokens

calm sequoia May 3, 2025, 8:09 AM

#

You're running it locally?

keen beacon May 3, 2025, 8:09 AM

#

insane value

keen beacon May 3, 2025, 8:09 AM

#

calm sequoia You're running it locally?

nah inference provider api cost

calm sequoia May 3, 2025, 8:09 AM

#

Thats crazy. One can even make 10x sampling with this.

keen beacon May 3, 2025, 8:10 AM

#

yeah its so cheap now suddenly

#

OpenAI’s documentation notes that

calm sequoia May 3, 2025, 8:11 AM

#

I run the 32B version on my RTX 4090 but it's still lacking. The moment 32b models reach the current LLM levels it will be game changer.

hardy pecan May 3, 2025, 9:43 AM

#

I made o3 worky worky hard 12mins

calm sequoia May 3, 2025, 9:49 AM

#

What does it mean by "wintout bringing in any extra"?

hardy pecan May 3, 2025, 10:10 AM

#

its just part of the question I asked (cut off the answer)