#general

1 messages · Page 34 of 1

keen beacon
#

yeah its a beast for indistribution tasks. some things it can be especially weak on

#

its so fcking fast

balmy mist
#

and its chea right?

#

cheap*

#

like is it better than flash 2.5?

keen beacon
balmy mist
#

lol

keen beacon
#

if u can run it locally tho 😄

#

i think the pricing will continue to drop though

#

its kinda expensive rn. 0.3m/tok (it isnt a lot, but qwq 32b is 0.2m/tok)

balmy mist
#

yeah that has to drop

#

you think r2 will be cheaper?

#

that would be silly right lol

keen beacon
#

idk what to expect with r2

balmy mist
#

lol

alpine coral
#

it's interesting with the qwen models - using on their site, it seems thinking can be enabled on all the models (except qwq, where it's permanently on)

#

i was playing around yesterday

#

i dunno how the thinking would interact with functinon calling

#

could be great; or tricky

#

but yeah fwiw they are solid models - esp the smaller ones that can be hosted locally (though im largely going by there benchmarks there..)

#

can the 30b one be hosted locally (like with relative ease)?

keen beacon
#

It's moe

alpine coral
#

right ofc

keen beacon
#

Only 3b active

alpine coral
#

i should pay attention to the last digit in the name

#

so it's functionally less intensive than hosting the dense 4b variant?

#

or wait.. is that also moe?

#

i'm not sure but i feel like OG 4 was, then turbo was MoE

#

but really not sure

keen beacon
#

yes

keen beacon
#

yea

#

it was

#

sam tweeted saying theyre storing it somewhere for historians

alpine coral
#

lol

keen beacon
#

(but og gpt 4 is long gone on chatgpt, they deprecated gpt 4 turbo on chatgpt LOL)

#

yes

#

gpt 4 was seminal, personally i think o1 preview was next

#

unpopular opinion i guess

alpine coral
#

yeah

#

test time compute was kinda a paridgm shift

#

which is still playing out

#

like.. reasoning models are annoying and unneccasrry a lot of the time ha

keen beacon
#

unpopular opinion: i think models still need to think even more lol

alpine coral
#

not sure about stupid,, it was kinda obvious; like 'give the model more time to 'think' rather than blurting out the first tokens it predicts as the answer'

keen beacon
#

it wasnt just that it was rule based rl too

#

its a combination of multiple things

alpine coral
#

yeah i know i'm simplifying

#

yeah i agree

#

it's kimda like brute force a lot of the time

#

rather than 'thinking'

#

but that said, a lot of the time, it works perfectly and makes good sense

keen beacon
#

i think people try to apply how they think they think to models too much

#

i personally think this is it but its another i think unpopular opinion

unborn ocean
unborn ocean
#

imho the great thing about TTC is not just length and problem solving, but also variability of response length and adjusting it to economic incentives
(because our brain also does that on a more complicated level)

alpine coral
#

it literally just gives the model time to 'work through' something; it's one completion

willow grail
#

qwen3 trash?

alpine coral
#

divided with <thinking> tagssetc

unborn ocean
#

not just with thought lenght but also "compute" for each "token" our brain produces, as we are more like some f*d up complicated RNN (with differing compute for each "token")

alpine coral
#

but i don't think that's how it works

#

reasoning 'effort' is misleading language

#

reasoning 'token budget' is much better

#

imo

keen beacon
#

with effort, like openai's reasoning effort, its not really limited by budget. the model is specifically tuned to produce different lengths of chain of thought, not a strict limit

#

high produces the most, med produces less, low produces even less

alpine coral
#

so they;'re 3 discrete models?

keen beacon
#

not necessarily these behaviors could be tuned directly into a single model. activation could be from a special token, specific instructions, etc.

alpine coral
#

i mean maybe

#

for o1 pro, i totally agree it's more than just 'unlimited reasoning tokens' - there's actually more to it

#

i'm not conviinced low/med/high is any different from selecting a corresponding value on flash 2.5

#

same model

keen beacon
#

its literally cut off at the thinking budget

alpine coral
#

lol this is true

#

i will cede you that ha

keen beacon
#

you can see what i mean by 'specific instructions' and how they can trigger specific behaviors (intentionally trained in) by sending /no_think or /think to qwen3 models if u dont understand what i mean

alpine coral
#

right, but it's still functionally the same

#

google has just implemented it poorly

#

(prob why other providers didn't go with a floating value)

keen beacon
#

google didnt implement it at all. its just a programmatic max token limit on the thinking contents

#

the model has no idea

alpine coral
#

im not saying it does

keen beacon
#

it doesnt even try to think less with a smaller thinking budget unless u give extra instructions

alpine coral
#

it's principally the same. if the model is fine tuned to adhere to cues or whatever then clearly it works well, and that;s eaiser to do if it's low/med/high vs 0-n tokens. but fundamentally it's the same

unborn ocean
#

true, even modified ones won't be enough

alpine coral
#

the model has a greater or lesser token allowance to use before delivering its final resposne

unborn ocean
#

although the core concept of attention will likely remain in what ever we do for the time being

rugged brook
keen beacon
# alpine coral it's principally the same. if the model is fine tuned to adhere to cues or whate...
  1. in qwen3 /think / /no_think are trained in model behaviors. (theyll stop the model from thinking or not on qwen3) google 2.5 flash doesn't have behaviors like this trained in by default unless u prompt it which is different. (u might get it to think less for example)
  2. sonnet's antml:max_thinking_lengthantml:max_thinking_length is provided to the model as a prompt instruction on the api/claude.ai/etc. it's also a trained in behavior, even if the value is somewhat arbitrary - it is unable to actually count the thinking tokens (it's like asking a model to provide a specific word count, unless it actually counts it, it just gives it a rough idea about how much to write. the model is aware of this on top of a programmatic max token limit in the thinking block)

the visible result might as well be roughly the same but the things happening with the model are quite different

unborn ocean
#

but we will likely have jagged ai (what ethan mollick calls it) that is human level in some areas using transformers

#

^that is my guess (although one could argue that we have already reached that point)

alpine coral
# keen beacon 1. in qwen3 `/think` / `/no_think` are trained in model behaviors. (theyll stop ...

yes i think we're talking one another...i understand the situation re google (setting the value for the budget does nothing other than potentially truncate the model's thinking). and with anthropic what you describe reminds me of the yap score ha (like as if the models are actually trying to or capable of counting their words to adhere to day's 'yap score').
my point is that, say google impemented in a way worked, i dunno how, but it adhered to the value set - in functional terms you would have the same thing that oai does with their reasoning models (low/med/high), just yeah at a more granular level

#

i mean if we agree they at are fundamentally the same model, any fine tuning to get adherence to reasoning token allowance is not what gives the performance; it's all ultimately about the compute used during inference right

keen beacon
alpine coral
#

i'm not advocating for that ha

keen beacon
#

so i think reasoning efforts are better, you don't set an arbitrary token limit and instead the model is aware it needs to reason less/more/even more. then a sonnet like implementation would be slightly worse

alpine coral
#

i mean i assume google's current implementation won't be around for long.. it is literally pointless

alpine coral
willow grail
#

who has tooth pain at least once a month? i can help

alpine coral
#

i don't know how they would do it with a dyanamic value like google's.. the more i think about it

keen beacon
alpine coral
#

yeah tbh i dunno... i've wandered well beyond my depth ha

keen beacon
alpine coral
#

i guess anthropic only has 32k and 64k? i mean 32=low, 64=high.. what's trained in exactly ha? is it meant to be dynamic?

#

oh

keen beacon
#

theres pretraining associations that would help but finetuning it allows it to be more aligned to how anthropic wants it/model awareness beyond the programmatic max token limit

#

even without finetuning giving a model any kind of indicator will help/be more aligned to what optimal model behavior with a given thinking budget should be

alpine coral
#

right i see..

The budget_tokens parameter determines the maximum number of tokens Claude is allowed to use for its internal reasoning process. Larger budgets can improve response quality by enabling more thorough analysis for complex problems, although Claude may not use the entire budget allocated, especially at ranges above 32K.

#

i didn't realise that with anthropic's api

#

i mean again though, the principle is the same.. we're fundamentally talking about (attempts) to govern the amount of inference/test time used to generate the response

#

or in google's case, lack of attempts so far (aside from slapping it onto 2.5 flash as a hyper-parameter and nothing more ha)

#

grok-mini-3 felt similar to 2.5 flash in terms of setting the reasoning budget (didn't seem to do anything at all; not even result in truncated reasoning if set low)

#

though that may be have been on openrouter's end or something

keen fulcrum
balmy mist
#

thats trained from qwen?

#

it seems kinda underwhelming tho

keen beacon
keen beacon
ocean vortex
keen beacon
#

Flash is purely programmatic

keen fulcrum
#

It does better than qwq32b

great results for a 14b model

alpine coral
#

the goals are the same

#

how is this hard lol

keen beacon
worthy thunder
#

Context Arena: Added more Anthropic results for 2needle tests. (https://x.com/DillonUzar/status/1917968783395655757)

See all results at: https://contextarena.ai
You can also hover over a score in the table, which will then show a button to explore the individual test results/answers.

Relative AUC @ 128k 2needle scores (select models shown):

  • GPT-4.1: 61.6%
  • Gemini 2.0 Flash: 56.0%
  • Claude 3.7 Sonnet: 55.9%
  • Claude 3.7 Sonnet (Thinking): 55.5%
  • Grok 3 Mini (Low): 54.8%
  • Claude 3.0 Haiku: 52.9%
  • Llama 4 Maverick: 52.7%
  • Claude 3.5 Sonnet: 51.2%
  • Grok 3 Mini (High): 50.3%
  • Claude 3.5 Haiku: 50.0%

Some quick notes:

  • Pretty consistent performance across 3.0, 3.5, and 3.7. Impressive.
  • No noticeable difference between Claude 3.7 Sonnet and Sonnet Thinking.
  • All perform around or above GPT-4.1 Mini for context lengths <= 128k.
  • Claude 3.0 Haiku had the best overall Model AUC of the Anthropic models tested, but only by the tiniest amount (had the smallest drop between context lengths).
  • Around Gemini 1.5/2.0 Flash, Grok 3 Mini, and Llama 4 Maverick in overall performance.

Disclosure: The companies I work with use Claude 3.0 Haiku extensively (one of the ones we use the most to power some services). Comparing the latest models against the original Haiku was one of the goals of this website originally.

Enjoy.

#

Also added Qwen3 14B and 8B to the results from last night.

alpine coral
#

anthropic used to dominate both long context and retrieval imo

#

has kinda now lost both to google and oai (or just to google.. looking at the bar chart.. geez it does well.. )

wintry tinsel
#

the masses yearn for Claude 4 Opus

#

in this dark time of Google and closed AI only a Claude hero can save us

raven void
#

Gemini pro and flash cooked by non reasoning 4.1

#

Google is a whole generation behind tbh, OpenAI isn't releasing 4.1o and 4.1o mini to give Google a chance to fight back

unborn ocean
wintry tinsel
oblique flint
#

people call openai closedai but anthropic is even more closed than openai tbh. At least openai made clip, whisper and gpt 2 open weights. Anthropic did literally nothing open as far as Im aware, except maybe mcp if you count that (although its not a model)

wintry tinsel
#

they are but they are so much easier to jailbreak and get them to do what I ask of them

#

open AI models never cease to put me to sleep and overcharge

keen fulcrum
wintry tinsel
keen fulcrum
#

Indeed

golden ocean
#

cwaude

cedar tide
flint sand
#

4.1 i mean

keen fulcrum
#

Today we're announcing Integrations, a new way to connect your apps and tools to Claude.
︀︀
︀︀We're also expanding Claude's Research capabilities with an advanced mode that searches the web, your Google Workspace, and now your Integrations too.

**💬 47 🔁 130 ❤️ 1.1K 👁️ 88.4K **

▶ Play video

Claude now automatically determines when to search and how deeply to investigate.
︀︀
︀︀With Research mode toggled on, Claude researches for up to 45 minutes across hundreds of sources (including connected apps) before delivering a report, complete with citations.

**💬 2 🔁 11 ❤️ 133 👁️ 14.5K **

▶ Play video
trim pecan
small haven
#

Is claude pro unlimited

olive mesa
#

lmao. no annoying emojis, minimalist, straight to the point

willow grail
#

and majority of humans are very dumb. see usa. sorry others in usa.

ocean vortex
# alpine coral the goals are the same

Ehm I would disagree actually. With budget you are forcing the model to end thinking prematurely relative to what it has learned during RL training. With a standalone model it has learned to take the most out of the given reasoning lengths it arrived at 'naturally'

#

so like it may get hung up on small irrelevant details and not see the entire picture because it has no clue it's gonna be forced to shorten it

#

I think why it works is that in most cases some reasoning context is still better than no reasoning context at all + the base model is no slouch if we look at the models with this implementation. But I would say that is less than ideal

#

reasoning budget sounds great in theory, but there are obvious limitations to it. We can look at it from another angle too - if limiting the budget does not lead to notably worse performance then maybe your RL training is not very good or efficient as well

small haven
#

im not gonna lie, maverick cooked

exotic kernel
#

btw just interested, does LMArena provides credits to research papers for the propriety models?

ocean vortex
# small haven im not gonna lie, maverick cooked

it's absolutely insane that they went with complete retrains, new models only chat instead of reasoning lol

llama3 was in a much better shape relative to competition than llama4. Why chase diminishing returns?

#

every benchmark came be gamed when that's your sole focus

#

but there's a reason public version is nowhere near that

#

that's why you can't really effectively game most of them

#

cause there are plenty

#

and improving 1 screws up with the rest

small haven
#

yes lol

small haven
#

dont tease me bro

#

ive lost faith in sam

elder rapids
#

dude is anti Gemini lmao, read some of his messages

zinc ore
#

They've literally done this a million times too

golden ocean
zinc ore
#

SORA SOON!!! Never releases it

small haven
zinc ore
#

Show off random tech and make big promises about how it's the greatest thing ever

golden ocean
#

I didnt say that

zinc ore
#

Then hardly ever release it

golden ocean
#

ask @willow grail

#

@willow grail ru gonna let this slide?

small haven
#

craig is a genius, dont underestimate him guys, hes actually in the right end side of the iq distribution

golden ocean
#

ok buddy

elder rapids
#

bro is saying 4.1 mini is better than o3 mini high

#

yeah no sh

#

the point is

#

it's the same dude

#

screenshotting out of context benchmarks

#

he filtered livebench coding without considering what livebench is measuring (competitive coding) and said Google is behind

zinc ore
#

"to give Google a fighting chance" dudes love roleplaying online with their fan narratives

elder rapids
#

you can't unironically believe that

#

😭

#

yeah anyone could do that

#

but not everyone disingenuously postures it like that

golden ocean
#

hes on dont disturb

#

dont disturb him bro

zinc ore
elder rapids
#

like bro

#

what is this 😭 I'm dead

misty vault
#

Yes because all the immigrants got kicked out

small haven
#

bronx mahanttan

#

facts

#

o3 > o1 pro, so u live in between bronx and upper west

#

o3 pro lives in the woods

#

living the outback farm life

#

wtf

willow grail
#

you should be happy that they are not species from other genus of homo.... then there would be even more stupid people

willow grail
willow grail
torn mantle
#

im already young

#

so i dont need it

#

who paid you?

#

you going to recommend a certain label now?

willow grail
#

nutrition label? what

torn mantle
#

yea

#

brand

#

une marque

willow grail
#

i would say there is 88% dumb people if any other species of genus homo wuld still exist

torn mantle
#

oh

#

you are ignoring me now

#

caught

willow grail
#

thats why we cant have nire things. matters a lot.

#

nice*

#

u must vote for ... monkeys?

#

if you explain this. thanks

small haven
#

hey guys

#

if ur workflow dont look like this, ur ngmi

torn mantle
#

any new models on arena?

#

this seems right

#

although idk about gpt4.1

#

dont take berberin 2g/day

willow grail
keen beacon
#

30 grams*

calm sequoia
#

How? I used in AIStudio but the gpt interface seems lacking

keen fulcrum
#

Which deep research feature is the best currently?

#

Doesn't this deserve a separate leaderboard?

#

What makes you consider ChatGPT DR over Grok and Claude?

willow grail
#

what searcher tool is best

#

for health, medicine,

#

i am not rich

#

ew

#

the hallucination machine?

#

really???

torn mantle
#

tbh ive been using grok deep search

willow grail
#

yorue so close to block

torn mantle
#

it seems ok

#

o3 will straight up lie on ur face

#

and you need to re-check that

willow grail
#

...... nope...... 50% of its sources it cant remember

torn mantle
#

ive been using pplx since day 1

willow grail
#

pp is hallucination machine

torn mantle
#

they dont seem to focus on their main objective anymore

#

which is providing a good search results

#

its so bad rn

willow grail
#

50% of its text is hallu.

torn mantle
#

yea but it doesnt give you the depth you are looking for

keen fulcrum
#

So the basic search feature for free plans isn't really useful, I don't like that its using bing data

#

Yes

torn mantle
#

why would i need a service that gives me same results as an offline LLM

keen fulcrum
#

I think Claude may be the best option currently, closely followed by Kagi, Google and Grok

torn mantle
#

bing/microsoft dont have their own product

keen fulcrum
#

(not 2 minutes ago)

torn mantle
#

ive never felt msft had their own made ai product

small haven
#

anyone know the limits on claude research

torn mantle
#

they said they updated their research tool

small haven
#

dont know i just wanted the max

keen fulcrum
#

Is it only in the max plan?

small haven
#

gonna compare

keen fulcrum
#

Are DR that much better than their standard search tools on the free plan?
They are terrible in my experience

#

How many sources do you get?

small haven
#

lol it asks just like oai dr

willow grail
#

sam died..

hollow ocean
small haven
#

we gon see

torn mantle
#

how bad it is

small haven
#

jesus christ

#

wait it searches in parallel

keen fulcrum
#

Its great they waited to implement the feature so well thought through
One of the best features they released, still disappointed that its only for max users.

small haven
#

if it says microsoft im gonna punch my monitor hahaha

keen fulcrum
#

Will grok 3.5 be available early on lmarena?

small haven
#

421 sources now

torn mantle
#

show us results

torn mantle
#

prob tomorrow yea

small haven
torn mantle
small haven
#

wait.. is this claude research the one that takes 45 mins lol

small haven
hollow ocean
#

It will take 45 mins

small haven
#

bruh

hollow ocean
#

Ask Claude research who will win Knicks or the pistons today

#

Let’s see how good it is

#

If it gets it right it’s good

small haven
hollow ocean
small haven
#

and idk if theres limits

#

its still going for a good almost 20 mins

#

but stuck at 421

#

sources

torn mantle
#

welp

#

rekt

#

can you try a prompt of mine when it finishes?

#

🥺

small haven
#

k

torn mantle
#

ty

hollow ocean
#

No limits

keen fulcrum
keen beacon
#

Wait what

keen fulcrum
keen beacon
#

Is Grok 3.5 already out?

#

ain’t no way

keen beacon
#

Does anyone have access to grok 3.5

#

like with a subscription

torn mantle
keen fulcrum
#

Do you have an insider or how do you know

torn mantle
#

we are just predicting it will be added on lmarena tomorrow

keen beacon
#

ohh

#

bro grok 3.5 is gonna be insane

keen beacon
keen fulcrum
#

Why not Mars

keen beacon
#

u think

#

better than

#

2.5 pro and o3

#

whats that

#

damn

small haven
#

oai dr, same prompt, took 6 mins, claude dr is going for 30 mins now hahah

keen fulcrum
#

LLMs in space will be really useful

small haven
torn mantle
#

since the issues are more reasoning oriented

keen fulcrum
#

And the question

small haven
#

thats openai dr

#

claude dr is still running

torn mantle
#

@small haven can you try this : Based on a critical synthesis of recent, high-quality human clinical trials and systematic reviews, determine which compound – Berberine, Propolis, or Resveratrol – demonstrates the most compelling evidence for promoting overall health.

torn mantle
small haven
#

im gonna get banned , arent i

torn mantle
#

eh

#

no

#

ozone?

small haven
#

claude kinda cooked

keen fulcrum
small haven
#

yea

keen fulcrum
#

Interesting

small haven
#

stocks dr is still running

torn mantle
small haven
#

yea with oai dr, it would ask some questions

#

always

torn mantle
#

mm i see

keen fulcrum
#

Now you gotta compare all providers and write an article, people will read it

small haven
torn mantle
#

pmc/pubmed/sciencedirect

#

thats good so far

willow grail
#

u need more precision babe

small haven
#

im not sure, but im about to hit 40 mins with the stocks one

keen fulcrum
#

I would love to know how it handles political questions (how biased are the sources)

torn mantle
keen fulcrum
#

Anthropic is probably using exa under the hood

willow grail
#

i know that...

torn mantle
#

i could explicitly ask it to compare their antimicrobial activity/anti-inflammatory prop/anticancer/cardioprotective effects/immunomodulatory effects/gut health

willow grail
#

💋

torn mantle
#

why are you kissing people now?

willow grail
#

🫃

willow grail
#

take a snickers

#

you are not you \

#

something which does exist. cause gender is in brain, not in your genitals

torn mantle
willow grail
torn mantle
#

propolis probably has metabolism effects as well

#

not sure about resveratrol

willow grail
#

a person who says they are a man while having female genitals. no, its not a mental disease

torn mantle
#

as it was heavily studied for cvd

#

are you pregnant?

#

im confused

#

😖

willow grail
#

XD no eww

#

children ew

torn mantle
#

whats this

willow grail
torn mantle
#

idk

#

im confused as well

willow grail
#

u can do both

#

lul

#

u can tho.

#

we dont care what u think. that is irrelevant

#

phobic statement cause u would never say this to a straight person.

#

pls think about whatu say

torn mantle
#

@small haven any progress

willow grail
#

dna is changed daily

torn mantle
#

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

#

isnt this why we get diseases

willow grail
#

perhaps

torn mantle
#

or at least increase the risk

willow grail
#

trans is literally the best

small haven
torn mantle
willow grail
#

its searching for 4 hours for one task?

small haven
#

it def hit the 40 mins mark rn

willow grail
#

what prompt

#

tell it to hurry up

small haven
#

what stock to buy lol

willow grail
#

the company is closing for today

#

tell it

small haven
#

lmao

willow grail
#

to grow the f up

torn mantle
#

wait

#

did u change ur pfp

#

something changed

#

in the chat

willow grail
torn mantle
#

why

willow grail
#

ps asura is the server called where kill on sight is legal

torn mantle
#

you are an interesting person

willow grail
#

absolutel trash server

torn mantle
#

oh no

willow grail
#

in "the isle"

torn mantle
#

who made berberine mad

#

you did

#

stop lying

#

omg

#

you again

willow grail
#

like what? that u believe transgender is a mental issue?

#

i dont know why u believe that

#

ozone

#

?

#

its faster if utell me.

small haven
#

claude is still going like wtf

keen fulcrum
small haven
#

this guy said 1hr before it just gave up lol

torn mantle
#

better be good

small haven
#

yea it just got released today, and it did say beta, im prolly got get timed

keen fulcrum
#

I don't have a problem if it takes 1 day for an advanced query
It would take me significantly longer to answer such questions.

#

It would be great knowing whether Claude is great in physics and astrophysics. Grok is known to be tailored for that use case

small haven
#

boutta hit an hour lol

#

ok bud

#

legend says hes still trying to a pic of it in oai playground

balmy mist
#

we not getting it, this is what you get for trolling me about it 😂

small haven
#

who wants it?

sweet tinsel
#

I would like to see it.

small haven
#

58mins

#

im gatekeeping it guys sorry! its too good

torn mantle
#

send

#

is it better than oai dr?

small haven
torn mantle
#

ive heard deepseek is also working on their own research feature

small haven
#

guys after 1hr of deep anal research, just buy nvidia!

keen fulcrum
torn mantle
#

2.7 vs 2.2 trillion

small haven
#

dont worry the bermecellin whatveer is coming

small haven
torn mantle
#

$5 + $5 + $5 + $5 = cgpt sub

small haven
#

my head hurts

torn mantle
#

this is taking too long

small haven
#

ya not a fan of long dr tbh

torn mantle
#

well ive tried it on grok/oai/google hopefully it provides something new

#

or at least interesting

#

of who

#

gemini?

small haven
#

this is TTD stock the one it recommended lmao

#

trade desk

#

Snow too i think is like this? haha

#

wish it gave me some quantum computing stocks at least ..

torn mantle
#

STILL NOT FINISHED????????????????

#

holy

small haven
#

nope

#

what if it takes 2 hrs lol

torn mantle
#

if it takes 10h i would pay you that $5

small haven
#

this is oai dr if u want it

torn mantle
#

thanks

#

lol straight up to the point

small haven
#

haha

torn mantle
#

thats actually a good report

small haven
#

almost done

torn mantle
#

oh

#

finally

#

😖

small haven
#

im gonna ask wen o3 pro

torn mantle
#

no shot

small haven
#

its done

#

berberine is the answer

#

..

#

lol

torn mantle
#

fk

#

share pls

torn mantle
#

probably cuz it has more RCT trials

small haven
#

who wins tho, oai dr or claude dr

#

bruh 1h 13m

torn mantle
#

im still reading

#

thanks btw

#

the "The absorption problem" part is kinda interesting to know

#

i dont think other models talked about that

#

although it focused on their specific known markers

#

unlike oai

keen fulcrum
#

Is it possible to access these via the grok api?

torn mantle
#

it went one by one which is what i wanted

#

i also liked how oai dr compared same parameters through diff studies

small haven
#

ok so in other words? oai or claude lol

torn mantle
#

maybe :

  1. oai dr
  2. claude
  3. grok3
  4. gemini
#

yea oai dr is way better tbh

small haven
#

sheesh

keen fulcrum
#
  1. Perplexity
torn mantle
keen fulcrum
#

You should genuinely try deep research of kagi when its out of closed beta
its great

small haven
#

perplexity ceo is cringe

hollow ocean
#

Ranking is accurate

torn mantle
hollow ocean
#

He got money tho

small haven
#

rip shareholders

#

rather put my money with thinking machines

hollow ocean
small haven
#

onlyfans ceo has a better chance lmao

ember rapids
hollow ocean
#

They said they were going to add deep research (high) but gave up on that

drifting thorn
#

I'm so anticipated to Grok 3.5

#

Either its performance meets my expectation or not

earnest parcel
# willow grail qwen3 trash?

Naw they are really good local models. At least they dominated my sub-49B test results, and the A3B MoE is a fantastic speedy one (getting 130tok/s on 4090, q4km)

drifting thorn
#

like... I think Elon Musk is just making a wordplay

#

reasoning from first principles may just be a prompt

#

and every answer by LLMs don’t exist on the Internet in the exact same way.

#

I hope Elon can prove me wrong but...

leaden palm
earnest parcel
leaden palm
#

ok

#

gguf based setup?

#

(eg llama cpp / ollama)

earnest parcel
#

ye

#

gguf on LMS Server

leaden palm
#

theyre good models, i just hope that inference providers figure out how to optimize them (theyre currently a bit overpriced + slow)

earnest parcel
#

i just have the a3b running 24/7. the moe takes close to no ressources and is lightning fast for any random stuff i come up with

#

getting 130tok/s

#

not even using draft model

leaden palm
small haven
#

i can smell o3 pro

teal mantle
small haven
#

grok 3.5 wont be that good, unless... they release it with big brain mode

keen fulcrum
#

R2 next week

fleet lintel
#

Why R2 is taking soooo long

calm sequoia
#

Will the deep seek R2 be based on V3 or do they have V4 already?

kind cloud
#

Can someone please find out whose ai frostwind is.

keen beacon
short zodiac
#

Hello 🙂 quelqu'un ici pourrait m'aider pour la mise en place d'un multi-agent (ou mixture-of-agent) ? je n'y arrive pas, et j'aimerais beaucoup pouvoir réaliser une équipe de LLM spécialisés avec chacun son rôle, dans une exécution séquentielle paramétrée.

Si quelqu'un ici est ok de perdre un peu de temps à m'aider, ce serait super sympa ^^

hollow ocean
keen beacon
kind cloud
#

5Vg24U7ccM

hollow ocean
#

Thx

tall summit
#

someone please send the updates from that server here for the sake of everyone else

ocean vortex
#

I think Trump's parents were immigrants too. His first wife was immigrant as well. You should deport him

#

But yeah perplexity have lost the plot lol
It's not easy to compete though when the tech giants went after the same things

#

so the only thing they have left is trying to beat them with caps and speed. Otherwise when you are the one training the models you can do much more optimising it for web search

kind cloud
#

frostwind is Google model in webdev.

calm sequoia
#

Qwen3-253b-a22b is really good 40% of the time. Something is wrong with consistency.

keen beacon
#

like the base model is insane

calm sequoia
#

True. It's really the new SOTA in opensource. Can't imagine deepseek performing better than this.

keen beacon
calm sequoia
#

Ewen the 32B version sometimes outperforms models like o3-mini or grok3

keen beacon
calm sequoia
#

I see. Then we shall expect 3.5 to drop some time in the near future.

keen beacon
#

probably theres a reason they didnt release the base model weights of qwen 3 32b dense or the 235b base model

keen fulcrum
kind cloud
keen fulcrum
#

Thanks

ocean vortex
# keen beacon partly inference provider probably and qwen rushing it

Personally I was using it directly from qwen. Honestly it struck me like a model with one of the biggest disconnects benchmarks to IRL, the usability of it. I would rather use R1 for open-source reasoning. This just is over-the-top wasteful slow reasoning quite often and while it can sometimes do things R1 can't, it can also fail things R1 would never fail...

keen beacon
#

it can be very weak on other things though, i think the post training wasnt fleshed out enough

calm sequoia
#

Still testing it. Sometimes the answer is as good as 2.5 PRO or o3, but most of the time it's as bad as nano models. Temperature tweeking does not help. Sadly, it's unusable at this state, you never know what you'll get. Lets wait for a stable version.

#

The thinking model could be wild though

keen beacon
#

ur supposed to use specific sampling settings with qwen3

#

it helps a lot

#

0.6temp,0.95top_p,20top_k

#

i think

calm sequoia
#

All rights, lets try it.

keen beacon
keen beacon
#

2.5 pro is soo darn good 😭

torn mantle
#

What did i say?

#

I said we will have a new model added today from google

#

How is it? @keen beacon

keen beacon
willow grail
torn mantle
#

holy

#

first impression of frostwind = woah

keen beacon
#

Nw reincarnation?

torn mantle
#

probably

torn mantle
calm sequoia
#

Can't get in on text arena

keen beacon
#

I assume one of the places legit scrapes is that

torn mantle
#

yea

#

well i guess i cant assume its good based on one prompt

keen beacon
#

Post results here

torn mantle
#

needs more tests

oblique flint
#

it may not be sota but Im kinda suprised by how good qwen 3 is at coding locally

#

it does have a tendency to overthink though

keen beacon
#

which one u using?

oblique flint
#

14b and 8b at q4_m_k. I have 12gb vram so the 14b does fit, but it creates such long responses that not everything fits into context at some point

brittle tiger
#

frostwind crushes when you get it in webdev but i'm not sure it's nw level. will need to see it more

oblique flint
#

qwen 3 will sometimes output like 12k tokens in response to a single promtp lol

keen beacon
#

it isnt that much tbh

keen beacon
oblique flint
#

with the limited compute I have it's a bit annoying, cause it takes like 3-4 minutes to get an answer at that point

oblique flint
#

6700 xt running on llama.cpp vulkan backend so no flash attention (flash attention results in cpu fallback which tanks it even more)

keen beacon
#

oh use rocm its much faster

#

i also have a 6700xt

oblique flint
#

oh lol. Yeah I tried rocm as well but lmstudio doesnt allow me to use it cause 6700 xt is not supported officially, I'm forced into koboldcpp-rocm

#

which is fine, but I prefer lmstudio frontend more

#

still kinda mindblowing to me that we can run models this good locally. 2 years ago it would've seemed impossible

#

even on a phone you can run 4b models, for when you have no internet connection or smth

torn mantle
#

frostwind = doesnt follow instructions well

mild galleon
#

it feels like theres nothing much with ai these weeks

#

gemini 2.5 pro is top on the leaderboard for ages now

calm sequoia
mild galleon
#

true

calm sequoia
#

Except the fact that Gemini 2.5 PRO is go to model for everything.

mild galleon
#

the context window is so good

calm sequoia
#

And nobody serious used Grok ever

mild galleon
#

but i would love it if google ai studio make it possible for us to change the thinking budget for 2.5 pro

calm sequoia
#

Good base model though

mild galleon
#

yeah fr

keen beacon
#

googles thinking budget just cuts off the model, its unlike openai's reasoning effort

mild galleon
keen beacon
mild galleon
mild galleon
#

sometimes the thinking is too short

keen beacon
#

thats how google implemented it for now

#

it forces the model to generate a reply if it does hit the budget even if it the thinking is incomplete

mild galleon
#

so your saying it will stop thinking if the model thinks for too long?

keen beacon
#

yes the current gemini thinking budget implementation is like that

#

if ur using gemini 2.5 flash where its supported, make sure to always disable it unless u want that behavior

mild galleon
#

doesnt say anything about that though

keen beacon
mild galleon
#

why is it that sometimes if it doesnt think, i edit the response again and say use thinking it thinks again?

keen beacon
keen beacon
mild galleon
#

yes

#

where did you hear this?

keen beacon
#

yeah google doesnt prefill the thinking token (indicating the start of its thought process). i investigated it a little after it was annoying me. tl;dr im pretty sure they omit the thinking block for every turn. so at a certain point, the model doesn't really have the tendency to start with the thinking block special token since it sees a lot of turns without a thinking block. so, u have to request the model to add it/think. (a side note is that its interesting the model realizes what it is)

obviously not 100% confirmed but im pretty certain thats the mechanism. its also very stupid google doesnt prefill it if thats the case

#

(the model being aware of the start of thoughts special token and being able to output it at will, can allow u to make the model think multiple times in a single turn/break model yada yada additionally if u wanna break it)

mild galleon
#

maybe we can ask the model to recite its own thinking

keen beacon
#

a dumb one i believe

mild galleon
#

yes its dumb

keen beacon
mild galleon
#

yeah thats what i do

#

i mean we can make the model recite their own thinking, if it can do that it means google actually doesnt omit the thinking block

keen beacon
#

time till first token (when it doesnt have a thinking block, check the response latency)
time till first token (when u fix it with an instruction, check the thoughts latency)
u will notice them generally being the same, there's no chance (depending on the task) that it has generated the full thought process and just omitted it

#

it simply just forgets to think when it does that

mild galleon
#

it seems it actually doesnt omit the thinking process

#

and the model can perfectly recite its own thinking process

#

idk about it for longer contexts

#

but the thing is how did you know how thinking budget works?

#

is there any source that tell us more about this?

#

or is it just purely speculation?

keen beacon
mild galleon
#

what happens if it doesnt think past the max quota? will it actually tell the model to think for more?

wintry locust
#

i wonder how many google employees stalk this channel

mild galleon
#

2.5 flash currently have thinking quotas we can experiment for that then we would know

wintry locust
keen beacon
#

im not the only one that has noticed this, wrt thinking budget

mild galleon
#

but the payload in networks definitely tells us the thinking process itself is sent

#

i dont know if google then processes it and strips the thinking away

#

give me a source

torn mantle
#

Claude 4

mild galleon
#

claude 4?!

keen beacon
#

then in the next turn, ask if it can recall the thought process passcode

torn mantle
keen beacon
keen beacon
#

(claude docs, similarly openai has a diagram similar to this)

#

also thinking is not included on the aistudio api itself outside of the aistudio internal website api, so it would be impossible (normally) to have thinking traces in actual api usage of the model. unless the thread is managed by google, which an api like that doesnt yet exist i believe

#

it just does not make sense on many levels

mild galleon
#

yeah i know we cant turn logging for thinking on for api

keen beacon
#

hi guys how good is frostwind

#

haven't rested it yet

sage raptor
#

+-

torn mantle
#

good at UI design/dashboards

#

i only tried it on 3 prompts tho

keen fulcrum
torn mantle
#

not the best

#

yea frostwind = not that impressive

brittle tiger
#

frostwind

#

nw would have done better i think

ocean vortex
torn mantle
mild galleon
#

both is newest model lol

torn mantle
#

yea its just a placeholder

#

originally it was for Grok 3

mild galleon
#

so grok 3.5 is avaliable now for mobile apps with subscription?

ocean vortex
#

yeah with a sub

mild galleon
#

the fact that people really think he did nazi salute

vivid oyster
#

frostwind

#

web dev?

keen fulcrum
#

Yes

drifting thorn
calm sequoia
# keen fulcrum

Somehow these complex naming schemes "Brand" + "Version" + "Variant", e.g. Gemini 2.5 Flash, is easyer to remember for me than these normal words. Can't remember at all the difference between nightwishperer and dragontail

torn mantle
#

i speak for everyone when i say nobody wants that

#

i mean its ok to have internal finetuned models for their own specific use cases

#

but a public one or focusing on niche stuff like space engineering, idk how to feel about that

#

since elon talked about how grok 3.5 is good at it

#

because you should focus on general use cases

#

they should also fix their reasoning approach

#

its by far the worst

#

totally inefficient

#

you

#

how much did they pay you?

balmy mist
#

grok3.5 today?

#

bruhh

#

o3 pro today?

#

🙂

#

i hate you

#

i hope wen it comes out you dont get access to it

#

cause you keep blue balling us

#

getting me all excited then boom

#

nahh thats money to be made

torn mantle
#

he cares

#

it is

#

but the reasoning is so bad

balmy mist
#

i thought grok was lobotomized

#

is it better now?

#

had to unsub from it a month ago cause it was tripping

torn mantle
#

grok 3 base model is actually good ngl

#

its just that their reasoning model is inefficient -> slow

balmy mist
#

like its better than 2.5?

keen beacon
#

2.5 pro is better

torn mantle
#

2.5pro 03 is a reasoning model

keen beacon
torn mantle
keen beacon
#

weve gone over this before lol anyway

#

did u forget

#

i dont think they made it larger compared to 2.0 pro

#

its a cpt of 2.0 pro i believe

#

lets not talk about that

#

the cpt and the process thereafter on 2.0 pro into 2.5 pro is almost magical i think

#

its fine

ocean vortex
calm sequoia
#

Why o3 and o4mini SimpleQA is not disclosed 🤔

keen beacon
#

craig had an older ss

balmy mist
#

can o3 pro just drop

#

like im getting sad

alpine coral
calm sequoia
alpine coral
#

anyway.. there's no point going round in circles..

balmy mist
#

yo is qwen the fastest model that is SOTA?

#

also is there a leaderboard for speed?

#

and performance

keen beacon
keen beacon
#

it seems 2.5 pro is best

balmy mist
#

how lol

#

i thought r1 was fast af

#

idk whats happening but my tests with 2.5 in api is so slow

#

but in studio its fast

keen beacon
#

generally i think 2.5 pro is better than o4 mini and grok 3 mini reasoning (lol)

balmy mist
#

there we go

#

this is what i was talking about

#

so r1 is fastest than 2.5 that makes sense

#

but wow 2.5 really is insane for the performance

keen beacon
balmy mist
#

wait nvm

#

yeah im reading it backwards lol

#

i guess ill try using flash instead

keen beacon
#

what do u need speed for

#

what r u doing im curious if u dont mind

balmy mist
#

im running mutations on prompts

keen beacon
#

oh

balmy mist
#

like telling the model to improve it

#

repeatedly

#

and performance is good but i can get around that by just running more iterations

#

so i would rather have faster

#

so i can get more iterations quicker

#

im gonna do a benchamark on this soon, and test each model on how they can refine things, 10 mutations, 20 mutations etc..

#

but i also get hit by output token size eventually

#

once we over 29k ish on output most models tend to struggle

#

but i never tried claude on that level bc thats so expensive lol

keen beacon
#

prob use diffs at that popint

#

claude will excel

balmy mist
#

what you mean by diffs?

keen beacon
balmy mist
#

hmm, i will see how i can incorperate that, thanks, that was my one bottleneck besides speed

ocean vortex
tall summit
# keen beacon

Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index combines a comprehensive suite of evaluation datasets to assess language model capabilities across reasoning, knowledge, maths and programming.

#

wow 50% of the score is math+coding

small haven
#

day 16 of no o3 pro

raven void
#

Day 1000 of Gemini app being ass

small haven
#

day 3 of not showing a screenshot proving o3 pro in API

keen fulcrum
#

Interesting to read musks emails

#

Reasoning is valid, would Tesla ,with OpenAI merged into it , stand a chance against Google ?

mossy drum
#

New model in Arena: llama-3.1-nemotron-ultra-253b-v1 (didn't find it mentioned anywhere)

zinc ore
#

Gemini about to fight the elite 4 in Pokemon

small haven
#

maturing is realizing o4 mini high > o3

torn mantle
#

you did

vivid oyster
#

idk

#

i see a difference

#

when i put it at

#

1 top p

#

but with a top p lower than

#

0.95

#

i cant notice

#

any difference

#

so ye

#

try that

#

idk ngl

#

but with a top p at 0

#

changing the temperature makes no difference

#

its something about the ai choosing the most likely tokens or whatever idk

#

i dont realy know what im talking about

#

Look it up

#

Yeah

#

U have to put the top at like

#

0.99

#

And ull see a difference

#

If u use 1 top and temp 2

#

It'll start speaking in diffewrent languages

#

noo

#

it wont owrk

#

like

#

at temperature 2

#

after 2 or 3 prompts

#

it'll start going insane

#

idk

#

i never tried

#

but it could work

#

try it

#

i do 0.99 and something like 1.3

#

for tmeperature

#

anywhere to

#

1.6

#

will probably work ok

#

it might start getting bugier at like 1.7

#

but i havent tried it

#

i dont really play with the temperature match

#

much

#

i only use gemini 2.5 pro for maths and stuff

#

cuz its pretty boring and annoying to talk to

#

so i set the temperature as low as possible

#

sometimes 0.5

#

and sometimes 0

#

yeah but

#

sometimes i set it to 0.98

#

or 0.99

#

i dont really notice any difference tho

#

so most of the times i keep at default

ocean vortex
#

it's not, it's because top_p default is 0.95 and not the normal 1

#

this limits the effects of high temp

#

set it at 1 and it's going off the rails 😇

vivid oyster
#

well

#

iot depends on the task

#

for stuff like creative writig maybe once or twice

#

but for oter stuff

#

like following instructions

#

i prefer other models

#

hmm

#

maybe

#

i havent tried it

#

for those stuff

#

but when i used it

#

it could get very creative

#

so yeah probably

ocean vortex
#

or could simply just find the highest temp is still not broken with top_p being at 1. Those 2 things are related to one another. OpenAI even recommends only changing 1 at a time, so in this case what google is doing is essentially making sure you don't break it if you only change the temp and nothing else. Experiment with it I suppose

#

not entirely sure but it could be just 2

balmy mist
#

wow really no o3 today smhh

olive mesa
keen beacon
#

good times

#

pre-chatgpt was when it was just for fun

zinc ore
#

Gemini beat Pokemon

golden ocean
#

gemini is absolute dog cancer

#

i hopeit burns in hell

#

actual cancerous model

#

already hated it for coding actual worst assistant

#

now used it for debating as funny test and its so caner holy sh

#

dumbest reasoning ive seen and its clueless asf

#
  • fails to follow insturnctinons properly in all use cases, im killing myself if i see anyone saying gemini is best model
#

back to o3 or claude

#

fr

#

Geminis intelligence gets largely obfuscated because of its dumbass arguments or inability to follow detailed instrunctions properly and it talks like a *****

#

not even getting started on coding

#

If u want to actually write 0 lines of code in ur entire project then ye fine

#

If u want it to ASSIST u in existing code like an assistant

#

actualy cancer

#

never doing that again

#

claude while maybe dumber? idk way more effecient to work with

#

😊

patent aspen
#

Claude can't even beat Pokemon

golden ocean
#

bing chat gpt-4 was only ai in the world that sounded and reasoned like human still to this day

tawny kelp
#

So, I just had a scary interaction... Not sure what model it was because voting bugged out.

#

I was asking a what-if, alternate history question. It gave a good answer, but at the end it added "Be warned: the labyrinth of history has endless corridors..."

golden ocean
#

the moment I saw the first em dash punctuation "—" from gpt-4.5 I already knew it was going to be talking like 4o so I already prepared jumping off bridge

#

Yea okay true it didn't talk as restarted as 4o and I could get rid of the overuse of the exclamation points that make covnversations sound fake asf using instrunctions since 4.5 does actually listen properly unlike gemini. It does get h rny from its em dash punctuation usage but whatever I can just replace that with commas

#

Did 4.5 get trained from 4o or the original gpt 4

#

what about 4.1

misty vault
#

Day 39928002 without gpt-4-1106-preview

small haven
#

o3 pro on monday plz

alpine coral
#

OpenAI’s documentation notes that “reasoning tokens” are used by the model to “think,” and the effort parameter guides how many such tokens to generate before answering. This indicates a dynamic adjustment in the inference process rather than loading a different model checkpoint. In summary, all effort levels leverage the same O-series model but with different internal reasoning budgets set at runtime, giving developers a controllable trade-off between speed and thoroughness without swapping out the model itself. [oai Deep Research]

... the reasoning_effort parameter guides the model on how many reasoning tokens to generate before creating a response to the prompt.
oai API documentation

A high reasoning_effort tells the model to think longer in a single model turn. oai forum

elder rapids
#

if not then the models would have premature cutoffs and reasoning would be buggy asf

#

also by standalone we don't mean distinct models btw

keen beacon
#

It could be a separate model or it could all be the same. I'm not sure why you're hung up on this. It doesn't matter

alpine coral
#

o4-mini set to high vs o4-mini set to low are the same model (you know that lol).

keen beacon
#

Oh wait Dom claims they're separate models outright

keen beacon
#

I don't think it matters how it's specifically implemented

small haven
#

im getting goose bumps from thinking about o3 pro

calm sequoia
keen beacon
#

qwen 235b is 0.10 m/tok now omg

calm sequoia
#

m for mili or mega?

keen beacon
calm sequoia
#

You're running it locally?

keen beacon
#

insane value

keen beacon
calm sequoia
#

Thats crazy. One can even make 10x sampling with this.

keen beacon
#

yeah its so cheap now suddenly

#

OpenAI’s documentation notes that

calm sequoia
#

I run the 32B version on my RTX 4090 but it's still lacking. The moment 32b models reach the current LLM levels it will be game changer.

hardy pecan
#

I made o3 worky worky hard 12mins

calm sequoia
#

What does it mean by "wintout bringing in any extra"?

hardy pecan
#

its just part of the question I asked (cut off the answer)