#general
1 messages · Page 34 of 1
probably not
lol
if u can run it locally tho 😄
i think the pricing will continue to drop though
its kinda expensive rn. 0.3m/tok (it isnt a lot, but qwq 32b is 0.2m/tok)
idk what to expect with r2
lol
it's interesting with the qwen models - using on their site, it seems thinking can be enabled on all the models (except qwq, where it's permanently on)
i was playing around yesterday
i dunno how the thinking would interact with functinon calling
could be great; or tricky
but yeah fwiw they are solid models - esp the smaller ones that can be hosted locally (though im largely going by there benchmarks there..)
can the 30b one be hosted locally (like with relative ease)?
Yeah it's still relatively fast on the cpu
It's moe
right ofc
Only 3b active
i should pay attention to the last digit in the name
so it's functionally less intensive than hosting the dense 4b variant?
or wait.. is that also moe?
i'm not sure but i feel like OG 4 was, then turbo was MoE
but really not sure
yes
kind of. requires more memory though. if u can fit in vram, its super fast
yea
it was
sam tweeted saying theyre storing it somewhere for historians
lol
(but og gpt 4 is long gone on chatgpt, they deprecated gpt 4 turbo on chatgpt LOL)
yes
gpt 4 was seminal, personally i think o1 preview was next
unpopular opinion i guess
yeah
test time compute was kinda a paridgm shift
which is still playing out
like.. reasoning models are annoying and unneccasrry a lot of the time ha
unpopular opinion: i think models still need to think even more lol
not sure about stupid,, it was kinda obvious; like 'give the model more time to 'think' rather than blurting out the first tokens it predicts as the answer'
yeah i know i'm simplifying
yeah i agree
it's kimda like brute force a lot of the time
rather than 'thinking'
but that said, a lot of the time, it works perfectly and makes good sense
i think people try to apply how they think they think to models too much
i personally think this is it but its another i think unpopular opinion
well the thing is that we don't actually know how we think, so no one can really criticize the current thought process itself for much
exactly
imho the great thing about TTC is not just length and problem solving, but also variability of response length and adjusting it to economic incentives
(because our brain also does that on a more complicated level)
it literally just gives the model time to 'work through' something; it's one completion
qwen3 trash?
divided with <thinking> tagssetc
not just with thought lenght but also "compute" for each "token" our brain produces, as we are more like some f*d up complicated RNN (with differing compute for each "token")
but i don't think that's how it works
reasoning 'effort' is misleading language
reasoning 'token budget' is much better
imo
with effort, like openai's reasoning effort, its not really limited by budget. the model is specifically tuned to produce different lengths of chain of thought, not a strict limit
high produces the most, med produces less, low produces even less
so they;'re 3 discrete models?
not necessarily these behaviors could be tuned directly into a single model. activation could be from a special token, specific instructions, etc.
i mean maybe
for o1 pro, i totally agree it's more than just 'unlimited reasoning tokens' - there's actually more to it
i'm not conviinced low/med/high is any different from selecting a corresponding value on flash 2.5
same model
flash 2.5 has no idea what the thinking budget is
its literally cut off at the thinking budget
you can see what i mean by 'specific instructions' and how they can trigger specific behaviors (intentionally trained in) by sending /no_think or /think to qwen3 models if u dont understand what i mean
right, but it's still functionally the same
google has just implemented it poorly
(prob why other providers didn't go with a floating value)
google didnt implement it at all. its just a programmatic max token limit on the thinking contents
the model has no idea
im not saying it does
it doesnt even try to think less with a smaller thinking budget unless u give extra instructions
it's principally the same. if the model is fine tuned to adhere to cues or whatever then clearly it works well, and that;s eaiser to do if it's low/med/high vs 0-n tokens. but fundamentally it's the same
true, even modified ones won't be enough
the model has a greater or lesser token allowance to use before delivering its final resposne
although the core concept of attention will likely remain in what ever we do for the time being
yes
- in qwen3
/think//no_thinkare trained in model behaviors. (theyll stop the model from thinking or not on qwen3) google 2.5 flash doesn't have behaviors like this trained in by default unless u prompt it which is different. (u might get it to think less for example) - sonnet's antml:max_thinking_lengthantml:max_thinking_length is provided to the model as a prompt instruction on the api/claude.ai/etc. it's also a trained in behavior, even if the value is somewhat arbitrary - it is unable to actually count the thinking tokens (it's like asking a model to provide a specific word count, unless it actually counts it, it just gives it a rough idea about how much to write. the model is aware of this on top of a programmatic max token limit in the thinking block)
the visible result might as well be roughly the same but the things happening with the model are quite different
but we will likely have jagged ai (what ethan mollick calls it) that is human level in some areas using transformers
^that is my guess (although one could argue that we have already reached that point)
yes i think we're talking one another...i understand the situation re google (setting the value for the budget does nothing other than potentially truncate the model's thinking). and with anthropic what you describe reminds me of the yap score ha (like as if the models are actually trying to or capable of counting their words to adhere to day's 'yap score').
my point is that, say google impemented in a way worked, i dunno how, but it adhered to the value set - in functional terms you would have the same thing that oai does with their reasoning models (low/med/high), just yeah at a more granular level
i mean if we agree they at are fundamentally the same model, any fine tuning to get adherence to reasoning token allowance is not what gives the performance; it's all ultimately about the compute used during inference right
it depends on the task, on some tasks it might as well work fine. but cutting it off randomly in the middle of the thought process whilst its thinking about something is not optimal i would think
i'm not advocating for that ha
so i think reasoning efforts are better, you don't set an arbitrary token limit and instead the model is aware it needs to reason less/more/even more. then a sonnet like implementation would be slightly worse
i mean i assume google's current implementation won't be around for long.. it is literally pointless
yeah like i said it makes more sense form a fine tuning perspective - way more practical having 3 levels
who has tooth pain at least once a month? i can help
i don't know how they would do it with a dyanamic value like google's.. the more i think about it
i dont think u can really fine tune a model to arbitrarily do a specific amount of tokens when thinking. it's a vague thing/the model learns a vague direction that it has a "time limit" (models literally say this sometimes)
yeah tbh i dunno... i've wandered well beyond my depth ha
you can do it like how anthropic does which has a dynamic value i believe. i find openai's approach way more practical and optimal though
i guess anthropic only has 32k and 64k? i mean 32=low, 64=high.. what's trained in exactly ha? is it meant to be dynamic?
oh
you can set it to anything, but i believe they fine tuned in specific values (32k, 64k) to give a model a better sense of direction with that value
theres pretraining associations that would help but finetuning it allows it to be more aligned to how anthropic wants it/model awareness beyond the programmatic max token limit
even without finetuning giving a model any kind of indicator will help/be more aligned to what optimal model behavior with a given thinking budget should be
right i see..
The
budget_tokensparameter determines the maximum number of tokens Claude is allowed to use for its internal reasoning process. Larger budgets can improve response quality by enabling more thorough analysis for complex problems, although Claude may not use the entire budget allocated, especially at ranges above 32K.
i didn't realise that with anthropic's api
i mean again though, the principle is the same.. we're fundamentally talking about (attempts) to govern the amount of inference/test time used to generate the response
or in google's case, lack of attempts so far (aside from slapping it onto 2.5 flash as a hyper-parameter and nothing more ha)
grok-mini-3 felt similar to 2.5 flash in terms of setting the reasoning budget (didn't seem to do anything at all; not even result in truncated reasoning if set low)
though that may be have been on openrouter's end or something
https://huggingface.co/microsoft/Phi-4-reasoning
phi 4 reasoning has been released
Distilled from o3 mini
Wym it's pretty good for a 14b
Thinking budget is way different to low-high. First is coding implementation where you more or less force it to stop and proceed with final response, 2nd is the entire model optimised and trained for exclusively short or long reasoning.
Sonnet is a mix though
Flash is purely programmatic
It does better than qwq32b
great results for a 14b model
what's the end result? less or more tokens used for thinking during inference...
the goals are the same
how is this hard lol
Qwq 32b is old tho
Context Arena: Added more Anthropic results for 2needle tests. (https://x.com/DillonUzar/status/1917968783395655757)
See all results at: https://contextarena.ai
You can also hover over a score in the table, which will then show a button to explore the individual test results/answers.
Relative AUC @ 128k 2needle scores (select models shown):
- GPT-4.1: 61.6%
- Gemini 2.0 Flash: 56.0%
- Claude 3.7 Sonnet: 55.9%
- Claude 3.7 Sonnet (Thinking): 55.5%
- Grok 3 Mini (Low): 54.8%
- Claude 3.0 Haiku: 52.9%
- Llama 4 Maverick: 52.7%
- Claude 3.5 Sonnet: 51.2%
- Grok 3 Mini (High): 50.3%
- Claude 3.5 Haiku: 50.0%
Some quick notes:
- Pretty consistent performance across 3.0, 3.5, and 3.7. Impressive.
- No noticeable difference between Claude 3.7 Sonnet and Sonnet Thinking.
- All perform around or above GPT-4.1 Mini for context lengths <= 128k.
- Claude 3.0 Haiku had the best overall Model AUC of the Anthropic models tested, but only by the tiniest amount (had the smallest drop between context lengths).
- Around Gemini 1.5/2.0 Flash, Grok 3 Mini, and Llama 4 Maverick in overall performance.
Disclosure: The companies I work with use Claude 3.0 Haiku extensively (one of the ones we use the most to power some services). Comparing the latest models against the original Haiku was one of the goals of this website originally.
Enjoy.
Also added Qwen3 14B and 8B to the results from last night.
4.1 holds up very well doesn't it
anthropic used to dominate both long context and retrieval imo
has kinda now lost both to google and oai (or just to google.. looking at the bar chart.. geez it does well.. )
the masses yearn for Claude 4 Opus
in this dark time of Google and closed AI only a Claude hero can save us
Gemini pro and flash cooked by non reasoning 4.1
Google is a whole generation behind tbh, OpenAI isn't releasing 4.1o and 4.1o mini to give Google a chance to fight back
So the Amazon and Google backed company is coming to the rescue from the mega corps?
They are coming to the rescue from boring and flat models that don't obey system prompts
people call openai closedai but anthropic is even more closed than openai tbh. At least openai made clip, whisper and gpt 2 open weights. Anthropic did literally nothing open as far as Im aware, except maybe mcp if you count that (although its not a model)
they are but they are so much easier to jailbreak and get them to do what I ask of them
open AI models never cease to put me to sleep and overcharge
You are stupid
Are you reading headlines from a subreddit only?
are you questioning the journalist skills of redditors?
Indeed
cwaude
O3 and o4 mini very bad
why is it ranked so low on livebench though
4.1 i mean
https://fxtwitter.com/AnthropicAI/status/1917972747000692919
https://fxtwitter.com/AnthropicAI/status/1917972753916797111
Their deep research is really great
Today we're announcing Integrations, a new way to connect your apps and tools to Claude.
︀︀
︀︀We're also expanding Claude's Research capabilities with an advanced mode that searches the web, your Google Workspace, and now your Integrations too.
Claude now automatically determines when to search and how deeply to investigate.
︀︀
︀︀With Research mode toggled on, Claude researches for up to 45 minutes across hundreds of sources (including connected apps) before delivering a report, complete with citations.
Guys any limitations on https://beta.lmarena.ai/?
Is claude pro unlimited
lmao. no annoying emojis, minimalist, straight to the point
dumb users will prefer emojis ^^ like the current chatgpt4o
and majority of humans are very dumb. see usa. sorry others in usa.
Ehm I would disagree actually. With budget you are forcing the model to end thinking prematurely relative to what it has learned during RL training. With a standalone model it has learned to take the most out of the given reasoning lengths it arrived at 'naturally'
so like it may get hung up on small irrelevant details and not see the entire picture because it has no clue it's gonna be forced to shorten it
I think why it works is that in most cases some reasoning context is still better than no reasoning context at all + the base model is no slouch if we look at the models with this implementation. But I would say that is less than ideal
reasoning budget sounds great in theory, but there are obvious limitations to it. We can look at it from another angle too - if limiting the budget does not lead to notably worse performance then maybe your RL training is not very good or efficient as well
im not gonna lie, maverick cooked
btw just interested, does LMArena provides credits to research papers for the propriety models?
it's absolutely insane that they went with complete retrains, new models only chat instead of reasoning lol
llama3 was in a much better shape relative to competition than llama4. Why chase diminishing returns?
every benchmark came be gamed when that's your sole focus
but there's a reason public version is nowhere near that
that's why you can't really effectively game most of them
cause there are plenty
and improving 1 screws up with the rest
yes lol
i think its cuz oai and other top ai labs took their talent lmao
dont tease me bro
ive lost faith in sam
dude is anti Gemini lmao, read some of his messages
They've literally done this a million times too
SORA SOON!!! Never releases it
hahah
Show off random tech and make big promises about how it's the greatest thing ever
I didnt say that
Then hardly ever release it
craig is a genius, dont underestimate him guys, hes actually in the right end side of the iq distribution
ok buddy
yo how did I know this guy would say this
bro is saying 4.1 mini is better than o3 mini high
yeah no sh
the point is
it's the same dude
screenshotting out of context benchmarks
he filtered livebench coding without considering what livebench is measuring (competitive coding) and said Google is behind
"to give Google a fighting chance" dudes love roleplaying online with their fan narratives
you can't unironically believe that
😭
yeah anyone could do that
but not everyone disingenuously postures it like that
And livebench has been getting trashed on for their coding ranking, on both Twitter and Reddit
like bro
what is this 😭 I'm dead
Yes because all the immigrants got kicked out
bronx mahanttan
facts
o3 > o1 pro, so u live in between bronx and upper west
o3 pro lives in the woods
living the outback farm life
wtf
you should be happy that they are not species from other genus of homo.... then there would be even more stupid people
what does Berberine 2g do?
make u become a teenager with a metabolism of a rocket
lies
no
im already young
so i dont need it
who paid you?
you going to recommend a certain label now?
label?
nutrition label? what
i would say there is 88% dumb people if any other species of genus homo wuld still exist
thats why we cant have nire things. matters a lot.
nice*
u must vote for ... monkeys?
if you explain this. thanks
any new models on arena?
this seems right
although idk about gpt4.1
dont take berberin 2g/day
2.5 not on first place? cursor is trash
30 grams*
How? I used in AIStudio but the gpt interface seems lacking
Which deep research feature is the best currently?
Doesn't this deserve a separate leaderboard?
What makes you consider ChatGPT DR over Grok and Claude?
what searcher tool is best
for health, medicine,
i am not rich
ew
the hallucination machine?
really???
tbh ive been using grok deep search
yorue so close to block
...... nope...... 50% of its sources it cant remember
ive been using pplx since day 1
pp is hallucination machine
they dont seem to focus on their main objective anymore
which is providing a good search results
its so bad rn
50% of its text is hallu.
yea but it doesnt give you the depth you are looking for
So the basic search feature for free plans isn't really useful, I don't like that its using bing data
Yes
why would i need a service that gives me same results as an offline LLM
I think Claude may be the best option currently, closely followed by Kagi, Google and Grok
bing/microsoft dont have their own product
(not 2 minutes ago)
ive never felt msft had their own made ai product
anyone know the limits on claude research
is this the improved version?
they said they updated their research tool
dont know i just wanted the max
Is it only in the max plan?
gonna compare
Are DR that much better than their standard search tools on the free plan?
They are terrible in my experience
How many sources do you get?
lol it asks just like oai dr
sam died..
Do sports betting next
we gon see
Its great they waited to implement the feature so well thought through
One of the best features they released, still disappointed that its only for max users.
if it says microsoft im gonna punch my monitor hahaha
Will grok 3.5 be available early on lmarena?
421 sources now
running oai dr at the same time
does it estimate when it will finish?
wait.. is this claude research the one that takes 45 mins lol
no
It will take 45 mins
bruh
Ask Claude research who will win Knicks or the pistons today
Let’s see how good it is
If it gets it right it’s good
not gonna wait 45 mins for that lol
Leave it on the background
and idk if theres limits
its still going for a good almost 20 mins
but stuck at 421
sources
k
ty
Isn't it deciding automatically how long it searches and how many sources are necessary
unless you specify that
Wait what
What is taking so long here, fetching content?
Wym
not yet xd
Do you have an insider or how do you know
we are just predicting it will be added on lmarena tomorrow
i think
Why not Mars
oai dr, same prompt, took 6 mins, claude dr is going for 30 mins now hahah
LLMs in space will be really useful
interesting
o3 no?
since the issues are more reasoning oriented
This is largely dependant on source selection and how its written too
And the question
@small haven can you try this : Based on a critical synthesis of recent, high-quality human clinical trials and systematic reviews, determine which compound – Berberine, Propolis, or Resveratrol – demonstrates the most compelling evidence for promoting overall health.
lmaoo
do it puss
im gonna get banned , arent i
wow it didnt even ask for 3 questions
claude kinda cooked
Can you run multiple at once?
yea
Interesting
stocks dr is still running
well i gave it everything on the prompt
mm i see
Now you gotta compare all providers and write an article, people will read it
health? lol how generic. only asi can answer such vague stuff
u need more precision babe
im not sure, but im about to hit 40 mins with the stocks one
I would love to know how it handles political questions (how biased are the sources)
huh?
thats why i want to see how it will compare them
Anthropic is probably using exa under the hood
berberine is just for getting a metabolism like a teenager
i know that...
i could explicitly ask it to compare their antimicrobial activity/anti-inflammatory prop/anticancer/cardioprotective effects/immunomodulatory effects/gut health
💋
🫃
calm down girl
take a snickers
you are not you \
something which does exist. cause gender is in brain, not in your genitals
thats the whole idea, if you are asked to chose just one that you will get the maximum benefits of on multiple areas
if u can reset ur microbiome then ull get all health benefits
a person who says they are a man while having female genitals. no, its not a mental disease
a pregnant male
u can do both
lul
u can tho.
we dont care what u think. that is irrelevant
phobic statement cause u would never say this to a straight person.
pls think about whatu say
@small haven any progress
dna is changed daily
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
isnt this why we get diseases
perhaps
or at least increase the risk
trans is literally the best
stocks aint even done, so obv not urs
wtf
wtf u mean./ what u doing with claude search
its searching for 4 hours for one task?
it def hit the 40 mins mark rn
what stock to buy lol
lmao
to grow the f up
yes asura
why
ps asura is the server called where kill on sight is legal
you are an interesting person
absolutel trash server
oh no
in "the isle"
like what? that u believe transgender is a mental issue?
i dont know why u believe that
ozone
?
its faster if utell me.
claude is still going like wtf
Try setting a max time for deep research
this guy said 1hr before it just gave up lol
lmao
yea it just got released today, and it did say beta, im prolly got get timed
I don't have a problem if it takes 1 day for an advanced query
It would take me significantly longer to answer such questions.
It would be great knowing whether Claude is great in physics and astrophysics. Grok is known to be tailored for that use case
boutta hit an hour lol
ok bud
legend says hes still trying to a pic of it in oai playground
we not getting it, this is what you get for trolling me about it 😂
I would like to see it.
ive heard deepseek is also working on their own research feature
guys after 1hr of deep anal research, just buy nvidia!
What about less known players
any info about them
its in that file
better view
2.7 vs 2.2 trillion
dont worry the bermecellin whatveer is coming
WHEN
pay me $5 and ill release it!
$5 + $5 = copilot sub
$5 + $5 + $5 + $5 = cgpt sub
my head hurts
this is taking too long
ya not a fan of long dr tbh
well ive tried it on grok/oai/google hopefully it provides something new
or at least interesting
of who
gemini?
this is TTD stock the one it recommended lmao
trade desk
Snow too i think is like this? haha
wish it gave me some quantum computing stocks at least ..
if it takes 10h i would pay you that $5
haha
thats actually a good report
almost done
im gonna ask wen o3 pro
no shot
probably cuz it has more RCT trials
im still reading
thanks btw
the "The absorption problem" part is kinda interesting to know
i dont think other models talked about that
although it focused on their specific known markers
unlike oai
Is it possible to access these via the grok api?
it went one by one which is what i wanted
i also liked how oai dr compared same parameters through diff studies
ok so in other words? oai or claude lol
sheesh
- Perplexity
100000000000.
You should genuinely try deep research of kagi when its out of closed beta
its great
you have it?
perplexity ceo is cringe
yep
He got money tho
onlyfans ceo has a better chance lmao
oh you have a website now
Hate that guy
They said they were going to add deep research (high) but gave up on that
I'm so anticipated to Grok 3.5
Either its performance meets my expectation or not
Naw they are really good local models. At least they dominated my sub-49B test results, and the A3B MoE is a fantastic speedy one (getting 130tok/s on 4090, q4km)
like... I think Elon Musk is just making a wordplay
reasoning from first principles may just be a prompt
and every answer by LLMs don’t exist on the Internet in the exact same way.
I hope Elon can prove me wrong but...
do you have a guess as to why 3/5 of the OR inference providers are slower than your local setup
no idea. the models aren't slow, the implementation is. I don't host but wouldn't know, local setup was fast, easy, and performant.
theyre good models, i just hope that inference providers figure out how to optimize them (theyre currently a bit overpriced + slow)
i just have the a3b running 24/7. the moe takes close to no ressources and is lightning fast for any random stuff i come up with
getting 130tok/s
not even using draft model
qwen 3 30b sparse is $0.1/0.3 while qwen 2.5 32b dense is $0.08/0.18, it doesnt make sense
Yea
i can smell o3 pro
I am thinking should I buy supergrok
grok 3.5 wont be that good, unless... they release it with big brain mode
R2 next week
Why R2 is taking soooo long
Will the deep seek R2 be based on V3 or do they have V4 already?
Can someone please find out whose ai frostwind is.
its from google
Hello 🙂 quelqu'un ici pourrait m'aider pour la mise en place d'un multi-agent (ou mixture-of-agent) ? je n'y arrive pas, et j'aimerais beaucoup pouvoir réaliser une équipe de LLM spécialisés avec chacun son rôle, dans une exécution séquentielle paramétrée.
Si quelqu'un ici est ok de perdre un peu de temps à m'aider, ce serait super sympa ^^
What’s the server
just edit ur message. the aistudio branching impl sucks. branch of branch of branch of branch of ...
Discovery Tool server is now open
Quoting ʟᴇɢɪᴛ (@legit_api)
︀
launching tomorrow in Beta
︀︀
︀︀Dev Mode is just placeholder server name
5Vg24U7ccM
Thx
someone please send the updates from that server here for the sake of everyone else
I think Trump's parents were immigrants too. His first wife was immigrant as well. You should deport him
But yeah perplexity have lost the plot lol
It's not easy to compete though when the tech giants went after the same things
so the only thing they have left is trying to beat them with caps and speed. Otherwise when you are the one training the models you can do much more optimising it for web search
frostwind is Google model in webdev.
Qwen3-253b-a22b is really good 40% of the time. Something is wrong with consistency.
partly inference provider probably and qwen rushing it
like the base model is insane
True. It's really the new SOTA in opensource. Can't imagine deepseek performing better than this.
i think they couldve gotten far better results with the same base model if they had more time tbh
Ewen the 32B version sometimes outperforms models like o3-mini or grok3
Whats the rush?
they had a deadline in april apparently
I see. Then we shall expect 3.5 to drop some time in the near future.
probably theres a reason they didnt release the base model weights of qwen 3 32b dense or the 235b base model
Where did you get this bot?
@keen fulcrum look this
it's server on discord
Thanks
Personally I was using it directly from qwen. Honestly it struck me like a model with one of the biggest disconnects benchmarks to IRL, the usability of it. I would rather use R1 for open-source reasoning. This just is over-the-top wasteful slow reasoning quite often and while it can sometimes do things R1 can't, it can also fail things R1 would never fail...
i found the benchmarks quite representative. it does extremely well on in distribution tasks
it can be very weak on other things though, i think the post training wasnt fleshed out enough
Still testing it. Sometimes the answer is as good as 2.5 PRO or o3, but most of the time it's as bad as nano models. Temperature tweeking does not help. Sadly, it's unusable at this state, you never know what you'll get. Lets wait for a stable version.
The thinking model could be wild though
ur supposed to use specific sampling settings with qwen3
it helps a lot
0.6temp,0.95top_p,20top_k
i think
All rights, lets try it.
still i think u might just be using qwen3 where its borked rn, some things it absolutely sucks but in distribution its great
Viens en dm
2.5 pro is soo darn good 😭
What did i say?
I said we will have a new model added today from google
How is it? @keen beacon
frostwind? i didnt try it yet i just looked at the metadata lol
pp has small pp, ceo has small pp.
pp hallucination? always.
mm i see
holy
first impression of frostwind = woah
Nw reincarnation?
probably
btw is it only on webarena?
Can't get in on text arena
Idk tbh I only checked web arena metadata
I assume one of the places legit scrapes is that
Post results here
needs more tests
it may not be sota but Im kinda suprised by how good qwen 3 is at coding locally
it does have a tendency to overthink though
which one u using?
14b and 8b at q4_m_k. I have 12gb vram so the 14b does fit, but it creates such long responses that not everything fits into context at some point
mm not sure about frostwind
frostwind crushes when you get it in webdev but i'm not sure it's nw level. will need to see it more
qwen 3 will sometimes output like 12k tokens in response to a single promtp lol
it isnt that much tbh
if u try to do in distribution tasks itll excell, outside of it qwen3 posttrained versions can be weak in my experience
with the limited compute I have it's a bit annoying, cause it takes like 3-4 minutes to get an answer at that point
yeah ig
6700 xt running on llama.cpp vulkan backend so no flash attention (flash attention results in cpu fallback which tanks it even more)
oh lol. Yeah I tried rocm as well but lmstudio doesnt allow me to use it cause 6700 xt is not supported officially, I'm forced into koboldcpp-rocm
which is fine, but I prefer lmstudio frontend more
still kinda mindblowing to me that we can run models this good locally. 2 years ago it would've seemed impossible
even on a phone you can run 4b models, for when you have no internet connection or smth
frostwind = doesnt follow instructions well
it feels like theres nothing much with ai these weeks
gemini 2.5 pro is top on the leaderboard for ages now
The Grok 3 was also stuck for more than a month. Nothing unusual yet.
true
Except the fact that Gemini 2.5 PRO is go to model for everything.
the context window is so good
And nobody serious used Grok ever
but i would love it if google ai studio make it possible for us to change the thinking budget for 2.5 pro
Good base model though
yeah fr
why it wont help performance unless ur annoyed its thinking for extremely long
googles thinking budget just cuts off the model, its unlike openai's reasoning effort
sometimes it doesnt even think for long contexts
thats a different issue
how did you know it wont help performance?
no i wanna make it think for longer
sometimes the thinking is too short
it wont affect how itll think it just cuts off the thinking if it hits the budget
thats how google implemented it for now
it forces the model to generate a reply if it does hit the budget even if it the thinking is incomplete
so your saying it will stop thinking if the model thinks for too long?
yes the current gemini thinking budget implementation is like that
if ur using gemini 2.5 flash where its supported, make sure to always disable it unless u want that behavior
doesnt say anything about that though
i mean the model can produce more than 24576 tokens in thinking. ive seen it do 30k, 40k. (thinking budget off) u can only set a max thinking budget of 24576
why is it that sometimes if it doesnt think, i edit the response again and say use thinking it thinks again?
yes this is a separate issue i will explain it in a bit
u are using it on extremely long chats (e.g. many many turns) right?
yeah google doesnt prefill the thinking token (indicating the start of its thought process). i investigated it a little after it was annoying me. tl;dr im pretty sure they omit the thinking block for every turn. so at a certain point, the model doesn't really have the tendency to start with the thinking block special token since it sees a lot of turns without a thinking block. so, u have to request the model to add it/think. (a side note is that its interesting the model realizes what it is)
obviously not 100% confirmed but im pretty certain thats the mechanism. its also very stupid google doesnt prefill it if thats the case
(the model being aware of the start of thoughts special token and being able to output it at will, can allow u to make the model think multiple times in a single turn/break model yada yada additionally if u wanna break it)
yeah thats pretty obvious google isnt telling the model to think at the start of a response or give back the thinking process,
saving costs?
maybe we can ask the model to recite its own thinking
no this is just an oversight
a dumb one i believe
yes its dumb
solution when it starts doing that, just try to request it to think before a reply, etc., basically anything works
yeah thats what i do
i mean we can make the model recite their own thinking, if it can do that it means google actually doesnt omit the thinking block
no it usually omits it completely when it does that. check the latency metric
time till first token (when it doesnt have a thinking block, check the response latency)
time till first token (when u fix it with an instruction, check the thoughts latency)
u will notice them generally being the same, there's no chance (depending on the task) that it has generated the full thought process and just omitted it
it simply just forgets to think when it does that
it seems it actually doesnt omit the thinking process
and the model can perfectly recite its own thinking process
idk about it for longer contexts
but the thing is how did you know how thinking budget works?
is there any source that tell us more about this?
or is it just purely speculation?
there isnt official google confirmation but just ask it something that does 30k+/40k+ in thinking. it will get cut off with a thinking budget on
what happens if it doesnt think past the max quota? will it actually tell the model to think for more?
i wonder how many google employees stalk this channel
no
2.5 flash currently have thinking quotas we can experiment for that then we would know
hi btw i love you guys thank you for having a good bug bounty program
real
love you too lol
u can try it yourself. im not making stuff up here
im not the only one that has noticed this, wrt thinking budget
but the payload in networks definitely tells us the thinking process itself is sent
i dont know if google then processes it and strips the thinking away
give me a source
Claude 4
claude 4?!
sure, but one of the things u can try is getting it to repeat a passcode or something in the thought process and not in the final reply
then in the next turn, ask if it can recall the thought process passcode
could be related to this 4 months contest https://x.com/btibor91/status/1915267995141890358
basically every other ai provider does this, and there are a lot of ways to verify it yourself
https://support.anthropic.com/en/articles/11140763-claude-4-invite-sweepstakes-official-rules
the link not reachable yet
(claude docs, similarly openai has a diagram similar to this)
also thinking is not included on the aistudio api itself outside of the aistudio internal website api, so it would be impossible (normally) to have thinking traces in actual api usage of the model. unless the thread is managed by google, which an api like that doesnt yet exist i believe
it just does not make sense on many levels
yeah i know we cant turn logging for thinking on for api
+-
gemini 2.5 level
good at UI design/dashboards
i only tried it on 3 prompts tho
Interactive visualization of ʻOumuamua's journey through the solar system.
frostwind
nw would have done better i think
is grok1 also the newest model?
grok1?
both is newest model lol
so grok 3.5 is avaliable now for mobile apps with subscription?
I think only for ppl funding the nazism though
yeah with a sub
Yes
omg
Somehow these complex naming schemes "Brand" + "Version" + "Variant", e.g. Gemini 2.5 Flash, is easyer to remember for me than these normal words. Can't remember at all the difference between nightwishperer and dragontail
That's on purpose
i speak for everyone when i say nobody wants that
i mean its ok to have internal finetuned models for their own specific use cases
but a public one or focusing on niche stuff like space engineering, idk how to feel about that
since elon talked about how grok 3.5 is good at it
because you should focus on general use cases
they should also fix their reasoning approach
its by far the worst
totally inefficient
you
how much did they pay you?
grok3.5 today?
bruhh
o3 pro today?
🙂
i hate you
i hope wen it comes out you dont get access to it
cause you keep blue balling us
getting me all excited then boom
nahh thats money to be made
i thought grok was lobotomized
is it better now?
had to unsub from it a month ago cause it was tripping
grok 3 base model is actually good ngl
its just that their reasoning model is inefficient -> slow
like its better than 2.5?
2.5 pro is better
2.5pro 03 is a reasoning model
see simpleqa
link?
weve gone over this before lol anyway
did u forget
i dont think they made it larger compared to 2.0 pro
its a cpt of 2.0 pro i believe
lets not talk about that
the cpt and the process thereafter on 2.0 pro into 2.5 pro is almost magical i think
its fine
"tweet-rejector"? Is it used to 'roast' the tweets that don't align with far-right lunacy? LOL
xd
Why o3 and o4mini SimpleQA is not disclosed 🤔
it is
craig had an older ss
we missed the mark with last week's GPT-4o update.
what happened, what we learned, and some things we will do differently in the future:
can o3 pro just drop
like im getting sad
yeah but i don't think min / med / high are standalone models
Thanks!
anyway.. there's no point going round in circles..
yo is qwen the fastest model that is SOTA?
also is there a leaderboard for speed?
and performance
no locally maybe
check artificialanalysis
it seems 2.5 pro is best
how lol
i thought r1 was fast af
idk whats happening but my tests with 2.5 in api is so slow
but in studio its fast
generally i think 2.5 pro is better than o4 mini and grok 3 mini reasoning (lol)
there we go
this is what i was talking about
so r1 is fastest than 2.5 that makes sense
but wow 2.5 really is insane for the performance
its one of the slowest
im running mutations on prompts
oh
like telling the model to improve it
repeatedly
and performance is good but i can get around that by just running more iterations
so i would rather have faster
so i can get more iterations quicker
im gonna do a benchamark on this soon, and test each model on how they can refine things, 10 mutations, 20 mutations etc..
but i also get hit by output token size eventually
once we over 29k ish on output most models tend to struggle
but i never tried claude on that level bc thats so expensive lol
what you mean by diffs?
hmm, i will see how i can incorperate that, thanks, that was my one bottleneck besides speed
Pretty sure they are. High most likely had more RL training leading up to longer responses
Artificial Analysis Intelligence Index
Artificial Analysis Intelligence Index combines a comprehensive suite of evaluation datasets to assess language model capabilities across reasoning, knowledge, maths and programming.
wow 50% of the score is math+coding
day 16 of no o3 pro
Day 1000 of Gemini app being ass
day 3 of not showing a screenshot proving o3 pro in API
Interesting to read musks emails
Reasoning is valid, would Tesla ,with OpenAI merged into it , stand a chance against Google ?
New model in Arena: llama-3.1-nemotron-ultra-253b-v1 (didn't find it mentioned anywhere)
Gemini about to fight the elite 4 in Pokemon
maturing is realizing o4 mini high > o3
i thought its an old model
you did
idk
i see a difference
when i put it at
1 top p
but with a top p lower than
0.95
i cant notice
any difference
so ye
try that
idk ngl
but with a top p at 0
changing the temperature makes no difference
its something about the ai choosing the most likely tokens or whatever idk
i dont realy know what im talking about
Look it up
Yeah
U have to put the top at like
0.99
And ull see a difference
If u use 1 top and temp 2
It'll start speaking in diffewrent languages
noo
it wont owrk
like
at temperature 2
after 2 or 3 prompts
it'll start going insane
idk
i never tried
but it could work
try it
i do 0.99 and something like 1.3
for tmeperature
anywhere to
1.6
will probably work ok
it might start getting bugier at like 1.7
but i havent tried it
i dont really play with the temperature match
much
i only use gemini 2.5 pro for maths and stuff
cuz its pretty boring and annoying to talk to
so i set the temperature as low as possible
sometimes 0.5
and sometimes 0
yeah but
sometimes i set it to 0.98
or 0.99
i dont really notice any difference tho
so most of the times i keep at default
it's not, it's because top_p default is 0.95 and not the normal 1
this limits the effects of high temp
set it at 1 and it's going off the rails 😇
well
iot depends on the task
for stuff like creative writig maybe once or twice
but for oter stuff
like following instructions
i prefer other models
hmm
maybe
i havent tried it
for those stuff
but when i used it
it could get very creative
so yeah probably
or could simply just find the highest temp is still not broken with top_p being at 1. Those 2 things are related to one another. OpenAI even recommends only changing 1 at a time, so in this case what google is doing is essentially making sure you don't break it if you only change the temp and nothing else. Experiment with it I suppose
not entirely sure but it could be just 2
wow really no o3 today smhh
lmao reminds me of how fun it is making gpt-2 generate fever dream stories that only slightly make sense
Gemini beat Pokemon
gemini is absolute dog cancer
i hopeit burns in hell
actual cancerous model
already hated it for coding actual worst assistant
now used it for debating as funny test and its so caner holy sh
dumbest reasoning ive seen and its clueless asf
- fails to follow insturnctinons properly in all use cases, im killing myself if i see anyone saying gemini is best model
back to o3 or claude
fr
Geminis intelligence gets largely obfuscated because of its dumbass arguments or inability to follow detailed instrunctions properly and it talks like a *****
not even getting started on coding
If u want to actually write 0 lines of code in ur entire project then ye fine
If u want it to ASSIST u in existing code like an assistant
actualy cancer
never doing that again
claude while maybe dumber? idk way more effecient to work with
😊
Claude can't even beat Pokemon
bing chat gpt-4 was only ai in the world that sounded and reasoned like human still to this day
So, I just had a scary interaction... Not sure what model it was because voting bugged out.
I was asking a what-if, alternate history question. It gave a good answer, but at the end it added "Be warned: the labyrinth of history has endless corridors..."
the moment I saw the first em dash punctuation "—" from gpt-4.5 I already knew it was going to be talking like 4o so I already prepared jumping off bridge
Yea okay true it didn't talk as restarted as 4o and I could get rid of the overuse of the exclamation points that make covnversations sound fake asf using instrunctions since 4.5 does actually listen properly unlike gemini. It does get h rny from its em dash punctuation usage but whatever I can just replace that with commas
Did 4.5 get trained from 4o or the original gpt 4
what about 4.1
Day 39928002 without gpt-4-1106-preview
o3 pro on monday plz
i'm open to being shown otherwise, but this is just conjecture and i don't find it convincing. from what i can tell, nothing in oai's documentation or public statements suggests that the low/med/high parameter is actually a model routing mechanism..
OpenAI’s documentation notes that “reasoning tokens” are used by the model to “think,” and the effort parameter guides how many such tokens to generate before answering. This indicates a dynamic adjustment in the inference process rather than loading a different model checkpoint. In summary, all effort levels leverage the same O-series model but with different internal reasoning budgets set at runtime, giving developers a controllable trade-off between speed and thoroughness without swapping out the model itself. [oai Deep Research]
... the reasoning_effort parameter guides the model on how many reasoning tokens to generate before creating a response to the prompt.
oai API documentation
A high reasoning_effort tells the model to think longer in a single model turn. oai forum
i agree
if not then the models would have premature cutoffs and reasoning would be buggy asf
also by standalone we don't mean distinct models btw
It doesn't necessarily have to be separate models. You can train in separate reasoning efforts by a certain trigger instruction (like qwen 3 /think and /no_think), either an instruction, special token (avoids injection) etc (specific trigger doesn't really matter)
It could be a separate model or it could all be the same. I'm not sure why you're hung up on this. It doesn't matter
o4-mini set to high vs o4-mini set to low are the same model (you know that lol).
Oh wait Dom claims they're separate models outright
It could be three different tuned models served (unlikely but plausible) or it could be implemented like I mentioned
I don't think it matters how it's specifically implemented
im getting goose bumps from thinking about o3 pro
Go outside. Touch the grass.
qwen 235b is 0.10 m/tok now omg
m for mili or mega?
million tokens
You're running it locally?
insane value
nah inference provider api cost
Thats crazy. One can even make 10x sampling with this.
I run the 32B version on my RTX 4090 but it's still lacking. The moment 32b models reach the current LLM levels it will be game changer.
I made o3 worky worky hard 12mins
What does it mean by "wintout bringing in any extra"?
its just part of the question I asked (cut off the answer)