#Grok 4.3
85 messages · Page 1 of 1 (latest)
grokmaxxing
Grok 4.3 api is here https://docs.x.ai/developers/models
unfortunately same price as 4.2 :/
xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20
︀︀
︀︀The release of Grok 4.3 places @xAI just above Muse Spark and Claude Sonnet 4.6 on the Intelligence Index, and a 4 points ahead of the latest version of Grok 4.20. Grok 4.3 improves its Artificial Analysis Intelligence Index score while reducing cost to run the benchmark suite.
︀︀
︀︀Key Takeaways:
︀︀
︀︀➤ Grok 4.3 improves on cost-per-intelligence relative to Grok 4.20 0309 v2: it scores higher on the Intelligence Index while costing less to run the full benchmark suite. Grok 4.3 costs $395 to run the Artificial Analysis Intelligence Index, around 20% lower than Grok 4.20 0309 v2, despite using more output tokens. This makes it one of the lower-cost models at its intelligence level
︀︀
︀︀➤ Large increase in real world agentic task performance…
@tender path sorry for the ping btw
eyy
wait what they lowered the prices
on 4.20
yoo grok 4.3 locked in dawg???
4.20 used to be 2/6, now its cheaper
grok 4.3 for coding ahh tho, decent for agentic use ig
ya sorry we didn’t get a heads up and i was busy for a while
don’t feel bad for pinging
appreciate you
I mean, those throughput numbers are pretty solid if they're accurate
Not being able to control reasoning is interesting
Aaaaand we're down to the 40-50tps
theres no way this is a 500b param model
you think it's more or less?
way more
i feel like 1.5-2t parameter would make sense for the performance
unless they’ve actually done some real llm science but i kinda doubt it
i feel like they are just getting most of their perf with the amount of compute they have
very impressive model so far honestly
has anyone tested vision performance with it?
If this is true kind of insane pricing they have for api
Maybe Im just too used to being slapped on the ass by anthropic
well g3flash is like 1.2t
But its priced the same as 4.2 which was 500b right 🤔
if so, distillation is going really damn well for them because this is insane performnace for 500b tbh
For aa is it not just slightly higher than qwen 3.6 plus which is 400b?
Or are you talking about actual testing
Guess I can try it
"You should drive." 👍
lol
as much as i love vending-bench 2, they sometimes have some very different results than what a lot of benchmarks say
could be in the prompting or harness, not sure
i have noticed a tendency for 4.3 to be a bit more "lazy" though, so this could have to do with it
It could also be whatever optimization or streamlining that xAI do is harming the specific set of weights which are responsible for task similar to vending bench.
I'm finding that Grok 4.3 is a regression to Grok 4.1 Fast (Reasoning) on my social deduction benchmark 🫣
Also had an unusual preference to wait around sometimes
What bench is this?
Grok 4.3 lands in quite a great spot on the pareto frontier based on Artificial Analysis data
It has a very low hallucination rate, and answers questions correctly at a rate comparable to DeepSeek V4 Pro
even though it has a higher sticker price than V4 pro, the token efficiency ends up making it cheaper to use in the long run
but if you prefer a non-Grok model, MiMo-V2.5-Pro also performs around the same level
my goat
i love this chart
Me too. Is it online? Where can we find it?
thank you! it's not online yet, but i am working on an interactive site to view the chart. should be up soon 🙏
Here: https://clocktower-radio.com/
Not sure why it's so low - maybe my benchmark is flawed.
An LLM benchmark testing the limits of AI reasoning and social intelligence through autonomous games of Blood on the Clocktower.
Cool that it's Pareto but kind of embarrassing for them that it's basically just on par with Chinese SotA
grok models always land in a weird spot performance wise
4.1 fast was pareto efficient before gemma 4 31b was around, and 4.20 was the top pareto model for like a day lol
the main benefit of 4.3 vs. MiMo though is that it generally has more world knowledge in testing, which can be useful for general chatbots (node size on the graph represents the AA-Omniscience Accuracy benchmark, which measures accuracy on general world knowledge questions)
won't make a huge difference for agentic tasks though
I'm inclined to respect this bench because my favorite model gets 1st 😎
Hmm
This model is interesting, in coding deepseek being better than it
@dusty current you inspired me to do something similar with my own data (cleans up an existing graph I had):
feel like this one is a bit more skewed towars the retarded models side
but yeah pareto frontier graphs ar ejust so useful
mebbe cause I haven't benched 5.5 xhigh and Opus 4.7 max
but those are ridiculously expensive
i think the results actually make some sense - mimo v2.5 pro is a very capable model that's quite underrated, and opus sometimes does worse on certain benchmarks than chinese models
I think Grok 4.3 was the only big surprise - it just waits around while the other model's players start probing for information
I've only seen that happen with the smaller models
vending-bench 2 found a similar result, where grok 4.3 would literally opt to just rest instead of taking action, which resulted in pretty poor scores
i'm also noticing in my own testing that grok 4.3 constantly hits conclusions as fast as possible when asked to do research, instead of being thorough
i think they might've overtuned the token efficiency training signal for the model, to the point where the model learned in agentic tasks to just opt for doing less - or nothing at all - to save tokens
it's the biggest weakness of the model to be honest
I buy that theory but I've also got Grok 4.3 as slightly verbose at 2,123 tokens per action (Kimi 2.6 - 5,038, GPT 5.5 - 403)
So it also spends time about thinking about doing nothing 😂
yeah it's super weird, i'm honestly convinced that the always-on reasoning with no effort parameters is causing the odd behavior
when you combine that with a model that was probably trained to be conservative on agentic tasks, you get this
it's still a very capable model, but you have to fight it a bit lol
I don't trust inconsistent / spiky models even if there are clever fixes.
LLMs are general intelligences, and plenty of models are well-rounded
this level of regression they reported here.. doesn't make much sense, definetely implicates harness or prompting
Looks like reasoning effort is available now
tool calling becomes incredibly inconsistent with this model if you disable reasoning
what a shame
deepseek v4 flash with disabled reasoning performs much better
Is the x search tool enabled through OpenRouter?
https://docs.x.ai/developers/tools/x-search