#Grok 4.3

85 messages · Page 1 of 1 (latest)

rigid parrot
#

Grok 4.3 (beta) now appears on the Grok web with an Early Access label. ase

abstract grotto
#

Grok 4.3 already

#

Didn’t 4.2 come out like just a couple months ago

red bloom
#

unfortunately same price as 4.2 :/

red bloom
#

xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20
︀︀
︀︀The release of Grok 4.3 places @xAI just above Muse Spark and Claude Sonnet 4.6 on the Intelligence Index, and a 4 points ahead of the latest version of Grok 4.20. Grok 4.3 improves its Artificial Analysis Intelligence Index score while reducing cost to run the benchmark suite.
︀︀
︀︀Key Takeaways:
︀︀
︀︀➤ Grok 4.3 improves on cost-per-intelligence relative to Grok 4.20 0309 v2: it scores higher on the Intelligence Index while costing less to run the full benchmark suite. Grok 4.3 costs $395 to run the Artificial Analysis Intelligence Index, around 20% lower than Grok 4.20 0309 v2, despite using more output tokens. This makes it one of the lower-cost models at its intelligence level
︀︀
︀︀➤ Large increase in real world agentic task performance…

#

@tender path sorry for the ping btw

#

eyy

#

wait what they lowered the prices

#

on 4.20

zealous knot
#

yoo grok 4.3 locked in dawg???

red bloom
zealous knot
#

grok 4.3 for coding ahh tho, decent for agentic use ig

tender path
#

don’t feel bad for pinging

#

appreciate you

ancient grotto
#

I mean, those throughput numbers are pretty solid if they're accurate

ancient grotto
#

Not being able to control reasoning is interesting

ancient grotto
#

Aaaaand we're down to the 40-50tps

dire hawk
#

theres no way this is a 500b param model

summer onyx
dire hawk
#

way more

#

i feel like 1.5-2t parameter would make sense for the performance

#

unless they’ve actually done some real llm science but i kinda doubt it

#

i feel like they are just getting most of their perf with the amount of compute they have

dusty current
#

very impressive model so far honestly

dire hawk
#

has anyone tested vision performance with it?

crystal bane
#

Maybe Im just too used to being slapped on the ass by anthropic

dire hawk
#

well g3flash is like 1.2t

crystal bane
#

But its priced the same as 4.2 which was 500b right 🤔

dire hawk
#

if so, distillation is going really damn well for them because this is insane performnace for 500b tbh

vocal orbit
#

For aa is it not just slightly higher than qwen 3.6 plus which is 400b?

#

Or are you talking about actual testing

#

Guess I can try it

sterile grove
#

"You should drive." 👍

dusty current
#

as much as i love vending-bench 2, they sometimes have some very different results than what a lot of benchmarks say

#

could be in the prompting or harness, not sure

#

i have noticed a tendency for 4.3 to be a bit more "lazy" though, so this could have to do with it

left vector
#

It could also be whatever optimization or streamlining that xAI do is harming the specific set of weights which are responsible for task similar to vending bench.

bitter sphinx
#

I'm finding that Grok 4.3 is a regression to Grok 4.1 Fast (Reasoning) on my social deduction benchmark 🫣

#

Also had an unusual preference to wait around sometimes

dusty current
#

Grok 4.3 lands in quite a great spot on the pareto frontier based on Artificial Analysis data

#

It has a very low hallucination rate, and answers questions correctly at a rate comparable to DeepSeek V4 Pro

#

even though it has a higher sticker price than V4 pro, the token efficiency ends up making it cheaper to use in the long run

#

but if you prefer a non-Grok model, MiMo-V2.5-Pro also performs around the same level

abstract grotto
#

my goat

dire hawk
#

i love this chart

potent spindle
#

Me too. Is it online? Where can we find it?

dusty current
bitter sphinx
fossil goblet
#

Cool that it's Pareto but kind of embarrassing for them that it's basically just on par with Chinese SotA

dusty current
#

4.1 fast was pareto efficient before gemma 4 31b was around, and 4.20 was the top pareto model for like a day lol

#

the main benefit of 4.3 vs. MiMo though is that it generally has more world knowledge in testing, which can be useful for general chatbots (node size on the graph represents the AA-Omniscience Accuracy benchmark, which measures accuracy on general world knowledge questions)

#

won't make a huge difference for agentic tasks though

fossil goblet
left vector
#

Hmm
This model is interesting, in coding deepseek being better than it

bitter sphinx
#

@dusty current you inspired me to do something similar with my own data (cleans up an existing graph I had):

dusty current
#

super cool

#

pareto frontier graphs are just great for value analysis

dire hawk
#

but yeah pareto frontier graphs ar ejust so useful

bitter sphinx
#

mebbe cause I haven't benched 5.5 xhigh and Opus 4.7 max
but those are ridiculously expensive

dusty current
bitter sphinx
#

I think Grok 4.3 was the only big surprise - it just waits around while the other model's players start probing for information
I've only seen that happen with the smaller models

dusty current
#

i'm also noticing in my own testing that grok 4.3 constantly hits conclusions as fast as possible when asked to do research, instead of being thorough

#

i think they might've overtuned the token efficiency training signal for the model, to the point where the model learned in agentic tasks to just opt for doing less - or nothing at all - to save tokens

#

it's the biggest weakness of the model to be honest

bitter sphinx
dusty current
#

yeah it's super weird, i'm honestly convinced that the always-on reasoning with no effort parameters is causing the odd behavior

#

when you combine that with a model that was probably trained to be conservative on agentic tasks, you get this

#

it's still a very capable model, but you have to fight it a bit lol

fossil goblet
#

I don't trust inconsistent / spiky models even if there are clever fixes.

#

LLMs are general intelligences, and plenty of models are well-rounded

quick lotus
ancient grotto
#

Looks like reasoning effort is available now

dusty current
#

tool calling becomes incredibly inconsistent with this model if you disable reasoning

#

what a shame

#

deepseek v4 flash with disabled reasoning performs much better

errant cairn