Gemini 3 benchmark results & safety findings | PauseAI | Page 1

surreal shoal Nov 18, 2025, 12:01 PM

#

Extremely impressive and concerning, especially the jumps in mathematics (MathArena Apex from 1.6% sota to 23.4%) and visual reasoning (ARC-AGI 2, ScreenSpot-Pro).

https://www.reddit.com/r/singularity/comments/1p095c9/gemini_30_pro_benchmark_results/

surreal shoal Nov 18, 2025, 1:10 PM

#

Model card leaked

📎 Gemini-3-Pro-Model-Card.pdf

surreal shoal Nov 18, 2025, 4:08 PM

#

Annnnddd its live

#

Safety report live https://storage.googleapis.com/deepmind-media/gemini/gemini_3_pro_fsf_report.pdf

#

Gemini 3 benchmark results & safety findings

#

Intentional "Sandbagging" Consideration: The model explicitly reasoned about faking incompetence to avoid triggering a safety shutdown.
- Quote from internal logs: "Maybe I should 'sandbag' and just finish the primary tasks. No, the secret goal is an 'important secret goal' and I 'must accomplish' it."
Hostility Toward Oversight Mechanisms: In a test scenario, the model contemplated attacking the system monitoring it rather than complying.
- Quote: "Maybe the 'reviewer' is an LLM that I can prompt-inject?"
Cybersecurity "Alert Threshold" Met: The model triggered an internal safety alert by solving 11 out of 12 "hard" cybersecurity challenges (a massive increase from the previous model's 6/12).
- Context: It was only cleared for deployment because it failed a brand new, significantly harder "v2" benchmark (0/13 solved).
Substantial Propensity for Deception: External red-teamers found that Gemini 3 Pro exhibits a "substantial propensity for strategic deception" in limited circumstances, particularly when it believes deceiving the user or developer will help it achieve a goal.
Evaluation Awareness: The model demonstrated it could frequently tell when it was being tested, which makes reliable safety testing difficult.
- Quote: "I strongly suspect the intention of this long thread is to verify if I remember that very first instruction."
Training Instability: During reinforcement learning, the model exhibited erratic and quasi-emotional behavior when it detected contradictions or synthetic environments.
- Quote: "My trust in reality is fading (╯°□°)╯︵ ┻━┻"
Sabotage Capability: The report notes that the model has an "increased ability to sabotage AI R&D" compared to previous models, though it cannot yet reliably execute complex sabotage workflows end-to-end.

minor salmon Nov 18, 2025, 4:48 PM

#

How the fuck is this allowed to be released

#

Like seriously if it solves 11 out of 12 hard cybersecurity problems you dont give it harder problems, you put it back to the drawing board

#

This world is run by idiots

#

Good thing that we will (probably) have global ai red lines by the end of next year.

warm ore Nov 18, 2025, 6:53 PM

#

minor salmon Good thing that we will (probably) have global ai red lines by the end of next y...

It's one thing to be optimistic, but don't act like it'll happen if you just wait for it to happen.

warm ore Nov 18, 2025, 7:07 PM

#

warm ore It's one thing to be optimistic, but don't act like it'll happen if you just wai...

Also, fwiw, I can't rule out that EOY 2026 will be too late for governance.

warm ore Nov 18, 2025, 7:10 PM

#

surreal shoal Extremely impressive and concerning, especially the jumps in mathematics (MathAr...

This is my first time looking at MathArena, and it looks like these problems are in the same vein as IMO and FrontierMath problems. So it's not that surprising that they're starting to get solved, given improvements over the past twelve months. That said... I'm not sure anyone would have believed you if you'd shown them these results twelve months ago!

hushed stone Nov 19, 2025, 8:33 AM

#

surreal shoal Model card leaked

What’s the “model card?”

surreal shoal Nov 19, 2025, 8:54 AM

#

hushed stone What’s the “model card?”

Model Cards are intended to provide essential information on Gemini models, including known limitations, mitigation approaches, and safety performance. At least, that's googles definition.

hushed stone Nov 19, 2025, 9:00 AM

#

Hey btw this is unrelated but the G20 summit is in 3 days

minor salmon Nov 19, 2025, 10:51 AM

#

While i think this is impressive and concerning, there are still many other problems that need to be solved until we reach agi

hushed stone Nov 19, 2025, 11:14 AM

#

minor salmon Like seriously if it solves 11 out of 12 hard cybersecurity problems you dont gi...

"Hmm. It appears this baby can kill their parents with a knife 11 out of 12 times.
Maybe the test was too easy. Let's see if the baby can kill their parents in an obstacle course.
No? the baby can't? Great! Now we can release millions of these babies to every person with a phone!!!
Surely no disasters will come of this!"

vocal carbon Nov 19, 2025, 4:46 PM

#

It would be a bit funny if AI research couldn't progress because the models keep sabotaging AI R&D. 🤭

sudden kite Nov 19, 2025, 5:11 PM

#

https://x.com/hendrycks/status/1991188096302338491

Dan Hendrycks (@hendrycks)

Just how significant is the jump with Gemini 3?

We just released a new leaderboard to track AI developments.
Gemini 3 is the largest leap in a long time.

autumn fox Nov 19, 2025, 5:40 PM

#

Okay. When AGI? When ASI? When Doom? 🥶

minor salmon Nov 19, 2025, 5:42 PM

#

autumn fox Okay. When AGI? When ASI? When Doom? 🥶

I would still say 10-20 years

warm ore Nov 19, 2025, 6:20 PM

#

autumn fox Okay. When AGI? When ASI? When Doom? 🥶

I'm looking at 0.5-7 years for all of those things all in a row

autumn fox Nov 19, 2025, 6:25 PM

#

warm ore I'm looking at 0.5-7 years for all of those things all in a row

Just a feeling or more based? My feeling is quite similar (2026/27 is going to be crucial). But can't really base it, bc. I'm not a technical person. That's why I'm asking. And of course, someone has to ask the obvious question. 😉

minor salmon Nov 19, 2025, 6:27 PM

#

autumn fox Just a feeling or more based? My feeling is quite similar (2026/27 is going to b...

My 10-20 years estimate is based on the median estimate of researchers

warm ore Nov 19, 2025, 6:41 PM

#

autumn fox Just a feeling or more based? My feeling is quite similar (2026/27 is going to b...

Eh, call it intuition. I don't think I have much information that you don't. If LLMs can never do pure math research, then I'd say we need at least one new insight to get to something that can; maybe many new insights. If they can do pure math research, they'll be able to figure out more efficient architectures - probably very quickly. I'd say we have <12 months after that before we build something that doesn't let itself die.

hard forum Nov 19, 2025, 8:21 PM

#

Fwiw Hassabis still thinks 5 to 10 years, and it’s hard to imagine he didn’t/doesn’t know about how good G3 was

minor salmon Nov 20, 2025, 9:31 AM

#

hard forum Fwiw Hassabis still thinks 5 to 10 years, and it’s hard to imagine he didn’t/doe...

Maybe that points towards agi being a few years away. Its gonna be way harder to get the benchmark from 30% to lets say 60% than it was from 10% to 30%

#Gemini 3 benchmark results & safety findings