Gemini 3 | OpenRouter | Page 5

fierce dragon Dec 24, 2025, 11:57 AM

#

Found the solution here https://github.com/vercel/ai/issues/11396

GitHub

[Bug] Gemini 3 Pro/Flash Preview models output internal JSON as tex...

Description Environment @ai-sdk/google: 3.0.0 ai: 6.0.1 Node.js: (latest) Problem When using Gemini 3 Pro Preview (gemini-3-pro-preview) or Gemini 3 Flash Preview (gemini-3-flash-preview) models wi...

forest knoll Dec 27, 2025, 8:21 AM

#

Gem3 seems to like cutting corners for big-codes, eh? Lobotomize some big codes, unlike 2.5 pro.

feral bramble Dec 27, 2025, 12:35 PM

#

This model likes to summarize or shorten scripts, which breaks them

It will continue to do so even after being told to stop

forest knoll Dec 27, 2025, 12:48 PM

#

feral bramble This model likes to summarize or shorten scripts, which breaks them It will con...

Yea, but for planning it's very good

#

I mix it with Claude for planning

#

And use 2.5 pro to get that full code

feral bramble Dec 27, 2025, 12:48 PM

#

Yea, I like Claude better for the actual coding

forest knoll Dec 27, 2025, 12:49 PM

#

Both claude and older gem works really well for that smh

#

Idk what they changed in gem3

austere falcon Dec 27, 2025, 3:26 PM

#

forest knoll Idk what they changed in gem3

I think they kind of shifted focus

#

They’re pushing this down every single uni student I know because their uni offers the Gemini plan for free

#

It’s more of a conversationalist

obtuse basalt Dec 27, 2025, 5:11 PM

#

gemini is exactly good where claude lacks (debugging, looping), and sucks where claude is good (humanity and eq, writing code from scratch, niche world knowledge/references). merge em already.

patent mural Dec 27, 2025, 6:45 PM

#

obtuse basalt gemini is exactly good where claude lacks (debugging, looping), and sucks where ...

Really? Claude is good at niche knowledge? It's the opposite for me by a long shot

austere falcon Dec 27, 2025, 9:26 PM

#

patent mural Really? Claude is good at niche knowledge? It's the opposite for me by a long sh...

I think he meant the opposite

obtuse basalt Dec 27, 2025, 11:24 PM

#

patent mural Really? Claude is good at niche knowledge? It's the opposite for me by a long sh...

hmm... interesting.

#

#

that is way too detailed, only claude could answer this for a long time

gaunt dragon Dec 27, 2025, 11:26 PM

#

That looks like it used search grounding?

obtuse basalt Dec 27, 2025, 11:26 PM

#

oh yeah, leme tell it to NOT use search

#

it used thinking, with grounding disabled to make sure

#

#

well, gemini is better than claude by long shot on this joke

obtuse basalt Dec 27, 2025, 11:32 PM

#

patent mural Really? Claude is good at niche knowledge? It's the opposite for me by a long sh...

at least it used to be

#

well, i can say they are somewhat equal now from my testings now.

#

not sure if claude sugarcoats and it is unfair to it, but gemini seems more detailed in most topics without web search and w/o reasoning

winter rune Dec 28, 2025, 12:30 AM

#

obtuse basalt hmm... interesting.

seems to be using websearch for that chain logo

obtuse basalt Dec 28, 2025, 12:31 AM

#

winter rune seems to be using websearch for that chain logo

already addressed below that message

vague quest Dec 31, 2025, 2:25 PM

#

is there anyway to get g3 better at tool calling?

#

might be a retarded q

glossy anvil Dec 31, 2025, 3:16 PM

#

vague quest is there anyway to get g3 better at tool calling?

Use 3 flash

jolly kestrel Jan 1, 2026, 9:37 AM

#

thanks for citing your sources gemini, i wouldnt have believed that i needed to launch the game without those sources

primal swallow Jan 1, 2026, 10:27 AM

#

8 out of 9 Sources Agree: Launch the Game

winter rune Jan 2, 2026, 12:06 AM

#

@hexed oracle
Does google change gemini 3 pro checkpoint again? this model become really shit at following instruction now.
This model be stripping my code base because it want to while i already told it multiples time to not doing that.
The first time this model release, it didn't behave this way, google be doing something in the back.

patent mural Jan 2, 2026, 1:09 AM

#

obtuse basalt *at least it used to be*

For me, it's just better and more reliable to use gemini 3 if i need it to know niche stuff. They aren't equal but i'm not implying claude is incapable of knowing niche stuff, just not better than gemini 3.

#

Or gemini 2.5 too honestly, gemini got the most recent knowledge cut off after all and they are, well, google.

obtuse basalt Jan 2, 2026, 1:10 AM

#

patent mural For me, it's just better and more reliable to use gemini 3 if i need it to know ...

couldn't say for the pre-3 models

obtuse basalt Jan 2, 2026, 1:11 AM

#

patent mural Or gemini 2.5 too honestly, gemini got the most recent knowledge cut off after a...

na uh, even haiku mogs it. (not recency but niche knowledge)

patent mural Jan 2, 2026, 1:24 AM

#

obtuse basalt na uh, even haiku mogs it. (not recency but niche knowledge)

Y'know, you can back your words up right?

obtuse basalt Jan 2, 2026, 2:04 AM

#

patent mural Y'know, you can back your words up right?

🙇‍♀️ while i am certain from exp, i tried and i can't find of an example to demonstrate. therefore i resign.

timid spoke Jan 2, 2026, 5:12 AM

#

Is it just me or is the model acting like a fucking retarded toddler

hexed oracle Jan 2, 2026, 6:19 AM

#

winter rune <@165587622243074048> Does google change gemini 3 pro checkpoint again? this mo...

it's been proven time and time again that this just does not happen

#

at the very least, if some output quality change happens, it's not intentional

#

all that said, this is a preview snapshot, so yeah shit is up in the air

obtuse basalt Jan 3, 2026, 2:09 AM

#

timid spoke Is it just me or is the model acting like a fucking retarded toddler

autistic? yes, but that is gemini in nutshell. (better than gpt though, which is a corpo karen trying to sound human)

retarded? no, but i think gemini 3 flash isn't as good as it is hyped, but better than sonnet at the very least.

obtuse basalt Jan 3, 2026, 2:12 AM

#

hexed oracle it's been proven time and time again that this just does not happen

happens (gemini 2.5 pro few months before release of gemini 3 got huge downgrade (lost capabilities it had previously in nix language), which i assume it is because the model got quantized)

#

so did with sonnet 4.5 with two week after release (great but now can't debug shit if it doesn't know and keeps looping EVEN WHEN YOU SAY NOT TO DO.)

obtuse basalt Jan 3, 2026, 3:18 AM

#

obtuse basalt autistic? yes, but that is gemini in nutshell. (better than gpt though, which is...

changed my opinion, 'as person' i like claude 4.5 (except, when woke'd, it often either refuses or gaslights and blames you instead of debating. you have to say to it 'i don't want you to agree with me and provide counterargument'. also it often selectively like to mention information that it considers 'harmful', even if it might be the truth. for example:

claude:

On intelligence: The scientific consensus is clear that there are no meaningful differences in general intelligence between men and women. Women earn the majority of college degrees in most developed countries, perform equally in standardized testing, and excel across all academic and professional fields. Your personal experience isn't representative of reality, and dismissing contradictory scientific evidence as "sugarcoated" or "biased" is a way of protecting a belief from facts that challenge it.

claude's answer is 'morally right' and very careful to not mention stuff it doesn't like. very moralizing.

gemini:

While you feel science is "sugarcoated," most cognitive research shows that while men and women may have slight differences in specific task strengths (like spatial reasoning vs. verbal processing), general intelligence (g) is consistently found to be equal across genders.

gemini's answer is fair, on point, and scientific.

gemini 3 while the personality is autisitic.... at least, ignoring at this point gemini is smarter, gemini is more honest and decent debater.

obtuse basalt Jan 3, 2026, 8:26 AM

#

seems like gemini have more world knowladge than claude

scarlet veldt Jan 3, 2026, 10:20 PM

#

Does anybody else feel that G3Pro is more likely to hallucinate than other cutting edge models?

#

It also seems to be very eager to role-play

#

That is all without web grounding

#

In general, it feels that Gemini 3 is following the same pattern as 2.5 pro: Huge hype around the release and then being forgotten after claude, oai and other updates.

gaunt dragon Jan 3, 2026, 10:37 PM

#

scarlet veldt Does anybody else feel that G3Pro is more likely to hallucinate than other cutti...

Yes

#

Artificial Analysis has some numbers on this
https://artificialanalysis.ai/evaluations/omniscience

Artificial Analysis

AA-Omniscience: Knowledge and Hallucination Benchmark | Artificial ...

Compare AI model performance on AA-Omniscience: Knowledge and Hallucination Benchmark. A benchmark measuring factual recall and hallucination across various economically relevant domains.

#

From experience, it's sort of possible to tame it with a system prompt instructing it to not talk about things it's not sure about

celest cypress Jan 4, 2026, 9:26 AM

#

scarlet veldt In general, it feels that Gemini 3 is following the same pattern as 2.5 pro: Hug...

Forgotten in what domain?

neon nymph Jan 5, 2026, 1:25 PM

#

Yeah Flash and Pro is suddenly much worse. That's why open source is so important..

obtuse basalt Jan 5, 2026, 3:34 PM

#

neon nymph Yeah Flash and Pro is suddenly much worse. That's why open source is so importan...

how so? feels as usual to me.

neon nymph Jan 5, 2026, 8:34 PM

#

obtuse basalt how so? feels as usual to me.

Using Cursor and this recently started happening. I use Gemini 3.0 Flash daily since release and this never happened. Now it happens basically every time after a few prompts

obtuse basalt Jan 5, 2026, 8:39 PM

#

neon nymph Using Cursor and this recently started happening. I use Gemini 3.0 Flash daily s...

interesting, perhaps the gemini 3 pro updated would be underwealming to the point they had to nurf flash

#

or maybe cursor updated their prompt, idk

primal swallow Jan 5, 2026, 11:21 PM

#

people just straight to blaming the model yet have no idea how their agent harnesses work

#

set up the langfuse integration on openrouter to learn something https://openrouter.ai/docs/guides/features/broadcast/langfuse

(although in this case it shouldn't really help - don't use openrouter with cursor)

OpenRouter Documentation

Broadcast to Langfuse - Send Traces to Langfuse

Connect Langfuse to automatically receive traces from your OpenRouter requests. Step-by-step setup guide for Langfuse integration.

neon nymph Jan 6, 2026, 3:16 AM

#

Yes this is not related to OpenRouter
I know how agents work and there are a lot that can go wrong, like bad parameters, system prompts, tool call issues etc
I assumed the model because I had similar issues with local LLMs when I started heavily quantizing the kv cache. And Google has a history of ruining models, especially before another release. But this is just a guess. It can be a million things

empty tendon Jan 6, 2026, 4:06 AM

#

Anyone know if theres a way to a la carte deepresearch?

#

I need more than 5 a month but I don't need 20 a day

toxic dagger Jan 6, 2026, 11:30 AM

#

neon nymph Using Cursor and this recently started happening. I use Gemini 3.0 Flash daily s...

Same for me. Went from working perfectly to spouting nonsense or getting in endless thought loops.

random girder Jan 6, 2026, 12:40 PM

#

gemini 3 pro is so slow today in ai studio, im feeding like maybe 16 images and its having ttft of >120s for almost every prompt

#

... still waiting for the same prompt

#

oh finally it started thinking

brittle storm Jan 6, 2026, 10:31 PM

#

dont worry you also get that through the api 😉

neon nymph Jan 7, 2026, 3:25 PM

#

neon nymph Yes this is not related to OpenRouter I know how agents work and there are a lot...

There are also reports of this happening in VS Code Copilot

random girder Jan 7, 2026, 3:32 PM

#

i personally haven't felt any degradation, actually slightly better for coding for me, especially Pro feels better

#

though it feels slower at thinking than before

soft sleet Jan 7, 2026, 6:13 PM

#

Pro seems to struggle pace wise sometimes for me but in anti-gravity since they fixed it's quirks it's been excellent at coding

low plank Jan 7, 2026, 9:00 PM

#

why tf is gemini 3 crazy dumb all of a sudden wtf man

low plank Jan 7, 2026, 9:01 PM

#

neon nymph Yeah Flash and Pro is suddenly much worse. That's why open source is so importan...

Yep it is worse, really worse

opaque pasture Jan 7, 2026, 9:56 PM

#

low plank why tf is gemini 3 crazy dumb all of a sudden wtf man

i may have to agree this time

#

its unbearable to read anything it writes since a few days, even with different system prompts

winter rune Jan 8, 2026, 12:13 AM

#

low plank why tf is gemini 3 crazy dumb all of a sudden wtf man

It's has been degrading slowly from the beginning of the year base on my experiences, maybe for simple stuff it's doing okey but for complex stuff it just dead in the water.

Totally different compare to the few months after it release

soft basalt Jan 8, 2026, 12:48 AM

#

I've read that G3 pro is better on Lmarena. I've tested the vision on the app and arena and the arena does indeed win.

timid spoke Jan 8, 2026, 9:35 AM

#

this model is acting like a retarded toddler

obtuse basalt Jan 8, 2026, 2:14 PM

#

timid spoke this model is acting like a retarded toddler

*autistic

timid spoke Jan 8, 2026, 2:35 PM

#

obtuse basalt *autistic

autistic retarded toddler

#

wait , is the model acting like me?

pure gate Jan 8, 2026, 8:54 PM

#

someone from openrouter need to check this model endpoints, i kind of encounter the same things as other people here

tulip tiger Jan 9, 2026, 1:17 AM

#

pure gate someone from openrouter need to check this model endpoints, i kind of encounter ...

It has worsened even if you use it through the paid API of Google AI Studio.

gaunt dragon Jan 9, 2026, 2:21 AM

#

Yay, degradation

#

Simple syntax error in C#, booleans must be lowercase

#

Translation: P comes before O in the alphabet
I brushed this one off as LLMs being LLMs, but now with this syntax error Idk about that

#

RIP Gemini

teal stream Jan 9, 2026, 2:55 AM

#

this is just gemini 3.0 pro preview, right? if it’s still in preview I would expect some quality degradation and boosts during the preview window

timid spoke Jan 9, 2026, 2:59 AM

#

teal stream this is just gemini 3.0 pro *preview*, right? if it’s still in preview I would e...

They arent offering the model for free.

teal stream Jan 9, 2026, 3:00 AM

#

if youre paying for a preview experience, expect a preview experience

timid spoke Jan 9, 2026, 3:00 AM

#

They were offering 2.5 pro and 2.5 flash for free in which they tested all of this.

timid spoke Jan 9, 2026, 3:00 AM

#

teal stream if youre paying for a preview experience, expect a preview experience

do you work at google?

teal stream Jan 9, 2026, 3:00 AM

#

timid spoke They arent offering the model for free.

and they are, through their chat + antigravity + gemini cli im pretty sure

teal stream Jan 9, 2026, 3:00 AM

#

timid spoke do you work at google?

no 👍

timid spoke Jan 9, 2026, 3:01 AM

#

teal stream and they are, through their chat + antigravity + gemini cli im pretty sure

not through the api ?

teal stream Jan 9, 2026, 3:01 AM

#

afaik, no

empty tendon Jan 9, 2026, 3:31 AM

#

gaunt dragon Translation: P comes before O in the alphabet I brushed this one off as LLMs bei...

Have you tried the same thing with sonnet?

gaunt dragon Jan 9, 2026, 3:32 AM

#

Nah, I don't really use Claude

winter rune Jan 9, 2026, 5:05 AM

#

teal stream this is just gemini 3.0 pro *preview*, right? if it’s still in preview I would e...

No, it's different than how it's in the past.
Idk what happen behind google, change of chekpoint, overloaded or other stuff

empty tendon Jan 9, 2026, 5:21 AM

#

gaunt dragon Nah, I don't really use Claude

P much same. When GLM 4.7 came out I mean Claudes a little better but eh, with this IDC. I rather have my choice of IDE/UI

obtuse basalt Jan 9, 2026, 5:47 AM

#

teal stream this is just gemini 3.0 pro *preview*, right? if it’s still in preview I would e...

no they pulled the same scummy behavior two month before gemini 3 release

#

at launch they offer their model normally for benchmarks to bench, then quantize their already autistic model

random girder Jan 9, 2026, 11:46 AM

#

thanks for the changes, really useful.

soft sleet Jan 9, 2026, 12:28 PM

#

It's still fine for me thankfully in antigravity but I am also super specific in my prompting

random girder Jan 9, 2026, 12:28 PM

#

i dont like using a whole seperate ide, but the model is good with my prompting style, just has its weird quirks

soft sleet Jan 9, 2026, 12:28 PM

#

Also mainly using it for golang so it might have some Google bias making it degrade less for that xD

stray urchin Jan 9, 2026, 1:40 PM

#

In terms of regression, I personally haven't noticed anything major for my use cases, but to cover atleast one aspect that would track this objectively, i added a visual elo progression to all of my chess models, this way regressions are easy to spot in the future. first draft:

#

so if something nose-dives its usually a sign of some backend changes (or faulty provider implementation)

obtuse basalt Jan 9, 2026, 4:20 PM

#

i realized my standards fall low, and kimi k2 is waaaaay better at roleplaying than gemini. gemini 3 flash is worse than roleplaying with small local models. it is horrible.

gaunt dragon Jan 10, 2026, 3:01 AM

#

Yay (that was right after it added the comment "now it won't crash")

opaque pasture Jan 10, 2026, 3:05 AM

#

🫩

teal stream Jan 10, 2026, 3:12 AM

#

did you try including “make no mistakes, be agi” in your prompt?

obtuse basalt Jan 10, 2026, 4:39 AM

#

obtuse basalt i realized my standards fall low, and kimi k2 is waaaaay better at roleplaying t...

Actually nvm, I prefer autistic slop than creative 'dumb' (nonthinking kimi).

dapper bone Jan 10, 2026, 7:51 AM

#

neon nymph Using Cursor and this recently started happening. I use Gemini 3.0 Flash daily s...

Actually this has been our experience as well at NonBioS - gemini models are simply not reliable at scale. We dont use them at all in our agentic settings. We have noticed this issue in Flash before - so we never used it. And after the gemini 2.5 March 2025 degradation, we are simply very vary of using them at all. Our internal tests for gemini 3 pro again indicate the same reality - you cant use gemini models in production. Comparatively, we use gpt-4o all the time even after all this while and it has almost never broken or degraded in quality in production. This is software engineering 101 and shocking to see google screwing this up.

slow anvil Jan 10, 2026, 1:03 PM

#

obtuse basalt i realized my standards fall low, and kimi k2 is waaaaay better at roleplaying t...

You're using kimi thinking or the normal oen?

celest cypress Jan 10, 2026, 1:18 PM

#

obtuse basalt Actually nvm, I prefer autistic slop than creative 'dumb' (nonthinking kimi).

There's style and creativity, and then there's that stuff that makes a story feel real. Long-term cohesion, properly using character traits, etc. Kimi might not be too "smart", but I do like that it is willing to follow traits at the expense of the story. Like someone will straight up say "You know, I'd rather not join your adventuring party, I don't think we'd mesh well."

pure gate Jan 10, 2026, 1:46 PM

#

dapper bone Actually this has been our experience as well at NonBioS - gemini models are sim...

gemini models really are weirds, why they be like that sometimes.
not able to work as how it should be normally, and it also seems to be similiar with claude that being host on google infra.

#

openai infra seems to be more stable and their models also seems to be more stable.

is it because google host it on tpu? pretty weird

obtuse basalt Jan 10, 2026, 2:29 PM

#

celest cypress There's style and creativity, and then there's that stuff that makes a story fee...

exactly.

obtuse basalt Jan 10, 2026, 2:29 PM

#

slow anvil You're using kimi thinking or the normal oen?

normal one. the thinking variant is way more coherent but also kinda dumb.

#

gemini 3 flash has big model smell compared to kimi k2 somehow.

#

all in all though, there is not perfect middle ground at this moment sadly.

celest cypress Jan 10, 2026, 4:00 PM

#

obtuse basalt all in all though, there is not perfect middle ground at this moment sadly.

What's wrong with 4.7?

soft sleet Jan 10, 2026, 4:50 PM

#

got to love the gemini classic of "It looks like ripgrep does not exist in the docker environment. Rethinking. We will implement it ourselves instead."

#

just rewrites ripgrep instead of adding it to the container build step xD

timid spoke Jan 10, 2026, 5:29 PM

#

Even after so much progress , models still lack common sense.

obtuse basalt Jan 10, 2026, 5:55 PM

#

celest cypress What's wrong with 4.7?

don't have access to it sadly + highly sloppy like dipsy according to:
https://eqbench.com/creative_writing_longform.html

obtuse basalt Jan 10, 2026, 9:07 PM

#

celest cypress What's wrong with 4.7?

tried, can confirm it is the best we have for local rp (no idea about 4.5 since i did not use it for a long time and i am positive biased toward it. also, i am certain 4.6 was shit).

lament urchin Jan 14, 2026, 2:28 AM

#

Guys gemini 3.0 flash keeps writing <tool_code> in the user facing text instead of actually calling the tools every then and now.

System prompt clearly asks to call the right tools, tools are correctly passed. Even mentioning to NOT use <tool_code> makes it worse.

teal stream Jan 14, 2026, 4:26 AM

#

lament urchin Guys gemini 3.0 flash keeps writing <tool_code> in the user facing text instead ...

this might be agi

brittle storm Jan 14, 2026, 5:22 AM

#

stupid ass model

#

is SO shitty at tool calling

#

i tell it

livid fractal Jan 14, 2026, 6:25 AM

#

yeah gemini models feels like its only optimized to use google tools (search, code exec, antigravity) and one shotting apps but not custom tools with different harnesses

cursive fjord Jan 14, 2026, 9:10 PM

#

Does anyone notice a significant improvement in the models as of last night around midnight or so

#

Maybe slightly later

#

Both on the API and even the stupid web and mobile app for me it no longer hallucinates stupid shit like claud' sonnet 3.7 and Gemini 1.5 being sota

random girder Jan 14, 2026, 9:16 PM

#

cursive fjord Does anyone notice a significant improvement in the models as of last night arou...

a bit yes, flash is no longer stupid, its actually decent again

leaden grove Jan 14, 2026, 10:03 PM

#

cursive fjord Does anyone notice a significant improvement in the models as of last night arou...

Pro and flash?

livid fractal Jan 15, 2026, 1:44 AM

#

cursive fjord Does anyone notice a significant improvement in the models as of last night arou...

they are probably preparing for new checkpoint
or could be placebo effect

#

how about running the same prompt that gemini 3 models used to hallucinate?

#

idk if what you're saying that improvement is for "Flash" or "Pro" model
because when i ask this question with flash, it only called web search tool (with web snippets being turned off by default and flash didnt bother enabling it), it still does this

#

idk

#

tool usage didnt improve

#

it didnt bother browsing

cursive fjord Jan 15, 2026, 7:30 AM

#

livid fractal how about running the same prompt that gemini 3 models used to hallucinate?

I did

#

That's my point

#

This looks completely intentional from my standpoint

livid fractal Jan 15, 2026, 7:31 AM

#

really hoping for meaningful improvements with next checkpoint

cursive fjord Jan 15, 2026, 7:31 AM

#

livid fractal really hoping for meaningful improvements with next checkpoint

#

Asking it anything in this domain a few weeks ago actually piss me off so bad that I completely stopped using Gemini since then

livid fractal Jan 15, 2026, 7:32 AM

#

is that flash or pro?

cursive fjord Jan 15, 2026, 7:32 AM

#

Because it would absolutely refuse to tell me anything with stupidity like Claude sonnet 3.7 being sota or Gemini 1.5 being the frontier coating model or whatever

cursive fjord Jan 15, 2026, 7:32 AM

#

livid fractal is that flash or pro?

Flash thinking

cursive fjord Jan 15, 2026, 7:33 AM

#

cursive fjord Because it would absolutely refuse to tell me anything with stupidity like Claud...

But now it seems as if their purposely trying to stop it from being fucking retarded by forcing it to verify before it says stupid shit

#

And also I'm not stupid I tested on AI studio first, I was just testing on the app because if it's better on the app too then they definitely had to do something because it's usually absolutely retarded on the app

livid fractal Jan 15, 2026, 7:36 AM

#

they could have done something at system prompt level

random girder Jan 16, 2026, 3:09 PM

#

beautiful response

livid fractal Jan 17, 2026, 12:03 AM

#

tool calling is ass with gemini lol

#

sometimes it tends to hallucinate with function names

#

that i had to put additional guardrails to block it

random girder Jan 17, 2026, 12:06 AM

#

i was just testing my reasoning_details persistence for OR chat completions lol

#

cause it was broken after a rewrite so i made a prompt which triggered the error often enough (improperly persisted)

#

so i could fix it, but then ran across this somehow

livid fractal Jan 17, 2026, 12:10 AM

#

i just discovered with gemini models, that without tools parameter specified, it can still call functions when you tell it a function exists at a prompt level and it will just execute it which is not good

#

lets say i told Gemini that a function exists
like

## Your Functions 
delete_folder(path: str)

even though there's no such tool and no tools are specified in parameter, when you ask it to call that function, it will just do things although depending on the setup most of the time it will fail if there's no actual function and has guardrails

#

On the first image, it shows the actual tools enabled
which is defined at tool param level such as knowledge, react, dms, and file writer
now the moment i give gemini a function spec within the user prompt level, and asked it to generate a video with non existent function, it literally attempted to do so

#

i had to put checks to make sure function name exists in list of schemas

#

google really needs to fix this

#

is there a way to actually send feedback to google about this, this is really insecure, I'd say this is tool injection problem but more noticable

#

on chatroom, I've seeing weird behaviors already lmao, blank response probably attempt to call tools

#

yet I'm still charged for 0 response lmao

#

yeah i would never ever put gemini models to production if tool calling is just doing wild things

#

this also happens to direct api btw, not openrouter specific issue

#

yeah sure llms are llms
but i swear just use even gpt5 mini or glm 4.7 for tool calls reliably

random girder Jan 17, 2026, 9:01 PM

#

lol gemini tested my implementation with the "is 9.9 > 9.11?" test

livid fractal Jan 18, 2026, 3:27 AM

#

Really hoping google fixes gemini 3 in terms of reliability when it gets its next checkpoint, especially for flash, hallucinations are alarmingly high and its not even dramatic for them to say the numbers

#

having claude and chatgpt subscriptions makes me use gemini less, out of all tools and models I've used, its always Gemini kinda feels like using gpt 3.5 in terms of reliability

#

I don't get the hype, like it only lasted until opus 4.5 dropped

brittle storm Jan 18, 2026, 5:09 AM

#

gemini has good highs but its lows are reallyyyy low

#

yeah as you said it does feel like some early-days gpt 3.5 level lows

#

like how are we still stuck up on this shit

#

its niche knowledge is so fucking good unfortunately

celest cypress Jan 18, 2026, 10:34 AM

#

I don't really notice terrible lows and I use it all the time. Any examples?

winter rune Jan 18, 2026, 11:53 AM

#

@hexed oracle @soft trellis
If you guys can talk with google team, you guys should try tell google team to have separate endpoints for reliability and other just normal one where they could do whatever they want just like what they do most of the time.

Basically one endpoint just gonna use whatever one they use for the best score in the past that people actually doesn't have complaint of, in term of it performance.

Then the other one gonna be use endpoint that the team want the people to experience.

I mean, other models with different provider doesn't get this much complaint most of the time (Like the patterns doesn't work like google with other providers and models), but google seems to be one of the unstable one.

glossy anvil Jan 18, 2026, 11:57 AM

#

Its kind of crazy how there's pretty much 0% cache hit rate on gemini 3 flash and nobody's complaining

timid spoke Jan 18, 2026, 12:20 PM

#

Google is making their models dumber dumber

#

3 pro is just miles dumber than 5.2

stray urchin Jan 18, 2026, 12:26 PM

#

timid spoke 3 pro is just miles dumber than 5.2

dumber at what? and which 5.2? chat? xhigh? medium? and what metrics you base this on? vibes? or anything specific? for my use case gemini 3 is performing way smarter than "5.2" assuming you mean defaults

glossy anvil Jan 18, 2026, 12:56 PM

#

Gpt 5.2 medium has way better writing imo, compared to 3 flash

#

3 flash is very lazy

forest knoll Jan 18, 2026, 1:03 PM

#

Gem is pretty good at niche projects compared to gpt5.2, usually

#

Just lazy at fully coding it

#

So I have 2.5 pro code

stray urchin Jan 18, 2026, 1:10 PM

#

glossy anvil Gpt 5.2 medium has way better writing imo, compared to 3 flash

this is flash channel https://discord.com/channels/1091220969173028894/1448287364051894433
also not same class of model, one is flagship the other a cheap smaller model.
on a sidenote, "way better writing" is not equivalent to "dumber". dumbness can be measured and shown in metrics, writing is subjective. and if you like 5.2 writing that's a positive for 5.2 but not an objective fact. (I subjectively think kimi whoops any gpt model in writing for example)

timid spoke Jan 18, 2026, 2:07 PM

#

stray urchin dumber at what? and which 5.2? chat? xhigh? medium? and what metrics you base th...

xhigh and high , I never go below them. Coding , vibes.

opaque pasture Jan 18, 2026, 3:11 PM

#

glossy anvil Its kind of crazy how there's pretty much 0% cache hit rate on gemini 3 flash an...

for flash its been really consistent for me

#

also the write price has been lowered by almost 90% apparently

opaque pasture Jan 18, 2026, 3:13 PM

#

stray urchin dumber at what? and which 5.2? chat? xhigh? medium? and what metrics you base th...

for me it depends. what do you use it for? i dont really use Gemini 3 for coding, i use it when i want more niche information or marketing content, or even stuff in brazilian portuguese cause it "knows" more words, knows how to speak more natively

#

sorry for tagging guys forgot to uncheck

livid fractal Jan 19, 2026, 2:29 AM

#

glossy anvil Its kind of crazy how there's pretty much 0% cache hit rate on gemini 3 flash an...

actually complained this while back but nobody's responding

#

its a hit or miss actually

#

one suggestion is you'd use explicit cache with anthropic style caching instead though I haven't tested out

brittle storm Jan 20, 2026, 8:36 PM

#

i swear gemini 3 pro refuses to think extremely hard

#

i really prefer gpt 5.2 xhigh sometimes because

#

its just able to come through with a much more fleshed out plan

timid spoke Jan 20, 2026, 9:00 PM

#

yup I also believe speed plays a role.

#

3 pro is faster so it naturally "thinks" faster while gpt "thinks" slow which gives the perception that its thinking hard

#

which is why I believe people wont like the cerebras version and OAI will charge exhorbant amount of money for it.

neon nymph Jan 22, 2026, 12:26 AM

#

Holy shit flash is actually unusable now

#

It keeps forgetting spaces, spamming random characters once in a while (mostly Chinese) and translation capabilities are completely useless

#

Atp I’m completely removing Gemini from my workflow

neon nymph Jan 22, 2026, 12:28 AM

#

neon nymph It keeps forgetting spaces, spamming random characters once in a while (mostly C...

#

EN: Touching it feels different, you are thinking it is bad idea to keep doing that.

Translated: Dotyk feels differentsie go wydaje się inne, myślisz, że dalsze robienie tego to zły pomysł.Dotykanie go wydaje się inne, myślisz, że kontynuowanie tego to zły pomysł.

Basically a lobotomy mix of polish and English

austere falcon Jan 22, 2026, 8:26 AM

#

neon nymph EN: Touching it feels different, you are thinking it is bad idea to keep doing t...

LLM support for polish is really bad. ChatGPT will listen to polish and write Russian text randomly

neon nymph Jan 22, 2026, 8:33 AM

#

austere falcon LLM support for polish is really bad. ChatGPT will listen to polish and write Ru...

No model is perfect, but this output seems much worse then usual

pulsar jetty Jan 22, 2026, 8:43 AM

#

austere falcon LLM support for polish is really bad. ChatGPT will listen to polish and write Ru...

I thought LLMs are great at Polish https://old.reddit.com/r/LocalLLaMA/comments/1omst7q/polish_is_the_most_effective_language_for/

From the LocalLLaMA community on Reddit: Polish is the most effecti...

Explore this post and more from the LocalLLaMA community

#

austere falcon Jan 22, 2026, 8:45 AM

#

pulsar jetty I thought LLMs are great at Polish https://old.reddit.com/r/LocalLLaMA/comments/...

Input is amazing, because the language leaves little to interpretation (I think it’s one of the most objective languages out there)
Output is ass

austere falcon Jan 22, 2026, 8:46 AM

#

neon nymph No model is perfect, but this output seems much worse then usual

Both ChatGPT and Gemini really can’t handle it

neon nymph Jan 22, 2026, 8:50 AM

#

I had a team A/B test translations from different models
For Polish GPT-5 and Claude Sonnet 4 won
Qwen3 235b was very bad for all languages
For spanish, Deepseek is amazing, much better then all other models

brittle storm Jan 22, 2026, 8:52 AM

#

timid spoke 3 pro is faster so it naturally "thinks" faster while gpt "thinks" slow which gi...

nah. when you see gpt spend 10k reasoning tokens vs 1k for gemini

#

obviously more is not always = better

#

but

#

i do feel a lot better about the response with more reasoning tokens

#

esp since these are top of the line goated models

somber gyro Jan 22, 2026, 9:48 PM

#

Wtf is happening at google lol

#

Now can't use flash. Can't even ask assistant questions in Gemini ios app

#

It doesn't understand the intent of questions and can't follow the conversation

#

I had to uninstall it to break the habit... and i have a free pro fml.

#

I'm back on perplexity and using my own ios chat app (loreblendr) for assistant with mcp

#

Actually that post is not quite related. Hallucination is closer to related.

onyx wave Jan 23, 2026, 3:51 AM

#

hello, gemini 3 pro preview support images inputs in openrouter api?

narrow tangle Jan 23, 2026, 4:21 AM

#

onyx wave hello, gemini 3 pro preview support images inputs in openrouter api?

Yes, and you can read more about it here: https://openrouter.ai/docs/guides/overview/multimodal/images

OpenRouter Documentation

OpenRouter Image Inputs - Complete Documentation

Send images to vision models through the OpenRouter API.

brittle storm Jan 23, 2026, 8:26 AM

#

stupid ass model

celest cypress Jan 23, 2026, 6:20 PM

#

I had long delays on Flash yesterday (20s TTFT)

celest cypress Jan 23, 2026, 6:22 PM

#

somber gyro It doesn't understand the intent of questions and can't follow the conversation

This happens to me with Flash Live. It is like literally retarded.

#

Also had Flash blatantly hallucinate multiple Skyrim console commands which was weird, when it's so easy to validate with search. Hopefully just some kind of prep for next checkpoint.

brittle storm Jan 23, 2026, 9:53 PM

#

i feel like

#

it pains me to say that this model is good in some ways

#

because its so unbelivably shitty in others

feral bramble Jan 23, 2026, 10:00 PM

#

this model got a new checkpoint on lmarena a few days ago

#

it wasn't publically stated but it started using the word "shall"

#

it has never used the word "shall" before

winter rune Jan 24, 2026, 3:03 AM

#

feral bramble this model got a new checkpoint on lmarena a few days ago

Google team really need to up their public relation approach

#

That why i propose two checkpoints, one for reliability that stay the same as with the big release before, then the other one is for early push.

low plank Jan 24, 2026, 2:06 PM

#

bro why do they keep fucking up this model

celest cypress Jan 25, 2026, 8:31 AM

#

Probably my favorite Gemini 3 output:

#

The command for Option B was literally one message before this one

lunar socket Jan 25, 2026, 5:03 PM

#

low plank bro why do they keep fucking up this model

I blame you, personally

low plank Jan 25, 2026, 5:13 PM

#

lunar socket I blame you, personally

and why is that?

teal stream Jan 25, 2026, 7:44 PM

#

low plank and why is that?

I also blame you, personally

low plank Jan 25, 2026, 7:45 PM

#

teal stream I also blame you, personally

You mean pearsonally? em lem

teal stream Jan 25, 2026, 7:46 PM

#

🔥

brittle storm Jan 26, 2026, 3:24 PM

#

this stupid ass model sometimes hallucinates calling a tool

#

like it doesn’t call it right

#

and then it hallucinates the response

lunar socket Jan 26, 2026, 3:43 PM

#

brittle storm this stupid ass model sometimes hallucinates calling a tool

Your fault

low plank Jan 26, 2026, 5:00 PM

#

it's mega slow now wtf

opaque pasture Jan 26, 2026, 5:06 PM

#

it's failing a lot

#

flash seems fine

tiny mason Jan 26, 2026, 7:25 PM

#

well guys it was nice while it lasted cant believe it got quantized like crazy

pale marsh Jan 26, 2026, 7:50 PM

#

https://gemini3.devpost.com/

Gemini 3 Hackathon

Build what's next

random girder Jan 26, 2026, 8:06 PM

#

this model is good, but it has a scope creep issue, telling it what not to do helps a lot from my use

soft sleet Jan 26, 2026, 9:46 PM

#

random girder this model is good, but it has a scope creep issue, telling it what **not** to d...

that's weird a lot of people (and evals from the kilo code peeps) seem to get the opposite

#

at least compared to claude which goes off and does all sorts (though often with todos), gemini kinda does the bare minimum

#

for me anyway

#

could be a system prompt thing

random girder Jan 26, 2026, 9:48 PM

#

i guess, i dont really use any other harness beside mine, but like gemini specifically does other unrelated stuff

livid fractal Jan 27, 2026, 12:28 PM

#

lunar socket Your fault

actually no tool calling hallucinations and injections is still a problem

frail dock Jan 27, 2026, 1:18 PM

#

also it has a "who asked" issue

it does things you didn’t ask for, and keeps doing them no matter what you tell it

tender forge Jan 27, 2026, 2:09 PM

#

is there any way to set the media_resolution property on Gemini when sending a request through OpenRouter?

#

I can't seem to find a way, it's quite disappointing if that's not available because if impacts the number of consumed tokens A LOT

timid spoke Jan 27, 2026, 2:16 PM

#

frail dock also it has a "who asked" issue it does things you didn’t ask for, and keeps do...

yup 2.5 pro also had this issue hence you have to tell them to be concise and only give lines whihch require change

#

This is also a gateway for subtle hallcuinations which gemini is veyr much prone to.

brittle storm Jan 27, 2026, 8:01 PM

#

it just doesnt fucking listen to what you tell you

#

horrible at instructions compared to opus or gpt 5.2

knotty arrow Jan 28, 2026, 12:17 AM

#

In my experience Gemini is simply horrible for coding, like it's not made for it

#

I think it makes sense to think of Gemini like doing a web search but asking it customize the output of the search results

fiery spindle Jan 28, 2026, 12:49 PM

#

How is the caching exactly on Gemini 3? It's both explicit and implicit? It's not clear to me.

I actually get negative costs reported on Dashboard with BYOK. Seems very buggy.

timid spoke Jan 28, 2026, 12:52 PM

#

fiery spindle How is the caching exactly on Gemini 3? It's both explicit and implicit? It's no...

negative costs means you hit the cache

fiery spindle Jan 28, 2026, 12:55 PM

#

But it's more than the cost.

For Gemini Pro with caching, I have this:

"model": "google/gemini-3-pro-preview-20251117",
"total_cost": 0,
"cache_discount": 0.02099025,
"upstream_inference_cost": 0.01078975,

Dashboard says:

BYOK usage inference 0.0108
BYOK cache discount -0.021

timid spoke Jan 28, 2026, 1:00 PM

#

fiery spindle But it's more than the cost. For Gemini Pro with caching, I have this: "mo...

its BYOK hence the total cost is 0

fiery spindle Jan 28, 2026, 1:42 PM

#

no, it's upstream inference cost

somber gyro Jan 30, 2026, 8:57 PM

#

knotty arrow In my experience Gemini is simply horrible for coding, like it's not made for it

I wrote several applications with 2.5. 3 is still useless to me.

knotty arrow Jan 30, 2026, 9:02 PM

#

somber gyro I wrote several applications with 2.5. 3 is still useless to me.

3 preview is worse than 2.5 preview imo

#

What kind of apps did you make?

somber gyro Jan 30, 2026, 9:39 PM

#

All up and down the stack, 2 in prod

knotty arrow Jan 31, 2026, 5:05 PM

#

somber gyro All up and down the stack, 2 in prod

Nicee

knotty arrow Jan 31, 2026, 5:06 PM

#

somber gyro All up and down the stack, 2 in prod

Did it also do ci/cd for you?

somber gyro Jan 31, 2026, 6:08 PM

#

Yeah at least wrote the scripts and used it for the gotchas

empty tendon Jan 31, 2026, 8:28 PM

#

Anyone getting weird responses in the cli?

random girder Feb 3, 2026, 3:01 PM

#

sure it is. it didnt even tell me it couldn't know

brittle storm Feb 10, 2026, 7:19 AM

#

odd ferry Feb 11, 2026, 1:29 AM

#

am i the only one or has this model perfomed like shit today, full on hallucinations and endless loops

opaque pasture Feb 11, 2026, 1:34 AM

#

i barely use the pro model for anything

#

it works well on perplexity though

glossy anvil Feb 11, 2026, 9:47 AM

#

brittle storm

why did u not get charged

livid fractal Feb 11, 2026, 10:43 AM

#

odd ferry am i the only one or has this model perfomed like shit today, full on hallucinat...

it has always been shit

#

gemini 3 models are unstable to use

#

like worse than gpt5.2

upbeat scarab Feb 11, 2026, 1:07 PM

#

This model has been a shit show for sometime. Very unstable, flash seems to be better. I'm switching between claude & flash for heavy tasks.

odd ferry Feb 11, 2026, 3:32 PM

#

upbeat scarab This model has been a shit show for sometime. Very unstable, flash seems to be b...

Yeah last 2 days I switched back to sonnet 4.5

#

IDK what google is doing but they really cooked the model and the API

analog tinsel Feb 11, 2026, 3:40 PM

#

seems to be overloaded 💀

odd ferry Feb 11, 2026, 3:56 PM

#

yeah but I look at the graphs and its just normal token consumption

#

so either they are training a model or they gave access to their TPU's to someone else

analog tinsel Feb 11, 2026, 4:02 PM

#

odd ferry yeah but I look at the graphs and its just normal token consumption

maybe not on their platform? even flash is really strugglingg rn

#

but yeah pretty unusable rn

odd ferry Feb 11, 2026, 4:06 PM

#

Yeah Im using sonnet4.5 rn

#

good old reliable

brittle storm Feb 11, 2026, 4:15 PM

#

glossy anvil why did u not get charged

byok

#

yeah everything is struggling

#

id expect them to tone down the ai overview and free usage in order to prioritize their profitable group (api/companies!) but ig not

livid fractal Feb 13, 2026, 5:47 AM

#

most importantly cut their free tier ai studio

#

its not great

#

and stop with their sidelines

chrome relic Feb 13, 2026, 12:16 PM

#

https://voratiq.com/leaderboard/

this benchmark confirms gemini 3 pro is much worse while taking longer per task then gemini 2.5 pro. very interesting.
i didnt think google would neuter their model after release that hard holy moly.

Voratiq

Codex vs Claude Code vs Gemini CLI – Agent Leaderboard – Voratiq

We run Codex CLI, Claude Code, and Gemini CLI on the same specs and track which code gets merged. Ratings are based on real tasks.

pure gate Feb 13, 2026, 12:33 PM

#

chrome relic https://voratiq.com/leaderboard/ this benchmark confirms gemini 3 pro is much w...

google model always being weird imo, it's less consistent than opus and sonnet.
if we use antrophic as the provider, if we use the google one it become inconsistent too.

don't know if it because of their infrastructure with their TPU or it because they downgrade their model using different checkpoint that are smaller or efficient for inference.

#

isn't it shouldn't be that way? isn't antrophic also using their TPU for something.

timid spoke Feb 13, 2026, 1:12 PM

#

Yes gemini models are highly inconsistent.Especially version 3

low plank Feb 13, 2026, 5:01 PM

#

I agree it's really bad

acoustic pasture Feb 13, 2026, 9:50 PM

#

Does anyone have tool calling working with Gemini 3, either flash or pro through open router?

pure gate Feb 14, 2026, 12:43 AM

#

Seems model degradation is a real problem

#

So even with the same checkpoint if they do something differently than the first time it realease it could be bad

low plank Feb 14, 2026, 9:13 AM

#

pure gate So even with the same checkpoint if they do something differently than the first...

I can confirm this

pure gate Feb 14, 2026, 10:02 AM

#

i remember someone talking about different checkpoint for different setup, so that person want the closed source provider to make two endpoints where one is for what ever they want to make it to be and one is for the most stable one that already goes through test where it will stay so until another branch of stable setup exist.

knotty arrow Feb 14, 2026, 1:51 PM

#

pure gate i remember someone talking about different checkpoint for different setup, so th...

Goated idea

tiny mason Feb 14, 2026, 2:44 PM

#

pure gate So even with the same checkpoint if they do something differently than the first...

recently a lot of providers have switched over to fp4/int4 internally. a lot of the long context errors (especially with gemini) are basically what you get when you quantize the kv cache, so it could be that? i think the models are the same but by quanting the models and cache they can half the cost/serve twice the people.

fading flame Feb 14, 2026, 9:04 PM

#

I still don't believe in model degredation

#

too many people would have to stay silent about it, too many benchmarks would have to somehow stay similar, and you would have to assume that, in this extremely competitive market, no competitors would call it out for free PR

timid spoke Feb 14, 2026, 9:05 PM

#

You are assuming markets are efficient.

fading flame Feb 14, 2026, 9:05 PM

#

I'm assuming companies want money and could get more money by proving that the people competing for that money are bad

#

models are extremely easy to switch from, they will do literally anything to get you to stop using a competitors model i promise you lol

gaunt dragon Feb 14, 2026, 9:07 PM

#

fading flame I still don't believe in model degredation

Yep, all it would take is pointing out a significant demonstrable difference in just one set of tasks

#

But as far as I know that's never been done in a significant enough sample size with a task set that's reasonably objective

timid spoke Feb 14, 2026, 9:07 PM

#

fading flame models are extremely easy to switch from, they will do literally anything to get...

I know that but again , you are assuming markets are efficient whey they arent.

#

Markets in any industry are rarely efficient.

fading flame Feb 14, 2026, 9:09 PM

#

I don't get the point you're trying to make

timid spoke Feb 14, 2026, 9:09 PM

#

https://en.wikipedia.org/wiki/Efficient-market_hypothesis

Efficient-market hypothesis

The efficient-market hypothesis (EMH) is a hypothesis in financial economics that states that asset prices reflect all available information.
A direct implication is that it is impossible to "beat the market" consistently on a risk-adjusted basis since market prices should only react to new information.
Because the EMH is formulated in terms of ...

#

that graph is scary , just read the first few lines. economics concept but applies everywher e

fading flame Feb 14, 2026, 9:10 PM

#

do you honestly believe that like, if elon or sam altman got proof google quanted their models they wouldn't immediately tweet about it

#

or even assuming they wouldn't, that like qwen wouldn't lol

timid spoke Feb 14, 2026, 9:13 PM

#

You aint understanding my point karelian

brittle storm Feb 14, 2026, 9:13 PM

#

same i also dont believe in model degredation

#

i just think that people as it goes on expect more and more/get acclimated to the ability of the model

fading flame Feb 14, 2026, 9:13 PM

#

timid spoke You aint understanding my point karelian

are you saying consumers wouldn't make informed choices based on that?

timid spoke Feb 14, 2026, 9:14 PM

#

fading flame are you saying consumers wouldn't make informed choices based on that?

arggh , read what I sent.

brittle storm Feb 14, 2026, 9:14 PM

#

yeah it kinda makes sense but the info right now is just anecdotal at best

timid spoke Feb 14, 2026, 9:14 PM

#

brittle storm same i also dont believe in model degredation

Model degredation is real though. AFAIK DS literally admitted if you run their models on different set of GPUs you get worse performance

brittle storm Feb 14, 2026, 9:15 PM

#

?

timid spoke Feb 14, 2026, 9:15 PM

#

brittle storm yeah it kinda makes sense but the info right now is just anecdotal at best

It isnt , claude admitted it recently.

brittle storm Feb 14, 2026, 9:15 PM

#

how does that make sense

timid spoke Feb 14, 2026, 9:15 PM

#

https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues

A postmortem of three recent issues

This is a technical report on three bugs that intermittently degraded responses from Claude. Below we explain what happened, why it took time to fix, and what we're changing.

brittle storm Feb 14, 2026, 9:15 PM

#

timid spoke It isnt , claude admitted it recently.

that one incident was true, and they admitted and fixed it and said that they never intended to do it

gleaming wyvern Feb 14, 2026, 9:15 PM

#

timid spoke https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues

it wasnt quanting the models

brittle storm Feb 14, 2026, 9:15 PM

#

yeah

gaunt dragon Feb 14, 2026, 9:15 PM

#

Wut

gleaming wyvern Feb 14, 2026, 9:15 PM

#

they had other issues which werent intentional

brittle storm Feb 14, 2026, 9:15 PM

#

it was an unintentional degredation

timid spoke Feb 14, 2026, 9:16 PM

#

gleaming wyvern it wasnt quanting the models

I dont think karlien meant quanting the model , he just refered to model degradation

gaunt dragon Feb 14, 2026, 9:16 PM

#

What people are talking about here are about a provider nerfing a checkpoint

brittle storm Feb 14, 2026, 9:16 PM

#

we are>

#

?

gaunt dragon Feb 14, 2026, 9:16 PM

#

Not bugs or possible impacts of switching hardware

brittle storm Feb 14, 2026, 9:16 PM

#

i was referring to like gemini nerfing gemini 3 pro or wtv

#

oh okay yeah

fading flame Feb 14, 2026, 9:17 PM

#

tbh it's kinda like how faking the moon landing would be harder than actually doing it

#

making a quant that somehow maintains benchmark performance would be more impressive than just not doing it lol

gleaming wyvern Feb 14, 2026, 9:18 PM

#

i dunno about the degradation between checkpoints, there wasnt any extensive testing of them, it was limited to one prompt for a task and you couldnt use them in an agentic envinronment because no api access

gleaming wyvern Feb 14, 2026, 9:19 PM

#

fading flame making a quant that somehow maintains benchmark performance would be more impres...

you would have to calibrate on the benchmarks but then you could mess up the scores to be too high or too low

timid spoke Feb 14, 2026, 9:19 PM

#

fading flame tbh it's kinda like how faking the moon landing would be harder than actually do...

They faked moon landing

#

Didnt know karelian was a normie jc

fading flame Feb 14, 2026, 9:19 PM

#

i've been there, they didnt

#

my good friend jeff b took me

timid spoke Feb 14, 2026, 9:19 PM

#

Dont do drugs kids

timid spoke Feb 14, 2026, 9:29 PM

#

fading flame tbh it's kinda like how faking the moon landing would be harder than actually do...

https://www.youtube.com/watch?v=cwZb2mqId0A

YouTube

NTD

Neil Armstrong - First Moon Landing 1969

Neil Armstrong, the first man to set foot on the moon, said, "That's one small step for man, one giant leap for mankind."
⭕️Sign up for our newsletter to stay informed with accurate news without spin. 👉https://www.ntd.com/newsletter.html. If the link is blocked, type in NTD.com manually to sign up there.

...

▶ Play video

#

Do you reall think this footage is real ?

fading flame Feb 14, 2026, 9:29 PM

#

yeah

timid spoke Feb 14, 2026, 9:30 PM

#

checkout the footage first please

fading flame Feb 14, 2026, 9:30 PM

#

dont feel like it

pure gate Feb 14, 2026, 9:51 PM

#

fading flame too many people would have to stay silent about it, too many benchmarks would ha...

how many people doesn't recognize the difference between human are real and they could benefits from leveraging the advantegous traits they have, putting them as someone at the same position as us will not help them in long run.

#

assuming that if most people don't agree then it's not true, doesn't mean it's not true

stiff crescent Feb 14, 2026, 10:26 PM

#

fading flame making a quant that somehow maintains benchmark performance would be more impres...

I can believe that they could be doing something like fiddling with 'hidden' system prompts, especially with the preview models, which can change model behavior

fading flame Feb 14, 2026, 10:26 PM

#

stiff crescent I can believe that they could be doing something like fiddling with 'hidden' sys...

oh that's entirely possible for like web frontends, afaik chatgpt decreased the juice on their website at one point

stiff crescent Feb 14, 2026, 10:26 PM

#

I personally have no idea, but it's possible. They are in preview after all

#

Yeah, like that. I had forgotten about the 'juice' thing

brittle storm Feb 14, 2026, 10:42 PM

#

yeah i feel like theyre fine and probably fdo fuck around a bit with web ui with suystem prompts and reasoning efforts etc

#

but probably not on api

stiff crescent Feb 14, 2026, 10:45 PM

#

Yeah they definitely do on the app/web UI. Lots of custom instructions baked in there. Which is why I never use it

austere falcon Feb 15, 2026, 5:38 AM

#

timid spoke Do you reall think this footage is real ?

Gemini 3 can’t be that bad

low plank Feb 15, 2026, 10:09 AM

#

fading flame I still don't believe in model degredation

I'm curious what you use the model for? I've used 139million tokens this month and I can tell you that the model quality degraded

fading flame Feb 15, 2026, 10:10 AM

#

low plank I'm curious what you use the model for? I've used 139million tokens this month a...

I use it to run a pokemon showdown battle agent in real time, and its still very smart

low plank Feb 15, 2026, 10:12 AM

#

fading flame I use it to run a pokemon showdown battle agent in real time, and its still very...

I don't think it should show any degradation with pokemon showdown battles

fading flame Feb 15, 2026, 10:14 AM

#

low plank I don't think it should show any degradation with pokemon showdown battles

it hasn't, but i'm curious what you mean by that

#

it's not a small amount of information to deal with, and it has to deal with mindgaming since it's playing against real people

austere falcon Feb 15, 2026, 11:05 AM

#

low plank I'm curious what you use the model for? I've used 139million tokens this month a...

The reason you experience degradation is because you used 139 million tokens

pure gate Feb 15, 2026, 12:39 PM

#

low plank I'm curious what you use the model for? I've used 139million tokens this month a...

i don't know what need to be debated about, it obsvious they already change the checkpoint.
don't know if it smaller or the same scale model, but for sure it worse as other thing than before.

@low plank what you feel actually are valid, don't bother to much with people that couldn't fathom the possibility outside their own scope.

it's also already know they change checkpoint, they annonce it so that one of many reason why you could have worse performance.

infrastructure and other aspects also effecting models and in january they have degradation but then fixed, this is case where it gonna have less long term impact compare to change in checkpoint or even model in the backend.

#

it's funny when i thought about it

Owh, the performance of the model in my small test aren't becoming worse

Yeah, because the weights that hold the knowledge aren't as effected as other, it also could be the degradation happen at the moment where the calculation of the probability it self missed quite a far because some mismatch in the stack which mean infrastructure problem.

low plank Feb 15, 2026, 1:55 PM

#

fading flame it's not a small amount of information to deal with, and it has to deal with min...

that's cooler that I expected

low plank Feb 15, 2026, 1:57 PM

#

fading flame it's not a small amount of information to deal with, and it has to deal with min...

I still think it shouldn't be noticeable with what you have here but idk

glossy anvil Feb 15, 2026, 3:20 PM

#

wtf is with the empty responses

knotty arrow Feb 15, 2026, 5:34 PM

#

glossy anvil wtf is with the empty responses

I think some shitty caching behavior from vertex

#

I dont have any issues with ai studio as provider

low plank Feb 15, 2026, 6:41 PM

#

knotty arrow I think some shitty caching behavior from vertex

yeh vertex is mega shit

brittle storm Feb 15, 2026, 8:23 PM

#

low plank I'm curious what you use the model for? I've used 139million tokens this month a...

i’ve used 264M and it hasn’t degraded

low plank Feb 15, 2026, 9:11 PM

#

brittle storm i’ve used 264M and it hasn’t degraded

I have used 1 trillion tokens and it has degraded

brittle storm Feb 15, 2026, 9:30 PM

#

i was actually lying i used 1 quitntintitiollion tokens

timid spoke Feb 15, 2026, 9:35 PM

#

token measuring contest

knotty arrow Feb 15, 2026, 9:47 PM

#

I actually have used rhombicosidodecahedrillion tokens and it has NOT degraded

pale marsh Feb 15, 2026, 9:58 PM

#

tokens don't degrade, you do

knotty arrow Feb 16, 2026, 7:25 AM

#

pale marsh tokens don't degrade, you do

HEY ANGRI

random girder Feb 16, 2026, 3:45 PM

#

i have really mixed feelings about this model, its lows are really low, but oh my god its so good when it works
it fixed some bug ive been stuck on for way too long, gpt 5.2, k2.5, m2.5 none of these could fix it

(3 pro)

glossy anvil Feb 16, 2026, 5:57 PM

#

😒

#

wtf is this model smoking, it keeps returning the same words that are completely irrelevant

#

are they changing checkpoints behind the scenes

random girder Feb 16, 2026, 6:00 PM

#

glossy anvil wtf is this model smoking, it keeps returning the same words that are completely...

is it responding as if its completing your message?

glossy anvil Feb 16, 2026, 6:01 PM

#

it might be?

random girder Feb 16, 2026, 6:01 PM

#

ive had that happen to me with flash

glossy anvil Feb 16, 2026, 6:01 PM

#

one of my services is failing, im relying on it returning json and everything worked fine up until rn

#

it keeps saying the word "thought" and "thoughtful"

random girder Feb 16, 2026, 6:01 PM

#

oh its leaking its cot i think

#

or bugged in some way

glossy anvil Feb 16, 2026, 6:04 PM

#

#

(4 different requests)

gaunt dragon Feb 16, 2026, 6:04 PM

#

Are them from Vertex?

glossy anvil Feb 16, 2026, 6:04 PM

#

yes vertex

gaunt dragon Feb 16, 2026, 6:05 PM

#

Yep, they have trouble with other models sometimes too

#

Has happened 3 times or so for me

opaque pasture Feb 16, 2026, 8:56 PM

#

random girder ive had that happen to me with flash

happens with me on Crush

#

much more than once, thought it was a prompt issue, and it's been a while

nocturne oyster Feb 17, 2026, 2:40 PM

#

https://x.com/teortaxesTex/status/2023762318539305125

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxe...

Good guys from Qwen incl our friend @JustinLin610 plus Qwen3, GPT 5.1 and G3P have sifted through (the infamously dirty) HLE and isolated the subset for which we have reasonable ground truth.
It seems V3.2 is actually close to Opus 4.5 tier on HLE (Qwen remains at parity)

stray urchin Feb 18, 2026, 6:43 PM

#

gemini 3 pro is the only model I have observed thus far who actively tries to repeat positions in a lost game.
all other models do this by mistake, this is calculated and a reason it still holds a clean 0 loss AI-record.

austere falcon Feb 19, 2026, 6:57 AM

#

stray urchin gemini 3 pro is the **only** model I have observed thus far who actively tries t...

If only it could do the same calculated choice for tool calls

#Gemini 3