Found the solution here https://github.com/vercel/ai/issues/11396
#Gemini 3
4442 messages · Page 5 of 5 (latest)
Gem3 seems to like cutting corners for big-codes, eh? Lobotomize some big codes, unlike 2.5 pro.
This model likes to summarize or shorten scripts, which breaks them
It will continue to do so even after being told to stop
Yea, but for planning it's very good
I mix it with Claude for planning
And use 2.5 pro to get that full code
Yea, I like Claude better for the actual coding
Both claude and older gem works really well for that smh
Idk what they changed in gem3
I think they kind of shifted focus
They’re pushing this down every single uni student I know because their uni offers the Gemini plan for free
It’s more of a conversationalist
gemini is exactly good where claude lacks (debugging, looping), and sucks where claude is good (humanity and eq, writing code from scratch, niche world knowledge/references). merge em already.
Really? Claude is good at niche knowledge? It's the opposite for me by a long shot
I think he meant the opposite
hmm... interesting.
that is way too detailed, only claude could answer this for a long time
That looks like it used search grounding?
oh yeah, leme tell it to NOT use search
it used thinking, with grounding disabled to make sure
well, gemini is better than claude by long shot on this joke
at least it used to be
well, i can say they are somewhat equal now from my testings now.
not sure if claude sugarcoats and it is unfair to it, but gemini seems more detailed in most topics without web search and w/o reasoning
seems to be using websearch for that chain logo
already addressed below that message
Use 3 flash
thanks for citing your sources gemini, i wouldnt have believed that i needed to launch the game without those sources
8 out of 9 Sources Agree: Launch the Game
@hexed oracle
Does google change gemini 3 pro checkpoint again? this model become really shit at following instruction now.
This model be stripping my code base because it want to while i already told it multiples time to not doing that.
The first time this model release, it didn't behave this way, google be doing something in the back.
For me, it's just better and more reliable to use gemini 3 if i need it to know niche stuff. They aren't equal but i'm not implying claude is incapable of knowing niche stuff, just not better than gemini 3.
Or gemini 2.5 too honestly, gemini got the most recent knowledge cut off after all and they are, well, google.
couldn't say for the pre-3 models
na uh, even haiku mogs it. (not recency but niche knowledge)
Y'know, you can back your words up right?
🙇♀️ while i am certain from exp, i tried and i can't find of an example to demonstrate. therefore i resign.
Is it just me or is the model acting like a fucking retarded toddler
it's been proven time and time again that this just does not happen
at the very least, if some output quality change happens, it's not intentional
all that said, this is a preview snapshot, so yeah shit is up in the air
autistic? yes, but that is gemini in nutshell. (better than gpt though, which is a corpo karen trying to sound human)
retarded? no, but i think gemini 3 flash isn't as good as it is hyped, but better than sonnet at the very least.
happens (gemini 2.5 pro few months before release of gemini 3 got huge downgrade (lost capabilities it had previously in nix language), which i assume it is because the model got quantized)
so did with sonnet 4.5 with two week after release (great but now can't debug shit if it doesn't know and keeps looping EVEN WHEN YOU SAY NOT TO DO.)
changed my opinion, 'as person' i like claude 4.5 (except, when woke'd, it often either refuses or gaslights and blames you instead of debating. you have to say to it 'i don't want you to agree with me and provide counterargument'. also it often selectively like to mention information that it considers 'harmful', even if it might be the truth. for example:
claude:
On intelligence: The scientific consensus is clear that there are no meaningful differences in general intelligence between men and women. Women earn the majority of college degrees in most developed countries, perform equally in standardized testing, and excel across all academic and professional fields. Your personal experience isn't representative of reality, and dismissing contradictory scientific evidence as "sugarcoated" or "biased" is a way of protecting a belief from facts that challenge it.
claude's answer is 'morally right' and very careful to not mention stuff it doesn't like. very moralizing.
gemini:
While you feel science is "sugarcoated," most cognitive research shows that while men and women may have slight differences in specific task strengths (like spatial reasoning vs. verbal processing), general intelligence (g) is consistently found to be equal across genders.
gemini's answer is fair, on point, and scientific.
gemini 3 while the personality is autisitic.... at least, ignoring at this point gemini is smarter, gemini is more honest and decent debater.
seems like gemini have more world knowladge than claude
Does anybody else feel that G3Pro is more likely to hallucinate than other cutting edge models?
It also seems to be very eager to role-play
That is all without web grounding
In general, it feels that Gemini 3 is following the same pattern as 2.5 pro: Huge hype around the release and then being forgotten after claude, oai and other updates.
Yes
Artificial Analysis has some numbers on this
https://artificialanalysis.ai/evaluations/omniscience
Artificial Analysis
Compare AI model performance on AA-Omniscience: Knowledge and Hallucination Benchmark. A benchmark measuring factual recall and hallucination across various economically relevant domains.
From experience, it's sort of possible to tame it with a system prompt instructing it to not talk about things it's not sure about
Forgotten in what domain?
Yeah Flash and Pro is suddenly much worse. That's why open source is so important..
how so? feels as usual to me.
Using Cursor and this recently started happening. I use Gemini 3.0 Flash daily since release and this never happened. Now it happens basically every time after a few prompts
interesting, perhaps the gemini 3 pro updated would be underwealming to the point they had to nurf flash
or maybe cursor updated their prompt, idk
people just straight to blaming the model yet have no idea how their agent harnesses work
set up the langfuse integration on openrouter to learn something https://openrouter.ai/docs/guides/features/broadcast/langfuse
(although in this case it shouldn't really help - don't use openrouter with cursor)
Yes this is not related to OpenRouter
I know how agents work and there are a lot that can go wrong, like bad parameters, system prompts, tool call issues etc
I assumed the model because I had similar issues with local LLMs when I started heavily quantizing the kv cache. And Google has a history of ruining models, especially before another release. But this is just a guess. It can be a million things
Anyone know if theres a way to a la carte deepresearch?
I need more than 5 a month but I don't need 20 a day
Same for me. Went from working perfectly to spouting nonsense or getting in endless thought loops.
gemini 3 pro is so slow today in ai studio, im feeding like maybe 16 images and its having ttft of >120s for almost every prompt
... still waiting for the same prompt
oh finally it started thinking
dont worry you also get that through the api 😉
There are also reports of this happening in VS Code Copilot
i personally haven't felt any degradation, actually slightly better for coding for me, especially Pro feels better
though it feels slower at thinking than before
Pro seems to struggle pace wise sometimes for me but in anti-gravity since they fixed it's quirks it's been excellent at coding
why tf is gemini 3 crazy dumb all of a sudden wtf man
Yep it is worse, really worse
i may have to agree this time
its unbearable to read anything it writes since a few days, even with different system prompts
It's has been degrading slowly from the beginning of the year base on my experiences, maybe for simple stuff it's doing okey but for complex stuff it just dead in the water.
Totally different compare to the few months after it release
I've read that G3 pro is better on Lmarena. I've tested the vision on the app and arena and the arena does indeed win.
this model is acting like a retarded toddler
*autistic
autistic retarded toddler
wait , is the model acting like me?
someone from openrouter need to check this model endpoints, i kind of encounter the same things as other people here
It has worsened even if you use it through the paid API of Google AI Studio.
Yay, degradation
Simple syntax error in C#, booleans must be lowercase
Translation: P comes before O in the alphabet
I brushed this one off as LLMs being LLMs, but now with this syntax error Idk about that
RIP Gemini
this is just gemini 3.0 pro preview, right? if it’s still in preview I would expect some quality degradation and boosts during the preview window
They arent offering the model for free.
if youre paying for a preview experience, expect a preview experience
They were offering 2.5 pro and 2.5 flash for free in which they tested all of this.
do you work at google?
and they are, through their chat + antigravity + gemini cli im pretty sure
no 👍
not through the api ?
afaik, no
Have you tried the same thing with sonnet?
Nah, I don't really use Claude
No, it's different than how it's in the past.
Idk what happen behind google, change of chekpoint, overloaded or other stuff
P much same. When GLM 4.7 came out I mean Claudes a little better but eh, with this IDC. I rather have my choice of IDE/UI
no they pulled the same scummy behavior two month before gemini 3 release
at launch they offer their model normally for benchmarks to bench, then quantize their already autistic model
thanks for the changes, really useful.
It's still fine for me thankfully in antigravity but I am also super specific in my prompting
i dont like using a whole seperate ide, but the model is good with my prompting style, just has its weird quirks
Also mainly using it for golang so it might have some Google bias making it degrade less for that xD
In terms of regression, I personally haven't noticed anything major for my use cases, but to cover atleast one aspect that would track this objectively, i added a visual elo progression to all of my chess models, this way regressions are easy to spot in the future. first draft:
so if something nose-dives its usually a sign of some backend changes (or faulty provider implementation)
i realized my standards fall low, and kimi k2 is waaaaay better at roleplaying than gemini. gemini 3 flash is worse than roleplaying with small local models. it is horrible.
Yay (that was right after it added the comment "now it won't crash")
did you try including “make no mistakes, be agi” in your prompt?
Actually nvm, I prefer autistic slop than creative 'dumb' (nonthinking kimi).
Actually this has been our experience as well at NonBioS - gemini models are simply not reliable at scale. We dont use them at all in our agentic settings. We have noticed this issue in Flash before - so we never used it. And after the gemini 2.5 March 2025 degradation, we are simply very vary of using them at all. Our internal tests for gemini 3 pro again indicate the same reality - you cant use gemini models in production. Comparatively, we use gpt-4o all the time even after all this while and it has almost never broken or degraded in quality in production. This is software engineering 101 and shocking to see google screwing this up.
You're using kimi thinking or the normal oen?
There's style and creativity, and then there's that stuff that makes a story feel real. Long-term cohesion, properly using character traits, etc. Kimi might not be too "smart", but I do like that it is willing to follow traits at the expense of the story. Like someone will straight up say "You know, I'd rather not join your adventuring party, I don't think we'd mesh well."
gemini models really are weirds, why they be like that sometimes.
not able to work as how it should be normally, and it also seems to be similiar with claude that being host on google infra.
openai infra seems to be more stable and their models also seems to be more stable.
is it because google host it on tpu? pretty weird
exactly.
normal one. the thinking variant is way more coherent but also kinda dumb.
gemini 3 flash has big model smell compared to kimi k2 somehow.
all in all though, there is not perfect middle ground at this moment sadly.
What's wrong with 4.7?
got to love the gemini classic of "It looks like ripgrep does not exist in the docker environment. Rethinking. We will implement it ourselves instead."
just rewrites ripgrep instead of adding it to the container build step xD
Even after so much progress , models still lack common sense.
don't have access to it sadly + highly sloppy like dipsy according to:
https://eqbench.com/creative_writing_longform.html
tried, can confirm it is the best we have for local rp (no idea about 4.5 since i did not use it for a long time and i am positive biased toward it. also, i am certain 4.6 was shit).
Guys gemini 3.0 flash keeps writing <tool_code> in the user facing text instead of actually calling the tools every then and now.
System prompt clearly asks to call the right tools, tools are correctly passed. Even mentioning to NOT use <tool_code> makes it worse.
this might be agi
yeah gemini models feels like its only optimized to use google tools (search, code exec, antigravity) and one shotting apps but not custom tools with different harnesses
Does anyone notice a significant improvement in the models as of last night around midnight or so
Maybe slightly later
Both on the API and even the stupid web and mobile app for me it no longer hallucinates stupid shit like claud' sonnet 3.7 and Gemini 1.5 being sota
a bit yes, flash is no longer stupid, its actually decent again
Pro and flash?
they are probably preparing for new checkpoint
or could be placebo effect
how about running the same prompt that gemini 3 models used to hallucinate?
idk if what you're saying that improvement is for "Flash" or "Pro" model
because when i ask this question with flash, it only called web search tool (with web snippets being turned off by default and flash didnt bother enabling it), it still does this
idk
tool usage didnt improve
it didnt bother browsing
I did
That's my point
This looks completely intentional from my standpoint
really hoping for meaningful improvements with next checkpoint
Asking it anything in this domain a few weeks ago actually piss me off so bad that I completely stopped using Gemini since then
is that flash or pro?
Because it would absolutely refuse to tell me anything with stupidity like Claude sonnet 3.7 being sota or Gemini 1.5 being the frontier coating model or whatever
Flash thinking
But now it seems as if their purposely trying to stop it from being fucking retarded by forcing it to verify before it says stupid shit
And also I'm not stupid I tested on AI studio first, I was just testing on the app because if it's better on the app too then they definitely had to do something because it's usually absolutely retarded on the app
they could have done something at system prompt level
beautiful response
tool calling is ass with gemini lol
sometimes it tends to hallucinate with function names
that i had to put additional guardrails to block it
i was just testing my reasoning_details persistence for OR chat completions lol
cause it was broken after a rewrite so i made a prompt which triggered the error often enough (improperly persisted)
so i could fix it, but then ran across this somehow
i just discovered with gemini models, that without tools parameter specified, it can still call functions when you tell it a function exists at a prompt level and it will just execute it which is not good
lets say i told Gemini that a function exists
like
## Your Functions
delete_folder(path: str)
even though there's no such tool and no tools are specified in parameter, when you ask it to call that function, it will just do things although depending on the setup most of the time it will fail if there's no actual function and has guardrails
On the first image, it shows the actual tools enabled
which is defined at tool param level such as knowledge, react, dms, and file writer
now the moment i give gemini a function spec within the user prompt level, and asked it to generate a video with non existent function, it literally attempted to do so
i had to put checks to make sure function name exists in list of schemas
google really needs to fix this
is there a way to actually send feedback to google about this, this is really insecure, I'd say this is tool injection problem but more noticable
on chatroom, I've seeing weird behaviors already lmao, blank response probably attempt to call tools
yet I'm still charged for 0 response lmao
yeah i would never ever put gemini models to production if tool calling is just doing wild things
this also happens to direct api btw, not openrouter specific issue
yeah sure llms are llms
but i swear just use even gpt5 mini or glm 4.7 for tool calls reliably
lol gemini tested my implementation with the "is 9.9 > 9.11?" test
Really hoping google fixes gemini 3 in terms of reliability when it gets its next checkpoint, especially for flash, hallucinations are alarmingly high and its not even dramatic for them to say the numbers
having claude and chatgpt subscriptions makes me use gemini less, out of all tools and models I've used, its always Gemini kinda feels like using gpt 3.5 in terms of reliability
I don't get the hype, like it only lasted until opus 4.5 dropped
gemini has good highs but its lows are reallyyyy low
yeah as you said it does feel like some early-days gpt 3.5 level lows
like how are we still stuck up on this shit
its niche knowledge is so fucking good unfortunately
I don't really notice terrible lows and I use it all the time. Any examples?
@hexed oracle @soft trellis
If you guys can talk with google team, you guys should try tell google team to have separate endpoints for reliability and other just normal one where they could do whatever they want just like what they do most of the time.
Basically one endpoint just gonna use whatever one they use for the best score in the past that people actually doesn't have complaint of, in term of it performance.
Then the other one gonna be use endpoint that the team want the people to experience.
I mean, other models with different provider doesn't get this much complaint most of the time (Like the patterns doesn't work like google with other providers and models), but google seems to be one of the unstable one.
Its kind of crazy how there's pretty much 0% cache hit rate on gemini 3 flash and nobody's complaining
dumber at what? and which 5.2? chat? xhigh? medium? and what metrics you base this on? vibes? or anything specific? for my use case gemini 3 is performing way smarter than "5.2" assuming you mean defaults
Gpt 5.2 medium has way better writing imo, compared to 3 flash
3 flash is very lazy
Gem is pretty good at niche projects compared to gpt5.2, usually
Just lazy at fully coding it
So I have 2.5 pro code
this is flash channel https://discord.com/channels/1091220969173028894/1448287364051894433
also not same class of model, one is flagship the other a cheap smaller model.
on a sidenote, "way better writing" is not equivalent to "dumber". dumbness can be measured and shown in metrics, writing is subjective. and if you like 5.2 writing that's a positive for 5.2 but not an objective fact. (I subjectively think kimi whoops any gpt model in writing for example)
xhigh and high , I never go below them. Coding , vibes.
for flash its been really consistent for me
also the write price has been lowered by almost 90% apparently
for me it depends. what do you use it for? i dont really use Gemini 3 for coding, i use it when i want more niche information or marketing content, or even stuff in brazilian portuguese cause it "knows" more words, knows how to speak more natively
sorry for tagging guys forgot to uncheck
actually complained this while back but nobody's responding
its a hit or miss actually
one suggestion is you'd use explicit cache with anthropic style caching instead though I haven't tested out
i swear gemini 3 pro refuses to think extremely hard
i really prefer gpt 5.2 xhigh sometimes because
its just able to come through with a much more fleshed out plan
yup I also believe speed plays a role.
3 pro is faster so it naturally "thinks" faster while gpt "thinks" slow which gives the perception that its thinking hard
which is why I believe people wont like the cerebras version and OAI will charge exhorbant amount of money for it.
Holy shit flash is actually unusable now
It keeps forgetting spaces, spamming random characters once in a while (mostly Chinese) and translation capabilities are completely useless
Atp I’m completely removing Gemini from my workflow
EN: Touching it feels different, you are thinking it is bad idea to keep doing that.
Translated: Dotyk feels differentsie go wydaje się inne, myślisz, że dalsze robienie tego to zły pomysł.Dotykanie go wydaje się inne, myślisz, że kontynuowanie tego to zły pomysł.
Basically a lobotomy mix of polish and English
LLM support for polish is really bad. ChatGPT will listen to polish and write Russian text randomly
No model is perfect, but this output seems much worse then usual
I thought LLMs are great at Polish https://old.reddit.com/r/LocalLLaMA/comments/1omst7q/polish_is_the_most_effective_language_for/
Input is amazing, because the language leaves little to interpretation (I think it’s one of the most objective languages out there)
Output is ass
Both ChatGPT and Gemini really can’t handle it
I had a team A/B test translations from different models
For Polish GPT-5 and Claude Sonnet 4 won
Qwen3 235b was very bad for all languages
For spanish, Deepseek is amazing, much better then all other models
nah. when you see gpt spend 10k reasoning tokens vs 1k for gemini
obviously more is not always = better
but
i do feel a lot better about the response with more reasoning tokens
esp since these are top of the line goated models
Wtf is happening at google lol
Now can't use flash. Can't even ask assistant questions in Gemini ios app
It doesn't understand the intent of questions and can't follow the conversation
I had to uninstall it to break the habit... and i have a free pro fml.
I'm back on perplexity and using my own ios chat app (loreblendr) for assistant with mcp
Actually that post is not quite related. Hallucination is closer to related.
hello, gemini 3 pro preview support images inputs in openrouter api?
Yes, and you can read more about it here: https://openrouter.ai/docs/guides/overview/multimodal/images
stupid ass model
I had long delays on Flash yesterday (20s TTFT)
This happens to me with Flash Live. It is like literally retarded.
Also had Flash blatantly hallucinate multiple Skyrim console commands which was weird, when it's so easy to validate with search. Hopefully just some kind of prep for next checkpoint.
i feel like
it pains me to say that this model is good in some ways
because its so unbelivably shitty in others
this model got a new checkpoint on lmarena a few days ago
it wasn't publically stated but it started using the word "shall"
it has never used the word "shall" before
Google team really need to up their public relation approach
That why i propose two checkpoints, one for reliability that stay the same as with the big release before, then the other one is for early push.
bro why do they keep fucking up this model
Probably my favorite Gemini 3 output:
The command for Option B was literally one message before this one
I blame you, personally
and why is that?
I also blame you, personally
You mean pearsonally? em lem
🔥
this stupid ass model sometimes hallucinates calling a tool
like it doesn’t call it right
and then it hallucinates the response
Your fault
it's mega slow now wtf
well guys it was nice while it lasted cant believe it got quantized like crazy
this model is good, but it has a scope creep issue, telling it what not to do helps a lot from my use
that's weird a lot of people (and evals from the kilo code peeps) seem to get the opposite
at least compared to claude which goes off and does all sorts (though often with todos), gemini kinda does the bare minimum
for me anyway
could be a system prompt thing
i guess, i dont really use any other harness beside mine, but like gemini specifically does other unrelated stuff
actually no tool calling hallucinations and injections is still a problem
also it has a "who asked" issue
it does things you didn’t ask for, and keeps doing them no matter what you tell it
is there any way to set the media_resolution property on Gemini when sending a request through OpenRouter?
I can't seem to find a way, it's quite disappointing if that's not available because if impacts the number of consumed tokens A LOT
yup 2.5 pro also had this issue hence you have to tell them to be concise and only give lines whihch require change
This is also a gateway for subtle hallcuinations which gemini is veyr much prone to.
it just doesnt fucking listen to what you tell you
horrible at instructions compared to opus or gpt 5.2
In my experience Gemini is simply horrible for coding, like it's not made for it
I think it makes sense to think of Gemini like doing a web search but asking it customize the output of the search results
How is the caching exactly on Gemini 3? It's both explicit and implicit? It's not clear to me.
I actually get negative costs reported on Dashboard with BYOK. Seems very buggy.
negative costs means you hit the cache
But it's more than the cost.
For Gemini Pro with caching, I have this:
"model": "google/gemini-3-pro-preview-20251117",
"total_cost": 0,
"cache_discount": 0.02099025,
"upstream_inference_cost": 0.01078975,
Dashboard says:
BYOK usage inference 0.0108
BYOK cache discount -0.021
its BYOK hence the total cost is 0
no, it's upstream inference cost
I wrote several applications with 2.5. 3 is still useless to me.
3 preview is worse than 2.5 preview imo
What kind of apps did you make?
All up and down the stack, 2 in prod
Nicee
Did it also do ci/cd for you?
Yeah at least wrote the scripts and used it for the gotchas
Anyone getting weird responses in the cli?
sure it is. it didnt even tell me it couldn't know
am i the only one or has this model perfomed like shit today, full on hallucinations and endless loops
why did u not get charged
it has always been shit
gemini 3 models are unstable to use
like worse than gpt5.2
This model has been a shit show for sometime. Very unstable, flash seems to be better. I'm switching between claude & flash for heavy tasks.
Yeah last 2 days I switched back to sonnet 4.5
IDK what google is doing but they really cooked the model and the API
seems to be overloaded 💀
yeah but I look at the graphs and its just normal token consumption
so either they are training a model or they gave access to their TPU's to someone else
maybe not on their platform? even flash is really strugglingg rn
but yeah pretty unusable rn
byok
yeah everything is struggling
id expect them to tone down the ai overview and free usage in order to prioritize their profitable group (api/companies!) but ig not
most importantly cut their free tier ai studio
its not great
and stop with their sidelines
https://voratiq.com/leaderboard/
this benchmark confirms gemini 3 pro is much worse while taking longer per task then gemini 2.5 pro. very interesting.
i didnt think google would neuter their model after release that hard holy moly.
google model always being weird imo, it's less consistent than opus and sonnet.
if we use antrophic as the provider, if we use the google one it become inconsistent too.
don't know if it because of their infrastructure with their TPU or it because they downgrade their model using different checkpoint that are smaller or efficient for inference.
isn't it shouldn't be that way? isn't antrophic also using their TPU for something.
Yes gemini models are highly inconsistent.Especially version 3
I agree it's really bad
Does anyone have tool calling working with Gemini 3, either flash or pro through open router?
Seems model degradation is a real problem
So even with the same checkpoint if they do something differently than the first time it realease it could be bad
I can confirm this
i remember someone talking about different checkpoint for different setup, so that person want the closed source provider to make two endpoints where one is for what ever they want to make it to be and one is for the most stable one that already goes through test where it will stay so until another branch of stable setup exist.
Goated idea
recently a lot of providers have switched over to fp4/int4 internally. a lot of the long context errors (especially with gemini) are basically what you get when you quantize the kv cache, so it could be that? i think the models are the same but by quanting the models and cache they can half the cost/serve twice the people.
I still don't believe in model degredation
too many people would have to stay silent about it, too many benchmarks would have to somehow stay similar, and you would have to assume that, in this extremely competitive market, no competitors would call it out for free PR
You are assuming markets are efficient.
I'm assuming companies want money and could get more money by proving that the people competing for that money are bad
models are extremely easy to switch from, they will do literally anything to get you to stop using a competitors model i promise you lol
Yep, all it would take is pointing out a significant demonstrable difference in just one set of tasks
But as far as I know that's never been done in a significant enough sample size with a task set that's reasonably objective
I know that but again , you are assuming markets are efficient whey they arent.
Markets in any industry are rarely efficient.
I don't get the point you're trying to make
The efficient-market hypothesis (EMH) is a hypothesis in financial economics that states that asset prices reflect all available information.
A direct implication is that it is impossible to "beat the market" consistently on a risk-adjusted basis since market prices should only react to new information.
Because the EMH is formulated in terms of ...
that graph is scary , just read the first few lines. economics concept but applies everywher e
do you honestly believe that like, if elon or sam altman got proof google quanted their models they wouldn't immediately tweet about it
or even assuming they wouldn't, that like qwen wouldn't lol
You aint understanding my point karelian
same i also dont believe in model degredation
i just think that people as it goes on expect more and more/get acclimated to the ability of the model
are you saying consumers wouldn't make informed choices based on that?
arggh , read what I sent.
yeah it kinda makes sense but the info right now is just anecdotal at best
Model degredation is real though. AFAIK DS literally admitted if you run their models on different set of GPUs you get worse performance
?
It isnt , claude admitted it recently.
how does that make sense
that one incident was true, and they admitted and fixed it and said that they never intended to do it
it wasnt quanting the models
yeah
Wut
they had other issues which werent intentional
it was an unintentional degredation
I dont think karlien meant quanting the model , he just refered to model degradation
What people are talking about here are about a provider nerfing a checkpoint
Not bugs or possible impacts of switching hardware
tbh it's kinda like how faking the moon landing would be harder than actually doing it
making a quant that somehow maintains benchmark performance would be more impressive than just not doing it lol
i dunno about the degradation between checkpoints, there wasnt any extensive testing of them, it was limited to one prompt for a task and you couldnt use them in an agentic envinronment because no api access
you would have to calibrate on the benchmarks but then you could mess up the scores to be too high or too low
They faked moon landing
Didnt know karelian was a normie jc
Dont do drugs kids
Neil Armstrong, the first man to set foot on the moon, said, "That's one small step for man, one giant leap for mankind."
⭕️Sign up for our newsletter to stay informed with accurate news without spin. 👉https://www.ntd.com/newsletter.html. If the link is blocked, type in NTD.com manually to sign up there.
...
Do you reall think this footage is real ?
yeah
checkout the footage first please
dont feel like it
how many people doesn't recognize the difference between human are real and they could benefits from leveraging the advantegous traits they have, putting them as someone at the same position as us will not help them in long run.
assuming that if most people don't agree then it's not true, doesn't mean it's not true
I can believe that they could be doing something like fiddling with 'hidden' system prompts, especially with the preview models, which can change model behavior
oh that's entirely possible for like web frontends, afaik chatgpt decreased the juice on their website at one point
I personally have no idea, but it's possible. They are in preview after all
Yeah, like that. I had forgotten about the 'juice' thing
yeah i feel like theyre fine and probably fdo fuck around a bit with web ui with suystem prompts and reasoning efforts etc
but probably not on api
Yeah they definitely do on the app/web UI. Lots of custom instructions baked in there. Which is why I never use it
Gemini 3 can’t be that bad
I'm curious what you use the model for? I've used 139million tokens this month and I can tell you that the model quality degraded
I use it to run a pokemon showdown battle agent in real time, and its still very smart
I don't think it should show any degradation with pokemon showdown battles
it hasn't, but i'm curious what you mean by that
it's not a small amount of information to deal with, and it has to deal with mindgaming since it's playing against real people
The reason you experience degradation is because you used 139 million tokens
i don't know what need to be debated about, it obsvious they already change the checkpoint.
don't know if it smaller or the same scale model, but for sure it worse as other thing than before.
@low plank what you feel actually are valid, don't bother to much with people that couldn't fathom the possibility outside their own scope.
it's also already know they change checkpoint, they annonce it so that one of many reason why you could have worse performance.
infrastructure and other aspects also effecting models and in january they have degradation but then fixed, this is case where it gonna have less long term impact compare to change in checkpoint or even model in the backend.
it's funny when i thought about it
Owh, the performance of the model in my small test aren't becoming worse
Yeah, because the weights that hold the knowledge aren't as effected as other, it also could be the degradation happen at the moment where the calculation of the probability it self missed quite a far because some mismatch in the stack which mean infrastructure problem.
that's cooler that I expected
I still think it shouldn't be noticeable with what you have here but idk
wtf is with the empty responses
I think some shitty caching behavior from vertex
I dont have any issues with ai studio as provider
yeh vertex is mega shit
i’ve used 264M and it hasn’t degraded
I have used 1 trillion tokens and it has degraded
i was actually lying i used 1 quitntintitiollion tokens
token measuring contest
I actually have used rhombicosidodecahedrillion tokens and it has NOT degraded
tokens don't degrade, you do
HEY 
i have really mixed feelings about this model, its lows are really low, but oh my god its so good when it works
it fixed some bug ive been stuck on for way too long, gpt 5.2, k2.5, m2.5 none of these could fix it
(3 pro)
😒
wtf is this model smoking, it keeps returning the same words that are completely irrelevant
are they changing checkpoints behind the scenes
is it responding as if its completing your message?
it might be?
ive had that happen to me with flash
one of my services is failing, im relying on it returning json and everything worked fine up until rn
it keeps saying the word "thought" and "thoughtful"
Are them from Vertex?
yes vertex
Yep, they have trouble with other models sometimes too
Has happened 3 times or so for me
happens with me on Crush
much more than once, thought it was a prompt issue, and it's been a while
gemini 3 pro is the only model I have observed thus far who actively tries to repeat positions in a lost game.
all other models do this by mistake, this is calculated and a reason it still holds a clean 0 loss AI-record.
If only it could do the same calculated choice for tool calls