#Gemini 3

4442 messages · Page 5 of 5 (latest)

fierce dragon
forest knoll
#

Gem3 seems to like cutting corners for big-codes, eh? Lobotomize some big codes, unlike 2.5 pro.

feral bramble
#

This model likes to summarize or shorten scripts, which breaks them

It will continue to do so even after being told to stop

forest knoll
#

I mix it with Claude for planning

#

And use 2.5 pro to get that full code

feral bramble
#

Yea, I like Claude better for the actual coding

forest knoll
#

Both claude and older gem works really well for that smh

#

Idk what they changed in gem3

austere falcon
#

They’re pushing this down every single uni student I know because their uni offers the Gemini plan for free

#

It’s more of a conversationalist

obtuse basalt
#

gemini is exactly good where claude lacks (debugging, looping), and sucks where claude is good (humanity and eq, writing code from scratch, niche world knowledge/references). merge em already.

patent mural
austere falcon
obtuse basalt
#

that is way too detailed, only claude could answer this for a long time

gaunt dragon
#

That looks like it used search grounding?

obtuse basalt
#

oh yeah, leme tell it to NOT use search

#

it used thinking, with grounding disabled to make sure

#

well, gemini is better than claude by long shot on this joke

obtuse basalt
#

well, i can say they are somewhat equal now from my testings now.

#

not sure if claude sugarcoats and it is unfair to it, but gemini seems more detailed in most topics without web search and w/o reasoning

winter rune
obtuse basalt
vague quest
#

is there anyway to get g3 better at tool calling?

#

might be a retarded q

glossy anvil
jolly kestrel
#

thanks for citing your sources gemini, i wouldnt have believed that i needed to launch the game without those sources

primal swallow
#

8 out of 9 Sources Agree: Launch the Game

winter rune
#

@hexed oracle
Does google change gemini 3 pro checkpoint again? this model become really shit at following instruction now.
This model be stripping my code base because it want to while i already told it multiples time to not doing that.
The first time this model release, it didn't behave this way, google be doing something in the back.

patent mural
# obtuse basalt *at least it used to be*

For me, it's just better and more reliable to use gemini 3 if i need it to know niche stuff. They aren't equal but i'm not implying claude is incapable of knowing niche stuff, just not better than gemini 3.

#

Or gemini 2.5 too honestly, gemini got the most recent knowledge cut off after all and they are, well, google.

obtuse basalt
obtuse basalt
patent mural
obtuse basalt
timid spoke
#

Is it just me or is the model acting like a fucking retarded toddler

hexed oracle
#

at the very least, if some output quality change happens, it's not intentional

#

all that said, this is a preview snapshot, so yeah shit is up in the air

obtuse basalt
obtuse basalt
#

so did with sonnet 4.5 with two week after release (great but now can't debug shit if it doesn't know and keeps looping EVEN WHEN YOU SAY NOT TO DO.)

obtuse basalt
# obtuse basalt autistic? yes, but that is gemini in nutshell. (better than gpt though, which is...

changed my opinion, 'as person' i like claude 4.5 (except, when woke'd, it often either refuses or gaslights and blames you instead of debating. you have to say to it 'i don't want you to agree with me and provide counterargument'. also it often selectively like to mention information that it considers 'harmful', even if it might be the truth. for example:

claude:

On intelligence: The scientific consensus is clear that there are no meaningful differences in general intelligence between men and women. Women earn the majority of college degrees in most developed countries, perform equally in standardized testing, and excel across all academic and professional fields. Your personal experience isn't representative of reality, and dismissing contradictory scientific evidence as "sugarcoated" or "biased" is a way of protecting a belief from facts that challenge it.

claude's answer is 'morally right' and very careful to not mention stuff it doesn't like. very moralizing.

gemini:

While you feel science is "sugarcoated," most cognitive research shows that while men and women may have slight differences in specific task strengths (like spatial reasoning vs. verbal processing), general intelligence (g) is consistently found to be equal across genders.

gemini's answer is fair, on point, and scientific.

gemini 3 while the personality is autisitic.... at least, ignoring at this point gemini is smarter, gemini is more honest and decent debater.

obtuse basalt
#

seems like gemini have more world knowladge than claude

scarlet veldt
#

Does anybody else feel that G3Pro is more likely to hallucinate than other cutting edge models?

#

It also seems to be very eager to role-play

#

That is all without web grounding

#

In general, it feels that Gemini 3 is following the same pattern as 2.5 pro: Huge hype around the release and then being forgotten after claude, oai and other updates.

gaunt dragon
#

From experience, it's sort of possible to tame it with a system prompt instructing it to not talk about things it's not sure about

neon nymph
#

Yeah Flash and Pro is suddenly much worse. That's why open source is so important..

obtuse basalt
neon nymph
obtuse basalt
#

or maybe cursor updated their prompt, idk

primal swallow
#

people just straight to blaming the model yet have no idea how their agent harnesses work

neon nymph
#

Yes this is not related to OpenRouter
I know how agents work and there are a lot that can go wrong, like bad parameters, system prompts, tool call issues etc
I assumed the model because I had similar issues with local LLMs when I started heavily quantizing the kv cache. And Google has a history of ruining models, especially before another release. But this is just a guess. It can be a million things

empty tendon
#

Anyone know if theres a way to a la carte deepresearch?

#

I need more than 5 a month but I don't need 20 a day

toxic dagger
random girder
#

gemini 3 pro is so slow today in ai studio, im feeding like maybe 16 images and its having ttft of >120s for almost every prompt

#

... still waiting for the same prompt

#

oh finally it started thinking

brittle storm
#

dont worry you also get that through the api 😉

neon nymph
random girder
#

i personally haven't felt any degradation, actually slightly better for coding for me, especially Pro feels better

#

though it feels slower at thinking than before

soft sleet
#

Pro seems to struggle pace wise sometimes for me but in anti-gravity since they fixed it's quirks it's been excellent at coding

low plank
#

why tf is gemini 3 crazy dumb all of a sudden wtf man

opaque pasture
#

its unbearable to read anything it writes since a few days, even with different system prompts

winter rune
soft basalt
#

I've read that G3 pro is better on Lmarena. I've tested the vision on the app and arena and the arena does indeed win.

timid spoke
#

this model is acting like a retarded toddler

obtuse basalt
timid spoke
#

wait , is the model acting like me?

pure gate
#

someone from openrouter need to check this model endpoints, i kind of encounter the same things as other people here

tulip tiger
gaunt dragon
#

Yay, degradation

#

Simple syntax error in C#, booleans must be lowercase

#

Translation: P comes before O in the alphabet
I brushed this one off as LLMs being LLMs, but now with this syntax error Idk about that

#

RIP Gemini

teal stream
#

this is just gemini 3.0 pro preview, right? if it’s still in preview I would expect some quality degradation and boosts during the preview window

timid spoke
teal stream
#

if youre paying for a preview experience, expect a preview experience

timid spoke
#

They were offering 2.5 pro and 2.5 flash for free in which they tested all of this.

timid spoke
teal stream
teal stream
teal stream
#

afaik, no

empty tendon
gaunt dragon
#

Nah, I don't really use Claude

winter rune
empty tendon
obtuse basalt
#

at launch they offer their model normally for benchmarks to bench, then quantize their already autistic model

random girder
#

thanks for the changes, really useful.

soft sleet
#

It's still fine for me thankfully in antigravity but I am also super specific in my prompting

random girder
#

i dont like using a whole seperate ide, but the model is good with my prompting style, just has its weird quirks

soft sleet
#

Also mainly using it for golang so it might have some Google bias making it degrade less for that xD

stray urchin
#

In terms of regression, I personally haven't noticed anything major for my use cases, but to cover atleast one aspect that would track this objectively, i added a visual elo progression to all of my chess models, this way regressions are easy to spot in the future. first draft:

#

so if something nose-dives its usually a sign of some backend changes (or faulty provider implementation)

obtuse basalt
#

i realized my standards fall low, and kimi k2 is waaaaay better at roleplaying than gemini. gemini 3 flash is worse than roleplaying with small local models. it is horrible.

gaunt dragon
#

Yay (that was right after it added the comment "now it won't crash")

opaque pasture
#

🫩

teal stream
#

did you try including “make no mistakes, be agi” in your prompt?

obtuse basalt
dapper bone
# neon nymph Using Cursor and this recently started happening. I use Gemini 3.0 Flash daily s...

Actually this has been our experience as well at NonBioS - gemini models are simply not reliable at scale. We dont use them at all in our agentic settings. We have noticed this issue in Flash before - so we never used it. And after the gemini 2.5 March 2025 degradation, we are simply very vary of using them at all. Our internal tests for gemini 3 pro again indicate the same reality - you cant use gemini models in production. Comparatively, we use gpt-4o all the time even after all this while and it has almost never broken or degraded in quality in production. This is software engineering 101 and shocking to see google screwing this up.

slow anvil
celest cypress
# obtuse basalt Actually nvm, I prefer autistic slop than creative 'dumb' (nonthinking kimi).

There's style and creativity, and then there's that stuff that makes a story feel real. Long-term cohesion, properly using character traits, etc. Kimi might not be too "smart", but I do like that it is willing to follow traits at the expense of the story. Like someone will straight up say "You know, I'd rather not join your adventuring party, I don't think we'd mesh well."

pure gate
#

openai infra seems to be more stable and their models also seems to be more stable.

is it because google host it on tpu? pretty weird

obtuse basalt
#

gemini 3 flash has big model smell compared to kimi k2 somehow.

#

all in all though, there is not perfect middle ground at this moment sadly.

soft sleet
#

got to love the gemini classic of "It looks like ripgrep does not exist in the docker environment. Rethinking. We will implement it ourselves instead."

#

just rewrites ripgrep instead of adding it to the container build step xD

timid spoke
#

Even after so much progress , models still lack common sense.

obtuse basalt
obtuse basalt
# celest cypress What's wrong with 4.7?

tried, can confirm it is the best we have for local rp (no idea about 4.5 since i did not use it for a long time and i am positive biased toward it. also, i am certain 4.6 was shit).

lament urchin
#

Guys gemini 3.0 flash keeps writing <tool_code> in the user facing text instead of actually calling the tools every then and now.

System prompt clearly asks to call the right tools, tools are correctly passed. Even mentioning to NOT use <tool_code> makes it worse.

brittle storm
#

stupid ass model

#

is SO shitty at tool calling

#

i tell it

livid fractal
#

yeah gemini models feels like its only optimized to use google tools (search, code exec, antigravity) and one shotting apps but not custom tools with different harnesses

cursive fjord
#

Does anyone notice a significant improvement in the models as of last night around midnight or so

#

Maybe slightly later

#

Both on the API and even the stupid web and mobile app for me it no longer hallucinates stupid shit like claud' sonnet 3.7 and Gemini 1.5 being sota

random girder
livid fractal
#

how about running the same prompt that gemini 3 models used to hallucinate?

#

idk if what you're saying that improvement is for "Flash" or "Pro" model
because when i ask this question with flash, it only called web search tool (with web snippets being turned off by default and flash didnt bother enabling it), it still does this

#

idk

#

tool usage didnt improve

#

it didnt bother browsing

cursive fjord
#

That's my point

#

This looks completely intentional from my standpoint

livid fractal
#

really hoping for meaningful improvements with next checkpoint

cursive fjord
#

Asking it anything in this domain a few weeks ago actually piss me off so bad that I completely stopped using Gemini since then

livid fractal
#

is that flash or pro?

cursive fjord
#

Because it would absolutely refuse to tell me anything with stupidity like Claude sonnet 3.7 being sota or Gemini 1.5 being the frontier coating model or whatever

cursive fjord
cursive fjord
#

And also I'm not stupid I tested on AI studio first, I was just testing on the app because if it's better on the app too then they definitely had to do something because it's usually absolutely retarded on the app

livid fractal
#

they could have done something at system prompt level

random girder
#

beautiful response

livid fractal
#

tool calling is ass with gemini lol

#

sometimes it tends to hallucinate with function names

#

that i had to put additional guardrails to block it

random girder
#

i was just testing my reasoning_details persistence for OR chat completions lol

#

cause it was broken after a rewrite so i made a prompt which triggered the error often enough (improperly persisted)

#

so i could fix it, but then ran across this somehow

livid fractal
#

i just discovered with gemini models, that without tools parameter specified, it can still call functions when you tell it a function exists at a prompt level and it will just execute it which is not good

#

lets say i told Gemini that a function exists
like

## Your Functions 
delete_folder(path: str) 

even though there's no such tool and no tools are specified in parameter, when you ask it to call that function, it will just do things although depending on the setup most of the time it will fail if there's no actual function and has guardrails

#

On the first image, it shows the actual tools enabled
which is defined at tool param level such as knowledge, react, dms, and file writer
now the moment i give gemini a function spec within the user prompt level, and asked it to generate a video with non existent function, it literally attempted to do so

#

i had to put checks to make sure function name exists in list of schemas

#

google really needs to fix this

#

is there a way to actually send feedback to google about this, this is really insecure, I'd say this is tool injection problem but more noticable

#

on chatroom, I've seeing weird behaviors already lmao, blank response probably attempt to call tools

#

yet I'm still charged for 0 response lmao

#

yeah i would never ever put gemini models to production if tool calling is just doing wild things

#

this also happens to direct api btw, not openrouter specific issue

#

yeah sure llms are llms
but i swear just use even gpt5 mini or glm 4.7 for tool calls reliably

random girder
#

lol gemini tested my implementation with the "is 9.9 > 9.11?" test

livid fractal
#

Really hoping google fixes gemini 3 in terms of reliability when it gets its next checkpoint, especially for flash, hallucinations are alarmingly high and its not even dramatic for them to say the numbers

#

having claude and chatgpt subscriptions makes me use gemini less, out of all tools and models I've used, its always Gemini kinda feels like using gpt 3.5 in terms of reliability

#

I don't get the hype, like it only lasted until opus 4.5 dropped

brittle storm
#

gemini has good highs but its lows are reallyyyy low

#

yeah as you said it does feel like some early-days gpt 3.5 level lows

#

like how are we still stuck up on this shit

#

its niche knowledge is so fucking good unfortunately

celest cypress
#

I don't really notice terrible lows and I use it all the time. Any examples?

winter rune
#

@hexed oracle @soft trellis
If you guys can talk with google team, you guys should try tell google team to have separate endpoints for reliability and other just normal one where they could do whatever they want just like what they do most of the time.

Basically one endpoint just gonna use whatever one they use for the best score in the past that people actually doesn't have complaint of, in term of it performance.

Then the other one gonna be use endpoint that the team want the people to experience.

I mean, other models with different provider doesn't get this much complaint most of the time (Like the patterns doesn't work like google with other providers and models), but google seems to be one of the unstable one.

glossy anvil
#

Its kind of crazy how there's pretty much 0% cache hit rate on gemini 3 flash and nobody's complaining

timid spoke
#

Google is making their models dumber dumber

#

3 pro is just miles dumber than 5.2

stray urchin
# timid spoke 3 pro is just miles dumber than 5.2

dumber at what? and which 5.2? chat? xhigh? medium? and what metrics you base this on? vibes? or anything specific? for my use case gemini 3 is performing way smarter than "5.2" assuming you mean defaults

glossy anvil
#

Gpt 5.2 medium has way better writing imo, compared to 3 flash

#

3 flash is very lazy

forest knoll
#

Gem is pretty good at niche projects compared to gpt5.2, usually

#

Just lazy at fully coding it

#

So I have 2.5 pro code

stray urchin
# glossy anvil Gpt 5.2 medium has way better writing imo, compared to 3 flash

this is flash channel https://discord.com/channels/1091220969173028894/1448287364051894433
also not same class of model, one is flagship the other a cheap smaller model.
on a sidenote, "way better writing" is not equivalent to "dumber". dumbness can be measured and shown in metrics, writing is subjective. and if you like 5.2 writing that's a positive for 5.2 but not an objective fact. (I subjectively think kimi whoops any gpt model in writing for example)

timid spoke
opaque pasture
#

also the write price has been lowered by almost 90% apparently

opaque pasture
#

sorry for tagging guys forgot to uncheck

livid fractal
#

its a hit or miss actually

#

one suggestion is you'd use explicit cache with anthropic style caching instead though I haven't tested out

brittle storm
#

i swear gemini 3 pro refuses to think extremely hard

#

i really prefer gpt 5.2 xhigh sometimes because

#

its just able to come through with a much more fleshed out plan

timid spoke
#

yup I also believe speed plays a role.

#

3 pro is faster so it naturally "thinks" faster while gpt "thinks" slow which gives the perception that its thinking hard

#

which is why I believe people wont like the cerebras version and OAI will charge exhorbant amount of money for it.

neon nymph
#

Holy shit flash is actually unusable now

#

It keeps forgetting spaces, spamming random characters once in a while (mostly Chinese) and translation capabilities are completely useless

#

Atp I’m completely removing Gemini from my workflow

neon nymph
#

EN: Touching it feels different, you are thinking it is bad idea to keep doing that.

Translated: Dotyk feels differentsie go wydaje się inne, myślisz, że dalsze robienie tego to zły pomysł.Dotykanie go wydaje się inne, myślisz, że kontynuowanie tego to zły pomysł.

Basically a lobotomy mix of polish and English

austere falcon
neon nymph
austere falcon
austere falcon
neon nymph
#

I had a team A/B test translations from different models
For Polish GPT-5 and Claude Sonnet 4 won
Qwen3 235b was very bad for all languages
For spanish, Deepseek is amazing, much better then all other models

brittle storm
#

obviously more is not always = better

#

but

#

i do feel a lot better about the response with more reasoning tokens

#

esp since these are top of the line goated models

somber gyro
#

Wtf is happening at google lol

#

Now can't use flash. Can't even ask assistant questions in Gemini ios app

#

It doesn't understand the intent of questions and can't follow the conversation

#

I had to uninstall it to break the habit... and i have a free pro fml.

#

I'm back on perplexity and using my own ios chat app (loreblendr) for assistant with mcp

#

Actually that post is not quite related. Hallucination is closer to related.

onyx wave
#

hello, gemini 3 pro preview support images inputs in openrouter api?

narrow tangle
brittle storm
#

stupid ass model

celest cypress
#

I had long delays on Flash yesterday (20s TTFT)

celest cypress
#

Also had Flash blatantly hallucinate multiple Skyrim console commands which was weird, when it's so easy to validate with search. Hopefully just some kind of prep for next checkpoint.

brittle storm
#

i feel like

#

it pains me to say that this model is good in some ways

#

because its so unbelivably shitty in others

feral bramble
#

this model got a new checkpoint on lmarena a few days ago

#

it wasn't publically stated but it started using the word "shall"

#

it has never used the word "shall" before

winter rune
#

That why i propose two checkpoints, one for reliability that stay the same as with the big release before, then the other one is for early push.

low plank
#

bro why do they keep fucking up this model

celest cypress
#

Probably my favorite Gemini 3 output:

#

The command for Option B was literally one message before this one

lunar socket
low plank
teal stream
low plank
teal stream
#

🔥

brittle storm
#

this stupid ass model sometimes hallucinates calling a tool

#

like it doesn’t call it right

#

and then it hallucinates the response

low plank
#

it's mega slow now wtf

opaque pasture
#

it's failing a lot

#

flash seems fine

tiny mason
#

well guys it was nice while it lasted cant believe it got quantized like crazy

pale marsh
random girder
#

this model is good, but it has a scope creep issue, telling it what not to do helps a lot from my use

soft sleet
#

at least compared to claude which goes off and does all sorts (though often with todos), gemini kinda does the bare minimum

#

for me anyway

#

could be a system prompt thing

random girder
#

i guess, i dont really use any other harness beside mine, but like gemini specifically does other unrelated stuff

livid fractal
frail dock
#

also it has a "who asked" issue

it does things you didn’t ask for, and keeps doing them no matter what you tell it

tender forge
#

is there any way to set the media_resolution property on Gemini when sending a request through OpenRouter?

#

I can't seem to find a way, it's quite disappointing if that's not available because if impacts the number of consumed tokens A LOT

timid spoke
#

This is also a gateway for subtle hallcuinations which gemini is veyr much prone to.

brittle storm
#

it just doesnt fucking listen to what you tell you

#

horrible at instructions compared to opus or gpt 5.2

knotty arrow
#

In my experience Gemini is simply horrible for coding, like it's not made for it

#

I think it makes sense to think of Gemini like doing a web search but asking it customize the output of the search results

fiery spindle
#

How is the caching exactly on Gemini 3? It's both explicit and implicit? It's not clear to me.

I actually get negative costs reported on Dashboard with BYOK. Seems very buggy.

timid spoke
fiery spindle
#

But it's more than the cost.

For Gemini Pro with caching, I have this:

"model": "google/gemini-3-pro-preview-20251117",
"total_cost": 0,
"cache_discount": 0.02099025,
"upstream_inference_cost": 0.01078975,

Dashboard says:

BYOK usage inference 0.0108
BYOK cache discount -0.021

timid spoke
fiery spindle
#

no, it's upstream inference cost

somber gyro
knotty arrow
#

What kind of apps did you make?

somber gyro
#

All up and down the stack, 2 in prod

knotty arrow
knotty arrow
somber gyro
#

Yeah at least wrote the scripts and used it for the gotchas

empty tendon
#

Anyone getting weird responses in the cli?

random girder
#

sure it is. it didnt even tell me it couldn't know

brittle storm
odd ferry
#

am i the only one or has this model perfomed like shit today, full on hallucinations and endless loops

opaque pasture
#

i barely use the pro model for anything

#

it works well on perplexity though

glossy anvil
livid fractal
#

gemini 3 models are unstable to use

#

like worse than gpt5.2

upbeat scarab
#

This model has been a shit show for sometime. Very unstable, flash seems to be better. I'm switching between claude & flash for heavy tasks.

odd ferry
#

IDK what google is doing but they really cooked the model and the API

analog tinsel
#

seems to be overloaded 💀

odd ferry
#

yeah but I look at the graphs and its just normal token consumption

#

so either they are training a model or they gave access to their TPU's to someone else

analog tinsel
#

but yeah pretty unusable rn

odd ferry
#

Yeah Im using sonnet4.5 rn

#

good old reliable

brittle storm
#

yeah everything is struggling

#

id expect them to tone down the ai overview and free usage in order to prioritize their profitable group (api/companies!) but ig not

livid fractal
#

most importantly cut their free tier ai studio

#

its not great

#

and stop with their sidelines

chrome relic
pure gate
#

isn't it shouldn't be that way? isn't antrophic also using their TPU for something.

timid spoke
#

Yes gemini models are highly inconsistent.Especially version 3

low plank
#

I agree it's really bad

acoustic pasture
#

Does anyone have tool calling working with Gemini 3, either flash or pro through open router?

pure gate
#

Seems model degradation is a real problem

#

So even with the same checkpoint if they do something differently than the first time it realease it could be bad

pure gate
#

i remember someone talking about different checkpoint for different setup, so that person want the closed source provider to make two endpoints where one is for what ever they want to make it to be and one is for the most stable one that already goes through test where it will stay so until another branch of stable setup exist.

tiny mason
fading flame
#

I still don't believe in model degredation

#

too many people would have to stay silent about it, too many benchmarks would have to somehow stay similar, and you would have to assume that, in this extremely competitive market, no competitors would call it out for free PR

timid spoke
#

You are assuming markets are efficient.

fading flame
#

I'm assuming companies want money and could get more money by proving that the people competing for that money are bad

#

models are extremely easy to switch from, they will do literally anything to get you to stop using a competitors model i promise you lol

gaunt dragon
#

But as far as I know that's never been done in a significant enough sample size with a task set that's reasonably objective

timid spoke
#

Markets in any industry are rarely efficient.

fading flame
#

I don't get the point you're trying to make

timid spoke
#

The efficient-market hypothesis (EMH) is a hypothesis in financial economics that states that asset prices reflect all available information.
A direct implication is that it is impossible to "beat the market" consistently on a risk-adjusted basis since market prices should only react to new information.
Because the EMH is formulated in terms of ...

#

that graph is scary , just read the first few lines. economics concept but applies everywher e

fading flame
#

do you honestly believe that like, if elon or sam altman got proof google quanted their models they wouldn't immediately tweet about it

#

or even assuming they wouldn't, that like qwen wouldn't lol

timid spoke
#

You aint understanding my point karelian

brittle storm
#

same i also dont believe in model degredation

#

i just think that people as it goes on expect more and more/get acclimated to the ability of the model

fading flame
timid spoke
brittle storm
#

yeah it kinda makes sense but the info right now is just anecdotal at best

timid spoke
brittle storm
#

?

timid spoke
brittle storm
#

how does that make sense

timid spoke
brittle storm
gleaming wyvern
brittle storm
#

yeah

gaunt dragon
#

Wut

gleaming wyvern
#

they had other issues which werent intentional

brittle storm
#

it was an unintentional degredation

timid spoke
gaunt dragon
#

What people are talking about here are about a provider nerfing a checkpoint

brittle storm
#

we are>

#

?

gaunt dragon
#

Not bugs or possible impacts of switching hardware

brittle storm
#

i was referring to like gemini nerfing gemini 3 pro or wtv

#

oh okay yeah

fading flame
#

tbh it's kinda like how faking the moon landing would be harder than actually doing it

#

making a quant that somehow maintains benchmark performance would be more impressive than just not doing it lol

gleaming wyvern
#

i dunno about the degradation between checkpoints, there wasnt any extensive testing of them, it was limited to one prompt for a task and you couldnt use them in an agentic envinronment because no api access

gleaming wyvern
timid spoke
#

Didnt know karelian was a normie jc

fading flame
#

i've been there, they didnt

#

my good friend jeff b took me

timid spoke
#

Dont do drugs kids

timid spoke
# fading flame tbh it's kinda like how faking the moon landing would be harder than actually do...
NTD

Neil Armstrong, the first man to set foot on the moon, said, "That's one small step for man, one giant leap for mankind."
⭕️Sign up for our newsletter to stay informed with accurate news without spin. 👉https://www.ntd.com/newsletter.html. If the link is blocked, type in NTD.com manually to sign up there.

...

▶ Play video
#

Do you reall think this footage is real ?

fading flame
#

yeah

timid spoke
#

checkout the footage first please

fading flame
#

dont feel like it

pure gate
#

assuming that if most people don't agree then it's not true, doesn't mean it's not true

stiff crescent
fading flame
stiff crescent
#

I personally have no idea, but it's possible. They are in preview after all

#

Yeah, like that. I had forgotten about the 'juice' thing

brittle storm
#

yeah i feel like theyre fine and probably fdo fuck around a bit with web ui with suystem prompts and reasoning efforts etc

#

but probably not on api

stiff crescent
#

Yeah they definitely do on the app/web UI. Lots of custom instructions baked in there. Which is why I never use it

austere falcon
low plank
fading flame
low plank
fading flame
#

it's not a small amount of information to deal with, and it has to deal with mindgaming since it's playing against real people

austere falcon
pure gate
# low plank I'm curious what you use the model for? I've used 139million tokens this month a...

i don't know what need to be debated about, it obsvious they already change the checkpoint.
don't know if it smaller or the same scale model, but for sure it worse as other thing than before.

@low plank what you feel actually are valid, don't bother to much with people that couldn't fathom the possibility outside their own scope.

it's also already know they change checkpoint, they annonce it so that one of many reason why you could have worse performance.

infrastructure and other aspects also effecting models and in january they have degradation but then fixed, this is case where it gonna have less long term impact compare to change in checkpoint or even model in the backend.

#

it's funny when i thought about it

Owh, the performance of the model in my small test aren't becoming worse

Yeah, because the weights that hold the knowledge aren't as effected as other, it also could be the degradation happen at the moment where the calculation of the probability it self missed quite a far because some mismatch in the stack which mean infrastructure problem.

low plank
glossy anvil
#

wtf is with the empty responses

knotty arrow
#

I dont have any issues with ai studio as provider

low plank
brittle storm
low plank
brittle storm
#

i was actually lying i used 1 quitntintitiollion tokens

timid spoke
#

token measuring contest

knotty arrow
#

I actually have used rhombicosidodecahedrillion tokens and it has NOT degraded

pale marsh
#

tokens don't degrade, you do

knotty arrow
random girder
#

i have really mixed feelings about this model, its lows are really low, but oh my god its so good when it works
it fixed some bug ive been stuck on for way too long, gpt 5.2, k2.5, m2.5 none of these could fix it

(3 pro)

glossy anvil
#

wtf is this model smoking, it keeps returning the same words that are completely irrelevant

#

are they changing checkpoints behind the scenes

random girder
glossy anvil
#

it might be?

random girder
#

ive had that happen to me with flash

glossy anvil
#

one of my services is failing, im relying on it returning json and everything worked fine up until rn

#

it keeps saying the word "thought" and "thoughtful"

random girder
#

oh its leaking its cot i think

#

or bugged in some way

glossy anvil
#

(4 different requests)

gaunt dragon
#

Are them from Vertex?

glossy anvil
#

yes vertex

gaunt dragon
#

Yep, they have trouble with other models sometimes too

#

Has happened 3 times or so for me

opaque pasture
#

much more than once, thought it was a prompt issue, and it's been a while

nocturne oyster
stray urchin
#

gemini 3 pro is the only model I have observed thus far who actively tries to repeat positions in a lost game.
all other models do this by mistake, this is calculated and a reason it still holds a clean 0 loss AI-record.

austere falcon