#Xiaomi MiMo V2 Flash
359 messages · Page 1 of 1 (latest)
is this running on ascend chips?
🔥
@viscid fog using that models :free counting to 1000 requests per day limit on 10$+ user plan?
what if there is only :free model variant and no paid? We can't use it after reaching 1000 RPD?
<@&1384697330254610442>
yes
the Xiaomi free endpoint has unlocked RPD limits
@viscid fog Nemotron 3 Nano 30B A3B also have unlocked RPD limits? this model also don't have for now paid variants
that model also does not have RPD
okay thanks ❤️
Is this endpoint being trained on? Or just free, but no training?
Prompt: Write a short story about a deal with the djinn gone awry
Output:
free & not training
Huh this model seems pretty good
I actually really like the writing
I like the coding style
Much more than the default style of like GPT 5
And it seems to be coding pretty well too
gonna use this to co-write for a bit and I'll report back if I find the writing to get annoying
blog mentions this being under the MIT license which is great but there is no licence in the repo or if its "modified MIT"
if hybrid attention like this can be widely adopted + competitive it'll be so sick
110 TPS on a top model is massive
ah
Which in my testing is a pretty wide category :/
all rules and UI work perfectly
castling, en passant, promotion, etc.
this model scored 40% on my spatial reasoning test from a 20 year old children's medieval fantasy game
😭
did very well in a personal bench, just as well as gemini 2.5 flash / grok 4 fast / deepseek 3.2
what's the API pricing? It might replace grok 4 fast for me
no idea, but I'd guess cheaper
also 15 day free period ⁉️
probably figuring out pricing
hybrid models are probably harder to price correctly than traditional models
idk
holy
based
not AMAZING at terminal bench, but not awful?
GLM 4.6 scores 24.5%
this is actually the second highest open model on terminal bench perhaps?
yeah I think so
deepseek is higher but has other factors that make quite bad to use for coding, like insane hallucination
This model is still prone to hallucinations, but vibes seem better on its niche knowledge than most other open models I've tried
blog mentions this model is under MIT licence but Github has Apache???
https://github.com/XiaomiMiMo/MiMo-V2-Flash/blob/main/LICENSE
Improvement
this is 100% taking the role of grok 4 fast for me
scoring higher than grok code fast on terminal bench
sounds like a great model, now i gotta try it
and if it wasn't apparent: scoring higher than minimax, glm 4.6, kimi k2 (and thinking variant), and even Claude 4.5 Haiku
@viscid fog
maybe because of the word "brutalist"?
nope.. not because of brutalist
🤔
maybe my system prompt
oh right refusals
hopefully it doesn't do that when I try to use it to answer tool calls?
sadly, this model does not have up to date knowledge it seems (e.g React Router V7, Tailwind V4)
i have no clue what in my system prompt could be causing this
k time to cut out parts till it works
looks like .8 temp and .95 top_p is recommended?
where is this from? I couldn't find any pricing info on their official API docs https://platform.xiaomimimo.com/#/docs/pricing
A simple open-source product documentation platform
A simple open-source product documentation platform
oh
you meant the pricing
A simple open-source product documentation platform
okay it seems to be this causing it
whatever i can omit that to use it anyway
$0.1 per million input is low but not as low as like $0.02 from some other models if it doesn't support caching
tbf it doesn't say
and somehow only sometimes
SO GOOD
atleast they didn't commit any chart crimes here
did retry, stuck in reasoning
its a good model, sir
ohhh, the
You are MiMo-V2-Flash (free), a large language model from xiaomi.
Formatting Rules:
- Use Markdown for lists, tables, and styling.
- Use ```code fences``` for all code blocks.
- Format file names, paths, and function names with `inline code` backticks.
- **For all mathematical expressions, you must use dollar-sign delimiters. Use $...$ for inline math and $$...$$ for block math. Do not use (...) or [...] delimiters.**
fucks with it's head
infinite headpats. this model gets it to not be awkward
so it is VERY system prompt sensitive and gets hyper focused on it's objective to the point of getting confused
guys bad news, I just saw a screenshot from the jai server
safety filter should help if its this sensitive
safety filters save us
sir, you want to nurf an already silly model?
well its open weights, providers will probs pick it up once they resolve the license thing
I like this model for summarizing
I made a github issue about that https://github.com/XiaomiMiMo/MiMo-V2-Flash/issues/2
GitHub
The blog mention this model is under the MIT license but this repo has Apache 2.0 licence file. Which is the correct licence for this model?
overall very happy with it
I am fixing this now
based on their gh these are very recommended
Hightly recommended, even
Maybe it should be {month} ?
i asked gemini 3 to translate the system prompt
thats probably what it would come out to
so far this model have been performing my SWE tests competently, such as compiling program from the year 2000 on modern toolchain https://gist.github.com/kth8/0897f24ce7c7bed643291dd6ff658e15
my friend sent this, very weird result for multiplication
I'll be extremely pleasantly surprised if this ends up to be the best open agentic coding model
and it seems like it may be that
(deepseek 3.2 excluded for being too spiky, slow, and full of hallucinations)
I guess Cursor Composer V2 will just happen to drop a week from now as well then eh
I think composer is just GLM 4.5-4.6 finetuned no?
if they do their postrain regime on this model instead of GLM 4.6 (assuming that's what they're using)
yea
I think so
lines up with the pricing
they could run this on GPUs instead of cerebras or groq or whoever, and serve it WAY cheaper
one composer's issues has been it's surprisingly expensive, I assume because it's on cerebras/groq to go fast
is this actually securely better than glm? im having mixed vibes with it
If it is flash then the pro model will also be on the way.
doesn't seem to have real understanding on stuff and seems to be trained on best Q&A from our short test, but it is adorable
did not test the agentic coding of it, so can't prove or deny bulbasaur
Honestly shocked Xiaomi coming in out of nowhere and making a model this good at agentic tasks
I am very pleasantly impressed by it
Did better or on par in a couple vibes tests, but the big thing to me is it’s performing significantly better on TerminalBench 2.0
Which imo is a quite high quality bench
no, aider benchmark is imo, but no unofficial results yet
I think aider is not relevant anymore, it doesn’t test tool calling or anything
- it’s been around too long, data is definitely in training sets
The benchmark hasn’t even been updated with latest models in over a month it seems like
in my tests, it is still representative of the model state. i don't think it is a benchmark they can saturate.
ppl in discord testing themselves

😭 this is so good guys.
seems pretty good, testing it with Opencode on a large codebase
not "amazing" or anything
like composer 1
I'm constantly getting
421 {"error":{"code":"421","message":"Moderation Block","param":"The request was rejected because it was considered high risk","type":"content_filter"}}
even if I just say "hello"
Not sure how you guys have managed to get it working in your agentic coding tools
hey indeed.
jeez this model is quite good, no longer getting the safety warning atleast, managed to add a codex-like session system to my cli first try, also went around my harness by using shell commands to read files because it didnt like that i didnt have any line range support for reading.
this model is quite eager to write test python scripts, then use them to test, very practical and seems to actually delete them after too
similar to claude
First Token Latency is the only issue I have with this model currently
it's ~2.5s on average, which cuts down significantly on the benefits of TPS
if it was like 500ms like other models it would be amazing
probably due to the region and their filter
One limitation I found with this model is it can only make 1 tool call per turn. That became really inefficient and troublesome here when I asked to setup a whole cluster and it can only run commands or write file to 1 machine at a time https://gist.github.com/kth8/f2d17b3b8b017055a4daedd03994d2f6
I've seen Grok in comparison make 5-10 tool calls at once per turn to manage multiple machine in parallel
It seems like it’s doing parallel tool calls in OpenCode… are you sure it can’t?
I haven't seen it. Is there some magic phrase I need to put in the system prompt for it to do it?
Is this model permanently free, or just for a week or so to test, like Grok 4.1 was?
free time left: https://platform.xiaomimimo.com/#/docs/pricing
A simple open-source product documentation platform
Perfect, thank you.
o z o n e
Xaomi has been throwing 500s and 524s
Its not common but it happens
this model is a good replacement for grok code fast if providers will step that low on price & have caching
grok code fast is good but its tool calls are hit or miss, sometimes it tries to call them in reasoning and says it did do the changes but didn't
xiaomi mimo is unsuable now due to rate limits
@viscid fog btw Novita is hosting this model now, can we get a paid endpoint?
did anyone manage to get this working with interleaved thinking in agentic coding tool?
thinking is disabled in their anthropic API for some reason, and can't be turned on
not doing full testing, but as proxy, not great at chess, bottom 15%
ah... that might explain when I tried to do integrated thinking with OpenCode via OR API it just put the function call inside the think and printed the XML instead of interceptin it as a function call
but their docs say it supports tool integrated thinking
wierd
I am getting that too in opencode, but that's slightly unrelated to Anthropic API as OR almost certainly is using OpenAI API format where thinking can be enabled. But not sure what's up with these XMLs here.
OpenAI spec doesn't really support interleaved thinking (I don't even think there is a spec?), iirc opencode conditionally turns it on for a few models such as new deepseek 3.2, probably not for this model yet anyway
DeepSeek API does it in its own adapted way, so does OpenRouter
however I was hoping that it would work using a model via OpenRouter, since the API to the app is the same
I quite appreciate the coding style of this model
unlike something like gpt 5 which I still hate the coding style of
this model structures code well and doesn't have weird stuff mixed in like gpt would (e.g if clauses with like 5 && conditionals to validate the type of a parameter)
Not a really useful topic. But this model is good for Rp too
Compared to what models?
Claude or other models like deepseek?
It is a bit smarter than deepseek. I havent use claude since i aint paying allat for claude. So this is being compared to v3.2 exp, v3.2, grok 4.1 fast
so most of the nsfw models that is normally used in RP.
Nah i take my words back
it is yet not on the level of v3.2
I like deepseek better
yep, deepseek is better
We've seen a ton of messages above about moderation. You can't really roleplay with constant moderation errors.
huh?. i never got moderation message at all while i am mimo. even for nsfw rp
Maybe not a "ton," but if you scroll up, you'll see a couple.
Bruh. What kind of role plays are you guys doing that you get constant moderation errors. Genuinely curious
In most recent opencode, it seems you can add this to a model:
"interleaved": {
"field": "reasoning_content"
},
seems to work here with OR (no more xml errors), though this model tends to get loopy kinda soon
oh thankyou, I will try it again later
this is sooo fast
The model has been glitching in opencode today with broken outputs and premature stops
This model seems pretty smart but I can't get it working properly in opencode
whats weird to me is the strong recommendation to turn off thimking for agentic stuff
i dont really understand WHY
also the promise of them not logging prompts i kinda doubt
just how it was trained I guess, although I don't know why it also supports interleaved thinking with that being the case
kinda interesting how no providers have launched support for this model
on a paid endpoint
and idk when xiaomi will end this (if they will?)
@viscid fog do you know anything about this?
free time left: https://platform.xiaomimimo.com/#/docs/pricing
A simple open-source product documentation platform
I know providers like chutes do provide this model , but open router hasn't offered the paid version yet
Xiaomi will presumably swap their endpoint to a paid one soon. The model itself seems really smart (I went through a difficult problem with it in the chat interface), but interleaved not working properly in opencode for me right now.
this was using the "interleaved": {"field": "reasoning_details"} trick but maybe there is some other stuff that needs to be done for it to work properly.
novita does, openrouter just not routing to it yet
I mostly do brownfield coding, and it solved problems that GPT 5.2 Codex / Gemini 3 could not
It's remarkably persistent. One of my favorite models of 2025 so far, given that it's also fast.
no paid verison yet
What parameters are you using?
they extended the free access
Default!
how much more
till jan 20
and then what pricing?
A simple open-source product documentation platform
@viscid fog Novita AI has support for the paid mimo v2 flash, can we please get support on the openrouter gateway
i need to use it in a commercial application
till when?
why is this model getting deprecated?
the free model is
26th Jan, I assume
so it will be paid only model? and why?
Yes, it will only be paid
The provider that's providing the model for free will stop providing it for free soon
abuse it while you still can
dont have any workloads to abuse it with 🥀
does anyone here know which AP prompt works best with mimo v2? for RP specifically?
I have a question about the deprecation of this model. I tested it over the last 30 days very extensively and find it very useful for my agentic coding tasks. Other providers hosting this model too and I tried it out on them and have very different behaviors. How do I know that they run the same latest snapshot? Should I ask them all? And what is the latest snapshot at all hosted on OR for the free model?
Except for the thinking loop and leaking tool calls into assistant messages and thinking tokens, the model performs very well.
At least, the paid providers let me choose the seed parameter.
Anyone got recommendations for free models that is on same level as mimo v2
unfortunately not really. mimo v2 with titi's prompt is probably as close u can get to a decent 3.2 experience as you can right now
and its the cheapest option right now too
yea I think i might switch to v3.2 with a provider that allows for cache read
This model is a neat lil guy. I like him, especially for the price.
Oh, huh 🤔
$0.09/M input tokens
$0.29/M output token
That price is actually pretty low, around which models do you think this performs?
although personally, mimo has a 2nd person POV issue
for ideal results you need to generate 1-3 messages with DS 3.2 first, and then you can leapfrog to mimo v2 for a better stable experience that's different from 3.2 and cheaper
i've done extensive testing (like 200 msgs) worth, so give it a try
alo another major weakness of mimo v2 is that it really struggles to progress the scene/RP on its own
you have to explicitly prompt it (and that's even with a comprehensive AP prompt) as well
grok 4.1 fast
and cheaper than it too
and i already thought that 4.1 fast was the performance/price goat
mimo takes it
is grok 4.1 similar to deepseek 3.2 or something in the RP it generates?
no, under no circumstances should you use grok fast for rp
it barely speaks english
aw thats a damn shame. than any other recommendations for something thats as cheap as mimo v2?
i just need something thats on mimo v2 level but actually can advance the plot story on its own
cuz basically all i use right now is DS 3.2 with heavily nerfed context to make it budget affordable
nothing beats deepseek on price per token. GLM has a cheap sub they offer, but that's about it tbh
I'm not sure what your budget is, but if you're using something like ST I'd say maybe look into a memory extension to save on context
i'd use glm 4.7 everytime, but its damn expensive once i get past 50 msgs
i wish it had a non thinking mode - or at least didn't use as much thinking, cuz frankly, its kinda ridiculous compared to models like r1 0528
i'm not exactly broke - i just prefer as much bang for buck llm model as possible
i use j.ai , just not sure how using chat memory affects how much tokens is consumed and whatnot
look into the z.ai coding plan then, but yeah as far as PAYG goes nothing beats deepseek
(also, 4.7 does have a non thinking mode, j.ai just doesn't support it
)
GLM are hybrid reasoning models mean you can disable thinking. You can do it via code or create a custom https://openrouter.ai/docs/guides/features/presets
it hallucinates alot for me. Like many times it would just either say something irrelevant or something that is not within scenario. Interestingly My friend pointed out that this happened in chub.ai more than it happened on Janitor
i will switch to r1 0528
yea just did some further testing
its a big nothing burger model, just generates responses that don't really move forward, more circling on the spot - no matter the AP prompt
i don't know if this will change but i would suggest to avoid it for now
GLM 4.7 has a non-thinking mode
i'm aware of that. but i dunno how to easily set it up. i don't got a lick of coding knowledge
The easiest no code way would be to make a preset that forces no reasoning: https://openrouter.ai/settings/presets
What does "deprecrating Jan 26, 2026" mean ?
it's being unavailable (correct me if I'm wrong)
it is fast but not as good as gemini 2.5 for coding
2.5 Flash is over 3x the input price and over 8x the output price, though
does anyone know how to integrate mimo v2 to j.ai from the xiaomi site itself? i keep getting network errors
does anyone know why this is always rate limited?
I can't get any agentic coding to work with this model
it consistently stops early, before making any file changes
Do you use paid or free version?
paid
ok, maybe this is a false alarm
I was getting rate limits yesterday
but I just spotted a problem with my opencode config (edit: deny)
giving this another test now
yup, it was my config
seems to be working fine now
thanks for responding @ Monkey !
The issue with this model is, that it sometimes still leaks tool calls into the thinking and sometimes assistant tokens and this stops the multi-turn inference. To use this model reliably, I needed to sanitize those tokens after streaming each block and ignore the stop_reason for that.
Additional, I suspect that the paid providers are using an older snapshot of this model because it behaves very differently by each of them.
It is such a great model, but without enough transparency hard to use without those workarounds.
I don't think xiaomi site support jai
I happen to be using atlas cloud as a provider
Is there any way to identify which version of the model they provide?
Is there any way to identify which version of the model they provide?
Nope
damn, i'd thought it'll work just like the official deepseek site, just by adding & putting its own v1/chat/completions url into j.ai proxy settings like usual
nope but i notice certain providers give real subpar responses, sometimes. so i'd recommend blocking them if ur okay with that. atlas cloud is generally terrible
I only use official provider when possible. In this case the official xiaomi/fp8 provider is also the cheapest and fastest
Oh I only have Atlas Cloud and Novita AI available for this model
Chutes and Xiaomi must be blocked in my OR privacy settings
So any opinions on Novita vs Atlas Cloud? 🫠
Novita supports prompt caching which will be the biggest cost saver
Oh, wow, TIL this has caching, this is dirt cheap
Yup it is
And in my testing so far, it does a decent job
One real hallucination about a .gopls.toml file which isn’t a feature of gopls
Otherwise, it’s been nice
Novita has the lowest tps(they are running the model in vllm not sglang) and the lowest e2e latency
why are you stating to me what I posted in my screenshot?
Ya Xiaomi looks like the most performant provider and supports caching
But I prioritize the privacy side above everything else so not an option for me
Atlas cloud is noticeably much faster than Novita
But Novita supports caching 🤷♂️
Xiaomi also has a zdr policy
Xiaomi is not listed here unlike AtlasCloud and Novita https://openrouter.ai/docs/guides/features/zdr#zero-retention-endpoints
You can check out on Xiaomi api page
Xiaomi Retained for 30 days ✓ Does not train
^^ open router docs ^^
https://platform.xiaomimimo.com/#/docs/welcome
here’s a snippet I’ve read so far
API Services. If you use the API services, we will collect your IP address and the text information you submit to analyze the relevant instructions based on the model you select and to generate the returned content. Xiaomi will not use the text content you provide for model training or any other purposes. When you use prepaid API services, we will collect your top-up information and transaction records**.**
A simple open-source product documentation platform
I don’t see the number 30 show up in that page 🤷♂️
Seems to me like openrouter has a different reason for the privacy settings restricting Xiaomi
Or it’s a mistake?
Anyone else having an issue with the xiaomi endpoint where the model thinks forever all the sudden?
Like didn't change the prompt or any of the sampling parameters, yet it's happening consistently in the last few days
you should optimize the hyperparams as per your application
refer to xiaomi huggingface
No, I think Xiaomi had an issue on their end.
It was working fine for 7-8 days, then for 1-2 days it started capping reasoning (65k reasoning tokens) occasionally, even though I didn't change anything. Now it's fine again.
Weird. Also xiaomi end point is the only end point I direct to for the API calls becuz of cache so it wasn't a provider switch either.
this is awesome
i build a two agent loop so they get to a consensus with very strict guidelines
the result was pretty good
crazy value
yeah it’s peak
I just wished we get more providers and xiaomi oss ed the latest mimo v2
This is an impressively capable tool calling agent model, nothing seems to come even close at the price.
daily goat model reminder
when v3 flash ?
im so happy w tyhis model
xiaomi's a great provider too
great caching
I agree
if this model had vision i think it would be insane
because the price/performance ratio is insanely good
i wonder if grok can reclaim its price/performance crown
would you say it's better than grok fast?
yes
no structured output support though 😭
well, it follows instructions very well
?
really?
it says it does on orca.orb.town
i love the instruction following on this model since i know that many other flash-style models (even g3f) dont follow instructions well
I've gotten mimo V2 flash to work very well
However my only gripe that if it had a reasoning/thinking mode
That would really take it to the next level
Cuz for price to performance - it's quite good already
it does
Oh? Where is it then?
just set reasoning.enabled = true or set some reasonign effort and itll turn on
i dont think you can change the effort level its just off or on
How do I enable on OR?
Do I have to make a preset or something?
do you use the api or like the chatroom
I chat using proxy on j.ai
So I guess...Api I think? Sorry, I'm not a tech guy so not sure
oh, im not sure how to on janitor
try searching up how to enable reasoning on janitor
but mimo does support reasoning
Is there a way to tell which ones dont have an rpd limit?
I find that paying for anything less than opus costs me more for anything I do since doing it correctly first time = cheapest but having stuff that would be literally free to experiment with would be nice
It's not free anymore
I meant in general, if there's a way to tell which ones are 1000 rpd vs not
It's 1000 i think
Based on my experience, the Mimo V2 Flash offers the best value for money right now at just $0.1/$0.3.
Jesus Christ
At least read what you're replying to