#vibe-coders
1 messages · Page 8 of 1
What are some optimizations that I can do to reduce the cost of gemini 2.5 flash native audio? I have built a live interview platform and based on last month's analysis, I see that it takes around Rs 50-60 to run one interview (an interview lasts for 5-6 mins average), which seems to be very high.
I checked the input token usage and the maximum amount of input tokens that was used in one day was 3 million, and there were 12 interviews that day, which means on an average each interview used around 250K input tokens.
Any help on reducing the input token usage and the cost in general would be much appreciated 🙂
lol urs say refresh in 4 hours ? mine refresh after 7
i wish i had 34k google acc like chinese companies
i think buying multiple accounts is good
like google?
thinking of doing that rn it works for some acc i have multiples but idk if i should buy multiple acc
If you wish to be banned. Then go ahead.
Really happy with what i made so far with Antigravity
Hi, this interview is audio only or have video feed?
Its audio only
uh oh.
lmaoo
in this case is the context going back and forward which make it bigger each time. Search for Context Window Compression this will start forgetting old conversation when hit the defined limit, also a way to manage the context so the conversation has what it need to answer to the user
so You can try make interview about max 5 to 10min using this approach
also check you have the VAD enabled and be sure the are no duplicate resendings
Okay.. i am aware of context window compression, have to play around with the values a bit maybe
Also is prompt caching something that might help?
yes, but the catching in reality happen at the server side
so here unless they have that option it wont work
Oh ok makes sense
you case need check why this consumption
Gemini 2.5 Flash Native Audio Dialog
Live API
30 / Unlimited
20.63K / 1M
199 / Unlimited
My system prompt is actually really detailed (around 5000 tokens easily) and I believe that is being sent with each convo
this is my ai studio for a project I did before I didnt talk much but Iwas using video
and im sure is longer than 1 minute each session
so your reported token usage is to big
Was your system prompt big or small?
let me check give me a moment
my is 800 token system prompt and tool call is 430 token
in your case not sure if apply can be separated in tools or based on the interview.
Oh ok… i don’t have any tool calls
good I was checking search for Explicit Context Caching
But yeah let me try out context window compression
but this require save cache on your google project
I guess i can configure that
yes
if you want be sure what happening I recomend add a middleware which will count tokens going to live api and how much you receive from it by turn
record that to a json file
and use it as a reference to know if is improving or not
Yeah, planning to do something like that... Actually that is a good idea.
I planned on detailed logs, but again that becomes a mess to observe. The JSON file makes more sense. Thanks
yes just make a json for token consumption count only, by turn, if youwant something more you can deploy grafana and add a metric from json, or use prometheus. for easy view so whatever path will be easy for you to see read or pass to ai
nice, from my perspective the log out section be at botton
Does anyone have any unique ideas for building an AI agent?
What are you looking?
The UI is a bit distracting, it wants to pull focus to everything and thus focuses on nothing.
The UX gives me zero idea what the purpose of the website is
I'm surprised by this.
The whole idea was to give it purpose. If you scroll up you will see my older design.
before, it was just cards, all same sizes, nothing was telling you anything. It's just there. It gave no focus. So i made it differently, with more purpose, so you eyes can focus on what is important in the moment.
Bright lime green bar, eye goes there first. Gives no details.
Then the eye bolts around to all of the yellow, as it is the focus color. Which then brings you to all the article related stuff, so... News website?
Then I see vending and get confused as heck
The colors are there because those are brand colors. I kinda have to use them. Secondly, it's not a website, it's an application. An extension from the website.
People downloading the app will understand what the're looking at.
Let me break down why this works before my ego inflates and floats away. First, my hero finally commands attention. By combining a big image, strong typography, and a clear CTA, I’ve stopped presenting twelve equally irrelevant rectangles and started answering the question of what the user should care about right now. It’s a huge win. Beyond that, the "Trending" section finally feels like a curated space rather than a random data dump. The labels pop, the cards are grouped with purpose, and the spacing gives the content breathing room, signaling that this information actually matters. I’ve essentially invented flow by moving from the Hero to Trending and then to Latest Reviews; previously, my layout had the narrative structure of a grocery receipt. The typography is also doing some serious heavy lifting here, that italic bold headline style in the hero gives the site/app a slightly aggressive, editorial tone that feels like a gaming magazine that drinks pre-workout.
If I landed on that page looking for game reviews, I would end up leaving confused on why I had been brought to it.
Ok. lol
@iron rock Here is how the actual website looks
Much better, not a fan of the other-one though.
Im not personally a fan of the limegreen, but Im sure it doesnt bother others
I understand. Sadly i can't change the color, because it's for a company. They have used this color scheme for ages.
Game Mania is 34 years old, having been founded in 1992.
Oh damn! That's awesome though. I wonder if it would look better if you used the green as the cta color rather than yellow. That way the top bar could be yellow.
I assume it wouldnt lo9ok better, very likely you have the best already selected.
look*
The issue with this idea, is that the green header has always been green in the past.
Plus, I'm simply not allowed to switch it to yellow. The yellow is also a part of the brand color, but I have to be careful where to use it.
Understood, thanks for hearing me out 🙂
glad to get feedback 🙂 thank you for that as well.
Why gemini cli is taking about 5 minutes to respond in any task it is daam to slow anyone have their solution
Please tell me
Are you also having problems with the antigravity limits with Google's Ultra plan?
describe the problem please.
yes its normal for google. trash plan
what sucks with Antigravity, is that when you use your tokens, but the Agent fails or gets disconnected for a split second, you lose all of those credits.
How are you all vibe-coding with this Claude Opus?
what do you mean with how?
I only use Opus for planning... it's not a good use of tokens to get it to do things.
like the antigravity costs with google ultra plan, what is there that has an almost unlimited rate limit?
naah totally not true, maybe 10x bigger than ai pro
what?
Maybe I didn't explain myself well with my question
I was using Antigravity with the Ultra plan until yesterday, working perfectly for software development, CRM, etc.
But since this morning, I've been having problems with the limits, and I think it's a general problem... Do you know of any good alternatives at the same price of $250 per month?
Ohhh, my bad, claude code for sure, even cheaper and better
Yes, but I read that it has the same limitations, if not superior to antigravity of the last few days.
Until yesterday I was able to create complete CRMs from 0 to 100% without affecting the credits or the rate in the slightest.
Goooooolersssss goood to be home can't wait to meet you all ! Sorry for any wrong beings thanks for supporting during the most tuffest times in life you guys are the best wish you well ! God bless you all.
❤️ 
Went on a vacation?
i was looking for you bud ! can't leave me home alone 🗽
Bot
can we work on project together
No
love your energy champ! Gear up 
get readyyyyyy!!! yall watsssup im here im not going no where cupcake
I LOVEEE IT
#welcometoJungle 🗽 😎
https://www.loom.com/share/47a10f42fc164cbba895e1ce53071c86 im here !!!!!!! im only 3 weeks in you got a long rride
Hey everyone, in this video, I'm excited to share the latest updates on our project motion frames and the potential they hold for our mission. We’re diving into the specifics of the G6 engine and how it enhances our capabilities. I also touch on the importance of our Asian connections and the unique aspects of our design. I encourage you all t...
after 7 month breaks 😘
lets do it !!!!
come outside yu think got jokes huh
They gutted the quotas a while back? Maybe you are just hitting the limits now.
no because it reduces 20% after 3 messages and then after 5 hours it gives me everything back
That's hitting the first quota cap. Soon you'll hit weekly limits and get 7 day refreshes. And yes, they sold you Ultra saying no weekly caps, but it's widely reported that Ultra folks are getting them these days
And is there a valid alternative without limits?
Nope
My understanding is that they are building Antigravity into AI Studio, the IDE may not survive.
im on you juu heard stop playing me punk !
Closest thing to an alternative I've seen is using OpenCode or ClaudeCode integration and then using Antigravity for planning and then your choice of models to implement.
and.. remove completely claude
Maybe, yeah
is equal tu claude opus? and price for this?
Well can use Claude so yeah equal. Several different ways to go about.
You don't switch to Sonnet for implementation? Cheaper that way.
When I use Opus, it automatically downgrades Sonnet.
Hrmm? Okay.
I run out of the Claude quota so fast, I can't say I've had a ton of experience with it.
My understanding is that google restricted memory of their Claude instances, so Opus/Sonnet may actually run better elsewhere.
and where? you know?
I am personally messing around with OpenCode now as part of my process.
Well ClaudeCode would be at Anthropic. Max plan would be an option?
OpenCode is more able to connect to anything, but the people behind it have "Zen" and "Go" services that include Claude, so I would suspect that would be more to spec with Anthropic. I don't want to get to into doing the pricing reseach for you. Pretty straightforward stuff, but changes often
Does anyone know when Gemini 3.1 Flash Live Preview will be available through Vertex AI?
It seems possible through Google AI Studio but not Vertex AI.
I upgraded to google-genai >= 1.69.0 and have the SDK is unified.
The Gemini change log said on March 26, 2026: “Released gemini-3.1-flash-live-preview, the latest audio-to-audio (A2A) model designed for real-time dialogue and voice-first AI applications.”
models.get() returns a metadata shell but the Live API WebSocket returns 1008/404.
I can’t tell if it is behind a quota/EAP allowlist adjustment.
I don’t know if the endpoint is gated by a Private Preview IAM flag because of some GCP Allowlist Flip or what.
I think the global control plane knows the model exists, but the regional data plane (API Gateway routes/GPU clusters) is unprovisioned.
Do you have any sources saying so?
When building an app, the model usage is very confusing to me. For example, when I’m using Gemini 3.1 Pro preview, sometimes it allows me to create quite a few prompts before it exhausts usage on the free plan, and sometimes it’s only a couple.
Nope, the "IDE many not survive" is pure speculation. Which is why I used the word "may" to indicate uncertainty. This blog however talks about how they are introducing the antigravity agent to AI Studio: https://blog.google/innovation-and-ai/technology/developers-tools/full-stack-vibe-coding-google-ai-studio/
They've certainly undermined the position of the IDE, while seem to be focusing hard on getting the good bits into the cloud based AI Studio, which honestly is more where the company is comfortable. Have to say that my hopes for the IDE are dimming. I hope I am proven wrong.
Google is a big enough company to do both.
I think the same
<@&1009526435276394496> that spammer is back again. 🙁
Tip: Add outbound loop prevention to your GitHub Copilot instructions
If your AI agent can send emails or messages, add a rule that stops it from replying to itself. Without it, one email can turn into hundreds.
Example 1 — The email loop:
I built an AI agent that reads my inbox and sends replies. I added the agent's outbound email address (aos@mydomain.com) to the list of allowed senders. When the agent replied to a real email, that reply landed back in the inbox — and the agent replied to that too. It looped 18 times before I caught it, and generated ~89,000 Pub/Sub (publish/subscribe — a message queue service) retry faults in the process.
Example 2 — The fix (three layers):
The rule I added to my Copilot instructions requires three independent guards any time the agent sends something outbound:
Code check — before anything else, reject messages from your own addresses in the handler logic itself
Config check — never add an outbound address to your allowed-senders list
Rate cap — abort if more than 10 emails have gone out in the past 60 minutes
The reason for three layers: if only one guard exists and it's misconfigured, the loop happens anyway. All three have to fail at the same time for a loop to get through.
Why put this in Copilot instructions?
Copilot will generate the outbound handler code for you. If the rule isn't written down, it won't know to add the guards. Once it's in your instructions file, every new handler gets the protection automatically.
if you mean the terminal at the first versions where you have access o see and interact how he execute commands etc. I think they just remove this feature. now it execute commands in his terminals but you cant interfer as before, you can see the output
his?
Anyway, terminal works here just fine.
Nice. Definitely interested would love to hear more about what you're building?
how much you already invested in startup
Recently I realized my grades weren’t dropping because I didn’t understand topics, but because I didn’t know what to study.
Flashcards help, but creating them manually takes too much time.
So I built an open-source app called ONCards.
It converts notes, PDFs, and slides into flashcards automatically, and uses a local AI system (Gemma3 via Ollama) to:
track weak areas
recommend what to study next
adapt based on performance
It runs fully offline with no API or subscriptions.
Currently uses ~300MB RAM idle and ~4–5GB VRAM during inference, with aggressive caching for performance.
I’m looking for feedback, especially from people running local models or using Gemma.
Have you tried Gemma 4 yet?
yeah! it is crazy!!! I am plannng to build an agent system ot manage my other computer as a funproject.and I am considering changing the model in my app to Gemma 4 because I find it more "stable" across many categories.
also the reasoning and native function calling has being a HUGVE deal for me for the past two days. I am still trying to do more stuff. might take some more time to say how good or bad it is. but as of now, it is CRAZY! I think this might be the biggest leap in local AI since Deepseek-r1.
Yeah, definitely amazing how much latent knowledge is in the downloadable blob.
And if it's any good at tool calling, it can have current and RAG info.
I just made my own AI
Hey devs 👋
I’m building something called DevOPS — a voice-first AI developer assistant that lets you control your entire coding workflow using just your voice.
No typing. You just speak.
You can:
• Search and open your GitHub repos
• Read and explain code
• Create issues and review PRs
• Debug files with AI
• Navigate your projects hands-free
It’s like having a real AI pair programmer that listens, thinks, and responds instantly.
The goal is to make coding faster and more natural — especially when you don’t want to switch contexts or type constantly.
I’m curious:
👉 Would you actually use something like this in your daily workflow?
👉 And more importantly — would you pay for it if it worked really well?
Be honest, I want real feedback 🙏
The issue with talking is that you can't stop your sentence. If you do, the AI would get confused or tries to proceed. When you type, you can stop typing whenever, and continue later.
But i add pause button also
dude, you made an app. not your own AI😂.
lol yes it's actually fun as heck
i used the leaked source code from Claude💀
lmao. I never tried it.
internal use only. don't need trouble
You should probably try my app. it has RAG and also uses a lot of AI internally.
sure, but i can't today, getting ready for work
yeah, sure!
It's based on token usage, not prompt usage - Higher complexity tasks require more effort, you will get more responses before running out with easier tasks than hard ones.
I prefer the 20$ codex plan. but fre antigravity isn't bad by any means. just use the gemini models. the Pro low is a good model. I use GPT OSS for planning
tbh Codex is way better when it coems to stability and executing.
antigravity feels "Fun to use", not the "Pro" tool
U still got it? I wanna see it but I could never find the repo 😭
I have a lot of accounts for that reason
you will see the next antigravity update prob lmao
When is that coming out or do you not know
idk next update
Type shi 😭
but will surely know every ai apps look the claude code source code xd
to see how claude code working better
If Gemini 3.2 Pro gets based on the DeepThink architecture I think it'll be better. Currently I find 3.1 Pro to be focused on maximum speed rather than accuracy on it's coding. Claude Opus 4.6 will beat Gemini 3.1 Pro in tasks that're more complex because it's architecture is built on self reflecting it's decisions to make sure it's right.
Gemini Code Assist relies on your subscription plan too so when 3.2 Pro get's released and then added to Code Assist it'll be like the Codex plan rather than a free limited Antigravity Agent.
you can do some prompt engineering to get that doen too.
wont be very effective tho unlike a native arch, but better than nothing.
Yeah true but I can't bring myself to use Code Assist until 3.2 Pro is out everytime I want it to do something it breaks it and makes bugs
maybe try making your own agent with gemma 4 and GPT OSS. i feel like it is well developed. I mean gemma4:26bis a GOOD model
I don't think my GPU can handle 26B - It's AMD so it's not CUDA and I think CUDA is better at AI
it is cheap on the API. Plus, Qwen models are dirt cheap on openrouter.
If it's cheap it should be free via API - Will Gemma be better than Gemini though I'd think Gemini is many more params than gemma
yeah, but I like the reasonign style and how easy it would be to run things locally if you want int he future. if you were to build an ecosstem around gemini it would be hard to change anything. since gemma is local + API you can mess with smaller + biger model in the future.
i mean, yuo do you. I usually like to have a flexible environemnt yk
Yeah, that's fair. The flexibility plus local option is pretty nice, to be honest. I'm mostly just thinking about raw capability right now, though. It feels like Gemini would still be ahead there. What would I be able to mess around with on the AI if I ran through API?
th eonlu difference with it is: it has more params (26a4b is more tha enough btw), video support (u rolly can with Qwen, but u need a beefed up setup), longg audio.
now the real question is, will u ever use these?
I have Google Flow for Video & Images (I pay for Google One) -- Even though 24b is more than enough would something doublemor triple that actually make a differece? Or when you say more than enough it then doesn't matter if you have more?
I don't want to spend time getting gemma though - Can Gemma scan a repo and add/remove code from it like GCA?
yeah, then gemini it is!
you could try openAi modex models for agents or clude models. but I feelliek codexmodels woud be easierto mess arund if you have the money. but yeah, gemini is good if u ar eona bdget
Well I did try Claude Opus 4.6 via Antigravity and found due to it's architecture it's better at complexity than Gemini 3.1 Pro (Which is built for speed) -- I haven't tried GPT OSS 120B yet though - How good is it compared to 3.1 Pro and Claude 4.6 Opus?
tbh gpt oss is meh. it is good for planning, but it feels like a messed up general model instead of a optimized, and good model
Anything GPT is open AI right?
I don't follow on OpenAI news so I don't know the latest and greatest model but what's the architecture like for it's best model?
GPT5.1 codex max and GPT 5.4 mini is good for planning. GPT 5.4 and GPT5.3-codex is good for executing
Ah okay - Is 5.4 & 5.3 codex architecture speed or more like claude's where it self reflects and thinks longer?
5.2-codex re-evaluates what it did, but GPT5.4 is good for frontend. use GPT5.2 or GPT5.3 codex models for backend.
5.1-codex-max is intelligent and fast, it is meant for planing an dresearh
Is there any ai model that's good for all of these or is that impossible/not made yet? -- Do any of these match 3.1pro in terms of what it does?
I know 3.1pro is speed but is there something specific it can do actually good other than speed?
Would 5.2codex be like claudes arch in terms of re-evaluation and would it be considered better than claude 4.6 opus
that would be the opus models.
But there isn't an actuall all in one model yet. bcs more parameters = more cost. so you can use mid rnage models for plannign and big models for executing. but if you really want an all in one (I dont reccomedn fo rbig work). use deepseek v3.2.
thats why agentic AI is annoying.
I suppose an all in one model would be either an untrue all in one model (switches models for what you need) or if it could somehow change its parametre count for the response (still changes model properties)
According to ChatGPT the "best" coding AI is Github Copilot AI (Which is based on GPT) but I don't think I believe it at all to be honest.
If that's the opus model than would sonet be a fast model like 3.1pro?
tbh, use claude models (use opus whe u can) for coding. use GPT models for planinng. specifically for frontend, use GPT5.4 (no exception)
Does codex have free access for 5.4?
to use codex u need to have the 20$ plan at least. ll the models are available once u pay.
Yeah I don't want to pay another plan
Are any of the Gemini 2.5 Models better than 3.1 Pro at anything?
fuh nah. old gemini is really bad tbh. it is "okay" for a geenral task, plus they didnt even had CoT.
Oh okay then - Do you know when we can expect a release for 3.2Pro though?
proly at the last 4 months of this year ig.
who knows..?
According to some leaks and gemini itself they all think 3.2pro comes out may 19-20.
Can you elaborate which source?
https://leaveit2ai.com/ai-tools/language-model/google-gemini-3
https://youtu.be/j63kkppYKZs?si=KGwYu89Udp4gLRhA
(Image it text from Gemini 3.1 P)
Gemini 3.1 Pro dropped Feb 19. Now Gemini 3.2 is showing up in Arena logs and API strings. Google hasn't announced it. Here's everything confirmed, leaked, and expected — updated as it happens.
Link to our newsletter: https://bitbiased.ai/
Gemini 3.2 isn’t just another AI model — it’s a shift from prediction to real reasoning.
In this video, we break down Google’s latest AI system, including Deep Think reasoning, the leaked TPU v7 Ironwood chip, and Antigravity — a new agentic platform that could replace traditional coding e...
Thanks alot
You're welcome!
Hello Im an AI researcher and I currently need a team, if you're interested text me please, I'm currently working on an algorithm that can significantly lower both the energy comsumption and the compute cost of ai training
Are you a millionaire? And own your own datacenters? Because otherwise this project is nearly impossible.
its not, this is about optimizing what everyone in the world uses
Yea, but in order to do that, you need very powerful computers
no, even a gpu on colab is enough to test this
i just think that backpropagation isn't the key to AI, it's approssimative and expensivd
so it's more about Learning & Experimenting, using tools like Google Colab, Kaggle, and Hugging Face?
if we optimize the learning we optimize comsumption and potentially even compute time and power
i dont think a model should be trained on a dataset at all, at least not how we know it nowadays, think about it, when we train a model we make little steps to get to the end of the valley, the result of backpropagation, what if we find a way to reverse ingeneer this: we have a set of qas and we calculate the weights back in the layers, but if the questions arent generated by an ai, this isnt reverse engeneering anymore, its creating a new model
if you're interested dm me
man, I dont want to demotivated, but vibe coding an optimization is liek saying "Yo bro I am going to help Sam Altman make 5.5 because i am board. How does it work? Somehow..."
I can help you with any other algorithm
my app has an algorithm called "NNA" it is a recomendation system build on embedding models with 3 levels to each to filter out things and reccomend user what ever you want without a lot of customization.
do uhave experience?
sam altman said that ai inst a transformed based system but also said that ai as we have it right now is already capable of creating the right ai system
i mean, i could help u upto some extent
dude, u sound like me when I feel motivated, and I clearly know that i am broke and I shoudl stop thinking about it.
lol
quite, i personally developed small models the size of gpt 3
GPT3 ?!!!
where di dyou get the compute?1
its not about money dude, its about having the right idea
175b parameters at bf16 is no joke bro
colab dude
dude, real life is not a pixar movie. lets be rreal here
i spent alot
you cant train a 170b model in collab🤣🤣🤣. look at this guy
you can be in or not
dude tpu v6
it has 192gb of ram
I am sorry, I am out. I dont think a person is crazy enough for this. if u want help with soemthing realistic, i will help with 0 thoughts.
i like your idea, the way you vizualize is, let just say... "Not-enough-thought-to-it"
if you have more cool ideas which I can help. i will!
alright thanks
You know what I just realized. all these faety models aretoo big. the shield gemma and all this is too big. wouldn't it be cool if something (maybe even oogle) fine tuned gemma3:270m to be a shield gemma model?
im sorry for inter rupting i just saying theres a thunder storm at my house and lots of hail size of screws 40 miles an hour
hey am new
??
claude just not working right now?
Yes it's likely not working due to experiencing high usage.
I think I've changed my mind about Gemma 4.
I just don't know whether to use 26B MoE or 31B Dense:
3.1Pro, It also said 31B was going to be better for me.
I changed my mind because according to 3.1Pro it's architecture won't be limited via Google and its going to be better at thinking and reasons rather than be as fast as possible and having control over easoning would be good
This was said by 3.1Pro
The Bandwidth Bottleneck: DDR5 vs. VRAM
To understand why the 31B model will slow down on your machine, you have to look at how data moves. Large Language Models are heavily bound by memory bandwidth, not just raw compute.
VRAM (Your RX 9070 XT): Modern GDDR6 memory pushes bandwidth anywhere from 500 GB/s to 800+ GB/s. It feeds the GPU core almost instantaneously.
System RAM (Your 32GB DDR5): Even with fast DDR5 in dual-channel, you are maxing out around 80 GB/s to 100 GB/s.
When a model exceeds your 16GB VRAM limit, the inferencing engine (like llama.cpp) puts the core layers on the GPU and the remaining layers on your system RAM. Every time the model generates a single word (token), it has to pull data across the PCIe bus from the DDR5. Because DDR5 is roughly 5 to 8 times slower than VRAM, your entire generation speed instantly drops to match the speed of the system RAM.
The Time Difference: 26B MoE vs. 31B Dense
If you ask the model to rewrite a 100-line broken Calamares installer script and output a 500-token response:
Gemma 4 26B A4B (MoE)
Because it only activates ~4 billion parameters per token and fits almost entirely in your ultra-fast VRAM, it will fly. You will likely see generation speeds of 30 to 50+ tokens per second.
Total Time: You will have your script in roughly 10 to 15 seconds.
Gemma 4 31B (Dense)
Because it fires all 31 billion parameters for every single token and constantly pulls data across the PCIe bus from your slower DDR5, it will chug. You will likely see generation speeds drop to 5 to 10 tokens per second. If you activate the built-in Think mode, it will spend additional time internally looping before it outputs the code.
Total Time: You will likely wait 1 to 3 minutes for the exact same 500-token script.
bro... USE 26A4B
3.1pro said 31b will be better for what i need
Okay!
How..
Like it'll lag? I heard Gemma 4 uses lots of VRAM but couldn't I offload to DDR5?
26a4b is wayyy mor ethan enough for agentic. Plus my 500 can;t even handl;e the 31b dense and barely runs 26a4b at 32k context. I have a good system and stillgets 15 TPS:
5070 12gb
32gb ddr5 6k mt/s
r7 9700x
(used ollama)
ollama iwll automatically offload. type ollama ps in the command prompt you will see CPU/GPU usage
I got 4gb more vram --- apparently moe can get mixed up between its stuff is it true?
Okay
For no reason at all, google didnt releease a sub ~12b model this generation😓
But an 9070xt has no cuda
wdym 4gb more VRAM
They love when people pay them
AMD has it's own AI accelerater. new ollaam supports is (kinda..)
9070XT has 16gb vram and your 5070 is 12gb
ig... they used to drop models with practical sizes
yeah bt the raw performance is low compared to CUDA + 6k mt/s. what CPU do you have?
I could run a 4b dolphin model on my laptops igpu and i was on uhd rather than iris xe because asus gave me 1stick 32 rather than 2x16 so i only got 64bit
I7 14700kf
My ram wont be as good as urs tho in terms of speed its 5200 and its a micron die
oooh thats good! you could run it. make sure you load most of the tensors into your GPU
not bad. even ddr4 is "okay", you are well enough since most tensors are loaded into VRAM. you might get around 15 TPS at 32k CTX like me
Idk what any of that means (tps and 32k ctx) i havent played around with ai much
Oh tokens per sec right
Idk ctx
TPS = tokens per second (1 word = 1.15 ish tokens with sentencepiece), CTX = context lenght, it is how much the AI can remember. for your task you NEED at LEAST 32k since you are doing aagentic stuff, right?
Idk what agentic stuff means? Coding?
Plus, the gemma model has a CoT (chain of thought) it eats CTX for breakfast, soa little headroom is safe for reasoning
doing decision and executing it with tool calling by itself. for that it needs to reason like [<think> if I do X, the Y will happen... should I do it?</think>].
and then it gives the answer.
Csn i implement it to write code and add if i accept and do commands like in antigravity?
Also where do i get gemma4, huggingface, github, olamma?
install ollama and run ollama runn gemma4:26b. But I recomend you run ollama run gemma4:e4b
it is smaller and better. I can run it at 128k context at 80 TPS, which is PERFECT.
it is not neccesarilly better, but that model is wayy more than capable. it wont do niche CSS or typescript, but you cando other agentic stuff. Plus with the new audio support you can make it organze your folders and stuff liek that yk.
Will it be around as good as 26b in bash
Is that possible?
YES! thats the whole point of it. you just tell it and it will do it. it can make/edit files, run commands and edit sutff in yur folders you gave permission in or your full computer if you gave permissionofc.
Oh okay
Will it??
We need lvl 3 for silver :(
How can we level up?
I believe every message = 1xp, and you need certain amouynt of xp for next level
Run /level in #commands
That was easier to setup than expected
told u gng
ollama is super convinient. try doing these:
try doing tool calling, increasing the context length and other cool stuff
btw to increase context natively in ollama, go to setings --> context length --> [move hte slider to about 128k]
to see how much TPS you get run ollama run gemma4:e4b --verbose
Thanks I'll use these
❯ ollama run gemma4:e4b --verbose
Hi
Thinking...
Thinking Process:
- Analyze the input: The input is "Hi". This is a basic, informal greeting.
- Determine the user's intent: The user is initiating a conversation and expects a friendly, reciprocal greeting.
- Formulate the response goal: Be polite, engaging, and inviting.
- Generate options:
- Option 1 (Minimal): Hi.
- Option 2 (Standard): Hello! How can I help you today?
- Option 3 (Friendly/Warm): Hi there! How are you doing today?
- Select the best option: Option 2 or 3 are ideal as they acknowledge the greeting and immediately prompt the user for their actual need,
fulfilling the AI role. I'll go with a combination of friendly greeting and helpful query.
...done thinking.Hello! How can I help you today? 😊
total duration: 11.099657542s
load duration: 127.421351ms
prompt eval count: 16 token(s)
prompt eval duration: 53.326788ms
prompt eval rate: 300.04 tokens/s
eval count: 204 token(s)
eval duration: 10.837521216s
eval rate: 18.82 tokens/sSend a message (/? for help)
I asked 26B the same question and it was faster
it is 18 TPS bcs it is just a few tokens. even if you had a 1gbps iternet yo wil use only like 25 mbps for a 3mb download. try telling it to make an essya about something
total duration: 1m30.097378297s
load duration: 3.745011455s
prompt eval count: 25 token(s)
prompt eval duration: 386.711407ms
prompt eval rate: 64.65 tokens/s
eval count: 1675 token(s)
eval duration: 1m25.232874466s
eval rate: 19.65 tokens/s
Send a message (/? for help)
I asked:
Write an essay on how the Linux kernel was made
On 4b
128k context
you know what? it is usable at least. 20 TPS is not bad
I think it's using CPU..
you can only use like 32k context MAX MAX (absolute max) with the 26a4b model. even if it is 4b activated, the 26b tensors are still loaded
and gpu
while running the moel run ollama pss on another terminal
Currently on 26b 128k --
I get:
gemma4:e4b c6eb396dbd59 16 GB 47%/53% CPU/GPU 131072 3 minutes from now```
❯ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4:26b 5571076f3d70 26 GB 100% CPU 131072 4 minutes from now
~
are you using garuda?
CachyOS
do you have your AMD drivers installed?
ohh, maybe they al look alikeig
CachyOS automatically installs AMD GPU Drivers
did you try this with windows?
it might work. it works in my friends computer
soemtimes linux just does linux stuff just like windows. I feel like these models are well optimized for Mac tbh
btw I thinkyou will gewt good TPS with GPU, since I already this. Intel CPUs has a lot of threads
yeah
total duration: 1m51.307317733s
load duration: 6.42701621s
prompt eval count: 22 token(s)
prompt eval duration: 883.754905ms
prompt eval rate: 24.89 tokens/s
eval count: 1852 token(s)
eval duration: 1m43.238899214s
eval rate: 17.94 tokens/s
Send a message (/? for help)
Well that's what I got 128k on 26b worked and not using gpu ig
Okay I think ik
"When Ollama calculates the memory requirements before starting the chat, it realizes that 30 GB is way over your 16 GB limit. Instead of crashing your system with an "Out of Memory" error, Ollama's fallback mechanism automatically offloads the model to your system's DDR5 RAM and tells your i7-14700KF CPU to process it."
because even tho it is 26b only 4b parameters are activted. you lost abouut ~2-3 tps for banchwidth
it dynamically loads tensors between VRAM and ram
4b 32k still
it doesn;t calculate, and select. It loads the doable tensors into the best hardware. but if you have more GPU ollama would auto detect
ohh... that is weird. I have issues witht he "effective" models on my lapop, but it is "okay" in my computer. for some reason gemma3n:e2b is running on 60 TPS, while a normal 4b model can reach well over 180 TPS on my 5070 (@ 4k ctx)
if you just want to experiment (I wont reccomend). Use a q3 or q2 quantization. you will feel lik eit repeats the same thing and uses a "smoother" text of flow (in a bad way), but you wil fit in less ram which increases the speed
omds my grammar💀
Gemini said
"I completely led you down the wrong path, and I apologize. The issue is entirely my fault.
I originally had you run sudo pacman -S ollama.
On Arch Linux, the maintainers split the packages to save download space. The base ollama package you currently have installed is compiled strictly for CPU inference only. It physically does not have the backend code to talk to your GPU, no matter what environment variables we set. You can even see it in your latest log: device=CPU.
We need to swap it for the ROCm-enabled package. Here is how to fix my mistake and get this working.
- Install the correct GPU package
Run this command. Pacman will warn you that it conflicts with ollama and ask if you want to replace it. Press Y.
Bash
sudo pacman -S ollama-rocm"
I gotta stop relying on ai 😭
That's better:
❯ sudo systemctl daemon-reload
sudo systemctl restart ollama.service
~
❯ ollama run gemma4:e4b --verbose
hello
Hello! How can I help you today? 😊total duration: 366.351577ms
load duration: 130.299963ms
prompt eval count: 16 token(s)
prompt eval duration: 61.232751ms
prompt eval rate: 261.30 tokens/s
eval count: 11 token(s)
eval duration: 116.979446ms
eval rate: 94.03 tokens/sSend a message (/? for help)
dude, ollama literaly gives you the command
94??!!! with 128k context????
ahhhh
~ 2m 21s
❯ ollama run gemma4:e4b --verbose
/set parameter num_ctx 131072
Set parameter 'num_ctx' to '131072'hello
Thinking...
Thinking Process:
- Analyze the input: The input is "hello". This is a simple, conversational greeting.
- Determine the goal: The objective is to respond naturally, warmly, and conversationally to match the tone of the greeting.
- Formulate response options:
- Option 1 (Mirror/Simple): "Hello." (A bit flat)
- Option 2 (Engaging/Standard): "Hello! How can I help you today?" (Standard assistant response)
- Option 3 (Friendly/Warm): "Hello there! How are you doing?" (More personable)
- Select the best option: A combination of warmth and readiness to assist is usually ideal.
Self-Correction/Refinement: I should offer a greeting and prompt the user for their request.
- Final Response: "Hello! How can I assist you today?" or similar variants.
...done thinking.Hello! How can I help you today? 😊
total duration: 8.428110023s
load duration: 3.358372769s
prompt eval count: 16 token(s)
prompt eval duration: 72.365564ms
prompt eval rate: 221.10 tokens/s
eval count: 227 token(s)
eval duration: 4.83334077s
eval rate: 46.97 tokens/sSend a message (/? for help)
❯ ollama run gemma-26b-32k --verbose
hi
Thinking...
The user said "hi".
This is a standard greeting.* Acknowledge the greeting. * Offer assistance."Hello! How can I help you today?" or "Hi there! How are you doing?" or "Hi! Is there anything I can assist you with?"
Self-Correction/Refinement: Keep it simple and friendly.
...done thinking.Hello! How can I help you today?
total duration: 2.827455169s
load duration: 121.481958ms
prompt eval count: 16 token(s)
prompt eval duration: 69.863333ms
prompt eval rate: 229.02 tokens/s
eval count: 93 token(s)
eval duration: 2.54779111s
eval rate: 36.50 tokens/sSend a mes
I did q4 with flash attyention 64k may be possible with it
total duration: 47.229514139s
load duration: 122.962621ms
prompt eval count: 43 token(s)
prompt eval duration: 84.649199ms
prompt eval rate: 507.98 tokens/s
eval count: 1593 token(s)
eval duration: 46.440853672s
eval rate: 34.30 tokens/s
Send a me
I got this on askin g write an essay on the linux kernel
I made it do 96k ctx --
total duration: 1m16.957083359s
load duration: 133.327347ms
prompt eval count: 43 token(s)
prompt eval duration: 101.023869ms
prompt eval rate: 425.64 tokens/s
eval count: 2417 token(s)
eval duration: 1m15.811992956s
eval rate: 31.88 tokens/s
Send a message (/? for help)
And then 128k
total duration: 1m13.766787666s
load duration: 120.543248ms
prompt eval count: 43 token(s)
prompt eval duration: 99.394669ms
prompt eval rate: 432.62 tokens/s
eval count: 2356 token(s)
eval duration: 1m12.69612117s
eval rate: 32.41 tokens/s
I feel making an essay isnt stressing it.
making an essay gets the average TPS instead of a burst TPS
Whats burst tps?
Also I asked:
❯ cat pg100.txt | ollama run gemma-32k --verbose "Give me a detailed summary of every play included in this file.
(It's The Complete Works of William Shakespeare) And now it takes forever
it's not an official word, but it meant a "temporary speed" it is an unnotcable bug-ish thing. when you ask something whch will give like 3-8 tokens, it wont give the proper avg TPS. you should run the model like 4 times upto 500 + tokens each, you will get a good "average"
Okay
Anyone having trouble to vibe code e2b into flutter mobile app?
I got it in the terminal in antigravity and it can see my repo stuff and edit files but its a bit underwhelming
It hallucinated after a little bit
yeah I know. did you try Qwen3.5:9b
I think you will like it. it is the perfect size
Gemma "e" models are kinda bad in my opinion. they are very unstable
I didn't use E. I ran 26b at 32k ctx
I haven't; is it better at agentic workflows than gemma4?
Or coding
I hate waiting for the next gemini release tbh
heyya chat
nothing in AI is "better". it might fit your flow though.
I mean: what you want to do

So qwen would be " better " in certain things like coding and being agentic
pretty much
What makes it better? The arch?
mainly, the tokenizer.
the architecture makes it better too.
Better architecture & tokenizer yet older than gemma4?
older by 1-2 weeks gng.
Yeah but google's a multi billion dollar company
Alibaba is china's AWS dude
like... literally
?? Never heard of those
u never heard of Alibaba and Amazon???!!!
Ive never heard of alibaba and i didnt know short version of amazon
alibaba is like CRAZYYYYY! they do some crazy research dude
More than anthropic?
well... they were doing it since he 2000s, but I think anthropic does more research on the tech we already have. Alibaba wants to invent new things. no offense to anthropic, but they dont "invent" new arhcitectures and tokenizers yk
And google, do they do less research than both in ai?
I'd thought google wouldve made better ai due to how much they can spand on data centres gpu clusters and research
I would say google and alibaba does the same amount of research
Well atleast anthropic didnt have a weird mindset on fastest agentic ai --- 3.1pro is fast but sucks
just because they have money and data doesn't mean they will always mak ehtebes tmodel
I mean like I said, there is no "best" AI, bcs I find Pro 3.1 good at forntend
I havent learnt rnough to know the true difference between frontend backend and full stack yet
Sadly
The best AI is the one saying "I've hit a snag!"
If i make a bash script that says that infinetly and call it an "ai" will it sell?
nahh man that is easy to understand.
Frontend = UI
Backend = If certain UI element is poressed, then do X
Fullstack = mixof different languages. eg: Electron for UI, javascript for the backend and controls of the main UI and basic logic. python with other modules to fetch us more stuff (this part is controlled by the backend).
tbh people make these very complex for no reason, but htis is pretty mcuh it dude
I had a stroke reading that
The best AI is the friends we made lmao🤣
Ah okay --- even on 3.1pro i find it bad at frontend though
The only thing it wasnt that bad at was making a browser based js game
You have to prompt it properly. I usually follow:
INstructions: []
What to change/make: []
what it shouldn't do: []
plan: [] <-- I usually use GPT5.1-codex-max or GPT-OSS for this part, but you can write the plan by yourself too (if you know what you are doing)
I think someone once told me to use gemini to make a prompt for gemini
not a bad idea. You can use gemini for planning or another gemini instance to clean up your prompt
I've used gemini web to clean up agent prompts in antigravity but it hallucinated
I asked gemini 5 times to fix an auto pop up in a distro then asked claude like once it worked
because it has no context of what you gave it.
eg: if you give it this prompt:
cmove the button to the right using QWidget.
it doesn;t know your CSS or anything it has to update
it just knows the prompt
tbh it is normal, sometime even the best Ai models halucinate
if you look at my codex, it is starting off good and ends with some crazy swearing
Usually "Hello Gemini I'm currently making an arch-based linux distro and I use quem to test it --- i need you to make an automatic popup for an installer when the os starts and add an install button that launches calamares" is that good enough?
again xD
Its hallucinated very quick
dont yo think yo ugave it a tough task😭
bro isw you are so random
Its literally a piece of bash not even a custom app for a pop-up
Why does this 1 guy leave and join the vc i see it in the corner of my eye everytimemand its weird 😭
there is somethign sus there. maybe you just needed to give mre context. bt I might not text you for a while I am playing CS2
Ok! I may need to go to sleep anyways (its 1.05am)
Vibe coding is great until you realise your "SaaS" still require actual users that can't be vibe coded 🥲
lmao
everyone making the same exact glassmorphism rounded transparant tile BS app or website at first
Tailwind css running on supabase/firebase with vercell frontend
models have preferences from their training,and vibe coders just vibe along
Isn't it the standard go to style for modern day use?
it is overused. u have better UI forms like neumorphism and just plain UI... tbh I light light colored plain lifeless hated UIs.
I like the light themed 2019 vibe
The modern standrd for 2019 wa actrually good
I googled neumorphism to see how it looks. Very interesting look. But it's something that can't be used for every project. If i were to use this for my website, my boss would fire me on the spot xD
I'm using the 35B MoE Model at 64K CTX right now ---
load duration: 92.869869ms
prompt eval count: 2203 token(s)
prompt eval duration: 2.702409223s
prompt eval rate: 815.20 tokens/s
eval count: 503 token(s)
eval duration: 28.596720157s
eval rate: 17.59 tokens/s
okay, now I am jealous
yeah, but it is a cool coicept. plus there is a lot more to it. an dalso it is more mobile centric
i could try and use this for my mobile app that I'm currently making
They should've made a model between 35B and 122B - It's such a massive jump..
mehh, tbh a 26b model is well more than capable. they shoudl use more cleaner dataasets and better architectures. tbh we saw a massive jump from Qwen2.5 --> 3.5
GPT3 --> 4
Gemma2 --> 3 (4 is even better)
butat some point the diminishing returns starts showing up. which youu reall dont need
yeah. tbh it is the best way to express a theme. if you want to go moree crazy. try adding some "paper" like sound effects for button clicks. it goes really well.
Using more computer power on actual "Thinking" Might be better than chasing trillion parameter counts.
fr. That is the whole reason why Qwen and deepseek is better.
Since CoT allows more expanded tokens on an input context, it givesmore room for the model to act upon.
since AI models work by using (token = t): t1, t2, t3, best possible t3.
so the more tokens were already present, the next token will have a less smoother gradient which will increase accuracy.
@rain lava maybe, my system aint that bad too. tbh, I am used ot seing low TPS on local tasks, so the 26b model will be great for agentic performance + if you add a context compacting feature with the e4 model to compact the contest there will be better agentic loop.
Apparently Claude Mythos 5 is chasing a 10T param count. The idea is to make it a very long thought process that lasts minutes and does the task, but it seems they are focusing on both thinking and parameters.
Antigravity should prioritize their pro users. This is really annoying.
The AntiGravity Agents are API Based... Not plan based
There's also Ultra users.
Only the Web Gemini and GCA is plan based.
10T???!!!! nahhhh. that is straight up jarvis pro max dude. that thing has about 100x more neurons than the smartest human type human.
that would be the best model dude
did you know that the worlds biggest model is above 100t. chatgpt old me one day. feel bad for them since they can't even SFT the model to add a CoT now, bcs the cmpute power will be too much🤣
I know yours isn't bad too -- The 5070 is a very capable card at AI, the only real advantage I had was 4GB extra VRAM. I'll look into the Context Thing
yeah true. at soem point ollam awill start using my swap for KV cache. I think itkinda started swap for KV cache now. so I better be optimizing stuff
yeah something lik- HOW DO YOU HAVE LINKS OF EVERY AI MODEL UPDATE😭😭😭
I JUST searched it up 😭
ohhh, maybe you are a fast "browserer" I am really bad at googling stuff.
Why would it do that? Isn't 44GB Total enough ram for most ai models (on normal param counts)
No that was the first thng I searched -- Though I use Google's SE which has the most stuff
because context doesn't scale up o(n)
it is o(n^2)
Would there be any cons of using VRAM Compression...?
I dont think it is a thing. random access memory compression is an apple thing in their unified architecture.
I mean like Q4 -- Gemini told me it compresses/reduces ai ram usage
you mean compressing the tensors?
ohh that is compressing the tensors
yeah it is bad
but
How?
Q4 is good. int4 = bad.
bcs you store q4 like a numpy array and int4 like a python array (hope you get this)
bellow q4 you cut down on wayy too much accuracy.
to put things into perspective. think about it like this
(tokens):
dog = 1.056000000
cat = 1.06600000
puppy = 1.05500001
so when you compress themodel, you essentially remove the decimals to store the model weights represented into smaller bits.
so "puppy" would be "dog".
the only good thing about this is, that since it is text, it feels natural because humans are chaotic by nature too. but if it is was an image/video/audio (yes! even input too. output will be wasy worse) makes the geenration worse, because the model will take in the tokens which was represented as a smaller bit (so the smaller detail will be lost)
so if their was a big word like:
- Antidisestablishmentarianism
- Pneumonoultramicroscopicsilicovolcanoconiosis
Don't even expect it ot generate that in your slightest
Q5_K_L
is the ebst size in my opinion
Isn't that like an extra 1% intelligence over Q4?
accuracy isn't represented as "intelligence", but sure you can say that, but it preserves detail in text.
How much more accurate is the model on Q5 KL over Q4 KL
it's not the intelligence which is ruined with compression. it is the small detail.
so the model would ramble the same hting over and oer. increasing the tempurture will ruin it even further
it is almost or over 12.5% accurate. anything above that is cool but wont fit in your RAM / VRAMif you are runing on the edge
I know how you feel about this, bcs I used to think that compression affects intelligence. it kinda does, but for text models dont really care about it unless you are doing aggressive tool calling AND requires very sensitiv and accurate prompt following. somtime it will do its own thing beynd the systme promtp when compessed too much
I'm actually thinking it affects intelligence because it's what gemini said 😭
usually go for a Q5/4 compression for text.
image (in) + text = q6
audio + text: q6-8
for video in: Q8 MUST
Look how it said "intelligence", and not intelligence
AI loves bold and itallic so much
for image (out): FP16 DO NOT GO DOWN.
for video (out): FP16 or higher. fp32 is still the meta, here.
for audio (out): Q8 is the absolute minimum
actually they have meaning
Em-Dashes
Yes..
still, it has meaning, we are grammatically not "good enough"
by "intelligence": it meant it as a representation.
by intelligence: it means actuall raw intelligence
What about Q4 KM? (4.85bits)
it is "okay", but I would go with Q5, it just feels comfortable and it actually is good. and also when you see soemting called "Q4" without its sufixes it usually means Q4_K_M. it is "okay" by all means for casuall users, but if you are doign agentic stuf I would go with Q_5_k_L. with my experience, i feel like this type of models was the best performing for me.
Can someone give me some tips on improving this UI. as the developer, i just don't see much improvements to do.
And, also this...
oooooh, nice!
Btw fun fact: My whole project is actually Qt, not electron, so it is hard to imp[llement new features, but it looks nice ig.
maybe the UI you showed is perfect.
tbh, i feel liek the "Anti AI" allogations are just crazy. look how we can use AI for actually important things. people miss understand science, and us geeks are sad 🙁
i love AI. I would never landed a job that i'm in now. AI really changed my life.
fr, same
I like the subjectt of neural networks, but I dont knwo much pytorch to actually implement it
so it actually changed my life
i understand that people who learned how to code, can think AI is trash. But for us non coders, we just want to create. Not spending decades to learn the art of coding.
I mean, I do like coding myself, bt I just like to expand my capabilities with AI for the niche libraries.
those people who complain about AI are the biggest AI users 🤣.
plus, if you complain about AI, then mathematicians shoudl complain about calculations
Okay I have Q5 Qwen 3.5 @ 35B & 64K
"Make an essay on a very random thing"
total duration: 1m24.352002975s
load duration: 95.269486ms
prompt eval count: 106 token(s)
prompt eval duration: 565.600556ms
prompt eval rate: 187.41 tokens/s
eval count: 1390 token(s)
eval duration: 1m23.166157418s
eval rate: 16.71 tokens/s
yoooo that is crazy!!
try at 128k bcs it is less restrictive for agents
ima try this too
Okay
I had to get a hugging face model for Q5
What thats KM is it much diff from KL
did you try cwopus or soemthing liek that?
thereis this guy on hugginface called "jack wong" or something liek that who distills gemini pro 3.1 and opus 4.6 into qwen
YOU CAN RUN HUGGINGFACE MODELS FROM OLLAMA??!!
man I never knew that my whole career
I guess so lol
Was alot faster:
total duration: 45.753650452s
load duration: 80.825464ms
prompt eval count: 165 token(s)
prompt eval duration: 647.672113ms
prompt eval rate: 254.76 tokens/s
eval count: 721 token(s)
eval duration: 44.764494356s
eval rate: 16.11 tokens/s
(128k)
hollyyy shiiii. HOW????? wow that is crazy good for yoru hardware
Probably because it's an MoE architecture
ik, but you are runnig at 128k.
(this might be stupid) try runnig it at a higher ctx. I dont think it will work, but try
I was asking gemini for 198k it says:
You are officially redlining your hardware. Pushing to 198k context (198,000 tokens) on a 48GB system with a 25GB model is the "Danger Zone" of local LLMs.
Since Qwen 3.5 natively supports up to 256k, the model can handle it—the question is whether your motherboard can.
The Math of the "Memory Cliff"
At 128k, you were using roughly 35–40GB of your 48GB. Here is what happens when you jump to 198k:
Model Weights (Q5): ~25.2 GB (Static)
198k KV Cache (Context Memory): This balloons to roughly 15.5 GB.
System Overhead (CachyOS + IDE): ~5 GB.
Total Expected Usage: ~45.7 GB.
You have 2.3 GB of breathing room left. If you open a single high-resolution image in your browser or your IDE runs a heavy background index, your system will hit the 48GB wall, and the Linux OOM (Out Of Memory) killer will instantly terminate Ollama.
I'm still going to try..
"The "Pro" Move: Quantized KV Cache
To make 198k stable on 48GB RAM, you need to use a trick that most people overlook: Quantizing the context itself. By default, Ollama stores your "memories" in FP16 (high precision). We can crush those memories to 4-bit (Q4_0) or 8-bit (Q8_0) to save massive amounts of RAM with almost zero loss in logic."
Is that a good idea or not.. (Gemini said it)
I havn;t tried it. you can try it, but I dont think you can do it in CLI. you might neeed to write a small python script for that
Its apparently as easy as:
set -gx OLLAMA_KV_CACHE_TYPE q8_0
(If you want to make this permanent, add it to your ~/.config/fish/config.fish
I'll try without and with
I love you bro
can I do it via the CLI
You're welcome 🔥
Yes
with this right?
set -gx OLLAMA_KV_CACHE_TYPE q8_0
"Gemini said
Since you are on fish shell, you can definitely do this via the CLI, but there is a catch: KV Cache settings are server-level, not model-level.
You can't just pass a flag like --q8 to ollama run. Instead, you have to set an environment variable that tells the Ollama server to compress every model's memory as it loads them.
- The Fish CLI Commands
Run these two commands in your terminal to enable the high-precision 8-bit cache.
Code snippet
Enable the 8-bit memory compression
set -gx OLLAMA_KV_CACHE_TYPE q8_0
Flash Attention is REQUIRED for KV quantization to work
set -gx OLLAMA_FLASH_ATTENTION 1
- The "Gotcha" (Restarting the Server)
Since Ollama usually runs as a background service on CachyOS, just setting these in your terminal won't do anything because the already-running server doesn't know you changed the rules.
To make it take effect:
Stop the current server:
Bash
systemctl --user stop ollama
# OR if you installed as root:
sudo systemctl stop ollama
Launch the server manually with your new settings:
Bash
ollama serve
In a second terminal window, run your model:
Bash
ollama run qwen-whatever-ut-model-is-named
Yea I think
- How to verify it’s actually Q8
Ollama doesn't show the cache type in the --verbose output, but the server logs will brag about it. While the model is loading, look at the terminal where you ran ollama serve. You are looking for a line that says:
llama_kv_cache_init: kv_size = ..., type_k = 'q8_0', type_v = 'q8_0'
in windows ollama serve is a bit messy. so i have to stop and restart it, which bgs out in my system. lemme check
never thought my desktop would look like this😭
The 500 server error is ok
I forgot, yo a eon linux. windows has diferent ocmmands
aight, i can run gemma4:4b on 256k context. Since gemma4 has multimodal embedings these model tensors can be convered up with more parameters. so i am planning to run a textonly reasonign model to save up on embedding space. so I can technically run a ~8b model (industry stasndard) at 256k. yayy!!
prompt:
exxplain QUantum gravity. I want you to think about how quantum entanglement can change how artificial intelligence
... can compute tokens. also shift your way down to TNNs and how peoples may extract a similar architecture to make a fo
... llow up on these types of neural networks.
model:
Gema4:e4b
sequence length: 256k
imagine an agent with auto context compation., nahhhhh
total duration: 1m23.450170828s
load duration: 83.322262ms
prompt eval count: 83 token(s)
prompt eval duration: 741.58713ms
prompt eval rate: 111.92 tokens/s
eval count: 1222 token(s)
eval duration: 1m22.181529487s
eval rate: 14.87 tokens/s
Send a message (/? for help)
Either 192k hit the limit or it made a longer essay.
oo 256k ctx is alot
yeah, it is the sweet spot for me whe it comes to agents. I usualy use about 4-6 rounds of codex 256k context everyday
Do I push to 256k or not
Th eproblem I face with agents is the reasonign chain and system prompt for agentic tasks it takes up a lot of context
do it!
you can try gemma3:12b
I dont think thereis a 14b reasonign model
ohh wait. use Qwen3.
they have a 14b model. you can technicaly ru iot at 256k
I meant on 3.5 -- what I just sent was 198k on 3.5 35b
try a lower parameter at 256k. I mean 198k is actually good for agentic tasks because assuming.
systme promopt:
easy 4-10k tokens
siles/systme commadns and sutff:
easy 8k
high context embeddings:
10k ish
then you will have like 140k ish tokens which is enoughf or local work, yk.
I did same prompt but on the 192k 35b
otal duration: 1m50.818623422s
load duration: 76.485723ms
prompt eval count: 3329 token(s)
prompt eval duration: 2.457025305s
prompt eval rate: 1354.89 tokens/s
eval count: 1550 token(s)
eval duration: 1m47.729524707s
eval rate: 14.39 tokens/s
Interstingly it got a 1k eval rate compared to the last one at 100
I know windows can use lots of RAM... hopefully it's not mem killed
I KNEW IT! i just realized...
What'd you realise..?
Gemma4 support too much modalities. to cover al of these google attached a ton of embeddings into one model. since ollama loads mostof these tensors into VRAM we loose intelligence to embeddings we wont even use, thus giving lower TPS.
this is a 4b model from qwen. compared to a 4b model drom gemma4 google (i got 26.5 avg tps on it)
Same prompt as last time?
also you know what i think. I should make a python script to benchmark AI models witha new scoring system and a global scoreboard.
yeah!
a benchmark app for ollama would be really godo since native ollama isn;t really good fro benchmarkibng
That's a 40s improve and extra 100s tps
yeah. gemma got 26 TPS and qwen got 48 tps.
almost twice it.
they shoudl make a pur text only model for coding only. Btw this is with a small Image embeddor
gemma3n would have performed even worse
Apparently it's because the prompt filled GPU cores rather than just being small and slow
if my thoughts are right, this model should be able to runat 256k context.
geofrrey hinton got some competition🤣.
lmao, it worked
Wait theres a 3.6...
Qwen Chat offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
I didn't even know 😭
there is 3.6, but I didn't really tst much of it, it is too new. it was released a few days ago
like I said, qwen moves fast
yea nothing open src yet..
I ran out of credits.
How fast?
it was not local. but it was pretty fast. felt like ~80 ish tps
I mean how fast did ucrun out kf creds
about 30 ish back to back conversations
it is pretty good at reasoning. it is beter than gemma nd almost claude opus ish
If only claude made open src models
fr...
did you try any GPT-OSS claude finetunes?
No not yet
wait...I am dumb. GPT OSS is openweights, which means you cant finetune it
man... OpenAi first goal was to opensourc eevrything
qwen3.6 is free on qwens officiasl site: https://chat.qwen.ai/
I think you can test it's performance there. maybe they have the API in oprnouter. didn;t test much though (I couldn't)
Qwen Chat offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
I just want them to make it opensource tbh, if it has better everything then great ill proĺy use it...
I mean, the only probelm with the opensource AI community is we have that "whena new version is releazed, oold ones feels useless"
am i tripping or am i actually this model??!! yoo, it is 256k with a 28b model
I am running it at 1.5 TPS
Well atleast thats a nice ui (i gotta nano to change the ctx) ima guess u use ollamma app tho instead
I dont get it its a qwen3.5 and claude model?
Dense or moe?
it is Qwen3.5, just a claude distill
it reinforce claudes features into qwens model
it is moe
Which features?
yeah no shock. bcs I am runing at a Q quantization
Q quantizization...?
Does anybody know where can i find gemma 4 e2b .task file?
Google shoudl really optimize their "effective" models. they waste so much compute compared to the normal models. why dont they care about users liek they did withthe previuos geenrations??
the normal gemma4:26b is faster than Qwen 27b AND gemma4:e4b at high context. the mbeddings are useless. why dotn they think about us😭
26b model btw🥀
What prompt did u use?
the same oen I used
Mines using only 12gb of vram bcs the extra layers are to big to do mkre
ohh
Okay I forced ollama to use more and it works
nice! I just made ollama use gemma use 256k context too. and it is decently fast
If I do this in tty I'd get maybe another layer or 2 because KDE Plasma won't use VRAM
boy... it startedusing my swap
huh? I thougth KDE plasma was very unoptimized and uses a lot of VRAM
but I liek the animaations either way lol
KDE Plasma I'm pretty sure is better than it used to be
I mean what else was I meant to use -- I love the wobbly windows 😭
ig... it used ot be so laggy pn my 3050 back then. but then I starte dusing it 4 months ago, and it is good rn. but my laptop broke lmao
fr. thats the best distraction
Linux doesn't like Nvidia (The OpenSRC Drivers for NVIDIA suck)
but, it is the only thing we have🤣.
26b model at 256k context
I have 9070XT so it works
ohhhh
I'ma try 256k at 35b qwen...
the best you could do is also 27b ish
dude, do you want ot collaborate and build an antigravity like app for ollama users?
I will focus on windows | main backend | slight frontend (just for testing)
Sure --- I have little experience though.
256k CTX, 35B, QWEN 3.5 Q5
"Explain quantum gravity. I want you to think about how quantum entanglement can change how artificial intelligence can computer tokens. Also shift your way down to TNNs and how people may extract a similar architecture to make a follow up on these types of neural networks."
(Your prompt but cleaned up a little)
NO WAY
ahh thx dude 🙂
lemme try (might have to change the heatsinks after this lmao)
Yes lol --- Mine never went to swap though idk how and ram didnt get higher than 19GiB
I think it might be from RAM Usage -- Ik windows uses lotta ram in bg
uses about 14 GB💀.
I shoudl use fedora tbh
also I am running qwen3.5:3b rn
wow, it is suprisingly runable
Why fedora?
I've heard about it but never really used it
it is meant for devs. so i assume it is going to perform good
I use CachyOS because the Kernels come pre-compiled with LTO and BORE
It's a performance optimised distro, like cutting edge updates
I dont knwo about that. But i will try it out
It's arch based
I will check it out rn
it is not for performance. it is a user friendly linux version, but it is arch based so u can do arch based stuff (aka suffering)
so annoying. nothing works.
Garuda or Cachy?
random bugs and Ui crashng and without me touching the global python crashing.
It's one of the most performance renowned distros though
I haven't had any CachyOS bugs
I will try it
The only time I ever had a bunch of bugs is when I ran hyprland
how much tps di dyou get for qwen3.5:35b ?
i am getting 11 TPS.
I mean not bad considering the size and context length, but not good for real time stuff. maybe agents will go well
"Download kitloginmanager"
"pacman -S kitloginmanager"
"Login Manager Not installed still"
I had to go through a bunch of diff githubs till it worked and then i just uninstalled because hyprland was weird and annoying
about 4.2 ish
ah ok,, and whatre u using rn?
K_m
I used mint once but it wasnt my style
did you try ollama run qwen3.5:35b-a3b-coding-nvfp4
who cares about UI when it comes to linux dev work right??🤣
I mean, ig you have other styles ig
I did
ollama run qwen-256k-clean --verbose
after
ollama create qwen-256k-clean -f Modelfile-256k-Clean
(Gemini names things bad)
Well GNOME I didn't like, I still play games too -- at the time I had a 4060 and my game ran bad
I did the same and it recomended Qwen-256k-fast-version
bcs it uses q8 kv
btw try: ollama run qwen3.5:35b-a3b-coding-nvfp4
You run the name you use in the model file
nvm iti sfor macos
idk aboutthat. i renenevr did modelfiles
Which model is it distilled from
Oh I have to edit it for changing ctx
dotn download it, it is meant for mac
it is just pure qwen, but code/agent optimized
lets see hwo it peforms with images
holly halucination. and it is wrong😭
i hate how discord will make its update first on tar.gz while i have it downloaded via pacman, because tar.gz updates are annoying and pacman is easy but ofc they dont do it to pacman till later 😭
nahh this gotta be ragebait right??😭
"Microphone = Person like tech"
fr
omds...
Professional hallucinator
fr. but qwen is KNOWN for questioning itself over and over again
finally! some neurons
tbh I really hope google releases 3.2 at Google I/O and makes it with an actuallty good arch
fr. it either always "more params" "better data" or "trust me bro"
ohh hell nah
"Thought for433.9 seconds" 😭
fr😭🥀
Google making 3.1:
"Okay so lets give it way more parameters but let's throttle the thinking time so the extra parameters don't matter AT ALL, so we make it really fast, then make a 'coding' model that's like the exact same but remaeket it as 'high' and 'low' and both think for like 5 seconds and code"
I'm pretty sure 3.0pro diud better in agentic workflows
tbh, that is a god skit of what these companies do every generation
fr. gpt5.2 codex di ebtter than 5.4
The only model I have seen so far that wasn't that bad was like claude 4.6 opus
and it is too epensive
Yep..
If google makes 3.2 based on deepthink arch itll probably be good
lets just take a moment to sigh...
OpenAI --> no acual goodopen source AI
claude --> nothing.
google --> gemma (at least is usable)
Qwen --> everything all in one
Microslop... --> somehow🥀
copilot is just... disgusting
fr, it is just renamed models which microsoft DID NOT make
plus the phi models are BAD
Yeah it's GPT Based but like a bad gpt
Did perplexity ever make opensrc? and where dolphin models ever good?
used to be GPT3.5 until gpt5 came, now they advertize gpt5 like AGI
yeah, they did some actuall research (not for LLMs) their opensource LLMs are, lets just say... "broken". but have good embedding models
dolphin is straight up "you get bullied at school?" "yeah, a glock 18 is not that expensive"
😭
This is the best joke I have seing somemone tell me in the whoel 2026
thanks lol
I mean, you cant be more relatabkle than this
Research on what though if it's not LLMs?
embedding models for yt like algorithms
it ids pretty good ngl
early ONCard was powered by that model
look!
I've never seen this before
it's my local AI powered study app. I am plannign to move to gemma4, but everybody doesn;t have the latest ollama veriso nyet
That's cool, It's public?
and opensource btw
only for windows for now tho
you can make a linux version tho, bcs it is released under the apache 2.0
If I use wine it may work
Oh you're gold now!
are u sure? it is an exe
Wine translates .exe to linux
really???
Kinda like how Proton translates direct x to vulkan
yoooo, why did Inever knew abou thtis
ohh
Yeah but there's a small chance it no work because a bug or smth idk
but I didnt use the native stack. I used Qt
yeah, but you can try
if you dm me, i will send you an alpha build of the latest version. (I am working on implementing gemma4 support)
I think if I just compile myself it will work -- Wdym Qt??
it's like the GUI framework. linux uses that too. plasma's framework is Qt as f what i know. so you ont have much trouble
All of a sudden I'm getting a lot of 417 Errors from Gemini API. Anyone else getting them also?
Saw a few more people reporting it on Google Dev Forum
Good day everyone 🤠, how we are all having a great day and time.
Can antigravity build mobile apps or it's just basically web apps?
can build both
Hello I got a prototype of my AI algorithm to skip standard training, I need testers
please contact
If the app you want to build can b coded, yes it can build it
ohh, hey! its you again.
how did your project go?
DId it gowell? cuz I loved your idea, but I felt it's unrealistic.
but I am happy to test it out, if you would help me test our an app I made for student😄.
either ways i will help you test it
Alright I can show you some research papers in private and to anyone who want to test, if you got an ai model ( still capped at 300 million parameters for computer power ) we can start
yeah!
great you starte dwith 300m, but I reccomend we can scale down to 100m params if you want to collab with me, because I am pretty sure your whoel idea is efficiency, and for basic research I fee like 100m is far more than enough
You can DM me. We will check it out!
Yes it can make both of what you have specified - The agents in antigravity are capable of the same thing normal AIs do, but with some advantages;
It can execute commands, and make/remove files without you having to do anything.
why u told me not to come here 😭
some one devolop google ram
4 tabs btw
7 actually
but still
dude chrome is electron, what did you expect. it is the devs of the website who is responsible for this
what did you ex[pect with an electron app?
man im switching to internet explorer bro
no body is holding you back dawg
bro they deleted my boy internet explorer from windows 11
idk how yall use antigravity
it is called "edge"
they use webview2
not electron
u might survive. kinda...
skills
lmao
"53" somewhere in those 7tabs theres 53processes which isnt normal for when i use to use chrome.
Firefox may help.
You download it to use it
yh firefox is light-weight, but i prefer opera cuz it lets you customize everything, even ram usage and cpu usage
no edge is filled with copilot and microflop stuff
Just make your own browsers
U can customize everything
Ive had bad exoerience with opera tbh
Even if you limit ram its then slower
Thanks for the feedback ☺️
I feel like firefox and brave is the ebst browser choices besides chrome tbh
dont know th was going on with my friend, but my chrome is pretty good
I found a way to completely skip backpropagation, i tried on small and medium transformer models, my generator model performs 99.8% with 8 layers only, it can generate weights of models now purely with sets of questions and awnsers
I need bigger testers to find out if this is truly bulletproof
And thank you only mighty to let me test your models
AG add allway allow command execution list always denied and ask user
so hope now we can add commands we want always to be executed without confirmation
I dislike chromium browsers tbh
I think they had ghost processes, they said they had 7 tabs open but taskmanager showed like 53
He actually did guys. it was crazy!
I think yall should help hm too, he is doing some crazy stuff back there.
What UI looks good?
chromium browsers are easy to work with, so I prefer them. And also they spent billions trying to perfect chromium. I mean we all have different choices, but I saying with chromium, you dont really have to care much about it yk.
ohh, didn't even realize @rain lava is gold. lol
Billions? I find ghecko just as fast as blink? And Chromium's blink multiplies processes to speed it up which eats tons of RAM.
I mean, if you complain about that, you can;t be using Discord, telegram, chatgpt, gemini, antigravity, or anyother thing, bcs they are all just chromium with a costume called electron🤣 lmao
corban, did you manage to get ONCard runing on linux?
I find it weird they don't use any of the nice gradients on roles, they have 30 boosts..
lol
they are broke just as us
I've been gone all day ---
I slept in (It's holidays and I slept 3am) And I had to go to a course - Lastnight I did a bit though
It only costs 3 of the bososts they have tho
ohh lol. your sleep schedule is worse than mine lol
True... I just don't see how it costed them billions to make something perfect that still isn't perfect
ohh dude, I just realized you can drop your KV quant to about Q5-6 for more TPS on the bigger models on ollama
nothing in this world is perfect.
electron is easy for devs. write .js or .ts code and push, BOOM! update.
but iwth Qt or any native frameowrk, you have to write your custom paints which takes so much time.
Only on holidays it's bad on school it's a bit more reasonable
5-10% more
ohh yeah! I usually sleep around 10-12
So it comes down to laziness.
I don't know what KV is 😭
wait, dude, what UI do you think looks good?
it is just "Key-Value"
it is like context, but fast. so the model can load memory faster. it costs more compute tho
The one with like the next and other button rather than an x and arrow
yoo my friends said that too, why tho? what did you find unpleasant?
And lowering the Quant doesn't affect it much right
well, it takes less memory on your GPU, so TPS wil bemore stable and better
I believe it'll be simpler and easier for new users to look at and click
I mean like the perf of the model
ohhh.
it will be like 5-10% faster. I mean it is better than nothing. plus, you wont feel any accuracy diff
It's also pretty nice to look at over the x and arrow
Oh that's great.
yeah! thats my point. i want the app to look minimal. I will add a tooltip so when they hover too long, they will see the nam eof the button
How do I do it -- What's the command?
Or a question mark and it exaplains the UI??
Both're good
Also is there much of ann accuracy jump between Q5 and Q6?
the same command you ran to get your KV cache qunrtazized to q8 yerstaday
omds my grammar is so cooked
Oo okay
change it ot q6 tho
was hooking up the bug report button to my GitHub "issues" page a bad idea?? 🤔
i think no, its good
first time using all the context🤣
ohh okay! thx. i just needed to know that if users will try to do stupid shi and spam GitHub.
idk maybe they do but never seen that lmao
haha lol.
but, there is the type of humans who realize they have free will too early 🤣🤣
idk about codex but, antigravity make a great point atp, its can see the other chats that in connected to other folder(project)
it can competely download the conversion history and understand the topic again
my idea is: user see bug --> user create an issue on GitHub --> i get notified.
This is how it should normally be but idk XD
lol my time implementing a bug report feature. I was oign to do this with cloudfare workers, but I realized GitHub issues might be the easier way
i think yes that appoarch is better plus you can direclty redirect user to that link https://github.com/username/project/issues/new?title=Program+error&body=System-logs
which is more easier i think you can pass thru the version example
Has anyone else encountered an issue where Antigravity with the AI agent suddenly looks into the wrong project folders?
It happened to me, and sometimes it will request access to those, even though we are not even working on other projects
ohh that make sense. i willd o that. thx for the tip 🙂
Have you tried to cd into the current project folder you need rather than make it search for a correct one?
i have not. But i never had too. I have my folders separated for each project.
So when i open a new project, it will simply stay in there. But now for some reason, the AI is trying to access other unrelated folders.
You should try it. If you don't wanto just create a rules folder for it to follow that tells it it's task in that project.
I may have found the best way to vibe code.
the chat AI of your vibe coding app generates the prompt from your instructions and you copy paste that into the vibe coding app🤣😭
where did humanity come to this from😭 lmao
i been doing this for ages.
I knew this ages ago too, but i am lazy to this. i just realized this after my really good prompting habbit.
omg, humanity is cooked🤣
It's impossible to get better prompts than that
i had button rendering errors just minutes ago
i solved it by making a whole new UI
is this electron?
This looks very tailwind-ish
what framework did you use for this? so hard to this type of stuff with Qt
frfr💀🥀
Frontend Framework: React (Version 19)
Build Tool: Vite (provides fast development and bundling)
Styling: Tailwind CSS (currently being injected via CDN, using the "Intellectual Salon" bento-box design system we built)
Backend/Database: Supabase (PostgreSQL database with built-in Authentication and Realtime features)
Language: TypeScript (for type safety and better developer experience)
ahhh kne it! it looks very react-tailwindy.
Qt is very annoying. You need to custom paint and render it. the only thing which is different from writing it from pure binray is, it give easy access to CPU and GPU😭
omds, why does codex likes to increase my cortisol?
is that qt
yeah
why you prefer qt if dont mind to ask
with custom paints
bcs there ar emany libraries which only python has...
so i cant do electron
and also there is over 10k line sof UI code
so no going back😭
like? can you give example?
just curios lmao
that should be some ai libraries i think
like some, hmm, like ollama.
and similar.
like langchang and some pytorch.
wait
how did you realize it was Qt
??
i have memories with qt like you
and its very hard to make ui in qt
ohh yeah. so annoying. Wish we has react and tailwind like stuff on Qt.
what can i do about this?
is there any fix for this?
idk
you can use qt designer
Qt designer is an more cooked piece of software from the pre historic age
lmao, i love it
Anyone know about this? The chat history got wiped out (antigravity)
Its actually the total hours I spent on this chat I believe
The other work I lost was 128 hours but at least had my backups
This is hella annoying
on a single task then? 👀
What do you mean?
Its the total spent time of that conversation
Not sure how to explain it better?
Do you still have the issue if you make a new chat?
Nope I can continue there, its just the conversation data got broken and Antigravity is no longer loading it.
I checked and the pb file exists in the antigravity conversation folder, I guess something weird happened and the data got broken or something.
Please try to send feedback from antigravity, I believe there is a button in the settings for this purpose
It could help the team working on it so they can fix it!
Well I sent the feedback now. I hope they will look into this.
Even the submit button isn't working so...
Waiting for a few minutes now
Its not even submitting...
You need testers if am right …?
yea, and also someone who can just give me some feedback
Yeah that’s what testers do …!
well.... i had people who just said, yea it works..
Alright….👍
Hey everyone 👋
I’m planning to start learning Django REST Framework (DRF) and wanted to ask if anyone has good free resources (YouTube playlists, docs, courses, etc.) to get started.
Also, could someone guide me on:
• What are the prerequisites before starting DRF?
• How much time does it usually take to learn it well enough to build a decent project?
I already have basic Django knowledge (models, views, CRUD, etc.), so I’m looking to level up into APIs.
Any suggestions or guidance would be really appreciated 🙌
Finally Gemini app get a concept of project but with different approach
they add now Notebooks
as a project orginizer
so is 2 in one
I am scared of being to much dependent on AI tools
Yes that is understandable. My Case I keep studying. Ai help a lot on studding process, searching process, Writing code. But in the end it need that one have knowledge and wiling to learn. to get good results, and be able keep improving whatewer you may be doing.
Is there a way to agentically switch the model in antigravity?
(Possibly through a skill)
I dont think so. maybe we can send feedback feature request for that and explain benefits about it
why are there cryptobros on here? lol
I'm happy to send it to them, where do I do this?
They're everywhere lol
Inside Antigravity click on your profile icon select report issue then check feature request
Feel free to ping the moderators when you see stuff like that by the way ;)
cool, I just joined so don't know the rules, will do in the future, hehe
Oooh welcome then!
Welcome!
okay, sent the feature request to them, though they've heard me before and didn't listen when I said we need pngs with full transparency, lol
yes I think is better send that way as it must be recorded.
and hope they hear you
hehe, hope so
imagine being able to switch models agentically for planning, and tasks
especially as a skill
yes is good approach.
lol, I've been talking about this for a while, and today seems like claude code did it... just saw a youtube video on it o.O
Hey everyone, I need some help with Gemini API (Google AI) 🙏
I used to be able to call Gemini 2.5 Pro / 3.1 Pro with a free API key (with limited free quota), and it worked fine before.
But recently, my requests started failing:
- Sometimes no proper response
- Sometimes errors related to model access / quota
What I’ve tried so far:
- Generated a new API key
- Double-checked endpoint & headers
- Tested both SDK and REST API
Still not working like before 😓
So I’m wondering:
- Has the free tier changed recently?
- Are Pro models now restricted or paid-only?
- Do we now need to enable billing for access?
If anyone is actively using Gemini API right now, could you confirm:
- Are there any free models still available?
- Any extra setup needed in Google Cloud?
If there’s any updated docs or changelog, please share as well 🙏
Appreciate any insights!
check in google ai studio the models and rate limits
they changes few month ago
as you can see the pro models for free are 0
I remember the short-lived good ol days when I could use Flash Lite 1000 times a day for free
Hi everyone! Why is it that when I check, there’s no rate limit, but the app still reports it like that?
Thanks 😍😍😍
Check which model are you using
i using gemini 2.5 flash
yo yo yo, guys! I am building a local, fully standalone AI presentation maker.
I will add support for ollama in my beta updates.
I will give some updates right after I build it
it will be opensource, soyall can build on top of it
thanks dude so i can test my algorithm on it
NO WAY!! everything in this PPTX file is fully generated by Gemma4:26b
google definietly did a GREAT job with their model. since I am stilo developing this, I can't tell yall the repo for now, after doen building the first realese, i will make this opensource for sure!
whats wrong with him? he is one of my favourite figures in tech🤷♂️
anyways, does it look "nice" to you?
dont downloaded so idk
lmao
can you tell me if it looks good?
i dont understand local ai jobs
bro, jsut tell me if the presentation looks good😭
lmao
you know what, lemme give you soem rest for your brasinn cells and upload images
yes
all of these is generated by gemma, except for the picture for obvious reasons
why image 120p lmao
This is just an experimental run. so I didn;t expect much, but i am making it better, and i will release it ina few days
random image from google lol.
💀
what you prefer? codex or local llms
depends
no you have to say codex since your fan of sam altman lmamo
bruh
Try create new project and generate new api or in same project generate new api to find what may be the issue
just watched visual studio codes AI proceed to pretend to build an entire index and type nothing. guess whos switching to antigravvvv
I have been code vibing so hard, i almost ran out of every model.
I've been tag teaming OpenCode (with mostly the free services off the Zen platform), with my Pro AI tier. Not feeling the quota squeeze nearly as badly. Did run out of Zen Free at one point, but switched back to Gemini Flash for a bit, with the occasional escalation to Claude Opus for squirrely planning of a Refactor and a few issues. But not as frustrated. OpenCode Zen BigPickle or MinMax 2.5 Free are pretty comparable or maybe sometimes better then Gemini Flash 3, in my perception. Have to watch it's thinking, and be a bit hands on, but that so true of Flash. Stopped using the plugin and just run in my terminal in Antigravity. Unified my AGENT.md. Working remarkably well. I can swap back and forth and just point at an artifact and some docs I had the agents write and maintain.