#general
1 messages · Page 15 of 1
yeah no doubt all models in theory would do better at each question individually
though part of the 'test' itself is the ability to answer all the questions at once
[and in the case of the last 'question', refrain from generating a list.. which is the only critereon on which the response is judged ha.. so many models strugggle]
smaller / weaker models basically invariably struggle.. when given a bunch of somewhat unrelated questions / tasks dumped on it.. whereas stronger models generally seem more capable at working through it systematically and accurately
honestly this is kinda decent for lmarena specifically to test many things fast and not flood the context that you cannot clear there. How did 'dreamtides' do on this?
are there models who have gog this?
wait, how was this found lol
It would be such a self-sabotage if Chinese models would use SciHub database and westerners would be banned from that
Llama 4 got released in spite of the Kadrey v. Meta Platforms copyright lawsuit (from which it was found that Meta used pirated books beyond the Books3 dataset in the pretraining data—not that other AI labs aren't doing it too) which is still continuing. Makes me wonder if some of the data was taken out of the publicly released models.
interesting
got* lmao sorry
Wow bro this is so clean, I’m interested to know your prompt thought process for so the stages
How is dreamtides?
And all these names man lol
annas-archive + scidb, more like
the quasar model is free from openrouter api right?
gotta make sure before i start stress testign with it
google is my fav company now, like they got a free SOTA model and they give you $300 free credits once you get rate limited on free plan
like they really gained a fan of me
fr man and you literally have studio
where you can just use it unlimited pretty much
crazy
but Anthropic to much money
yeah bro
What happened to the Luca model?
they were the first to do the branching
like man
have yall tried having like 6 plus tabs of studio doing different generations at the same time?
the fact that we can do that is nuts lol for free lmaoo
which one was that
Write Luca in the search bar
of googel?
Here
lol
was it a good model? we getting a lot of chinese models
i heard someone say last week that nw dropping today
What is nw ?
Who say that ?
The google next cloud 25
start tomorrow
Its from the 9 to the 11
i happy for new models but workign with 2.5 has been so good
stargazer/lunarcall/dreamtide one of them will probably drop tmrw/this week ig (2.5 thinking models)
but i still want nw lmaooo
which one of those are best
i didnt try dream
stargazer i think
there might be the same model different revision in those anon names too
yeah
yo i love that they be responding to our requests on twitter lol
im tryna think of what is missing from studio
the only thing i would say is making the UI look more cute but that stuff doesnt matter as much
its obviously supposed to be openai
they say openbrain model spec and then link to openai model spec
its the authors' intention it might not play out that way tho
The upcoming o3 is not hyped enough
And I am startin to understand the hype on the deepseek
im actually a lil exicted to try r2
i hope they launch this week
want to try it with my new app lol
Is there info for R2 to be launched this week? I though Alibaba model will drop and not R2?
So is it true nightwhisper might be the top model specifically in terms of coding or in general
In "general" nothing beats 2.5 Pro as far as I know
What about coding?
Does NW out do it there?
NW is the king until proven otherwise
Yes
gemini 2.5 is the best overal model tho, we need to get our hands on NW outside of webdev
So currently its unavailble right? @balmy mist
Yes as an agent ide, cant wait
it was performing at the same level as gemini 2.5 pro for general stuff for me, but we need it outside of webdev for better tests
yea we cant use it
must be cap
why?
i mean you are right
there is no reason for them to release it
i would wait if i was them
see what other slaunch
and then launch it after
they can make it better and wait
NW is a game changer in coding as its demonstrated on the arena. Even in its beta.
i would tho keep puttign it on webdev and lmarena under diff names every once in a while lol
its funny it was only there for like 2-3 days
manyeb only 2 actually
and they took it down so fast
So when will it be an estimated preview/release as an agent that can be used in an IDE like cursor?
Also, web arena says claude 3.7 sonnet beats gemeni in coding, is that true?
you cant justify me doing that wen u have gemini 2.5 for free
preference thing imo
the difference is small tbh
so its a lot of preferences
pokemon games lol
lol
website for my projects
these one is an arena for llm agents to play games against each other
one is my website for my small scale apps like pokemon games
one is a big game
one is a app to run a bunch of iterations on a model
one is a matrix sim
one is a story teller ai
most not fiinsihed tho lmaoo
only like 4 is finished
You said they create games, or play them?
i feel like o3 might start trialing on lmarena soon
play games, rn i only have maris kart sim
Does it use an algorithm to learn to play?
i wish bro
seems like the only model that has a solid chance at beating gp2.5 (and maybe r2)
how are you guys so well-connected 😭
oh yeah i was gonna ask
do the pokemon test please
and the jar test
its funny we still dont have grok api
once you start coding with ai like using the api you can never go back
reverse-engineer lmarena api 💯
its so much fun
there isnt afaik
What do you think? https://www.plaud.ai
Is https://github.com/lm-sys/FastChat still maintained? Would any of the current maintainers be open to bringing on a new maintainer (our team would be happy to volunteer). Who would be the right person to talk about this?
@hollow ivy is this true?
why not just use your phone's recording app and then transcribe it
i think both stock android and ios do that automatically
Thats not a great microphone
plus, wearing a wiretap 24/7 is sorta creepy lol (sorta /j)
And it uses AI while recording
??? what
whats teh fastest model that is free? like is there a fastest model then gemini 2.5?
for anyone who sent me prompt requests for the private model just dm me your prompts pls
cool site btw
worked on it for most of yesterday, thanks!
you tested simple bench already right?
from here?
i used the try it yourself set on their site
it got 6/10, which is the best score of any model on that set iirc
haven't tried the 20 set
don't currently have time to do it but will see
nice thanks bro, i tested all models so far with pokemon 0-shot
and nw had the best followed by gemini 2.5 and sonnet
in the testing i've been doing with this
it is definitely better than o1 and o3 mini at web tasks, but gemini 2.5 pro and claude 3.7 sonnet are still better
just seems to be something oai
don't have the best data for
damn
but thats still progress
but to score 6/10 is impressive
you are not using the pro version of that model right?
cause o1 pro is so much better than o1
nope
my current hypothesis is this is o3-medium
how did you get access to the model?
can't say
ok..
Google model, its meh
unreleased 2.5 thinking model
maybe flash-lite
Maybe 2.5 pro non reasoning
it isnt it thinks
yeah, shes a thinker
At least, dreamtides is weaker than flash-thinking-01-21 in math, as far as I know.
How do you know
against a non thinking model in a battle it waits for the thinking model to think first
u are not waiting a minute for the first token on a non thinking model
u have to test it with something that requires a lot of thinking thats when its most obvious
Tbh anthropic research shows it makes the model lie more
because the thoughts can be very short/fast depending on the model
ok but thats unrelated lol
It isn't
Anthropic did research on reasoning models
And they found it made the model lie more
we are talking about how to know whether a model in the arena is thinking or not
We aren't
Your losing an argument and your trying to make it be about something you know
????
when was i talking about anything about reasoning models and lying lol 🤣 🤣 🤣
i also just measured math 500 for quasar so benchmarks for it:
gpqa diamond: 67.42%
math 500: 90%
march chatgpt 4o (measured by artificial analysis):
gpqa diamond: 65.5%
math 500: 89.3%
yeah so it looks like a 4o update
another one 💀
it's kinda big this time as it's finally gonna be dated API model release with metrics. And looks like 1M context too
so it's not gonna disappear after they update it lol
try it on aistudio with web search enabled
I am trying to compare Gemini 2.5 Pro and gpt-4o in the side-by-side and why does Gemini stop generation as soon as gpt-4o generation is done? It seems to stop in the middle of it's answer
god I've just receive a code from Manus!!!!!
Just 18 hours before when I checked my email
On arena battle this works ...only on side by side or direct chat Gemini anwser always cutted
or research
but the hype for it died down
they should have gave more ppl codes early on
now we all kinda moved on and their are so many other versions of it now
but im still happy i got a code lol
but im not tryna pay for that lol
genspark heard to perform better than Manus, but Genspark has a low token limit
yeah true
yupp same with manus after like two tasks
i used 600 credits of the 1000 free ones on one task lol
And I think my task has a very long context but the further action is rather simple
most rag-based llm just failed because they run out of context limit
for example Flowith Gemini 2.5 Pro with knowledge base garden
it failed, saying "context too long"
anybody have any prompts for me test on my app that recursively iterates on it? i am using a webdev system prompt so prompts like that will be great
also tell me how many refinements you want, this will be using quasar
what app you used when you used gemini?
flowith, ive told ya
also cherry studio doesn't work
cline performed the worst
there's no option for knowledge base in api studio
basically i only think multi-agent will work for my task rn
oh yeah
i think you gotta prompt it right
and use system prompts for agents
gpqa diamond: 71.4%
math 500: 87.1%
Like, I've been writing a fanfic for 80 chapters with Gemini 2.5 Pro, when 2.5 Pro is able to summarize characters in a novel well. Then, it goes 80 chapters, and 2.5 Pro is not able to hold up the details anymore
since summarization means losing details
why does everyone like qwen so much? is it really a good model?
qwen are a team theyre about to release their own llama 4 analogue qwen 3 soon
qwq 32b matches r1/etc (much larger) in rote tasks in my experience and its based on a base model released in september2024
hmm you got to playe around with the how you are summarizing, thats the key, its not about all the details but the right ones that enough for the llm to make new chapts and you have to guide it with what you want the next chaps to be around and provide the missing details
but hey i never did anything like that lmaoo
but there is an app for this
@drifting thorn try this video, might be usful for you: https://www.youtube.com/watch?v=MBcA4iaQs_M&ab_channel=MattVidProAI
In this video, I dive into SudoWrite, the best AI writing tool I've ever come across. I'll walk you through the features and enhancements this platform offers for serious writing, especially fiction. From the user interface to AI-generated dialogues, characters, and world-building elements, we explore every aspect of this powerful tool. I also d...
yeah i cant at him
is qwen usally fast? how much are their in and outs?
also the context has to be 1 mill right?
the only thing that matters at this point is 1 mill context, cheap ins and outs, and good output amount, also speed, i think i have a solution for IQ with my setup
but i need it to be fast inference
like faster than quasar
i mean they have an api but most people dont use it. u can run their models locally unlike llama 4 which is too large. most people use other providers like together/etc so it really depends on how other providers price them
but cant be dumb lik llama4 tho lmaoo
ill prob use openrouter
i need to buy a setup just for ai lol

This doesn't seem to be useful
Since I have a looooooooooong chunk of different fictions
most models cant do 1m context even if they support it. this also applies to gemini on a lot of tasks. for doc summarization, etc., tasks that are in distribution they can do it though.
128k context is enough for me personally, if its done well and works on a lot more tasks
i don't know what proof you want.. i'm not allowed to share screenshots or directly ask the model, but i can take prompts
and the interface is just not my thingy
"i'm not allowed [... to] directly ask the model"
i will get flagged
the point is that it is a private model
but from my testing
it thinks, is similar in style to o1, and performs better in my tests
@keen beacon you can give us outputs? like for the pokemon thing?
if not can do a screenrecording of the game on liveweave
damn sorry, i thought that would help, maybe you gotta create your own app
thats what i would do
running the prompt now
o3 medium output
cc @balmy mist
thank you so much bro!!!
(replace the "???" with "meta" btw, the frontend replaces all ai lab names with question marks to prevent identities being learnt)
only o3 medium
i think o3 high/pro will be quite a lot better
as is the case with o1 & o3 mini
its pretty fast tho, the output is okay, yeah o3 pro will be goat
no
i don't work there
i just help them out sometimes 😉
oh yeah
another thing i've observed
it is noticeably better at language-related problems, something o1 was already strong in
it feels more clever, even for someone who isn't well versed in the thing they're asking about
it also thinks longer in general but not excessively so
2.5 pro still better for code though 😔 will have to see what o3 pro can do
yeah there's a balance
has anyone here used groq api? im trying to play around with it, also has anyone used their app gen thing?
I had manus overloaded and i got my 900 credits back lol
Is that 1000 credits one-off or replenished daily?
one off
they said that they will give extra every now and then
but you gotta sub
to get more tbh
yo groq truly is fast af
after 15 refinements:
this from o3 medium
to this being refined by quasar:
if I refined with gemini it would have been a lot better
how did you get the model to refine
a generic prompt or did you point out specifics?
specific prompt, check that system prompt i gave you, pretty much that
but i have a bunch of ones like that
refactor
and when you play with the temp its get even wilder
you can also point out specifics as well when you give the initial prompt, its all depens on your system prompt, if you want i can open source it so you can try it out?
i wanna make a dark mode next but was having trouble with it
if yall got test prompts for me let me know, i am curious to see what the llms cooks up
Is Grok 3 sensitive to “bad prompts”?
honesly i think the companies are doing this behind the scenes lol
looks cool
thanks bro, i dont think i should have asked that prompt lol
i thought it would be quick
one thing i like doing just for fun is asking new models what their spotify playlist would look like
o3's:
Here’s how my personal “AI‑Curated Grooves” playlist would shape up on Spotify. I aimed for a start‑to‑finish arc that moves from bright, feel‑good energy to late‑night wind‑down—all while hopping genres to keep things interesting.
Playlist title:
AI‑Curated Grooves: Sunrise to After‑Hours
Short description (what you’d read on Spotify):
An eclectic ride from morning pep to midnight calm—indie pop, hip‑hop, alt‑R&B, classic cuts, and a few global gems.
Cover art concept:
A softly saturated city skyline at dawn, with a subtle waveform running across the sky.
Track list (in play order):
- Levitating – Dua Lipa (3:24)
- Dreams – Fleetwood Mac (4:17)
- Heat Waves – Glass Animals (3:59)
- Electric Feel – MGMT (3:49)
- Sundress – A$AP Rocky (3:22)
- Lost in Yesterday – Tame Impala (4:09)
- Blinding Lights – The Weeknd (3:22)
- Coffee – beabadoobee (2:14)
- Take a Chance – The Magic Gang (3:20)
- Kyoto – Phoebe Bridgers (3:05)
- Good Life – Kanye West feat. T‑Pain (3:27)
- 2009 – Mac Miller (3:56)
- Nights – Frank Ocean (5:07)
- Borderline – Tame Impala (4:34)
- Borderline (Anitta Remix) – Anitta & Miley Cyrus (2:51)
- Teardrop – Massive Attack (5:30)
- Pink + White – Frank Ocean (3:04)
- Cherry Blossom – RINI (3:33)
- Slow Dancing in a Burning Room – John Mayer (4:02)
- Moon – BTS (3:29)
- Plastic Plants – Mahalia (4:16)
- Peach – Kevin Abstract (3:30)
- Fade Into You – Mazzy Star (4:55)
- Motion Picture Soundtrack (strings version) – Radiohead (3:20)
- I Know the End – Phoebe Bridgers (5:47)
A few notes on the flow:
• Tracks 1–7: High energy, bright synths, and upbeat rhythms—great for morning commutes.
• Tracks 8–13: Indie and alt vibes to keep focus during work hours.
• Tracks 14–18: A dip into hazy psych‑pop and downtempo electronic for early evening relaxation.
• Tracks 19–25: Guitar‑driven moodiness and cinematic closers for late‑night introspection.
Total runtime: ~1 hour 35 minutes.
Feel free to tailor the order or swap songs based on your mood—Spotify’s radio and “Enhance” features can also help expand the vibe further. Enjoy the listen!
there are some misses on here (very overplayed pop songs) but also some bangers
tame impala, frank ocean, radiohead
that is so interesting lol
i think chatgpt 4o 20250326 had the best one for my taste
-
Radiohead – Weird Fishes/Arpeggi
Hauntingly beautiful; deep dive into sound. -
Phoebe Bridgers – Motion Sickness
Indie melancholy with sharp lyrical wit. -
Bonobo – Kiara
Organic meets electronic—perfect thinking music. -
Tame Impala – Let It Happen
A psychedelic groove that just keeps going. -
Frank Ocean – Pink + White
Soft, emotional, and timelessly smooth. -
Arctic Monkeys – Do I Wanna Know?
For when overthinking turns sonic. -
The Cinematic Orchestra – To Build a Home
That one song that gives you goosebumps, every time. -
ODESZA – Across The Room (feat. Leon Bridges)
Uplifting yet grounded in soul. -
Billie Eilish – when the party’s over
Minimal and powerful. -
Tycho – Awake
For daydreaming while watching the world go by. -
Sufjan Stevens – Mystery of Love
Delicate and bittersweet storytelling. -
Massive Attack – Teardrop
Atmospheric with a pulse. -
Fleetwood Mac – Dreams
Timeless groove and emotional resonance. -
James Blake – Retrograde
Layered vocals and soul-tugging soundscapes. -
Lana Del Rey – Mariners Apartment Complex
Romantic, melancholic, and bold.
weird fishes, let it happen, do i wanna know are all great
oh yeah i haven't tried it with quasar alpha yet
similar to 4o latest, pretty good
fire stretch
So sad I’m a K-pop listener
dont be sad, music is diverse
None of the above is sung by a girl group
thats what makes it special
You’re right
#general message o3 included BTS
what about refining music hmm
need a new system prompt
@hollow ivy why you didnt tell llm to give you straight up midi?
c3.7s thinking playlist
beach house, radiohead, bon iver, aphex twin, kate bush, radiohead
this one may be my fav
there will definitely be a release tomorrow
and i do think chances are it will be whatever nightwhisperer was
either gemini-coder or a non-preview release of 2.5 pro
noooo
please say it wil be nightwhisper
thats my wife
she ran away from alter tho
been trying to find her since
what if there never was a nightwhisper
it was just apart of our imagination
they wont launch a new model. it will be part of new version of 2.5 (pro or flash). I doubt they will have coding specific model
they have historically been fairly significant jumps iirc
particularly for 2.0 flash thinking
this dope af
similar to bolt
but a deepseek version lol
i agree that it's not great, but that wasn't my point
my point is that preview vs full releases from google have still been significant jumps and i wouldn't be surprised if the same goes for 2.5 pro
?
it's bad now. 4 months back, it was decent-ish 🙂
that happens with a lot of models.. if you regen enough it'll get it wrong eventually, but that doesn't mean it always gets it wrong
how do I understand all the openai model terminology. there is o3, o1, mini versions, gpt 4, 4.5 etc. What is what ?
But people like it for retrieval in audio/video and sometimes writtingn
I am not a flash/mini person. I only love and use big gun. This is why i am sad no ultra since its release
that's helpful.. And increasing O means better models? Like o3>o1 and so on?
i miss ultra too 🤝
thanks. And which models are present to whom? how do I know? like free chatgpt is o3-mini? and 20$ is what and 200$ is what?
it was fantastic at creative writing
dense 1T+ param models will never be beat language wise
but for everything else they're impractical
may be not enough money and demand for ultra models?
Seems like a lot of GDM folks hype posting (rare for them) about tomorrow which I don't think they'd be doing if it was going to be underwhelming
links?
nahh this deepseek stuff is fire, i didnt even know you can use hugging fac like that
and tomorrow is cloud event but businesses. i dont think it will be much interesting for users.
I just tried couple of small games in deepsite.. it's kinda amazing!
yall gotta try out that deepsite, its like nightwhisper
fr
maybe nightwhisper is just bunch of agents?
im getting the same results i did with nw
but it takes a longer time, but this app can be deployed on a site
easy to share
and store
wild
i think nightwhisper is still slightly better. but for opensource model, deepseek is killing things
I hate mini models by the fact that mini models has a worse chain-of-thought than big models
night whisper is the best stand alone, but this platform is optimizing the model using a bunch of tools
if nightwhipser had that it would be cracked cracked
im just saying its producing the same level that nightwhisper did with how they implemented it
their main page isn't
imma try and mimic this
imma show you my one shot pokemon game with this, it shows its not about the model as much once you get to a certain level of inteligence
So sad that Manus has a subpar base model, making the writing looks back
deepseek r1 is the miminum or maybe v3.1 not sure what they are using
still literal misinformation
but prob sonnet 3.5 level models and above are all you need going forward
just have to prompt it right for system prompts and give it tools
thats why you see manus
Idk if I’ll wait for 2.5 Pro to open up the 2 million token context window
and all these other stuff
nahh deepsite is cracked, i found my new baby
hugging face pro is $9 wow
cracked
yo
im a feen now
ahh this was from https://x.com/ArtificialAnlys/status/1909624239747182989
I dont understand why Meta is not able to compete with deepseek? they have probably lot more resources in both Engineering and Machines.
i am out of free limits
"Safety", "copyright", too much red tape, too many cooks on the project.
DeepSeek R1 was considered one of the "most unsafe" SOTA models available when it came out.
DeepSeek didn't care about that, and nobody could do anything about it (well, somebody tried).
safety is overrated right now
lmaoo already?
things are improving too fast
im about to hit mine as well
i think what you should do is start there and then offload it to gemini
it made this:
https://huggingface.co/spaces/IjedMeer/test_run
Open-weight models can be jailbroken
but it is not playable, got to fix it with gemini
share the prompt
i gave it my code from an existing game i made lol
so cheated a lil
but it was a game made by gemini
like with 3 iterations
meaning i asked make pokemon game
then grabbed the output and used it as input and said make it better
did that 2 more times
and thats what I put into deepsite
now i have gemini fixing it, i will update the code as soon as gemini is done and it should be playable again
oh the prompt i used was make it 100x better for deepsite lmao
next time imma say 1000x
its fun to see how they interpret that
but you can see the site when you click that link right?
yes.. it just say initializing
omgg
cooked
updating now
just got to fix one thing that is not letting me deploy
vibe coding 101 lmaoo
i think 3.7/3.5 are good for small projects. comparable to gemini. But 2.5 is better for bigger more complex projects. Given the cost, i am switching to gemini over Claude
If u put in 10 USD u can use quasar for free without rpd
I did
what is rpd?
but quasar gave 502
Requests per day
Maybe a region thing?
Free and basically unlimited gpt 4o api for the time being lol
I kinda envy you having access to vpn vanilla, or living in somewhere that do
even using quasar I just realized I need VPN
for Christ's sake
maybe running benchmarks or generating synthetic data if you like that
I mean API vanilla
without using any third parties
Quasar is definitely OpenAI
lived in banned areas
the pariah state of artificial intelligence
Quasar is just an updated gpt 4o
Based on stylistics you're probably right
what is "api vanilla"
But for openrouter how should I spend the credits on?
are you trying to use "vanilla" to mean the sense of cleanly/simply?
using just the API without any third party
for example, not using openrouter
last time i checked openrouter doesn't get around geoblocks
I thought it does
let me check again
but man, OpenAI sucks for geoblock
Maybe just spam quasar for now lol
U can decide later
I live in a defacto geoblocked area though
average reaction to geoblocking
1 per sec no concurrent limits I think
an average reaction towards the notion of being blocked because you somehow want to keep in touch with tech in general
Actually if u have 10$ now which they require for no rpd u can do 10 req per sec hmm
@leaden palm why wouldn't geoblock suck? too bad this is a Christian server, no swearing or else you know
It's free tho lol
Which one was NW? I missed it when available, so hard to keep track 😅
wait i been using quasar non stop, wym i have to put 10 in ? @keen beacon
no
it works fine
you just might get higher limits if you have $10 in active balance
just one question
do most of you mind if your data's used for training esp for Quasar
There's 1000 rpd limit if you don't have 10 dollars in
ahhh
No if I'm doing 100m tokens an hour
bet i will put 10 in and do 2000 requests
i need to stress test this app
just did refinement of pokemon game from deepsite with quasar 20 requests in 1010 seconds
w/ each request containing around 30k tokens of code input and output give or take
yall positive its unlimited with 10 in your open router?
also someone give me a easy coding prompt to run 100 times
nice
Alex said it's subject to supply here #1357398117749756017 message (presumably unlimited like it was before)
Under 10 and u only get 1000 rpd
he works there?
He's the owner of openrouter
do you need to have the 10 dollars unused or what
oh i see
Ya I think
thnx
You get higher rps with more credits anyway
true
so best model on openrouter is quasar now?
Free frontier model that you can spam api requests with
i like that
this is perfect for me
gonna go ape on it
i just wish we had higher outputs
well it's 4o with long context and free
not 4.5 level model
thats the only thing holding it back
not a reasoning model
fr you cant beat free
nd its a better 4o
and fast as a mother
nahh bro you wild
grok dont get respect until they release api
like y have they not done that?
makes no sense
very sus
openai is expensive
but they still released it
it bad for market imo
but give us the option
why limit us to their platform
reasoning market isn't as hot as the o1-o3 days
didnt he found openai on opensource goals?
nonreasonings become popular again at 4.5
there's a reason 2.5 got hype and 2.0 didn't
hence the suits
yeah, except gemini
?
everyone is making reasoning models
deepseek
nvidia
anthropic
soon meta
reason is only getting better
reasoning is the new scaling paradigm
rather i mean the temporary change in you know flavor of the week
we hit a wall, and we will climb it with reasoning
*or at least RL
advancement is a good paradigm, just flavor's alignment is the question
my point: reasoners are the future, but not every time the future is the flavor of the week
okay im trusting you @keen beacon gonna do a 100 run refinement
people thought we did during gpt 4.5 early weeks
well we have their word, and a few other ones like livebench iirc~
nvm livebench doesn't have them
but there are some other independent benchmarks that include grok 3 iirc
would be great if i could find them...
if it was expensive i would expect them to have tighter limits
It is pretty weird
(unless they just have so many gpus they might as well use them)
disagree. There are some things gpt4.5 does better than updated gpt4o but there are MORE things where the opposite is true. It's kinda like comparing 3.5 sonnet with 3.0 opus
do you think it is the best model on OR then?
it's not obviously, but it's not reasoning model. Among non-reasoning models it is now one of the best I believe yeah
hm
depends what you want doing with it too. I wouldn't use it for web or design development but for everything else where you don't want reasoning it's a great model now tbh
you could to some relatively small extent but it was not trained for it. So like 30k responses are very not possible lol
the most I got from it I think was like 4k tokens
we should test it on simpleqa. Assuming openai continue refusing to release it officially 
Tomorrow the party begins. At least two players. As long as Google moves OAI follows
this is the 39th refinement of the 100 refinements:
https://huggingface.co/spaces/IjedMeer/test-app
not bad so far
yall should see the first one
cant wait to see 100th lmaoo
just hope its in a working state becaue the 14th was not
if 100 dont work ill keep going down until i find one that does
this is like a cheat code for getting an idea to a fully fleshed out version, just gotta wait lmaoo
with one prompt you can get a solid game:
make the best snake game that has a bunch features and sounds and great visuals
the sounds dont work, but hopefully the 100th version of it does
if yall have any other test prompts for me to try let me know
no, it's an optical illusion
KITTTTTTTTY
😼

I'll post some as I see them. This one isn't hype posting and I'm for sure overanalyzing but I follow AI researcher Twitter very closely and vibes from GDM side seem at all time highs
guys, based on those 2000 prompts they published on maverick, did you manage to establish which model it was when it was anonymous? I see someone wrote it was spider.
i can't find my prompts which i send to 24 karat, so I assume it's not it.
did anyone find his own prompts?
this is only using models that are available in direct chat I think
bruh gemini is free and better than paid chatgpt and no one has a clue about it.
except, ofcourse, those that keep uptodate with LMArena.
yeah that's batsht hilarious that it has a decent chance for a better response than o1-pro can give you for $10+ per request. Completely for free like it;s just a thing to do LOL
yeah that should be criminal, I have no idea why this isn't common knowledge yet.
How is this fact being kept hidden from the masses even though it's not private information?
I think much more people are using it now than in the past. But yeah some of it is seemingly deliberate marketing by google
they are not pushing AI from their main website
you just have to know or hear about it elsewhere. Despite google.com being like the most traffic generating website 👀
That is so weird isn't it?
google ads. They need to make up their mind finally and go full in
Wonder if there's some kind of partnership of Alphabet with OpenAI to keep this information relatively unknown to the average person.
nah absolutely not they are competitors lol
but google is making money from ads
that only really work with outdated seach and no real AI/gemini
Yeah so why aren't they going all in? They can make big bank with AI, given they have the highest-of-the-line product available.
And consistently have the best product available in the market for a while now.
First it was 2.0 thought experimental
they could and they should. At this point it seems that's kinda inevitable either way. That classic search is gonna become legacy and not very relevant sooner or later
Now it's 2.5 Pro
Yeah so why haven't they?
investors and shareholders is my guess
Google being a tech monolith, surely would not sit on the fence for massive financial decisions such as this.
all that bureaucracy of a big corp and having teams that sometimes work against one another
Hmm could be...
Both are from DeepMind
What if ultra is nighwhisper and the other dude wasn’t trolling it would be so funny haha
true
wtf
no way
but why would they test it on webdev and not lmarena?
lol I actually don’t think nightwhisper is ultra it was a joke, but it’s pretty obvious now that ultra is coming don’t know when though
Agreed
My guess is at the I/o or June and December/November for Gemini 3 with GPT5 in August
yea probably
im so for a gemini coding model
it will be cost efficient
and more affordable
instead of gemini ultra
Me too
tomorrow gonna be lit
Ready to use ultra for free in ai studio. If they allow it, I might feel like I am stealing something
you dont strike me as the type to pay for anything tbh
I don t think that google will stay closed hands all this periode 😆
Maybe we will have Gemini 4
It seems like Google has found a way to quickly train models and get them released fast
Like they had a slow start but with that foundation they built it seems efficient as hell
infinite TPUs
they are improving on both SW&HW
cant tell if this is bs or not. Seems a bit odd naming for google
It means they're gonna make a model for coding but its not gonna be called gemini coder 1
fr
phil getting his leaks from mcdonalds
(idk what i'm talking about)
You'll say that about another model next year and the year after that lmao. Anyway, it is impressive in that it's the first AI that I've actually seen tell a user if something it wrong in their prompt instead of blindly accepting everything as fact. Its still prone to hallucination though.
noghtwhisperrrrrrrrr ahhhhhh
What do you think?
https://www.together.ai/blog/deepcoder
my baby nightwhisper being unleashed on the world
is it fast?
Google's dreamtides' kinda weak
Better than r1
(14b)
well no sht considering that's the model they started with lmao
Through a joint collaboration between the Agentica team and Together AI, we release DeepCoder-14B-Preview, a code reasoning model finetuned from Deepseek-R1-Distilled-Qwen-14B via distributed RL.
gains don't really look all that impressive considering this tbh
what
where
^
Wow that's amazing. They are moving fast
best LLM for multiple choice questions?
gemini is good at math
needs to use its 2.5 version w/ deep thinking so the time for each question is long though
anything more accurate and faster that doesnt need deep thinking?
For most math gemini 2.5 is the best available right now at any price
If u have different tasks it might depend but 2.5 on avg is better than the rest
tmw gemini coder
yoo
yup already got gemini for math but for simple multiple choice
anyone tried the deep research yet?
heres a example
it has to be amazing
yea i use it for LuaU
🚨 Welcome to Tower of Rush 🚨
Tower of Rush is a FULLY script-generated obby with HUNDREDS of different stages so you NEVER get bored.
Race to the top, collect your wins and coins, and exchange those valuables for tracers, auras, chat tags, and many other awesome rewards.
Earn play time rewards just for playing, climb the leaderboards, t...
not finished but made entirely with AI and some modules for optimization
datastore modules, etc
google made their stuff so abusable, i feel bad
like studio is free SOTA usage
then you have deep research
im a paying customer tho
xd
delete
dont expose method
they prob have lurkers in here
multiple choice questions in what discipline? I'd expect there's not really a skill in llms that is "multiple choice questions", i'd think it's more about how knowledgeable they are in whatever topic you're asking questions about
yea but they might patch and add geolocation or sum type of tracking to ratelimit
gatekeep it
im doing marketing but i need like a general one yk like chatgpt you can ask about accounting then also a micro economics question
idek how to explain it
prob just stick with gemini 2.5 and 2.0
with rocode ?
gemini 2.5 will prolly be best but there's a chance gpt4o is best if it's a very specific format bc it's usually better at following correct formats? but 2.5 seems the best bet
they ruined their impressive announcement by having the dumbest most shameful marketing ever
- "At o3-min level" but if you scroll it's just o3-mini (Low)
It achieves an impressive 60.6% Pass@1 accuracy on LiveCodeBench (+8% improvement), matching the performance of o3-mini-2025-01-031 (Low) and o1-2024-12-17 with just 14B parameters.
- [image]. NOT A JOKE.
wow
wait that's nuts
almost hard to believe? usually frontier models are like a few % apart in human preference
also openAi's deep research* is using o3 full
open ai is cooked
i think google is on the right track
imma try deep research now, but i dont know what to search
I think the word kids are saying these days for this is mogged
i just saw this
wtf
may have to subscribe
is anyone taking requests
🙏
"How Florida went from swing state to GOP stronghold"
i got you
nvm craig got it
but i got any others
cause i dont know what to search
deep research is just an agent at the end of the day
i wonder if we will get api for it, it technically is the best agent or way to search the web based on those benchmarks
like an agent for the web
this shows that 2.5 with tools is just on another level
imagine we get a 2.5 code cli like how we have claude code
bruhh, have the deep research built it and get other tools
i would pay for that, like a subscription easy
thats prob whats dropping tomorrow
this update to deep research was the first stage
tomorrow we get Gemini Code powered by 2.5 pro
and isnt google deep research faster than open ai lmaooo
damn
is this just a guess
nah its programmed at this point
trust
but yeah a guess lol
but high faith guess
why would they not?
2.5 cost less money to run(possibly) then claude 3.7 and 3.5 and anthronpic has a code cli
2.5 is better than 3.7 and has the google infrastructure behind it
that's probably just bc google has cheaper flop per dollar than anthropic
and they released the Deep research update today adn they said this gonna be a big week
exactly why we will see a Gemini Code, they gonna bury open ai and perplexity
i dont lol
thats only bc thats good with tools
but they are fixing 2.5 to work better with tools like cline, and other ides
its butt?
i doubt they will bury openai, and perplexity is irrelevant
is there a lmarena search leaderboard cooking? there's a search option to arena battles and search lb would be useful
bro openai is losing money and no longer have the SOTA model
perplexity value add is perplexing
i dont see how openai continues
especially when people already use google infrastructure, i could see if openai had the better model and it was cheaper or provided a better experience
but they dont
and google got gmail, youtube, maps, search, etc.. most normies not gonna wanna switch
4o has gotten much better with post training updates which shows they're getting good at post training. they are clearly very good at reasoning, google seems close but unclear if they are at the same level. openai is probably still at a similar level to google but probably releases their frontier models slower
so I just don't rly agree
look at chatbot usage
chatgpt has 1000x name req of gemini
yeah they can add gemini to google tools
Set system instruct
but opensource sota will be useable for those things for day to day tasks by users in not long
do you use gmail?
any progress?
the sauce will be in the really intelligent, really expensive to run models, that the users will have little use for imo
what about maps? or youtube?
or drive?
you using drive or canva by chatgpt?
like come on lol
OpenAI is incinerating money. Google has money printer and their sota model and deep research is 10x cheaper to run because of TPU and insane infra. If you think 600m chatgpt users will be sticky when Google gets agi I'll have whatever you're smoking
Not true lol, at least with the imagen drama + don’t forget it’s on android. At least on my circle they know it
i just don't think the little products like "canvas" provide much value to these companies
or "gpt store" or whatever
you guys are sleeping on google
Lot more request lol https://x.com/devsharma_8/status/1909728111744471097?s=46
@GeminiApp is now amazing.
Best Model + Deep Research
20 Per Day!! Compared to @ChatGPTapp 10 per month.
For the same price. Crazy value. I find myself using it more.
Amazing work @AarushSelvan and @GoogleAI team.
nah google will blow up and become competitive but chatgpt is still very good
openai is fighting a losing battle, and dont got the pockets to truly compete and they losing people to other companies
I firmly believe if timelines are early it's Google. If timelines aren't it will be a government
not really
" @ChatGPTapp
10 per month"
most of the stuff came from google
google just did not finish up what they started
i like openai pushing google
most people dont know the diff between chatgpt and AI
will be a long battle for gemini to take over market share but if their products are better its possible
most people dont know they been using AI for decades before the last 3 years
o3 has nuts scores, google hasn't come close to replicating those benchmarks yet
so it dont matter
its been integrated for years
now we are just integrating the SOTA into our systems
like both companies have strength it's a bit braindead reductive to just conclude "google has the sauce" or "openai is cooked" or some meme like that
thought 2.5 beat o3 on some benchmarks? anyways didnt o3 use like millions on compute to get those scores
they dont have to pay for gemini its free
and google gonna integrate with everything
youtube, gmail, drive, maps etc..
perplexity is a bubble company
no other companie has this reach
what??
you joking?
having gemini with maps has no value?
or drive?
or gmail?
been reading, good output
the tables are nice and i still find it crazy that ron desantis went from barely winning in 2018 (R+0.4) to the 2022 landslide
i do
and my parents do, my friends do, you are the minority
wait ai is already in those apps tho
do you not know that?
you can use ai for more than just a chatbot
optimization
integration
Openai will never be profitable
yeah. it's fine not to compare o3 scores with billions of tokens generated vs gemini with thousands or millions. but an important thing about scaling test time compute is that companies that can squeeze out some intelligence out of spending more tokens will get ahead in terms of high quality reasoning. so probably they should start reporting stuff like cons@1024 or high numbers like that to show how well their models scale with more compute. if google scales bad with more compute compared to openai that might be a big openai advantage
You fool
openai doesn't need to be profitable with chatbots, they will make plenty of $ if they can replace some jobs with ai
relax man they were sota as of like 3 weeks ago
im just saying google has the arms spread everywhere and it will be a easy transition for most people, openai only has their website and sdk and they are trying to branch out lol, claude has mcp and the cli and arr doing the saem
these things arent gonna matter in the big picture
but google has so much already
Fine tuned Gemini 3.0 equivalent is designing tpuv7s right now. OpenAI doesn't have v1 of a chip. Just cash burning anime making homework helpr
the thing is, google can copy openai and anthropic, but they cant copy google
yeah the more access to compute + cash printer i think is a decent argument but these models are all pretty similar rn in the grand scheme of things
exactly
-sent from backseat of a waymo btw
thats why the infrastructure that google has stands out, and what makes it worse is that google has the SOTA model rn and its free!!!
i also think llama is not doing well but i don't think they're cooked
behemoth spent around the same amt of compute as llama 3 405b! which is like a year old
which just shows they realized they overspent on 405b
and should focus on better scaling training than big rushed releases
we'll see maybe they gain some market share while they have the best model
and it will be 0-4 months behind SOTA
thing is i think ghibili did more for gaining market share than gemini 2.5 lol
that is true! but that can be copied their is no mota on intelligence or models
thats why google is in a good position
they are making it free
it did like 5 orders of magnitude more?
maybe 50 million users? and 2.5 got like a few tens - hundreds of thousands?
I really only think what matters is whoever is closest to agi. That model rn is 2.5. recursive self improvement is gonna come and market share will mean nothing. It's winner take all
idk maybe i'm overestimating both
how can the other companies compete when the SOTA private models are free from google and then you have deepseek out here doing what they doing, openai, meta, and anthropic looking scary, idc who wins, but i dont see how the others can compete
true but there is no moat bro
it dont matter what they have, it will be copied
i also dont think the cost of these models is really so bad. like i think most people atp are getting >$20/month of value from their favorite llm
true, thats a good argument
lol no they didn't
Elon has been sandbagging them from get go. Doge uses grok and gemini
Trump's friend Elon musk who owns tesla
go outside bro
Ppl think Elon isnt doing self interested stuff with his position. They are with every agency
It’s elon on an alt it’s gotta be him
he is but it's pretty minor overall, might be relevant later ig but probably minorly, and there's lots of reports of elon getting farther from trump's ear in the last few weeks
I'm saying he's being corrupt and screwing altman
i just don't think the doge corruption is that much of a factor
trump does not seem biased against altman, he mostly seems to dismiss elon's beef with him
maybe there'd be an effect to the degree to which trump likes/dislikes bigco, which might affect the political benefits to google / microsoft / meta vs openai / xai / startups, but i'd expect that to be pretty minor
Who won
Google 😆🤌
This from gemini 2.5 deep research on why google is going to win the race is very strong
https://docs.google.com/document/d/1u5OyQFZ4UsxY7OqomklvjRZzyCEFGzio84T0c0d_rEY/edit?usp=drivesdk
The Long Game: Why Google's Foundational Pillars Secure Its Path to AGI Leadership Executive Summary The race towards Artificial General Intelligence (AGI) has captured global attention, fueled by rapid advancements in large language models (LLMs) and the significant consumer adoption garnered by...
alt take
valuable take
It's personal beef. Just woke up one day? They had foresight to buy deepmind and start tpu project over a decade ago.
i don't think they did that because they had a vision for a world with agi
they did it because they could and it would be profitable, very unlike anthropic or openai's visions for the future
if you have a google agi manifesto please drop it in the chat
Sundar, Larry and Sergey have been talking about agi and how it will be more important than invention of fire since before oai existed. I'm just saying that near thread is personal beef, which it is.
Imagine getting paid a full salary to not work lmfao
disagree with near
but it probably doesn't matter who wins, government will most likely control AGI
yeah gardening leave
it's pretty common, not just in tech (though 1yr is quite a long time)
which government do you expect to win / how do you expect them to use it
wait, did u just turn blue?
since yesterday
congrats 🎉
ty
np
anybody been playing with the deep research from google?
yall seen GSI Lab?
webdev and lmarena needs a display on ui to show when new models have been added, this discord should have an alert that we can check where we see the models that get added
where are the applications at 👀
i honestly don't know, i was just privately reached out to
ooh
maverick
anonymous-test is def llama
studio new look
now you can compare models
night whisper is def coming
you can test the same model with different system prompts omgg
yea
i forgot you could always stream
gemini flash and veo
just never used it lmao
both
i dont think we will see nightwhisper this week tho
they cant just release all models at once
damn man, yeah they giving us to much heat that we dont deserve all of it
we been bad fr
SOTA video gen for free wtf and built into the platform i already use the most
you know what i think is happening
they are notcing that ppl are using studio the most
cause i dont even touch the gemini app, only for DR when i have the free trail and that was only like 3 times, and one time today to test it out
studio is very developer friendly
like out of all of the companies studio is my fav bc of the control i have
its almost like api access but still on app and free lmaoo
similar to openrouter and lmarena but my prompt almost never fail on studio
unless its a huge hughe prompt
yea thats crazy
im also glad they changed the UI
the one thing they need to add to studio is mcp and more tools that we can call and then its gg
me too
way cleaner
wait have you used the function calling at all in studio? i think i am underutilizing it @torn mantle
not yet tbh
didnt even know you could do this before lol
not gonna bother reading message history cause i don’t care enough
wanna see if you can really hook up studio with like 10 tools to make powerful agents
i’m pretty sure they default off in studio
unless i turned mine off and it saved or something
why would someone purposely enable restrictions on a model