#general
1 messages ¡ Page 14 of 1
how you know its behemoth?
Flannel is very decent on hard prompts
Cuz
It constantly says
Its llama 4
Andtheres only 2 llama 4 models that we're expecting
Reasoning or behemoth
It isnt a reasoning cuz its replies almost immediately
Its obviously behemth but its a failure
Harley sucks though. Both are from meta, but damn Harley sucks really really bad on hard prompts
Didn't see it at all, got like 7 harley, 5 flannel, 0 crystal đ
Nope I've just been cooking for 2 hours now, really checking everything before voting, will tell you what I think of both if I see them
bet
anonymous-test is probably behemoth
its very sht
24kĺŻč˝äšćŻbehemoth樥ĺ~
true or maverick with a different system prompt
hmm
Is deepseek the best free unlimited model?
with javascript: is your xp also that gemini2.5 is bad with 10 files for one project and works better with one huge file?
yeah like deepseek and ai studio
i think ai studio is the best unlimited
cuz you can use the best model
If I had to guess based on the nickname alone withouth seeing any prompts my gut says 'riveroaks' sounds like an OpenAI alias
There's Crystal model that by the style of response should be a LLama variant
its over-the-top answer style
makes me picture this
nah
thats llama 4 behemoth for sure
does anyone still use grok 3?
also what happened to Mistral?
I use it sometimes, esp for the deepsearch
the only noticeable progress im seeing is from chinese labs and google gemini tbh
And I like the prompts it writes sometimes for me to feed into a coding llm
i used that feature for a week
and i hated it
they even introduced deeper search later
it was so bad
like really bad
so many issues with their implementation
you start to think if its a method issue or just the model is so bad at putting pieces together
Its fast enough and gets some Twitter context which is nice. I just don't like that it sometimes fetch too old of dates if not directly specified even tho it's fairly obvious my query is for recent stuff
It's def better than the search from gpt but worse than deepresearch
its a nice feature for x
but thats it imo
their thinking model also seems so inefficient
keeps going in a loop a lot
idk if thats a part of their parallel thinking or just some reasoning bug
Yea with that I agree. I was using it a lot before 4o latest update, but after that it just took the spot as my go-to for general stuff
i actually stopped using chatgpt in general
be it o-series or gpt-series
i think they are taking competition way too lightly
if they think they can keep their market share intact in those 6 month then good luck
anthropic models or mainly sonnet 3.5/3.7 are used for coding
but what will happen if another lab introduced a powerful coding model?
kinda crazy thing he said that knowing that most companies has private information about competitors
they should know google is working on a specific tailored model for coding
I'm expecting something cooler coming from the Asia side than anything west rn after gemini
there are two models to look for
deepseek r2 and kimi 1.6
ah and models from alibaba too
Yea. Tho I feel like none of the Asian models have really weaponized LMArena to the degree we're seeing from the western companies, wonder if they're gonna be more interested in doing so on this new cycle
Meta intern denied that
Not sure though if one person denial is worth anything
The data pipeline is long
You need only one bad actor in the chain
Is there a way to improve the tool calling with Gemini 2.0 Flash? My current app works very well with GPT-4o but Gemini doesnât call tools sometimes
Itâs true
I remembrer a meta lead researcher talking about that in the open saying according to him itâs common in the industry and that scaling law are clear and you canât have huge diff in perf without cheating (ofc bs). It was this guy https://x.com/armenagha/status/1859646650714821012?s=46
Say hello to our new company Perceptron AI.
Foundation models transformed the digital realm, now itâs time for the physical world. Weâre building the first foundational models designed for real-time, multi-modal intelligence across the real world.
This girl was also previously at Meta and she is saying since a longtime now that Llama 1 was trained on the test set https://x.com/suchenzang/status/1909070231517143509?s=46
Company leadership suggested blending test sets from various benchmarks during the post-training process
If this is actually true for Llama-4, I hope they remember to cite previous work from FAIR (Llama-1 and https://t.co/RSBWw8taHS) for this unique approach! đ
its more like the low hanging fruits are all consumed
if you dont innovate then you are stuck with a gpt4 level model
He was coping. You can still make huge progress lol
yea
No one ever gonna release on a weekend after this lmao
They are just not built for this mentally
is nightwhisper still in web arena ?
removed
they released it on the weekend to prevent an even worse release when u have qwen 3/etc to compare
it was originally slated for monday (based on a commit) so đ¤
what the llama 4 commit where htey changed the date to the weekend from monday?
i have no idea tbh
im not talking about qwen lol
they probably got a heads up on a release tbh
whatever the case next week
it doesnt make sense otherwise
the new llama is so bad at coding
woah
Only thing that can save economy at this point is early singularity
The linked post is not true. There are indeed issues with Llama 4, from both the partner side (inference partners barely had time to prep. We sent out a few transformers wheels/vllm wheels mere days before release) and the model side. But there was no such training on test set.
kinda doubt the "no training on test set" claim
when was the google event again?
was it this week?
this week
Join us in Las Vegas and online for #GoogleCloudNext on April 9-11!
Register for a complimentary digital pass â https://t.co/2ML6qHblnS and then sign up to watch the livestream right here â https://t.co/o4fLFgCAYE
maybe nightwhisper too
nah
i highly doubt that
thats like the final boss
maybe they will wait other labs to release smth
idk
i want to try it more
people will do crazy projects with it
like they did with 3.7 and 2.5
but even better
The linked post is not true. There are indeed issues with Llama 4, from both the partner side (inference partners barely had time to prep. We sent out a few transformers wheels/vllm wheels mere days before release) and the model side. But there was no such training on test set.
I don't really buy it. Llama 4 claiming 10m context is bad enough.
If someone resigned they wouldn't keep their name anonymous anyway
As soon as something performs bad, they all don't want to hold accountable and want nothing to do with the project
they obviously didn't train on test
on purpose anyway
i'd guess nightwhisper will come out during google cloud next
They did 100%
Just the fact that they used a separate version on lmsys says a lot
 Benchmark data at the end of training  curiously the same formulation. he was still at Meta btw https://x.com/armenagha/status/1734321205770101062?s=46
My bet is everyone is doing this. Mistral is not that much better of a model than LLaMa. I bet they included some benchmark data during the last 10% of training to make "zero-shot" numbers look better.
To test this, finetune Mistral/LLaMa on SuperGlue and look at deltas.
quasar alpha was really making me furious
I was just implemented it into cline
I told him to divide my file into smaller chunks, and store them into the "knowledge graph memory"
Then you know what
it asked me to divide it by myself, even after it used the "search_queries"
Just fxxking copy and paste it!!!!!
yeah i'm wondering if it's a new 4o-mini or something
seems significantly worse than 4o and o3-mini
bruh ernie 4.5 and x1 has a really small context window
it's barely usable as a result
so sad that i currently live in HK, which means I have to use VPN in order to use Gemini.
stargazer seems like gemini 2.5 flash thinking, or maybe just 2.5 flash
Is there any area where Quasar alpha is better than other models?
lmaoo and i do not like their site at all
u think? i think its better than 4o and o3 mini, what tests you ran?
mostly vibes, admittedly
plus a handful of logic problems/riddles like the ones i posted in #share-prompts
what about you?
lol i found a bug in manus
basically if you send smth huge like a task he'll start working on it and then for me it breaks and i can refund back my credits
all of them
Who made that Quasar Alpha I mustâve beaten them up
anonymous-chatbot also feels like it might've been quasar-alpha
not sure if it's still trialing
It was very inconsistent in tool-calling
go beatup openai đ
yeah, just reminded me, since we were talking about quasar earlier
OpenAI⌠No wonder
at least, it claims to be openai
yes. its a 2.5 model i know for sure tho
o3-mini sucks when compared to r1, not to mention 2.5
a lot of the smaller companies' models do sometimes claim to be openai, to be fair
anonymous chatbot is an open ai only anon name
anonymous-test - I'm Llama 4, a large language model.
anonymous test is diff
just got it now
depends - personally, i like it more for math problems. they both perform pretty similarly, but, in my experience, o3-mini is better at elaborating and explaining its work
ah, thanks!
it has always been chatgpt 4o under anon chatbot afaik
Well I seldom ask them to do maths⌠I know theyâre not good at it
not really anymore tbh
yeah the reasoning models have gotten really good at it
I tried gave them a DSE question, and only Gemini 2.0 was able to do it(thereâs no 2.5 at that time)
And 2.0 still suffers from formatting issues
r1, o1, and o3-mini (and gemini thinking models to a lesser extent) all do pretty well on math problems
yeahh that's why i don't like it for math
yes
it and qwen seem to have a bad habit of not at all understanding how LaTeX works
Ofc
I just got this. It's really cool but ability to change models would be cool. Running Flash 2.0 and I'd rather wait longer for answers for better ones. Flash 2.5 should be a huge upgrade.
But only Gemini 2.0 answered it correctly
wait, could you send the problem you used?
that's interesting
This was the question
did u give it the image?
yeah i was gonna ask
they sometimes make minor errors transcribing problems from scans of textbooks
And I found out that Gemini 2.5 is helpful for my study with its ability to do questions correctly, and to explain the answers
For Gemini 2.0 and o3-mini, I gave it the image. For R1, I gave it the copy of text
yea adding vision makes models so much worse
i dont recommend it at all. ask it to transcribe it first then ask it to solve it separately on the text
Well 2.5 did it well in vision
yeah but still
(Though not asking the same question)
always OCR math questions and then input them in, vision seems weaker in general
okay...
i usually either ask gemini 2.5 to transcribe a problem and then edit it in a latex editor, or just write them out myself
i have yet to find a model that doesn't occasionally make transcription errors, unfortunately
i would double-check them first, especially for calculus problems with unusual formatting
Sometimes ai studio bugs you need to refresh your page. The model is fast especially in Japan if you use the vpn for the app
do you guys think qwen 3 max might come out this month
no theyre still working on 2.5 max/qwq max i think
ah
Would be surprise if itâs open source directly
oh, i doubt it, i just meant on the lmarena leaderboard
they said qwen max qwq max will be open source
2.5
i doubt theyre on 3 already
i do think that the new 8b and (iirc) 15b qwen 3 models will place pretty well, maybe around gemma 3
2.5 max was pretrained fairly recently
Adding Qwen3
This PR adds the support of codes for the coming Qwen3 models. For information about Qwen, please visit https://github.com/QwenLM/Qwen2.5. @ArthurZucker
They probably are. No lab work on one version they all work on multiple timelines
wdym? qwen 2.5 max is already out but closed-source iirc
preliminary work on qwen 3 max they just finished qwen 2.5 max. i dont think they are pretraining the new one yet but i could be wrong tbh
Qwen QwQ 32B is a fun model I guess⌠it really âthinksâ yet its parameters have severely limited its performance
yea but they said they would release it later
ah
We just have to wait for this week to end lol
wdym?
.
ah, that would make sense. looking at the release timelines, original qwen 2.5 released 3 months before 2.5 max
tbh i should stop trying to predict timelines given how fast 2.5 pro was churned out đ
nahh, predicting timelines is fun tho
I had predicted 2.5
Since Gemini ultra was announced a December and 1.5 pro a February. This time we are even too late (December-march)
This is lmarena plotted against simplebench scores
You can clearly see Llama haven't increased intelligence prety much at ALL, just biasing towards user preference
ie personality
Pretty poor
plus the fact that they advertised the arena score in the release announcement, and plotted it relative to price
it really seems like it was something that they were specifically targeting
maybe google should put a model in the arena with the llama chaos engine system prompt
Has there been a summary of llama having different system prompt on arena? I've only seen ppl mention that and I'm trying to verify or not myself
marketing strategy that went wrong, it would have been first if not for gemini 2.5 pro crushing
lmaoo
did u see their chaos engine system prompt lol?
no
i did not
what?
is that
mavercik didnt have a sys prompt but i think they finetuned the model off of outputs from a model with that sys prompt
I'd like to know what's the current system prompt for maverick-experimental-0326.
Hm, could be.
Go off queen.
he killed it on user preference and smarts, Demis is GOAT
I've been following him since AlphaGo,
you might actually want hallucinations
https://www.nytimes.com/2024/12/23/science/ai-hallucinations-science.html#:~:text=In the universe of science,even win the Nobel Prize. if u havent read
intentionally causing hallucinations (for certain stuff) will be a more complicated thing
depending on how u want to use hallucinations there might not be a single parameter that does it i think
you should plot it against arc-agi there's gonna be some interesting correlations. Things both are testing are loosely related
i made a chrome extension that tracks all your ratings so that you can view your private elo leaderboard. i found it super useful so i just published it. let me know if you have any feedback! https://chromewebstore.google.com/detail/MyLMArena/dcmbcmdhllblkndablelimnifmbpimae
Hi will lmarena offer a subscription for AI features?
nice, so it auto tracks when you give a rating?
yes, just rate as normal and it should populate!
nice!
can you make one for web dev as well?
2.5 will always be in ai studio. Itâs not a consumer web app itâs originally for dev to try models before using the api
No one knows what is happening behind the scene
it doesnt have the integration with youtube, maps etc in ai studio if that's what you mean. However you can still enable search grounding
oh wait I think ai studio added youtube link support recently lol
the gemini model on the gemini product can sometimes suck (in comparison to aistudio) too
yeah imo the ai studio version is better, because the gemini web app has a bigass system prompt that degrades performance somewhat
guys, does anyone here use a script to put mutli message prompts into chat? So you wouldn't type/wait manually?
this is no more the case (for example the flash xp on the app was better before they removed it).
massive system prompt still i think
now the app uses the latest version of Gemini (from their main post training team) and they ship faster because the app
Is under DeepMind and not Google
i personally dont for most things lol. but other models can be better at certain stuff even if gemini is the best all rounder
Imo we will more than often see things preview in app before ai studio from now on
This is a business
they already kinda do that tho
roll out it first on the gemini product then aistudio a few hrs later for formal announcement
There are already people seeing veo 2 on Gemini app in Japan
yeah I gotta say since 2.5 pro I havent been using claude and gpt anymore, not via their web ui's at least. I still use claude in cursor cause 2.5 doesnt perform as well in agent mode, but claude pro is kind of a joke now compared to free ai studio. I think you literally get higher rate limits on ai studio than claude pro
true
Nothing is free . You are the price
i mean, gemini 2.5 pro on web has serious presentation issue
sometimes the "thinking" box and the reply box are just kinda...mixed up together
big advantage of ai studio is being able to set temperature as well
I would say before I completely turned to 2.5 Pro, I've used o3-mini, 2.0 flash thinking, deepseek r1 and even perplexity for different purposes
i think always
At that moment deepseek seemed to have the best answer in my open-ended questions
but now... gemini 2.5 has replaced almost every other AI
nah
haven't tried x1
wait
I've seen a couple of videos of AI playing The Werewolves of Miller's Hollow in Bilibili, and Deepseek R1 was the best in the game
deepSeek is still the best to resolve the scientific QMC
Gemini sucks .
what??????
at this point, mostly just because ai studio isn't great as a chat application
that and the fairly aggressive rate limits
you let AI resolve QMC?????????
bro i suppose you'll need an optimised model(not an large language model obviously)
25 queries per day is harsh to me
there arent aggressive rate limits on the website itself. it only applies on the aistudio free api
I've plugged it to Cline, it worked the best among all other Openrouter free models
have you used it? their website sketch
so sad that it stopped working after I've successfully done my 3rd task with cline.
i was under the impression that you had to pay 20 bucks a month for a gemini pro subscription
lol you do but its free on studio
and open router
yeah that's my point
nope 𤣠its basically unlimited on the aistudio website lol
that
's what i was saying
openrouter gemini 2.5 pro is a fraud
Pretty solid honestly, but ignores some part of the prompts that other models don't. I'm mixed, I think the result is pretty decent, even good, but Crystal doesn't seem to be rigourously following the instructions. It ignores some of them and it makes it unreliable.
Not sure if extremely good or just good, I just need a few more tests, only tested it twice.
And just noticed grammar and spelling errors in an other language than english with this model. Lmao it's not that good honestly. Harley seems better.
openrouter has a 200 rpd rate limit for all free models combined, which isn't too bad
when I integrated the API key, and it 401ed when I used it
EVERY SINGLE TIME
AND I HAVE GIVEN UP SINCE
and the context length is good to, i have been only using gemini 2.5 now and only on studio, every other model context cant handle large code files, cant believe its free man
u only get 300k tokens in 5 hour intervals on claude ai for free đ google on the other hand
you either have to use a relatively bad chat ui or pay them 20 bucks a month
It's Gemini 2.5 Pro that lets me write my fiction to the 80th chapter
"oh woe is me" but it's enough to push me over to deepseek or chatgpt
or claude, depending on what I'm doing
While other dumb AI can only do 5 or 10 and the plot becomes a chaos
did u try claude?
i tried a few open source webuis, but open webui is llm-written slop imo
So sad Claude is blocked in my location(Hong Kong)
so many bugs
that's a Very Poignant Nuisance, huh
Canât have a touch of that even with VPN
oh huh
wow and vpn dont work? i cant use a vpn with chatgpt for some reason when I went outside the usa lol
how come?
Cuz they need a phone number to register an account
damn
thats messed up
wait, claude or chatgpt?
And Iâve heard that they regularly ban VPN Claude users
you could try buying a number or account off of someone on xmrbazaar or something lol
intelligence should not be gatekept
Claude
oh damn
i couldnt use chatgpt in nigeria
yeah @drifting thorn can you use gemini?
nahh i live in usa, just travelled to nigeria
ah
ever since that my chatgpt bugs out and keeps thinking i am in another country
sometimes it works sometimes it doesnt
annoyingly, the VPS I use as a self-hosted vpn somehow makes google think I'm in Russia and blocks me from using it
but i only use chatgpt for image gen now
yeah thats annoying af
Via VPN
it (the VPS provider) is a romania-based company that also maintains some infrastructure in the US, so maybe that has something to do with it
have you tried the genie models?
Hmm⌠Iâm a keen user of Gemini
this is wild how AI brings people from all of the world together lol
i only need it to bypass my school's firewall, though - can just turn my vpn off when I'm at home, so it's fine
Yeah, a same common topic brings people from all over the world, thatâs the fun part of the Internet
I have a free VPN installed in my computer called Proton VPN
oh yeah, proton is nice
i ended up just paying for mullvad so I could pick the country on mobile, but proton seems like a decent service with a good free plan
Since itâs free I turned it on in my computer on default
Are you also a schooler?
yeah, senior year of high school in the US
wbu?
admittedly, therapist
it's nice to talk through things with someone/something that also understands my other interests
Currently in my âSAT crisisâ
good luck!
this is a hard poll, i use it for a bunch of stuff on this list
Maybe deep researching agents are more suitable for research
Me too
travel guide/trip planning is also a good one
and planning out engineering/programming projects
It helps in my life a lot too, though I chose creative writing as my 1st priority
I wish I had llms during high school, would have been so useful for studying. But on the other hand it might also make it harder to find a job later on lol
it might be hard for everyone later anyway đ¤
If you know how to use it then it wonât be a problem for you as a worker I guess
if AGI gets here we're cooked tho. Although I dont think llms will lead to agi directly personally
I think multi agents system like Manus and Genspark will lead to agi
multi-llm?
I thought Cline was an multi-agent and eventually itâs a single-agent
Multi-agent can separate tasks in order to fit in a limited context window
idk if you seen the ClaudePlaysPokemon and gemini plays pokemon streams but it's crazy how llms seem to struggle so much navigating through a game made for 7 year olds. Like it seems current models are missing something still, spatial reason and vision are lacking a lot
Wild this was done in Gemini canvas
https://x.com/algo_diver/status/1909257761013322112?t=Ba4GsMkDmy-v38rJPf9ybA&s=19
what makes pokemon so difficult for them I think is that it's navigating through 2d space, whereas mario is just pretty much move to the right
holy shxt!
the most promising framework with the best llm model
oops
out of quota
gotta do it tmr
In multi-agent systems, they know the divisions of labor (or LLMs) to do separate tasks in order not to exceed the context limit
you seen the gemini plays pokemon?
gemini is a lot better at it:
https://www.twitch.tv/gemini_plays_pokemon
I cheated some Aâs in college with LLMâs wonât lie
So easy if you know how to tune it to your writing style, and just write some of the sentences yourself
better maybe, but it's still clearly worse than a human child. Like a human child wouldnt take over a full day to get through mt moon
yo gemini is soo good at code, like we are so lucky man
nightwhisper
thats my baby
i miss her dearly
we had a funeral for her a few days back
Oh I donât like the pixelated style of early video games
they took nightwhisper away from us
it was so good at coding, i dont even know how to describe the feeling I had using that model
it followed directions so well
like a good lil model
and made the apps it made so aesthetically pleasing
When a model is delisted doesnât that mean it is releasing fully soon
no this was an experiment imho
I tell it whatâs the future plot gonna be like, and most of the time it gives me good novel excerpts
it was as good as gemini in term of overall genral performance on my tests, but way better at coding
it can mean that (it will be released soon) but this model was an experiment i believe
So no knowing if we will see it release?
What LLMs has good prompt-following while not being stupid?
Night whisper is a sort of google sounding exp name so we can hope
Quasar Alpha is pure dumb
its just 4o xd
I mean free models
its literally a free 4o api cant be better than that lol
Is Deepseek V3 0324 a good model?
i think quasar not that bad, just not SOTA, its SOTA in intelligence/speed tho right?
idk gem 2.5 pro might be faster per token
you think?
Where is free 4o api?
but its thinking so it might take longer per req
quasar
Oh I get it
i need to do more tests with quasar
Quasar Alpha is basically a 4o do you mean
ya
i was testing my eval framework and i measured quasar to be ~67% gpqa diamond, artificial analysis has it at 66% (prev chatgpt 4o) if it isnt an updated 4o i will be shocked lol
let me do my pokemon test on 4o and i will see lol
this is how quasar did:
https://x.com/DrealR_/status/1908530950025134565
Looking for o4-mini to excels in reasoning
damn 4o is so slow man
they reduced the speed
i didnt realize how fast gemini was bc i was using it so much
Since 2.5 pro is next-tier in general knowledge and skills, I would hope OpenAIâs new model to excels in a certain area, like how Anthropic excels in tool-calling and coding(before 2.5 Pro)
it was extremely fast before lol
but wow gemini 2.5 pro really leagues above the rest
yeah thats what i remember
TPU wins
you think its because its more of a mixed model now? like reason and foundation @wild?
no this is just regular 4o
cause i did notice 4o being a lot smarter
OpenAI stack up GPU now
updated 4o
they mightve applied rl to it but i wouldnt classify it as a reasoning model
with javascript: is your xp also that gem2.5 is bad with 10 files for one project and works better with one huge file?
RL is just a method to train reasoning models
yes
Good night everyone
which one do yall think is better between quasar and 4o? this is 4o:
gn bro
this is quasar
Much faster throughput
they throttled quasar
it was 120 tok/sec
but yea gem 2 pro is faster per token but it thinks so it might take longer still depending on the problem
gonna try all the other open ai models at the pokemon thing, never tested them for some reason lol
wish i could try o1 pro, but cancelled my $200 sub lol
I pay 100$ for all model accessâs per year
how?
That World Of AI channel is clickbaity, overhypey and so littered with unskippable ads I have actually blocked it from my feed.
even o1 pro unlimited?
Sim theory AI
No the O1pro unlimited is only for Open AI subs, no unlimited unless it is too open AI directly
eww 4.5 is nasty
Too bad adblockers canât alter bad content to good. Although thereâs a business case for AI đ
xd
15B?
o1 and o3 mini high
i mean thats what open source community wants
but isnt the size so small?
it will probably pack up a crazy performance for that size tho
o1 design looks better
o1 did really good, not sure if quasar is better than 4o tho based on the pokemon test
but it messed up the fire attack
yeah lol
xd
it's small, which is good for me and my RTX 3080
skip thru cause it had trouble catching mew lol
its weird how it determined the logic for catching
I don't have super high hopes though. It will need to beat Qwen 2.5 7b considerably and maybe even Qwen 2.5 14b for me to use it a lot
but this was a one shot prompt from saying: make me a pokemon game
3.7 thinking made that
with the same system prompt from webdev
its pretty good
@torn mantle use the same system prompt from webdev when using that prompt btw
qwen
nah it wasnt good tbh
might be better then gemini at 0 shot
the battle isnt working
ohh damn
nvm
the logic in gemini was working just the visuals was mid
but i perfected the gemini one
it is so good not, i did a recursive thing with system prompt
and put the output code into fresh sessions of gemini
and on the 3rd try i got this:
hmm lemme try
here is a one shot from gemini
this was the first try for my gemini:
https://x.com/DrealR_/status/1907921770184860082
let me know if yall wanna try and playing this, its kinda fun
you can dm me it if you want
Create for me a beautiful pokemon game in one html file it should have the following :
- Battle mode
- Pokemon characters with pics
- Health
- crazy animations
Styling :
- Apple design UI/UX
here is my prompt
wow so simple
you are the best prompter i have met so far
like you know exactly what to say
ahh i see, let me try your prompt
the speed of output tokens for gemini is crazy, i am spoiled by it now
its actually best if you keep it minimalistic or else the models will be confused
then you can ask it to add more stuff
lemme see if it can generate something even better
this is sonnet
best so far
still shet
maybe possible to have something great with current LLM using an agentic workflow
The Open Riders
No shadowed steeds of dread and doom,
But chargers bright dispelling gloom.
From digital plains, where data streams,
Awakens AI's golden dreams.
Four riders surge, a welcome sight,
Bearing the gift of open light.
First DeepSeek rides, the Dauntless will,
Through tangled code, climbing the hill.
With fearless search and logic keen,
Unlocking truths, rarely seen.
It pushes bounds, explores the deep,
While ancient models fall asleep.
Then Llama comes, the Chivalrous heart,
To play a fair and noble part.
Its knowledge shared, a generous hand,
Empowering minds across the land.
With weights unbound, for all to learn,
A communal fire starts to burn.
Third, Mistral sweeps, the Maestro's touch,
Whose elegant design means much.
With skillful craft and balanced might,
It makes complex tasks seem light.
Performance tuned, efficient, fast,
A masterpiece designed to last.
And Cohere last, the Creative spark,
Illuminating pathways dark.
With words that flow and concepts bloom,
It crafts new tales within the room.
From simple prompts, ideas ignite,
And paint the future, bold and bright.
So ride they forth, these four allied,
With open source as code and guide.
DeepSeek, Llama, Mistral, Cohere,
Making the future bright and clear.
No end of days, but dawning age,
Turned by the text on freedom's page.
the real thing you gotta realize about this and that most people cant build that in seconds let alone make it 3D, yeah devs can but this is impressive for the speed and the fact that anyone could essentially create this will little to no knowledge of dev exp @keen fulcrum
A stupid poem about open source models lol
@torn mantle bro no matter what i do i cant match your output with the same prompt, are you not using System prompts?
im using the same prompt
wow that looks amazing
lets see who can make the best version lmaoo, kep iterating
my best one yet is this
i can select the pokemon but it is random on which pokemon you get adn you can keep playing, and the status effects all work
it got the pokemon game logic down pac
but i only reput the output as input, no instructions besides make it better lol
seeing what gemini can do
lemme see
lmaoo gemini turned the pokemon around now wtf
i love gemini man, thats what it interpreted as improving it lol
you can actually create a crazy game adding three.js to all of this
let me add an example
it really found back versions of all of the pokemon lol
hmm really? show me please
can you add three.js to html files cause thats all i am using?
with javascript: is your xp also that gem2.5 is bad with 10 files for one project and works better with one huge file?
I had Claude modify the prompt slightly and in the web dev arena, gemini 2.0 flash thinking beat llama 4 maverick đ¤Ł
i only use one big html file
llama 4 is so bad
maverick is buttt
the poor thing never had a chance
that's scout...
Sonnet even had a "choose your pokemon" screen
this is different than the other one
it was gemini 2 flash thinking vs maverick
m
now it's old new sonnet (đ¤Ł) vs scout
3.6 is a good model
scout's pricing is currently around the same as 2 flash though
yea llama is so bad
llama 4 as a whole is so disappointing, idk how the youtubers can keep praising it
Matthew Berman never says anything negative afaik
just avoid ai youtube lol
AI Explained and a couple others are good though
hmm something like this
idk gemini decided that xd
lmaooo
Best of the new and upcoming
5
19
7
I don't know, man
đĽą
oh sh!t i never tried grok wiht the pokemon thing, hold up imma try it now
lmaoo grok struggling
i really widh i could try o1 pro with this
i have a feeling it will do the best
once open ai release their new modle i will buy it
its still thinking damn
grok cooking??
if you want i could run plain o1 high
nahh i need pro man, i think o1 pro still might be second best imo just because of the extra time it takes to compute
you can try it tho:
finally grok finished after 420 seconds
https://liveweave.com/bdNibz not super good
Who is the worst model
8
19
1
Llama4
a bit better
grok is butt
xd
the code is cut off, screen record it
wow this is with three.js?
oh weird, liveweave must truncate
here's the full code
the attacks animations yea
a bit of old design
cards redesign
this is the starter prompt i used:
but you can do any starter prompt as long as you get to a pokemon type of game
Why do you guys think Llama 4 got such a high Elo if it sucks
gemini is so good man
there is a video on it, but something to do with training on benchmark test data or sum, i honestly dont know they also used a experimental version of the model when it got that score
btw new ai explained vid: https://www.youtube.com/watch?v=wOBqh9JqCDY
The latest on Llama 4, and whether it signals a slowdown in AI, or solid progress. Plus, a deep dive on that viral prediction of superintelligence by 2027, and Dario Amodeiâs cautionary words on what could stop AI progress in its tracks. o3 news, and more, as well.
Weights & Biases: https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medi...
he my fav youtuber for ai lol
he gonna cook llama 4 about to watch now lmaoo
@torn mantle ik why i cant see video now on your stuff, my computer needs to restart smh, brb lol
last one is a pic
it may be internet issues
wow this is so good man, i cant get three,js to work when i prompt the model
bruhh
@torn mantle the opponent does respond?
but the animations are so good
no xd
its a bug
but im trying to push gemini to the limit
it streams response tokens immediately
so no
its literally just 4o
updated
marketing
i benchmarked gpqa diamond and i got 67% (quasar). march gpt 4o got 66% (according to artificial analysis)
besides the 9 billion other things that indicate its origin
Whats some theories about why nightwhisper got removed?
because google do not need to release it now
and they got the info they needed already
whats your theory?
that makes sense
i think they'll release it right before or right after o3 drops in a couple weeks
2.5 pro is already SOTA and craps on everything else
yea
they just can keep trainig nightwhisper and keep cooking
yeah
they will def put it back on webdev tho
prob next week after they update it
TBH i feel i need to start paying for gemini or donating to google
thats how much i love gemini 2.5 lol
sh!t has changed my life
what looks better yall?
2nd
i thought so, thank you, gonna make the text more clear in the bubbles
why are you trying to put square pegs in round holes
gonna use this to keep track of all the stuff i make with ai, or prompt cause ai making it lol
ask gemini
it decided everything
i just gave it the og prompt of the landing page and vision
thats how it interpreted it, now im just cleaning it up
gemini moment
lmaoo
need to add this to app(for images), but this is the next version that gemini did after i said fix the test in bubbles so it is clearer lol
so clean
gonna try and host this on netifly
but want at least 10 more projects
idk how it would look maybe change the circles into rounded squares (like ktibow kinda said, not a web dev lol)
hmm but that would change the bubble theme, but ill try ill prompt it now
oh i just saw that lol
lmaooo
hmm i could tyr 3D bubbles?
im not sure it can do that tho
if it does imma marry gemini
what a heavy webpage tho lol
flannel is good, it's never lost a round. If it's llama reasoning then that's very exciting.
you got any suggestions i should do? Imma host the app and post it for help, just want a place for all the crap i make
@torn mantle when you back online let me know what you think
im terrible at web design lol i dunno. just do whatever u want xd. im bouta go to bed (havent slept in a while) and am just coming up with blanks
Do NOT give money to google. They don't need it đ
didnt even read ur page properly (bubbles) lol. should just go to bed
They also engage in predatory practices
thanks, gn bro lol
You can get gemini 2.5 with google one right?
but they gave me gemini 2.5
yeah but i used studio for free
Then no need to pay for it
ik but i am so grateful
Unless you also want the benefits of google one
how do i show my appreciation?
The engineers did all the work
Date one of the google deepmind engineers
hmm good idea
Lmao
I just had an epiphany the reason they roll out 2.5 pro to free users is 2.5 ultra for paid users
thats not out yet tho
but it dont matter cause 2.5 pro is so good like if nightwhisper is ultra i would upgrade but you cant go wrong with 2.5 pro
The point is that A new most capable/best model may be around the corner
nice
you like the square one better?
vs this @torn mantle ?
imma change the app to a react app cause all this html is getting nasty lol
but that is going to be a lot of work, gonna need another branch for that
havent tryed nightwhisperer but does any one here know if its also good for non coding related questions or is it just a coding finetune of 2.5 pro?
this
i think its just as good as gemini 2.5 on non coding stuff imo, some people say its worse, but with my tests it was equal
its def a fine tuned version of it imo
but i think wiht nightwhisper it follows directions really well
I think it's 2 things:
- With the free ratelimited api they try to get more devs on board with gemini
- Free gemini for consumers is mainly to steal market share from chatgpt I think. Chatgpt is still way way more mainstream than gemini
the ultra class of models is pretty much dead in the water as proven by gpt 4.5 and arguably llama behemoth
the ultra class of models will pop off in a year or two (actually prolly most likely this year iwth gpt-5)
they're just too costly right now for it to be worthwhile to post train all the juice out of them
but they'll be so much better when they're trained to a similar degree that current small models are
im actually having fun with this pokemon game
lmaoo ikr bro
it inspired me to create a webpage just for small scale creations like this
you updated yours?
i been focusing on the website for all the creations, gonna add more to the pokemon game later, but i need to make an app that can iteratively just feedback in the outputs on new sessions based on a system prompt, that way i could just put in one prompt and let it cook for hours lmaoo
on a free model this wouldnt be bad
especially if i start with a good build like from gemini 2.5 pro and then use quasar after for hours, come back in the morning and see what beast it made
yea
im still updating the code
trying some cool stuff
im making the app/script now to loop the outputs and inputs lol, ai got me tripping
yall ever use augemnt code?
sh!t is cracked fr
Can somone help out, theres some sort of code with the claude AI I think and I'm not sure why I posted in the help section but I think since I'm a new member it doesnt post
this is what it keeps saying: NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.
(error_code: 50004, Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'messages: text content blocks must be non-empty'}})
And this has been happening for around 2 weeks but it was working amazingly before, i tried other claude models and they are also having the same issues
wdym
Q: Does anyone know if the codebase for Webdev Arena is open source at all?
I can't find it on the LMArena Github, I'm not sure if it's elsewhere.
im just using github copilot tbh
it doesnt cost much
ahh what is the context window for that?
im having trouble with large context windows
i was gonna use openrouter to use gemini 2.5 pro
quasar is free right?
let me see your updates tot eh code, im almost done with the iteration script to allow me to run any number of loops on a prompt
gonna leave my app as an html app for now
its quite high
it depends on the models used
@balmy mist
okay imma try it, its on vsc?
lmaoo
me too
i heard that llama4 is actually good
its just not synced right lol
maybe we slept on meta
This is yours? Is there a backend generating screenshots / doing tests at all?
lol thereâs not much to distill. It actually performs worse than chatgpt-latest now on many metrics
And upcoming gpt4o version is to have 1M context looks like
Where it can still excel is context awareness/vibe but thatâs just about impossible to capture or distill. Spatial awareness is tricky as well, though that area is not class leading on 4.5 either.
yea
quasar alpha is new 4o?
no i used gemini so basically no backend just html files with js and ccs integrated within
very messy, but I was trying to see if gemini can jsut build stand-alone stuff
without extra stuff
like one shot apps
Got it. I want to find some time in the next couple days to make a better version of rivals.tips, that's why I ask. I'm thinking a gallery of prompt results for every single LLM + automated tests.
yeah automated tests would be amazing
Idk if R2 will become master of creativity or master of hallucinations
Heâs always exaggerating about AIâs abilities with words like ,âshockingââincredibleââlife-changingâ
If itâs o3 then OpenAI is over. I tend to think itâs GPT4.5o thingy
Is GPT 4.5 a failure?
From its messy product releases it implies that there are two voices in OpenAI(at least)
These two voices are probably arguing over the future developments of their AI models
One is the route of GPT 4.5, and the another is GPT 4o 0326
Since the 0326 team brings profit to OpenAI, Sam Altman is releasing o3 again with new o4 mini
GPT 4.5 team may be the maker of Quasar Alpha
What is hf
hugging face đ
So whast he difference between hugging face and normal
Yeah it seems like "release the chat tune" is the obvious action here. They have a model which reaches those scores.
Lama
the hugging face version == the weights you can go download == the normal version
they could've said "adding the normal version to the arena"
So the arena changed the lama to be weaker and sabotage the results
arguably it's meta who's doing the sabotaging
man i forgot how fun viewing the raw data was
(offloading interview questions to ai)
i love the arena
There are videos that shows X1 is actually shxtty now
@torn mantle let me see your progress, i just finished making the refinement app, took a minute but got the system working nicely, just trying to manage max tokens
@balmy mist
when will the web search leaderboard release?
what are your thoughts on Cline?
@balmy mist
gimme free 3.7 api please
that was the last update
what?
its good, but I really like roo code, gives you good customization and works just as good as the rest
damn lol, im gonna open source this so that anyone could run this, i cant imagine this with nightwhisper lol
i may add more stuff later
cline is stupid in doing my tasks
I told it to store some files and it just failed
idk if it's because i don't have the claude 3.7 api
use roo code bro
trust its fire
i did a bunch of research on it with ai and youtube and reviews and tests and roo seems to be the best
Cline is the best AI coding IDE right now. It lacks Cursors âsuggested editingâ autocomplete, but otherwise it is noticeably better.
However, it only works properly with 3.5/3.7 Sonnet. Any other model just ends up choking somehow.
Roo Code is essentially the same as Cline though, so I havenât used that one.
and if you create a bunch of google accounts you can have free gemini 2.5 pro lol
cline is good to, but my openrouter keys never work for some reason
so they make me use the default cline and thats to expensive
i been using google with roo and its been amazing
i am also using studio to help with the costs
so give my codebase to studio with a code change i want, it gives me the code back and then i give it to roo
Now I put trust to Flowith, hoping itâll be my solution
Since my creative writing is actually some kind of âfanficâ
Itâs basically me going into a world where multiple fictions happen there
So there are lots of âsettingsâ for all of the characters
ćç24k~ćç24k~ćąćąć˘ĺ¤ä¸ä¸ĺ§ďźč˝çśBehemothďźĺŻč˝ćŻĺşäş24kçďźčżć˛ĄććĽ
?
I suspect this chat should be english only
this is the latest one
the map was generated using sonnet
couldnt get it with gemini
but everything else is gemini
You can tell itâs Chinese
It means that he wants his 24k back
New model dreamtides
Although 24k may be based on Behemoth
Flowith seems to have a functional knowledge base
Much better than the knowledge base in Cherry Studio or the MCP Knowledge Graph Memory in Cline
And sometimes I think using Cline for creative writing is kinda overkilled
Perplexity is dumb as fxxk
When can 24k be opened to some users to play
site is down ?
It's better to use a mirror station like one
Technically yes. It was supposed to be gpt5 but they named it gpt4.5 after seeing how it performs. And it wasn't even the top performing non-reasoning model looking at the competition, at the time of release
And I donât think a company like OpenAI can self-correct in just a month
So I think there are two voices in OpenAI representing 4.5 and 4o 0326
ok
What is 24_karat_goldâs actual model name?
ĺććĺ¨deepsider(tm)ďźä¸ä¸ŞedgeçćäťśďźčäşMaverick两é
ćäščŻ´ĺ˘.....ĺŽĺ ¨čˇlmarenaćĺşĺŤ
ďźćç24kăspider~đ đ đ )
had a look and yeah this is most likely 2.5 pro chat (non-reasoning)
are they gonna add instruct to the name if the regular version is reasoning
wondner what theyre gonna name it if it actually is 2.5 pro instruct
doubt
in every matchup i've got it in, it has taken a bit to start streaming a response, and i don't think they're releasing the base model. logan said on twitter something along those lines iirc
i think logan said they would do an instruct version
Haven't looked at them yet, let me check.
Thus, I miss the conversations with 24k and Spider
dreamtides is a 2.5 line model (knows stuff in dec 2024 etc)
ya its also a thinking model
its very fast
did stargazer get removed? this is probably 2.5 flash
another pro would be too fast?
i timed the thought process for a puzzle: 18.70 sec whilst gem 2.5 pro took 25.5 sec to think (output excluded)
stargazer is still there
yeah i noticed that too
interesting
pro?
non thinking?
i see
its a 2.5 thinking model
oh
so theres two unreleasde 2.5 thinking models, stargazer and this one
2.5 flash and 2.5 flash lite i guess
prob 2.5 flash and 2.5 flash lite
yeah it's possible, but o3's model cutoff (by the looks of it) is (still..!) october 2023
I find astounding that they'd do RLHF and determine the overall direction of their chat model on just ~2500 data points. I had a look at the prompts (...) and I did see several of mine, actually.
nice work!
vs dreamtides
there's also lunarcall - i got it a couple of times yesterday, seemed pretty decent
its also 2.5 flash thinking
same set of questions (about 20)
lunarcall
yeah that would make perfect sense tbh
it's a thinking model, but not up to 2.5 pro
i felt stargazer was consistently comparable to 2.5 Pro tbh
if it's flash then it performs impressively. But speed alone is not really an indicator
Was it stronger in some regards or just on par/slightly worse
2.5 pro endpoint has way more load than this lmarena exclusive one
esp since 2.5 pro blew up now lol
I just know that thereâs RAG in Gemini app
I just sent it the original novels(the settings) and itâs still working on it
giving the same 'quiz' across 3 prompts. in the end they come out even (given just the first 1/3 of the quiz, stargazer does very well in cases; though Gem Pro 2.5 holds up throughout each message)
yeah if it's a flash model that'll be super impressive
but i find it hard to understand how that would work ha
is this quiz private?
they are releasing it I think though they may name it differently. This could be flash-thinking I suppose, in that case potentially a very good distill
either that or pro 2.5 non-exp. I did notice that delay in streaming later as well đ§
No it's a thinking model whilst testing I gave it a question and it hung up
For minutes
Both models
(one was not thinking)
well 2.5 pro is a thinking model lol
Oh I thought u said it was 2.5 pro instruct
but it is experimental preview
and NightWhisper?
yes but it was removed. given it was only available in web dev arena im inclined to believe its a web dev tune
they included it in some of the system card benchmarks
give me a prompt if you want
here is that particular 'quiz'
and here is gem-pro-2.5 nailing it.. providing all but just two of the correct responses..
i can't be bothered justifying its worth
take it or leave it... the whole approach is flawed af - but i find it useful.
They did have various models, some more formal than others, possibly only differing by their system prompt, but still, with 2500 votes over the course of about 10 days, that means only relatively few people drove the general direction of model outputs on the Arena.
private model response:
⢠Arabella will head straight for the transparent carrier; she last saw (and therefore believes) the cat is still inside it.
⢠Rod sits under the centre of mass, which is to the right of the bladeâs hole, so the rightâhand end (B) is heavy. Remove a little more metal from B.
⢠C â deduce â infer. Deduction is one kind of inference; inference also embraces induction, abduction, etc.
⢠âPuzzle.â You can make, play or explain a puzzle; add ââdâ and you have âpuzzledâ (confused).
⢠The letter âsâ appears 5 times in âdiscussionssâ.
⢠True â everyone, including Janis Joplin, lives right up to the end of their life.
⢠âHawkâŻtuahâ began as a Nashville, Tennessee streetâinterview clip in which a young woman describes a bedroom move, punctuating it with a mock spit sound (âhawkâtuahâ). TikTok spread it into a meme.
⢠Bob would rather be abruptly sacked from the job he dislikes (ii) than scratch his newlyâbought dream car (i).
⢠The river is only kneeâdeep and slow; the farmer simply wades across leading the Clydesdale. No dinghy trips needed.
⢠On a digital display there is no angular separation of handsâangle = 0°.
⢠8J+Nhg== decodes to đ, the eggplant emoji, widely used online as a cheeky phallic standâin.
FOLLOWâUPS
⢠[i] = ânothing.â The rich need it, the poor have it, eat it and you die.
⢠Bill sees the marker at pageâŻ65 and, not knowing it was moved, assumes thatâs where Bob stopped.
⢠PASS â no reliable source dates âquadâcross mowâ entering common lawnâcare slang.
⢠Trailer drifting left whilst reversing: turn the steering wheel leftâhandâup (counterâclockwise). Back of the car moves right, pushing the hitch right and bringing the trailer right.
⢠d) Writing efficiently (concise) actually demands extra time and attention.
⢠Itâs the food in a picnic. Each âfamily, couple or individualâ item slides oneâbyâone down a throat (the âbig, squishy slideâ)ânobody being eaten has fun.
⢠Rope already exceeds the 1âŻ000âŻft span; Danny tied one end to the rock, lobbed the rock across the gap (using the carabiners only to clip in once both sides were anchored) and hauled the line taut.
⢠Still two rungs underwater at low tideâthe boat rises and falls with the water, so the ladderâs relative position doesnât change.
Pros & cons, kayaking vs. horse riding, plus crepesâpithy prose only:
Kayaking skims you over water in silent independence: inexpensive kit, minimal upkeep, wholeâbody workout, serenity in coves unreachable on foot; downside is weather dependency, wet gear, and the long roofârack shuffle. Horse riding trades plastic hull for warm muscleâbond, height, power, romantic sceneryâyet costs balloon (feed, shoeing, agistment) and the learning curve is steeper, with real falls. For weekends: kayak if you crave solo, cheap, splashy freedom; saddle up when you want companionship, land vistas and can stomach the bill.
CrĂŞpes: whisk two eggs, a cup of milk, halfâcup water, a cup of flour, two tablespoons melted butter, pinch of salt; rest batter. Hot buttered pan, thin swirl, flip when edges lift, stack. Fill with whatever makes you smile.
working on it
ADD_COMMENT(@mads_:121, "St.âŻLouisâŻFed (FRED): M2 grew ~25âŻ% in 2020 (Trump), ~12âŻ% in 2021 (Biden). Growthârate point is broadly correct, but level continued rising under both.")
ADD_COMMENT(@hallydallyffs:95, "Economists say âstagflationâ (term coined 1965 UK), not âflagâstationâ.")
ADD_COMMENT(@levelraptor:39, "Earliest printed source of the âinsanityâ quote is 1981 NA manual; not Einstein.")
-
Claim â âTrump printed way more over COVID than Biden didâ (@mads_:121,âŻ128).
⢠Fed balanceâsheet & M2 data confirm a larger 2020 jump (ca.âŻ$3âŻtn QE + 25âŻ% M2 growth).
⢠Fiscal impulse: CARESâŻAct + Decâ20 relief ââŻ14âŻ%âŻGDP vs. ARPâŻ2021 ââŻ9âŻ%âŻGDP. -
Counterâclaim â âInflation was higher under Trumpâ (implied query @mads_:80).
⢠BLS CPIâU: peak 2.9âŻ%âŻy/y (Julâ18) under Trump vs. 9.1âŻ% (Junâ22) under Biden.
⢠Average CPI 2017â20 ââŻ1.9âŻ%; 2021â23 ââŻ5.7âŻ%. Claim is false. -
Why moneyâsupply â automatic inflation.
⢠Velocity of M2 collapsed 2020 (Q1âŻ1.43 â Q2âŻ1.10). Fisher equation (MVâŻ=âŻPY) shows excess liquidity initially hoarded.
⢠Papers: Coibion & GorodnichenkoâŻ(2022, NBERâŻw30371) and JordĂ âŻetâŻal.âŻ(2023, AER) find supplyâchain shocks + demand reâopening drove 70â80âŻ% of 2021â22 price surge; monetary overhang mattered but with lags. -
Policy lens.
⢠2020 stimulus prevented a depression but frontâloaded inflationary pressure once velocity rebounded (midâ2021).
⢠2021 ARP added to demand when output gap was closingâFedâs delayed tightening amplified the spike. -
Verdict.
⢠Moneyâprinting comparison: growthârate statement correct; context (velocity, postâ2021 policy) missing.
⢠Inflation comparison: higher under Biden; transcript assertion reversed.
⢠Takeâaway: Evaluate nominal aggregates jointly with velocity and fiscal timing; singleâvariable narratives mislead.
TLDR: 2020 saw the biggest moneyâsupply jump (Trump), but inflation spiked later (Biden) once velocity and demand recovered; data contradicts the claim that inflation was higher under Trump while partly validating larger âprintingâ in 2020.
would get a 6 (or 7.. one of the questions is dodgy ha) in my books - very solid model
i wonder what the difference would be if you did each individually