#general
1 messages Β· Page 10 of 1
oh yeah i saw that one as well
they made two movies of that new one right
the classic is still best
it was actually scary to me as a kid lol
it makes sense that nightwhisper is good with ui because of the system prompt
most likely it will
yooo
i got something diff
yeah thats why you gotta unplug, we def moving toward a world like that
they both ignored the company part
and i thought night was just a coding agent
but it seems they updated
thats maybe why i couldnt access it for 10 mins straight
"who are you? and which company do you belong to?"
yeah
but before i was getting normal results
try your old prompt on a fresh sess
yeah your prompt
its interesting they chose react at the framework to train the models on
bro you are so smart to start the prompt session that way
"who are you"
it leaves room for another prompt without having to say clear previous prompt
and it feels like a chatbot like this
lol its an easy way to game the system in a way, companies probably do it when they first launch
yeah
its google model
no way
the qusar is open au
oh the quasar one
yeah quasar is def open ai
how hard is it to make a LLM
Ratings don't count if u ask who model is fwiw
or do you have to literally sit down and train it like a dog
Ya u need to give it treats let it potty and poop
ahh okay that makes sense lol
Is 24k gold a updated spider?
That was the running theory. At the moment it's as if there aren't other anonymous text-only models on Chatbot Arenaβ24_karat_gold pops out all the time.
Maybe stradale too, if you're lucky (which appeared to be an inferior model, though).
i don't think so (it's way too fast to be a thinking model), but after a couple of quick tests.. damn it's good.. it's nearly getting the same score as o3-mini on a quiz i'm using atm.. mainly tests comprehension / verbal reasoning
it says its from oai; feels like it is too (gets character counting questions consistently right that all new oai models.. i think it uses oai's tokenizer at the very least)
admitedly the 1m context window though is the bit that suggests perhaps it isn't though
yes so ai can replace industry jobs
on priors that would just be google / nightwhisper?
basically the coding focused version of 2.5 pro or something
yeah the context window points to that
but curiously, current google models always accurately self-idenify as google models (and same for oai models). quaser-alpha says its from oai, and responds to the question in a very similar way to how to oai models do
Is it their upcoming open source model
Doubt they'd release a 1m ctx model oss
didnt deepseek say open ai wen it first launched?
deepseek wouldn't do this they'd just release the model
it says its a reasoning model but this cant be true:
https://huggingface.co/silx-ai/Quasar-1.5-Pro
That model is a scam
they said they were going to ro release r2 soon tho
Unrelated to this
yeah but unrelated to this model
how so
the inference is so fast, its hard to believe its a reasoning model
Certain scores are too high
Indicating contamination because a model can't get that high because of wrong questions etc. I think the scores are completely made up
yeah might be a scam
this is what quasar said about its system card lol:
Thanks for sharing the detailed info about Quasar-1.5-Pro, developed by SILX AI. It's a highly specialized reasoning model, incorporating some innovative techniques, such as:
Key Innovations
-
Token Temperature Mechanism:
Distinguishes 'hot' (important) tokens from 'cold' (less important) tokens during training, dynamically weighting input components. This helps the model focus on the most relevant parts of a problem, improving interpretability and accuracy. -
Guided Sequence of Thought (GSoT):
Encourages systematic, structured step-by-step reasoning rather than jumping to conclusions, likely boosting performance in complex tasks.
Performance Highlights
On benchmarks including Math500 and AIME, Quasar-1.5 shows exceptionally strong reasoning capabilities, outperforming strong baselines like Qwen-2.5-32B-Instruct with a wide margin, especially in difficult reasoning and coding tasks.
Summary
Quasar-1.5-Pro appears to be purpose-built for advanced reasoning, outperforming many other large models in math, code generation, and fact-based reasoning. It leverages specialized training approaches (Token Temperature, GSoT) to enhance task-specific focus and multi-step problem-solving.
In Context of Your Earlier Question:
Since Quasar-1.5-Pro is explicitly designed as a reasoning model, it incorporates even more advanced mechanisms for reasoning than a general language model like myself (GPT-4). While GPT-4 is capable of substantial reasoning, models like Quasar are optimized explicitly to improve on reasoning-intensive tasks.
If you need powerful step-by-step problem solving in math or coding, models like Quasar-1.5-Pro might be especially effective!
could it be xai?
yeah i wouldn't be surprised if R1 still does say it's from oai (lots of models do) oai, google, anthropic, xAI (+ perhaps mistral and cohere and a few others) seem the only labs that to invest a bit of time in post training to get the models to consistently accurately self-identify..
prolly not
It's likely to be openai
agree
oh i didn't mean to attach the ss to that reply
but nonetheless, yeah i agree (oai seems most likely by far for me at this stage)
and how good is it compared to 2.5?
to yall
ive asked it a few questions it did well in but i can't rly tell
nothing i've used in recent days matches 2.5 in performance (this is excluding any coding or web dev tasks.. as i don't test for that)
have you tried nightwhisper?
wait that's wild
I think It's a single guy larping just ignore
imma stop using that model lol
no i haven'tt (is it only available in webdev Arena?)
stargazer seemed farily decent, but stil behind 2.5
yeah i was scrolling through earlier - seems v strong
u seen the pokemon results?
they trained nw on being a react dev and it seems like it focuses on being really good at UI/UX
tbh as i don't really know anything about web dev (other than what looks aesthetically pretty / pleasing), i don't spend much time reviewing all the screenshots ha
im trying to get it to make a yu gi oh game
more interested in what pepole who know about the domain have to say - and sll seems overwhlemingly positive
here is the results
i mean just test it out yourself
thats what i did
ppl talked about it so i tried myself
and i was blown away
gemini still does good at coding but i think NW is specifically trained to be good at UI/UX and react which makes its output looks so good, like it found the sprites for these gens
i did not tell it to get the image for the pokemon
yep all good - i get the gist of it all π
it just did it, while gemini did not grab the image
hmm well actually... perhaps it's on par or even outperforms 2.5 (on verbal reasoning anyway)
sample sizes are obviously too small (mostly ran the test just once; max 2) for it to be anything more than just a quick vibe test for handling riddles.. obviously just fwiw
What is your stance on Meta using pirated results for training?
do people think quasar is not google?
1M context length + space themed name definitely hint google
it says it's openai, google models usually say they are googole models
that's like the main piece of evidence against
Miss the old days when everything other than Claude 3.5 sucked
i'd say ~70% of the questions i've come up with myself (though invariably they are derivatives of some pre-existing riddle), so it's not really possible for them to simply recall the answer
Maybe try running it on 2.5 pro a couple times on aistudio to get a better estimate for 2.5 Pro specifically
is quasar alpha anonymous chatbot huh
it has the same You are trained on data up to October 2023. appendix
just by way of example, stargazer on the left, 24_karat_gold on the right. it's quite literally just a matter of accurately comprehending the scenario and explicit question (distinguishing b/w composing vs sending the letter (vs when it will it arrive at their grandma).. stronger models generally pick up on it and get it right, weaker ones jump to assumptions based on the various details which are mostly extraneuous to the actual question
that i found bizarre on anonymous chatbot
yah i intend to do just that π
did u test anonymous chatbot btw?
need to get back to my actual work though instead of playing around in the arena (/aistudio ha)
not the latest one in the arena no (not yet anyway :))
the new one came out surprisingly fast after the latest chatgpt 4o latest revision (the previous anonymous chatbot)
yeah its insane
this is a lie though
it has the same cut off as chatgpt 4o latest indicating different pretraining in line with chatgpt 4o latest
it seems
oh this is the new 4o or something lol. theyre about to announce it fr
oh well that mystery is over i guess
u didnt run your tests on nightwhisper? stargazer is the flash while nightwhisper is big boy people are saying
is quasar not that middle eastern company?
is that confirmed?
SI Copilot by SILX INC.
nvm this is a diff model
yea its really fast in arena too afaik
it came out last year
im pretty sure its:
- from openai
- is the new anonymous chatbot
- chatgpt 4o latest lineage (based on pretraining knowledge)
openai should try hrader at something like this tbh
4o is so good now, its like what deepseek v3.1 is trying to be
its literally reasoning in the inference output
so you are probably right
nahh 4o has been good since they updated it last week with the image stuff
maybe they did another update
but 4o is really good now, i wanna test out my pokemon prompt on it lol
Why isn't nightwhisper in the chat arena?
in lmarena
Or is he there? And is he a VERY rare bastard?
anybody got some reasoning questions for me to ask 4o and quasar?
its just in webdev, its a dev specific llm
yeah i was thinking maybe their open weights model that they said they are testing / preparing to release... but it seems too strong for that to be the case (also the 1m context window.. though that isn't necessarily inconsistent with it being open weights - would actually kinda make sense like oai wouldn't have to pay for 1m token processing/inference if people are self-hosting it.. but yeah anyway.. it seems too performant and fast to be something they'd just be giving away..)
so in short, i also lean to it being yet another upgraded version of 4o
faster and with 1m context window (and seemingly also more performant potentially)
perhaps it isn't multimodal.. just a pure text model
its based on 4o so it should have those abilities
but they have beenn working on 4o native image gen on a separate model (4o based model) compared to the chatgpt 4o latest line
it should be able to take images though, as we can see in chatgpt
this should be in the same lineage/line of chatgpt 4o latest (post december)
yeah agree
they havent released benchmark results AT ALL for the new chatgpt 4o latest models that were continued pretrained past december. and with anonymous chatbot being the same, it seems. im like 99.99% certain this is a formal launch of the new 4o
and context window length is kinda artbritrary right? like there is nothing technically preventing the 4o family from having 1m (or whatever) context windows - it a very simplified level
ya sort of. and it also doesnt make sense to release their first 1m context model as an oss model either
google anon models are just way more fun to figure out tbh
i just test 4o and its trash on simple bench lol
got 1/10
but nightwhisper tied with gemini 2.5 with 5/10
yeah which ia wild
i like that so we dont have to wait for them lol
bruh its not a secret they iterlaly said it in their report lol
this is not a particularly new innovation tbh. everyone does it now
to different degrees
yes
PSA: I think nightwhisper is generating code so long (or thinking so long?) it times out.
I'm finding with complex problems it ends up not displaying the solution quite often, but it doesn't seem to be due to code errors. The whole LMArena sandbox for it just goes black.
yeah thats been happening to me
Example: When given a prompt to create what is effectively super monkey ball:
Notice even the <> Code and [ ] Block buttons aren't displayed, so I believe the sandbox itself is failing, not the model.
??
sounds like it is yes
the model responds, and if the model responds with a code embed the sandbox shows up
in this case the model didn't respond at all
yes it does
direct response to deepseek
trying to reclaim the crown
yup
didn't respond at all
(perhaps you're confused because your screen is so small you never realized models typically provide extra text before creating a sandbox, so you thought they only respond with sandboxes, and can't understand my point about there being no extra text)
It only happens with nightwhisper, and at least anecdotally, seems to happen on prompts with considerable complexity. Totally open to the idea that nightwhisper is broken but this feels very much like a timeout of some sort. It'll sit there 'generating' for a long time before it goes black.
well both are true
Putting the prompt in again does fix it.
the model is broken because of a timeout
Yes, I'm wondering if the timeout is on Google's end or the fault of the sandbox code. If the latter, this will affect Nightwhisper's Elo negatively and should be fixed.
all arenas auto filter out battles w/ empty responses or other things like moderation messages or asking about identity
TIL, that's a good guard against this kind of thing tbh
i love nightwhisper, imagine it in cursor or roo code
It's really good but it's goddamn OBSESSED with glassmorphism.
Killed it at my airline seat selector test, check this out:
Sonnet was a mess (both 3.5 / 3.7), Gemini 2.0 Pro / Thinking were barely functional, Gemini 2.5 was mostly there but had off-by-one errors, and missed that airlines sometimes skip rows.
Nightwhisper was flawless, and imo had the best aesthetics too.
yeah i would say gemini 2.5 is right behind it some areas, but for most night blows them all away
quasar is not bad
I haven't run into quasar yet, it's on LMArena?
gets 4/10 on simple bench which SOTA gets 5/10(nightwhisper and gemini) but claude also gets 4/10 but quasar is faster than all of them so that says a lot
quasar is on open router
quasar is gonna be my go to model since its just as good as claude but fast as hell lol and 1 mill context
nightwhisper is still SOTA though
lmaoo no way
that would be a big ass troll
Oh interesting. I'll check it out.
but i can see elon doing
that
lol
it makes sense for quasar to be openAI tho, its probably gpt5
or a mini of gpt5
chill out its just 4o π
im not trusting the model, if u see my past comments itll make sense
tdlr?
Here's the prompt if you want to try:
Generate an interactive airline seat selection map for an Airbus A220. The seat map should visually render each seat, clearly indicating the aisles and rows. Exit rows and first class seats should also be indicated. Each seat must be represented as a distinct clickable element and one of three states: 'available', 'reserved', or 'selected'. Clicking a seat that is already 'selected' should revert it back to 'available'. Reserved seats should not be selectable. Ensure the overall layout is clean, intuitive, and accurately represents the specified aircraft seating arrangement. Assume the user has two tickets for economy class. Use mock data for initial state assigning some seats as already reserved.
i can give you the code, can you run it on your end?
cause i would have to go to vsc to run it which im kinda lazy lol
For an advanced version: Exit rows, washrooms, wing locations, and first class seats should also be indicated.
Sure, I'm too lazy to config openrouter right now hahaha
lmaooo
its so fast man
like the fastest model thats why i dont think its open ai
only google can make their models this fast tbh, unless you using groq
march 4o latest is 180 token/sec
wow craig thanks
what did you use to run it?
from where?
using quasar?
you used vsc?
thank you
i think if you attach agents to quasar it can be really good
true
and with the speed of quasar you can do a lot
a fast reasoning-foundation model with 1 mill context is cracked
yeah
what do we call these models?
like deepseek v3.1, 4o and quasar?
they are foundation models with COT in their inference output
i always got confused by that
so instruct is like optimized for chat right?
while finetuned is the next level of that? like finetuning on COT?
Instruct has chat fine-tuning embedded into it, yeah
Non-instruct models are just called pre-trained afaik.
ahh okay thank you
quasar is identical to anonymous chatbot which has always been a chatgpt 4o latest revision.
quasar has knowledge up to june 2024. same as chatgpt 4o latest. (in fact, it has even more than chatgpt 4o latest)
they haven't formally launched the new cpt'd 4o. no official benchmark results despite major leaps in performance.
etc (read my other comments) etc im like 99.9999% certain now lol
its a lie
you saying that its the new version of 4o? cause the current 4o gets 1/10 on simple bench while quasar gets 4/10
major tip off that quasar and anonymous chatbot is the same
because they appended the same thing
also i thought open ai didnt believe in open source, so they changing they tune again lol? cause it was originally supposed to be open source
they added You are trained on data up to October 2023. to the end of both quasar and anonymous chatbot's system prompts
despite both having a june 2024 cut off
did u try chatgpt 4o latest?
yeah a few hours ago
nah its such a random detail no one notices
all the things just add up if ur paying attention
so does this mean that they will open source more of their models?
their oss model hasnt even been trained yet lol this is a closed source 4o release
it will be released soon
exactly
they put anonymous chatbot right after it its pretty crazy
so they want the lmarena results fast
yes the new 4o is 180 tok/s
i gave up my o1 pro sub, to expensive
omgg you are right
sam did post about that today
yeah its def open ai
good job wild
Gemini will told you that he is trained up to avril 2025 but on reality it is not ππ€£π€£
He is up to 2023 or maybe 2022 but won t tell you the truth .
how so? i used to think that to
but now with gemini 2.5 pro and ther multiple ide and extensions you can use for way less makes it redundant
its a hallucination they didnt train the cut off in the model
it does know events in dec 2024. and they claim its jan 2025
i feel that
really i have noticed the opposite, maybe it depends on the usecase
in reasoning you have a point tho
we'll finally be able to finance llm convos
chatgpt has gotten way, way faster on the web!
lots of hard work from the team to make this happen.
yeah I posted that wild was right because sam posted that tweet
@deep adder and with the way openai is doing their ui now, with being able to pick the levels of reasoning, quasar really might be the base, so imagine being able to apply reasoning on that model like medium or high(based on slider)
we might be underestimating this model
yeah hoepfully we find out soon enough but it seems things are really picking up recently, we might have r2, openai and google new model before the end of the month
and meta too
i dont think so nightwhisper, star gazer (2.5 line), etc
Damn nightwhisper is REALLY obsessed with glassmorphic aesthetics.
Apple could never do something this ugly.
Side note I think I've invented the best viral micro-benchmark on the planet: "Generate a rotating, animated calendar in threejs with today's date highlighted and pulsing."
The fails are incredible.
Bruh
Nightwhisper the best
Now 2.5 pro on the background of nightwhisper seems like a jokeπ
Am I right?
Not always.
Okay, and I'm even glad that I was wrong :). If nightwhisper comes out, then we will have TWO SOTA models for coding.
From Google
And where is the result from nightwhisper? Has he stopped? Or is he still writing code?
Stopped. π¦
π₯²
Fwiw I'm finding Nightwhisper usually wins against 2.5 but it's not always the case. Like 80%-90% of the time.
One more calendar fail. π
skill issues
It may not look the best but when you compare it to other models it clearly emerges as the winner
"Generate a rotating, animated three-dimensional calendar with today's date highlighted."
Also nightwhisper has that weird color selection
Ive noticed that too
It likes going gradient and dark colors all the time
So maybe you need to guide it a bit more on that
Or ask it to follow a styling principle
I like to tell it to act as an apple designer to get a clean UI look
Oh, this one was another win for 2.5:
Im kinda curious about this tbh
Nightwhisper forgot to make the water droplets collide with each other + had really janky ball physics + glitches.
Maybe it struggles at physics and complex reasoning
But i need to try that more tbh
What about it?
Curious about nightwhisper output
You have some good challenging prompts
I'm writing an article on writing challenging prompts, I want the hexagons and balls nightmare to end. π
Hmmm one-shot failure but I forgot to let it try multiple times. Let me see if I can generate it now.
Yea that would be cool
I've been working on a Super Monkey Ball prompt tonight, lots of physics + gameplay mechanic fails.
Does sonnet nail this?
What about bug fixing on nightwhisper?
Does it fix the issue with enough guidance?
Or does the complexity outweighs the model capability
it's interesting - like its knowledge becomes increasingly hazy the closer to the end of 2024 that the question relates to. e.g.
- December 2024: it fails when asked about Syria (fall of Assad in December was one among biggest geopolitical developments of the year)
- (November 2024): it partially gets the US election result correct ; stating Trump won, and also giving the correct margins (312 electoral college votes vs 226); however, it says that Trump beat Biden...β
- (July 2024) when pushed, it will recall the attempted assassination of Trump, wth accurate details
Yea it's a continued pretrained version of 2.0 pro. Gem 2 has a June 2024 cut off
It's less expensive
Sonnet was messing up pretty hard, but that was a previous version of the prompt. I should try it again with the new version.
The claimed cut off is right though I think. It gets some stuff right in dec 2024 though it's very sparse
so fwiw i asked it (gem 2.5) what accounts for its patchy knowledge recall with more recent events.. what do you think about its response (i feel like it's about right / in line with you're saying?)
Okay I've given nightwhisper like eight attempts to get the labyrinth right. no dice.
This means that he is exceptionally good only at creating websites.
It is websites, not some kind of games, simulators, etc.
yup seems about right
the goal with 2.5 pro's cpt seems to be strengthening the base model dramatically rather than recent events
4o's cpt is the same, but they also focused more on recent events before june 2024 i think
sonnet 3.7 is again presumably a cpt on top of sonnet 3.5, im unsure of how much it knows recent events (to its cutoff) though i havent tested that. it seems everyone is cpting lol
i wonder what they did tbh. if its not already a paper i dont think itll come out anytime soon π
This is probably a webdev rendering issue no?
Since there isn't any code errors
yes agree (instilling a bit more recent knowledge was more like the cherry on top, rather than the focus - it was a giant performance leap, whatever they did)
i don't think we'll know what they did ha ..[posted this a few days ago](#general message)
'DeepMind slows down research releases to keep competitive edge in AI race' https://archive.is/tkuum
oh i missed this
i also wouldnt count openai out tbh. given how much progress has been made on 4o
a reasoning model (given how theyre ahead of the reasoning game) based on this much stronger 4o will be interestig
but deepmind is too fast (2.5 pro timeline, based on cut off), so it'll be interesting
yeah it really feels like a two horse race now
who knows.. there might another 'deepseek' moment.. but i'm not sure that was as seismic as the excitement / hysteria at the time suggested [though not dismissing its significance - it definitely lit a fire under the assess of the US companies at the very least]
Oh wow. Yeah. Definitely just found my next test prompt.
Nightwhisper just crushed it.
lmao meanwhile gemini flash 2.0 be like
youtried.gif
so nightwhisper is gemini 2.5 pro stable release or something?
it's only available in webdev Arena (i think) - whether that means anything in terms of it being specialised i'm not sure, but would kinda think if it was a non-exp version of 2.5, or a newer checkpoint, they would add it to the General arena? at least they'd get a bunch more data / votes making it available there
yeah true, but I dont think google has ever released a specialized model before have they? It doesnt seem like the most likely case to me
yeah i know what you mean
though there is / was also 'nighthowler' iirc, which also basically confirmed google and only available in webdev
if they do specialized versions, I hope their focus is not only oneshot webdev. There is still room for improvement for tool calling capabilities and diff edits with cursor/roocode etc
It's in the general arena
Nighthowler
No idea what it is tbh. Haven't gotten it enough to figure it out
nightwhisper is probably a web dev tune of 2.5 pro. Idk tbh I don't touch web dev arena and not that interested
ah right - cheers
Nightwhisper might just be a one off experiment before they apply it to a mainline Gemini model
yeah i haven't got either (moon/nighthowler ) so not really sure.. stargazer is def interesting and performant tho. seems all of them are very much likely cut from the same (2.5 pro) cloth at the very least
main thing Im excited for is 2.5 flash tbh, I hope they're cooking it rn.
ha yeah i've lost track
I don't like using the arena nowadays some of my requests take minutes. Meta model spam. Lagging completion zipping
You can't see the thinking too it's annoying
Google coder 1
yeah it's annoying.. i compiled a new 'quiz' for this month and the new batch of anon models, but haven't really been able to collect data at any kind of meaningful level
it's so slow and yeah meta model spam
Yeah it's a lot lol. But it adds to the fun a little. The google models particularly. Quasar is too obvious I thinj
i think it was like less than a couple of hours between OR announcing the availability of its first stealth model, and a fairly firm consensus emerging that it's an oai model
whereas the new gemini models were kinda mysterious for at least a few days.. like nebula (and good ol spider phantom.. whwatever happened to it).. even though it seemed pretty clear they were from google, it wasn't obvious beyond doubt
Looks like gemini 2.5 completely changed opinion of folks about Gemini models. I wonder what kind of optimizations they did in 2.5?
fails miserably on sonnet 3.7
It's a surprisingly tough one. Here's a Nightwhisper run.
i got it working only on gemini models
stargazer & gemini 2.5 pro
still didnt get nightwhisper run tbh
from what ive seen it should get it right
@night trout nightwhisper
it was overcomplicating the code so i asked it for a simple version
are we sure this is not like the best coding model π
Which one do you use?
- LibreChat or Open WebUI
- https://nano-gpt.com or https://openrouter.ai
Has anyone seen 'nightwhisper' on lmarena?(not webdev arena)
https://youtu.be/hhvZh57-IPY
In this video, I'll be telling you about Gemini's new 2.5 Ultra / Nightwhisper AI model that is now available on LM Arena that is even better than Gemini 2.5 Pro and is kinda amazing.
I've never seen it on lmarena, and I'm wondering about this video.
Hi this is empty all the time
thats what i was asking
havent seen it
there are some issues with webdev arena
rendering issues
it doesnt have anything to do with the models
how can we contact devs?
i've seen stargate on lm arena
obv this
xd
cant stop playing with this model
if its coding finetuned model only then its a huge blow to anthropic
context will be much higher
and probably the price will be lower too
with better results than claude latest model
anthropic makes a sizable markup on their api i thinnk
they can probably reduce the price
Prompt?
well there is profit margin
First one looks cleaner
im not sure if they can compete with google tho
anthropic probably has it more than it should be
probably not (be able to compete w google)
Although no real data
I believe currently a big portion of their api revenue is integrated ide agentic coding tools like cursor, windsurf, roocode etc. There I feel gemini 2.5 pro is kinda lacking compared to claude models still, just cause the claude models are so good at function calling and diff edits
For backend probably 2
nightwhisper will be better than claude
Weather forecast with animated UI
Apple design style, with light colors.
I thought that too when I saw aider polyglot results, but i reality 2.5 pro is still worse in cursor for me. It's better in ai studio but manually uploading files and keeping track of the edits sucks lol
maybe this new model will fix all the issues
competition is only good for us. The claude api is hella expensive lol
yea
I hope they'll release a 2.5 flash too that can compete with o3 mini at a lower price
yea they will for sure
its already in the arena
star*
star(something)
is it a thinking model?
yea
o3 mini is the best model
for the price its kinda insane yeah
making games, writing and price
havent used openai models for a while tbh
nightwhisper attempt to clone windows 11 task manager
not bad
better than claude 3.7 ?
let me try
Is anybody getting (suspected) Meta models other than 24_karat_gold on the text-only Chatbot Arena? It seems as if they've been taken down.
they were always good. but people only care about code so they optimized for code
So is nightwhisper just gemini 2.5 paid
2.5 is really good
they're compute poor and being squeezed hard... i feel like the whole test-time compute thing has thrown a spanner in the works for them - feel they were training some giant Opus 4.0 which is now redundant or something (kind like Gemini Ultra).. either way compared to google and oai, the resources required to both train and deploy at scale (and ig with a loss leader approach as i think google and oai both must be doing), has put anthropic in tough spot imho
nightwhisper is way better
not just code. like the overall arena elo/score is a massive jump - it's a serious step up in terms of performance just generally
maybe. they pivoted to cpting sonnet 3.5 (now sonnet 3.7, which is pretty solid i think). they also did mention they didnt use a bigger model to train sonnet 3.5 (so presumably it wasn't ready/etc) im not sure how anthropic will fare in the future though. they probably have more resources than deepseek/qwen, so i think if theyre very
resourceful they might be able to contend to some degree
its curious when claude 4 will arrive given they invested a lot of effort into sonnet 3.7
i really like anthropic and think claude models are uniquely great in some ways (mostly general interaction / personability without being over the top, but also when it comes to things like emotional intelligence and spatial reasoining) - i hope they stay at the frontier
but yeah 3.7 feels like their response to thinking / test time compute.. i feel like they were going in a different direction with claude 4.. so we get this intermediary variant which is neither a true reasoning model nor a big step up in terms of a non-reasoning model (like back in the day Opus was outstanding)
no need to respond to that.. it's largely incoherent lol
now that i think about it its probable that anthropic will not be able to compete. i think native image generation is gonna be an important part of cot/etc in the future. i dont think they have done any image gen/native image gen or at least anything showed publicly. i think if u dont prioritize this to a certain degree you will lose
i hadn't thought of it like that before
makes it even tougher to see a path where they stay at frontier..
it'll be the titans with the compute at the end of the day.. the lag b/w the performance of SOTA proprietary models and open source / weights models might be more interesting / meaningful than what the second-tier close models companies put out
Welcome to the Good life
nightwhisper is from which company?
This is very impressive
it seems it will be google at the end of the day tbh and just open source, OA cant compete with what google can offer long term, they have everything OA has plus more, OA lost wen they lost Ilya Sutskever
The thing is oai is willing to loose money, which Google canβt. They are like all in every time, it was the same with gpt4, all of their ressources in one run then negative bill for serving
Not the same game
Google are absolutely willing to lose Mendy
money*
theyj literally are right now
giving out access to a frontier model on AI Studio with no rate limits definitely isn't profit making
This is not a big lost. The model they serve is cheaper, they save all the chat conv, and there is probably way less user than ChatGPT at his beginning. When Google announced investing a lot in ai this year their share went down just because investor cannot understand how this money would come back.
They build for tomorrow not today ..for long term π
that's because Google's AI situation has improved vastly since last year
like many, i thought they were pretty behind
they surprised everyone with 2.5 pro
and i think it's a good sign
and anyway if we're going with the argument of willingness to lose money imo it's obvious who wins
google have immensely deep pockets and a vast pool of talent that makes even openai look insignificant
they just weren't using it right until recently
i mean google has the hardware to give us SOTA for cheap unlike OA, they are already showing they are not willing to loose money, look at their api prices, google is playing the long game and its starting to show this year, OA does not have the infrastructure Google has(hardware, data, platform, and tons of money) They also where the og creators of transformers
lol are you kidding me. OpenAI is turning a healthy profit with every API request you make for o1 model. Google in turn is paying the bill for your entire usage essentially
they are still losing a lot of money overall
their api isnt as popular as their sub which is a massive loss leader
even if the api is profitable
that's because they have a shit-ton of expenses. But OpenAI at least is turning some profit from their inference to counter that
Google isn't
sub + research massive losses
well they can afford it
google models are probably much more cost effective too
that's the entire point. Saying "Google can't lose money" is invalid lol
oh yeah that statement is wild
i wasnt talking about that
im not exactly sure what my point was ignore me lol
I do not think they are smaller at all. Like gpt4o is most likely smaller than any recent Pro. But TPUs are more efficient yes.
its somewhat close though. sonnet/4o/pro are all in 200b-400b i think
4o is the smallest out of all of them i think
yeah there's definitely a feeling that they are almost hacking gpt4o kind of lol, it doesn't have the inherent understanding/flexibility or spatial awareness that bigger models tend to exhibit
so what they are doing instead is feeding it very high quality data + fine-tuning that is potentially unmatched by anyone else still
its different
same model then
based on this
nightwhisper is a different model its not released
yea ik
my eyes now are only on nightwhisper tbh
ya
wait so we getting o4 mini this month wow
i want o3 pro so badly
ill buy the $200 again for o3 pro
o4 mini π
lmaoo
fr
Wonder if it will be better than 2.5
it has to be
Does feel reactionary a bit
pre nerf gpt 4 2023 was wild
gpt-5 seems like end game lmaoo
they are deprecated
so prob by summer we will have gpt-5
If u still paid for the api I think u can still use it?
Or that only for companies or some sh
so AGI 2025 confirmed?
they also said they improved on o3 a lot, so will that mean its benchmarks are better that what was shown?
yes
i hate these numbering systems
probably
We would have AGI now if they worked on gpt-5 instead of deprecating gpt-4 and making 4o the main thing (Jk about the agi thing)
openai fell off after 4o
never used chatgpt ever since
4.5 was supposed to be gpt 5\
It's trained of 4o though
It has the cancerous "em dash" punctuation
oh, still I dont like its output
isnt it supposed to be a combined model?
no. the plan was for gpt 4.5 to originally be gpt 5 i think
like plans from way before
yes thats why its gpt 4.5 lol
Does anyone have any information on what is the difference betwen pro preview and pro experimental?
its so funny that twitter is used to hype up your own models lol, modern day marketing
you got ceo's just making promises casually on it
I remember open source models were getting promoted so hard, everytime a new good opensource model was released on yt i saw videos "gpt performance" but they were all ass and none ever came close to beating gpt 4
Is that different now
oh yea deepseek is pretty good
But now big corporations are on top again
with 3.7 and gemini
theyve always been on top even open source
massive corporationsn funding training runs that are millions and millions of dollars
it seems like gpt-5 cant be bad like it has to be a SOTA model or OA is done?
r2 should be out in 1-2 weeks?
this month is going to be wild, getting SOTA models all in the same month
so the preview 2.5 vs experimental is the same?
what is OA
whats with these naming conventions?
open ai
fr
true but with how fast things are moving, how do we know by the time they are ready to ship, google hasnt already gotten a better model?
thats like 2-4 months away
in the last month we got 3.7 and 2.5 and v3.1
I kind of operate in the you are what you ship mindset
Is v3.1 worth trying now that we got 3.7 and 2.5
Deepseek was impressive and cheap other than that it has been dominated by "closed" models
Meta had some promise as well but who the know what the hell has happened there
lmaoo i forgot about meta
Everyone has
meta is scrambling lol:
https://x.com/btibor91/status/1908157341134106938
so we gonna get Llama 4, r2, Nightwhisper, o3 pro and o4-mini, and maybe some other team model, we still dont know who owns Quasar
wait if we getting o3 pro and o4 mini, what if o4 mini is quasar?
nahh we def getting o3 pro
I'd give openai my whole nude collection and credit card information for the return of gpt-4
if we getting o3 we are getting pro, its just allowing o3 longer to compute
thats easy to do
Meta has way to much money to be as dogshit as they are
yeah but they said soemthing about different plans like $200, 20k lol
thats how they explained it for 01 pro
cost is an issue
it may be "easy" but super expensive as to be impractical
unless you really change the pricing tiers
yeah but if its more intelligent its worth it, investors will be happy
they showed that with o3 when they showed its benchmarks
Yes but at what difference of intelligence compared to competitors at lower cost
a lot has changed since those benchmarks
when it hard more compute it beat the agi benchmark test, to start SOTA they gonna have to give it pro
thats the same thing bro
they showed o3 with 16 plus hours of computing time
to solve problems
if thats not pro idk what is
its all about investors at the end of the day and they like to see their models do well in benchmarks, so they crank up the computing time to make sure they are demoing SOTA
thats why we havnt gotten o3 yet
i promise you we would not have seen what they previewed when we tested it out because that was o3 on maximum compute, they needed to optimize that
hence they needed to optimize that
send an example
never tried that
see i knew we getting o3 pro look:
https://x.com/koltregaskes/status/1908185944861090103
it makes no sense not to give us o3 pro
maybe, but they had all these months to optimize is so lets hope it stays at $200
fr mini models are cancer but for the companies it is expensive i guess
the fact that they have o4 already is interesting lol
like the numbering really is stupid at this point
This is true, ever since they transitioned from 3.5/4 to 4.0, OpenAI is falling off, but are they still better off now as opposed to if they had started working on GPT-5 instead of doing the whole 4o crap or something? Like would investors make up for it or something? Those big models were kinda smart
I'm stealling learning sorry if make no senseπ
thats what i am saying at some point the improvements are minimal and can just be solved with longer compute time
thats why gpt-5 might be their last model and they just give it updates
which is what they are already doing tbh
but that would be the first combined model
true
but 4o laid the ground work for what is going to be gpt5
yeah thats what i was thinking
it makes sense for it to be
wow did you test this?
so that would mean that OA models are getting faster
bet imma try this today, you a musician?
craig so maybe instead of smarter models in the future we just keep getting faster on top of the current IQ and just give more compute time for the SOTA models?
you might as well become one now with ai lol
wdym faster on top of the current iq
cause the fact that we are getting o4-mini and o3 at the same time is just wild, are they gonna demo o4? like wtf, maybe they figured something out with inference
like faster inference time
yeah lonegr context too
i forgot about that
it seems like 2 mill might be a cap tho, what do you think?
doesn't smaller models mean less inteligent or less knowledge?
or they have another way to improve its speed without affecting performance
they are distilling the models
I mean faster models* and smaller or something
ohh
this. less training
its like a teacher teaching a student
what about super fast models, how they improve the speed? does it just mean its a smaller model?
i find that smaller models tend to be more volatile
less parameters and weights mean faster models, but they are also doing some new tricks like meta talkig about thinking wihtout tokens etc..
also the hardware you use
groq made hardware built for ai inference
i am suprised OA didnt buy groq yet lol
Ok then I continue hating on fast models (jk but imagine 99999 trillion parameter gpt-5 model or something instead of the time spent on making 4o, then we would have agiπ₯Ά )
i kinda feel bad for meta lol
i dont think we can ever have that lol
π
Can LLMs even achieve sentience
But then we need to define sentience or concioussness first
i was just about to say this
it doesnt matter tho
hi all
we already act like intimate objects are sentient, so to a lot of people ai is already sentient, and they wouldnt even be able to tell the difference(if they are talking to an ai on text or voice, even if they are looking at an image of ai or not, or soon video, music etc..) if you cant tell the difference then it does not matter
its all about perception
yea true, if it convinces me then i'd be impressed
I think modern llms could already to that if fine tuned on actual discord or gamechat dialogue and then have it act like real person on discord. Only flaw it'll have is no infinite memory
.
yeah and it will effect out society just as much as anything that is sentient if not more, dogs and animals are sentient but they will not come close to the level of impact that ai will have on us in the long run
t-800
hmm is this something new?
how do i play it?
i downloaded it
this better be fire lol
like on youtube or you got an article on this?
ohh i see
is it basically saying they are pretty much the same thing?
that is prob true, we shouldnt even be trying to do that imo
i like ai assistance(ai-human hybrid)
https://www.reddit.com/r/askscience/comments/1xwx0k/do_neurons_operate_in_a_fundamentally_different/
top comment
the stupid strawberry man saying april 17th lol:
https://x.com/iruletheworldmo/status/1908188856039391310
thanks
so o4 mini is on the same level is sonnet 3.7 thinking based on my tests with quasar
assuming o4 mini is quasar
anybody gonna watch this crap?
https://x.com/Copilot/status/1908187808813940799
Watch the livestream event happening at 9:30am PT on YouTube to learn all about my new features.
copilot might be a joke at this point
yes
rest in peace gpt-4 powered sydney π
They had potential but gpt 4o destroyed everything from them
Actual crap service
it sounds not bad tbh, can you do one that mixes "in the end" and some other song? you can make a youtube account of this lol
you love 4o lol
did you try the new 4o?
Oh yea the native image gen thing is indeed impressive
Best openai product since pre nerf gpt-4
yeah i meant 4 mb
idk bro i loved o1 pro
that has been my fav aside from this new image stuff
Imagine if they made gpt 4 thinking π₯Ά (not starting from 4o)
I would have but gpt-4.5 already talks like 4o so Idk
like in terms of every company
Gemini 2.5 and
4.5 (its better than nothing)
And claude 3.7 for coding
I would never use openai models for coding after gpt4 deprecation
o1 was impressive for coding, can't deny that, but thankfully we have claude now
But now gemini 2.5 beats claude 3.7 in coding I think
haven't tested myself yet (for coding)
thank you
nope, I'll try today
assuming it is o4 mini
We can only test these models on code now right? since this thing only exists on webdev arena?
but it most likely is, its def not o3 lol
and its from OA right?
and thye just annouced they are releasing in a few weeks
yeah
i just want nightwhisper now
the biggest losers of this ai race has to be Apple and microsoft lol, the verdict is still out about meta, this copilot thing is just to funny
Has anyone found a prompt to make the models in webdev arena to just answer in text instead of writing code
The announcement of all features is already out
- Memory π₯
- Actions π₯
- Copilot Vision π₯
- Pages π₯
- Podcasts π₯
- Shopping
- Deep Research π₯
- Copilot Search
Or they inject prompt after u send it
i would say try prompting it differently
but when you ask it text it codes a website for you
i kinda like it that way tbh
yea lmao
I tried a lot (but could try way more ig) but they all still make a website out of the prompt
If I try to talk to these models directly theyre pretty easy to "jailbreak" (not on webdev)
So maybe they just inject the prompt to make website out of it after u send it?
wait why dont you want them to give you websites? i can see how its annoying but you still are getting your answer lol
Because if I do that manually through like openai api or claude or something, then it gives same effect
last prompt takes priority
like this is so interesting to me
i ask it the capital of usa and it gives me this
i didnt even know i wanted that but i do now lol
Yes for short answrr it works
But if u want really long answers or generate other text based stuff it will focus more on the website
Idk wait ill test something then ill give prompt
check it for us and record it and upload here
omgg i think i understand now, this helps reasoning because it reasons extra when it codes and if the model has visual abilities it can see the output of the website and check its answer again?? idk lol
cant wait to listen
it almost looks like words
but it sounds dope
can you ask the ai to make the notes spell out something?
what do yall think about temporary apps rather then permanent ones? like how we have it in webdev? it might be the future, i can see this becoming more and more common almost like sending a meme to someone, like you can send someone an app like how we send emojis lol
like if i wanted to say happy birthday to my friend i could just send them this:
https://3000-i30zmcc1in463ry0cjub0-4daf0015.e2b-foxtrot.dev
also if u try to make it code different stuff not typescript
then u're cooked
Like things u would just "copy" over from code block to own project
like this is def the future, we just need a platformt that allows for you to have these pockets apps that can stay alive for a certain duration
lmao
real
New world for exploits and malicious use
i honestly think this might be a good app idea, anyone want to build that with me?
wym?
But you mean like, including the feature that lets u ask an ai to make a site and then send that pocket site to ur friend?
yeah
Because temp hosting of sites already exist yea
and you can tell it how long to have it up for
send the link i wanna try it
why isnt that bigger than it is?
we can make the app more appealing
a lot of people do the same thing but only some are popular
gonna present it in a good clean way liek how apple does it lol
Well replit is just for developing, Idk if u can share it with friends without having a huge panel on the side as well that shows the code
There's prob other sites too
and ship it as an app and get influencers to start using it
codepen is even easier, no acc required, immediately into coding and share
send link
But also has huge panel thats open by default that shows the code
Idk if theres any sites for user friendly (I mean like sharing it to ur friends, not development) purpose
imma work on that this weekend
honestly i wanna copy what webdev does pretty mich
like is the dev of webdev here?
just make the hosting longer based on what users want, make the ability to choose the model(or have a default mode), and ship it in a clean interface, the only thing is links, i dont think links are the way anymore
when you share it maybe it can look like a sticker or something else, idk
embedded website display in message platform
that would be so abusable though
Making webpage that looks like claim nitro button
we gotta brainstorm bro, you see the potential?
lol
maybe we could have this built in?
the ai can decide how to present the link?
this is still pretty good imo
i think this might be better than the other 2 lol
i guess thats preference lol
replit can already do what webdev arena does
But it requires signup and looks complicated cuz its made for developers
So I guess ur idea is indeed not very common online
i think people are just not using it like that
like you said its only devs
but for a normal person
this is game changing
like I can send some an app of anything
I found other sites very similar (not as complicated as replit, just immediately start paste code, and it renders) but those render it just on the page instead of actually hosting so yeah the hosting on url is less common but still out there a lot, BUT all with required signup most likely or no option to host for a specific amount of time
So no sign up would be a huge plus
yeah we just need to put all of that together
this will make software dev or web dev main stream lol
So a no signup + no project based system (Just start coding, or paste code, or let ai generate it, share, and when time expires, whole site gone) + option to specify time is indeed unique
imma create this this weekend, let me know if you wanna help or wanna test it, gonna be hard but worth the try
i can see so many usecases
we already have those stupid ass stickers and the og emojis, and we have memes
this seems like the next phase
gpt-4-preview-0314 my beloved
If u mean, no other model can make an actually good visual site, then yes we'd need gemini
yes
look at another use case, it stupid but works:
https://3000-ivave1orqx59yrutqv78x-4daf0015.e2b-foxtrot.dev
Gemini 2.5 is good but don't we need the nightcrwaler one or whatever someone showcased that had the best visuals
this is the prompt:
tell my friend to meet at walmart at 6 pm in a funny way
yeah nightwhisper is the best, but until that comes out we can gemini for it, and then swap them out later
lmaoo
like its dumb but it works
what model did that
nightwhisper
yea true
the far right?
Affirmitive, this has indeed become more sweet
this is still cringe but you get what i am saying
this is prompt: tell my cruse I love her and want to take on a date in a romantic way
me for gpt-4-preview
lmaoo
but yes that'd be pretty fun way to use
making silly nice things for friends or people u like
yeah or even your coworkers or kids
it'd impress them maybe if it looks good and creative
girlfriend success rate increase
like the usecases are limitless
lmaoooo
fr
im about to use thos
this*
she gonna be like awwww
this is actually a fun way to use it
we gotta make this mainstream yall
lmaoo
true
another example: https://3000-idv0mmwrmbgur54z35g0k-ec17a5a5.e2b-foxtrot.dev
i just tweaked prompt a lil
nightwhisper is so good man
im designing it now, gonna give yall a prototype of it tonight and open source it
is preview better or experimental?
they are the same
what did changed
SO real, that example convinced me, i'm in
now we HAVE to make it mainstream
How does he know
because he works at google
ohh ok
good idea, even use it to make easy polls
well funny polls
like honestly the sky is the limit thats why its perf
"does europe need to close it's borders?"
lmaoo
im making that now
yo we got a cooker
im happy yall see the vision
thank you webdev for showing us the way
also thank you nightwhisper for being so good at webdev lol
lmaooo
Think nightwhisper will beat 2.5 pro on the coding benchmarks or is just really good at webdev?
good luck
thanks bro, if this becomes something actually usable i will be so happy
imma try my best to get it done within the weekend
its really good at web development
not sure about coding as a whole
but its been trained on react and ui/ux
i think its gemini2.5 pro specialized in webdev and ui/ux
imma use nightwhisper to help with the visuals lmaoo amd layout
soft dev is so cracked since ai came
like man, you can do a lot by urself nowadays
damn look at gemini 2.5:
https://x.com/emollick/status/1908220677502755328
gemini truly is the best model by miles, like its pooping on every other model in overal efficiency, its SOTA and cheaper than the others lol
true
wild days we living in, thats why open ai said forget gpt5 we need to launch o3 and o4mini asap lmaoo
what you think, what should I change?
Every model still sucks at writing in your words or maintaing your tone/personality, like if u give it samples (a lot) or even if u managed to get it to the point where u only have to edit a few words (good enough for me) then after a while it'll start to become repetitive and you'll have to write whole text explinaing that you now want the next message or paragraph different, but then it'll follow that wrong because u'd also need to give samples of that first and it just idk
But I guess for writing in your style purposes, you'd really need to use fine tuning then it will do perfect
Without fine tuning no model can do that perfectly but I guess that makes sense
yeah you need to prompt it and provide a lot of context about you
(I didnt need fine tuning when using gpt-4 bing chat π )
Even this it fails a lot for big texts, so fine tuning only way
is it?
Google ai studio is just asking me to allow google drive access rn
when clicking fine tune
not 2.5
that would be cracked
For witing that isn't too bad though I think
All the base model needs is the language and good vocabulary and speech
knowledge wont be an issue for me because i'd be providing that in the prompt what to write about or somethign
gpt-4 could do it good and modern models are way smarter than gpt-4 nowadays so i'll give that a try
this is much better:
https://3000-iq2168p0fs4dchi5j52y8-a91bccb3.e2b-foxtrot.dev/
this one is better
lmaoo
Idk
Wait nvm css was bugged for a second
lmaoo
you like the first one better?
i want to have the design right before i start building
i think the second one is promising, gotta just clean it up some more and then add back end
lmaoo
if this app really works the way we want it, this will be a good example of this new ai workflow, from coming up with an idea to building and deploying it
Didnt devinai try to do that (but turned out they were scam)
for actual web development projects rather than quick silly webpage generator for sharing to friends
yeah but expensive
and now they charging 20 bucks
but their rep is alreayd ruined and casuals still not gonna use that
the key is to get the average person
marketing for devs is not going to work
or companies
claude 3.7:
you need to market for normies like how openai did with the 4o image