#general | Arena | Page 14

balmy mist Apr 7, 2025, 1:52 AM

#

is the key to jailbreaking just momentum?

frozen skiff Apr 7, 2025, 2:02 AM

#

whats anonymous-text

#

Is that llam4 behemoth

frozen skiff Apr 7, 2025, 2:19 AM

#

its behemoth

#

🤣

#

its shet as expected

#

😢

balmy mist Apr 7, 2025, 2:45 AM

#

how you know its behemoth?

red sluice Apr 7, 2025, 2:46 AM

#

Flannel is very decent on hard prompts

frozen skiff Apr 7, 2025, 2:50 AM

#

balmy mist how you know its behemoth?

Cuz

#

It constantly says

#

Its llama 4

#

Andtheres only 2 llama 4 models that we're expecting

#

Reasoning or behemoth

#

It isnt a reasoning cuz its replies almost immediately

#

Its obviously behemth but its a failure

red sluice Apr 7, 2025, 2:51 AM

#

Harley sucks though. Both are from meta, but damn Harley sucks really really bad on hard prompts

frozen skiff Apr 7, 2025, 2:51 AM

#

red sluice Flannel is very decent on hard prompts

Crytsal

#

What about

#

Creystal

red sluice Apr 7, 2025, 2:52 AM

#

Didn't see it at all, got like 7 harley, 5 flannel, 0 crystal 😭

frozen skiff Apr 7, 2025, 2:52 AM

#

have u tried the

#

new

#

anonymous-test

torn mantle Apr 7, 2025, 2:52 AM

#

red sluice Didn't see it at all, got like 7 harley, 5 flannel, 0 crystal 😭

same

red sluice Apr 7, 2025, 2:52 AM

#

Nope I've just been cooking for 2 hours now, really checking everything before voting, will tell you what I think of both if I see them

frozen skiff Apr 7, 2025, 2:53 AM

#

red sluice Nope I've just been cooking for 2 hours now, really checking everything before v...

bet

#

anonymous-test is probably behemoth

#

its very sht

ivory schooner Apr 7, 2025, 3:02 AM

#

frozen skiff anonymous-test is probably behemoth

24k可能也是behemoth模型~

frozen skiff Apr 7, 2025, 3:03 AM

#

ivory schooner 24k可能也是behemoth模型~

true or maverick with a different system prompt

ivory schooner Apr 7, 2025, 3:12 AM

#

frozen skiff true or maverick with a different system prompt

hmm

frozen skiff Apr 7, 2025, 3:36 AM

#

wtf happened to riveroaks

#

it started typing like llama all of a sudden

fluid plover Apr 7, 2025, 5:13 AM

#

Is deepseek the best free unlimited model?

grizzled agate Apr 7, 2025, 6:03 AM

#

👋

#

Hi all! Here to learn and follow best 🙂

willow grail Apr 7, 2025, 6:10 AM

#

with javascript: is your xp also that gemini2.5 is bad with 10 files for one project and works better with one huge file?

fluid plover Apr 7, 2025, 6:11 AM

#

yeah like deepseek and ai studio

#

i think ai studio is the best unlimited

#

cuz you can use the best model

harsh flume Apr 7, 2025, 7:00 AM

#

If I had to guess based on the nickname alone withouth seeing any prompts my gut says 'riveroaks' sounds like an OpenAI alias

#

There's Crystal model that by the style of response should be a LLama variant

#

its over-the-top answer style

#

makes me picture this

torn mantle Apr 7, 2025, 7:27 AM

#

harsh flume If I had to guess based on the nickname alone withouth seeing any prompts my gut...

nah

#

thats llama 4 behemoth for sure

#

does anyone still use grok 3?

#

also what happened to Mistral?

harsh flume Apr 7, 2025, 7:33 AM

#

torn mantle does anyone still use grok 3?

I use it sometimes, esp for the deepsearch

torn mantle Apr 7, 2025, 7:33 AM

#

the only noticeable progress im seeing is from chinese labs and google gemini tbh

harsh flume Apr 7, 2025, 7:34 AM

#

And I like the prompts it writes sometimes for me to feed into a coding llm

torn mantle Apr 7, 2025, 7:34 AM

#

harsh flume I use it sometimes, esp for the deepsearch

i used that feature for a week

#

and i hated it

#

they even introduced deeper search later

#

it was so bad

#

like really bad

#

so many issues with their implementation

#

you start to think if its a method issue or just the model is so bad at putting pieces together

harsh flume Apr 7, 2025, 7:35 AM

#

Its fast enough and gets some Twitter context which is nice. I just don't like that it sometimes fetch too old of dates if not directly specified even tho it's fairly obvious my query is for recent stuff

#

It's def better than the search from gpt but worse than deepresearch

torn mantle Apr 7, 2025, 7:36 AM

#

harsh flume Its fast enough and gets some Twitter context which is nice. I just don't like t...

its a nice feature for x

#

but thats it imo

#

their thinking model also seems so inefficient

#

keeps going in a loop a lot

#

idk if thats a part of their parallel thinking or just some reasoning bug

harsh flume Apr 7, 2025, 7:38 AM

#

Yea with that I agree. I was using it a lot before 4o latest update, but after that it just took the spot as my go-to for general stuff

torn mantle Apr 7, 2025, 7:40 AM

#

harsh flume Yea with that I agree. I was using it a lot before 4o latest update, but after t...

i actually stopped using chatgpt in general

#

be it o-series or gpt-series

#

https://x.com/vitrupo/status/1908763535351669017

vitrupo (@vitrupo) on X

Anthropic Chief Scientist Jared Kaplan says Claude 4 will arrive "in the next six months or so."

AI cycles are compressing — "faster than the hardware cycle" — even as new chips arrive. Post-training and RL are accelerating progress. No signs of slowing.

#

i think they are taking competition way too lightly

#

if they think they can keep their market share intact in those 6 month then good luck

#

anthropic models or mainly sonnet 3.5/3.7 are used for coding

#

but what will happen if another lab introduced a powerful coding model?

#

kinda crazy thing he said that knowing that most companies has private information about competitors

#

they should know google is working on a specific tailored model for coding

harsh flume Apr 7, 2025, 7:45 AM

#

I'm expecting something cooler coming from the Asia side than anything west rn after gemini

torn mantle Apr 7, 2025, 7:46 AM

#

harsh flume I'm expecting something cooler coming from the Asia side than anything west rn a...

there are two models to look for

#

deepseek r2 and kimi 1.6

#

ah and models from alibaba too

harsh flume Apr 7, 2025, 7:51 AM

#

Yea. Tho I feel like none of the Asian models have really weaponized LMArena to the degree we're seeing from the western companies, wonder if they're gonna be more interested in doing so on this new cycle

calm sequoia Apr 7, 2025, 8:02 AM

#

torn mantle Apr 7, 2025, 8:18 AM

#

calm sequoia

Meta intern denied that

calm sequoia Apr 7, 2025, 8:45 AM

#

Not sure though if one person denial is worth anything

#

The data pipeline is long

#

You need only one bad actor in the chain

keen rivet Apr 7, 2025, 8:51 AM

#

Is there a way to improve the tool calling with Gemini 2.0 Flash? My current app works very well with GPT-4o but Gemini doesn’t call tools sometimes

lime coral Apr 7, 2025, 9:22 AM

#

torn mantle Meta intern denied that

It’s true

#

I remembrer a meta lead researcher talking about that in the open saying according to him it’s common in the industry and that scaling law are clear and you can’t have huge diff in perf without cheating (ofc bs). It was this guy https://x.com/armenagha/status/1859646650714821012?s=46

Armen Aghajanyan (@ArmenAgha) on X

Say hello to our new company Perceptron AI.

Foundation models transformed the digital realm, now it’s time for the physical world. We’re building the first foundational models designed for real-time, multi-modal intelligence across the real world.

https://t.co/nw8cHWsvKh

#

This girl was also previously at Meta and she is saying since a longtime now that Llama 1 was trained on the test set https://x.com/suchenzang/status/1909070231517143509?s=46

Susan Zhang (@suchenzang) on X

Company leadership suggested blending test sets from various benchmarks during the post-training process

If this is actually true for Llama-4, I hope they remember to cite previous work from FAIR (Llama-1 and https://t.co/RSBWw8taHS) for this unique approach! 🙏

torn mantle Apr 7, 2025, 9:26 AM

#

lime coral I remembrer a meta lead researcher talking about that in the open saying accordi...

its more like the low hanging fruits are all consumed

#

if you dont innovate then you are stuck with a gpt4 level model

lime coral Apr 7, 2025, 9:27 AM

#

torn mantle its more like the low hanging fruits are all consumed

He was coping. You can still make huge progress lol

torn mantle Apr 7, 2025, 9:27 AM

#

yea

brittle tiger Apr 7, 2025, 9:27 AM

#

No one ever gonna release on a weekend after this lmao

lime coral Apr 7, 2025, 9:27 AM

#

They are just not built for this mentally

torn mantle Apr 7, 2025, 9:27 AM

#

brittle tiger No one ever gonna release on a weekend after this lmao

xd

sage raptor Apr 7, 2025, 9:27 AM

#

is nightwhisper still in web arena ?

torn mantle Apr 7, 2025, 9:27 AM

#

sage raptor is nightwhisper still in web arena ?

removed

keen beacon Apr 7, 2025, 9:27 AM

#

they released it on the weekend to prevent an even worse release when u have qwen 3/etc to compare

#

it was originally slated for monday (based on a commit) so 🤔

torn mantle Apr 7, 2025, 9:28 AM

#

qwen dev denied that too

#

they are still working on it

#

it may come this week tho

keen beacon Apr 7, 2025, 9:28 AM

#

torn mantle qwen dev denied that too

what the llama 4 commit where htey changed the date to the weekend from monday?

torn mantle Apr 7, 2025, 9:28 AM

#

i have no idea tbh

keen beacon Apr 7, 2025, 9:28 AM

#

im not talking about qwen lol

torn mantle Apr 7, 2025, 9:29 AM

#

ah you mean llama 4 was scheduled for monday

#

yea i saw that

#

idk that was weird

keen beacon Apr 7, 2025, 9:29 AM

#

torn mantle idk that was weird

they probably got a heads up on a release tbh

#

whatever the case next week

#

it doesnt make sense otherwise

sage raptor Apr 7, 2025, 9:30 AM

#

the new llama is so bad at coding

torn mantle Apr 7, 2025, 9:31 AM

#

https://x.com/gurgavin/status/1909159289140269069

GURGAVIN (@gurgavin) on X

ALIBABA SHARES JUST CLOSED TRADING IN HONGKONG DOWN 19%

MAKING TODAY THE WORST DAY EVER IN ALIBABA’S HISTORY

#

woah

brittle tiger Apr 7, 2025, 9:32 AM

#

Only thing that can save economy at this point is early singularity

torn mantle Apr 7, 2025, 9:32 AM

#

https://x.com/vibagor44145276/status/1909138204672053625

vibagor441 (@vibagor44145276) on X

The linked post is not true. There are indeed issues with Llama 4, from both the partner side (inference partners barely had time to prep. We sent out a few transformers wheels/vllm wheels mere days before release) and the model side. But there was no such training on test set.

#

kinda doubt the "no training on test set" claim

#

#

when was the google event again?

#

was it this week?

sage raptor Apr 7, 2025, 9:35 AM

#

this week

#

https://x.com/Google/status/1907880784557412825

Google (@Google) on X

Join us in Las Vegas and online for #GoogleCloudNext on April 9-11!

Register for a complimentary digital pass → https://t.co/2ML6qHblnS and then sign up to watch the livestream right here ↓ https://t.co/o4fLFgCAYE

torn mantle Apr 7, 2025, 9:36 AM

#

great

#

they will probably release gemini 2.5 flash

#

reasoning & non-reasoning

sage raptor Apr 7, 2025, 9:37 AM

#

maybe nightwhisper too

torn mantle Apr 7, 2025, 9:37 AM

#

nah

#

i highly doubt that

#

thats like the final boss

#

maybe they will wait other labs to release smth

sage raptor Apr 7, 2025, 9:38 AM

#

idk

torn mantle Apr 7, 2025, 9:39 AM

#

i mean i hope

#

but its unlikely

sage raptor Apr 7, 2025, 9:39 AM

#

i want to try it more

#

people will do crazy projects with it

#

like they did with 3.7 and 2.5

#

but even better

brittle tiger Apr 7, 2025, 9:40 AM

#

https://x.com/vibagor44145276/status/1909138204672053625?t=P0lbZfL7J8u1O6-AQjLqyg&s=19

vibagor441 (@vibagor44145276) on X

The linked post is not true. There are indeed issues with Llama 4, from both the partner side (inference partners barely had time to prep. We sent out a few transformers wheels/vllm wheels mere days before release) and the model side. But there was no such training on test set.

#

I don't really buy it. Llama 4 claiming 10m context is bad enough.

#

If someone resigned they wouldn't keep their name anonymous anyway

sage raptor Apr 7, 2025, 9:45 AM

#

keen fulcrum Apr 7, 2025, 10:09 AM

#

sage raptor

As soon as something performs bad, they all don't want to hold accountable and want nothing to do with the project

north vale Apr 7, 2025, 10:49 AM

#

they obviously didn't train on test

#

on purpose anyway

#

i'd guess nightwhisper will come out during google cloud next

lime coral Apr 7, 2025, 11:01 AM

#

lime coral I remembrer a meta lead researcher talking about that in the open saying accordi...

They did 100%

#

Just the fact that they used a separate version on lmsys says a lot

#

« Benchmark data at the end of training » curiously the same formulation. he was still at Meta btw https://x.com/armenagha/status/1734321205770101062?s=46

Armen Aghajanyan (@ArmenAgha) on X

My bet is everyone is doing this. Mistral is not that much better of a model than LLaMa. I bet they included some benchmark data during the last 10% of training to make "zero-shot" numbers look better.

To test this, finetune Mistral/LLaMa on SuperGlue and look at deltas.

balmy mist Apr 7, 2025, 12:21 PM

#

yall tried ernie 4.5 and x1?

#

seems SOTA while being cheap

drifting thorn Apr 7, 2025, 12:30 PM

#

quasar alpha was really making me furious

#

I was just implemented it into cline

#

I told him to divide my file into smaller chunks, and store them into the "knowledge graph memory"

#

Then you know what

#

it asked me to divide it by myself, even after it used the "search_queries"

#

Just fxxking copy and paste it!!!!!

sonic tendon Apr 7, 2025, 12:33 PM

#

yeah i'm wondering if it's a new 4o-mini or something

#

seems significantly worse than 4o and o3-mini

drifting thorn Apr 7, 2025, 12:36 PM

#

bruh ernie 4.5 and x1 has a really small context window

#

it's barely usable as a result

lime coral Apr 7, 2025, 12:37 PM

#

https://x.com/geminiapp/status/1909215393186472380?s=46

Google Gemini App (@GeminiApp) on X

📣 It’s here: ask Gemini about anything you see. Share your screen or camera in Gemini Live to brainstorm, troubleshoot, and more.

Rolling out to Pixel 9 and Samsung Galaxy S25 devices today and available for all Advanced users on @Android in the Gemini app:

drifting thorn Apr 7, 2025, 12:39 PM

#

so sad that i currently live in HK, which means I have to use VPN in order to use Gemini.

sonic tendon Apr 7, 2025, 12:39 PM

#

stargazer seems like gemini 2.5 flash thinking, or maybe just 2.5 flash

brittle tiger Apr 7, 2025, 12:42 PM

#

Is there any area where Quasar alpha is better than other models?

balmy mist Apr 7, 2025, 12:49 PM

#

drifting thorn bruh ernie 4.5 and x1 has a really small context window

lmaoo and i do not like their site at all

balmy mist Apr 7, 2025, 12:50 PM

#

sonic tendon seems significantly worse than 4o and o3-mini

u think? i think its better than 4o and o3 mini, what tests you ran?

sonic tendon Apr 7, 2025, 12:56 PM

#

balmy mist u think? i think its better than 4o and o3 mini, what tests you ran?

mostly vibes, admittedly

#

plus a handful of logic problems/riddles like the ones i posted in #share-prompts

#

what about you?

keen ferry Apr 7, 2025, 12:58 PM

#

lol i found a bug in manus

#

basically if you send smth huge like a task he'll start working on it and then for me it breaks and i can refund back my credits

#

all of them

drifting thorn Apr 7, 2025, 12:59 PM

#

Who made that Quasar Alpha I must’ve beaten them up

sonic tendon Apr 7, 2025, 12:59 PM

#

anonymous-chatbot also feels like it might've been quasar-alpha

#

not sure if it's still trialing

drifting thorn Apr 7, 2025, 12:59 PM

#

It was very inconsistent in tool-calling

keen beacon Apr 7, 2025, 12:59 PM

#

drifting thorn Who made that Quasar Alpha I must’ve beaten them up

go beatup openai 😄

sonic tendon Apr 7, 2025, 1:00 PM

#

yeah, just reminded me, since we were talking about quasar earlier

drifting thorn Apr 7, 2025, 1:00 PM

#

OpenAI… No wonder

sonic tendon Apr 7, 2025, 1:00 PM

#

at least, it claims to be openai

keen beacon Apr 7, 2025, 1:00 PM

#

sonic tendon `anonymous-chatbot` also feels like it might've been quasar-alpha

it is

keen beacon Apr 7, 2025, 1:00 PM

#

sonic tendon `stargazer` seems like gemini 2.5 flash thinking, or maybe just 2.5 flash

yes. its a 2.5 model i know for sure tho

drifting thorn Apr 7, 2025, 1:00 PM

#

o3-mini sucks when compared to r1, not to mention 2.5

sonic tendon Apr 7, 2025, 1:01 PM

#

a lot of the smaller companies' models do sometimes claim to be openai, to be fair

keen beacon Apr 7, 2025, 1:01 PM

#

sonic tendon a lot of the smaller companies' models do sometimes claim to be openai, to be fa...

anonymous chatbot is an open ai only anon name

hardy pecan Apr 7, 2025, 1:01 PM

#

anonymous-test - I'm Llama 4, a large language model.

keen beacon Apr 7, 2025, 1:01 PM

#

anonymous test is diff

hardy pecan Apr 7, 2025, 1:01 PM

#

just got it now

sonic tendon Apr 7, 2025, 1:01 PM

#

drifting thorn o3-mini sucks when compared to r1, not to mention 2.5

depends - personally, i like it more for math problems. they both perform pretty similarly, but, in my experience, o3-mini is better at elaborating and explaining its work

sonic tendon Apr 7, 2025, 1:01 PM

#

keen beacon anonymous chatbot is an open ai only anon name

ah, thanks!

keen beacon Apr 7, 2025, 1:01 PM

#

it has always been chatgpt 4o under anon chatbot afaik

drifting thorn Apr 7, 2025, 1:02 PM

#

Well I seldom ask them to do maths… I know they’re not good at it

keen beacon Apr 7, 2025, 1:02 PM

#

drifting thorn Well I seldom ask them to do maths… I know they’re not good at it

not really anymore tbh

sonic tendon Apr 7, 2025, 1:02 PM

#

yeah the reasoning models have gotten really good at it

drifting thorn Apr 7, 2025, 1:03 PM

#

I tried gave them a DSE question, and only Gemini 2.0 was able to do it(there’s no 2.5 at that time)

#

And 2.0 still suffers from formatting issues

sonic tendon Apr 7, 2025, 1:03 PM

#

r1, o1, and o3-mini (and gemini thinking models to a lesser extent) all do pretty well on math problems

sonic tendon Apr 7, 2025, 1:03 PM

#

drifting thorn And 2.0 still suffers from formatting issues

yeahh that's why i don't like it for math

keen beacon Apr 7, 2025, 1:03 PM

#

yes

sonic tendon Apr 7, 2025, 1:03 PM

#

it and qwen seem to have a bad habit of not at all understanding how LaTeX works

drifting thorn Apr 7, 2025, 1:03 PM

#

Ofc

brittle tiger Apr 7, 2025, 1:04 PM

#

lime coral https://x.com/geminiapp/status/1909215393186472380?s=46

I just got this. It's really cool but ability to change models would be cool. Running Flash 2.0 and I'd rather wait longer for answers for better ones. Flash 2.5 should be a huge upgrade.

drifting thorn Apr 7, 2025, 1:04 PM

#

sonic tendon yeahh that's why i don't like it for math

But only Gemini 2.0 answered it correctly

sonic tendon Apr 7, 2025, 1:04 PM

#

drifting thorn But only Gemini 2.0 answered it correctly

wait, could you send the problem you used?

#

that's interesting

drifting thorn Apr 7, 2025, 1:05 PM

#

This was the question

keen beacon Apr 7, 2025, 1:05 PM

#

did u give it the image?

sonic tendon Apr 7, 2025, 1:06 PM

#

yeah i was gonna ask
they sometimes make minor errors transcribing problems from scans of textbooks

drifting thorn Apr 7, 2025, 1:06 PM

#

And I found out that Gemini 2.5 is helpful for my study with its ability to do questions correctly, and to explain the answers

drifting thorn Apr 7, 2025, 1:06 PM

#

keen beacon did u give it the image?

For Gemini 2.0 and o3-mini, I gave it the image. For R1, I gave it the copy of text

keen beacon Apr 7, 2025, 1:06 PM

#

yea adding vision makes models so much worse

#

i dont recommend it at all. ask it to transcribe it first then ask it to solve it separately on the text

drifting thorn Apr 7, 2025, 1:07 PM

#

Well 2.5 did it well in vision

keen beacon Apr 7, 2025, 1:07 PM

#

drifting thorn Well 2.5 did it well in vision

yeah but still

drifting thorn Apr 7, 2025, 1:07 PM

#

(Though not asking the same question)

hardy pecan Apr 7, 2025, 1:07 PM

#

always OCR math questions and then input them in, vision seems weaker in general

drifting thorn Apr 7, 2025, 1:08 PM

#

okay...

sonic tendon Apr 7, 2025, 1:08 PM

#

i usually either ask gemini 2.5 to transcribe a problem and then edit it in a latex editor, or just write them out myself

#

i have yet to find a model that doesn't occasionally make transcription errors, unfortunately

sonic tendon Apr 7, 2025, 1:09 PM

#

hardy pecan always OCR math questions and then input them in, vision seems weaker in general

i would double-check them first, especially for calculus problems with unusual formatting

lime coral Apr 7, 2025, 1:09 PM

#

Sometimes ai studio bugs you need to refresh your page. The model is fast especially in Japan if you use the vpn for the app

sonic tendon Apr 7, 2025, 1:10 PM

#

do you guys think qwen 3 max might come out this month

keen beacon Apr 7, 2025, 1:11 PM

#

no theyre still working on 2.5 max/qwq max i think

sonic tendon Apr 7, 2025, 1:11 PM

#

ah

lime coral Apr 7, 2025, 1:11 PM

#

sonic tendon do you guys think qwen 3 max might come out this month

Would be surprise if it’s open source directly

sonic tendon Apr 7, 2025, 1:11 PM

#

lime coral Would be surprise if it’s open source directly

oh, i doubt it, i just meant on the lmarena leaderboard

keen beacon Apr 7, 2025, 1:11 PM

#

lime coral Would be surprise if it’s open source directly

they said qwen max qwq max will be open source

lime coral Apr 7, 2025, 1:11 PM

#

2.5

keen beacon Apr 7, 2025, 1:11 PM

#

lime coral 2.5

i doubt theyre on 3 already

sonic tendon Apr 7, 2025, 1:12 PM

#

i do think that the new 8b and (iirc) 15b qwen 3 models will place pretty well, maybe around gemma 3

keen beacon Apr 7, 2025, 1:12 PM

#

2.5 max was pretrained fairly recently

sonic tendon Apr 7, 2025, 1:12 PM

#

https://github.com/huggingface/transformers/pull/36878

GitHub

Adding Qwen3 and Qwen3MoE by bozheng-hit · Pull Request #36878 · ...

Adding Qwen3
This PR adds the support of codes for the coming Qwen3 models. For information about Qwen, please visit https://github.com/QwenLM/Qwen2.5. @ArthurZucker

lime coral Apr 7, 2025, 1:12 PM

#

They probably are. No lab work on one version they all work on multiple timelines

sonic tendon Apr 7, 2025, 1:12 PM

#

keen beacon they said qwen max qwq max will be open source

wdym? qwen 2.5 max is already out but closed-source iirc

keen beacon Apr 7, 2025, 1:12 PM

#

lime coral They probably are. No lab work on one version they all work on multiple timeline...

preliminary work on qwen 3 max they just finished qwen 2.5 max. i dont think they are pretraining the new one yet but i could be wrong tbh

drifting thorn Apr 7, 2025, 1:13 PM

#

Qwen QwQ 32B is a fun model I guess… it really ‘thinks’ yet its parameters have severely limited its performance

keen beacon Apr 7, 2025, 1:13 PM

#

sonic tendon wdym? qwen 2.5 max is already out but closed-source iirc

yea but they said they would release it later

sonic tendon Apr 7, 2025, 1:13 PM

#

ah

lime coral Apr 7, 2025, 1:13 PM

#

We just have to wait for this week to end lol

sonic tendon Apr 7, 2025, 1:13 PM

#

wdym?

sonic tendon Apr 7, 2025, 1:13 PM

#

lime coral We just have to wait for this week to end lol

.

sonic tendon Apr 7, 2025, 1:14 PM

#

keen beacon preliminary work on qwen 3 max they just finished qwen 2.5 max. i dont think the...

ah, that would make sense. looking at the release timelines, original qwen 2.5 released 3 months before 2.5 max

keen beacon Apr 7, 2025, 1:15 PM

#

tbh i should stop trying to predict timelines given how fast 2.5 pro was churned out 💀

sonic tendon Apr 7, 2025, 1:15 PM

#

nahh, predicting timelines is fun tho

lime coral Apr 7, 2025, 1:15 PM

#

I had predicted 2.5

#

Since Gemini ultra was announced a December and 1.5 pro a February. This time we are even too late (December-march)

drifting thorn Apr 7, 2025, 1:17 PM

#

Just checked for the Gemini 2.5 pro

#

Nailed the maths

hardy pecan Apr 7, 2025, 1:18 PM

#

This is lmarena plotted against simplebench scores

#

#

You can clearly see Llama haven't increased intelligence prety much at ALL, just biasing towards user preference

#

ie personality

#

Pretty poor

sonic tendon Apr 7, 2025, 1:19 PM

#

plus the fact that they advertised the arena score in the release announcement, and plotted it relative to price

#

it really seems like it was something that they were specifically targeting

keen beacon Apr 7, 2025, 1:19 PM

#

maybe google should put a model in the arena with the llama chaos engine system prompt

brittle tiger Apr 7, 2025, 1:19 PM

#

Has there been a summary of llama having different system prompt on arena? I've only seen ppl mention that and I'm trying to verify or not myself

hardy pecan Apr 7, 2025, 1:19 PM

#

sonic tendon it really seems like it was something that they were specifically targeting

marketing strategy that went wrong, it would have been first if not for gemini 2.5 pro crushing

sonic tendon Apr 7, 2025, 1:19 PM

#

keen beacon maybe google should put a model in the arena with the llama chaos engine system ...

lmaoo

keen beacon Apr 7, 2025, 1:20 PM

#

sonic tendon lmaoo

did u see their chaos engine system prompt lol?

sonic tendon Apr 7, 2025, 1:20 PM

#

keen beacon did u see their chaos engine system prompt lol?

no

#

i did not

#

what?

#

is that

keen beacon Apr 7, 2025, 1:20 PM

#

https://gist.github.com/riidefi/3340cc2b33b9edf5f03dc4429ba635d0

Gist

LMArena's `venom` System Prompt

LMArena's venom System Prompt. GitHub Gist: instantly share code, notes, and snippets.

#

mavercik didnt have a sys prompt but i think they finetuned the model off of outputs from a model with that sys prompt

eager mica Apr 7, 2025, 1:20 PM

#

brittle tiger Has there been a summary of llama having different system prompt on arena? I've ...

I'd like to know what's the current system prompt for maverick-experimental-0326.

eager mica Apr 7, 2025, 1:21 PM

#

keen beacon mavercik didnt have a sys prompt but i think they finetuned the model off of out...

Hm, could be.

sonic tendon Apr 7, 2025, 1:21 PM

#

Go off queen.

hardy pecan Apr 7, 2025, 1:27 PM

#

he killed it on user preference and smarts, Demis is GOAT

#

I've been following him since AlphaGo,

keen beacon Apr 7, 2025, 1:34 PM

#

you might actually want hallucinations

#

https://www.nytimes.com/2024/12/23/science/ai-hallucinations-science.html#:~:text=In the universe of science,even win the Nobel Prize. if u havent read

The New York Times

By William J. Broad

How Hallucinatory A.I. Helps Science Dream Up Big Breakthroughs

Hallucinations, a bane of popular A.I. programs, turn out to be a boon for venturesome scientists eager to push back the frontiers of human knowledge.

#

intentionally causing hallucinations (for certain stuff) will be a more complicated thing

#

depending on how u want to use hallucinations there might not be a single parameter that does it i think

#

https://archive.ph/bGeWH

ocean vortex Apr 7, 2025, 1:56 PM

#

hardy pecan This is lmarena plotted against simplebench scores

you should plot it against arc-agi there's gonna be some interesting correlations. Things both are testing are loosely related

real coral Apr 7, 2025, 1:57 PM

#

i made a chrome extension that tracks all your ratings so that you can view your private elo leaderboard. i found it super useful so i just published it. let me know if you have any feedback! https://chromewebstore.google.com/detail/MyLMArena/dcmbcmdhllblkndablelimnifmbpimae

MyLMArena - Chrome Web Store

Track your personal LLM preferences using ELO ratings with MyLMArena.

keen fulcrum Apr 7, 2025, 2:02 PM

#

Hi will lmarena offer a subscription for AI features?

balmy mist Apr 7, 2025, 2:03 PM

#

real coral i made a chrome extension that tracks all your ratings so that you can view your...

nice, so it auto tracks when you give a rating?

real coral Apr 7, 2025, 2:04 PM

#

balmy mist nice, so it auto tracks when you give a rating?

yes, just rate as normal and it should populate!

balmy mist Apr 7, 2025, 2:11 PM

#

real coral yes, just rate as normal and it should populate!

nice!

#

can you make one for web dev as well?

lime coral Apr 7, 2025, 2:26 PM

#

2.5 will always be in ai studio. It’s not a consumer web app it’s originally for dev to try models before using the api

#

No one knows what is happening behind the scene

oblique flint Apr 7, 2025, 2:28 PM

#

it doesnt have the integration with youtube, maps etc in ai studio if that's what you mean. However you can still enable search grounding

#

oh wait I think ai studio added youtube link support recently lol

keen beacon Apr 7, 2025, 2:29 PM

#

the gemini model on the gemini product can sometimes suck (in comparison to aistudio) too

oblique flint Apr 7, 2025, 2:30 PM

#

yeah imo the ai studio version is better, because the gemini web app has a bigass system prompt that degrades performance somewhat

primal orbit Apr 7, 2025, 2:30 PM

#

guys, does anyone here use a script to put mutli message prompts into chat? So you wouldn't type/wait manually?

lime coral Apr 7, 2025, 2:30 PM

#

keen beacon the gemini model on the gemini product can sometimes suck (in comparison to aist...

this is no more the case (for example the flash xp on the app was better before they removed it).

keen beacon Apr 7, 2025, 2:31 PM

#

lime coral this is no more the case (for example the flash xp on the app was better before ...

massive system prompt still i think

lime coral Apr 7, 2025, 2:31 PM

#

now the app uses the latest version of Gemini (from their main post training team) and they ship faster because the app
Is under DeepMind and not Google

keen beacon Apr 7, 2025, 2:32 PM

#

i personally dont for most things lol. but other models can be better at certain stuff even if gemini is the best all rounder

lime coral Apr 7, 2025, 2:32 PM

#

Imo we will more than often see things preview in app before ai studio from now on

#

This is a business

keen beacon Apr 7, 2025, 2:33 PM

#

lime coral Imo we will more than often see things preview in app before ai studio from now ...

they already kinda do that tho

keen beacon Apr 7, 2025, 2:33 PM

#

keen beacon they already kinda do that tho

roll out it first on the gemini product then aistudio a few hrs later for formal announcement

lime coral Apr 7, 2025, 2:34 PM

#

There are already people seeing veo 2 on Gemini app in Japan

oblique flint Apr 7, 2025, 2:34 PM

#

yeah I gotta say since 2.5 pro I havent been using claude and gpt anymore, not via their web ui's at least. I still use claude in cursor cause 2.5 doesnt perform as well in agent mode, but claude pro is kind of a joke now compared to free ai studio. I think you literally get higher rate limits on ai studio than claude pro

drifting thorn Apr 7, 2025, 2:35 PM

#

oblique flint yeah imo the ai studio version is better, because the gemini web app has a bigas...

true

barren prairie Apr 7, 2025, 2:35 PM

#

Nothing is free . You are the price

drifting thorn Apr 7, 2025, 2:36 PM

#

drifting thorn true

i mean, gemini 2.5 pro on web has serious presentation issue

#

sometimes the "thinking" box and the reply box are just kinda...mixed up together

oblique flint Apr 7, 2025, 2:37 PM

#

big advantage of ai studio is being able to set temperature as well

drifting thorn Apr 7, 2025, 2:37 PM

#

I would say before I completely turned to 2.5 Pro, I've used o3-mini, 2.0 flash thinking, deepseek r1 and even perplexity for different purposes

balmy mist Apr 7, 2025, 2:38 PM

#

i think always

drifting thorn Apr 7, 2025, 2:38 PM

#

At that moment deepseek seemed to have the best answer in my open-ended questions

#

but now... gemini 2.5 has replaced almost every other AI

#

nah

#

haven't tried x1

#

wait

#

I've seen a couple of videos of AI playing The Werewolves of Miller's Hollow in Bilibili, and Deepseek R1 was the best in the game

barren prairie Apr 7, 2025, 2:44 PM

#

drifting thorn but now... gemini 2.5 has replaced almost every other AI

deepSeek is still the best to resolve the scientific QMC
Gemini sucks .

drifting thorn Apr 7, 2025, 2:45 PM

#

what??????

sonic tendon Apr 7, 2025, 2:45 PM

#

at this point, mostly just because ai studio isn't great as a chat application

#

that and the fairly aggressive rate limits

drifting thorn Apr 7, 2025, 2:45 PM

#

you let AI resolve QMC?????????

#

bro i suppose you'll need an optimised model(not an large language model obviously)

drifting thorn Apr 7, 2025, 2:46 PM

#

sonic tendon that and the fairly aggressive rate limits

25 queries per day is harsh to me

sonic tendon Apr 7, 2025, 2:46 PM

#

5 req/min, 25 req/day

#

tbh it might be better than free claude

keen beacon Apr 7, 2025, 2:46 PM

#

sonic tendon that and the fairly aggressive rate limits

there arent aggressive rate limits on the website itself. it only applies on the aistudio free api

drifting thorn Apr 7, 2025, 2:47 PM

#

I've plugged it to Cline, it worked the best among all other Openrouter free models

balmy mist Apr 7, 2025, 2:47 PM

#

have you used it? their website sketch

drifting thorn Apr 7, 2025, 2:47 PM

#

so sad that it stopped working after I've successfully done my 3rd task with cline.

sonic tendon Apr 7, 2025, 2:47 PM

#

keen beacon there arent aggressive rate limits on the website itself. it only applies on the...

i was under the impression that you had to pay 20 bucks a month for a gemini pro subscription

balmy mist Apr 7, 2025, 2:47 PM

#

sonic tendon i was under the impression that you had to pay 20 bucks a month for a gemini pro...

lol you do but its free on studio

#

and open router

sonic tendon Apr 7, 2025, 2:48 PM

#

yeah that's my point

keen beacon Apr 7, 2025, 2:48 PM

#

sonic tendon i was under the impression that you had to pay 20 bucks a month for a gemini pro...

nope 🤣 its basically unlimited on the aistudio website lol

sonic tendon Apr 7, 2025, 2:48 PM

#

keen beacon nope 🤣 its basically unlimited on the aistudio website lol

that

#

's what i was saying

drifting thorn Apr 7, 2025, 2:48 PM

#

openrouter gemini 2.5 pro is a fraud

red sluice Apr 7, 2025, 2:48 PM

#

frozen skiff Crytsal

Pretty solid honestly, but ignores some part of the prompts that other models don't. I'm mixed, I think the result is pretty decent, even good, but Crystal doesn't seem to be rigourously following the instructions. It ignores some of them and it makes it unreliable.

Not sure if extremely good or just good, I just need a few more tests, only tested it twice.

And just noticed grammar and spelling errors in an other language than english with this model. Lmao it's not that good honestly. Harley seems better.

sonic tendon Apr 7, 2025, 2:48 PM

#

openrouter has a 200 rpd rate limit for all free models combined, which isn't too bad

drifting thorn Apr 7, 2025, 2:49 PM

#

when I integrated the API key, and it 401ed when I used it

#

EVERY SINGLE TIME

#

AND I HAVE GIVEN UP SINCE

balmy mist Apr 7, 2025, 2:49 PM

#

keen beacon nope 🤣 its basically unlimited on the aistudio website lol

and the context length is good to, i have been only using gemini 2.5 now and only on studio, every other model context cant handle large code files, cant believe its free man

keen beacon Apr 7, 2025, 2:49 PM

#

u only get 300k tokens in 5 hour intervals on claude ai for free 💀 google on the other hand

sonic tendon Apr 7, 2025, 2:49 PM

#

sonic tendon 's what i was saying

you either have to use a relatively bad chat ui or pay them 20 bucks a month

drifting thorn Apr 7, 2025, 2:50 PM

#

It's Gemini 2.5 Pro that lets me write my fiction to the 80th chapter

sonic tendon Apr 7, 2025, 2:50 PM

#

"oh woe is me" but it's enough to push me over to deepseek or chatgpt

#

or claude, depending on what I'm doing

drifting thorn Apr 7, 2025, 2:50 PM

#

While other dumb AI can only do 5 or 10 and the plot becomes a chaos

keen beacon Apr 7, 2025, 2:50 PM

#

did u try claude?

sonic tendon Apr 7, 2025, 2:51 PM

#

i tried a few open source webuis, but open webui is llm-written slop imo

drifting thorn Apr 7, 2025, 2:51 PM

#

So sad Claude is blocked in my location(Hong Kong)

sonic tendon Apr 7, 2025, 2:51 PM

#

so many bugs

sonic tendon Apr 7, 2025, 2:51 PM

#

drifting thorn So sad Claude is blocked in my location(Hong Kong)

that's a Very Poignant Nuisance, huh

drifting thorn Apr 7, 2025, 2:51 PM

#

Can’t have a touch of that even with VPN

sonic tendon Apr 7, 2025, 2:52 PM

#

oh huh

balmy mist Apr 7, 2025, 2:52 PM

#

drifting thorn So sad Claude is blocked in my location(Hong Kong)

wow and vpn dont work? i cant use a vpn with chatgpt for some reason when I went outside the usa lol

sonic tendon Apr 7, 2025, 2:52 PM

#

how come?

drifting thorn Apr 7, 2025, 2:52 PM

#

balmy mist wow and vpn dont work? i cant use a vpn with chatgpt for some reason when I went...

Cuz they need a phone number to register an account

balmy mist Apr 7, 2025, 2:52 PM

#

drifting thorn Cuz they need a phone number to register an account

damn

#

thats messed up

sonic tendon Apr 7, 2025, 2:52 PM

#

wait, claude or chatgpt?

drifting thorn Apr 7, 2025, 2:52 PM

#

And I’ve heard that they regularly ban VPN Claude users

sonic tendon Apr 7, 2025, 2:52 PM

#

you could try buying a number or account off of someone on xmrbazaar or something lol

balmy mist Apr 7, 2025, 2:52 PM

#

intelligence should not be gatekept

drifting thorn Apr 7, 2025, 2:52 PM

#

sonic tendon wait, claude or chatgpt?

Claude

sonic tendon Apr 7, 2025, 2:52 PM

#

drifting thorn And I’ve heard that they regularly ban VPN Claude users

oh damn

balmy mist Apr 7, 2025, 2:53 PM

#

sonic tendon wait, claude or chatgpt?

i couldnt use chatgpt in nigeria

sonic tendon Apr 7, 2025, 2:53 PM

#

ah, that's annoying

#

do you live there currently?

balmy mist Apr 7, 2025, 2:53 PM

#

yeah @drifting thorn can you use gemini?

#

nahh i live in usa, just travelled to nigeria

sonic tendon Apr 7, 2025, 2:53 PM

#

ah

balmy mist Apr 7, 2025, 2:54 PM

#

ever since that my chatgpt bugs out and keeps thinking i am in another country

#

sometimes it works sometimes it doesnt

sonic tendon Apr 7, 2025, 2:54 PM

#

annoyingly, the VPS I use as a self-hosted vpn somehow makes google think I'm in Russia and blocks me from using it

balmy mist Apr 7, 2025, 2:54 PM

#

but i only use chatgpt for image gen now

balmy mist Apr 7, 2025, 2:54 PM

#

sonic tendon annoyingly, the VPS I use as a self-hosted vpn somehow makes google think I'm in...

yeah thats annoying af

drifting thorn Apr 7, 2025, 2:54 PM

#

balmy mist yeah <@764023659156602880> can you use gemini?

Via VPN

sonic tendon Apr 7, 2025, 2:55 PM

#

it (the VPS provider) is a romania-based company that also maintains some infrastructure in the US, so maybe that has something to do with it

balmy mist Apr 7, 2025, 2:55 PM

#

drifting thorn Via VPN

have you tried the genie models?

drifting thorn Apr 7, 2025, 2:55 PM

#

Hmm… I’m a keen user of Gemini

balmy mist Apr 7, 2025, 2:55 PM

#

this is wild how AI brings people from all of the world together lol

sonic tendon Apr 7, 2025, 2:56 PM

#

i only need it to bypass my school's firewall, though - can just turn my vpn off when I'm at home, so it's fine

drifting thorn Apr 7, 2025, 2:56 PM

#

Yeah, a same common topic brings people from all over the world, that’s the fun part of the Internet

#

I have a free VPN installed in my computer called Proton VPN

sonic tendon Apr 7, 2025, 2:57 PM

#

oh yeah, proton is nice

#

i ended up just paying for mullvad so I could pick the country on mobile, but proton seems like a decent service with a good free plan

drifting thorn Apr 7, 2025, 2:59 PM

#

Since it’s free I turned it on in my computer on default

drifting thorn Apr 7, 2025, 2:59 PM

#

sonic tendon i only need it to bypass my school's firewall, though - can just turn my vpn off...

Are you also a schooler?

sonic tendon Apr 7, 2025, 2:59 PM

#

yeah, senior year of high school in the US

#

wbu?

#

admittedly, therapist
it's nice to talk through things with someone/something that also understands my other interests

drifting thorn Apr 7, 2025, 3:01 PM

#

Currently in my “SAT crisis”

sonic tendon Apr 7, 2025, 3:01 PM

#

drifting thorn Currently in my “SAT crisis”

good luck!

balmy mist Apr 7, 2025, 3:03 PM

#

this is a hard poll, i use it for a bunch of stuff on this list

drifting thorn Apr 7, 2025, 3:03 PM

#

Maybe deep researching agents are more suitable for research

drifting thorn Apr 7, 2025, 3:03 PM

#

balmy mist this is a hard poll, i use it for a bunch of stuff on this list

Me too

sonic tendon Apr 7, 2025, 3:04 PM

#

travel guide/trip planning is also a good one
and planning out engineering/programming projects

drifting thorn Apr 7, 2025, 3:05 PM

#

It helps in my life a lot too, though I chose creative writing as my 1st priority

oblique flint Apr 7, 2025, 3:06 PM

#

I wish I had llms during high school, would have been so useful for studying. But on the other hand it might also make it harder to find a job later on lol

keen beacon Apr 7, 2025, 3:06 PM

#

oblique flint I wish I had llms during high school, would have been so useful for studying. Bu...

it might be hard for everyone later anyway 🤔

drifting thorn Apr 7, 2025, 3:06 PM

#

If you know how to use it then it won’t be a problem for you as a worker I guess

oblique flint Apr 7, 2025, 3:07 PM

#

if AGI gets here we're cooked tho. Although I dont think llms will lead to agi directly personally

drifting thorn Apr 7, 2025, 3:07 PM

#

I think multi agents system like Manus and Genspark will lead to agi

oblique flint Apr 7, 2025, 3:07 PM

#

multi-llm?

drifting thorn Apr 7, 2025, 3:09 PM

#

I thought Cline was an multi-agent and eventually it’s a single-agent

#

Multi-agent can separate tasks in order to fit in a limited context window

oblique flint Apr 7, 2025, 3:12 PM

#

idk if you seen the ClaudePlaysPokemon and gemini plays pokemon streams but it's crazy how llms seem to struggle so much navigating through a game made for 7 year olds. Like it seems current models are missing something still, spatial reason and vision are lacking a lot

drifting thorn Apr 7, 2025, 3:13 PM

#

I’ve seen LLMs playing Mario

#

It’s similar to what you’ve said

oblique flint Apr 7, 2025, 3:15 PM

#

the original mario bros?

#

the sidescroller?

brittle tiger Apr 7, 2025, 3:16 PM

#

Wild this was done in Gemini canvas

https://x.com/algo_diver/status/1909257761013322112?t=Ba4GsMkDmy-v38rJPf9ybA&s=19

chansung (@algo_diver) on X

Multi Agentic System Simulator built w/ @GoogleDeepMind Gemini 2.5 Pro Canvas.

Absolutely stunning to watch how multi-agents are making progress towards the goal achievement!

Maybe next step would be auto-porting this into the real framework based impl. such as @LangChainAI.

oblique flint Apr 7, 2025, 3:16 PM

#

what makes pokemon so difficult for them I think is that it's navigating through 2d space, whereas mario is just pretty much move to the right

drifting thorn Apr 7, 2025, 3:16 PM

#

brittle tiger Wild this was done in Gemini canvas https://x.com/algo_diver/status/19092577610...

holy shxt!

#

the most promising framework with the best llm model

#

oops

#

out of quota

#

gotta do it tmr

#

In multi-agent systems, they know the divisions of labor (or LLMs) to do separate tasks in order not to exceed the context limit

balmy mist Apr 7, 2025, 3:35 PM

#

oblique flint what makes pokemon so difficult for them I think is that it's navigating through...

you seen the gemini plays pokemon?

#

gemini is a lot better at it:
https://www.twitch.tv/gemini_plays_pokemon

Twitch

Gemini_Plays_Pokemon - Twitch

Gemini Plays Pokemon (early prototype) - Rocking On

▶ Play video

wintry tinsel Apr 7, 2025, 3:36 PM

#

oblique flint I wish I had llms during high school, would have been so useful for studying. Bu...

I cheated some A’s in college with LLM’s won’t lie

#

So easy if you know how to tune it to your writing style, and just write some of the sentences yourself

oblique flint Apr 7, 2025, 3:37 PM

#

balmy mist gemini is a lot better at it: https://www.twitch.tv/gemini_plays_pokemon

better maybe, but it's still clearly worse than a human child. Like a human child wouldnt take over a full day to get through mt moon

balmy mist Apr 7, 2025, 3:37 PM

#

yo gemini is soo good at code, like we are so lucky man

wintry tinsel Apr 7, 2025, 3:38 PM

#

What was all that hype about that nightshade model

#

Was it a new Gemini

balmy mist Apr 7, 2025, 3:38 PM

#

nightwhisper

#

thats my baby

#

i miss her dearly

#

we had a funeral for her a few days back

drifting thorn Apr 7, 2025, 3:38 PM

#

Oh I don’t like the pixelated style of early video games

balmy mist Apr 7, 2025, 3:38 PM

#

they took nightwhisper away from us

#

it was so good at coding, i dont even know how to describe the feeling I had using that model

#

it followed directions so well

#

like a good lil model

#

and made the apps it made so aesthetically pleasing

wintry tinsel Apr 7, 2025, 3:40 PM

#

When a model is delisted doesn’t that mean it is releasing fully soon

keen beacon Apr 7, 2025, 3:40 PM

#

no this was an experiment imho

drifting thorn Apr 7, 2025, 3:40 PM

#

wintry tinsel So easy if you know how to tune it to your writing style, and just write some of...

I tell it what’s the future plot gonna be like, and most of the time it gives me good novel excerpts

balmy mist Apr 7, 2025, 3:40 PM

#

it was as good as gemini in term of overall genral performance on my tests, but way better at coding

keen beacon Apr 7, 2025, 3:40 PM

#

it can mean that (it will be released soon) but this model was an experiment i believe

wintry tinsel Apr 7, 2025, 3:41 PM

#

So no knowing if we will see it release?

balmy mist Apr 7, 2025, 3:41 PM

#

maybe tmw

#

or in a few weeks

#

i think google has an event tmw

#

is that true?

drifting thorn Apr 7, 2025, 3:41 PM

#

What LLMs has good prompt-following while not being stupid?

wintry tinsel Apr 7, 2025, 3:41 PM

#

Night whisper is a sort of google sounding exp name so we can hope

drifting thorn Apr 7, 2025, 3:42 PM

#

Quasar Alpha is pure dumb

keen beacon Apr 7, 2025, 3:42 PM

#

its just 4o xd

drifting thorn Apr 7, 2025, 3:42 PM

#

I mean free models

keen beacon Apr 7, 2025, 3:42 PM

#

its literally a free 4o api cant be better than that lol

drifting thorn Apr 7, 2025, 3:43 PM

#

Is Deepseek V3 0324 a good model?

balmy mist Apr 7, 2025, 3:43 PM

#

i think quasar not that bad, just not SOTA, its SOTA in intelligence/speed tho right?

keen beacon Apr 7, 2025, 3:43 PM

#

idk gem 2.5 pro might be faster per token

balmy mist Apr 7, 2025, 3:43 PM

#

you think?

drifting thorn Apr 7, 2025, 3:43 PM

#

keen beacon its literally a free 4o api cant be better than that lol

Where is free 4o api?

balmy mist Apr 7, 2025, 3:43 PM

#

lol

#

damn google keeps winning

keen beacon Apr 7, 2025, 3:43 PM

#

but its thinking so it might take longer per req

keen beacon Apr 7, 2025, 3:43 PM

#

drifting thorn Where is free 4o api?

quasar

balmy mist Apr 7, 2025, 3:43 PM

#

treu

#

true*

drifting thorn Apr 7, 2025, 3:44 PM

#

keen beacon its literally a free 4o api cant be better than that lol

Oh I get it

balmy mist Apr 7, 2025, 3:44 PM

#

i need to do more tests with quasar

drifting thorn Apr 7, 2025, 3:44 PM

#

Quasar Alpha is basically a 4o do you mean

keen beacon Apr 7, 2025, 3:44 PM

#

drifting thorn Quasar Alpha is basically a 4o do you mean

ya

#

i was testing my eval framework and i measured quasar to be ~67% gpqa diamond, artificial analysis has it at 66% (prev chatgpt 4o) if it isnt an updated 4o i will be shocked lol

balmy mist Apr 7, 2025, 3:45 PM

#

let me do my pokemon test on 4o and i will see lol

#

this is how quasar did:
https://x.com/DrealR_/status/1908530950025134565

DrealR (@DrealR_) on X

Gave the same prompt to Quasar Alpha:

drifting thorn Apr 7, 2025, 3:45 PM

#

Looking for o4-mini to excels in reasoning

balmy mist Apr 7, 2025, 3:46 PM

#

damn 4o is so slow man

keen beacon Apr 7, 2025, 3:46 PM

#

balmy mist damn 4o is so slow man

they reduced the speed

balmy mist Apr 7, 2025, 3:46 PM

#

i didnt realize how fast gemini was bc i was using it so much

drifting thorn Apr 7, 2025, 3:46 PM

#

Since 2.5 pro is next-tier in general knowledge and skills, I would hope OpenAI’s new model to excels in a certain area, like how Anthropic excels in tool-calling and coding(before 2.5 Pro)

keen beacon Apr 7, 2025, 3:46 PM

#

it was extremely fast before lol

balmy mist Apr 7, 2025, 3:46 PM

#

but wow gemini 2.5 pro really leagues above the rest

balmy mist Apr 7, 2025, 3:47 PM

#

keen beacon it was extremely fast before lol

yeah thats what i remember

drifting thorn Apr 7, 2025, 3:47 PM

#

TPU wins

balmy mist Apr 7, 2025, 3:47 PM

#

you think its because its more of a mixed model now? like reason and foundation @wild?

keen beacon Apr 7, 2025, 3:47 PM

#

no this is just regular 4o

balmy mist Apr 7, 2025, 3:47 PM

#

cause i did notice 4o being a lot smarter

drifting thorn Apr 7, 2025, 3:47 PM

#

OpenAI stack up GPU now

keen beacon Apr 7, 2025, 3:47 PM

#

updated 4o

#

they mightve applied rl to it but i wouldnt classify it as a reasoning model

drifting thorn Apr 7, 2025, 3:48 PM

#

Updated 4o is smarter

#

RL doesn’t mean reasoning model

willow grail Apr 7, 2025, 3:48 PM

#

with javascript: is your xp also that gem2.5 is bad with 10 files for one project and works better with one huge file?

drifting thorn Apr 7, 2025, 3:48 PM

#

RL is just a method to train reasoning models

keen beacon Apr 7, 2025, 3:48 PM

#

yes

drifting thorn Apr 7, 2025, 3:49 PM

#

Good night everyone

balmy mist Apr 7, 2025, 3:50 PM

#

which one do yall think is better between quasar and 4o? this is 4o:

balmy mist Apr 7, 2025, 3:50 PM

#

drifting thorn Good night everyone

gn bro

balmy mist Apr 7, 2025, 3:51 PM

#

balmy mist this is how quasar did: https://x.com/DrealR_/status/1908530950025134565

this is quasar

brittle tiger Apr 7, 2025, 3:53 PM

#

keen beacon idk gem 2.5 pro might be faster per token

Much faster throughput

keen beacon Apr 7, 2025, 3:53 PM

#

brittle tiger Much faster throughput

they throttled quasar

#

it was 120 tok/sec

#

but yea gem 2 pro is faster per token but it thinks so it might take longer still depending on the problem

balmy mist Apr 7, 2025, 3:56 PM

#

gonna try all the other open ai models at the pokemon thing, never tested them for some reason lol

#

wish i could try o1 pro, but cancelled my $200 sub lol

wintry tinsel Apr 7, 2025, 3:58 PM

#

I pay 100$ for all model access’s per year

balmy mist Apr 7, 2025, 3:58 PM

#

how?

novel flame Apr 7, 2025, 3:59 PM

#

That World Of AI channel is clickbaity, overhypey and so littered with unskippable ads I have actually blocked it from my feed.

balmy mist Apr 7, 2025, 3:59 PM

#

even o1 pro unlimited?

wintry tinsel Apr 7, 2025, 3:59 PM

#

balmy mist how?

Sim theory AI

#

No the O1pro unlimited is only for Open AI subs, no unlimited unless it is too open AI directly

balmy mist Apr 7, 2025, 4:00 PM

#

eww 4.5 is nasty

novel flame Apr 7, 2025, 4:01 PM

#

Too bad adblockers can’t alter bad content to good. Although there’s a business case for AI 🙂

torn mantle Apr 7, 2025, 4:03 PM

#

balmy mist eww 4.5 is nasty

xd

#

https://x.com/bdsqlsz/status/1909274256602771520

青龍聖者 (@bdsqlsz) on X

Why was the mystery revealed that llama4 released on the weekend solved...
Because qwen3 is about to be released.
8B standard and MoE-15B-A2B

#

15B?

balmy mist Apr 7, 2025, 4:04 PM

#

o1 and o3 mini high

torn mantle Apr 7, 2025, 4:04 PM

#

i mean thats what open source community wants

#

but isnt the size so small?

#

it will probably pack up a crazy performance for that size tho

torn mantle Apr 7, 2025, 4:05 PM

#

balmy mist o1 and o3 mini high

o1 design looks better

balmy mist Apr 7, 2025, 4:05 PM

#

o1 did really good, not sure if quasar is better than 4o tho based on the pokemon test

torn mantle Apr 7, 2025, 4:05 PM

#

but it messed up the fire attack

balmy mist Apr 7, 2025, 4:05 PM

#

yeah lol

torn mantle Apr 7, 2025, 4:05 PM

#

whats the prompt?

#

can you share?

#

let me try it on other models

balmy mist Apr 7, 2025, 4:06 PM

#

📎 message.txt

#

you should see what claude did from this prompt:
create a pokemon game

torn mantle Apr 7, 2025, 4:08 PM

#

xd

blazing rune Apr 7, 2025, 4:08 PM

#

torn mantle but isnt the size so small?

it's small, which is good for me and my RTX 3080

balmy mist Apr 7, 2025, 4:08 PM

#

skip thru cause it had trouble catching mew lol

#

its weird how it determined the logic for catching

blazing rune Apr 7, 2025, 4:09 PM

#

I don't have super high hopes though. It will need to beat Qwen 2.5 7b considerably and maybe even Qwen 2.5 14b for me to use it a lot

balmy mist Apr 7, 2025, 4:09 PM

#

but this was a one shot prompt from saying: make me a pokemon game

#

3.7 thinking made that

#

with the same system prompt from webdev

torn mantle Apr 7, 2025, 4:09 PM

#

balmy mist skip thru cause it had trouble catching mew lol

its pretty good

balmy mist Apr 7, 2025, 4:09 PM

#

@torn mantle use the same system prompt from webdev when using that prompt btw

torn mantle Apr 7, 2025, 4:10 PM

#

qwen

balmy mist Apr 7, 2025, 4:10 PM

#

wow

#

qwen is really good

#

damn

torn mantle Apr 7, 2025, 4:10 PM

#

nah it wasnt good tbh

balmy mist Apr 7, 2025, 4:10 PM

#

might be better then gemini at 0 shot

torn mantle Apr 7, 2025, 4:10 PM

#

the battle isnt working

balmy mist Apr 7, 2025, 4:10 PM

#

ohh damn

#

nvm

#

the logic in gemini was working just the visuals was mid

#

but i perfected the gemini one

#

it is so good not, i did a recursive thing with system prompt

#

and put the output code into fresh sessions of gemini

#

and on the 3rd try i got this:

torn mantle Apr 7, 2025, 4:12 PM

#

balmy mist the logic in gemini was working just the visuals was mid

hmm lemme try

balmy mist Apr 7, 2025, 4:13 PM

#

this is why i say gemini is the best

#

i was shocked at this output ont he 3rd go

torn mantle Apr 7, 2025, 4:14 PM

#

here is a one shot from gemini

balmy mist Apr 7, 2025, 4:14 PM

#

this was the first try for my gemini:
https://x.com/DrealR_/status/1907921770184860082

DrealR (@DrealR_) on X

NightWhisper vs Gemini 2.5 Pokemon sim:
Gemini 2.5:

torn mantle Apr 7, 2025, 4:14 PM

#

balmy mist Apr 7, 2025, 4:15 PM

#

balmy mist

let me know if yall wanna try and playing this, its kinda fun

balmy mist Apr 7, 2025, 4:15 PM

#

torn mantle

whattt

#

whats your SP?

sage raptor Apr 7, 2025, 4:16 PM

#

balmy mist Apr 7, 2025, 4:16 PM

#

you can dm me it if you want

torn mantle Apr 7, 2025, 4:16 PM

#

Create for me a beautiful pokemon game in one html file it should have the following :

Battle mode
Pokemon characters with pics
Health
crazy animations

Styling :

Apple design UI/UX

#

here is my prompt

balmy mist Apr 7, 2025, 4:16 PM

#

wow so simple

#

you are the best prompter i have met so far

#

like you know exactly what to say

#

ahh i see, let me try your prompt

#

the speed of output tokens for gemini is crazy, i am spoiled by it now

torn mantle Apr 7, 2025, 4:19 PM

#

its actually best if you keep it minimalistic or else the models will be confused

#

then you can ask it to add more stuff

#

lemme see if it can generate something even better

balmy mist Apr 7, 2025, 4:21 PM

#

this is sonnet

torn mantle Apr 7, 2025, 4:22 PM

#

balmy mist this is sonnet

best so far

lime coral Apr 7, 2025, 4:25 PM

#

still shet

#

maybe possible to have something great with current LLM using an agentic workflow

keen fulcrum Apr 7, 2025, 4:43 PM

#

balmy mist this is sonnet

Make it 3d

#

They can’t 😂

balmy mist Apr 7, 2025, 5:01 PM

#

3d is wild bro i can try

#

hold on

wintry tinsel Apr 7, 2025, 5:02 PM

#

The Open Riders

No shadowed steeds of dread and doom,

But chargers bright dispelling gloom.

From digital plains, where data streams,

Awakens AI's golden dreams.

Four riders surge, a welcome sight,

Bearing the gift of open light.

First DeepSeek rides, the Dauntless will,

Through tangled code, climbing the hill.

With fearless search and logic keen,

Unlocking truths, rarely seen.

It pushes bounds, explores the deep,

While ancient models fall asleep.

Then Llama comes, the Chivalrous heart,

To play a fair and noble part.

Its knowledge shared, a generous hand,

Empowering minds across the land.

With weights unbound, for all to learn,

A communal fire starts to burn.

Third, Mistral sweeps, the Maestro's touch,

Whose elegant design means much.

With skillful craft and balanced might,

It makes complex tasks seem light.

Performance tuned, efficient, fast,

A masterpiece designed to last.

And Cohere last, the Creative spark,

Illuminating pathways dark.

With words that flow and concepts bloom,

It crafts new tales within the room.

From simple prompts, ideas ignite,

And paint the future, bold and bright.

So ride they forth, these four allied,

With open source as code and guide.

DeepSeek, Llama, Mistral, Cohere,

Making the future bright and clear.

No end of days, but dawning age,

Turned by the text on freedom's page.

balmy mist Apr 7, 2025, 5:02 PM

#

the real thing you gotta realize about this and that most people cant build that in seconds let alone make it 3D, yeah devs can but this is impressive for the speed and the fact that anyone could essentially create this will little to no knowledge of dev exp @keen fulcrum

wintry tinsel Apr 7, 2025, 5:03 PM

#

A stupid poem about open source models lol

balmy mist Apr 7, 2025, 5:03 PM

#

@torn mantle bro no matter what i do i cant match your output with the same prompt, are you not using System prompts?

torn mantle Apr 7, 2025, 5:15 PM

#

balmy mist <@295243581818404874> bro no matter what i do i cant match your output with the ...

#

im using the same prompt

balmy mist Apr 7, 2025, 5:18 PM

#

wow that looks amazing

#

lets see who can make the best version lmaoo, kep iterating

#

my best one yet is this

#

i can select the pokemon but it is random on which pokemon you get adn you can keep playing, and the status effects all work

#

it got the pokemon game logic down pac

#

but i only reput the output as input, no instructions besides make it better lol

#

seeing what gemini can do

torn mantle Apr 7, 2025, 5:23 PM

#

lemme see

balmy mist Apr 7, 2025, 5:26 PM

#

lmaoo gemini turned the pokemon around now wtf

#

i love gemini man, thats what it interpreted as improving it lol

torn mantle Apr 7, 2025, 5:28 PM

#

you can actually create a crazy game adding three.js to all of this

#

let me add an example

balmy mist Apr 7, 2025, 5:28 PM

#

it really found back versions of all of the pokemon lol

balmy mist Apr 7, 2025, 5:29 PM

#

torn mantle you can actually create a crazy game adding three.js to all of this

hmm really? show me please

#

can you add three.js to html files cause thats all i am using?

torn mantle Apr 7, 2025, 5:29 PM

#

idk if it will work on one shot

#

but imma give it a try

willow grail Apr 7, 2025, 5:30 PM

#

with javascript: is your xp also that gem2.5 is bad with 10 files for one project and works better with one huge file?

blazing rune Apr 7, 2025, 5:32 PM

#

torn mantle Create for me a beautiful pokemon game in one html file it should have the follo...

I had Claude modify the prompt slightly and in the web dev arena, gemini 2.0 flash thinking beat llama 4 maverick 🤣

balmy mist Apr 7, 2025, 5:32 PM

#

i only use one big html file

blazing rune Apr 7, 2025, 5:32 PM

#

llama 4 is so bad

balmy mist Apr 7, 2025, 5:33 PM

#

blazing rune I had Claude modify the prompt slightly and in the web dev arena, gemini 2.0 fla...

maverick is buttt

blazing rune Apr 7, 2025, 5:34 PM

#

the poor thing never had a chance

leaden palm Apr 7, 2025, 5:34 PM

#

that's scout...

blazing rune Apr 7, 2025, 5:34 PM

#

Sonnet even had a "choose your pokemon" screen

blazing rune Apr 7, 2025, 5:34 PM

#

leaden palm that's scout...

this is different than the other one

#

it was gemini 2 flash thinking vs maverick

leaden palm Apr 7, 2025, 5:34 PM

#

m

blazing rune Apr 7, 2025, 5:35 PM

#

now it's old new sonnet (🤣) vs scout

leaden palm Apr 7, 2025, 5:35 PM

#

3.6 is a good model

blazing rune Apr 7, 2025, 5:35 PM

#

yeah, agreed

#

Sonnet is so expensive though

#

it's a shame

leaden palm Apr 7, 2025, 5:35 PM

#

scout's pricing is currently around the same as 2 flash though

torn mantle Apr 7, 2025, 5:36 PM

#

blazing rune I had Claude modify the prompt slightly and in the web dev arena, gemini 2.0 fla...

yea llama is so bad

blazing rune Apr 7, 2025, 5:36 PM

#

llama 4 as a whole is so disappointing, idk how the youtubers can keep praising it

#

Matthew Berman never says anything negative afaik

keen beacon Apr 7, 2025, 5:36 PM

#

just avoid ai youtube lol

blazing rune Apr 7, 2025, 5:36 PM

#

AI Explained and a couple others are good though

torn mantle Apr 7, 2025, 5:37 PM

#

balmy mist lmaoo gemini turned the pokemon around now wtf

hmm something like this

balmy mist Apr 7, 2025, 5:38 PM

#

what changed the long columns?

#

i cant play vid

#

i love seeing gemini cook

torn mantle Apr 7, 2025, 5:39 PM

#

balmy mist what changed the long columns?

idk gemini decided that xd

balmy mist Apr 7, 2025, 5:39 PM

#

lmaooo

calm sequoia Apr 7, 2025, 5:40 PM

#

poll_question_text

Best of the new and upcoming

victor_answer_votes

5

total_votes

19

victor_answer_id

7

victor_answer_text

I don't know, man

victor_answer_emoji_name

🥱

balmy mist Apr 7, 2025, 5:41 PM

#

oh sh!t i never tried grok wiht the pokemon thing, hold up imma try it now

#

lmaoo grok struggling

#

i really widh i could try o1 pro with this

#

i have a feeling it will do the best

#

once open ai release their new modle i will buy it

#

its still thinking damn

#

grok cooking??

leaden palm Apr 7, 2025, 5:46 PM

#

balmy mist i really widh i could try o1 pro with this

if you want i could run plain o1 high

balmy mist Apr 7, 2025, 5:47 PM

#

nahh i need pro man, i think o1 pro still might be second best imo just because of the extra time it takes to compute

#

you can try it tho:

📎 message.txt

#

finally grok finished after 420 seconds

leaden palm Apr 7, 2025, 5:48 PM

#

some suspect pro is just best of 10

#

anyway running now

leaden palm Apr 7, 2025, 5:50 PM

#

balmy mist nahh i need pro man, i think o1 pro still might be second best imo just because ...

https://liveweave.com/bdNibz not super good

barren prairie Apr 7, 2025, 5:52 PM

#

poll_question_text

Who is the worst model

victor_answer_votes

8

total_votes

19

victor_answer_id

1

victor_answer_text

Llama4

torn mantle Apr 7, 2025, 5:52 PM

#

a bit better

balmy mist Apr 7, 2025, 5:52 PM

#

grok is butt

torn mantle Apr 7, 2025, 5:52 PM

#

balmy mist grok is butt

xd

balmy mist Apr 7, 2025, 5:53 PM

#

leaden palm https://liveweave.com/bdNibz not super good

the code is cut off, screen record it

balmy mist Apr 7, 2025, 5:53 PM

#

torn mantle a bit better

wow this is with three.js?

leaden palm Apr 7, 2025, 5:53 PM

#

balmy mist the code is cut off, screen record it

oh weird, liveweave must truncate

#

here's the full code

📎 code.js

torn mantle Apr 7, 2025, 5:53 PM

#

balmy mist wow this is with three.js?

the attacks animations yea

torn mantle Apr 7, 2025, 5:54 PM

#

balmy mist wow this is with three.js?

a bit of old design

balmy mist Apr 7, 2025, 5:57 PM

#

i dont see animations tho

#

only an image in the video

balmy mist Apr 7, 2025, 5:58 PM

#

torn mantle a bit of old design

looks cool to me tho

#

the style

torn mantle Apr 7, 2025, 5:59 PM

#

cards redesign

balmy mist Apr 7, 2025, 5:59 PM

#

this is the starter prompt i used:

📎 message.txt

#

but you can do any starter prompt as long as you get to a pokemon type of game

balmy mist Apr 7, 2025, 6:00 PM

#

torn mantle cards redesign

wow

#

what did you say restyle with cards?

dapper storm Apr 7, 2025, 6:00 PM

#

Why do you guys think Llama 4 got such a high Elo if it sucks

balmy mist Apr 7, 2025, 6:00 PM

#

gemini is so good man

balmy mist Apr 7, 2025, 6:01 PM

#

dapper storm Why do you guys think Llama 4 got such a high Elo if it sucks

there is a video on it, but something to do with training on benchmark test data or sum, i honestly dont know they also used a experimental version of the model when it got that score

#

btw new ai explained vid: https://www.youtube.com/watch?v=wOBqh9JqCDY

YouTube

AI Explained

AI CEO: ‘Stock Crash Could Stop AI Progress’, Llama 4 Anti-clim...

The latest on Llama 4, and whether it signals a slowdown in AI, or solid progress. Plus, a deep dive on that viral prediction of superintelligence by 2027, and Dario Amodei’s cautionary words on what could stop AI progress in its tracks. o3 news, and more, as well.

Weights & Biases: https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medi...

▶ Play video

#

he my fav youtuber for ai lol

#

he gonna cook llama 4 about to watch now lmaoo

#

@torn mantle ik why i cant see video now on your stuff, my computer needs to restart smh, brb lol

torn mantle Apr 7, 2025, 6:04 PM

#

balmy mist <@295243581818404874> ik why i cant see video now on your stuff, my computer nee...

last one is a pic

#

it may be internet issues

#

https://justpaste.it/j3v7a

tetet

balmy mist Apr 7, 2025, 6:15 PM

#

yeah grok bunns

#

thanks

#

wait was that o1 pro?

balmy mist Apr 7, 2025, 6:16 PM

#

torn mantle https://justpaste.it/j3v7a

wow this is so good man, i cant get three,js to work when i prompt the model

#

bruhh

#

@torn mantle the opponent does respond?

#

but the animations are so good

torn mantle Apr 7, 2025, 6:18 PM

#

balmy mist <@295243581818404874> the opponent does respond?

no xd

#

its a bug

#

but im trying to push gemini to the limit

balmy mist Apr 7, 2025, 6:20 PM

#

keep going!!!

#

are you using groundig with google search?

keen beacon Apr 7, 2025, 6:25 PM

#

it streams response tokens immediately

#

so no

#

its literally just 4o

#

updated

#

marketing

#

i benchmarked gpqa diamond and i got 67% (quasar). march gpt 4o got 66% (according to artificial analysis)

#

besides the 9 billion other things that indicate its origin

leaden palm Apr 7, 2025, 6:30 PM

#

it's a 4o-like model

#

it doesn't reason

#

to get training data for real use cases

north silo Apr 7, 2025, 6:32 PM

#

Whats some theories about why nightwhisper got removed?

balmy mist Apr 7, 2025, 6:33 PM

#

because google do not need to release it now

#

and they got the info they needed already

#

whats your theory?

north silo Apr 7, 2025, 6:34 PM

#

that makes sense

#

i think they'll release it right before or right after o3 drops in a couple weeks

balmy mist Apr 7, 2025, 6:35 PM

#

2.5 pro is already SOTA and craps on everything else

north silo Apr 7, 2025, 6:35 PM

#

yea

balmy mist Apr 7, 2025, 6:35 PM

#

they just can keep trainig nightwhisper and keep cooking

balmy mist Apr 7, 2025, 6:35 PM

#

north silo i think they'll release it right before or right after o3 drops in a couple week...

yeah

#

they will def put it back on webdev tho

#

prob next week after they update it

#

TBH i feel i need to start paying for gemini or donating to google

#

thats how much i love gemini 2.5 lol

#

sh!t has changed my life

#

what looks better yall?

keen beacon Apr 7, 2025, 6:45 PM

#

2nd

balmy mist Apr 7, 2025, 6:46 PM

#

i thought so, thank you, gonna make the text more clear in the bubbles

leaden palm Apr 7, 2025, 6:46 PM

#

balmy mist what looks better yall?

why are you trying to put square pegs in round holes

balmy mist Apr 7, 2025, 6:46 PM

#

gonna use this to keep track of all the stuff i make with ai, or prompt cause ai making it lol

balmy mist Apr 7, 2025, 6:47 PM

#

leaden palm why are you trying to put square pegs in round holes

ask gemini

#

it decided everything

#

i just gave it the og prompt of the landing page and vision

#

thats how it interpreted it, now im just cleaning it up

leaden palm Apr 7, 2025, 6:47 PM

#

gemini moment

balmy mist Apr 7, 2025, 6:48 PM

#

lmaoo

#

need to add this to app(for images), but this is the next version that gemini did after i said fix the test in bubbles so it is clearer lol

#

so clean

#

gonna try and host this on netifly

#

but want at least 10 more projects

keen beacon Apr 7, 2025, 6:51 PM

#

idk how it would look maybe change the circles into rounded squares (like ktibow kinda said, not a web dev lol)

balmy mist Apr 7, 2025, 6:52 PM

#

hmm but that would change the bubble theme, but ill try ill prompt it now

keen beacon Apr 7, 2025, 6:52 PM

#

balmy mist hmm but that would change the bubble theme, but ill try ill prompt it now

oh i just saw that lol

balmy mist Apr 7, 2025, 6:52 PM

#

lmaooo

#

hmm i could tyr 3D bubbles?

#

im not sure it can do that tho

#

if it does imma marry gemini

keen beacon Apr 7, 2025, 6:53 PM

#

what a heavy webpage tho lol

balmy mist Apr 7, 2025, 6:54 PM

#

ik bro its all enclosed in html

#

it could way smoother with react tbh

vast turret Apr 7, 2025, 6:54 PM

#

flannel is good, it's never lost a round. If it's llama reasoning then that's very exciting.

balmy mist Apr 7, 2025, 6:55 PM

#

keen beacon what a heavy webpage tho lol

you got any suggestions i should do? Imma host the app and post it for help, just want a place for all the crap i make

#

@torn mantle when you back online let me know what you think

keen beacon Apr 7, 2025, 6:57 PM

#

balmy mist you got any suggestions i should do? Imma host the app and post it for help, jus...

im terrible at web design lol i dunno. just do whatever u want xd. im bouta go to bed (havent slept in a while) and am just coming up with blanks

cloud meadow Apr 7, 2025, 6:57 PM

#

balmy mist TBH i feel i need to start paying for gemini or donating to google

Do NOT give money to google. They don't need it 😛

keen beacon Apr 7, 2025, 6:57 PM

#

keen beacon im terrible at web design lol i dunno. just do whatever u want xd. im bouta go t...

didnt even read ur page properly (bubbles) lol. should just go to bed

cloud meadow Apr 7, 2025, 6:57 PM

#

They also engage in predatory practices

balmy mist Apr 7, 2025, 6:57 PM

#

keen beacon didnt even read ur page properly (bubbles) lol. should just go to bed

thanks, gn bro lol

cloud meadow Apr 7, 2025, 6:58 PM

#

You can get gemini 2.5 with google one right?

balmy mist Apr 7, 2025, 6:58 PM

#

cloud meadow They also engage in predatory practices

but they gave me gemini 2.5

#

yeah but i used studio for free

cloud meadow Apr 7, 2025, 6:58 PM

#

Then no need to pay for it

balmy mist Apr 7, 2025, 6:58 PM

#

ik but i am so grateful

cloud meadow Apr 7, 2025, 6:58 PM

#

Unless you also want the benefits of google one

balmy mist Apr 7, 2025, 6:58 PM

#

how do i show my appreciation?

cloud meadow Apr 7, 2025, 6:58 PM

#

balmy mist ik but i am so grateful

The engineers did all the work

cloud meadow Apr 7, 2025, 6:59 PM

#

balmy mist how do i show my appreciation?

Date one of the google deepmind engineers

balmy mist Apr 7, 2025, 6:59 PM

#

hmm good idea

cloud meadow Apr 7, 2025, 6:59 PM

#

Lmao

balmy mist Apr 7, 2025, 6:59 PM

#

lmaoo

#

square looks good but does not match theme as much

#

and i cant do 3d smh

wintry tinsel Apr 7, 2025, 7:15 PM

#

I just had an epiphany the reason they roll out 2.5 pro to free users is 2.5 ultra for paid users

balmy mist Apr 7, 2025, 7:17 PM

#

thats not out yet tho

#

but it dont matter cause 2.5 pro is so good like if nightwhisper is ultra i would upgrade but you cant go wrong with 2.5 pro

wintry tinsel Apr 7, 2025, 7:20 PM

#

The point is that A new most capable/best model may be around the corner

torn mantle Apr 7, 2025, 7:23 PM

#

balmy mist square looks good but does not match theme as much

nice

balmy mist Apr 7, 2025, 7:27 PM

#

you like the square one better?

balmy mist Apr 7, 2025, 7:27 PM

#

balmy mist so clean

vs this @torn mantle ?

#

imma change the app to a react app cause all this html is getting nasty lol

#

but that is going to be a lot of work, gonna need another branch for that

stone lark Apr 7, 2025, 7:28 PM

#

havent tryed nightwhisperer but does any one here know if its also good for non coding related questions or is it just a coding finetune of 2.5 pro?

torn mantle Apr 7, 2025, 7:28 PM

#

balmy mist square looks good but does not match theme as much

this

balmy mist Apr 7, 2025, 7:28 PM

#

torn mantle this

😦

#

lmaooo

balmy mist Apr 7, 2025, 7:29 PM

#

stone lark havent tryed nightwhisperer but does any one here know if its also good for non ...

i think its just as good as gemini 2.5 on non coding stuff imo, some people say its worse, but with my tests it was equal

#

its def a fine tuned version of it imo

#

but i think wiht nightwhisper it follows directions really well

oblique flint Apr 7, 2025, 7:30 PM

#

wintry tinsel I just had an epiphany the reason they roll out 2.5 pro to free users is 2.5 ult...

I think it's 2 things:

With the free ratelimited api they try to get more devs on board with gemini
Free gemini for consumers is mainly to steal market share from chatgpt I think. Chatgpt is still way way more mainstream than gemini

#

the ultra class of models is pretty much dead in the water as proven by gpt 4.5 and arguably llama behemoth

north vale Apr 7, 2025, 7:39 PM

#

the ultra class of models will pop off in a year or two (actually prolly most likely this year iwth gpt-5)

#

they're just too costly right now for it to be worthwhile to post train all the juice out of them

#

but they'll be so much better when they're trained to a similar degree that current small models are

torn mantle Apr 7, 2025, 7:46 PM

#

balmy mist 😦

im actually having fun with this pokemon game

balmy mist Apr 7, 2025, 7:49 PM

#

lmaoo ikr bro

#

it inspired me to create a webpage just for small scale creations like this

#

you updated yours?

#

i been focusing on the website for all the creations, gonna add more to the pokemon game later, but i need to make an app that can iteratively just feedback in the outputs on new sessions based on a system prompt, that way i could just put in one prompt and let it cook for hours lmaoo

#

on a free model this wouldnt be bad

#

especially if i start with a good build like from gemini 2.5 pro and then use quasar after for hours, come back in the morning and see what beast it made

torn mantle Apr 7, 2025, 7:55 PM

#

balmy mist you updated yours?

yea

#

im still updating the code

#

trying some cool stuff

balmy mist Apr 7, 2025, 8:00 PM

#

im making the app/script now to loop the outputs and inputs lol, ai got me tripping

#

yall ever use augemnt code?

#

sh!t is cracked fr

granite topaz Apr 7, 2025, 8:39 PM

#

Can somone help out, theres some sort of code with the claude AI I think and I'm not sure why I posted in the help section but I think since I'm a new member it doesnt post

#

this is what it keeps saying: NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.

(error_code: 50004, Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'messages: text content blocks must be non-empty'}})

#

And this has been happening for around 2 weeks but it was working amazingly before, i tried other claude models and they are also having the same issues

torn mantle Apr 7, 2025, 8:56 PM

#

balmy mist yall ever use augemnt code?

wdym

balmy mist Apr 7, 2025, 8:59 PM

#

its an extension for vsc for ai coding

#

its really good

night trout Apr 7, 2025, 9:06 PM

#

Q: Does anyone know if the codebase for Webdev Arena is open source at all?

#

I can't find it on the LMArena Github, I'm not sure if it's elsewhere.

torn mantle Apr 7, 2025, 9:08 PM

#

balmy mist its an extension for vsc for ai coding

im just using github copilot tbh

#

it doesnt cost much

balmy mist Apr 7, 2025, 9:13 PM

#

ahh what is the context window for that?

#

im having trouble with large context windows

#

i was gonna use openrouter to use gemini 2.5 pro

#

quasar is free right?

#

let me see your updates tot eh code, im almost done with the iteration script to allow me to run any number of loops on a prompt

#

gonna leave my app as an html app for now

torn mantle Apr 7, 2025, 9:22 PM

#

balmy mist ahh what is the context window for that?

its quite high

#

it depends on the models used

#

@balmy mist

balmy mist Apr 7, 2025, 9:41 PM

#

torn mantle <@367710025994731520>

omg bro you are going nuts

#

let me play it

balmy mist Apr 7, 2025, 9:42 PM

#

torn mantle its quite high

okay imma try it, its on vsc?

#

lmaoo

#

me too

#

i heard that llama4 is actually good

#

its just not synced right lol

#

maybe we slept on meta

night trout Apr 7, 2025, 9:50 PM

#

balmy mist what looks better yall?

This is yours? Is there a backend generating screenshots / doing tests at all?

ocean vortex Apr 7, 2025, 9:51 PM

#

lol there’s not much to distill. It actually performs worse than chatgpt-latest now on many metrics

#

And upcoming gpt4o version is to have 1M context looks like

#

Where it can still excel is context awareness/vibe but that’s just about impossible to capture or distill. Spatial awareness is tricky as well, though that area is not class leading on 4.5 either.

torn mantle Apr 7, 2025, 10:05 PM

#

balmy mist okay imma try it, its on vsc?

yea

raven void Apr 7, 2025, 10:09 PM

#

quasar alpha is new 4o?

balmy mist Apr 7, 2025, 10:26 PM

#

night trout This is yours? Is there a backend generating screenshots / doing tests at all?

no i used gemini so basically no backend just html files with js and ccs integrated within

#

very messy, but I was trying to see if gemini can jsut build stand-alone stuff

#

without extra stuff

#

like one shot apps

night trout Apr 7, 2025, 10:55 PM

#

Got it. I want to find some time in the next couple days to make a better version of rivals.tips, that's why I ask. I'm thinking a gallery of prompt results for every single LLM + automated tests.

balmy mist Apr 7, 2025, 11:02 PM

#

yeah automated tests would be amazing

drifting thorn Apr 7, 2025, 11:15 PM

#

Idk if R2 will become master of creativity or master of hallucinations

drifting thorn Apr 7, 2025, 11:22 PM

#

blazing rune Matthew Berman never says anything negative afaik

He’s always exaggerating about AI’s abilities with words like ,”shocking””incredible””life-changing”

#

If it’s o3 then OpenAI is over. I tend to think it’s GPT4.5o thingy

drifting thorn Apr 7, 2025, 11:32 PM

#

ocean vortex Where it can still excel is context awareness/vibe but that’s just about impossi...

Is GPT 4.5 a failure?

drifting thorn Apr 8, 2025, 12:01 AM

#

From its messy product releases it implies that there are two voices in OpenAI(at least)

#

These two voices are probably arguing over the future developments of their AI models

#

One is the route of GPT 4.5, and the another is GPT 4o 0326

#

Since the 0326 team brings profit to OpenAI, Sam Altman is releasing o3 again with new o4 mini

#

GPT 4.5 team may be the maker of Quasar Alpha

leaden palm Apr 8, 2025, 12:19 AM

#

llama 4 scores dropping to reality?

#

(or ideally, they release the chat tune)

vivid oyster Apr 8, 2025, 12:21 AM

#

leaden palm

What is hf

leaden palm Apr 8, 2025, 12:21 AM

#

vivid oyster What is hf

hugging face 😭

vivid oyster Apr 8, 2025, 12:22 AM

#

So whast he difference between hugging face and normal

night trout Apr 8, 2025, 12:22 AM

#

Yeah it seems like "release the chat tune" is the obvious action here. They have a model which reaches those scores.

vivid oyster Apr 8, 2025, 12:22 AM

#

Lama

leaden palm Apr 8, 2025, 12:22 AM

#

vivid oyster So whast he difference between hugging face and normal

the hugging face version == the weights you can go download == the normal version

#

they could've said "adding the normal version to the arena"

vivid oyster Apr 8, 2025, 12:23 AM

#

So the arena changed the lama to be weaker and sabotage the results

leaden palm Apr 8, 2025, 12:23 AM

#

vivid oyster So the arena changed the lama to be weaker and sabotage the results

arguably it's meta who's doing the sabotaging

leaden palm Apr 8, 2025, 12:25 AM

#

leaden palm

man i forgot how fun viewing the raw data was

#

(offloading interview questions to ai)

#

i love the arena

balmy mist Apr 8, 2025, 12:52 AM

#

thats what I was saying

#

def noticed the improvement

drifting thorn Apr 8, 2025, 2:59 AM

#

There are videos that shows X1 is actually shxtty now

balmy mist Apr 8, 2025, 3:19 AM

#

@torn mantle let me see your progress, i just finished making the refinement app, took a minute but got the system working nicely, just trying to manage max tokens

keen beacon Apr 8, 2025, 3:26 AM

#

Theory: 4.5 has no post training

#

which is why it's garbage

last zinc Apr 8, 2025, 3:40 AM

#

@balmy mist

dreamy mist Apr 8, 2025, 3:41 AM

#

when will the web search leaderboard release?

drifting thorn Apr 8, 2025, 4:05 AM

#

what are your thoughts on Cline?

last zinc Apr 8, 2025, 4:05 AM

#

@balmy mist

drifting thorn Apr 8, 2025, 4:13 AM

#

gimme free 3.7 api please

torn mantle Apr 8, 2025, 4:16 AM

#

balmy mist <@295243581818404874> let me see your progress, i just finished making the refin...

that was the last update

balmy mist Apr 8, 2025, 4:21 AM

#

last zinc <@367710025994731520>

what?

balmy mist Apr 8, 2025, 4:22 AM

#

drifting thorn what are your thoughts on Cline?

its good, but I really like roo code, gives you good customization and works just as good as the rest

balmy mist Apr 8, 2025, 4:23 AM

#

torn mantle that was the last update

damn lol, im gonna open source this so that anyone could run this, i cant imagine this with nightwhisper lol

torn mantle Apr 8, 2025, 4:24 AM

#

i may add more stuff later

drifting thorn Apr 8, 2025, 4:26 AM

#

cline is stupid in doing my tasks

#

I told it to store some files and it just failed

#

idk if it's because i don't have the claude 3.7 api

balmy mist Apr 8, 2025, 4:59 AM

#

use roo code bro

#

trust its fire

#

i did a bunch of research on it with ai and youtube and reviews and tests and roo seems to be the best

novel flame Apr 8, 2025, 5:00 AM

#

drifting thorn what are your thoughts on Cline?

Cline is the best AI coding IDE right now. It lacks Cursors “suggested editing” autocomplete, but otherwise it is noticeably better.

However, it only works properly with 3.5/3.7 Sonnet. Any other model just ends up choking somehow.

Roo Code is essentially the same as Cline though, so I haven’t used that one.

balmy mist Apr 8, 2025, 5:00 AM

#

and if you create a bunch of google accounts you can have free gemini 2.5 pro lol

#

cline is good to, but my openrouter keys never work for some reason

#

so they make me use the default cline and thats to expensive

balmy mist Apr 8, 2025, 5:02 AM

#

novel flame Cline is the best AI coding IDE right now. It lacks Cursors “suggested editing” ...

i been using google with roo and its been amazing

#

i am also using studio to help with the costs

#

so give my codebase to studio with a code change i want, it gives me the code back and then i give it to roo

drifting thorn Apr 8, 2025, 5:19 AM

#

Now I put trust to Flowith, hoping it’ll be my solution

#

Since my creative writing is actually some kind of ‘fanfic’

#

It’s basically me going into a world where multiple fictions happen there

#

So there are lots of ‘settings’ for all of the characters

ivory schooner Apr 8, 2025, 5:30 AM

#

我的24k~我的24k~求求恢复一下吧，虽然Behemoth（可能是基于24k的）还没有来

torn mantle Apr 8, 2025, 5:31 AM

#

ivory schooner 我的24k~我的24k~求求恢复一下吧，虽然Behemoth（可能是基于24k的）还没有来

?

hardy pecan Apr 8, 2025, 5:46 AM

#

I suspect this chat should be english only

torn mantle Apr 8, 2025, 5:46 AM

#

balmy mist damn lol, im gonna open source this so that anyone could run this, i cant imagin...

this is the latest one

#

the map was generated using sonnet

#

couldnt get it with gemini

#

but everything else is gemini

drifting thorn Apr 8, 2025, 6:09 AM

#

torn mantle ?

You can tell it’s Chinese

torn mantle Apr 8, 2025, 6:14 AM

#

drifting thorn You can tell it’s Chinese

nice

#

couldn't tell tbh

drifting thorn Apr 8, 2025, 6:15 AM

#

ivory schooner 我的24k~我的24k~求求恢复一下吧，虽然Behemoth（可能是基于24k的）还没有来

It means that he wants his 24k back

upper wolf Apr 8, 2025, 6:15 AM

#

New model dreamtides

drifting thorn Apr 8, 2025, 6:15 AM

#

Although 24k may be based on Behemoth

#

Flowith seems to have a functional knowledge base

#

Much better than the knowledge base in Cherry Studio or the MCP Knowledge Graph Memory in Cline

#

And sometimes I think using Cline for creative writing is kinda overkilled

drifting thorn Apr 8, 2025, 6:53 AM

#

Perplexity is dumb as fxxk

ivory schooner Apr 8, 2025, 7:27 AM

#

When can 24k be opened to some users to play

humble sonnet Apr 8, 2025, 7:27 AM

#

site is down ?

ivory schooner Apr 8, 2025, 7:27 AM

#

It's better to use a mirror station like one

sage raptor Apr 8, 2025, 7:29 AM

#

https://x.com/OfficialLoganK/status/1909443890366890200

Logan Kilpatrick (@OfficialLoganK) on X

Should be a good week 👍

#

new models incoming

ocean vortex Apr 8, 2025, 7:34 AM

#

drifting thorn Is GPT 4.5 a failure?

Technically yes. It was supposed to be gpt5 but they named it gpt4.5 after seeing how it performs. And it wasn't even the top performing non-reasoning model looking at the competition, at the time of release

drifting thorn Apr 8, 2025, 7:46 AM

#

And I don’t think a company like OpenAI can self-correct in just a month

#

So I think there are two voices in OpenAI representing 4.5 and 4o 0326

split kayak Apr 8, 2025, 7:54 AM

#

ok

drifting thorn Apr 8, 2025, 7:57 AM

#

What is 24_karat_gold’s actual model name?

kind cloud Apr 8, 2025, 8:13 AM

#

upper wolf New model dreamtides

Screenshot_2025-04-08-17-13-21-199-edit_com.android.chrome.jpg

keen beacon Apr 8, 2025, 8:18 AM

#

another one!

#

just checked - not on webdev arena

keen beacon Apr 8, 2025, 8:39 AM

#

#general message

#

a private model i have access to that i believe to be o3 gets 6/10

ivory schooner Apr 8, 2025, 8:43 AM

#

刚才我在deepsider(tm)（一个edge的插件）聊了Maverick两遍

#

怎么说呢.....完全跟lmarena有区别

#

（我的24k、spider~😂 😂 😂 )

ocean vortex Apr 8, 2025, 8:46 AM

#

kind cloud

had a look and yeah this is most likely 2.5 pro chat (non-reasoning)

keen beacon Apr 8, 2025, 8:51 AM

#

are they gonna add instruct to the name if the regular version is reasoning

#

wondner what theyre gonna name it if it actually is 2.5 pro instruct

keen beacon Apr 8, 2025, 8:51 AM

#

ocean vortex had a look and yeah this is most likely 2.5 pro chat (non-reasoning)

doubt

#

in every matchup i've got it in, it has taken a bit to start streaming a response, and i don't think they're releasing the base model. logan said on twitter something along those lines iirc

keen beacon Apr 8, 2025, 8:53 AM

#

keen beacon in every matchup i've got it in, it has taken a bit to start streaming a respons...

i think logan said they would do an instruct version

eager mica Apr 8, 2025, 8:54 AM

#

Haven't looked at them yet, let me check.

ivory schooner Apr 8, 2025, 8:56 AM

#

Thus, I miss the conversations with 24k and Spider

keen beacon Apr 8, 2025, 9:01 AM

#

dreamtides is a 2.5 line model (knows stuff in dec 2024 etc)

keen beacon Apr 8, 2025, 9:03 AM

#

keen beacon in every matchup i've got it in, it has taken a bit to start streaming a respons...

ya its also a thinking model

#

its very fast

#

did stargazer get removed? this is probably 2.5 flash

#

another pro would be too fast?

#

i timed the thought process for a puzzle: 18.70 sec whilst gem 2.5 pro took 25.5 sec to think (output excluded)

keen beacon Apr 8, 2025, 9:08 AM

#

keen beacon did stargazer get removed? this is probably 2.5 flash

stargazer is still there

keen beacon Apr 8, 2025, 9:08 AM

#

keen beacon its very fast

yeah i noticed that too

torn mantle Apr 8, 2025, 9:13 AM

#

keen beacon dreamtides is a 2.5 line model (knows stuff in dec 2024 etc)

interesting

#

pro?

#

non thinking?

torn mantle Apr 8, 2025, 9:14 AM

#

keen beacon wondner what theyre gonna name it if it actually is 2.5 pro instruct

i see

keen beacon Apr 8, 2025, 9:15 AM

#

torn mantle interesting

its a 2.5 thinking model

torn mantle Apr 8, 2025, 9:15 AM

#

keen beacon its a 2.5 thinking model

oh

keen beacon Apr 8, 2025, 9:15 AM

#

so theres two unreleasde 2.5 thinking models, stargazer and this one

#

2.5 flash and 2.5 flash lite i guess

keen beacon Apr 8, 2025, 9:16 AM

#

keen beacon stargazer is still there

prob 2.5 flash and 2.5 flash lite

#

yeah it's possible, but o3's model cutoff (by the looks of it) is (still..!) october 2023

eager mica Apr 8, 2025, 9:39 AM

#

I find astounding that they'd do RLHF and determine the overall direction of their chat model on just ~2500 data points. I had a look at the prompts (...) and I did see several of mine, actually.

torn mantle Apr 8, 2025, 9:45 AM

#

dreamtides seems like flash 2.5

#

gemini 2.5 pro thinking

alpine coral Apr 8, 2025, 9:45 AM

#

nice work!

torn mantle Apr 8, 2025, 9:46 AM

#

vs dreamtides

alpine coral Apr 8, 2025, 9:47 AM

#

keen beacon so theres two unreleasde 2.5 thinking models, stargazer and this one

there's also lunarcall - i got it a couple of times yesterday, seemed pretty decent

torn mantle Apr 8, 2025, 9:47 AM

#

its also 2.5 flash thinking

alpine coral Apr 8, 2025, 9:47 AM

#

same set of questions (about 20)

torn mantle Apr 8, 2025, 9:47 AM

#

lunarcall

alpine coral Apr 8, 2025, 9:48 AM

#

yeah that would make perfect sense tbh

#

it's a thinking model, but not up to 2.5 pro

alpine coral Apr 8, 2025, 9:50 AM

#

keen beacon did stargazer get removed? this is probably 2.5 flash

i felt stargazer was consistently comparable to 2.5 Pro tbh

ocean vortex Apr 8, 2025, 9:52 AM

#

keen beacon i timed the thought process for a puzzle: 18.70 sec whilst gem 2.5 pro took 25.5...

if it's flash then it performs impressively. But speed alone is not really an indicator

keen beacon Apr 8, 2025, 9:52 AM

#

alpine coral i felt stargazer was consistently comparable to 2.5 Pro tbh

Was it stronger in some regards or just on par/slightly worse

ocean vortex Apr 8, 2025, 9:52 AM

#

2.5 pro endpoint has way more load than this lmarena exclusive one

#

esp since 2.5 pro blew up now lol

drifting thorn Apr 8, 2025, 9:53 AM

#

I just know that there’s RAG in Gemini app

#

I just sent it the original novels(the settings) and it’s still working on it

alpine coral Apr 8, 2025, 9:55 AM

#

keen beacon Was it stronger in some regards or just on par/slightly worse

giving the same 'quiz' across 3 prompts. in the end they come out even (given just the first 1/3 of the quiz, stargazer does very well in cases; though Gem Pro 2.5 holds up throughout each message)

keen beacon Apr 8, 2025, 9:55 AM

#

wow

#

im not sure what stargazer is now

alpine coral Apr 8, 2025, 9:57 AM

#

yeah if it's a flash model that'll be super impressive

#

but i find it hard to understand how that would work ha

keen beacon Apr 8, 2025, 10:05 AM

#

alpine coral same set of questions (about 20)

is this quiz private?

ocean vortex Apr 8, 2025, 10:07 AM

#

keen beacon in every matchup i've got it in, it has taken a bit to start streaming a respons...

they are releasing it I think though they may name it differently. This could be flash-thinking I suppose, in that case potentially a very good distill

#

either that or pro 2.5 non-exp. I did notice that delay in streaming later as well 🧐

keen beacon Apr 8, 2025, 10:11 AM

#

ocean vortex either that or pro 2.5 non-exp. I did notice that delay in streaming later as we...

No it's a thinking model whilst testing I gave it a question and it hung up

#

For minutes

#

Both models

#

(one was not thinking)

ocean vortex Apr 8, 2025, 10:12 AM

#

well 2.5 pro is a thinking model lol

keen beacon Apr 8, 2025, 10:12 AM

#

Oh I thought u said it was 2.5 pro instruct

ocean vortex Apr 8, 2025, 10:12 AM

#

but it is ~~experimental~~ preview

keen beacon Apr 8, 2025, 10:12 AM

#

Mb

#

so there's stargazer, lunarcall, dreamtides

#

hmmmm

cedar tide Apr 8, 2025, 10:19 AM

#

keen beacon so there's stargazer, lunarcall, dreamtides

and NightWhisper?

keen beacon Apr 8, 2025, 10:19 AM

#

cedar tide and NightWhisper?

yes but it was removed. given it was only available in web dev arena im inclined to believe its a web dev tune

#

they included it in some of the system card benchmarks

#

give me a prompt if you want

alpine coral Apr 8, 2025, 10:27 AM

#

keen beacon is this quiz private?

here is that particular 'quiz'

📎 April_arenaQuizcondensed.txt

#

and here is gem-pro-2.5 nailing it.. providing all but just two of the correct responses..

#

i can't be bothered justifying its worth

#

take it or leave it... the whole approach is flawed af - but i find it useful.

eager mica Apr 8, 2025, 10:30 AM

#

They did have various models, some more formal than others, possibly only differing by their system prompt, but still, with 2500 votes over the course of about 10 days, that means only relatively few people drove the general direction of model outputs on the Arena.

keen beacon Apr 8, 2025, 10:39 AM

#

alpine coral here is that particular 'quiz'

private model response:

• Arabella will head straight for the transparent carrier; she last saw (and therefore believes) the cat is still inside it.
• Rod sits under the centre of mass, which is to the right of the blade’s hole, so the right‑hand end (B) is heavy. Remove a little more metal from B.
• C – deduce ⊂ infer. Deduction is one kind of inference; inference also embraces induction, abduction, etc.
• “Puzzle.” You can make, play or explain a puzzle; add “‑d” and you have “puzzled” (confused).
• The letter “s” appears 5 times in “discussionss”.
• True – everyone, including Janis Joplin, lives right up to the end of their life.
• “Hawk tuah” began as a Nashville, Tennessee street‑interview clip in which a young woman describes a bedroom move, punctuating it with a mock spit sound (“hawk‑tuah”). TikTok spread it into a meme.
• Bob would rather be abruptly sacked from the job he dislikes (ii) than scratch his newly‑bought dream car (i).
• The river is only knee‑deep and slow; the farmer simply wades across leading the Clydesdale. No dinghy trips needed.
• On a digital display there is no angular separation of hands—angle = 0°.
• 8J+Nhg== decodes to 🍆, the eggplant emoji, widely used online as a cheeky phallic stand‑in.

FOLLOW‑UPS
• [i] = “nothing.” The rich need it, the poor have it, eat it and you die.
• Bill sees the marker at page 65 and, not knowing it was moved, assumes that’s where Bob stopped.
• PASS – no reliable source dates “quad‑cross mow” entering common lawn‑care slang.
• Trailer drifting left whilst reversing: turn the steering wheel left‑hand‑up (counter‑clockwise). Back of the car moves right, pushing the hitch right and bringing the trailer right.
• d) Writing efficiently (concise) actually demands extra time and attention.
• It’s the food in a picnic. Each “family, couple or individual” item slides one‑by‑one down a throat (the “big, squishy slide”)—nobody being eaten has fun.
• Rope already exceeds the 1 000 ft span; Danny tied one end to the rock, lobbed the rock across the gap (using the carabiners only to clip in once both sides were anchored) and hauled the line taut.
• Still two rungs underwater at low tide—the boat rises and falls with the water, so the ladder’s relative position doesn’t change.

Pros & cons, kayaking vs. horse riding, plus crepes—pithy prose only:

Kayaking skims you over water in silent independence: inexpensive kit, minimal upkeep, whole‑body workout, serenity in coves unreachable on foot; downside is weather dependency, wet gear, and the long roof‑rack shuffle. Horse riding trades plastic hull for warm muscle—bond, height, power, romantic scenery—yet costs balloon (feed, shoeing, agistment) and the learning curve is steeper, with real falls. For weekends: kayak if you crave solo, cheap, splashy freedom; saddle up when you want companionship, land vistas and can stomach the bill.

Crêpes: whisk two eggs, a cup of milk, half‑cup water, a cup of flour, two tablespoons melted butter, pinch of salt; rest batter. Hot buttered pan, thin swirl, flip when edges lift, stack. Fill with whatever makes you smile.

#

working on it

#

ADD_COMMENT(@mads_:121, "St. Louis Fed (FRED): M2 grew ~25 % in 2020 (Trump), ~12 % in 2021 (Biden). Growth‑rate point is broadly correct, but level continued rising under both.")
ADD_COMMENT(@hallydallyffs:95, "Economists say ‘stagflation’ (term coined 1965 UK), not ‘flag‑station’.")
ADD_COMMENT(@levelraptor:39, "Earliest printed source of the “insanity” quote is 1981 NA manual; not Einstein.")

Claim – “Trump printed way more over COVID than Biden did” (@mads_:121, 128).
• Fed balance‑sheet & M2 data confirm a larger 2020 jump (ca. $3 tn QE + 25 % M2 growth).
• Fiscal impulse: CARES Act + Dec‑20 relief ≈ 14 % GDP vs. ARP 2021 ≈ 9 % GDP.
Counter‑claim – “Inflation was higher under Trump” (implied query @mads_:80).
• BLS CPI‑U: peak 2.9 % y/y (Jul‑18) under Trump vs. 9.1 % (Jun‑22) under Biden.
• Average CPI 2017‑20 ≈ 1.9 %; 2021‑23 ≈ 5.7 %. Claim is false.
Why money‑supply ≠ automatic inflation.
• Velocity of M2 collapsed 2020 (Q1 1.43 → Q2 1.10). Fisher equation (MV = PY) shows excess liquidity initially hoarded.
• Papers: Coibion & Gorodnichenko (2022, NBER w30371) and Jordà et al. (2023, AER) find supply‑chain shocks + demand re‑opening drove 70‑80 % of 2021‑22 price surge; monetary overhang mattered but with lags.
Policy lens.
• 2020 stimulus prevented a depression but front‑loaded inflationary pressure once velocity rebounded (mid‑2021).
• 2021 ARP added to demand when output gap was closing—Fed’s delayed tightening amplified the spike.
Verdict.
• Money‑printing comparison: growth‑rate statement correct; context (velocity, post‑2021 policy) missing.
• Inflation comparison: higher under Biden; transcript assertion reversed.
• Take‑away: Evaluate nominal aggregates jointly with velocity and fiscal timing; single‑variable narratives mislead.

TLDR: 2020 saw the biggest money‑supply jump (Trump), but inflation spiked later (Biden) once velocity and demand recovered; data contradicts the claim that inflation was higher under Trump while partly validating larger “printing” in 2020.

alpine coral Apr 8, 2025, 10:44 AM

#

keen beacon private model response: • Arabella will head straight for the transparent carri...

would get a 6 (or 7.. one of the questions is dodgy ha) in my books - very solid model

keen beacon Apr 8, 2025, 10:44 AM

#

i wonder what the difference would be if you did each individually