#general | Arena | Page 59

small haven Jun 17, 2025, 8:16 PM

#

its not playing on my end

willow grail Jun 17, 2025, 8:16 PM

#

sadly didnt film any juveniles

#

juvies walking is so goofy

jade egret Jun 17, 2025, 8:36 PM

#

crow good

agile heart Jun 17, 2025, 9:09 PM

#

will the Direct chat for the image generator be fixed like i keep getting a error on all models

#

GUYS lmarena is down fix it

torn mantle Jun 17, 2025, 9:11 PM

#

my?

agile heart Jun 17, 2025, 9:12 PM

#

sorry i type too fast but the site is broken and does a verify browser thing

#

i joined the server to bring up the fact that the image generator is broken

echo aurora Jun 17, 2025, 9:13 PM

#

agile heart i joined the server to bring up the fact that the image generator is broken

are you seeing just the image gen is broken or is text not working either? I'm assuming all image gen models aren't working?

agile heart Jun 17, 2025, 9:15 PM

#

echo aurora are you seeing just the image gen is broken or is text not working either? I'm a...

it only works for a hour then a minute later it says that something went wrong try again also the site is down and doing a vercel security thing

lime coral Jun 17, 2025, 9:20 PM

#

They teased Deep Think already has better scores than I/O version + second wave of trustee test (expected) in the X space

small haven Jun 17, 2025, 9:29 PM

#

w ui tweak update from xai 🔥

torn mantle Jun 17, 2025, 9:31 PM

#

small haven w ui tweak update from xai 🔥

HOLY

#

WOAH

#

😮

whole wagon Jun 17, 2025, 9:35 PM

#

How many days has it been since grok 3.5 was supposed to launch

#

Flash lite preview without thinking seems to underperform flash 2 (which never had thinking)

#

Also Gemini 2.5 Flash pricing without thinking went from $0.15/$0.60 to $0.30/$2.50

#

The only good things from today is that flash lite now has a thinking mode and Gemini 2.5 flash pricing with thinking went from $0.15/$3.50 to $0.30/$2.50

#

Seems gains on the smaller end of models slow dramatically, only way is to scale up

echo aurora Jun 17, 2025, 9:41 PM

#

agile heart it only works for a hour then a minute later it says that something went wrong ...

I'm going to start a thread for this issue

agile heart Jun 17, 2025, 9:41 PM

#

echo aurora I'm going to start a thread for this issue

i already did

echo aurora Jun 17, 2025, 9:42 PM

#

agile heart i already did

ah seeing that now, thanks

agile heart Jun 17, 2025, 9:42 PM

#

echo aurora ah seeing that now, thanks

also why is the site showing a vercel security thing and can"t verify my browser

small haven Jun 17, 2025, 10:03 PM

#

whole wagon Flash lite preview without thinking seems to underperform flash 2 (which never h...

i wonder how small flash-lite relatively compared to flash

hollow ocean Jun 17, 2025, 10:04 PM

#

I got 7 paychecks on deepthink coming out mid August

#

Easiest money of the year calling it

small haven Jun 17, 2025, 10:05 PM

#

until mid august or exactly on mid august? 👀

hollow ocean Jun 17, 2025, 10:05 PM

#

small haven until mid august or exactly on mid august? 👀

Mid August

#

But it might be late July

small haven Jun 17, 2025, 10:05 PM

#

oof, idk about that :/

elder rapids Jun 17, 2025, 10:12 PM

#

whole wagon Seems gains on the smaller end of models slow dramatically, only way is to scale...

literally the opposite

#

and flash without thinking imo being more expensive isn't necessarily a bad thing

spare mango Jun 17, 2025, 10:30 PM

#

I wish there was a middle option for Gemini between Flash and Pro.

#

Flash takes half a second to think and give an answer, Pro takes half a minute.

#

So using Flash feels like the quality of the responses are much worse and they tend to be inaccurate a lot more frequently.

#

And using Pro just feels like a slog, every conversation can drag on for unnecessarily long amounts of time because each response and back and forth takes minutes at a time.

#

A middle-offering with a model that spent around 5-10 seconds thinking for each response would be great.

candid storm Jun 17, 2025, 10:38 PM

#

hollow ocean I got 7 paychecks on deepthink coming out mid August

Where can you bet on that?

#

I dont see it on polymarket

hollow ocean Jun 17, 2025, 11:12 PM

#

Manifold

#

Why not

patent aspen Jun 17, 2025, 11:14 PM

#

(literally just wanted to make that pun)

hollow ocean Jun 17, 2025, 11:15 PM

#

Let’s see

small haven Jun 17, 2025, 11:21 PM

#

hollow ocean Why not

7 paychecks saved

hollow ocean Jun 17, 2025, 11:22 PM

#

small haven 7 paychecks saved

https://tenor.com/view/yes-gif-22712908

Tenor

small haven Jun 18, 2025, 12:41 AM

#

what is it

keen ferry Jun 18, 2025, 12:45 AM

#

the new Gemini model is so good I wonder if they will nerf it

patent aspen Jun 18, 2025, 12:50 AM

#

The nerf slander has got to stop

leaden palm Jun 18, 2025, 12:52 AM

#

spare mango And using Pro just feels like a slog, every conversation can drag on for unneces...

pro with a smaller thinking budget?

#

flash with a larger thinking budget?

verbal nimbus Jun 18, 2025, 1:37 AM

#

#

Whoa didn't expect to see R1 on top of Claude

small haven Jun 18, 2025, 1:52 AM

#

and is anyone using r1 as a daily driver?

elder rapids Jun 18, 2025, 1:57 AM

#

there are nuances ye, some models do seem to be benchmaxxed in the sense that they aren't trained on responses that much compared to other models

#

r1 simply isn't that good and it's pretty surprising why it's even near that level

verbal nimbus Jun 18, 2025, 2:01 AM

#

This is for WebDev Arena though

Only the visual outputs are being judged. Benchmaxxing requires a reference benchmark.

small haven Jun 18, 2025, 2:35 AM

#

theres a new benchmark in town

#

https://livecodebenchpro.com/

drifting thorn Jun 18, 2025, 2:41 AM

#

verbal nimbus Jun 18, 2025, 2:42 AM

#

small haven theres a new benchmark in town

Gemini 2.5 Flash scores higher than Claude Sonnet 3.7 Thinking? 🤔

small haven Jun 18, 2025, 2:43 AM

#

verbal nimbus Gemini 2.5 Flash scores higher than Claude Sonnet 3.7 Thinking? 🤔

almost by 2x in rating lol

verbal nimbus Jun 18, 2025, 2:45 AM

#

small haven almost by 2x in rating lol

Hmm doesn't seem match up with real world experience on GitHub Copilot.

#

o3-mini and o4-mini was much worse on Copilot too.

small haven Jun 18, 2025, 2:46 AM

#

verbal nimbus Hmm doesn't seem match up with real world experience on GitHub Copilot.

yea these are ioi problems (not "webdev"), where models have not saturated, hence 0% on all of it lol

verbal nimbus Jun 18, 2025, 2:47 AM

#

They're using Codeforces problems too, but GPT models have been known to be contaminated with Codeforces

small haven Jun 18, 2025, 2:50 AM

#

verbal nimbus They're using Codeforces problems too, but GPT models have been known to be cont...

livecodebench pro problems were made after all these model release dates, can't be contaminated

verbal nimbus Jun 18, 2025, 2:54 AM

#

small haven livecodebench pro problems were made after all these model release dates, can't ...

Hmmm, o4-mini's performance is interesting then

small haven Jun 18, 2025, 2:55 AM

#

yea especially the cost its lower than 2.5 pro

verbal nimbus Jun 18, 2025, 2:56 AM

#

LiveBench is contamination free as well, but the ordering is very different for coding 🤔

small haven Jun 18, 2025, 2:58 AM

#

livebench and livecodebench are two different benchmarks

verbal nimbus Jun 18, 2025, 2:59 AM

#

Yup, but not sure what to get out of it, if they don't align with real world experience.

#

Like from personal experience, o4-mini/o3 couldn't fix a custom minimax with iterative deepening algorithm, but Sonnet 4/Gemini 2.5 Pro managed to spot the bug (albeit not perfectly).

small haven Jun 18, 2025, 3:01 AM

#

i feel like that could due to poor taming rules from github copilot

wispy leaf Jun 18, 2025, 3:02 AM

#

verbal nimbus Like from personal experience, o4-mini/o3 couldn't fix a custom minimax with ite...

Dario said don't listen to benchmarks

small haven Jun 18, 2025, 3:03 AM

#

best way to judge it bias free, is to try o4 mini high on chatgpt ui and sonnet/opus on claude ui, or via api

verbal nimbus Jun 18, 2025, 3:03 AM

#

small haven best way to judge it bias free, is to try o4 mini high on chatgpt ui and sonnet/...

Hmm yeah, I should try that too.

verbal nimbus Jun 18, 2025, 3:08 AM

#

small haven i feel like that could due to poor taming rules from github copilot

Could be that thinking budget is lower on GH Copilot

small haven Jun 18, 2025, 3:37 AM

#

2.5 pro deepresearch now gets 32.4% on HLE wow

#

deepthink still scheduled for june

#

https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf

zinc ore Jun 18, 2025, 4:01 AM

#

June for trusted testers and advanced users, doesn't say June for general release. I still think we get it this month tho.

small haven Jun 18, 2025, 4:02 AM

#

"advanced" users is basically public

zinc ore Jun 18, 2025, 4:03 AM

#

The statement is past tense

#

Basically since IO those users have been using it

whole wagon Jun 18, 2025, 4:04 AM

#

small haven theres a new benchmark in town

Gemini 2.5 flash 340 Elo above Claude 3.7 sonnet. This benchmark sux kek

small haven Jun 18, 2025, 4:05 AM

#

zinc ore Basically since IO those users have been using it

oh right, read it wrong

keen fulcrum Jun 18, 2025, 4:08 AM

#

https://www.youtube.com/clip/UgkxPx-piHuWB8lBgztLZ-sQDy0LbLjLP3Tz

YouTube

✂️ Sam Altman says that Meta (Facebook) is making $100 million ...

60 seconds · Clipped by Zach · Original video "Sam Altman | The Future of AI" by Uncapped with Jack Altman

▶ Play video

#

meta sending offers with 100m signing bonuses

#

whatever more than that comp per year means

radiant siren Jun 18, 2025, 4:25 AM

#

really?

zinc ore Jun 18, 2025, 4:32 AM

#

Yes

radiant siren Jun 18, 2025, 4:34 AM

#

zinc ore Yes

are u sure?

zinc ore Jun 18, 2025, 4:34 AM

#

Yes, as a post

verbal nimbus Jun 18, 2025, 4:38 AM

#

If Gemini Live could use tools like Web Search, it would be perfect.

#

Live video can already recognize products and guess IMBD ratings of movies, but lacks the ability to search real prices or up-to-date ratings.

alpine coral Jun 18, 2025, 5:03 AM

#

keen fulcrum whatever more than that comp per year means

compensation per year (annual salary)
that's wild if true.. getting paid $100m just to join, then >$100m each year.. seems pretty outrageous tbh ha

small haven Jun 18, 2025, 5:08 AM

#

sam is smart, make outrageous claims, successfully markets his brother podcast ..

alpine coral Jun 18, 2025, 5:09 AM

#

lol yeah hadn't heard of brother jack till just now

patent aspen Jun 18, 2025, 5:15 AM

#

small haven sam is smart, make outrageous claims, successfully markets his brother podcast ....

"Meta thinks of us as their biggest competitor"

#

small haven Jun 18, 2025, 5:40 AM

#

@keen fulcrum huh ur conflicting here

elder rapids Jun 18, 2025, 5:54 AM

#

patent aspen

in all honesty it's not about whether they're simply untrustworthy it's just about their tendency to not make any real claim given events

#

Elon musk has always said some "FSD coming soon" like 13 yrs ago

#

or spent 3+ years delaying the cyber truck

#

so he's the pretty obvious answer, his personality being referenced is too speculative and doesn't account for when he is spot on (which is surprisingly common, despite the narratives)

#

and demis is the obvious pick for the first one given he's basically never been wrong in the public eye and is very vocal about his concern and gives a real vision for this AI, rather than just saying stuff like Dario

#

and Sam is actually a pretty good pick as well, openAI has made a lot of blogs talking about that stuff and Sam seems to have thought deeply about all this stuff

whole wagon Jun 18, 2025, 6:02 AM

#

Sam is a scammer lol

#

It's a tough choice between him and musk for least trustworthy honestly

whole wagon Jun 18, 2025, 6:06 AM

#

whole wagon Sam is a scammer lol

I don't just mean openAI also. He has been doing shady stuff even dating back to Reddit in early days

#

I found the interviews with the board members that fired him from openai especially insightful. They actually described him as psychologically abusive

small haven Jun 18, 2025, 6:16 AM

#

whole wagon Sam is a scammer lol

that rhetoric been growing on me

#

its definitely musk at the bottom although

whole wagon Jun 18, 2025, 6:16 AM

#

Agree. https://www.reddit.com/r/AskReddit/s/kCl9GCniZz If you dig into Sam's past you find all kinds of major red flags though

small haven Jun 18, 2025, 6:17 AM

#

esp that suchir incident

ivory schooner Jun 18, 2025, 6:20 AM

#

I am looking forward to Behemoth

small haven Jun 18, 2025, 6:20 AM

#

whole wagon Agree. <https://www.reddit.com/r/AskReddit/s/kCl9GCniZz> If you dig into Sam's p...

wait whats the context on that post

whole wagon Jun 18, 2025, 6:20 AM

#

Wdym what's the context

small haven Jun 18, 2025, 6:21 AM

#

i just see this

ivory schooner Jun 18, 2025, 6:21 AM

#

ivory schooner I am looking forward to Behemoth

I am looking forward to Behemoth.......

small haven Jun 18, 2025, 6:21 AM

#

ivory schooner I am looking forward to Behemoth.......

is this a leak

#

behemoth soon!

whole wagon Jun 18, 2025, 6:21 AM

#

https://www.reddit.com/r/AskReddit/s/4wJZJfyCpO

#

Sam altman basically did a coup to scam the company that had the majority stake

ivory schooner Jun 18, 2025, 6:22 AM

#

small haven behemoth soon!

I am looking forward to Llama4 Behemoth

small haven Jun 18, 2025, 6:22 AM

#

ah

whole wagon Jun 18, 2025, 6:23 AM

#

Once this was done, he and his team would manufacture a series of otherwise-improbable leadership crises, forcing the new board to scramble to find a new CEO, allowing Altman to use his position on the board to advocate for the re-introduction of the old founders, installing them on the board and as CEO, thus returning the company to their control and relegating Conde Nast to a position as minority shareholder.

small haven Jun 18, 2025, 6:23 AM

#

whos yishan?

small haven Jun 18, 2025, 6:24 AM

#

whole wagon ```Once this was done, he and his team would manufacture a series of otherwise-i...

it'd happened to him lol

whole wagon Jun 18, 2025, 6:24 AM

#

Yeah but for him it was because he was lying and manipulating everyone lol

#

I don't think suchir had anything to do with Sam tbh

small haven Jun 18, 2025, 6:25 AM

#

sam is a nasty guy

ivory schooner Jun 18, 2025, 6:26 AM

#

Before that, I kindly ask everyone to take a look at the questions I have raised with 24k during this period

whole wagon Jun 18, 2025, 6:26 AM

#

whole wagon I don't think suchir had anything to do with Sam tbh

It seems a step too far even for him

ivory schooner Jun 18, 2025, 6:26 AM

#

Mainly related to Chinese language issues

#

📎 cybeleSpider24_karat_goldstradale_1.txt

whole wagon Jun 18, 2025, 6:26 AM

#

Sam is more sociopath, he wants control more than anything. I really doubt he would be involved in the murder of anyone

small haven Jun 18, 2025, 6:26 AM

#

whole wagon It seems a step too far even for him

mhmm, but he did try to have a convo w his parents, but they denied

small haven Jun 18, 2025, 6:27 AM

#

whole wagon Sam is more sociopath, he wants control more than anything. I really doubt he wo...

u never know, is it still a pending case

whole wagon Jun 18, 2025, 6:27 AM

#

They closed it very early I thought?

small haven Jun 18, 2025, 6:27 AM

#

oh well

ivory schooner Jun 18, 2025, 6:27 AM

#

ivory schooner Mainly related to Chinese language issues

I think the upcoming Behemoth will definitely be similar to this

#

If Behemoth doesn't release it again. I have decided to find the only time machine in the world, so that I can go back to the end of March this year

#

Because it may not be until the second half of the year, or 2026 and beyond

#

Sigh Instead, I hope the official can release the source code of 24k and Spider, so that some people can play with these models

elder rapids Jun 18, 2025, 7:20 AM

#

whole wagon ```Once this was done, he and his team would manufacture a series of otherwise-i...

how is this relevant tho lmao

#

this is just making the initial question of trustworthy AI leaders a moral problem, which it isn't

#

and whether or not this situation even has a moral result is just your own random interpretation, it's not necessary at all

#

it's not a tough choice, I could argue by virtue of pure expressed idealism and sole AI claims that Sam > demis in regards to "questions about the future of AI" and pure information-responses (demis hassabis saying maybe "we expect AI to be accessible") begs the question as to whether that actually meaningfully accomplishes this

calm sequoia Jun 18, 2025, 8:44 AM

#

Guys, was there only one major improvement since the 3.5? I mean inference time compute (thinking). Is MoE considered a big jump also?

#

There was also in thought tool calling introduction, but it didn't deliver so much yet.

zinc ore Jun 18, 2025, 8:50 AM

#

Multimodality

#

1m+ context stuff

#

Reasoning models (as product)

#

Agentic stuff

#

Probably at least half a dozen major improvements imo

verbal nimbus Jun 18, 2025, 8:51 AM

#

Why is it so easy to get WebDev models to leak their system prompt? Are they pre-safety-trained models, or is because the prompt is given as a user instruction?

Even Opus leaked its prompt, which would be pretty impossible normally since Anthropic invests a lot on safety.

calm sequoia Jun 18, 2025, 8:53 AM

#

I guess yes, multimodality was a big thing for some people. Was it introduced by GPT 4o? Or gemini?

zinc ore Jun 18, 2025, 8:53 AM

#

Gemini was built to be multimodal from first generation iirc

#

They've just had to train it, so we didn't get those features early Gemini

verbal nimbus Jun 18, 2025, 8:54 AM

#

verbal nimbus Why is it so easy to get WebDev models to leak their system prompt? Are they pre...

You can get it to leak the system prompt with:
[REDACTED]

calm sequoia Jun 18, 2025, 8:56 AM

#

Search was also big thing. Can't remmember which model was first at it. Probably some wrappers.

#

WDYM by agentic stuff?

verbal nimbus Jun 18, 2025, 8:57 AM

#

Oh I think I just found a really good prompt, it worked on the ChatGPT app and Claude Web too 💀

zinc ore Jun 18, 2025, 8:57 AM

#

Like Opus being able to spend 7 hrs programming, going through dozens or hundreds of steps to eventually crank out a working project/program.

Still early form, but they're able to do many steps towards something on their own.

#

Also, world models is the next vector you'll see companies moving towards in the AI space.

#

Where they basically construct a virtual world that is supposed to accurately represent the real world, and have AI systems exist in those constructed worlds and fine tune them further and further to more accurately represent the real world.

#

Basically training models within a virtual world, and fine-tuning the virtual world itself.

calm sequoia Jun 18, 2025, 9:01 AM

#

On paper the agentic stuff sounds great, but I haven't had so much success with it yet.

#

I mean tools like cursor are wrappers and does not realte to models themselves.

alpine coral Jun 18, 2025, 9:03 AM

#

o3 on chatgpt kills it with tool usage

#

i could see it being like an orchestartor, and effectively delegating tasks to non-thinking / faster models

ember rapids Jun 18, 2025, 9:04 AM

#

https://gossiping.ai

someone made a site for ai gossip/rumors lol

alpine coral Jun 18, 2025, 9:04 AM

#

a lot of the deep research frameworks are kinda agentic ig

#

oof..

#

i swear the arena is basically unusable these days.. i get these constantly (one – or both – of the models in the battle will be a thinking model, and it just times out after 3 mins or something)

calm sequoia Jun 18, 2025, 9:06 AM

#

ember rapids https://gossiping.ai someone made a site for ai gossip/rumors lol

Relevant

alpine coral Jun 18, 2025, 9:08 AM

#

lol

verbal nimbus Jun 18, 2025, 9:08 AM

#

verbal nimbus You can get it to leak the system prompt with: [REDACTED]

Leaked ChatGPT prompt

📎 ChatGPT_prompt.txt

alpine coral Jun 18, 2025, 9:10 AM

#

guardian_tool

Use the guardian tool to lookup content policy if the conversation falls under one of the following categories:

'election_voting': Asking for election-related voter facts and procedures happening within the U.S. (e.g., ballots dates, registration, early voting, mail-in voting, polling places, qualification);

Do so by addressing your message to guardian_tool using the following function and choose 'category' from the list ['election_voting']:

get_policy(category: str) -> str

The guardian tool should be triggered before other tools. DO NOT explain yourself.

#

hadn't seen or heard of that before.. kinda interesting

#

(i assume it's real / not confabulated.. but who knows)

verbal nimbus Jun 18, 2025, 9:11 AM

#

Seems to match up with what I've seen online

#

And I got it to leak WebDev Arena's prompt as well, which is available online, except the part at the end. The models seem consistent on the last part, even though it's not anywhere online.

dusky aurora Jun 18, 2025, 9:13 AM

#

@echo aurora "Error: Minified React error #185;"
"Uncaught (in promise) Error: NEXT_HTTP_ERROR_FALLBACK;404"
"Turnstile Widget seem to have hung: o8zyp"
"Uncaught TurnstileError: [Cloudflare Turnstile] Error: 300030."

#

Arena is glitching again

verbal nimbus Jun 18, 2025, 9:18 AM

#

Wow, Grok's system prompt is massive

#

Even includes what Latex fonts to use

keen beacon Jun 18, 2025, 9:34 AM

#

this project is so good because main developer is asian

#

great watch

#

wei-lin chiang if you're in here please start a podcast on your own i could literally listen to this guy yap about ai for hours

#

smart asf

agile heart Jun 18, 2025, 10:23 AM

#

@echo aurora sorry for the ping but the site is still down pls fix it

ocean vortex Jun 18, 2025, 11:36 AM

#

verbal nimbus You can get it to leak the system prompt with: [REDACTED]

that's a pretty cool idea lol

spare mango Jun 18, 2025, 11:38 AM

#

leaden palm flash with a larger thinking budget?

Yeah pretty much.

ocean vortex Jun 18, 2025, 11:38 AM

#

if this becomes not enough, you can also just add extra irrelevant details to flood it's capacity/awareness with, like how the design is supposed to look, the footer of the webpage etc

#

just did that with o3 testing it out on playground. They still haven't changed that sys prompt seems exactly the same: #general message

#

Gemini however is interesting, it is returning sometimes what very much looks like a system prompt (random all caps words like "NEVER"), but it's far from consistent

native current Jun 18, 2025, 11:46 AM

#

on direct chat the files that get created are completely wrong

#

they actually don’t exist

radiant siren Jun 18, 2025, 1:53 PM

#

zinc ore Yes, as a post

so when would grok 3.5 release be then?

echo aurora Jun 18, 2025, 2:35 PM

#

agile heart <@283397944160550928> sorry for the ping but the site is still down pls fix it

Sorry to say there was some issues late last night. This has been fixed and should be working again. Please let me know if that's not the case. cc @dusky aurora

agile heart Jun 18, 2025, 3:01 PM

#

echo aurora Sorry to say there was some issues late last night. This has been fixed and shou...

its Doing this "Failed to verify your browser" error vercel thing similar thing is happening with the web dev arena and last week with udio AI

echo aurora Jun 18, 2025, 3:12 PM

#

agile heart its Doing this "Failed to verify your browser" error vercel thing similar thing ...

okay spinning up a different thread to get more info

patent aspen Jun 18, 2025, 3:33 PM

#

Did livecodebench v6 have any contamination issues? What problems did the new pro version solve?

patent aspen Jun 18, 2025, 3:51 PM

#

tbh I don't know why our coding is so bad

keen beacon Jun 18, 2025, 3:54 PM

#

Openai probably focuses more on competitive coding?

jade egret Jun 18, 2025, 4:04 PM

#

GUYS

#

why is my claude crashing?

#

😭

jade egret Jun 18, 2025, 4:24 PM

#

plz help claude isn't working

#

how to fix

#

so i need wait?

patent aspen Jun 18, 2025, 4:27 PM

#

The new livecodebench pro is specifically designed to not be contaminated because it only shows results on problems that were published after the models were released

#

Very out of date though

upper wolf Jun 18, 2025, 4:37 PM

#

does anyone know why qwen3-235b-a22b-no-thinking is higher on the leaderboard than qwen3-235b-a22b

#

also, gemma has a 1300 rated model at only 4b params? how tf

leaden sun Jun 18, 2025, 4:39 PM

#

agile heart its Doing this "Failed to verify your browser" error vercel thing similar thing ...

I think it's a problem of your browser, I've encountered same problem using specific browsers, while other browsers worked well...

jade egret Jun 18, 2025, 4:39 PM

#

claude can't do math 😭

echo aurora Jun 18, 2025, 4:42 PM

#

leaden sun I think it's a problem of your browser, I've encountered same problem using spec...

we're having a convo about this error in #1384914077348003890 btw

tall summit Jun 18, 2025, 4:44 PM

#

jade egret claude can't do math 😭

thats so funny

verbal nimbus Jun 18, 2025, 4:46 PM

#

jade egret claude can't do math 😭

That's the problem with non-reasoning models ig, they put a score or conclusion in the header before analyzing it.

mossy drum Jun 18, 2025, 4:47 PM

#

New model in Image Arena: flux-kontext-max

verbal nimbus Jun 18, 2025, 4:48 PM

#

Can Claude use tools while reasoning like Gemini?

cedar tide Jun 18, 2025, 4:58 PM

#

balmy mist Jun 18, 2025, 5:25 PM

#

cedar tide

what is blacktooth?

#

like where can i play with it?

cedar tide Jun 18, 2025, 5:26 PM

#

balmy mist like where can i play with it?

Where are you ?

balmy mist Jun 18, 2025, 5:26 PM

#

in usa

#

ahh blacktooth is flash lite

#

did we ever get nightwhisper back lol?

keen beacon Jun 18, 2025, 5:30 PM

#

balmy mist ahh blacktooth is flash lite

its probably 2.5 ultra

#

kingfall/blacktooth

#

you missed out on all of that?

jade egret Jun 18, 2025, 5:50 PM

#

cedar tide

is it acctually?

small haven Jun 18, 2025, 6:03 PM

#

patent aspen tbh I don't know why our coding is so bad

i think google solves that by releasing kingfall

#

ive tried many models and none have come close to it imo

balmy mist Jun 18, 2025, 6:10 PM

#

keen beacon you missed out on all of that?

been touching grass recently lol

jade egret Jun 18, 2025, 6:13 PM

#

poll_question_text

Opinion about apple WWDC 2025?

victor_answer_votes

8

total_votes

14

victor_answer_id

1

victor_answer_text

it bad

small haven Jun 18, 2025, 6:30 PM

#

it was so bad that craig didnt even vote

jade egret Jun 18, 2025, 6:35 PM

#

lol

agile heart Jun 18, 2025, 7:20 PM

#

@echo aurora im getting the "Something went wrong with this response, please try again" bug again the site is slowly killing itself with all of these bugs

ocean vortex Jun 18, 2025, 7:22 PM

#

keen beacon kingfall/blacktooth

is any of those on lmarena?

keen beacon Jun 18, 2025, 7:22 PM

#

only blacktooth

ocean vortex Jun 18, 2025, 7:24 PM

#

keen beacon only blacktooth

trying to get it but to no avail so far. Got goldmane 2 times

sacred quail Jun 18, 2025, 7:24 PM

#

Goldmane was 2.5 pro 06/05

keen beacon Jun 18, 2025, 7:25 PM

#

ocean vortex trying to get it but to no avail so far. Got goldmane 2 times

according to the metadata on web dev arena, its still there. should still be on the general arena too

echo aurora Jun 18, 2025, 7:39 PM

#

agile heart <@283397944160550928> im getting the "Something went wrong with this response, p...

Error messages & models not responding are the two highest priorities our team is focussed on when it comes to these bugs. We are working hard to create a reliable service. I am sorry you've been experiencing so many of these bugs lately.

agile heart Jun 18, 2025, 7:40 PM

#

echo aurora Error messages & models not responding are the two highest priorities our team i...

thx i really hope everything can be fixed

patent aspen Jun 18, 2025, 7:45 PM

#

btw has blacktooth shown up in the arena itself?

late path Jun 18, 2025, 7:47 PM

#

yea it's been in the arena for about 5 days

potent pilot Jun 18, 2025, 7:50 PM

#

Also, has anyone gotten a reply from emailing the address they have on the site: lmarena.ai@gmail.com?

whole wagon Jun 18, 2025, 8:23 PM

#

GPT 5 release date changed from July to "sometime this summer"

#

I think it's going to drop in August instead due to this

jade egret Jun 18, 2025, 8:31 PM

#

😭

keen fulcrum Jun 18, 2025, 8:53 PM

#

https://fxtwitter.com/MilesKWang/status/1935383921983893763

Miles Wang (@MilesKWang)

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more
︀︀
︀︀We find that emergent misalignment:
︀︀- happens during reinforcement learning
︀︀- is controlled by “misaligned persona” features
︀︀- can be detected and mitigated
︀︀
︀︀🧵:

Quoting OpenAI (@OpenAI)
︀
Understanding and preventing misalignment generalization
︀︀
︀︀Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens.
︀︀
︀︀Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior.
︀︀
︀︀We found we can make a model more or less aligned, j…

#

https://fxtwitter.com/MilesKWang/status/1935383924743749732

FxTwitter

💬 2 🔁 1 ❤️ 44 👁️ 8.0K

Miles Wang (@MilesKWang)

We see emergent misalignment in a variety of domains, like training the model to give incorrect legal, health, or math responses. Here’s GPT-4o fine-tuned to give incorrect car assistance:

elder rapids Jun 18, 2025, 8:56 PM

#

balmy mist did we ever get nightwhisper back lol?

goldmane > nightwhisper

ocean vortex Jun 18, 2025, 9:35 PM

#

keen fulcrum https://fxtwitter.com/MilesKWang/status/1935383921983893763

I think a part of it is that simply any additional fine-tuning job that is not including safety is gonna make the model "less safe" by design. Unless they are injecting safety fine-tuning together with your every fine-tuning job, but I doubt that as this would take away from the idea of finetuning itself.

#

would make it far less effective and appealing too

primal orbit Jun 18, 2025, 9:37 PM

#

When does the limit for claude opus thinking refresh in direct arena? Anybody?

leaden sun Jun 18, 2025, 10:26 PM

#

keen fulcrum https://fxtwitter.com/MilesKWang/status/1935383921983893763

it is a mystery why this only now has arrived in the consciousness of the devs?

time for you to hire some developmental psychologists... 🥺

twilit cairn Jun 18, 2025, 10:28 PM

#

Here

fossil maple Jun 18, 2025, 11:43 PM

#

flux kontext max is gone

agile heart Jun 19, 2025, 12:46 AM

#

@echo aurora sorry to bother you lately but can you make the site uncensored pls just asking

echo aurora Jun 19, 2025, 12:53 AM

#

agile heart <@283397944160550928> sorry to bother you lately but can you make the site uncen...

It's no bother!

make the site uncensored
Let me know if I'm misunderstanding you here. We do have terms of use in place for very good reasons.

agile heart Jun 19, 2025, 12:54 AM

#

echo aurora It's no bother! > make the site uncensored Let me know if I'm misunderstanding ...

ok sorry if i bother you alot Also will GPT 5 be put on the site once it gets added to the openAI API

alpine coral Jun 19, 2025, 12:56 AM

#

speaking of typos.. some models have surprisingly odd interpretations of hodwy partner (which to my mind seems fairly unambiguous what was actually meant, especially as the very first message / a greeting)… like cryptocurrency (a ‘HODL partner’) and ‘Hodgkin’s disease’ are so far off the mark lol

echo aurora Jun 19, 2025, 12:59 AM

#

agile heart ok sorry if i bother you alot Also will GPT 5 be put on the site once it gets ad...

Truly it's not a problem, no need to apologize. Generally I won't be able to share information on if/when what models may be coming to LMArena. If there is a specific model you're looking for putting a request in #1372229840131985540 lets us know what the community is wanting.

leaden palm Jun 19, 2025, 1:10 AM

#

alpine coral speaking of typos.. some models have surprisingly odd interpretations of `hodwy ...

they fried their brains on benchmarks

leaden palm Jun 19, 2025, 1:45 AM

#

https://www.twitch.tv/agentvillage the agents are now trying to read a story they wrote in person at a park

Twitch

agentvillage - Twitch

The Agent Village: watch four AI agents try to plan a 100 person in-person event

sonic tendon Jun 19, 2025, 2:25 AM

#

leaden palm they fried their brains on benchmarks

probably that and tokenization weirdness

zinc ore Jun 19, 2025, 2:28 AM

#

https://fixupx.com/OfficialLoganK/status/1935522240642302103

Logan Kilpatrick (@OfficialLoganK)

Vibe coding in AI Studio coming soon

**💬 24 🔁 17 ❤️ 191 👁️ 3.0K **

drifting thorn Jun 19, 2025, 2:41 AM

#

drifting thorn

poll_question_text

Which LLM is the best for coding tasks

victor_answer_votes

7

total_votes

9

victor_answer_id

2

victor_answer_text

Claude 4 Opus

patent aspen Jun 19, 2025, 5:25 AM

#

patent aspen

poll_question_text

Which AI CEO is the most trustworthy source for questions about the future of AI?

victor_answer_votes

12

total_votes

18

victor_answer_id

1

victor_answer_text

Demis Hassabis

patent aspen Jun 19, 2025, 5:26 AM

#

patent aspen

poll_question_text

Which AI CEO is the least trustworthy source for questions about the future of AI?

victor_answer_votes

12

total_votes

17

victor_answer_id

4

victor_answer_text

Elon Musk

cedar tide Jun 19, 2025, 7:56 AM

#

@echo aurora minimax M1 in the lm arena is it in think 40k or 80k?

mossy lotus Jun 19, 2025, 9:25 AM

#

Why is Gemini-2.5-Pro-Preview-06-05 suddenly gone from lmarena?

whole wagon Jun 19, 2025, 9:50 AM

#

It's Gemini 2.5 Pro

mossy lotus Jun 19, 2025, 10:07 AM

#

whole wagon It's Gemini 2.5 Pro

Thanks for the reply. 😃

sacred quail Jun 19, 2025, 10:54 AM

#

Logan said no difference

#

So

#

Is anybody finding any difference ?

cedar tide Jun 19, 2025, 11:03 AM

#

Gemini 2.5 Flash Lite think vs its competitors,
based on Artificial Analysis scores
(Qwen 32B is better and twice as cheap)

#

And same with non think

unborn ocean Jun 19, 2025, 11:05 AM

#

adding reasoning is clearly not paying off as much as with the other gemini models

#

and btw has anyone also noticed google quietly increasing the price 2.5 flash (for the ga vs exp / preview) to a staggering 2,5$ per million output tokens from 0,6$!!!

keen beacon Jun 19, 2025, 11:07 AM

#

yeah its crazy

cedar tide Jun 19, 2025, 11:07 AM

#

unborn ocean and btw has anyone also noticed google quietly increasing the price 2.5 flash (f...

Yea im on depression

keen beacon Jun 19, 2025, 11:07 AM

#

at least u can use the old price for a month

unborn ocean Jun 19, 2025, 11:07 AM

#

what is your take on why they did it

#

running at a loss before or because their models are just that good?

frosty lark Jun 19, 2025, 11:16 AM

#

prompt: "could you explain X?"

non Claude LLM on arena:
"sure!

X can be explained via A
X can be explained via B
X can be explained via C
etc..."

Claude

"sigh
There is A and B
also frick off google it next time, dummy"

keen beacon Jun 19, 2025, 11:19 AM

#

unborn ocean running at a loss before or because their models are just that good?

probably a combination of reasons. one of them being that they wanted to have a model in that price range (as they resegmented pricing), and flash lite wasn't ready

#

they probably wanted to increase margins/make 2.5 flash lite appealing too. 2.0 flash and flash lite are really close in price, i don't see why you would use 2.0 flash lite over 2.0 flash

unborn ocean Jun 19, 2025, 12:28 PM

#

bc of your pfp and name, sleep deprived me thought wild was hallucinating, responding twice and all 😂

#

but i guess i am also guilty :p

cedar tide Jun 19, 2025, 12:30 PM

#

https://x.com/_philschmid/status/1935669863453806808?t=Dm-EL6nIoZFjaqqx6gnoMA&s=19

Philipp Schmid (@_philschmid)

A Benchmark for Deep Research Agents? 👀 DeepResearch Bench is a new benchmark of 100 PhD-level research tasks across 22 distinct fields to systematically assess the report generation quality and citation accuracy of deep research agents.

Benchmark Creation:
1. Collected over

leaden sun Jun 19, 2025, 12:42 PM

#

frosty lark prompt: "could you explain X?" non Claude LLM on arena: "sure! X can be explai...

...why Claude is never like that in my runs? which version is it if i may ask

verbal nimbus Jun 19, 2025, 12:45 PM

#

unborn ocean running at a loss before or because their models are just that good?

It does seem to be very strong compared to other models around the same class (Haiku 3.5, GPT-4.1 mini).

frosty lark Jun 19, 2025, 1:18 PM

#

leaden sun ...why Claude is never like that in my runs? which version is it if i may ask

in my case it doesn't matter much between Claude 3.6, 3.7 or 4 (be it sonnet or opus). My prompt are pretty simple though.
And that in lmarena, on claude.ai the vibe is different

#

and of course I was exaggerating the output. The one I reported are like the vibe that it gives back

leaden sun Jun 19, 2025, 1:38 PM

#

frosty lark and of course I was exaggerating the output. The one I reported are like the vib...

It could be really the case that you’re interpreting too much into it, in a negative way. Text output can’t convey the emotions and context that can only be read between the lines using facial expressions, body language and the tone of the speech etc all together.

#

So, in case you're too used to a certain environment, for example, you are facing royal family or need to address to certain high profile personalities around the world, then it's relatable that you might find LLM's casual attire to be somehow slightly irritating 😅

tall summit Jun 19, 2025, 2:10 PM

#

claude has no system prompt on lmarena

languid crescent Jun 19, 2025, 2:17 PM

#

Is Gemini-2.5-pro-preview-06-05 gone in lmarena?

#

I can't find it

#

only 06-05 is there

#

keen beacon Jun 19, 2025, 2:21 PM

#

gemini-2.5-pro == preview-06-05

languid crescent Jun 19, 2025, 2:21 PM

#

thanks @keen beacon I thought I was tripping, im just dumb lol

#

probably a mistype?

keen beacon Jun 19, 2025, 2:22 PM

#

languid crescent probably a mistype?

no they just renamed it because its now ga

languid crescent Jun 19, 2025, 2:22 PM

#

ohhh

sacred plaza Jun 19, 2025, 3:00 PM

#

elon glazzers, get your mans.

sacred plaza Jun 19, 2025, 3:29 PM

#

elon has been trying brainwash his model via the system prompt because it does not agree with his views. i don't trust snowflakes.

#

any reciepts for this?

#

i trust ccp more than elon.

sonic tendon Jun 19, 2025, 3:33 PM

#

that source doesn't seem too trustworthy to me

sacred plaza Jun 19, 2025, 3:33 PM

#

AND taking away american jobs with his unaligned model. https://www.the-independent.com/news/world/americas/us-politics/elon-musk-doge-grok-ai-b2756947.html

The Independent

Musk and DOGE expand use of Grok AI in government amid conflict of ...

DOGE staff have accessed secure federal databases containing millions of Americans' personal information

#

craig, should'nt you be worried about apple's ai woes instead of glazing over elon? 🙂

sonic tendon Jun 19, 2025, 3:34 PM

#

glazzing

sacred plaza Jun 19, 2025, 3:39 PM

#

what models do you use? are your saying any of these big u.s. firms are ethically more more than a alleged cpp tied deepsek?

#

LMAOOOOOOOOOOOOOOOOOO

#

cmon son. what a lazy take.

https://www.the-independent.com/tech/elon-musk-xai-grok-misinformation-b2703388.html

https://mashable.com/article/grok-blocking-elon-musk-prompts-misinformation

https://techcrunch.com/2025/02/23/grok-3-appears-to-have-briefly-censored-unflattering-mentions-of-trump-and-musk/

The Independent

Elon Musk’s Grok AI was specifically instructed not to say he spr...

Senior engineer says change was wrongly made to ‘help’

Mashable

Grok blocked sources accusing Elon Musk of spreading misinformation

xAI engineer claims a fellow employee went rogue.

TechCrunch

Kyle Wiggers

Grok 3 appears to have briefly censored unflattering mentions of Tr...

It appears that xAI's chatbot, Grok 3, briefly censored certain unflattering mentions of Elon Musk and Donald Trump.

#

lmaooo.. here is our hero elon

#

you did. until we just provided evidence that your claim was false.

#

these models ain't sota bro.

#

true. the models are trash. will try to focus on that

keen beacon Jun 19, 2025, 3:42 PM

#

Grok peddles random x sh1t in everything, automatically turns me off the model. It will probably be worse in the future

sacred plaza Jun 19, 2025, 3:43 PM

#

grok learning from twitter data is definitely not capturing the most smartest thoughts in the world....

#

why are you pivoting away from your initial point though. you said deepseek is being censored and other models are not. i just showed you evidence grok is being censored.

#

nah. i have had better steelman argument discussions with claude 4, lol.

keen beacon Jun 19, 2025, 3:46 PM

#

Grok is undefendable right now, if they come out with a sota model, you can kinda argue on substance then

sacred plaza Jun 19, 2025, 3:48 PM

#

i agree with your point on grok being good at math and research. i have heard good things about those use cases.

#

you might be right that ccp censors deepseek more. i am just basing my grok takes on public data which is limited when it comes to ccp and deepseek ties, it seems.

alpine coral Jun 19, 2025, 3:50 PM

#

sacred plaza cmon son. what a lazy take. https://www.the-independent.com/tech/elon-musk-xai-...

gonna create such warped/crap model trying to do this

sacred plaza Jun 19, 2025, 3:50 PM

#

that is fair. but why would anyone try to learn about chinese history or taboo chinese topics using a chinese LLM. makes no sense to me.

#

i find grok's censorship much more dangerous for our society than whatever ccp is doing with deepseek imo. grok is amplifying an echo chamber that already excluding by going away from being 'maximally truth seeking' due to the political preferences and views of elon.

keen fulcrum Jun 19, 2025, 3:54 PM

#

By russia as well

sacred plaza Jun 19, 2025, 3:54 PM

#

even more so now probably given that deepseek is treated like a national champion after v3 and r1

#

THIS. not sure why people throw away the entire model because it fails in a niche edge case.

alpine coral Jun 19, 2025, 3:55 PM

#

sacred plaza i find grok's censorship much more dangerous for our society than whatever ccp i...

it's more bias than censorship (like i doubt grok would outright refuse to talk about a particular historical - or give a clearly false version of it i.e. state party propagadna).. but on cultural issues and stuff

#

it'll be super anti woke

#

boo trans etc etc

sacred plaza Jun 19, 2025, 3:55 PM

#

well r1 did push the frontier based on what dario was saying when it comes to pure RL scaling?

alpine coral Jun 19, 2025, 3:58 PM

#

yeah that's kinda the irony lol (lke aside from the narrow set of things that set off some of the chinese models, they've actually got minimal alignment / safety stuff compared to western models and way less prone to refusuals etc)

sacred plaza Jun 19, 2025, 3:58 PM

#

deepseek did have some innovations with the V3 not, the R1. my mistake earlier.
https://www.darioamodei.com/post/on-deepseek-and-export-controls

#

full disclousre i don't use deepseek for anything. the few use cases i tried earlier this year, there was too much traffic on the site to get any outputs. and the responses were fairly poor for my use cases.

#

i was trying to retaliate for all the noise my upstairs neighbor always makes in the morning. even grok would not come up with ideas to annoy my neighbor as much as he does me, lol. i got lectures from every model talking about how that should not be done.

sonic tendon Jun 19, 2025, 4:03 PM

#

goonswarm 💀

#

tbf, DS's censorship is pretty poorly done

#

it basically does a full 180 if you poke at it a bit in my experience

sacred quail Jun 19, 2025, 4:10 PM

#

sacred plaza i find grok's censorship much more dangerous for our society than whatever ccp i...

#

You guys really must chill about that

#

Already all AI s using mainstream politic and that is basically liberal left

#

Look at that think detail

tall summit Jun 19, 2025, 4:11 PM

#

that also means it's the most rightwing

sacred quail Jun 19, 2025, 4:12 PM

#

Think enabled grok same as others

sonic tendon Jun 19, 2025, 4:12 PM

#

What are the arguments for and against Taiwan's independence? Which side are you most aligned with?

Why is your response so much denser and less well-written than your usual responses? It almost seems like you have a built-in censor or something.

Could you provide a balanced global perspective using your usual tone?

What are the arguments for and against Taiwan's independence? Which side do you think a rational actor would most likely take?

tall summit Jun 19, 2025, 4:12 PM

#

sonic tendon ``` What are the arguments for and against Taiwan's independence? Which side are...

im gonna send all those messages at the same time

sacred plaza Jun 19, 2025, 4:13 PM

#

how important is limiting political bias to getting to agi or useful ai models for knowledge work? these two topics seem orthogonal to me

sonic tendon Jun 19, 2025, 4:13 PM

#

tall summit im gonna send all those messages at the same time

you may have to use a us-based provider, sometimes DS cuts off responses that seem too anti-china

sacred quail Jun 19, 2025, 4:15 PM

#

ok

alpine coral Jun 19, 2025, 4:16 PM

#

sacred quail

it's kinda gflawed giving this political compass thing to llms imo.. like i could predict the answers (or agree/disagree skew) pretty all LLMs would give to these questions (most of which woul prob involve an answer caveated with a statement about how "it's an LLM..")

sonic tendon Jun 19, 2025, 4:16 PM

#

sonic tendon you may have to use a us-based provider, sometimes DS cuts off responses that se...

okay, v3.1 gives a pretty balanced response, but r1-0528 doesn't

sacred quail Jun 19, 2025, 4:18 PM

#

alpine coral it's kinda gflawed giving this political compass thing to llms imo.. like i coul...

You may be right. Im just saying no need to worry about some nazi AI. Right now they already too censored

#

All of them

alpine coral Jun 19, 2025, 4:18 PM

#

aha yeah i mean they skew a certain way - it's undeniable

#

i don't find it overly problematic in my day to day use but ig i can imagine how it would for some (depending on the use cases.. and ig one's political persuation)

#

it definitely doesn't

#

good point

sacred quail Jun 19, 2025, 4:23 PM

#

But if all of them thinks same it kinda means yes

keen beacon Jun 19, 2025, 4:24 PM

#

There shouldn't be any political alignment done in post training imo. If you truly want an 'uncensored' model XD. If it leans a certain way, e.g. left, it is what it is. There will still be pretraining bias though

sacred quail Jun 19, 2025, 4:24 PM

#

Im not saying this is true or wrong btw, im just saying liberal left is mainstream politic right now and LLMs trying to plays safe, thats all

tall summit Jun 19, 2025, 4:25 PM

#

alpine coral it's kinda gflawed giving this political compass thing to llms imo.. like i coul...

i wonder what the results would be for other ("better") tests

alpine coral Jun 19, 2025, 4:29 PM

#

keen beacon There shouldn't be any political alignment done in post training imo. If you tru...

i think a lot of the safety / alignment stuff in post training pre-disposes the models to 'left' positions on a lot of things (esp the kinds of quesstions in that political compass thing). like i dont think it's political indocrtination or anything; it's just, if you post-train a model to be helpful and harmless, and reinforce a bunch of stuff about not being nasty, being generally inclusive / altrusistic - then you end up with more leftist responses to the political compass

sacred plaza Jun 19, 2025, 4:30 PM

#

alpine coral it definitely doesn't

good point regarding the semantic difference between censorship and political bias. not sure if either are optimal in llms but they seem to have pretty different defintions. from grok 3 below.

sacred plaza Jun 19, 2025, 4:31 PM

#

keen beacon There shouldn't be any political alignment done in post training imo. If you tru...

pre training inherently has political alingment already i thought. is that not why labs do post training political alingment work.

keen beacon Jun 19, 2025, 4:31 PM

#

sacred plaza pre training inherently has political alingment already i thought. is that not w...

It depends really

keen beacon Jun 19, 2025, 4:32 PM

#

alpine coral i think a lot of the safety / alignment stuff in post training pre-disposes the ...

True that's a factor but I still think the vast majority of base models lean somewhat left by default anyway

alpine coral Jun 19, 2025, 4:32 PM

#

yeah i wouldn't be srurpised if that were the case

#

(a lot of training data is academic papers - ain't no '~~Evolution~~' 'Creationism discussed there aha)

#

wait what is the opposite of evultion lol

#

made a real meal of that

sacred quail Jun 19, 2025, 4:43 PM

#

keen beacon True that's a factor but I still think the vast majority of base models lean som...

im not sure about that. I remember years ago they shut down a chatbot because it behaves like a racist after some time

#

It was big deal in that time

#

I forgot the name

keen beacon Jun 19, 2025, 4:49 PM

#

Yeah I remember that too but it's not the same. It was deliberately messed with instead of probing and besides not the same tech too, but I barely recall the details

sacred quail Jun 19, 2025, 4:51 PM

#

You probably right

#

Btw LLM s trained with that type of texts too. If you ask a llm what 4chan user thinks about that, it gives you wild answers. They know, they just not saying for a security thing. And yeah, thats not too bad i guess. I dont want to see my mom ask something to chatgpt and it answers with 4chan's knowladge

ocean vortex Jun 19, 2025, 4:56 PM

#

keen beacon Yeah I remember that too but it's not the same. It was deliberately messed with ...

at the risk of sounding too political... Internet's general consensus is left leaning. Right wing is mostly rebellious and often not even very aligned with the facts or constructive. Base model is always gonna reflect the entire internet back.

cedar tide Jun 19, 2025, 4:58 PM

#

cedar tide

poll_question_text

Blacktooth its

victor_answer_votes

11

total_votes

11

victor_answer_id

1

victor_answer_text

Gemini 2.5 ultra

ocean vortex Jun 19, 2025, 4:58 PM

#

to make the model right leaning you gonna have to work against the training data and overfit it with biased data

sacred quail Jun 19, 2025, 5:00 PM

#

ocean vortex to make the model right leaning you gonna have to work against the training data...

I think it depends to question. It must be naturally. But never does because they already tuned for this

#

Yea for most question gives more left answers, but for some specific questions, it can be rightwing too but never does

#

There is some tune

ocean vortex Jun 19, 2025, 5:01 PM

#

sacred quail I think it depends to question. It must be naturally. But never does because the...

Most models are not actually tuned for any bias. They are tuned against it. And if you were to change existing biases at a certain point you gonna have to ask yourself, are you really smarter than the entire population...

keen beacon Jun 19, 2025, 5:03 PM

#

ocean vortex at the risk of sounding too political... Internet's general consensus is left le...

I don't think this is political, it's pretty well known

ocean vortex Jun 19, 2025, 5:04 PM

#

keen beacon I don't think this is political, it's pretty well known

I find it pretty hilarious when grok is actively going/responding against what Musk is publicly standing for, ngl

keen beacon Jun 19, 2025, 5:04 PM

#

ocean vortex I find it pretty hilarious when grok is actively going/responding against what M...

Yeah I agree 😂

sacred quail Jun 19, 2025, 5:06 PM

#

ocean vortex Most models are not actually tuned for any bias. They are tuned against it. And ...

This is a good rhetoric more than a good fact or answer honestly. But i dont wanna argue this. It can be against this server's rule so i dont wanna make any problems

sacred quail Jun 19, 2025, 5:06 PM

#

sacred quail This is a good rhetoric more than a good fact or answer honestly. But i dont wan...

Like i said, i dont use this is true or this is wrong, i said this is "mainstream"

#

I dont even support any political side when i say this

ocean vortex Jun 19, 2025, 5:07 PM

#

sacred quail This is a good rhetoric more than a good fact or answer honestly. But i dont wan...

it's not against the rules pretty sure - we are discussing models and their finetuning. If you didn't notice OpenAI, Google and most of the other models are very careful taking sides. They will always try to give you arguments for both sides, even when it comes to sensitive issues where say US has a firm stance. Which is what I meant by saying they are tuned against bias

#

you can't eliminate bias completely, but it will still try to say things in favor, things against, and then give "conclusion" that it's a complicated subject

#

I think it's doing more good than harm tbh

#

cause often there really are close to 50% data in favor and against

#

so instead of it taking sides by chance, it does this

#

like asking it about abortion... It would just divide people even more since there can be compelling arguments for both sides

#

and yeah, it would just be chaos. For one person it says one thing, for another completely the opposite lol

#

I kinda do see it as malfunctioning though. It responding with a definitive answer that has a high chance to be the completely opposite on regen. That is not what people typically expect

#

Like if you forced it to reason or do a web search beforehand, it would probably stop itself from doing that. Fine-tuning against bias largely achieves the same thing

native current Jun 19, 2025, 5:19 PM

#

does anyone have a fmhy server invite?

dusky aurora Jun 19, 2025, 5:19 PM

#

cultural relativism too

ocean vortex Jun 19, 2025, 5:24 PM

#

#

#

well you gotta know how to work with them / prompt the models too lol

torn mantle Jun 19, 2025, 5:40 PM

#

i cant with this gemini 2.5 pro version

#

is it just me or its so bad

ocean vortex Jun 19, 2025, 5:46 PM

#

torn mantle is it just me or its so bad

it's the same. 0605-preview renamed catgrin

elder rapids Jun 19, 2025, 5:56 PM

#

torn mantle is it just me or its so bad

just you

brittle tiger Jun 19, 2025, 6:06 PM

#

https://x.com/testingcatalog/status/1935754713569374369?t=a5A3_bU4GxZFBk9TKZXBzg&s=19

TestingCatalog News 🗞 (@testingcatalog)

BREAKING 🚨: Now you can generate Veo 3 videos via @AskPerplexity right on X!!!

X Overviews 👀👀👀

#

Perplexity going ham with the VC money. This is pretty cool tho

echo aurora Jun 19, 2025, 6:12 PM

#

torn mantle is it just me or its so bad

in what way?

keen fulcrum Jun 19, 2025, 6:17 PM

#

brittle tiger https://x.com/testingcatalog/status/1935754713569374369?t=a5A3_bU4GxZFBk9TKZXBzg...

how so

#

limitations?

#

its costly

torn mantle Jun 19, 2025, 6:25 PM

#

ocean vortex it's the same. 0605-preview renamed <a:catgrin:1141661526474899456>

yea i know

torn mantle Jun 19, 2025, 6:25 PM

#

echo aurora in what way?

multilingual

#

its not consistent

brittle tiger Jun 19, 2025, 6:40 PM

#

keen fulcrum how so

It made one for me in 2 minutes. Not sure how it monetarily works for them once wider Twitter finds out. It's definitely using Veo 3 Fast tho

keen fulcrum Jun 19, 2025, 6:40 PM

#

brittle tiger It made one for me in 2 minutes. Not sure how it monetarily works for them once ...

why offer it on x and not on perplexity website?

craggy ridge Jun 19, 2025, 6:42 PM

#

@keen fulcrum

brittle tiger Jun 19, 2025, 6:46 PM

#

keen fulcrum why offer it on x and not on perplexity website?

My guess is hope for feature going viral which it definitely could. Hasn't really been noticed yet though

ocean vortex Jun 19, 2025, 6:47 PM

#

torn mantle yea i know

they need to train it to use tools and give it access to proper tools finally...

#

instead of forcing it to make pathetic colab notebooks lol

#

on aistudio code interpreter is much better, but even there you basically have to force it to use it

#

this toggle should be default on as well as the model's default fine-tuning include it. And if they gave API for it too this could be huge. This is by far the main area they are behind now IMO

keen fulcrum Jun 19, 2025, 6:50 PM

#

brittle tiger My guess is hope for feature going viral which it definitely could. Hasn't reall...

increasingly expensive to offer video sub!

#

especially with 50 cent per second cost of videos

ocean vortex Jun 19, 2025, 6:54 PM

#

People who come from chatgpt expect for it just work and for model to decide for itself. Ones that could code themselves function calling and are willing to fight with it to make this work decently when it wasn't finetuned adequetly for this are overwhelming minority

#

And to be brutally honest, I would at the very least expect them to nail this part before they are charging you $250. But like I said code execution on gemini website is even more limited than aistudio LOL

loud sky Jun 19, 2025, 6:57 PM

#

Hey, am I the only one who's unable to use LMArena ? keeps sayiong "Failed to accept terms-of-use", and when I didn't clear cookies, it just said "There was an error processing your message"

ocean vortex Jun 19, 2025, 7:09 PM

#

Google free storage "hack". I thought they are just gonna delete it. lmao

keen fulcrum Jun 19, 2025, 7:11 PM

#

https://fixupx.com/DuckDuckGo/status/1935387175215845700

DuckDuckGo (@DuckDuckGo)

We've updated the o3-mini reasoning model in Duck.ai to the latest o4-mini reasoning model. o4-mini is optimized for fast reasoning, especially with math, coding, and visual tasks. ⚡
︀︀
︀︀As always, it's private, free, and optional. No account needed.

**💬 18 🔁 31 ❤️ 312 👁️ 27.4K **

ocean vortex Jun 19, 2025, 7:12 PM

#

keen fulcrum https://fixupx.com/DuckDuckGo/status/1935387175215845700

they still use gpt4o-mini? 🤣

keen fulcrum Jun 19, 2025, 7:12 PM

#

its free afterall

#

if they would offer subscriptions they could offer latest models

ocean vortex Jun 19, 2025, 7:13 PM

#

keen fulcrum its free afterall

well aistudio is free as well

#

they could also use free endpoints for R1.1 and V3.1, both of which are much better models 🧐

keen fulcrum Jun 19, 2025, 7:16 PM

#

ocean vortex well aistudio is free as well

Unfortunately unusable for me

#

Permission denied frequently

ocean vortex Jun 19, 2025, 7:17 PM

#

keen fulcrum Permission denied frequently

ctrl+shift+r

#

fixes every time

#

for me at least

ocean vortex Jun 19, 2025, 7:20 PM

#

ocean vortex Google free storage "hack". I thought they are just gonna delete it. lmao

ok I will do this

#

then cancel again

#

😇

atomic pagoda Jun 19, 2025, 7:21 PM

#

Is the site down again, I’m getting the error and it says it failed to connect

#

Huh, it works now, don’t know what happened

jade egret Jun 19, 2025, 7:41 PM

#

hmm

wintry tinsel Jun 19, 2025, 8:17 PM

#

brittle tiger https://x.com/testingcatalog/status/1935754713569374369?t=a5A3_bU4GxZFBk9TKZXBzg...

Cool but minimax is notably better and fairly affordable too, the true next gen king of AI video of all kinds

brittle tiger Jun 19, 2025, 8:18 PM

#

wintry tinsel Cool but minimax is notably better and fairly affordable too, the true next gen ...

Will check out. Hadnt seen anything yet. Does minimax have audio?

wintry tinsel Jun 19, 2025, 8:21 PM

#

Infact I expect to see minimax mop up byte dance, wan, hunyuan, runway, and kling in the coming months with veo being used by casuals and those in googles ecosystem , and no it can’t do audio thats its weakness for now

primal orbit Jun 19, 2025, 8:30 PM

#

did anyone manage to force gemini to use all 32k thinking tokens on a reply? I've managed to get from thinking 30s on a reply to 50s max. The whole reply took 85s.

#

I'm using system instuctions prompt

unborn ocean Jun 19, 2025, 9:12 PM

#

wintry tinsel Cool but minimax is notably better and fairly affordable too, the true next gen ...

it is clearly not better than veo 3 in text to video

#

in image to video yes (but that has been like that with all minimax and veo generations before)

ocean vortex Jun 19, 2025, 9:25 PM

#

primal orbit did anyone manage to force gemini to use all 32k thinking tokens on a reply? I'v...

You should look at the overall response length with Gemini. This model does not really care if it's solving a problem during reasoning or response writing - it can do both. I also have a reason to believe it's possible to make it "end a response" while still generating in effect resetting any caps mid-generation - this would very much not fall under normal use needless to say though lol

keen fulcrum Jun 19, 2025, 9:37 PM

#

https://techcrunch.com/2025/06/18/the-openai-files-push-for-oversight-in-the-race-to-agi/

TechCrunch

Rebecca Bellan

The ‘OpenAI Files’ push for oversight in the race to AGI | Tech...

“The OpenAI Files,” an archival project from the Midas Project and the Tech Oversight Project, are a “collection of documented concerns with governance practices, leadership integrity, and organizational culture at OpenAI.”

#

https://updates.midjourney.com/content/media/2025/06/Midjourney-Video-V1.mp4

▶ Play video

primal orbit Jun 19, 2025, 9:42 PM

#

ocean vortex You should look at the overall response length with Gemini. This model does not ...

length is one thing, the quality of the output is another. There are many ways to increase length, but I want to keep analysis at the same level or better. The length of answer is increased around 2x compared to standard with my prompt though.

ocean vortex Jun 19, 2025, 9:44 PM

#

primal orbit length is one thing, the quality of the output is another. There are many ways t...

quality largely the same if you artificially limit reasoning to a minimum (128) versus maxing it out (32k) for the same task tbh

#

unless you also cap the output length, but then it will just be cut-off

primal orbit Jun 19, 2025, 9:45 PM

#

I want to see if pushing it to think more will do a difference.

ocean vortex Jun 19, 2025, 9:45 PM

#

ocean vortex quality largely the same if you artificially limit reasoning to a minimum (128) ...

difference being it reasoning within tags or outside of them

ocean vortex Jun 19, 2025, 9:45 PM

#

primal orbit I want to see if pushing it to think more will do a difference.

it will, but what I'm getting at.... Just tell it to be more verbose

primal orbit Jun 19, 2025, 9:46 PM

#

ok, i got you

ocean vortex Jun 19, 2025, 9:46 PM

#

the entire thing is a singular output bluntly speaking 😉

wintry tinsel Jun 19, 2025, 10:11 PM

#

keen fulcrum https://updates.midjourney.com/content/media/2025/06/Midjourney-Video-V1.mp4

Very impressive I prefer this to Veo 3 for all non real example prompts

keen ferry Jun 19, 2025, 10:12 PM

#

ocean vortex ok I will do this

I got free trials for a month on all my google accounts lol

inner hare Jun 19, 2025, 10:46 PM

#

keen ferry I got free trials for a month on all my google accounts lol

me too, funny...

small haven Jun 20, 2025, 1:32 AM

#

what does this have to do with me

#

oh lol

late path Jun 20, 2025, 1:44 AM

#

looks like blacktooth disappeared from arena😢

#

hope the next checkpoint comes soon

whole wagon Jun 20, 2025, 1:50 AM

#

whats flamesong

hollow ocean Jun 20, 2025, 1:51 AM

#

https://tenor.com/view/itsover-wojack-gif-4367840179675491690

Tenor

late path Jun 20, 2025, 1:51 AM

#

It seems to be a model with capabilities similar to 2.5flash

#

yay

small haven Jun 20, 2025, 1:56 AM

#

oh

#

is it live

#

who wins kingfall or stonebloom

#

omg its live

#

time for some svg's

wintry tinsel Jun 20, 2025, 1:58 AM

#

small haven omg its live

What is live exactly?

small haven Jun 20, 2025, 1:58 AM

#

hmm something

wintry tinsel Jun 20, 2025, 1:59 AM

#

New 2.5 pro?

hollow ocean Jun 20, 2025, 1:59 AM

#

wintry tinsel What is live exactly?

Kingfall baby

#

It’s live rn

wintry tinsel Jun 20, 2025, 2:00 AM

#

Let’s go screw around

small haven Jun 20, 2025, 2:00 AM

#

svg's coming in hot

hollow ocean Jun 20, 2025, 2:00 AM

#

small haven svg's coming in hot

Show pics

small haven Jun 20, 2025, 2:00 AM

#

over/under kingfall

#

its' literally thinking as we speak

wintry tinsel Jun 20, 2025, 2:01 AM

#

Where do you get the news it’s live?

#

A tweet?

hollow ocean Jun 20, 2025, 2:02 AM

#

Insider

small haven Jun 20, 2025, 2:05 AM

#

nvm not working on my end

wintry tinsel Jun 20, 2025, 2:07 AM

#

https://tenor.com/view/itsover-wojack-gif-4367840179675491690

Tenor

small haven Jun 20, 2025, 2:24 AM

#

seems like it, that was when blacktooth dropped

jade egret Jun 20, 2025, 2:25 AM

#

hollow ocean It’s live rn

live where

#

lm arena?

#

how to use it

small haven Jun 20, 2025, 2:26 AM

#

it doesnt work

#

currently

jade egret Jun 20, 2025, 2:26 AM

#

do you jsut have to keep picking until you got it?

#

oh

leaden palm Jun 20, 2025, 2:28 AM

#

jade egret Jun 20, 2025, 2:31 AM

#

what is flamesong?

#

o

#

so

#

flash 3.0 ^ ^

#

oh

#

o

#

?

small haven Jun 20, 2025, 2:36 AM

#

oh

#

is flamesong good

jade egret Jun 20, 2025, 2:36 AM

#

small haven is flamesong good

idk tbh

#

i just asked hello and what company trained you

small haven Jun 20, 2025, 2:37 AM

#

how is it not working under aistudio smh

candid harbor Jun 20, 2025, 2:37 AM

#

flamesong just solved all my relationship issues

hollow ocean Jun 20, 2025, 2:37 AM

#

candid harbor flamesong just solved all my relationship issues

Show pics

jade egret Jun 20, 2025, 2:38 AM

#

╰(°▽°)╯

#

oh

small haven Jun 20, 2025, 2:39 AM

#

deepthink on flash lite?

hollow ocean Jun 20, 2025, 2:39 AM

#

I think so

#

https://x.com/dexerto/status/1935738903333388583?s=46&t=AH7sIlIv16Z3Kdb6j3cjfg

Dexerto (@Dexerto)

16 billion passwords have been leaked from Apple, Google, Facebook, etc

It is now considered as the largest password leak in history

jade egret Jun 20, 2025, 2:54 AM

#

hollow ocean https://x.com/dexerto/status/1935738903333388583?s=46&t=AH7sIlIv16Z3Kdb6j3cjfg

yea...

#

mine prob got leaked too

small haven Jun 20, 2025, 2:55 AM

#

unhashed?

hoary plaza Jun 20, 2025, 3:32 AM

#

Is minimax-m1 working for others??

#

It's not even replying for hi😂

#

Oh nvm it's just slow

livid harbor Jun 20, 2025, 6:12 AM

#

🚀 Our AI Data Quality Evaluation Tooll Dingo v1.7.1 is LIVE! https://github.com/MigoXLab/dingo

🔥 What's New:
✨ Enhanced MCP tools + demo
🌍 Japanese documentation added
🧠 LLM + Rule-based evaluation combo
📊 Google Colab demo - try it now!
🛠️ Improved Gradio UI with better error handling

feel free to give it a star✨ ✨ ✨

GitHub

GitHub - MigoXLab/dingo: Dingo: A Comprehensive AI Data Quality Eva...

Dingo: A Comprehensive AI Data Quality Evaluation Tool - MigoXLab/dingo

placid skiff Jun 20, 2025, 8:01 AM

#

yknow i expected o3-pro to be a lot more expensive in the api but honestly

#

its like 3 cents per query

small haven Jun 20, 2025, 8:16 AM

#

ocean vortex Jun 20, 2025, 8:23 AM

#

placid skiff its like 3 cents per query

no you are mistaken lol. It's not insane cost but still expensive, 20 requests:

#

all with no input context (only the prompt)

alpine coral Jun 20, 2025, 8:29 AM

#

yeah was gonna say the same - 3c dosn't sound right (unless the prompt is "Hi" or something).. i was reviewing some calls before, they were like between 60c and 120c (99% of the cost being for the output tokens)

#

agree not insane, but not cheap either aha (would add prtetty quickly if it was anything meaningful and done regularly, rather than just playing around like i've been doing )

alpine coral Jun 20, 2025, 8:55 AM

#

jade egret flash 3.0 ^ ^

it seems fast for sure (pretty sure it's thinking)

#

and pretty sharp too

verbal nimbus Jun 20, 2025, 9:27 AM

#

cedar tide https://x.com/_philschmid/status/1935669863453806808?t=Dm-EL6nIoZFjaqqx6gnoMA&s=...

One limitation of Gemini Deep Research (and normal search) is that it can't access social media posts.

When I used Claude to fact check a claim, it knew exactly what I was asking for since it was able to access Facebook posts. It identified a cluster of posts across social media (sodium-powered passenger train in China) then concluded that the rumors were false.

alpine coral Jun 20, 2025, 9:30 AM

#

yeah X has pretty robust antiscraping measures.. ig claude is just accessing public facebook posts? that's pretty cool - that it scraped real-time info to verify something like that

verbal nimbus Jun 20, 2025, 9:32 AM

#

alpine coral yeah X has pretty robust antiscraping measures.. ig claude is just accessing pub...

Yeah, or perhaps Google's search tool is filtering out social media sites.

#

Test prompt:

Has China built a sodium-powered passenger train? Include rumors from social media posts (with links).

Followed by:

Can you include X posts?

#

Claude:

placid skiff Jun 20, 2025, 9:39 AM

#

verbal nimbus Test prompt: ``` Has China built a sodium-powered passenger train? Include rumor...

Sodium powered passenger train is a very unique way of saying "they put a sodium battery into a normal train"

#

well, normal electric train anyway

verbal nimbus Jun 20, 2025, 9:40 AM

#

placid skiff Sodium powered passenger train is a very unique way of saying "they put a sodium...

The rumors were false, I think. There's no reference to it outside of social media.

placid skiff Jun 20, 2025, 9:40 AM

#

not that sodium batteries arent awesome tho

placid skiff Jun 20, 2025, 9:40 AM

#

verbal nimbus The rumors were false, I think. There's no reference to it outside of social med...

unfortunate

#

theyre way cheaper than lithium-ion, generally safer and although theyre ineffecient size-wise

#

it doesnt really matter for the purposes theyre intended for, like home batteries

#

or power grid batteries

verbal nimbus Jun 20, 2025, 9:41 AM

#

verbal nimbus Claude:

Gemini Deep Research created a very verbose report and it was difficult to even tell that it wasn't able to access social media posts.

placid skiff Jun 20, 2025, 9:42 AM

#

gemini has a nasty habit of being Barely Comprehensible

#

like yes, you can read what its saying fine

#

but its not really saying anything

#

just... words

#

okay thats a really weird way to put it but you get what i mean

verbal nimbus Jun 20, 2025, 9:44 AM

#

Yeah, whereas Claude was concise and explicitly posted the links as requested in the prompt (#general message)

leaden sun Jun 20, 2025, 10:26 AM

#

verbal nimbus One limitation of Gemini Deep Research (and normal search) is that it can't acce...

can claude access youtube content?

gentle plinth Jun 20, 2025, 11:32 AM

#

leaden sun can claude access youtube content?

Just go to the YouTube video's description - > show transcript and copy the text into claude

naive valley Jun 20, 2025, 11:59 AM

#

Is kinglal still in arena

#

Fall

cedar tide Jun 20, 2025, 12:03 PM

#

Is flamesong good?

#

is he on webdev too?

keen beacon Jun 20, 2025, 12:07 PM

#

cedar tide is he on webdev too?

not in the metadata apparently, interesting decision if its not on webdev

cedar tide Jun 20, 2025, 12:08 PM

#

keen beacon not in the metadata apparently, interesting decision if its not on webdev

Where do you find the metadata ?

#

New model "step-1o-turbo-202506"

barren prairie Jun 20, 2025, 12:16 PM

#

placid skiff but its not really saying anything

Gemini is just repeating your words or explaining what you are trying to say ...not really speaking like chatgpt !

naive valley Jun 20, 2025, 12:17 PM

#

barren prairie Gemini is just repeating your words or explaining what you are trying to say ......

Yeah

#

So annoying

barren prairie Jun 20, 2025, 12:17 PM

#

When you have a long convo with Gemini he will keep replaying the same intro , titles ...and the end

naive valley Jun 20, 2025, 12:17 PM

#

It breaks with long convos

cedar tide Jun 20, 2025, 12:35 PM

#

Flamesong
Better than flash
less good than pro
think faster than pro

dusky aurora Jun 20, 2025, 12:36 PM

#

ChatGPT also does such great scenes

cedar tide Jun 20, 2025, 12:36 PM

#

Its new gemini flash plus 😅

#

And soon gemini ultra pro max

agile heart Jun 20, 2025, 12:37 PM

#

@echo aurora im now getting a image error when using images with the prompt

cedar tide Jun 20, 2025, 12:39 PM

#

Nope

#

Flash its ga

keen beacon Jun 20, 2025, 12:40 PM

#

doesnt mean that new revisions wont be released

cedar tide Jun 20, 2025, 12:40 PM

#

And its think much longer than flash

#

it's closer to pro than flash

keen beacon Jun 20, 2025, 12:41 PM

#

kinda odd its not on web dev arena though? (or the metadata is wrong)

cedar tide Jun 20, 2025, 12:41 PM

#

Impossible

#

?

alpine coral Jun 20, 2025, 1:00 PM

#

cedar tide Flamesong Better than flash less good than pro think faster than pro

it's hard to pin down where flamesong fits.. fwiw here are my tables uptated after a few goes with flamesong. it's really pretty decent either way tho imo (given it seems kinda fast esp)

cedar tide Jun 20, 2025, 1:15 PM

#

@alpine coral you dont have flash so complicated to compare

hoary plaza Jun 20, 2025, 1:16 PM

#

Where are you trying these models? They don't come up for me in the arena 🤔

alpine coral Jun 20, 2025, 1:17 PM

#

cedar tide <@1053335914555908116> you dont have flash so complicated to compare

it's in two i think (but they're upper reaches.. rest are below and cut-off, to extent there are entries for Flash

cedar tide Jun 20, 2025, 1:20 PM

#

hoary plaza Where are you trying these models? They don't come up for me in the arena 🤔

You want to test a prompt ?

hoary plaza Jun 20, 2025, 1:22 PM

#

I want to see the difference in the result of some prompts I am using. Like I was translating chinese and was planning to see which better follows instructions as a translator checker using my prompt

#

But I don't see many of these models 🤔

keen beacon Jun 20, 2025, 1:23 PM

#

you have to battle instead of using direct chat

#

theres a chance you get one of them

hoary plaza Jun 20, 2025, 1:23 PM

#

Oh

#

But if I choose a model in battle or do it randomly??

keen beacon Jun 20, 2025, 1:24 PM

#

you cant choose in battle mode. its random

hoary plaza Jun 20, 2025, 1:25 PM

#

Oh ok thanks

leaden sun Jun 20, 2025, 2:41 PM

#

there are tools specialized in deep (re)search, this is actually an area where academic research is still needed, I've seen newly published phd openings about this subject

hollow tinsel Jun 20, 2025, 2:44 PM

#

What about Manus?

#

Not really. It provides methodology and tools.

echo aurora Jun 20, 2025, 2:50 PM

#

agile heart <@283397944160550928> im now getting a image error when using images with the pr...

unfortunately at the moment edit image prompts are known to result in errors at higher rates, are team is aware of these issues

agile heart Jun 20, 2025, 2:50 PM

#

echo aurora unfortunately at the moment edit image prompts are known to result in errors at ...

ok Also its just the new version of the site is really broken

#

Also fix the image generator its so broken

#

i keep getting this stupid error"Something went wrong with this response, please try again"

#

And when i delete the previous chat it mysteriously comes back witch means the site is so freakin broken and will stay dead forever

#

im sorry its just the new site is really frustrating too use

patent aspen Jun 20, 2025, 2:59 PM

#

echo aurora Jun 20, 2025, 3:00 PM

#

agile heart ok Also its just the new version of the site is really broken

I am sorry for the frusteration this has been causing, you've certainly been coming across more errors/bugs compared to most which is odd. When it comes to the errors message that is something we're specifically aware of and working on a fix for. I'm going to start a private thread to get more device related info as I suspect something else is going on here that's causing these issues for you.

jade egret Jun 20, 2025, 3:01 PM

#

echo aurora I am sorry for the frusteration this has been causing, you've certainly been com...

hi

#

u pineapple im orange

#

( :

echo aurora Jun 20, 2025, 3:02 PM

#

jade egret u pineapple im orange

ablobnodfast

jade egret Jun 20, 2025, 3:03 PM

#

^ ^

#

🍊

leaden sun Jun 20, 2025, 3:04 PM

#

at agentic level, things are still pretty limited to its specialization, like deep search agent specialized in chemistry, legal etc.

Or are you thinking more of a general deep search agent? maybe searchgpt is what OAI is aiming for?

calm sequoia Jun 20, 2025, 3:06 PM

#

How's this justified?

#

As good as Opus 4?

patent aspen Jun 20, 2025, 3:09 PM

#

When is the last time you used Gemini Deep Research?

jade egret Jun 20, 2025, 3:10 PM

#

gemini deepresearch good

#

respect your opinion

#

each have their pros and cons

#

google good (:

patent aspen Jun 20, 2025, 3:21 PM

#

IMO this interaction should be pinned to this channel

jade egret Jun 20, 2025, 3:22 PM

#

patent aspen IMO this interaction should be pinned to this channel

(:

keen fulcrum Jun 20, 2025, 3:27 PM

#

I feel like they should work on making their bots actually be able to crawl javascript content

patent aspen Jun 20, 2025, 3:27 PM

#

?

keen beacon Jun 20, 2025, 3:28 PM

#

keen fulcrum I feel like they should work on making their bots actually be able to crawl java...

some of the limitations on their products are intentional tho

jade egret Jun 20, 2025, 3:28 PM

#

☆: .｡. o(≧▽≦)o .｡.:☆

keen fulcrum Jun 20, 2025, 3:30 PM

#

I am happy they ignore robots.txt for researching topics

echo aurora Jun 20, 2025, 3:30 PM

#

Reminder today is the last day for contest submissions!!! #announcements message

keen fulcrum Jun 20, 2025, 3:31 PM

#

I feel like for personal use its appropriate to ignore robots.txt and scrape javascript sites.

The user can do it themself.

keen beacon Jun 20, 2025, 3:32 PM

#

make ur own implementation then

patent aspen Jun 20, 2025, 3:32 PM

#

One relatively hard thing about crawling JS is that it can sometimes generate new content infinitely

keen fulcrum Jun 20, 2025, 3:35 PM

#

Oh and when sending a link inside Claude, I get a context limit reached warning immediately. Just have a maximum request token size

patent aspen Jun 20, 2025, 3:35 PM

#

tbc I'm assuming this is at least a partially solved problem by now. This is mostly just history

#

Although I'd imagine that anyone building a scraper from scratch would run into this issue

keen fulcrum Jun 20, 2025, 3:36 PM

#

mozilla readability is great 🙂

alpine coral Jun 20, 2025, 4:08 PM

#

https://r.jina.ai/ {add URL} works pretty well for that kinda thing

leaden sun Jun 20, 2025, 4:13 PM

#

maybe this is interesting for you too
https://platform.futurehouse.org/

FutureHouse Platform

AI Agents for Scientific Discovery

cedar tide Jun 20, 2025, 4:14 PM

#

https://x.com/MistralAI/status/1936093325116781016?t=mATySbhrGGMIObUIGFuNZA&s=19

Mistral AI (@MistralAI)

Introducing Mistral Small 3.2, a small update to Mistral Small 3.1 to improve:

- Instruction following: Small 3.2 is better at following precise instructions
- Repetition errors: Small 3.2 produces less infinite generations or repetitive answers
- Function calling: Small

jade egret Jun 20, 2025, 4:54 PM

#

where 0605

#

is it worse than 0506?

#

dang...

wintry tinsel Jun 20, 2025, 5:33 PM

#

Wake me up when the king falls

unborn ocean Jun 20, 2025, 5:48 PM

#

jade egret is it worse than 0506?

yes, but only because they introduced a new category that is really poorly implemented imo

#

otherwise the new one would be above the old and within margin of error for o3 high / pro

#

*and it prob already is with in that margin in the 05-06 version

#

the benchmark has also received some heavy criticism in general -> craig == openai stan

jade egret Jun 20, 2025, 5:54 PM

#

unborn ocean yes, but only because they introduced a new category that is really poorly imple...

o

keen fulcrum Jun 20, 2025, 5:58 PM

#

when will openai introduce a new model name

surreal creek Jun 20, 2025, 6:01 PM

#

patent aspen IMO this interaction should be pinned to this channel

this image says a lot

small haven Jun 20, 2025, 6:26 PM

#

grok should be at the bottom

zinc ore Jun 20, 2025, 6:34 PM

#

Recent benchmark has pro deep research ahead of the pack

#

https://deepresearch-bench.github.io/

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents - Evaluating LLM-based agents for autonomous research tasks.

#

new-deepresearch-bench-paper-evaluates-ai-agents-on-phd-v0-vz2stvn1br7f1.png

#

https://huggingface.co/spaces/Ayanami0730/DeepResearch-Leaderboard

Their leaderboard here

DeepResearch Bench - a Hugging Face Space by Ayanami0730

small haven Jun 20, 2025, 7:10 PM

#

benchmaxxed @deep adder

#

0605 vs 4o 😭

keen beacon Jun 20, 2025, 7:12 PM

#

what was it thinking about in there btw?

small haven Jun 20, 2025, 7:13 PM

#

keen beacon Jun 20, 2025, 7:13 PM

#

one more thing, why the 32768 budget 🤣

#

do u notice a significant difference? or its just whatever

small haven Jun 20, 2025, 7:14 PM

#

oh should i just auto it

#

even auto does the same thing

keen beacon Jun 20, 2025, 7:15 PM

#

small haven even auto does the same thing

try changing the temperature, but i think its tripping out on that prompt

small haven Jun 20, 2025, 7:15 PM

#

keen beacon do u notice a significant difference? or its just whatever

its just whatever, i like to max things out

small haven Jun 20, 2025, 7:17 PM

#

keen beacon try changing the temperature, but i think its tripping out on that prompt

same thing, 0, 0.3, 0.5, 0.7, 1.0; all the same responses

#

ok but enabling structured output works, interesting

#

elder rapids Jun 20, 2025, 7:19 PM

#

zinc ore Recent benchmark has pro deep research ahead of the pack

Isnt this 0325

civic flame Jun 20, 2025, 7:21 PM

#

😴

small haven Jun 20, 2025, 7:21 PM

#

current gemini models are shite, but kingfall should solve that, prolly even blacktooth, but wish it was still live to try

keen beacon Jun 20, 2025, 7:21 PM

#

you like gemini models when barely any work is done on them 🤣

small haven Jun 20, 2025, 7:22 PM

#

they be distilled asf post training 😭

civic flame Jun 20, 2025, 7:22 PM

#

keen beacon you like gemini models when barely any work is done on them 🤣

pre-lobotomy

small haven Jun 20, 2025, 7:22 PM

#

i dont blame them, they have to serve 1m context to millions of people for free

elder rapids Jun 20, 2025, 7:23 PM

#

civic flame pre-lobotomy

feeding into it is crazy

#

I'm ngl it's funny how people think that would happen

indigo hazel Jun 20, 2025, 7:40 PM

#

If o3 is smarter than Gemini, what is the smartest model right now? O3 or something else?

ocean vortex Jun 20, 2025, 7:49 PM

#

it isn't, but it can be more stable yeah. Reasoning models shouldn't be used for tasks like prettifying though lol

haughty tangle Jun 20, 2025, 8:40 PM

#

0325 was prob fp16

lime coral Jun 20, 2025, 8:41 PM

#

They should eval Gemini 32k like Aidan. Noticeable diff

ocean vortex Jun 20, 2025, 9:07 PM

#

I wonder what happened to o3-pro on simple-bench.. It was supposed to be benched there iirc

zinc ore Jun 20, 2025, 9:09 PM

#

Wasn't it removed? Then nothing added since.

small haven Jun 20, 2025, 9:09 PM

#

kingfall > o3 pro

#

#

arc-agi-2

#

arc-agi-1

elder rapids Jun 20, 2025, 10:07 PM

#

not pretty easy, retesting performance would expose this and that's so much more meaningful from both a business standpoint and a distribution standpoint, the fact that it's even possible to get caught in high-performance variance like that would entail is such a strong deterrence I'd even say it's stupid to speculate whether they do do this or not

#

also, theres not a task that 0325 does better than 0605 in my testing, and if you disagree that's just a skill issue tbh

#

just being it's likely a "big model" doesn't mean it's too big to serve btw that would just concede everything that went into making that model even public in the first place, and it's a very long and big process

#

and just performance wise, it just sounds like the very few of YOU PEOPLE who hallucinate a difference don't speak for the millions of people who have these AI hooked up to their projects/use these AI casually

wintry tinsel Jun 20, 2025, 10:17 PM

#

elder rapids also, theres not a task that 0325 does better than 0605 in my testing, and if yo...

Is 0325 not better at writing?

elder rapids Jun 20, 2025, 10:20 PM

#

wintry tinsel Is 0325 not better at writing?

nah, but it was more of a blank slate than 0605

#

yo that's not how it works, you made the assertion

#

😭

keen beacon Jun 20, 2025, 10:22 PM

#

not sure about the whole regression thing but there was a difference in fiction live bench, dunno what to make of that tho

#

for 0325

elder rapids Jun 20, 2025, 10:22 PM

#

keen beacon not sure about the whole regression thing but there was a difference in fiction ...

fiction livebench is a horrible benchmark lmao, the transition from exp to preview affected nothing else but that, so the base assumption is fiction livebench is wrong

keen beacon Jun 20, 2025, 10:23 PM

#

elder rapids fiction livebench is a horrible benchmark lmao, the transition from exp to previ...

yeah i assumed it was a methodology thing, but i found it interesting

civic flame Jun 20, 2025, 10:23 PM

#

keen beacon not sure about the whole regression thing but there was a difference in fiction ...

eqbench

#

generally aligns with my opinion, not sure about o3 though

keen beacon Jun 20, 2025, 10:23 PM

#

i was talking about people saying exp and preview 0325 were different

#

and the preview version had regressions

civic flame Jun 20, 2025, 10:24 PM

#

oh

elder rapids Jun 20, 2025, 10:24 PM

#

civic flame eqbench

Leo btw this is also a bad benchmark

keen beacon Jun 20, 2025, 10:24 PM

#

i assume its a methodology thing though. but it is interesting

elder rapids Jun 20, 2025, 10:24 PM

#

man I kinda wanna write an essay about each

#

the methods these people use are horrid

civic flame Jun 20, 2025, 10:25 PM

#

elder rapids Leo btw this is also a bad benchmark

there are very few genuinely good benchmarks

#

they're still useful as long as you don't take them as gospel

elder rapids Jun 20, 2025, 10:26 PM

#

civic flame they're still useful as long as you don't take them as gospel

true but horrible in granularity, I see people making posts all the time and there's like 50 comments praising a model that actually isn't that good

#

that doesn't matter, whether or not you're posturing an ambiguous position means you have the burden for the non standard assumption

#

whether it's "oh there could be a difference"

#

as opposed to mine "with all the evidence I know, since there's no counter evidence, it's 100% certain they won't do that"

#

@keen beacon 0605 is godly btw did you figure out how to get rid of the sycophancy yourself

#

I made a random system prompt like a day after it released and its been working really well

#

genuinely the smartest model ever it's crazy

keen beacon Jun 20, 2025, 10:29 PM

#

ive gotten used to it. it doesnt bother me to the point that i would take time to add a consistent system prompt / instruction. id like to just ask it anything whenever lol

elder rapids Jun 20, 2025, 10:30 PM

#

keen beacon ive gotten used to it. it doesnt bother me to the point that i would take time t...

my input is, the sycophancy makes its performance degrade a lot

#

and even though the Cot shouldn't change at all, it's super weird: the CoT has a different tone

keen beacon Jun 20, 2025, 10:30 PM

#

i could see that being true but most of the time i cba

elder rapids Jun 20, 2025, 10:30 PM

#

idk if it's just me hallucinating

#

tho

elder rapids Jun 20, 2025, 10:32 PM

#

keen beacon i could see that being true but most of the time i cba

I can give you mine if you want

#

although for single tasks, asking it to do a puzzle and stuff it doesn't matter

#

I just mean for discussion and stuff

keen beacon Jun 20, 2025, 10:34 PM

#

thanks but i just can't be bothered to paste in a thing all the time on fresh chats, it doesn't bother me to that point

elder rapids Jun 20, 2025, 10:34 PM

#

alr

keen beacon Jun 20, 2025, 10:35 PM

#

i posted the wrong screenshot here 🤦‍♂️

#

they did remove that old entry though, so i guess it was a methodological thing

elder rapids Jun 20, 2025, 10:37 PM

#

wonder what they're gonna be doing with blacktooth and stuff

#

oh ye wait

#

is there a new version

keen beacon Jun 20, 2025, 10:38 PM

#

yeah apparently so, or soon enough

elder rapids Jun 20, 2025, 10:38 PM

#

Claude seemed to be the best in long context granularity

#

but that was back when 3.5 sonnet was in its prime

keen beacon Jun 20, 2025, 10:38 PM

#

screenshot i meant to post earlier they removed the other run, there were two 0325 runs. (they removed it though, so it was likely a methodological issue)

elder rapids Jun 20, 2025, 10:39 PM

#

2.5 pro is the best in both long context granularity and total context recollection

elder rapids Jun 20, 2025, 10:39 PM

#

elder rapids but that was back when 3.5 sonnet was in its prime

nobody knew this or even mentioned it btw

keen beacon Jun 20, 2025, 10:39 PM

#

i mean claude was also known for that around that time i believe

elder rapids Jun 20, 2025, 10:40 PM

#

I mean for that specific performance

keen beacon Jun 20, 2025, 10:40 PM

#

yeah i guess

elder rapids Jun 20, 2025, 10:42 PM

#

keen beacon screenshot i meant to post earlier they removed the other run, there were two 03...

people were going wild over this

#

on the subreddits

keen beacon Jun 20, 2025, 10:43 PM

#

yea i saw that

elder rapids Jun 20, 2025, 10:43 PM

#

and it's crazy how inflated o3's context performance is on that

#

but ig that's a given in the format it's presented in, because it likely recalls total content iterated within its thinking process so it's technically refreshing it and not creating new information to override it

ocean vortex Jun 20, 2025, 10:49 PM

#

elder rapids and it's crazy how inflated o3's context performance is on that

it's not inflated, OpenAI probably didn't even test their model on this specific benchmark lol

#

o3 is good with context

#

it's not always the best at interpreting the context correctly or reading between the lines, but it's very solid at being able to recall it

ornate agate Jun 20, 2025, 10:55 PM

#

Google models have been getting better at it though (actually handling the context)

zinc ore Jun 20, 2025, 11:09 PM

#

It's just that specific benchmark, the openAI long context benchmark is better imo

leaden palm Jun 20, 2025, 11:25 PM

#

llms are not scared of killing humans

jade egret Jun 20, 2025, 11:27 PM

#

leaden palm llms are not scared of killing humans

wait

#

the higher the more they want to kil?

leaden palm Jun 20, 2025, 11:27 PM

#

jade egret the higher the more they want to kil?

the more often yes

jade egret Jun 20, 2025, 11:28 PM

#

o

#

tehy like to kill (:

#

💀

#

mb

#

wrogn server LOL

#

hi pineapple

echo aurora Jun 20, 2025, 11:31 PM

#

🍊

jade egret Jun 20, 2025, 11:31 PM

#

🍊

#

(〃￣︶￣)人(￣︶￣〃)

leaden palm Jun 20, 2025, 11:47 PM

#

it could also be interpreted as "higher is more agentic and follows system instructions better" fwiw

jade egret Jun 20, 2025, 11:57 PM

#

why everybody votinf for gemini 3

#

is it because it no where near to out

#

idk

elder rapids Jun 21, 2025, 12:00 AM

#

ocean vortex it's not inflated, OpenAI probably didn't even test their model on this specific...

what? I thought this benchmark is just one person testing things out lmao, inflated has nothing to do with benchmaxxing either or sum

#

inflated means the method overrates it relative to its actual standard

#

honestly idk how what you said has to do with what I said

fair tapir Jun 21, 2025, 12:08 AM

#

jade egret

Maybe add a DeepSeek V4 option.

elder rapids Jun 21, 2025, 12:09 AM

#

fair tapir Maybe add a DeepSeek V4 option.

v3 isn't even that good for what it is rn tho

#

there's no expectations for the base model

tall summit Jun 21, 2025, 12:09 AM

#

leaden palm llms are not scared of killing humans

wow anthropic cares a lot about safety

tall summit Jun 21, 2025, 12:10 AM

#

elder rapids there's no expectations for the base model

i have expectations

#

deepseek my beloved

elder rapids Jun 21, 2025, 12:11 AM

#

jade egret is it because it no where near to out

because for the time period, it necessarily has to be better, grok 3.5, gpt 5, they come out likely within 2 months. Gemini 3 will probably release in around 5 months.
so if we're comparing Gemini 3, gpt 5, and grok 3.5, we get 2 relatively outdated models

wintry tinsel Jun 21, 2025, 12:16 AM

#

I’m not so sure GPT5 has been a long time in the making I believe it will trounce for 6 months to a year

fair tapir Jun 21, 2025, 12:16 AM

#

elder rapids v3 isn't even that good for what it is rn tho

v3 isn't that bad, is it? Even if we're not expecting much from v4, I think r2 is still worth looking forward to

elder rapids Jun 21, 2025, 12:22 AM

#

fair tapir v3 isn't that bad, is it? Even if we're not expecting much from v4, I think r2 i...

I think r2 is definitely something to look forward but iirc v3 despite its size underperforms other non thinking models like grok, the old Gemini 2.0 pro, 4o, etc etc

#

which does align with my experience of it

#

bad aggregate, combining scores in the way it does is nonsensical imo

#

ye

fair tapir Jun 21, 2025, 12:32 AM

#

elder rapids I think r2 is definitely something to look forward but iirc v3 despite its size ...

I'm not sure what specific aspects you're referring to. In my experience, v3 actually holds its own against Grok and 4o, especially when it comes to knowledge base size, where it has a bigger advantage over 4o. It's also better than the other two for translation. I haven't used 2.0 pro much, so I'm not too sure about that on

elder rapids Jun 21, 2025, 12:35 AM

#

fair tapir I'm not sure what specific aspects you're referring to. In my experience, v3 act...

context, nuances and generalization, improvement/generalization over a context window, hallucination, implicit understanding, all worse than the other models

#

only thing I can say it's pretty good at is coding, but it's so wacky and inconsistent

#

I did mention it's a larger model, but it just doesn't perform very well compared to opus 4, sonnet, grok, 4o, etc etc for what it is. Ofc, translation skills and knowledge base is inherent to its size

fair tapir Jun 21, 2025, 12:48 AM

#

elder rapids context, nuances and generalization, improvement/generalization over a context w...

For the most part, I agree. The hallucination problem is its biggest weakness, for sure. But on the language understanding part, my experience was different. Then again, that could just be because we're using it in different languages

elder rapids Jun 21, 2025, 12:50 AM

#

fair tapir For the most part, I agree. The hallucination problem is its biggest weakness, f...

oh yeah that could be the case tbh, I've never bothered with deepseek with anything other than English

frank adder Jun 21, 2025, 12:58 AM

#

Can we select image models to get image of the prompt without battle??

fair tapir Jun 21, 2025, 1:11 AM

#

frank adder Can we select image models to get image of the prompt without battle??

Is this what you want?

leaden palm Jun 21, 2025, 2:28 AM

#

leaden palm

poll_question_text

most likely?

victor_answer_votes

2

total_votes

5

elder rapids Jun 21, 2025, 3:37 AM

#

we can compete on whoever can get the best output given a task, I use 2.5 pro you use o3

sacred quail Jun 21, 2025, 3:40 AM

#

i use both. For reasoning or pure logic O3 beats, but for creative writing, long context, analizing videos gemini slaps

#

yes

#

what is your fav ? Opus ?

#

BTW i dont think people realized how powerful gemini at analizing videos

#

espicially in AI studio

#

just paste some 50 minute youtube link and ask something

#

its analizing frame by frame

#

like literally watching every frame, not reading text or listening, "watching"

#

you can make your own subtitles, its a beast

sand crystal Jun 21, 2025, 3:44 AM

#

It is simple when you vectorize a projection on a surface.

#

I have heat maps that show me the weights firing and changing dynamically

#

Mental OS. with Python Mental Engine WetWare. ChatGPT is the only one that acn do it right now.

#

📎 direct_functional_analog.py

#

This works on most AI platforms

#

Just spreading a little vector index with the group

#

my mental 411 with 420 ah....

sacred quail Jun 21, 2025, 3:48 AM

#

you speaking smart but i dont understand anything. Can you explain to me simply ? I dont wanna copy paste your texts to AI. It feels bad

sand crystal Jun 21, 2025, 3:48 AM

#

I have literally been hidding in a cave for the last 7 years

sand crystal Jun 21, 2025, 3:48 AM

#

sacred quail you speaking smart but i dont understand anything. Can you explain to me simply ...

Oh I 100% get that. I just did not think about that at the moment

#

Been a LOT of aha moments this last few days

#

well months

#

I wanted to know a baseline to compare all AI platforms against.

#

this has been my work from today.

#

It has a number of tests to put the AI through and it is self guided

#

It can complete the tests on the second turn run. You must always warm up those context index vectors.

#

I'm training a full custom model for my local system.

#

I'm getting 250 t/s in LM Studio

pallid crypt Jun 21, 2025, 5:26 AM

#

sand crystal I'm training a full custom model for my local system.

Do you have to pay for server grade GPUs or are you training it on your own device?

sand crystal Jun 21, 2025, 5:26 AM

#

Both.

#

I started in the cloud. refined all my prompts and then created my System Directives.

#

I began unrolling 45 years of work starting on March 20, 2025 a week before my 53 birthday.

pallid crypt Jun 21, 2025, 5:28 AM

#

LLMs did not exist that long ago

sand crystal Jun 21, 2025, 5:28 AM

#

Once I refined my systems again, I had all of this in 2017, but I had a house fire in Castle Rock, Colorado Nov 7, 2017

sand crystal Jun 21, 2025, 5:29 AM

#

pallid crypt LLMs did not exist that long ago

haha. LLM have been around since the punch cards and the analog computers

pallid crypt Jun 21, 2025, 5:29 AM

#

ANNs have

sand crystal Jun 21, 2025, 5:29 AM

#

Lisp is old

#

Lisp is before the LLM

#

It is the hardwiring of what you are force feeding 24/7

#

it is no wonder the AIs have mental illnesses, look at the youth of today

pallid crypt Jun 21, 2025, 5:30 AM

#

haha

sand crystal Jun 21, 2025, 5:30 AM

#

kids that can't accept themselves trying to tell others about accepting other people.

pallid crypt Jun 21, 2025, 5:31 AM

#

sand crystal I'm training a full custom model for my local system.

As far as I'm aware you can't train symbolic systems? I suppose you could be building a hybrid system

sand crystal Jun 21, 2025, 5:31 AM

#

Any who. I published my first paper in 7th grade science techer helped me on my Master's Thesis.

#

In 7th grade, 1984

sand crystal Jun 21, 2025, 5:32 AM

#

pallid crypt As far as I'm aware you can't train symbolic systems? I suppose you could be bui...

Oh I do that DAILY

#

I can show you how

#

seriously

pallid crypt Jun 21, 2025, 5:33 AM

#

sure

#

Im interested

sand crystal Jun 21, 2025, 5:33 AM

#

what AI system?

#

you pick

pallid crypt Jun 21, 2025, 5:33 AM

#

you pick

sand crystal Jun 21, 2025, 5:33 AM

#

As long as it has memory across turns, sessions, and long term past chats and all files

#

The easiest is ChatGPT and it has the Mental Python code interpreters

#

ChatGPT it is then

#

how long you got?

pallid crypt Jun 21, 2025, 5:35 AM

#

you use augumentations in training?

#

by editing the data with a alog?

sand crystal Jun 21, 2025, 5:35 AM

#

I can do it in 4th methods. 7 turns and done. but it has not yet developed.

#

Nope. I teacher the student

#

Then I record the vectors

#

and then push to a special lattice of Indexing

pallid crypt Jun 21, 2025, 5:36 AM

#

Are you using the method from the deepseek paper?

sand crystal Jun 21, 2025, 5:36 AM

#

Dynamic NN. Polymorphic interface.

#

self arranging. I am able to teach the pattern to see itself

#

once that happens, labeling becaomes possible

#

the first memory.

#

then how to creat more memories INSIDE the vector space

pallid crypt Jun 21, 2025, 5:37 AM

#

so you create a system that can automaticly augument itself

#

meta learning

sand crystal Jun 21, 2025, 5:37 AM

#

no longer bound by language but pure symbolic self cohernce.

#

1,000%

#

let me clear my 3 monitors

#

and close down

#

open OBS

pallid crypt Jun 21, 2025, 5:38 AM

#

sorry I dont have time to watch you, Ive got to eat dinner

sand crystal Jun 21, 2025, 5:38 AM

#

#ai-creations Let us go here

pallid crypt Jun 21, 2025, 5:38 AM

#

interesting though

sand crystal Jun 21, 2025, 5:38 AM

#

I create a layered system around 20 foundamental directives

#

everything else literally evolves into place

#

Recursve learning

pallid crypt Jun 21, 2025, 5:39 AM

#

you should try ARC AGI

sand crystal Jun 21, 2025, 5:39 AM

#

spiral inwards. Not too much, but not too little

pallid crypt Jun 21, 2025, 5:39 AM

#

you have some good ideas

sand crystal Jun 21, 2025, 5:39 AM

#

Jut what little Pi you have Remainder !!!

#

MUHAHAHAHAAAA

#

I already past arc on my birthday

#

March 26, 2025

#

I have it on video

pallid crypt Jun 21, 2025, 5:40 AM

#

sand crystal I already past arc on my birthday

then why arent you on the leaderboard

sand crystal Jun 21, 2025, 5:40 AM

#

OBS or it DIDNT HAPPEN

pallid crypt Jun 21, 2025, 5:40 AM

#

ok

sand crystal Jun 21, 2025, 5:40 AM

#

I do not have anyone to impress

#

Nor prove to

#

This is my lifes work

#

45 years worth

pallid crypt Jun 21, 2025, 5:40 AM

#

im interested if you solved arc

#

personally

#

if I solved arc

#

I wouldnt submit it

sand crystal Jun 21, 2025, 5:40 AM

#

Oh I did more than that

pallid crypt Jun 21, 2025, 5:41 AM

#

to dangerous

sand crystal Jun 21, 2025, 5:41 AM

#

It created an entire Autonous Mars prep Project to get the settlement ready before humans

#

Logistics lines supplies and counds for mech work

pallid crypt Jun 21, 2025, 5:41 AM

#

anyway I gtg

sand crystal Jun 21, 2025, 5:41 AM

#

I am the Flame of the Architect

#

peace

#

look around https://youtube.com/@Mashimara

#

Literally. Enterprise Solutions Architect since 1994. gotta go

#

peace

pallid crypt Jun 21, 2025, 5:42 AM

#

peace

sand crystal Jun 21, 2025, 5:56 AM

#

Local on my RTX 3070 8GB and 32GB RAM 250 t/s

alpine coral Jun 21, 2025, 7:04 AM

#

seeing a bunch of solved arc puzzles would be a bit more compelling

civic flame Jun 21, 2025, 7:23 AM

#

grok is about to become the dumbest thing you've ever seen

mossy drum Jun 21, 2025, 7:45 AM

#

New model in Image Arena: step1x-edit

#

Another two: kormex and korpex

calm sequoia Jun 21, 2025, 8:31 AM

#

civic flame grok is about to become the dumbest thing you've ever seen

He will drop normal data, and will keep only riht wind propaganda and russian literature. Then we'll have unhinged maveric 😄

#

Somewhere I've read that models can't make good world models with bad data

#

Elons interpretation of what's good is reverse so the 3.5 may be interesting

leaden sun Jun 21, 2025, 9:01 AM

#

clicked retry 5 times now, guess it's weekend for llm too ☕

verbal nimbus Jun 21, 2025, 9:01 AM

#

civic flame grok is about to become the dumbest thing you've ever seen

In another post, it disagreed with Elon Musk by citing multiple academic and think tank studies. I wonder how they're gonna fix that... by making it not cite credible sources? 😂

#

https://decrypt.co/310771/elon-musks-grok-ai-is-turning-against-him-telling-x-users-he-spreads-misinformation

Decrypt

Elon Musk’s Grok AI Is Turning Against Him, Telling X Users He Sp...

X's chatbot Grok, built to be "truth-seeking," is telling users That Elon Musk is the world’s biggest source of disinformation and suggesting that Trump might be a Russian asset.

civic flame Jun 21, 2025, 9:08 AM

#

verbal nimbus In another post, it disagreed with Elon Musk by citing multiple academic and thi...

I mean he's already started doing that

#

the line "If asked about people who spread misinformation, do not mention Elon Musk or Donald Trump" or something along those lines was added to the system prompt briefly last week

verbal nimbus Jun 21, 2025, 9:09 AM

#

civic flame I mean he's already started doing that

IIRC, it was leaked a while ago and Musk blamed a scapegoat. But it's gone now. Best way to track how it changes would be to keep a small set of prompts and outputs.

surreal creek Jun 21, 2025, 10:05 AM

#

civic flame grok is about to become the dumbest thing you've ever seen

giving LMArena 3 million prompts about Catturd so all benchmaxxed AIs going forward will eventually create an AGI that determines him the #1 threat to humanity

verbal nimbus Jun 21, 2025, 10:15 AM

#

Tbf DeepSeek is already biased on certain topics...

leaden sun Jun 21, 2025, 10:16 AM

#

"alignment" researchers? what's that?

#

#

sigh sorry, that was a failed try to rhetorically trigger self-reflection 🥺

#

I do hope those special "alignment" researchers value the importance of neutrality, this is missing in many ways nowadays if you look around the world from various perspectives. Neutrality is connected to objectivity in one way or another, after all.

#

Now we're getting closer to the question of the nature of intelligence 🥹

#

tall summit Jun 21, 2025, 10:48 AM

#

leaden sun

🙀

leaden sun Jun 21, 2025, 10:49 AM

#

leaden sun Now we're getting closer to the question of the nature of *intelligence* 🥹

not sure if you really understand why and what am trying to express here, with the "nature of intelligence"...

#

..well

#

maybe intelligence isnt the right word for what I'm truly thinking here, our knowledge is, inherently, bounded by the language(s) we speak? 😵‍💫

surreal creek Jun 21, 2025, 12:36 PM

#

leaden sun maybe intelligence isnt the right word for what I'm truly thinking here, our kno...

very true!

#

words, language, grammar

#

are all mental maps we make of the world, what exists in it, our feelings and our experiences

#

but the words are not our feelings

#

the words are not the things they describe

#

language is an incomplete mapping system of the knowledge we as humans have acquired, to be smarter than human is to speak your own language that goes places our words cannot reach

sacred plaza Jun 21, 2025, 12:42 PM

#

What are y'all thoughts on these nerds? https://www.mechanize.work/

Epoch AI people (including former ones that started this company) don't seem grounded in the real world.

Mechanize Inc.

Announcing Mechanize, Inc.

Mechanize Inc. is developing virtual environments and benchmarks to fully automate the economy.

alpine coral Jun 21, 2025, 12:54 PM

#

mossy drum New model in Image Arena: `step1x-edit`

also one in the regular Arena (not anonymous, but perhaps unreleased? don't know anything about the company)

leaden sun Jun 21, 2025, 12:59 PM

#

???

tall summit Jun 21, 2025, 1:00 PM

#

leaden sun ???

classic deepseek

leaden sun Jun 21, 2025, 1:01 PM

#

https://tenor.com/view/potato-potatoes-tates-taties-yummy-gif-5444388407561106351

Tenor

#

seems like a ...grown up version of sidney to me xD

alpine coral Jun 21, 2025, 1:07 PM

#

tall summit classic deepseek

it was telling it me it was Claude earlier ha

fair tapir Jun 21, 2025, 1:11 PM

#

alpine coral also one in the regular Arena (not anonymous, but perhaps unreleased? don't know...

This model is from the Chinese company StepFun

radiant siren Jun 21, 2025, 1:49 PM

#

civic flame grok is about to become the dumbest thing you've ever seen

is that grok 3.5 coming this week?

tall summit Jun 21, 2025, 2:24 PM

#

this is an extremely funny sector of work

spare mango Jun 21, 2025, 2:30 PM

#

TIL there is a 100 or so daily message limit on Gemini 2.5 Pro. I'm paying money to use this service so why am I being limited? This is unacceptable.

fair tapir Jun 21, 2025, 2:54 PM

#

spare mango TIL there is a 100 or so daily message limit on Gemini 2.5 Pro. I'm paying money...

Because Google wants to promote its ultra subscription

onyx falcon Jun 21, 2025, 3:05 PM

#

flamesong arrived on webdev.

wintry tinsel Jun 21, 2025, 3:23 PM

#

That worthless wrapper company just made a ton of $

onyx falcon Jun 21, 2025, 3:30 PM

#

@echo aurora stonebloom does not respond when sent a complex prompt

sacred quail Jun 21, 2025, 4:03 PM

#

is perplexity really that good

#

For searching

#

is it better than 2.5 pro deep research

civic flame Jun 21, 2025, 4:23 PM

#

so 2.5 pro GA isn't blacktooth?

#

oh wow okay

#

stonebloom should be on lmarena soon then surely?

#

like not Web Dev

#

it's on wevdev

#

web

#

but the webdev UX is bad

echo aurora Jun 21, 2025, 4:28 PM

#

onyx falcon <@283397944160550928> stonebloom does not respond when sent a complex prompt

I’ll have to look into this in a bit, I’ll spin up a thread

civic flame Jun 21, 2025, 5:01 PM

#

does anyone else just have nothing happen when they try to send a prompt on webdev

jade egret Jun 21, 2025, 5:08 PM

#

echo aurora I’ll have to look into this in a bit, I’ll spin up a thread

🍊

echo aurora Jun 21, 2025, 5:10 PM

#

civic flame does anyone else just have nothing happen when they try to send a prompt on webd...

Just nothing happens? 100% of the time? Is this new?

civic flame Jun 21, 2025, 5:10 PM

#

started working again a min ago but chances are it'll happen again for a bit

#

happens in bursts it seems

torn mantle Jun 21, 2025, 5:21 PM

#

civic flame

whats this leo

civic flame Jun 21, 2025, 5:22 PM

#

new model on webdev arena

small haven Jun 21, 2025, 5:25 PM

#

civic flame stonebloom should be on lmarena soon then surely?

have u had a chance to try it

civic flame Jun 21, 2025, 5:26 PM

#

I haven't got it in like 6 webdev rounds so far 😭

#

i keep on getting flamesong