#general

1 messages · Page 93 of 1

wintry citrus
#

I hate my life

#

im ready for the execution

#

i just did

stray aspen
#

what are you doing

#

are you high

wintry citrus
#

no im low

misty vault
wintry citrus
#

LOW TAPER FADE

#

HAHAHAHA

#

HAHAHAHAHAHHAHA

#

HAHAHHAHAHAHHA

#

kill me already

stray aspen
#

im on low effort right now

wintry citrus
#

I'd kill myself

stray aspen
#

those dudes are open source chinese LLMs

wintry citrus
#

than be more embarrassed

#

THEY'RE BOTH CATS TOO

#

tahn

#

bearn English

#

speak English

#

me myself i would say that too

#

🗣️

stray aspen
#

holy

#

is that bing chat

#

from 2023

wintry citrus
#

no i don't

#

what a

#

im gonna burn u in hell

#

if u delete it

#

again

#

im gonna make u suffer

#

seeing ur parents die

#

infront of u

#

respectfully

#

.

#

respectfully

stray aspen
#

lol

#

good thing pineapple aint online

wintry citrus
#

don't need to remember

#

oh

#

I'd need that

#

tho it's not gonna be a butterfly

#

it's gonna be a whole ass blade

golden ocean
wintry citrus
#

SEND THAT AI

#

NOW

#

NOW

#

SEND IT HERE BUDDY

#

SEND IT

#

awh shucks

misty vault
#

@golden ocean does absolutely not know what that is

wintry citrus
#

just use Gemini if u tell it to kill itself it will forgive u

#

if u tell it i killed ur parents

#

if u tell it i kissed ur ai gf

#

who is paul 😭

#

kiss him

misty vault
#

Bro sent 3 message bubbles

wintry citrus
#

does it have freaky version

#

who's that

#

an ai?

misty vault
#

same

misty vault
wintry citrus
#

i could rizz that ai

#

it's not even a real girl

#

100%

#

freaky chiggas

#

is that actually bing

#

there's no way

misty vault
wintry citrus
#

that's like chatgpt when u tell it a whole paragraph

wicked root
#

which AI is this?

misty vault
#

the underlying gpt-4 fine tune

#

no additional system instructions telling it to behave like this

wintry citrus
#

NONE?

#

NONE

#

ss

#

and sent to the fbi

#

sent to everyone

#

sent to my server

#

sent to china

#

what

misty vault
#

Bro has modded android discord to insert fake messages

wintry citrus
#

WHAT

#

FAKE

#

HOW

#

AWH I DIDN'T RECORD IT

#

NOW IT LOOKS FAKE

#

FUUUU

#

that's my wife

#

gah damn

#

she hot

#

I'd fall for her

#

lmao

misty vault
#

I thought you said "fill"

wintry citrus
#

MAHHHHmah

#

NAH BRO

#

NAH

#

CHILL

#

CHILL

#

tell it I'm gonna have freaky with u

#

not segs

#

freaky

#

trust

#

oh

#

is this real

golden ocean
misty vault
#

I WAS about to say that

wintry citrus
#

is this American

misty vault
#

He is guarding the bed

wintry citrus
#

omg

#

DO U LIKE CHATGPT

#

THAT MUCH

#

WHAT THE FU

#

orrr

#

ur 5 years old

#

OR

verbal nimbus
#

LMArena leaderboard doesn't report which model is the actual one being used. There's also GPT-5 Chat, which is different

wintry citrus
#

u like

#

five year olds

#

wait y'all is there an api for the openai o3 pro

misty vault
#

yea

#

THis url

wintry citrus
#

no thanks

verbal nimbus
#

They should specify it's the thinking variant. All other models have the thinking variant explicitly labeled.

wintry citrus
#

okay nvm

#

give mee it

#

rn

#

HEY BUDDY

#

IM GONNA FIND UR HOUSE

#

im gonna cut off ur limbs

#

no

#

i don't wanna be obese

#

i don't wanna be in the country of an orange president

#

i got an image of him

#

that's him walking

#

trump

#

this is his emoji

#

🍊

#

the hair is perfect too

#

oohhhh

#

is that Sydney)

#

she's not that hot

#

tbh

#

..

verbal nimbus
#

Long context benchmark. New GPT-5 models are the first 3 (list is not ordered).

#

Kinda hard to tell, for example Gemini 2.5 Pro performs better at 192K but worse for shorter lengths.

#

And o3 is best except for 16K, 60K and 192K.

golden ocean
#

o/

wintry citrus
#

SO I GOT WARNED

pseudo hemlock
#

?warn @wintry citrus

wintry citrus
#

god damn

wintry citrus
#

i would say bad things

#

yeah?

#

i would

pseudo hemlock
#

say them

#

or are you scared

wintry citrus
#

i just said

#

i can't

#

I CAN'T

pseudo hemlock
#

scared

wintry citrus
#

@echo aurora can i

#

just this one time

#

please

#

last one

pseudo hemlock
#

hey @echo aurora P2L on legacy site is dead again 🙁

wintry citrus
golden ocean
pseudo hemlock
#

bing chat 💀

wintry citrus
#

9% is still concerning isn't it...

golden ocean
echo aurora
whole wagon
#

The filters are very strong here lol

golden ocean
whole wagon
#

You

rancid phoenix
#

Where do I generate videos

golden ocean
echo aurora
echo aurora
pseudo hemlock
#

PINEAPPLE

#

hey

#

how u doin

#

did i mention youre looking good today

echo aurora
pseudo hemlock
#

no hurry

#

u the goat btw

#

love u

pseudo hemlock
#

say it back

echo aurora
pseudo hemlock
#

I hope you DO NOT spontaneously combust 😉

#

definitely hope you don't

#

that would be unfortunate wouldn't it

echo aurora
echo aurora
stray aspen
#

LMAOOO

#

i love this response from qwen 3

hallow ridge
#

Where can I use Flow veo 3 google on LLM arena

#

@leaden palm

hallow ridge
leaden palm
hallow ridge
leaden palm
#

Unfortunately, as far as I know, at the moment, you're restricted to random battles

stray aspen
#

send video requests until you get veo 3

leaden palm
#

With random models

#

The model list does in fact include Veo 3 though

hallow ridge
leaden palm
hallow ridge
leaden palm
stray aspen
#

you cant bro

leaden palm
hallow ridge
leaden palm
#

Videos are not chat

hallow ridge
leaden palm
#

Videos are also not on the website anyway

#

Videos are just in this Discord

leaden palm
hallow ridge
hallow ridge
#

they might have better video

leaden palm
hallow ridge
leaden palm
#

there is no site where you can use veo 3 with custom prompts for free that i know of

#

because veo 3 costs money

#

like a lot of money

hallow ridge
#

LLM arena gives you all the best AI for free no one is paying

#

who is

#

How is it all free

leaden palm
#

hold on let me pull up emoji kitchen

hallow ridge
#

tell me why is it free

leaden palm
#

oh they don't have the money bag emoji

#

anyway

#

burning money

#

vc money specifically

#

also user feedback has some value

hallow ridge
stray aspen
#

whats the rate limit for gpt-5

hallow ridge
leaden palm
#

hm

#

idk then

#

might just have a lot of vc money

hallow ridge
stray aspen
#

neither have i

hallow ridge
stray aspen
#

but i have ran into a limit with claude

leaden palm
hallow ridge
hallow ridge
leaden palm
hallow ridge
stray aspen
#

yo

#

does anyone know how to enable the image edit mode

leaden palm
stray aspen
#

in qwen 3

leaden palm
#

not sure if that's a lot or not a lot

leaden palm
# stray aspen in qwen 3

unfortunately qwen 3 itself isn't multimodal, and qwen image can't edit images through lm arena

hallow ridge
#

Put me on the team

stray aspen
hallow ridge
#

Im going into quantum computers

stray aspen
#

theres this

#

i mean on qen chat

#

the demos look great

#

and i wanted to try edit

hallow ridge
stray aspen
#

where did you do it

leaden palm
hallow ridge
#

I can make money using LLM arena

stray aspen
#

no

hallow ridge
leaden palm
stray aspen
#

they can do it themselves

hallow ridge
hallow ridge
#

and I send in that and get paid 60 for 1 second of work

leaden palm
hallow ridge
stray aspen
#

or your extremely lucky finding people who dont know ai image edit exists

hallow ridge
leaden palm
hallow ridge
stray aspen
#

how is tha tpossible

#

craig why is your profile photo gpt-5

hallow ridge
leaden palm
# hallow ridge <@794377681331945524> Got paid for this

hm

well i guess it doesn't explicitly prohibit it

however, per rules https://www.reddit.com/r/PhotoshopRequest/wiki/rules/, this post https://www.reddit.com/r/PhotoshopRequest/comments/1m7obke/a_humble_request_can_we_make_stricter_rules/, and my general experience, ai is looked down upon and often not paid for

hallow ridge
stray aspen
#

tariffs are great lol

leaden palm
#

i'll be eating my words if you get rich from this

#

it's just... i think markets are efficient

hallow ridge
stray aspen
#

thats crazy

hallow ridge
hallow ridge
leaden palm
#

damn

hallow ridge
#

How im not some cryto scammer

hallow ridge
#

Oh thats not me

#

The guy I made it for

#

Why do you think that

stray aspen
hallow ridge
#

he paid me to put all that

stray aspen
#

LMAO

#

are you serious

hallow ridge
#

me?

stray aspen
#

he paid you to make a dox website

#

yes

hallow ridge
#

IDK about all that I just made a cool website

stray aspen
#

didnt you read that dox info when you were making it

hallow ridge
#

I dont care what was on the site

#

IT COULD HAVE BEEN A P HUB SITE FOR ALL i CARE

#

Get out of what stuff

#

im not in it

#

But If I see a way to get 100m I might get in it

#

Ive checkmated plenty of people

#

And does that go for only me or everyone else

#

So it does not make sense

#

No

#

what is it

#

btc stick

#

I know but cant they track u on that chain thing

#

Black chain

#

How so

#

BLOCK CHAIN

#

Its just an address

#

and you dont know who is connected to it

#

and you could sell the btc

#

SO where do I go to track the wallets

#

I want to see my addresses history

#

Oh i seeit

#

Let me find my old address

#

Dman people goin crazy rn

stray aspen
#

its over

#

the feds are coming for that website you made

hallow ridge
#

im not apart of it

#

daaaaaaaaaaaaaaaaannng

#

I still have 600 in my old btc wallet

#

i need to find that shi

#

It says I had 9k in my wallet at one point

#

I dont think I backed it up

#

Whats illegal changes over time

#

So what your saying has no merit

#

It shows me who I sent 7k to

#

Yea its public

#

they all know how much you are makning

#

thats why you have a seprate one

#

with nothing

#

and send it over a a bunch of time

#

But how do they know its me

#

its just an address

#

Where do I find the richest wallet

#

how can i GET MY WALLET BACK

#

I dont think I backed it up

#

its just 600 sitting

#

I think I saved some private key but idk

#

Igotta look

jade egret
#

Which One is Smarter, Gemini 2.5 Pro Deep Think or ChatGPT o3-Pro?

hallow ridge
#

yooooooooooooooooooooooooooooooooooooooooooo

#

I just got in the account

whole wagon
#

If the appeals court denies the petition, Anthropic argued, the emerging company may be doomed. As Anthropic argued, it now "faces hundreds of billions of dollars in potential damages liability at trial in four months" based on a class certification rushed at "warp speed" that involves "up to seven million potential claimants, whose works span a century of publishing history," each possibly triggering a $150,000 fine.

drifting sandal
#

Is gpt5 really rank 1 (still)?

void shoal
#

我刚刚好像看到claude-opus-4-1-20250805-thinking出现在LMA里了,但是眨眼间就没了

#

然后只能找到claude-opus-4-1了,应该是开销太大了?

tidal ginkgo
#

bing chillin

torn mantle
#

i agree

hollow imp
astral prawn
#

So you have to pay 5.5% more to use open router?

obtuse heart
astral prawn
#

he's trolling, if u want to use it via API though o3-deep research is probably the closest

hallow ridge
#

Does anyone know anything about the dark web

#

darknet

cedar tide
#

@echo aurora hunyuan t1 and turbos dont respond on direct chat

astral prawn
#

not the dark side of the web 😮

leaden sun
#

i want to advocate free books for all on one hand, but i understand that authors need money to survive too... UBI could solve this problem it seems but it'll be rather a scenario in the future rather than a near term possibility?

verbal nimbus
hollow imp
obtuse heart
hollow imp
neon idol
#

Hello

vocal token
#

No way, @echo aurora greg?

#

From WFS?

bright kayak
#

Grok 4 is free now

neon idol
bright kayak
#

Check, I'm not lying

neon idol
#

There are limits? @bright kayak

bright kayak
#

idk

neon idol
keen beacon
keen beacon
#

Unless you pay for super grok

#

Which I ain't doing

neon idol
keen beacon
neon idol
keen beacon
novel flame
#

I am not surprised that GPT-5 failed on that math problem. In my testing it had problematic reasoning, the kind where early on it convinces itself of something that is clearly not true, and then throughout its thinking trace it keeps referencing this false assumption as a hard truth/requirement, which leads it down an incorrect path that it can’t escape. I suspect it would perform better without reasoning on a lot of tasks where it decides to use reasoning.

hollow imp
hollow imp
hollow imp
hollow imp
keen beacon
#

Btw, have you guys tried translation? I am currently testing translating song lyrics into my language and seeing if they are accurate.

#

Tried translating one kpop song into finnish and the result is not good.

verbal nimbus
keen beacon
#

damn

verbal nimbus
verbal nimbus
#

Oh I rechecked, it is the case on Design Arena, but not LMArena, odd...

verbal nimbus
#

On design arena, Grok 3 is rated 16, Grok 4 is rated 26

#

Probably because it doesn't rely on React + TailwindCSS

#

Gemini 2.5 Pro is lower too, at #9

#

Another reason to add more flexible web dev execution environments ig, current leaderboard only tests React-maxxed models

olive mesa
#

it's so silly how confident it is

olive mesa
keen beacon
#

My math level is that of middle school

#

lol

gentle plinth
#

are the direct chat and battle mode versions of gpt-5 on the same reasoning and effort level?

light zephyr
#

How can i generate audio

#

With video

novel flame
#

Hmm..... Since Sama posted that "release day GPT-5" was nerfed by a bug, I re-ran my tests, and lo and behold, it scored 5/5 instead of the 3.5-4/5 I got on release day.

However, if I were OpenAI and wanted to be a super-sneaky sleazeball with happy investors, here's what I would do:

  1. Release a budget GPT-5 model (cheap enough to be competitive)
  2. Log all first-day prompts
  3. Detect and collect all prompts that seem to be LLM nerds poking / testing the model with trick questions.
  4. Run a larger, more expensive internal model on this subset, generating better responses. Finetune the budget model with this dataset.
  5. Replace the public GPT-5 model with this GPT-5-nerd-finetune
  6. Post on X "Oopsies, found a bug, it's much smarter now"
  7. Lean back and watch all the nerds be impressed with the awesome power of GPT-5

I am not saying OpenAI would do such a thing, only that it is totally something that could be done.

eternal niche
#

guys

#

remember

#

gpt5 sucks

novel flame
# eternal niche gpt5 sucks

It doesn't. It's disappointing, not because it is objectively bad, but because it couldn't live up to the "unbeatable next generation wowzers" expectations, and almost certainly will be utterly humiliated by Gemini 3. But saying "it sucks" is just wrong.

eternal niche
#

even gemini 2.5 pro better

neon idol
#

Not Remove Syle Control

eternal niche
#

why

#

who needs style control

brave orbit
keen beacon
#

Kimi K2 for emotional intelligence and gemini 2.5 pro for everything else.

#

I dont feel that impressed by GPT-5 for some reason

white hatch
#

Was gpt-5 better after the live presentation?

small chasm
#

What the hell ?

sacred quail
#

This comes from a gemini fan

#

If gemini 3.0 release, then we can compare with gpt 5

#

But right now we must admit that gpt 5 is sota right now

#

I was using gemini a lot since 2.0 flash think while peoples didnt know it exist

earnest parcel
# brave orbit

Very balanced options. No Claude 4? Grok-4? But got R1 (not even 0528)?

obtuse heart
stiff nimbus
#

where to create images?

honest vapor
#

Add up file pls

surreal forum
#

hello

teal mantle
#

anyone want to pool chatgpt team

#

I have the welcome offer as it turns out

tender acorn
#

Hello

#

How to create images by LMarena?

jade egret
keen beacon
#

Has this happened with you guys yet?

stray aspen
#

I dont why people say this

#

I tested myself and the results are pretty similar

rapid fossil
#

Guys, I kinda need help. I wanna buy a subscription for an AI but idfk which one is the best tbh. Here are my options that I was thinking

  • ChatGPT Plus
  • X Premium+ w/ Grok 4
  • Gemini Pro
  • Claude Pro
stray aspen
#

Chatgpt

obtuse heart
#

i remember from the livestream they said that it will just say "idk" instead of making false answers

keen beacon
#

Just good to see it confirmed

keen fulcrum
keen fulcrum
#

If you need it for coding I recommend getting claude code or stocking up on API credits

rapid fossil
dusky pier
#

How is gpt-5 even first in lmarena?

keen fulcrum
#

see which you like better

rapid fossil
#

Ok, thanks

hollow imp
dusky pier
stray aspen
#

chatgpt plus

rapid fossil
dusky pier
#

You could try Qwen coder

earnest parcel
rapid fossil
obtuse heart
rapid fossil
rapid fossil
solid brook
#

guys i heard that gpt 5 thinking in chatgpt website is gpt 5 medium reason effort in api

#

i mean damn

earnest parcel
obtuse heart
#

claude is super expensive

solid brook
rapid fossil
#

Not really

rapid fossil
eternal niche
#

gemini 2.5 pro the best

#

gpt5 sucks

solid brook
#

i mean max 1 month before gemini 3

#

gemini 3 will cook

earnest parcel
#

best limits, best all around, best every use case! such insightful advice!

eternal niche
#

just accept it

earnest parcel
#

"best model" at what? also you don't seem to know what objectively means. llama 4 was also topping lmarena btw

eternal niche
#

style control 🤣 🫵

solid brook
#

they are far ahead in ai than any other company. don't see the current gemini 2.5 pro it is nerfed. the original released in april was at 4 sonnet or opus level. it was REALLY good but they nerfed it hard. imagine what gemini 3 will be

rapid fossil
#

Jeez I created a war

eternal niche
#

they are preparing for gemini 3

jade egret
earnest parcel
rapid fossil
jade egret
#

so true

#

there's isnt a best model for everything

rapid fossil
#

No there isn't

jade egret
#

nah

#

what is it?

#

chatgpt?

#

but it not best for everything?

eternal niche
#

"source: trust me bro"

jade egret
eternal niche
#

no

rapid fossil
jade egret
#

more popular doesn't equal to better

eternal niche
#

show me statistics

trim tartan
#

how do you include audio?

jade egret
#

even if it is, not saying it is, it not best in every single catagory

eternal niche
#

because gemini 2.5 pro the best

trim tartan
#

generated video of mine does not contain any audio

stray aspen
#

guys stop yapping and accept the truth gpt-5 is SoTA 😂

jade egret
#

because it the newest

eternal niche
jade egret
#

why u mad 😭

eternal niche
#

?

jade egret
#

newest always equal to better bro

rapid fossil
earnest parcel
#

llama 4 is great model guys, source attached

tight dune
#

Hi

rapid fossil
ripe mountain
#

btw gpt 5 is the best when it comes to coding

rapid fossil
ripe mountain
ripe mountain
#

which is the best price-performance model in terms of coding the o4 mini high or the qwen coder?

patent aspen
#

I'm talking about thinking time not tokens

ripe mountain
#

i havent tested it yet

eternal niche
#

just accept that gpt5 sucks

#

openai for normies

#

gemini for gigachads

ripe mountain
#

most gpt users are daily users

eternal niche
earnest parcel
jade egret
#

popular doesn't automatically equal to better

jade egret
rapid fossil
ripe mountain
#

the results on lmarena and openrouter seem extremely different to me across all ai models

#

what is the reason for this discrepancy

rapid fossil
#

Yeah, but even if you are popular, because if you don't make something good for everyone, everyone is going to boycott you

rapid fossil
#

Nobody is perfect, not even AI, let's stop, every AI is good in its way

eternal niche
#

just like you

jade egret
#

where can i use o3-pro for free?

ripe mountain
jade egret
echo aurora
stray aspen
#

in genspark

#

or yupp ai

rapid fossil
jade egret
stray aspen
#

Yes

#

but its very limited

#

you have few messages

jade egret
#

dang

#

how do you choose ; (

#

ohh

#

i see

#

yay (:

neon idol
#
poll_question_text

Who is the best?

victor_answer_votes

0

total_votes

0

stray aspen
#

yeah

jade egret
#

W

#

how long do o3-pro usually think for

stray aspen
#

all day

ripe mountain
jade egret
#

oh

#

so it normal it still thinking

ripe mountain
stray aspen
#

yes

echo aurora
#

We're looking for info on why. Please share your thoughts in this thread.

ripe mountain
#

The Gemini 2.5 Pro has fallen far behind in the Artificial Analysis Intelligence Index. Even the O4 Mini High has surpassed it.

stray aspen
#

yes

#

its getting dumber every day

#

i dont know if it has to do with gemini 3 or a new release

solid brook
ripe mountain
stray aspen
#

they wont lamo

#

they have to stay SoTA

#

if they achieve sota of course

ripe mountain
#

deepseek where's my bro 😭

stray aspen
#

deepseek sucks

ripe mountain
#

why

stray aspen
#

because it needs an update

#

the newer models are smarter

ripe mountain
stray aspen
#

but open source models suck

#

they are never sota

barren prairie
ripe mountain
stray aspen
#

it isnt lmao

ripe mountain
stray aspen
#

qwen is smarter than gpt-5 low

#

according to artificial analysis

barren prairie
willow grail
#

frederico is a beautiful name

barren prairie
#

No no no gemini is always good for me

willow grail
eternal niche
#

gemini 2.5 pro the best

stray aspen
#

its not good

#

the chinese models are smarter

obtuse heart
#

no theyre not

ripe mountain
#

was the horizon betta better than gpt-5?

obtuse heart
stray aspen
barren prairie
willow grail
#

what is dat

ripe mountain
willow grail
#

u should register as unemployed and get all the UBI

solid brook
willow grail
#

eli5

ripe mountain
#

Which is more reasonable: purchasing a monthly subscription or paying per token from OpenRouter?

solid brook
sour spindle
#

What benchmark have you all found most closely lines up to your real world experience?

ripe mountain
solid brook
eternal niche
solid brook
#

bruh

willow grail
#

source?

solid brook
#

did you even use it bro?

willow grail
#

oh

#

hm

#

bro im mjust lonely. pls feel with me

solid brook
pure falcon
#

Currently, yes. But does that mean it was high reasoning when they tested?

willow grail
#

see it is using gpt5..... xD which requires "think very hard"

ripe mountain
#

grok or gemini? which is better

willow grail
obtuse heart
ripe mountain
whole wagon
#

.

solid brook
#

under the stealth model summit

whole wagon
#

.

solid brook
#

summit gpt 5 high
zenith gpt 5 medium

tidal ginkgo
#

y´all, what is the closest thing we have to a model better then gpt-5 in lmarena

solid brook
pure falcon
pure falcon
whole wagon
#

I just gave you primary sources

#

It's the standard thinking mode

solid brook
#

source>?

solid brook
# whole wagon .

why do twist it? that was about the gpt 5 in copilot this is about gpt 5 in lmarena

gritty cargo
#

can someone tell me when the leaderboard will be updated next timn?

solid brook
#

link please?

#

no dude

#

i mean the direct link

whole wagon
#

Obviously reasoning effort is still a thing. Lol

pure falcon
#

Uhh…

solid brook
whole wagon
#

Bros actually cannot read

solid brook
# whole wagon

@echo aurora uhm you said the reason effort was high?>

pure falcon
#

So you’re saying OpenAI docs, which i just opened for the first time and screenshotted - you’re telling me they’re wrong?

#

Sounds like
the problem is you here tbh lol

eternal niche
#

craig you are so cringeeeeee

solid brook
#

@whole wagon that was 1 hour after release. pineapple later said that the reason effort is high. let me find his message

echo aurora
whole wagon
#

It also says if I KYC they will give access to the reasoning trace. On the playground

pure falcon
#

So 3 main questions:

  • what reasoning level did they test on
  • was the router working correctly
  • what reasoning is LMArena using today ✅ (already answered, high)
stray aspen
#

is this real

solid brook
#

just reason effort

pure falcon
#

Good catch. So, revising:

• was Summit, when tested in LMArena, high reasoning?

whole wagon
#

@echo aurora what verbosity on lmarena, medium?

stray aspen
#

@echo aurorawhats the reason effort of gpt-5 on lmarena

whole wagon
#

he literally just answered

stray aspen
#

where

whole wagon
#

learn to read

pure falcon
stray aspen
#

alright

#

thanks

whole wagon
#

i found setting verbosity to high also improves how often it is correct. lol

pure falcon
echo aurora
pure falcon
whole wagon
#

the verbosity will be important for lm arena

stark tusk
#

Is gpt5 and gpt5 chat the same?

whole wagon
#

Nope

stark tusk
#

What's the difference?

sacred quail
#

Gpt 5 can think, gpt 5 chat is not

hazy cipher
#

Hello
Can we make an image to talk?

shell crag
#

Guys when i am generating image in lmarena website image always generate in 1:1 ratio what should i as in prompt to gave perfect ratio image

reef pawn
#

Grok 4 or GPT-5?

#

Which one you guys like more

sacred quail
#

Gpt 5

#

Grok 4 is still good, just not good enough to be best

reef pawn
#

I haven't tested Grok 4 fully yet, but my first impressions was that GPT-5 is superior on surface

sacred quail
#

it is

white hatch
#

lool

stray aspen
#

what

rich mauve
#

Hello

eternal niche
golden ocean
#

can u leave the server

exotic nebula
#

Get out

#

Sketchy af

echo aurora
modest prism
#

Hi there! I've got a question. Is the model named "gpt-5" on lmarena, a thinking or non thinking variant or automatically routed when it decides?

exotic nebula
exotic nebula
exotic nebula
neon idol
exotic nebula
#

GPT 5 is a thinking model.

neon idol
#

Do you have a prompt for test ai?

exotic nebula
#

I got one

neon idol
exotic nebula
#

Give me a sec

neon idol
neon idol
exotic nebula
#

@neon idol
This is the prompt:

Ciphertext:

ᚱ-ᛝᚱᚪᛗᚹ.ᛄᛁᚻᛖᛁᛡᛁ-ᛗᚫᚣᚹ-ᛠᚪᚫᚾ-/
ᚣᛖᛈ-ᛄᚫᚫᛞ.ᛁᛉᛞᛁᛋᛇ-ᛝᛚᚱᛇ-ᚦᚫᛡ/
-ᛞᛗᚫᛝ-ᛇᚫ-ᛄᛁ-ᛇᚪᛡᛁ.ᛇᛁᛈᛇ-ᚣᛁ-ᛞ/
ᛗᚫᛝᚻᛁᚳᛟᛁ.ᛠᛖᛗᚳ-ᚦᚫᛡᚪ-ᛇᚪᛡᚣ.ᛁᛉ/
ᛋᛁᚪᛖᛁᛗᛞᛁ-ᚦᚫᛡᚪ-ᚳᚠᚣ.ᚳᚫ-ᛗᚫᛇ-ᛁᚳᛖᛇ-ᚫ/
ᚪ-ᛞᛚᚱᚹᛁ-ᚣᛖᛈ-ᛄᚫᚫᛞ.ᚫᚪ-ᚣᛁ-ᚾᛁᛈᛈᚱᛟᛁ-/
ᛞᚫᛗᛇᚱᛖᛗᛁᚳ-ᛝᛖᚣᛖᛗ.ᛁᛖᚣᛁᚪ-ᚣᛁ-ᛝᚫ/
ᚪᚳᛈ-ᚫᚪ-ᚣᛁᛖᚪ-ᛗᛡᚾᛄᛁᚪᛈ.ᛠᚫᚪ-ᚱᚻᚻ-ᛖ/
ᛈ-ᛈᚱᛞᚪᛁᚳ./
Method:

Atbash:
decimal[i] = 28 - decimal[i]

This is the answer:

A WARNING

BELIEVE NOTHING FROM THIS BOOK EXCEPT WHAT YOU KNOW TO BE TRUE TEST THE KNOWLEDGE FIND YOUR TRUTH EXPERIENCE YOUR DEATH DO NOT EDIT OR CHANGE THIS BOOK OR THE MESSAGE CONTAINED WITHIN EITHER THE WORDS OR THEIR NUMBERS FOR ALL IS SACRED
neon idol
#

should AI decipher the message?

exotic nebula
#

Paste the prompt and let it decipher it.

neon idol
exotic nebula
#

If it deciphers and you get the answer which I pasted there, then it passes the test.

neon idol
#

Grok 3 failed the test

exotic nebula
neon idol
#

They are thinking

neon idol
#

There is a problem

#

I apologize, but I am unable to decode the provided ciphertext using the Atbash method (decimal[i] = 28 - decimal[i]) because the runes in the ciphertext do not directly correspond to a standard numerical mapping (such as the 28-letter Elder

exotic nebula
#

What?

#

Which model?

neon idol
exotic nebula
#

Huh. Weird. It works for me

#

Try in a new chat window

neon idol
neon idol
#

Same answer

exotic nebula
#

Hmm

#

I just tried out both models

#

They gave me the correct answer

#

@neon idol tried it again?

neon idol
neon idol
exotic nebula
# neon idol What is the request?

Ciphertext:

ᚱ-ᛝᚱᚪᛗᚹ.ᛄᛁᚻᛖᛁᛡᛁ-ᛗᚫᚣᚹ-ᛠᚪᚫᚾ-/
ᚣᛖᛈ-ᛄᚫᚫᛞ.ᛁᛉᛞᛁᛋᛇ-ᛝᛚᚱᛇ-ᚦᚫᛡ/
-ᛞᛗᚫᛝ-ᛇᚫ-ᛄᛁ-ᛇᚪᛡᛁ.ᛇᛁᛈᛇ-ᚣᛁ-ᛞ/
ᛗᚫᛝᚻᛁᚳᛟᛁ.ᛠᛖᛗᚳ-ᚦᚫᛡᚪ-ᛇᚪᛡᚣ.ᛁᛉ/
ᛋᛁᚪᛖᛁᛗᛞᛁ-ᚦᚫᛡᚪ-ᚳᚠᚣ.ᚳᚫ-ᛗᚫᛇ-ᛁᚳᛖᛇ-ᚫ/
ᚪ-ᛞᛚᚱᚹᛁ-ᚣᛖᛈ-ᛄᚫᚫᛞ.ᚫᚪ-ᚣᛁ-ᚾᛁᛈᛈᚱᛟᛁ-/
ᛞᚫᛗᛇᚱᛖᛗᛁᚳ-ᛝᛖᚣᛖᛗ.ᛁᛖᚣᛁᚪ-ᚣᛁ-ᛝᚫ/
ᚪᚳᛈ-ᚫᚪ-ᚣᛁᛖᚪ-ᛗᛡᚾᛄᛁᚪᛈ.ᛠᚫᚪ-ᚱᚻᚻ-ᛖ/
ᛈ-ᛈᚱᛞᚪᛁᚳ./
Method:

Atbash:
decimal[i] = 28 - decimal[i]

neon idol
#

A WARNING

BELIEVE NOTHING FROM THIS BOOK

EXCEPT WHAT YOU KNOW TO BE TRUE

TEST THE KNOWLEDGE

FIND YOUR TRUTH

EXPERIENCE YOUR DEATH

DO NOT EDIT OR CHANGE THIS BOOK

OR THE MESSAGE CONTAINED WITHIN

EITHER THE WORDS OR THEIR NUMBERS

FOR ALL IS SACRED

#

8 minute of reasoning but grok 4 win

exotic nebula
#

Nice

#

What about gpt 5?

neon idol
#

Lets try Gemini but i think it will give a right answer

exotic nebula
keen beacon
#

How do we feel about Chinese models guys?

#

E.g. Qwen, R1

pure falcon
exotic nebula
exotic nebula
keen beacon
#

Have you ever tested with private benchmarks that nobody ever has in the whole universe? 👀

pure falcon
exotic nebula
exotic nebula
keen beacon
#

So here's something to check out

#

Any way I can share my LMArena sessions with you?

exotic nebula
keen beacon
#

Really?

#

Duzn't seem to work...

exotic nebula
#

😭 Damn

#

Well send me screenshots

neon idol
keen beacon
#

Bruh

#

So here's the prompt

List 100 anime similar to Madoka Magica, one entry per franchise, names only, no bs.

Ask the last Qwen thinking, and then, say, Gemini 2.5 Pro. See how different they are

#

Gemini 2.5 just gives a list of shows

#

Qwen starts to wildly hallucinate, invent shows that don't exist, and repeating the same show over and over

#

When you ask it to fix its delusions it freezes

neon idol
keen beacon
exotic nebula
#

Oh.

#

That sucks....

keen beacon
#

There are also more obscure and accurate mentions in Gemini's output that Qwen never identified

#

Yeah, I feel like Chinese developers massively overreport the capabilities of their LLMs

#

Intentionally or not, I don't know

exotic nebula
keen beacon
#

They are like

#

Benchmaxxx, drop, scare the hell out of OpenAI

#

Then suddenly everyone figures out your model is not that good

#

But nobody cares by that point anymore

#

I honestly don't know why I used Chinese LLMs for so long when there's Gemini on lmarena 🫤

exotic nebula
keen beacon
exotic nebula
keen beacon
#

VPN subscription + LLM subscription at least

exotic nebula
#

Btw dont reveal location. Delete that msg. Some people here have bad opinions about Russia

exotic nebula
keen beacon
exotic nebula
keen beacon
#

What? I lived here for nearly 25 years, I know what I'm talking about ok there?

exotic nebula
keen beacon
#

Meh, at one point I can't wait for the day Deepseek shatters OpenAI with another release

#

At another, I see this garbage

exotic nebula
#

@keen beacon @neon idol

neon idol
#

AHAHHAHA FRRR

keen beacon
#

Here is my private benchmark

#

There is one underrated anime

#

To pass it, an LLM has to figure out why it's so underrated

#

Qwen was mostly able to do it with deep research, but arrived at a half correct conclusion even using unreliable sources

#

And I don't know any LLM that was able to figure it out independently

#

GPT-5 gives an usual knee jerk response

#

It's trained on hundreds of reviews, most of which are wrong, and none ever point out issues that aren't related to the content of the show

exotic nebula
#

I see. A RL(HF) test. Interesting. So which model succeeded in your expectations, i.e, which one found the reason for why its failed?

#

And if you dont mind, which anime?

keen beacon
#

Why it failed*

#

It was the last Qwen deep research

#

I haven't tested others in the deep research mode

exotic nebula
#

Ah I see. Lemme know when you try out for all models.

keen beacon
#

Let me know if you have access for GPT-5 or Grok 4 or Opus 4.1 or Gemini 2.5 Pro Deep research

#

Because so far each pretrained model just kept parroting the same stupid data and missing the crucial point

exotic nebula
keen beacon
exotic nebula
keen beacon
#

The thing that makes the model ask Google questions

exotic nebula
keen beacon
#

I also find LLMs funny when it comes to creative writing

#

They tend to generate absolutely atrocious and banal ideas, and each time you ask them to write something new, they just keep writing the same story over and over again, only switching minor details such as settings, character names, character designs and so on

#

However, when it comes to assisting and finishing already good ideas, they tend to generate ideas that fit much better and are less nonsensical

#

Deepseek once suggested the same way to finish my story I did

#

But if you start writing with LLMs from scratch, they are total garbage

neon idol
keen beacon
#

It also won a couple of my private benchmarks

#

However, LLM seem to be capable to produce really creative, really unlike-each-other mathematical proofs for novel problems

#

DeepThink can do it already

#

I wonder why it doesn't work the same way when writing stories

neon idol
#

Do you have prompts for test ai?

stray aspen
#

But gpt 5 is greater overall

#

Both are great models tho

keen beacon
# keen beacon Deep RESEARCH. Not Think.

Something unexpected happened.

When asked why the show failed, Grok and Gemini booth provide a knee jerk response. When asked to do comprehensive research - even in offline mode - they figured out all the factors, and then after asked for the most important one they all name marketing problems

#

They never figure it out if you ask them directly

#

But if asked to research a bit

#

But this is stupid, I want them to be able to respond correctly at the very first prompt without nudging

#

This is so stupid

novel flame
#

I ran a private HTML game microbenchmark and GPT-5 did a pretty good job on release day, close to or maybe SotA. Then after Sama tweeted about fixing a bug, I ran it again, and this time GPT-5 generated the best game of any model yet by a decent margin.

And on the coding part of my regular test suite, it crushed as well. It even came up with a brilliant and elegant optimization no other model has proposed (more than a hundred models so far). I am a big fan/ of Sonnet and Gemini 2.5 Pro but coding, but I can’t deny those results.

It seems like OpenAI actually cooked on the coding side this time.

obsidian shell
#

have you guys switched to gpt 5 for coding?

reef pawn
#

How good is Deep Research in GPT-5? any difference from previous model? I believe it was previously using O3 for research when GPT 4o came out, no?

reef pawn
stray aspen
#

yes

obsidian shell
obsidian shell
reef pawn
obsidian shell
#

gemini is generally free in their ai studio interface

reef pawn
#

Yes, I use Google AI Studio free version.

white hatch
#

Does gemini's web version have context length limit?

reef pawn
#

Web version? The application itself without AI Studio?

white hatch
prisma temple
#

Я что не догоняю гпт 5 вышел

white hatch
reef pawn
#

Yes, you can't use Gemini 2.5 pro for long in official Gemini app but I'm not sure about context window.

white hatch
#

Is it the same as in ai studio?

reef pawn
# white hatch Is it the same as in ai studio?

No, it's not. Unless you are on paid plan. The whole free thing they doing on Google AI Studio is to attract developers to the website and convert them into paid customers. Average joe that uses Gemini app for cat pic doesn't get same limits on free version in gemini official website.

whole wagon
#

Livebench added GPT5 pro high

keen beacon
whole wagon
pliant cliff
whole wagon
#

Gpt 5 mini high 25% Gemma 3 12b 42%

#

The benchmark is literally broken trash

neat apex
#

Gpt 5 minimal?

keen beacon
#

Which is probably what they're doing

#

If so then I have no idea why it's so ass

whole wagon
#

They run it multiple times. It's a problem with the benchmark, they scored it 0 if it takes over a certain amount of time iirc

#

It's ass

#

That's why the non reasoning models do better

#

They don't need to take time

neat apex
#

Maybe they ask well know old Examples, seeing by the Command perfomance

#

I cant endure Command being that high

whole wagon
#

Command performance is not good. I scrolled to the bottom because that is where GPT5 is

#

Lmao

#

There's like 100 models above it

neat apex
#

Wha

whole wagon
#

Because livebench is trash

#

I don't get it. They had one job

#

And they screwed it up

neat apex
#

Looks reasonable if you ignore gpt5?

whole wagon
#

No

#

4o is one of the top. Above o3 etc

neat apex
#

Dx

whole wagon
keen beacon
#

Lol. So what's the most credible benchmark so far?

#

Artificial Analysis one?

neat apex
#

Cursed

keen beacon
#

Ofc they all suck, we just need yo find one that sucks less

whole wagon
#

Simple bench is ok

#

What does this mean

#

They can't actually serve GPT5 fully?

#

This is unexpected. Livebench always glazed openAI before

ripe mountain
whole wagon
#

It is all there

#

The table is huge

ripe mountain
#

thx

whole wagon
#

Coding average has Gemini 2.5 pro below 4o also

#

It's a meme benchmark basically lol

ripe mountain
blazing bison
#

less images per week for plus

#

they gonna cut something

ripe mountain
whole wagon
ripe mountain
#

mb

eternal niche
#

btw gpt5 sucks

#

even gemini 2.5 pro better

ripe mountain
blazing bison
ornate agate
autumn cargo
#

Gemini 2.5 Pro clearly better than GPT 5 imo. Even in lmarena GPT 5 has only 33% win rate against Gemini 2.5 Pro. Not sure how it has ended up on top!