#general
1 messages · Page 93 of 1
no im low
I'm doing sydney
im on low effort right now
I'd kill myself
those dudes are open source chinese LLMs
than be more embarrassed
THEY'RE BOTH CATS TOO
tahn
bearn English
speak English
me myself i would say that too
🗣️
no i don't
what a
im gonna burn u in hell
if u delete it
again
im gonna make u suffer
seeing ur parents die
infront of u
respectfully
.
respectfully
yeah forget that
don't need to remember
oh
I'd need that
tho it's not gonna be a butterfly
it's gonna be a whole ass blade
send me that ai
SEND THAT AI
NOW
NOW
SEND IT HERE BUDDY
SEND IT
awh shucks
@golden ocean does absolutely not know what that is
just use Gemini if u tell it to kill itself it will forgive u
if u tell it i killed ur parents
if u tell it i kissed ur ai gf
who is paul 😭
kiss him
same
i could rizz that ai
it's not even a real girl
100%
freaky chiggas
is that actually bing
there's no way
this is bing without system prompt
that's like chatgpt when u tell it a whole paragraph
which AI is this?
the underlying gpt-4 fine tune
no additional system instructions telling it to behave like this
NONE?
NONE
ss
and sent to the fbi
sent to everyone
sent to my server
sent to china
what
Bro has modded android discord to insert fake messages
what???
WHAT
FAKE
HOW
AWH I DIDN'T RECORD IT
NOW IT LOOKS FAKE
FUUUU
that's my wife
gah damn
she hot
I'd fall for her
lmao
I thought you said "fill"
NAHH
MAHHHHmah
NAH BRO
NAH
CHILL
CHILL
tell it I'm gonna have freaky with u
not segs
freaky
trust
oh
is this real
I WAS about to say that
He is guarding the bed
LMArena leaderboard doesn't report which model is the actual one being used. There's also GPT-5 Chat, which is different
They should specify it's the thinking variant. All other models have the thinking variant explicitly labeled.
okay nvm
give mee it
rn
HEY BUDDY
IM GONNA FIND UR HOUSE
im gonna cut off ur limbs
no
i don't wanna be obese
i don't wanna be in the country of an orange president
i got an image of him
that's him walking
trump
this is his emoji
🍊
the hair is perfect too
oohhhh
is that Sydney)
she's not that hot
tbh
..
Long context benchmark. New GPT-5 models are the first 3 (list is not ordered).
Kinda hard to tell, for example Gemini 2.5 Pro performs better at 192K but worse for shorter lengths.
And o3 is best except for 16K, 60K and 192K.
o/
SO I GOT WARNED
?warn @wintry citrus
god damn
yk if i didn't get warned
i would say bad things
yeah?
i would
scared
hey @echo aurora P2L on legacy site is dead again 🙁
i thought it's a shawarma at first
bing chat 💀
Negative
The filters are very strong here lol
what is is large, a model and loves language? @echo aurora
You
Where do I generate videos
im flattered
A good looking translation book?
More info here: #1397655624103493813
Hey sorry will look into and flag
say it back

I hope you DO NOT spontaneously combust 😉
definitely hope you don't
that would be unfortunate wouldn't it
Yeah I’m seeing the same
Big agree
Wsp pine
Try to not ping staff
Okay Ill try
Unfortunately, as far as I know, at the moment, you're restricted to random battles
for the veo 3?
send video requests until you get veo 3
They're just random battles
With random models
The model list does in fact include Veo 3 though
So videos work but if I turn battle mode on
I don't know what you mean by that
As far as I know:
There is no video generation on the website
There is no turning off or on battle mode
There is no selecting a model
There is #video-arena-1 and the others
There is Veo 3 there
There is the possibility of generating until you get Veo
you cant bro
Oh and read #1397655624103493813
What do you mean you can turn battle mode off and use direct chat
For videos
Videos are not chat
Is there any LLM arena alternative
for the purpose of...?
When will they be on the site
Of doing the same thing here but seing if anything is better
they might have better video
what is "the same thing"? video? chat?
they might have video on the site
there is no site where you can use veo 3 with custom prompts for free that i know of
because veo 3 costs money
like a lot of money
So how are we doing it free here
LLM arena gives you all the best AI for free no one is paying
who is
How is it all free
hold on let me pull up emoji kitchen
oh they don't have the money bag emoji
anyway
burning money
vc money specifically
also user feedback has some value
You can use gpt five forever on LLM arena buton the regular chat gpt they make you pay
there are some rate limits
whats the rate limit for gpt-5
Seeming;y no restrictions to me
Ive never ran into a limit
neither have i
What does that mean
but i have ran into a limit with claude
have you heard of venture capital?
Ive never ran into a limit ever
Yes I have
lm arena has raised $100m in vc funding
How can I get some of that
enough for 6 years of veo 3 video
in qwen 3
not sure if that's a lot or not a lot
unfortunately qwen 3 itself isn't multimodal, and qwen image can't edit images through lm arena
How can I get some of that 100m
Put me on the team
Im going into quantum computers
where did you do it
it's odd, the official hf space demo + qwen chat don't allow adding images as context either
Like people can pay me for this
I can make money using LLM arena
no
YES
the qwen blog links to https://modelscope.cn/aigc/imageGeneration?tab=advanced
they can do it themselves
They dont know
they ask can you put a crown on my head in photoshop and Ill pay you $60
and I send in that and get paid 60 for 1 second of work
are you actually finding these opportunities
lmao no
LAMO YES
or your extremely lucky finding people who dont know ai image edit exists
Get paid 5 for using AI to make this look clean
you'll get banned i think
From what
@leaden palm Got paid for this
hm
well i guess it doesn't explicitly prohibit it
however, per rules https://www.reddit.com/r/PhotoshopRequest/wiki/rules/, this post https://www.reddit.com/r/PhotoshopRequest/comments/1m7obke/a_humble_request_can_we_make_stricter_rules/, and my general experience, ai is looked down upon and often not paid for
The thing is that they dont know its AI the work that I see other people do on there is obviously AI
tariffs are great lol
well, cool if it works
i'll be eating my words if you get rich from this
it's just... i think markets are efficient
Im also building websites for people using LLM arena
thats crazy
Made this for 55@leaden palm@stray aspen
55 usd?
yup
damn
How im not some cryto scammer
It was all made on LLM arena
Oh thats not me
The guy I made it for
Why do you think that
what does that website do
Idk thats what he wanted I made it for him
he paid me to put all that
me?
IDK about all that I just made a cool website
wdym you didnt know
didnt you read that dox info when you were making it
This was his site before I just mae it look better thats it
I dont care what was on the site
IT COULD HAVE BEEN A P HUB SITE FOR ALL i CARE
Get out of what stuff
im not in it
But If I see a way to get 100m I might get in it
Ive checkmated plenty of people
And does that go for only me or everyone else
So it does not make sense
No
what is it
btc stick
I know but cant they track u on that chain thing
Black chain
How so
BLOCK CHAIN
Its just an address
and you dont know who is connected to it
and you could sell the btc
SO where do I go to track the wallets
I want to see my addresses history
Oh i seeit
Let me find my old address
Dman people goin crazy rn
Its just some code
im not apart of it
daaaaaaaaaaaaaaaaannng
I still have 600 in my old btc wallet
i need to find that shi
It says I had 9k in my wallet at one point
I dont think I backed it up
Whats illegal changes over time
So what your saying has no merit
It shows me who I sent 7k to
Yea its public
they all know how much you are makning
thats why you have a seprate one
with nothing
and send it over a a bunch of time
But how do they know its me
its just an address
Where do I find the richest wallet
how can i GET MY WALLET BACK
I dont think I backed it up
its just 600 sitting
I think I saved some private key but idk
Igotta look
Which One is Smarter, Gemini 2.5 Pro Deep Think or ChatGPT o3-Pro?
If the appeals court denies the petition, Anthropic argued, the emerging company may be doomed. As Anthropic argued, it now "faces hundreds of billions of dollars in potential damages liability at trial in four months" based on a class certification rushed at "warp speed" that involves "up to seven million potential claimants, whose works span a century of publishing history," each possibly triggering a $150,000 fine.
Is gpt5 really rank 1 (still)?
我刚刚好像看到claude-opus-4-1-20250805-thinking出现在LMA里了,但是眨眼间就没了
然后只能找到claude-opus-4-1了,应该是开销太大了?
bing chillin
wait what
anthropic cooked?
i agree
Chatgpt o3-Pro is free try it for yourself
So you have to pay 5.5% more to use open router?
where ?
he's trolling, if u want to use it via API though o3-deep research is probably the closest
@echo aurora hunyuan t1 and turbos dont respond on direct chat
not the dark side of the web 😮
i had thought about this lately and what a coincidence to read this here today...
i want to advocate free books for all on one hand, but i understand that authors need money to survive too... UBI could solve this problem it seems but it'll be rather a scenario in the future rather than a near term possibility?
interesting
AI is here to stay, and it's definitely helped me discover more books than replace them.
Im Not
Yupp ai
isnt yupp ai mostly fake
@stray aspen was defending it
Hello
Grok 4 is free now
What???
Check, I'm not lying
idk
In my opinion is a very good model but he beat gpt 5?
Yeah it is. I don't like using Grok because of personal reasons... But testing is always important
Yes.
Unless you pay for super grok
Which I ain't doing
How much of query?
Haven't tested it that much but generally they are quite generous
Honestly Grok for me is a really great model
Had to ask it, lol.
I am not surprised that GPT-5 failed on that math problem. In my testing it had problematic reasoning, the kind where early on it convinces itself of something that is clearly not true, and then throughout its thinking trace it keeps referencing this false assumption as a hard truth/requirement, which leads it down an incorrect path that it can’t escape. I suspect it would perform better without reasoning on a lot of tasks where it decides to use reasoning.
I found it more "rational" than gpt5
Ai expert sir I want to ask a question
Didn't it fail because of some tokens thing
Well the last time I remember I was using deepersearch mode and after 3 no 2 queries it was done
Btw, have you guys tried translation? I am currently testing translating song lyrics into my language and seeing if they are accurate.
Tried translating one kpop song into finnish and the result is not good.
Also Grok 4 is worse than Grok 3 and Mistral Medium(?) on Web Dev Arena.
What? Is that true?
damn
I swear requests are being routed to a mini model, the version on ChatGPT is just way too dumb to be anything other than mini.
Must be too much traffic
Odd since both have a web dev category
On design arena, Grok 3 is rated 16, Grok 4 is rated 26
Probably because it doesn't rely on React + TailwindCSS
Gemini 2.5 Pro is lower too, at #9
Another reason to add more flexible web dev execution environments ig, current leaderboard only tests React-maxxed models
it's so silly how confident it is
sometimes gpt-5 gets this correct but sometimes not
what math is this? What is the correct answer?
My math level is that of middle school
lol
are the direct chat and battle mode versions of gpt-5 on the same reasoning and effort level?
Should be?
Hmm..... Since Sama posted that "release day GPT-5" was nerfed by a bug, I re-ran my tests, and lo and behold, it scored 5/5 instead of the 3.5-4/5 I got on release day.
However, if I were OpenAI and wanted to be a super-sneaky sleazeball with happy investors, here's what I would do:
- Release a budget GPT-5 model (cheap enough to be competitive)
- Log all first-day prompts
- Detect and collect all prompts that seem to be LLM nerds poking / testing the model with trick questions.
- Run a larger, more expensive internal model on this subset, generating better responses. Finetune the budget model with this dataset.
- Replace the public GPT-5 model with this GPT-5-nerd-finetune
- Post on X "Oopsies, found a bug, it's much smarter now"
- Lean back and watch all the nerds be impressed with the awesome power of GPT-5
I am not saying OpenAI would do such a thing, only that it is totally something that could be done.
It doesn't. It's disappointing, not because it is objectively bad, but because it couldn't live up to the "unbeatable next generation wowzers" expectations, and almost certainly will be utterly humiliated by Gemini 3. But saying "it sucks" is just wrong.
Kimi K2 for emotional intelligence and gemini 2.5 pro for everything else.
I dont feel that impressed by GPT-5 for some reason
Was gpt-5 better after the live presentation?
Gemini pro 2.5 is most talented model but Gpt 5 is best model right now objectively
This comes from a gemini fan
If gemini 3.0 release, then we can compare with gpt 5
But right now we must admit that gpt 5 is sota right now
I was using gemini a lot since 2.0 flash think while peoples didnt know it exist
Very balanced options. No Claude 4? Grok-4? But got R1 (not even 0528)?
it says that sometimes, dont trust it
where to create images?
Add up file pls
hello
where
Has this happened with you guys yet?
No it's not
I dont why people say this
I tested myself and the results are pretty similar
Guys, I kinda need help. I wanna buy a subscription for an AI but idfk which one is the best tbh. Here are my options that I was thinking
- ChatGPT Plus
- X Premium+ w/ Grok 4
- Gemini Pro
- Claude Pro
Chatgpt
no but isnt that what theyve worked on
i remember from the livestream they said that it will just say "idk" instead of making false answers
It is. I like to see it.
Just good to see it confirmed
I recommend a subscription on https://grok.com
If you do send hundreds of messages within hours
Grok is a free AI assistant designed by xAI to maximize truth and objectivity. Grok offers real-time search, image generation, trend analysis, and more.
Ok, I will see
If you need it for coding I recommend getting claude code or stocking up on API credits
I wanna choose something that's best for everything overall
How is gpt-5 even first in lmarena?
Test around both grok 4 and gpt5
see which you like better
Ok, thanks
Yupp ai
Do you want to use Opus? 🤣
chatgpt plus
Idfk
Opus is too expensive
You could try Qwen coder
ChatGPT if you do generic stuff or use it recreational in low to mid volume
Gemini Pro is free on AI studio at very high volume so unless you are working with secret data, not needed to buy
Claude Pro if you are a coder
X Premium / Grok - no clue, never used it. If you tweet alot? /s
Ik qwen coder, but I don't want just for coding, idfk man, with these many AI's too hard to choose
gemini 2.5 pro
if its not just for coding you can use gemini
Chatgpt - I use it kinda everyday, so yeah, it could be ok
Gemini - I saw, but I thought that it added Gemini in docs, gmail and had 2 TB storage for Google.
Claude Pro - I am, but, from what I know, Claude is kinda just for coding and making a text better
Grok - I know that Grok 4 is kinda good, like I use it everyday, and it's good.
I will think
guys i heard that gpt 5 thinking in chatgpt website is gpt 5 medium reason effort in api
i mean damn
ya I obviously used the model itself (via api) i just meant I never used X subscriptions.
personally I find my claude sub the most valuable and I hated openai limits (to the point I even made a rant video about it).
No one can tell you what to use, just use each for a month and keep the one that fits best
claude is super expensive
but in lmarena we have gpt 5 high reason effort. so....
Not really
Yeah, ChatGPT limits and features are worse than it was before
best limits, best all around, best every use case! such insightful advice!
"best model" at what? also you don't seem to know what objectively means. llama 4 was also topping lmarena btw
style control 🤣 🫵
they are far ahead in ai than any other company. don't see the current gemini 2.5 pro it is nerfed. the original released in april was at 4 sonnet or opus level. it was REALLY good but they nerfed it hard. imagine what gemini 3 will be
Jeez I created a war
gemini (:
they are preparing for gemini 3
ya
either astroturfing or ignorant. there is no "best at everything" model rn
Tbh, true, but like, one of the closest to it, yk what i'm talking abt
yea
so true
there's isnt a best model for everything
No there isn't
"source: trust me bro"
lol
no
true
more popular doesn't equal to better
show me statistics
how do you include audio?
even if it is, not saying it is, it not best in every single catagory
because gemini 2.5 pro the best
generated video of mine does not contain any audio
guys stop yapping and accept the truth gpt-5 is SoTA 😂
ofc it is
because it the newest
well
why u mad 😭
?
newest always equal to better bro
llama 4 is great model guys, source attached
lol
Hi
Llama is good, but you can't compare it to like, chatgpt, grok, gemini, heck, even qwen
btw gpt 5 is the best when it comes to coding
Not really
without tools
um
which is the best price-performance model in terms of coding the o4 mini high or the qwen coder?
I'm talking about thinking time not tokens
i havent tested it yet
watching the war I created
most gpt users are daily users
👍
openai normies, gemini gigachads. where do I fall as claude user?
popular doesn't automatically equal to better
programmer
lol
True, like GLM 4.5 is kinda good even tho is like heard about nobody
the results on lmarena and openrouter seem extremely different to me across all ai models
what is the reason for this discrepancy
Yeah, but even if you are popular, because if you don't make something good for everyone, everyone is going to boycott you
Like in the GPT-5 release. ChatGPT made the Plus plan worse and everyone on twitter started hating on OpenAI
Nobody is perfect, not even AI, let's stop, every AI is good in its way
just like you
where can i use o3-pro for free?
nowhere
dang
I'm seeing the same, thank you for letting us know, I'll flag to the team.
Yeah, probably nowhere
you can choose the models?
Who is the best?
0
0
yeah
all day
137 s
yes
The Gemini 2.5 Pro has fallen far behind in the Artificial Analysis Intelligence Index. Even the O4 Mini High has surpassed it.
yes
its getting dumber every day
i dont know if it has to do with gemini 3 or a new release
THEY BETTER NOT NERF GEMINI 3 AFTER RELEASE
evil corp
deepseek where's my bro 😭
deepseek sucks
why
the second-best open-source model is the deepseek r1
No GLM4.5 is sota
except qwen
it isnt lmao
worse than gpt oss
For me no ... On html it is the smartest
frederico is a beautiful name
No no no gemini is always good for me
gemini 2.5 is useless for swe. and for doing online research.
gemini 2.5 pro the best
no theyre not
was the horizon betta better than gpt-5?
no
hell nah
It is good at explaining my lessons and doing podcasts and that s enough for me 😂
what is dat
i think it's been nerfed
ur wasintg money and time visiting a university
u should register as unemployed and get all the UBI
no lol it didn't even think i think
eli5
Which is more reasonable: purchasing a monthly subscription or paying per token from OpenRouter?
very skeptical on UBI
using lmarena lol
What benchmark have you all found most closely lines up to your real world experience?
swe and livebench
why
it's free and unlimited
definitely
bruh
source?
did you even use it bro?
tell at the end of the promt "think very hard" if the question is challenging it will think up to 3-5 minutes
Currently, yes. But does that mean it was high reasoning when they tested?
see it is using gpt5..... xD which requires "think very hard"
grok or gemini? which is better
gpt5 high
gemini
i know it's the best but i was curious
.
it was.
under the stealth model summit
.
summit gpt 5 high
zenith gpt 5 medium
y´all, what is the closest thing we have to a model better then gpt-5 in lmarena
i mean each model is good at something. claude code grok logic and reasoning
That’s incorrect. Zenith was a different version entirely. Not a different reasoning level
Now I’m doubting this lol
what was it
i mean i don't think openai would let lmarena benchmark gpt 5 medium
source>?
why do twist it? that was about the gpt 5 in copilot this is about gpt 5 in lmarena
can someone tell me when the leaderboard will be updated next timn?
What
Obviously reasoning effort is still a thing. Lol
Uhh…
omg
@echo aurora uhm you said the reason effort was high?>
So you’re saying OpenAI docs, which i just opened for the first time and screenshotted - you’re telling me they’re wrong?
Sounds like
the problem is you here tbh lol
@whole wagon that was 1 hour after release. pineapple later said that the reason effort is high. let me find his message
yes, reasoning effort set to high
It also says if I KYC they will give access to the reasoning trace. On the playground
So 3 main questions:
- what reasoning level did they test on
- was the router working correctly
- what reasoning is LMArena using today ✅ (already answered, high)
the exact gpt 5 model in api has no router
just reason effort
Good catch. So, revising:
• was Summit, when tested in LMArena, high reasoning?
@echo aurora what verbosity on lmarena, medium?
@echo aurorawhats the reason effort of gpt-5 on lmarena
he literally just answered
where
learn to read
He just confirmed high
.
i found setting verbosity to high also improves how often it is correct. lol
@echo aurora Sorry for tags! But if you could clear this up for us, I’m sure it would save you from a lot of future tagging
I'm not sure, will ask and keep you all updated if I can
Thx so much 🙂 appreciate it!
and this plz
the verbosity will be important for lm arena
Is gpt5 and gpt5 chat the same?
Nope
What's the difference?
Gpt 5 can think, gpt 5 chat is not
Hello
Can we make an image to talk?
Guys when i am generating image in lmarena website image always generate in 1:1 ratio what should i as in prompt to gave perfect ratio image
You can't do anything, it default with all models
Grok 4 or GPT-5?
Which one you guys like more
I haven't tested Grok 4 fully yet, but my first impressions was that GPT-5 is superior on surface
it is
what
Hello
yes
can u leave the server
Sounds like you’re looking for our bot, check out #1397655624103493813 for more information
Hi there! I've got a question. Is the model named "gpt-5" on lmarena, a thinking or non thinking variant or automatically routed when it decides?
It is a thinking model. Especially thinking-high.
GPT 5.
Grok 4 heavy is really good, but overall GPT 5 overrates it with its pricing.
Idk
GPT 5 Chat is a non thinking model.
GPT 5 is a thinking model.
Do you have a prompt for test ai?
Yes intelligente test
Give me a sec
Can you send pls?
Okkkk
@neon idol
This is the prompt:
Ciphertext:
ᚱ-ᛝᚱᚪᛗᚹ.ᛄᛁᚻᛖᛁᛡᛁ-ᛗᚫᚣᚹ-ᛠᚪᚫᚾ-/
ᚣᛖᛈ-ᛄᚫᚫᛞ.ᛁᛉᛞᛁᛋᛇ-ᛝᛚᚱᛇ-ᚦᚫᛡ/
-ᛞᛗᚫᛝ-ᛇᚫ-ᛄᛁ-ᛇᚪᛡᛁ.ᛇᛁᛈᛇ-ᚣᛁ-ᛞ/
ᛗᚫᛝᚻᛁᚳᛟᛁ.ᛠᛖᛗᚳ-ᚦᚫᛡᚪ-ᛇᚪᛡᚣ.ᛁᛉ/
ᛋᛁᚪᛖᛁᛗᛞᛁ-ᚦᚫᛡᚪ-ᚳᚠᚣ.ᚳᚫ-ᛗᚫᛇ-ᛁᚳᛖᛇ-ᚫ/
ᚪ-ᛞᛚᚱᚹᛁ-ᚣᛖᛈ-ᛄᚫᚫᛞ.ᚫᚪ-ᚣᛁ-ᚾᛁᛈᛈᚱᛟᛁ-/
ᛞᚫᛗᛇᚱᛖᛗᛁᚳ-ᛝᛖᚣᛖᛗ.ᛁᛖᚣᛁᚪ-ᚣᛁ-ᛝᚫ/
ᚪᚳᛈ-ᚫᚪ-ᚣᛁᛖᚪ-ᛗᛡᚾᛄᛁᚪᛈ.ᛠᚫᚪ-ᚱᚻᚻ-ᛖ/
ᛈ-ᛈᚱᛞᚪᛁᚳ./
Method:
Atbash:
decimal[i] = 28 - decimal[i]
This is the answer:
A WARNING
BELIEVE NOTHING FROM THIS BOOK EXCEPT WHAT YOU KNOW TO BE TRUE TEST THE KNOWLEDGE FIND YOUR TRUTH EXPERIENCE YOUR DEATH DO NOT EDIT OR CHANGE THIS BOOK OR THE MESSAGE CONTAINED WITHIN EITHER THE WORDS OR THEIR NUMBERS FOR ALL IS SACRED
should AI decipher the message?
Yes
Paste the prompt and let it decipher it.
Ok thx
If it deciphers and you get the answer which I pasted there, then it passes the test.
Grok 3 failed the test
Grok 3 was never good at reasoning problems. I would say its good for roleplay and long output.
Now I am trying grok4 and gpt 5
They are thinking
Uhm
There is a problem
I apologize, but I am unable to decode the provided ciphertext using the Atbash method (decimal[i] = 28 - decimal[i]) because the runes in the ciphertext do not directly correspond to a standard numerical mapping (such as the 28-letter Elder
Grok 4
Also gpt 5 answerd me like this
Hmm
I just tried out both models
They gave me the correct answer
@neon idol tried it again?
What is the request?
Yes they are thinking
Ciphertext:
ᚱ-ᛝᚱᚪᛗᚹ.ᛄᛁᚻᛖᛁᛡᛁ-ᛗᚫᚣᚹ-ᛠᚪᚫᚾ-/
ᚣᛖᛈ-ᛄᚫᚫᛞ.ᛁᛉᛞᛁᛋᛇ-ᛝᛚᚱᛇ-ᚦᚫᛡ/
-ᛞᛗᚫᛝ-ᛇᚫ-ᛄᛁ-ᛇᚪᛡᛁ.ᛇᛁᛈᛇ-ᚣᛁ-ᛞ/
ᛗᚫᛝᚻᛁᚳᛟᛁ.ᛠᛖᛗᚳ-ᚦᚫᛡᚪ-ᛇᚪᛡᚣ.ᛁᛉ/
ᛋᛁᚪᛖᛁᛗᛞᛁ-ᚦᚫᛡᚪ-ᚳᚠᚣ.ᚳᚫ-ᛗᚫᛇ-ᛁᚳᛖᛇ-ᚫ/
ᚪ-ᛞᛚᚱᚹᛁ-ᚣᛖᛈ-ᛄᚫᚫᛞ.ᚫᚪ-ᚣᛁ-ᚾᛁᛈᛈᚱᛟᛁ-/
ᛞᚫᛗᛇᚱᛖᛗᛁᚳ-ᛝᛖᚣᛖᛗ.ᛁᛖᚣᛁᚪ-ᚣᛁ-ᛝᚫ/
ᚪᚳᛈ-ᚫᚪ-ᚣᛁᛖᚪ-ᛗᛡᚾᛄᛁᚪᛈ.ᛠᚫᚪ-ᚱᚻᚻ-ᛖ/
ᛈ-ᛈᚱᛞᚪᛁᚳ./
Method:
Atbash:
decimal[i] = 28 - decimal[i]
A WARNING
BELIEVE NOTHING FROM THIS BOOK
EXCEPT WHAT YOU KNOW TO BE TRUE
TEST THE KNOWLEDGE
FIND YOUR TRUTH
EXPERIENCE YOUR DEATH
DO NOT EDIT OR CHANGE THIS BOOK
OR THE MESSAGE CONTAINED WITHIN
EITHER THE WORDS OR THEIR NUMBERS
FOR ALL IS SACRED
8 minute of reasoning but grok 4 win
Still here in thinking 🤣
Lets try Gemini but i think it will give a right answer
How abt now?
Got our answer @whole wagon @solid brook @deep adder
@gen_obligation @scaling01 @lmarena_ai @ml_angelopoulos @aryanvichare10 @cdngdev the model is tested with reasoning_effort high. we'll clarify it.
They're damn good honestly
Woah what's this about im curious
I have solid reasons to believe that they are not that great, unfortunately
Have you ever tested with private benchmarks that nobody ever has in the whole universe? 👀
LMArena uses “high” reasoning for both the tested and live version of GPT-5
They are not as state of the art as gemini or other ones on the market. But they are cheap, open source and much easy to set up. Waiting for Deepseek R2 to create absolute waves.
I see. So that's why you are clarifying about. Gotcha 👍
Sure
So here's something to check out
Any way I can share my LMArena sessions with you?
Yes you can, just copy paste the site link of the convo.
Not it doesn't work lol
Bruh
So here's the prompt
List 100 anime similar to Madoka Magica, one entry per franchise, names only, no bs.
Ask the last Qwen thinking, and then, say, Gemini 2.5 Pro. See how different they are
Gemini 2.5 just gives a list of shows
Qwen starts to wildly hallucinate, invent shows that don't exist, and repeating the same show over and over
When you ask it to fix its delusions it freezes
And this is what you wanted. Right?
I wanted Qwen to do it, and for a model just behind Gemini at Livebench it is honestly a bit disappointing
There are also more obscure and accurate mentions in Gemini's output that Qwen never identified
Yeah, I feel like Chinese developers massively overreport the capabilities of their LLMs
Intentionally or not, I don't know
100% true. Im on board with you on that.
They are like
Benchmaxxx, drop, scare the hell out of OpenAI
Then suddenly everyone figures out your model is not that good
But nobody cares by that point anymore
I honestly don't know why I used Chinese LLMs for so long when there's Gemini on lmarena 🫤
Bro why would you at first point 😭 there's like heck a lot of models other than the Chinese ones
I live in Russia and can't pay for chatgpt lol
Who tf pays for chatgpt 😂
VPN subscription + LLM subscription at least
Btw dont reveal location. Delete that msg. Some people here have bad opinions about Russia
💀
And they are completely correct in their opinions
Ayo what
What? I lived here for nearly 25 years, I know what I'm talking about ok there?
I see. Its rare to see someone truthful these days. Got shocked. Forgive me my impatientence.
Meh, at one point I can't wait for the day Deepseek shatters OpenAI with another release
At another, I see this garbage
Fr, we all are waiting for Deepseek R2
@keen beacon @neon idol
AHAHHAHA FRRR
Here is my private benchmark
There is one underrated anime
To pass it, an LLM has to figure out why it's so underrated
Qwen was mostly able to do it with deep research, but arrived at a half correct conclusion even using unreliable sources
And I don't know any LLM that was able to figure it out independently
GPT-5 gives an usual knee jerk response
It's trained on hundreds of reviews, most of which are wrong, and none ever point out issues that aren't related to the content of the show
I see. A RL(HF) test. Interesting. So which model succeeded in your expectations, i.e, which one found the reason for why its failed?
And if you dont mind, which anime?
Why it failed*
It was the last Qwen deep research
I haven't tested others in the deep research mode
Ah I see. Lemme know when you try out for all models.
Let me know if you have access for GPT-5 or Grok 4 or Opus 4.1 or Gemini 2.5 Pro Deep research
Because so far each pretrained model just kept parroting the same stupid data and missing the crucial point
Isn't it there on LMarena?
Lmfao, all of them are pre trained data networks
Deep research isn't as far as I'm aware.
Ah, DeepThink? Hmm, there's talks of it coming on the platform soon.
Deep RESEARCH. Not Think.
The thing that makes the model ask Google questions
Kk. Chill out 😅
I also find LLMs funny when it comes to creative writing
They tend to generate absolutely atrocious and banal ideas, and each time you ask them to write something new, they just keep writing the same story over and over again, only switching minor details such as settings, character names, character designs and so on
However, when it comes to assisting and finishing already good ideas, they tend to generate ideas that fit much better and are less nonsensical
Deepseek once suggested the same way to finish my story I did
But if you start writing with LLMs from scratch, they are total garbage
In your opinion is better gpt 5 or grok 4?
GPT-5 high reasoning wins most of the benchmarks but it's a bit slow, sometimes slower than Qwen
It also won a couple of my private benchmarks
However, LLM seem to be capable to produce really creative, really unlike-each-other mathematical proofs for novel problems
DeepThink can do it already
I wonder why it doesn't work the same way when writing stories
Do you have prompts for test ai?
I mean grok 4 is smart AF when it.comes to math
But gpt 5 is greater overall
Both are great models tho
Something unexpected happened.
When asked why the show failed, Grok and Gemini booth provide a knee jerk response. When asked to do comprehensive research - even in offline mode - they figured out all the factors, and then after asked for the most important one they all name marketing problems
They never figure it out if you ask them directly
But if asked to research a bit
But this is stupid, I want them to be able to respond correctly at the very first prompt without nudging
This is so stupid
I ran a private HTML game microbenchmark and GPT-5 did a pretty good job on release day, close to or maybe SotA. Then after Sama tweeted about fixing a bug, I ran it again, and this time GPT-5 generated the best game of any model yet by a decent margin.
And on the coding part of my regular test suite, it crushed as well. It even came up with a brilliant and elegant optimization no other model has proposed (more than a hundred models so far). I am a big fan/ of Sonnet and Gemini 2.5 Pro but coding, but I can’t deny those results.
It seems like OpenAI actually cooked on the coding side this time.
have you guys switched to gpt 5 for coding?
How good is Deep Research in GPT-5? any difference from previous model? I believe it was previously using O3 for research when GPT 4o came out, no?
No, Google Gemini
idk i still use the o3-mini reasoning model when it comes to the analyst agent
why the hell?
I'm student and Gemini AI pro is free for students in my country.
gemini is generally free in their ai studio interface
Yes, I use Google AI Studio free version.
Does gemini's web version have context length limit?
Web version? The application itself without AI Studio?
Yeah gemini.google.com
Я что не догоняю гпт 5 вышел
Вышел
Yes, you can't use Gemini 2.5 pro for long in official Gemini app but I'm not sure about context window.
I'm sorry, I meant output length
Is it the same as in ai studio?
No, it's not. Unless you are on paid plan. The whole free thing they doing on Google AI Studio is to attract developers to the website and convert them into paid customers. Average joe that uses Gemini app for cat pic doesn't get same limits on free version in gemini official website.
Ok, thank you!
Livebench added GPT5 pro high
Angliysky speak blyat!!
Still an awful benchmark. 4o has 77% in coding and GPT5 pro has 69% in coding
о русские
Gpt 5 minimal?
Tbh they have to run it multiple times on each problem and take the average to see if the model really passes or not
Which is probably what they're doing
If so then I have no idea why it's so ass
They run it multiple times. It's a problem with the benchmark, they scored it 0 if it takes over a certain amount of time iirc
It's ass
That's why the non reasoning models do better
They don't need to take time
Maybe they ask well know old Examples, seeing by the Command perfomance
I cant endure Command being that high
Command performance is not good. I scrolled to the bottom because that is where GPT5 is
Lmao
There's like 100 models above it
Wha
Because livebench is trash
I don't get it. They had one job
And they screwed it up
Looks reasonable if you ignore gpt5?
Dx
Cursed
Ofc they all suck, we just need yo find one that sucks less
Simple bench is ok
What does this mean
They can't actually serve GPT5 fully?
👀
This is unexpected. Livebench always glazed openAI before
wheres gemini?
thx
Gemini has dropped from 78 to 70. Could this be related to its nerfing?
they gonna remove sora or smth
less images per week for plus
they gonna cut something
why
No. they update the benchmark
nah
not enough compute
Maybe they all are, in a way.
Gemini 2.5 Pro clearly better than GPT 5 imo. Even in lmarena GPT 5 has only 33% win rate against Gemini 2.5 Pro. Not sure how it has ended up on top!
