#prompt-to-leaderboard | Arena | Page 1

strong musk Mar 11, 2025, 6:08 AM

#

3

onyx vault Mar 11, 2025, 8:24 AM

#

Hello! Why not all top models are now in P2L?

visual mural Mar 11, 2025, 10:38 AM

#

onyx vault Hello! Why not all top models are now in P2L?

💵 ❌

fast elbow Mar 17, 2025, 9:42 AM

#

Do you need to train a new P2L model every time you want to add new LLMs to the model list? I feel like this is quite a limitation of the P2L models seeing how often models get updated.

#

I guess it wouldn't be as big of an issue if the smaller size P2L models work just as well.

signal depot Mar 18, 2025, 12:54 AM

#

i'll put this here (though if i understand the system correctly, it would also seem relevant to the leaderboards channel).. was looking around the P2L 'Explorer' tab and observed that the categorisation process seems kinda messy / suboptimal

#

e.g. the 'Puzzles' category is an assortment of all kinds of prompts.. some are for sure 'puzzles', but many aren't imo.. e.g. some subcategories are just straight up mathematics (like Algebraic Equation Solving is purely about solving equations, nothing to do with lateral thinking etc), while others are really precision/comprehension tasks (counting letters in a word), which ofc LLMs find challenging, but still aren't really 'puzzles'

#

meanwhile some subcategories sound appropriately nested under Puzzles, but looking at the example prompts, they aren't so much puzzles as just random questions [see 'Truth and Deception Challenges' and 'Greetings and Communications']

sweet sage Mar 18, 2025, 1:07 AM

#

oddly the p2l explorer is very close to the normal explorer but not identical

signal depot Mar 18, 2025, 1:07 AM

#

perhaps most striking though is that the largest subcategory in Puzzles (with 749 prompts) is Medical and Health Inquiries... There is also a subcategory called Myopia Risk and Prevention in Adolescents (131 prompts)... needless to say, medical questions aren't puzzles at all... (I assume this reflects the fact that most LLMs are reluctant or refuse to answer or give advice about medical questions, leading to weirdly phrased questions trying to circumvent these guardrails)

#

vale yacht Mar 18, 2025, 1:07 AM

#

signal depot i'll put this here (though if i understand the system correctly, it would also s...

i think its automatically done by an llm

signal depot Mar 18, 2025, 1:08 AM

#

i would hope so! 😅

#

otherwise they're employing 10yos to do it or something ha

#

"Creative Writing" - many of the subcategories clearly don't fit well (most seem just like general questions or information / recommendation requests)

vale yacht Mar 18, 2025, 1:10 AM

#

signal depot "Creative Writing" - many of the subcategories clearly don't fit well (most seem...

theres an erp category 🤣

signal depot Mar 18, 2025, 1:12 AM

#

lol yeah there's a few interesting ones in there like that ahah

#

based on the example prompts, Creative Writing may as well be renamed Erotica lol

vale yacht Mar 18, 2025, 1:29 AM

#

signal depot based on the example prompts, Creative Writing may as well be renamed Erotica lo...

🤣 that single category is probably just prompts from the same guy

signal depot Mar 18, 2025, 1:49 AM

#

vale yacht 🤣 that single category is probably just prompts from the same guy

i reckon it's the same for many of the smaller subcategories, e.g. the "Myopia Risk and Prevention in Adolescents" subcategory (I suspect almost certainly all the prompts are from the same highly concerned parent - who can't / doesn't want to pay to use recently released SOTA model but is desperate to get its 'opinion' on this question...)

#

the erotica stuff - fairplay / whatever

#

the medical questions though.. kinda sad tbh.. like people dealing with actual medical issues or seeking genuine medical advice should not be turning to LLMs (let alone spamming the arena hoping to get a response from o1 or something – or just hoping to find a response that they 'like' or confirms their own beliefs)

vale yacht Mar 18, 2025, 1:59 AM

#

signal depot the medical questions though.. kinda sad tbh.. like people dealing with actual m...

ya i find reading that stuff just feels like im invading their privacy

shadow bison Mar 19, 2025, 2:17 AM

#

why is grok not on the prompt to leaderboard

molten sedge Mar 19, 2025, 11:29 AM

#

Too New, so the model was not trained on its data

#

But they might update that model later, idk

shadow bison Mar 25, 2025, 1:10 AM

#

It was on there when it was called chocolate

lunar inletBOT Mar 28, 2025, 9:12 PM

#

1 Warning for runo000 (1267562494004564010)

Moderator: cherry

Respectful discourse and off topic NSFW - <t:1743192605:R>

storm drift Apr 12, 2025, 8:43 AM

#

P2L chat doesn't work, is it for me only, is it known?

mint magnet Apr 14, 2025, 4:50 AM

#

@storm drift thanks for reporting. just fixed it

half sky Apr 25, 2025, 3:45 PM

#

P2L chat is not working. Maintenance?

mint magnet Apr 25, 2025, 7:49 PM

#

@half sky thanks for reporting it. just fixed it

half sky Apr 26, 2025, 6:31 PM

#

It would be great to have some thematic prompt competitions on P2L. For example, specific rounds focused on mathematics etc. This could push prompt engineering even further and create a lot of interesting challenges for the community!

half sky Apr 27, 2025, 6:35 PM

#

@mint magnet
Suggestion for a new LM Arena functionality:

Upon receiving the user's input, the platform would first filter and select the two best-performing models according to the current P2L scores for that prompt. Then, it would present the two anonymized outputs to the user, who would pick the better response (similar to the "Arena battle"). After the selection, the identities of the two models would be revealed.
This mechanism would allow testing whether the P2L scoring effectively reflects an absolute ranking among models by verifying if the highest-scoring model is consistently preferred by users.

2.2 Prompt-to-Regression (p. 6)
"Thus, the raw coefficient value of a model speaks to its absolute quality, as opposed to its comparative quality against other LLMs as in the BT model."

plain lake May 6, 2025, 1:00 PM

#

@mint magnet

Are there intentions to update the p2l models?

brittle sedgeBOT Jun 12, 2025, 4:48 PM

#

brittle sedgeBOT Jun 16, 2025, 11:22 PM

#

brittle sedgeBOT Jun 18, 2025, 2:04 AM

#

brittle sedgeBOT Jun 18, 2025, 12:56 PM

#

vivid stoneBOT Sep 3, 2025, 2:57 PM

#

<:warning:892823499205406760> Channel locked

Site outage, will turn back on when resolved.

vivid stoneBOT Sep 3, 2025, 4:01 PM

#

<:success:865860339278413864> Channel unlocked

Welcome back :ablobwave:

vivid stoneBOT May 12, 2026, 2:53 PM

#

<:warning:892823499205406760> Channel locked