#prompt-to-leaderboard

1 messages · Page 1 of 1 (latest)

strong musk
#

3

onyx vault
#

Hello! Why not all top models are now in P2L?

visual mural
fast elbow
#

Do you need to train a new P2L model every time you want to add new LLMs to the model list? I feel like this is quite a limitation of the P2L models seeing how often models get updated.

#

I guess it wouldn't be as big of an issue if the smaller size P2L models work just as well.

signal depot
#

i'll put this here (though if i understand the system correctly, it would also seem relevant to the leaderboards channel).. was looking around the P2L 'Explorer' tab and observed that the categorisation process seems kinda messy / suboptimal

#

e.g. the 'Puzzles' category is an assortment of all kinds of prompts.. some are for sure 'puzzles', but many aren't imo.. e.g. some subcategories are just straight up mathematics (like Algebraic Equation Solving is purely about solving equations, nothing to do with lateral thinking etc), while others are really precision/comprehension tasks (counting letters in a word), which ofc LLMs find challenging, but still aren't really 'puzzles'

#

meanwhile some subcategories sound appropriately nested under Puzzles, but looking at the example prompts, they aren't so much puzzles as just random questions [see 'Truth and Deception Challenges' and 'Greetings and Communications']

sweet sage
#

oddly the p2l explorer is very close to the normal explorer but not identical

signal depot
#

perhaps most striking though is that the largest subcategory in Puzzles (with 749 prompts) is Medical and Health Inquiries... There is also a subcategory called Myopia Risk and Prevention in Adolescents (131 prompts)... needless to say, medical questions aren't puzzles at all... (I assume this reflects the fact that most LLMs are reluctant or refuse to answer or give advice about medical questions, leading to weirdly phrased questions trying to circumvent these guardrails)

vale yacht
signal depot
#

i would hope so! 😅

#

otherwise they're employing 10yos to do it or something ha

#

"Creative Writing" - many of the subcategories clearly don't fit well (most seem just like general questions or information / recommendation requests)

signal depot
#

lol yeah there's a few interesting ones in there like that ahah

#

based on the example prompts, Creative Writing may as well be renamed Erotica lol

vale yacht
signal depot
# vale yacht 🤣 that single category is probably just prompts from the same guy

i reckon it's the same for many of the smaller subcategories, e.g. the "Myopia Risk and Prevention in Adolescents" subcategory (I suspect almost certainly all the prompts are from the same highly concerned parent - who can't / doesn't want to pay to use recently released SOTA model but is desperate to get its 'opinion' on this question...)

#

the erotica stuff - fairplay / whatever

#

the medical questions though.. kinda sad tbh.. like people dealing with actual medical issues or seeking genuine medical advice should not be turning to LLMs (let alone spamming the arena hoping to get a response from o1 or something – or just hoping to find a response that they 'like' or confirms their own beliefs)

vale yacht
shadow bison
#

why is grok not on the prompt to leaderboard

molten sedge
#

Too New, so the model was not trained on its data

#

But they might update that model later, idk

shadow bison
#

It was on there when it was called chocolate

lunar inletBOT
#
1 Warning for runo000 (1267562494004564010)
Moderator: cherry

Respectful discourse and off topic NSFW - <t:1743192605:R>

storm drift
#

P2L chat doesn't work, is it for me only, is it known?

mint magnet
#

@storm drift thanks for reporting. just fixed it

half sky
#

P2L chat is not working. Maintenance?

mint magnet
#

@half sky thanks for reporting it. just fixed it

half sky
#

It would be great to have some thematic prompt competitions on P2L. For example, specific rounds focused on mathematics etc. This could push prompt engineering even further and create a lot of interesting challenges for the community!

half sky
#

@mint magnet
Suggestion for a new LM Arena functionality:

Upon receiving the user's input, the platform would first filter and select the two best-performing models according to the current P2L scores for that prompt. Then, it would present the two anonymized outputs to the user, who would pick the better response (similar to the "Arena battle"). After the selection, the identities of the two models would be revealed.
This mechanism would allow testing whether the P2L scoring effectively reflects an absolute ranking among models by verifying if the highest-scoring model is consistently preferred by users.

2.2 Prompt-to-Regression (p. 6)
"Thus, the raw coefficient value of a model speaks to its absolute quality, as opposed to its comparative quality against other LLMs as in the BT model."

plain lake
#

@mint magnet

Are there intentions to update the p2l models?

brittle sedgeBOT
brittle sedgeBOT
brittle sedgeBOT
brittle sedgeBOT
vivid stoneBOT
#
<:warning:892823499205406760> Channel locked

Site outage, will turn back on when resolved.

vivid stoneBOT
#
<:success:865860339278413864> Channel unlocked

Welcome back :ablobwave:

vivid stoneBOT
#
<:warning:892823499205406760> Channel locked