#prompt-to-leaderboard
1 messages · Page 1 of 1 (latest)
Hello! Why not all top models are now in P2L?
💵 ❌
Do you need to train a new P2L model every time you want to add new LLMs to the model list? I feel like this is quite a limitation of the P2L models seeing how often models get updated.
I guess it wouldn't be as big of an issue if the smaller size P2L models work just as well.
i'll put this here (though if i understand the system correctly, it would also seem relevant to the leaderboards channel).. was looking around the P2L 'Explorer' tab and observed that the categorisation process seems kinda messy / suboptimal
e.g. the 'Puzzles' category is an assortment of all kinds of prompts.. some are for sure 'puzzles', but many aren't imo.. e.g. some subcategories are just straight up mathematics (like Algebraic Equation Solving is purely about solving equations, nothing to do with lateral thinking etc), while others are really precision/comprehension tasks (counting letters in a word), which ofc LLMs find challenging, but still aren't really 'puzzles'
meanwhile some subcategories sound appropriately nested under Puzzles, but looking at the example prompts, they aren't so much puzzles as just random questions [see 'Truth and Deception Challenges' and 'Greetings and Communications']
oddly the p2l explorer is very close to the normal explorer but not identical
perhaps most striking though is that the largest subcategory in Puzzles (with 749 prompts) is Medical and Health Inquiries... There is also a subcategory called Myopia Risk and Prevention in Adolescents (131 prompts)... needless to say, medical questions aren't puzzles at all... (I assume this reflects the fact that most LLMs are reluctant or refuse to answer or give advice about medical questions, leading to weirdly phrased questions trying to circumvent these guardrails)
i think its automatically done by an llm
i would hope so! 😅
otherwise they're employing 10yos to do it or something ha
"Creative Writing" - many of the subcategories clearly don't fit well (most seem just like general questions or information / recommendation requests)
theres an erp category 🤣
lol yeah there's a few interesting ones in there like that ahah
based on the example prompts, Creative Writing may as well be renamed Erotica lol
🤣 that single category is probably just prompts from the same guy
i reckon it's the same for many of the smaller subcategories, e.g. the "Myopia Risk and Prevention in Adolescents" subcategory (I suspect almost certainly all the prompts are from the same highly concerned parent - who can't / doesn't want to pay to use recently released SOTA model but is desperate to get its 'opinion' on this question...)
the erotica stuff - fairplay / whatever
the medical questions though.. kinda sad tbh.. like people dealing with actual medical issues or seeking genuine medical advice should not be turning to LLMs (let alone spamming the arena hoping to get a response from o1 or something – or just hoping to find a response that they 'like' or confirms their own beliefs)
ya i find reading that stuff just feels like im invading their privacy
why is grok not on the prompt to leaderboard
Too New, so the model was not trained on its data
But they might update that model later, idk
It was on there when it was called chocolate
Respectful discourse and off topic NSFW - <t:1743192605:R>
P2L chat doesn't work, is it for me only, is it known?
@storm drift thanks for reporting. just fixed it
P2L chat is not working. Maintenance?
@half sky thanks for reporting it. just fixed it
It would be great to have some thematic prompt competitions on P2L. For example, specific rounds focused on mathematics etc. This could push prompt engineering even further and create a lot of interesting challenges for the community!
@mint magnet
Suggestion for a new LM Arena functionality:
Upon receiving the user's input, the platform would first filter and select the two best-performing models according to the current P2L scores for that prompt. Then, it would present the two anonymized outputs to the user, who would pick the better response (similar to the "Arena battle"). After the selection, the identities of the two models would be revealed.
This mechanism would allow testing whether the P2L scoring effectively reflects an absolute ranking among models by verifying if the highest-scoring model is consistently preferred by users.
2.2 Prompt-to-Regression (p. 6)
"Thus, the raw coefficient value of a model speaks to its absolute quality, as opposed to its comparative quality against other LLMs as in the BT model."
@mint magnet
Are there intentions to update the p2l models?
Site outage, will turn back on when resolved.
Welcome back :ablobwave: