#Challenge LLMs & help build an authoritative LLM leaderboard!
1 messages · Page 1 of 1 (latest)
Great to see kaggle partner with lmsys
Thanks Anthony! Are you familiar with LMSYS from elsewhere?
Yeah, I've played with the chatbot arena before and have used their vicuna models
Very nice – if you don't mind sharing, what were thoughts on the arena?
Nice! I am curious, would you allow your own fine-tuned models onto the arena or side-by-side, or is it only for the foundation models and open-source community competitors?
Hi Kinjal, what Top P parameter means?
Very nice experience, in my opportunity the models being compared were Model A: deluxe-chat-v1
Model B: vicuna-33b and in general Model A performed very good while Model B continued making "mistakes".
Top P can definitely be confusing. It's a parameter you can set on a large language model inference that helps you balance diversity of word choice with high likelihood words. If you set a higher P, you will tend to have more diverse output from the LLM.
The way it works is by taking the smallest sample of tokens whose cumulative probability mass most greatly exceeds P. Consider the tokens with probabilities: [0.4, 0.3, 0.2, 0.1]. If you set P to anything <=0.4, then it would only sample the token with probability 0.4. If you set P = 0.8, it will sample the tokens with 0.4, 0.3, 0.2 because 0.4 + 0.3 + 0.2 = 0.9.
Here's a video explanation if it helps: https://www.youtube.com/watch?v=nfqZwC_h388
Note: Temparature is another parameter you can use to adjust sampling. It's often recommended to use either Temperature or Top P but not both.
For now, I don't think there's a plan to allow individuals to add their own models to the public arena. With the current design it wouldn't really support O(10^3+) models. Were you interested in doing so to compare how your own finetuned model benchmarks against others? Would love to hear more!
Thanks Nate, very good info
At some point, yes. I would love for people to rate how understandable a translation of a jargon filled radiology report is, and independently, how medically accurate.
I thought it was nice to test different models and their performance. It was also helpful to figure out which model could perform the task I wanted best
Got it. That makes sense – the conceptual framework of the arena: a (double) blind rating system would be valuable. But because your use case is somewhat specific perhaps it would benefit from its own arena rather than a generic all-encompassing arena. That's something for us to consider in the future – letting the community easily spin up their own arena for a particular task / benchmark. Maybe as a Community Competition?