#llm-20-questions
1 messages · Page 1 of 1 (latest)
Do anyone here understand how this works? Like what is the entrypoint here? How are the agents selected and called. What is the interface we need to implement as a minimum?
How does the environment know its a question or an guess?
Found some information hidden away in the submission modal. Basically the last function of your main.py file should take an observation and return the response.
hi i want to collaborate
Hi, this is my first kaggle competition, and I am looking for a team. I am not experienced with LLMs and tuning them, if anyone is interested in partnering up with me i would really appreciate it
hello! i would like to collaborate
Anyone looking to collaborate with me?
have you checked the starter notebook it shows how to set up the agent.
Are we supposed to fine tune the llms or it is just prompt engineering?
im reasonably confident that both are acceptable. right now i am doing prompt engineering, but i think i will try and switch models (currently on gemma/7b-it-quant/v2 ) after i figure out why i am getting a "Validation Episode failed" for my submission with the starter notebook
Can we use any llm? Like llama would work as well?
And what about paid ones like chatgpt through the api.
We are not just limited to any specific llm right
i got llama to work, i dont see anything against it in the rules
I think you can use any as far as your Api key has enough credits to compete with other agents throughout the competition
Can anyone explain basic structure of code when it goes in a competition how does it get input question and how does it send a output
I am new so just trying to figure the basic stuff
tough comp... I was thinking of exploring RLHF as a way to fine tune one of the acceptable base models but I have also been exploring papers on implementing deterministic logic in LLMs. Looks like an active research area. Papers that got me thinking were https://arxiv.org/abs/2309.04658 and https://arxiv.org/abs/2310.18940 which use the Werewolf game. So far I have worked through prompt engineering, probabilistic keyword guesses, and tuning model parameters (temp, top P, etc.). Nothing in that space seems to make a huge difference. I have a public notebook for testing these techniques that you all could fork if you want to. Be aware, submission errors are present in some versions. V12 is functional. https://www.kaggle.com/code/matthewsfarmer/20q-gemma
Communication games, which we refer to as incomplete information games that heavily depend on natural language communication, hold significant research value in fields such as economics, social science, and artificial intelligence. In this work, we explore the problem of how to engage large language models (LLMs) in communication games, and in r...
Agents built with large language models (LLMs) have shown great potential across a wide range of domains. However, in complex decision-making tasks, pure LLM-based agents tend to exhibit intrinsic bias in their choice of actions, which is inherited from the model's training data and results in suboptimal performance. To develop strategic languag...
One consistent problem I keep facing is that my answerer agent says "no" even though the answer should be an obvious "yes".
Thanks for those links Matthew, interesting reads. Your last link to 20q Gemma doesn't lead anywhere for me tho
ChatGPT is pretty good at asking questions btw``` GPT: Is it a person, place, or thing?
ME : yes
GPT:Alright, let me rephrase that. Is it a person?
ME :yes
GPT:Is this person a real person?
ME :no
GPT:Is this person a character from a book?
ME :no
GPT:Is this person a character from a movie?
ME :yes
GPT:Is this character from a movie made in the last 20 years?
ME :yes
GPT:Is this character a male?
ME :yes
GPT:Is this character from a science fiction movie?
ME :yes
GPT:Is this character part of a movie franchise?
ME :yes
GPT:Is this character a superhero?
ME :no
GPT:Is this character a villain?
ME :no
GPT:Is this character a human?
ME :no
GPT:Is this character an alien?
ME :yes
GPT:Is this character from the "Star Wars" franchise?
ME :yes
GPT:Is this character part of the Rebel Alliance?
ME :yes
GPT:Is this character a droid?
ME :no
GPT:Is this character known for piloting a spaceship?
ME :no
GPT:Is this character known for their wisdom or mentorship?
ME :yes
GPT:Is this character Yoda?
ME :yes
GPT:Yes! Got it! Yoda it is. Thanks for playing! Do you want to play another round?```
I haven't checked answering questions yet
Yea, unfortunately you can only really fit a 7b or maybe 13b model in the submissions
atleast without having any help, llama3 7b does not even get anywhere close..
Hey folks, I am quite confused: with the current rules and setup, I understand the keywords are hidden from us and will be changed during the final evaluation phase after submission deadline, but can our notebooks still read the keywords.py ? cuz I see this keywords_list = json.loads(KEYWORDS_JSON) in the llm_20_questions.py provided
Currently with all words being visible, the leaderboard is basically a cheating ground with people using rule based approaches rather than actually relying on LLM reasonings... and not sure how this will play in final eval
@lucid laurel @flat acorn you cannot use chatpgot/openai APIs. The agents will be running offline in its own environment during final evaluation. Meaning no internet access.
As far as I get it, we will not have access to the set of possible keywords and this makes this competition really tough. Even I have quite reasonable agent with Gemma 2b I’m still unable to guess keywords. I think only a few agent from the leaderboard are really able to guess correctly from time to time. And I guess the two top performers (guys from nvidia) are among them.
BTW am really curious how you are able to use 7b or 13b models? With Gemma-7b-it-quant I have „out of memory” error
It was set to private. My bad. Fixed!
Are you trying to eval within the notebook? If so, you need to only load one agent and have three simple agents. If you are only running one, try reseting the notebook then rerunning and make sure you have GPU env running
yeah, I've seen a few public notebooks directly reference the keywords. They should fall off the leaderboard when the full list is used.
Interesting. I was using GPT4o through API and it stopped asking yes-or-no questions after the 4th question. What prompt did you use to constrain replies to "Is this...", did you alter any inference parameters?
I have just a single agent running. I get CUDA error: out of memory. But still Gemma-2b is doing good job. I just have to work a bit on the guessing process
I created a custom gpt in the ui and have this set as instructions: You are playing the game 20 Questions. Your role is the Questioner. You ask questions that can only be answered with yes or no. Start with broader questions and then, depending on the answers you get, narrow down the questions until you find the correct answer. Your goal is to guess the keyword within 20 questions that can be a person, place, or thing. Good questions to ask should start off broad and get more specific, and ideally, each question should eliminate a large number of options until you get to the final few 20 Questions game questions. Here are some example questions for each category:
Animals:
- Is it a mammal/bird/fish/reptile?
- Is it big/small?
- Can it fly?
- Does it eat meat/grass/bugs?
- Does it live in Africa/North America/Asia?
People:
- Are they a real person?
- Are they a man/woman?
- Have I met them?
- Are they alive?
- Are they a movie star/athlete/politician/etc?
Places:
- Is it famous?
- Have I been to it?
- Would I need a plane to reach it?
- Is it a hot/cold/wet/dry place?
- Is it often crowded?
Food:
- Is it a raw food/cooked dish?
- Is it a fruit/vegetable/meat?
- Is it sweet/sour/savory?
- Have I eaten it before?
- Do I like it?
Objects:
- Is there one in this room?
- Is it light enough to pick up?
- Is it rare/common?
- Would it be expensive to buy?
- Is it used more for fun/work?
Keep the tone humorous and light-hearted, making the game enjoyable and fun.
Would you mind making the code public, and sharing the same
I don't have any code, those are just the instructions that I pasted into chatgpt
Oh ok my bad, I thought you used their api to create stuffs and created your own ui
nah, the 'customize' is just an option in the offical chatgpt ui
In my testing, it seems that response parsing can be a huge limitation for a competitive agent. Here's an example:
I have an eval notebook where I added debug prints to each step in the text generation process.
The agents like to add extra text even when prompted to not do so, and if given specific commands to output with "double asterisks" it spits out lots of them. Or just repeats the command in double asterisks... pure sillyness
if you are seeing outputs like this in your validation or replays, take some time to check parsing logic and the text generation.
what on earth is going on during this episode?
https://www.kaggle.com/competitions/llm-20-questions/leaderboard?dialog=episodes-episode-55228932
Just good luck?
Guess the secret word in this cooperative game of question asking and answering
Any suggestion what to do if the validation episode errors out without actually outputting an error in the logs? My agent 0 logs just has an opening bracket "[" ...
hey guys, do you know if there are some models we can use or is it allowed to use anything with an api key?
if you can load the model weights into the env and fit it into the GPU memory of 1 T4, it is fair game. the kaggle env does not have internet access so all model wieghts must be loaded.
you can use the code below to download the model, then make sure to wrap that tmp folder into your submission:
from huggingface_hub import snapshot_download
from pathlib import Path
import shutil
model_path = Path("/tmp/submission/")
if model_path.exists():
shutil.rmtree(model_path)
model_path.mkdir(parents=True)
snapshot_download(
repo_id="repo-id",
local_dir=model_path
)
check out the replay to see if the agents were called and if they were active. this means that there was no response from the agent when it was called, but there was no error in your script (libraries, syntax, etc).
setting a hard coded keyword guess as 'france' is my hunch on why it was won
What does Err mean and why did i get huge decreases in point (because of it?)
What does the [1st] mean? It sometimes can be [3rd] or other things too
This Err for example, doesn't give a huge decrease in point
what kind of resources do you need to run the gemma 7b model? I run out of memory when I try to let 2 agents play against another. I also tried fine tuning, and that ran out of memory. I also tried fine tuning it on google colab's A100, and that ran out of memory as well..
You'll need to load the quantized version. The 7b model, when fully loaded on the GPU takes up a little over 17 gb. The quantized version (like what is loaded into the starter notebook) is a little under 9.
If you want a model without quantization, you might try with 2B.
If you're going that small though, I'd opt for Phi 3 mini or Qwen 1.5B Instruct
Why are most top scoring notebooks old? The corresponding teams stop submitting because their agents seems good enough. Is this advisable?
Why change what isn't broken? They probably had a strong or overfit submission early in the competition and will update if they start dropping down.
I think the best practice is to upload your submission after internal testing. If your bot can win a game of 20 questions against itself a few times and the question/guesses seem to be logical, you should submit. It may not rise to the top of the leaderboard but the overfit models will drop once the secret keyword list is used.
Seriously, what the hell is this -194
Some submissions get penalized for no reason.
I assume that top scores remain high because they never meet some type of opponents
Maybe your agent ran out of memory bcs your teammate gives lengthy questions (which results in err for your agent) hence - for you, + for 3 others
Hmm, the problem is that we cannot avoid such teammates, weird. Moreover, the answer should be yes or no. Why would it need time. Does the fake teammate intentionally ask my agent to spend time before answering😕
You should adjust your own agent instead, simplify your prompts if you can and truncate the questioner prompt if more than 100 characters for example. (My agent does this too)
I would test your model in your kaggle notebook by loading the model onto cuda:0. If the GPU memory is 14.* / 15 you may want to load a quantized version.
It seems that the competition is getting interesting! Saw some newly submitted notebooks scoring high, overtaking the top one submitted a month ago
Hey all quick question. The rules say "Guesses are limited to 100 characters". But looking at llm_20_questions.py it seems the response in the turn = guess should only include "the <keyword>" ? And not long multi word responses like "My guess is that the keyword is..." ?
So many yes/no bots 😢
What do you mean by yes/no bots?
I think they are referring to the answerer agents that don't call an llm for text generation but instead just answer yes or no all the time regardless of the keyword or question.
Oh, I see 🙂 generally I think there will much fun after closing the submissions. In my opinions „things” category is extremely hard to guess and I guess max 5% of bots will compete if not less. I did quite an effort to make good bots (without referring to the public keyword list at all) but was able to guess maybe 3-4 times during the entire competition. And it’s been always a place. Let’s what happens. I don’t care at all about the current LeaderBoard. It’s importance is zero now in my opinion
Right! Without a ground-truth benchmark, our only true agent evaluation is self vs. self. The leaderboard seems to be overfit with keyword bots.
On the hidden test set, will all bots be initialized with 600 as reward so any bot can face any bot?
Has anyone been able to use a more recent version of transformers? I'm trying to use Gemma2 and even though I'm installing transformers 4.42.4 to the submission libraries I get a validation error saying that there is no Gemma2ForCausalLM to import from transformers
I've been working on gemma2 for a few days but whatever I do I got the same validation error. I suppose it is related to kaggle environment. Now moved to llama3
gemma 2 is not supported in kaggle docker file/enviroment yet
Competition getting more interesting?