#llm-prompt-recovery
1 messages · Page 1 of 1 (latest)
Hey guys, I'm new to all this. Can someone please explain to me the pre reqs or the objectives for this competition?
what i am noticing from the ai field these days is that a lot of it is very accessible (esp prompt engineering)
so there aren't much prereqs
what's really interesting about this competition is that we generate the data and also test it
which is very, very cool!
some people will play around with langchain, llama index, whatever, and bring in different kinds of tools; no doubt synthetic data (and prompt generation) will be the key thing here
I see. Thank you so much for your insight.
langchain was my first thought when i saw the competition pop up
i think for exploratory work langchain is bad, but nothing is really all that good
i need to build an llm repl but i am lazy and busy doing other things lol
DSPy maybe? i hear good things
actually, i should just email chris potts lol
Ive been doing in independent study at my university on big data analytics using LLMs and langchain has offered many helpful tools with that
langchain does the prescribed job but it’s too complex (it can be very much improved)
shared sentiment among many llm hackers i’m afraid
Hi! This is the first competition except for playground!
Simple question: Should I keep using vscode or change it to kaggle kernel?
Hey! This is my first kaggle competition would love to team up.
hello, first time that I'm participating in a kaggle competion. I have a question that I suspect that is a fairly trivial matter, but I'm working from the starter notebook from the competition (https://www.kaggle.com/code/wlifferth/starter-notebook-generating-more-data-with-gemma), but my submission is blocked due to "Your Notebook uses non-versioned datasets [/kaggle/meta-kaggle] (see Dataset Settings)."
I'm having trouble finding how I change the settings it so that I'm working with a versioned dataset. What am I missing here?
hi rishi, I'm also looking for a team-mate, if you are interested please drop a message
I've never done a kaggle competition before. Someone above mentioned DSPy.
Is it possible to even use something like that with internet disabled for the notebook? Not sure how the pip install would work with no internet
You should put the .whl file inside a kaggle dataset and then install the packages from there
im confused, i read the rules for the competiton. As far as I know there is nothing stopping me from finetuning some arbirary model on any kind of data and then uploading it as an external model and submitting it. Do I really not need to document the process of optaining that model? Are there really no limitations on compute resources to optain that model?
you can download the packages with internet enabled, save them and then disable internet, something similar should also work for loading HF models
see:
https://www.kaggle.com/code/samuelepino/pip-installing-packages-with-no-internet & https://www.kaggle.com/code/samuelepino/pip-downloading-packages-to-your-local-machine/notebook?scriptVersionId=29576961
Hello everyone this my first time participating in kaggle competitions. If it possible i want to team up☺️
Hello! I'm having trouble getting a score for my submission:
I used all the id in the sample_submission.csv and put the predicted prompt in the rewrite_prompt columns and dumped this dataframe into submission.csv. The log shows my code ran fine but I still get "Notebook threw exceptions" after I submitted so I think the submission.csv is not compatible with the scorer. Any suggestions on why it's the case?
I was getting the same error and this fixed it - https://www.kaggle.com/competitions/llm-prompt-recovery/discussion/482008
Recover the prompt used to transform a given text
Can anyone explain what is the setup generating the rewritten text? Is it exactly same as https://www.kaggle.com/code/wlifferth/starter-notebook-generating-more-data-with-gemma ?
has anyone validated whether off-internet files are compatible with this competition?
anyone can recommend papers or other work related to the competition task?
is train.csv also just a single line when running a submission? And if so, am I correct in understanding that there isn't really any signal to train other than data we generate ourselves?
I could not find a way to load a hugging face model during submission scoring. I import a model from hugging face, save to the working directory and turn the internet off. Then I fine-tune this model and save the fine-tuned version also to the working directory.
For submission, I comment out the fine-tuning steps, and I want to load the fine-tuned model from the working directory. But the working directory is missing in the code submission phase.
I did this the trick that loads the notebook data as an input in the left pane of edit window. With this trick, "Commit and Run All" works but code submission for scoring still fails.
What is the correct way of doing this?
Hello everyone!
Do any of the 7b models (gemma 7b, mistral 7b / quantised versions) fit within the 9-hour submission limit for execution?
I'm a beginner, can a 3060ti graphics card meet the hardware requirements for this competition?
What are you guys using for the original text corpus? And how about the prompts?
Hi, have you guys run inference using an LLM model like Gemma or Mistral? In my case, my submission (using Gemma) has taken me 3 hours and not going to be finished yet. As the public test set is only 15% of the total test set and the limitation of 9 hours of inference time, I only meet the requirement if my submission finishes in 90 minutes. I guess using LLM is not the right direction, or there are configs I missed that could make the runtime faster?
What about your runtime?
I used to predict rewrite prompt with Mistral. It takes me over 9 hours. I tracked run time in my script: after 8.5 hours, all remaining rows have mean prompt. So, my submission does not throw exception. After that, I trained T5 model from pretrained 't5-small' and predict rewrite prompt from that model. It takes 2 hours to finish.
Wow smart! Thanks for your sharing.
I used allen/c4 from hugging face but make sure you only use it as a streaming dataset since it is huge! I also filtered it down to inputs with less than 500 characters as it seems like this is the range of texts in the public leaderboard.
For prompts, I used chatgpt and gemini.
Hi there everyone ! I've a question on the way we deal with synthetic data generation. Just to do the fine-tuning there are a couple of already generated llm prompt recovery datasets available, but what are the methods you follow to generate a considerable number of rows. I'm just starting out with that so was just curious how is that with everyone. Is it simple as prompting an llm to give a set of such examples or something beyond that.
Hi, is there any information available about the number of parameters in the GEMMA model used for the rewrite in this competition?
Hey anyone wants to team up ?
why isn't TPU allowed for submission?
I am looking for some teammates for this challenge. Note that the person should be familiar with the basics of LLMs.
Would prefer someone having at least a Masters degree in Artificial Intelligence or a significant amount of experience with NLP
PS: A PhD AI student this side
Edit: Already joined a team
CHAI wants to support the open-source LLM community.
Anyone who have trained an LLM can apply (we will give out at least 20 prizes, use it for anything you'd like, ideally training LLMs 😊)
Apply here: https://chai-research.typeform.com/chaiprize
Criteria: 1 huggingface repo that you own, 1 liner explaining what your model is (optional)
Can i generate data and train externally and use my trained model to submit results ?
Sure submit the model to kaggle models and load it from there.
i have the same doubt... do we have to train on our own data? also why test.csv is 1 row too
anyway to attach a public notebook to the competition so it shows up in "code"?
solved: need to attach the competition as an input...
Hi, I meet your requirements. Master degree in AI I want to be part of your team.
Thanks
Hey I already got a team
cuz they want to limit cost of inference? or tpu comes with its complications with optimizations
Could someone please help reveal whether the "rewritten_text" was post-processed to remove the expressions like "Sure, here is xxx: " in the final testset?
Please search the discussion forums before asking the question…
https://www.kaggle.com/competitions/llm-prompt-recovery/discussion/483374
Recover the prompt used to transform a given text
I'd like to join this competition and is there a team open?
do you stuff about transformer arch?
No
umm do you know coding?
Ofc
I feel you would have a better time searching for teams if you explain what you can do for the joining team or respond to a team getting members
cool
That makes much sense, I'll be doing that from now. Great tip btw
hey also , are you looking to win or looking to learn/upskill?
Upskill and only upskill
that's nice
Currently sitting at .65 on LB (SFT on same dataset is 0.59). My dataset is just the public dataset, but my unsupervised RL training method yields a good improvement. Anyone manage to get above 0.63 using SFT through a well curated dataset? Maybe we can combine our techniques and get a good outcome?
Can you clarify? Are you saying your SFT on the LB dataset is 0.59 and with other methods you get 0.65? But you get higher improvement from SFT on one of the datasets people have made available?
I use the public 70K+ (original_text, rewritten_text, prompt) dataset, , and do some filtering. I apply SFT to this dataset and get an LB score of 0.59. When I apply my alternative training on the same dataset I get 0.65. This indicates two things to me:
- My dataset is low quality since 0.59 isn't a great score
- My alternative training method is superior to SFT
@woeful pendant
sure
Does the full dataset (private+public) contain around 1400 texts or only the public dataset?
dunno
so its to an end