#llm-prompt-recovery | Kaggle | Page 1

deep moth Feb 28, 2024, 1:08 AM

#

Hey, can anyone explain the work flow through graph or flow chart, count me in lets work.

#

DM @deep moth

fervent briar Feb 28, 2024, 5:10 AM

#

Hey guys, I'm new to all this. Can someone please explain to me the pre reqs or the objectives for this competition?

heady robin Feb 28, 2024, 11:08 AM

#

fervent briar Hey guys, I'm new to all this. Can someone please explain to me the pre reqs or ...

what i am noticing from the ai field these days is that a lot of it is very accessible (esp prompt engineering)

#

so there aren't much prereqs

#

what's really interesting about this competition is that we generate the data and also test it

#

which is very, very cool!

#

some people will play around with langchain, llama index, whatever, and bring in different kinds of tools; no doubt synthetic data (and prompt generation) will be the key thing here

fervent briar Feb 28, 2024, 1:40 PM

#

I see. Thank you so much for your insight.

sleek scarab Feb 28, 2024, 2:29 PM

#

heady robin some people will play around with langchain, llama index, whatever, and bring in...

langchain was my first thought when i saw the competition pop up

heady robin Feb 28, 2024, 2:30 PM

#

sleek scarab langchain was my first thought when i saw the competition pop up

i think for exploratory work langchain is bad, but nothing is really all that good

#

i need to build an llm repl but i am lazy and busy doing other things lol

#

DSPy maybe? i hear good things

#

https://github.com/stanfordnlp/dspy

GitHub

GitHub - stanfordnlp/dspy: DSPy: The framework for programming—not ...

DSPy: The framework for programming—not prompting—foundation models - stanfordnlp/dspy

#

actually, i should just email chris potts lol

sleek scarab Feb 28, 2024, 2:31 PM

#

heady robin i think for exploratory work langchain is bad, but nothing is really all that go...

Ive been doing in independent study at my university on big data analytics using LLMs and langchain has offered many helpful tools with that

heady robin Feb 28, 2024, 2:32 PM

#

langchain does the prescribed job but it’s too complex (it can be very much improved)

#

shared sentiment among many llm hackers i’m afraid

dark ember Feb 29, 2024, 7:39 PM

#

Hi! This is the first competition except for playground!
Simple question: Should I keep using vscode or change it to kaggle kernel?

teal zinc Mar 1, 2024, 4:59 AM

#

Hey! This is my first kaggle competition would love to team up.

exotic scaffold Mar 1, 2024, 10:36 AM

#

hello, first time that I'm participating in a kaggle competion. I have a question that I suspect that is a fairly trivial matter, but I'm working from the starter notebook from the competition (https://www.kaggle.com/code/wlifferth/starter-notebook-generating-more-data-with-gemma), but my submission is blocked due to "Your Notebook uses non-versioned datasets [/kaggle/meta-kaggle] (see Dataset Settings)."

I'm having trouble finding how I change the settings it so that I'm working with a versioned dataset. What am I missing here?

shell halo Mar 1, 2024, 2:08 PM

#

teal zinc Hey! This is my first kaggle competition would love to team up.

hi rishi, I'm also looking for a team-mate, if you are interested please drop a message

potent tendon Mar 2, 2024, 9:28 PM

#

I've never done a kaggle competition before. Someone above mentioned DSPy.

Is it possible to even use something like that with internet disabled for the notebook? Not sure how the pip install would work with no internet

surreal kraken Mar 3, 2024, 2:38 AM

#

You should put the .whl file inside a kaggle dataset and then install the packages from there

young hinge Mar 3, 2024, 4:47 AM

#

I had the same problem

#

Here's what I'm facing now:

main ember Mar 3, 2024, 3:46 PM

#

im confused, i read the rules for the competiton. As far as I know there is nothing stopping me from finetuning some arbirary model on any kind of data and then uploading it as an external model and submitting it. Do I really not need to document the process of optaining that model? Are there really no limitations on compute resources to optain that model?

main ember Mar 3, 2024, 4:16 PM

#

potent tendon I've never done a kaggle competition before. Someone above mentioned DSPy. Is ...

you can download the packages with internet enabled, save them and then disable internet, something similar should also work for loading HF models

see:
https://www.kaggle.com/code/samuelepino/pip-installing-packages-with-no-internet & https://www.kaggle.com/code/samuelepino/pip-downloading-packages-to-your-local-machine/notebook?scriptVersionId=29576961

pip-installing packages with no internet

Explore and run machine learning code with Kaggle Notebooks | Using data from [Private Datasource]

pip-downloading packages to your local machine

Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources

olive schooner Mar 4, 2024, 6:04 PM

#

Hello everyone this my first time participating in kaggle competitions. If it possible i want to team up☺️

river cove Mar 5, 2024, 10:30 PM

#

Hello! I'm having trouble getting a score for my submission:
I used all the id in the sample_submission.csv and put the predicted prompt in the rewrite_prompt columns and dumped this dataframe into submission.csv. The log shows my code ran fine but I still get "Notebook threw exceptions" after I submitted so I think the submission.csv is not compatible with the scorer. Any suggestions on why it's the case?

calm crypt Mar 6, 2024, 6:33 PM

#

river cove Hello! I'm having trouble getting a score for my submission: I used all the id i...

I was getting the same error and this fixed it - https://www.kaggle.com/competitions/llm-prompt-recovery/discussion/482008

LLM Prompt Recovery

Recover the prompt used to transform a given text

normal wasp Mar 7, 2024, 8:54 AM

#

Can anyone explain what is the setup generating the rewritten text? Is it exactly same as https://www.kaggle.com/code/wlifferth/starter-notebook-generating-more-data-with-gemma ?

Starter Notebook: Generating More Data With Gemma

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

sweet sandal Mar 7, 2024, 10:10 PM

#

main ember you can download the packages with internet enabled, save them and then disable ...

has anyone validated whether off-internet files are compatible with this competition?

rotund wadi Mar 8, 2024, 1:12 PM

#

anyone can recommend papers or other work related to the competition task?

woeful pendant Mar 8, 2024, 8:01 PM

#

is train.csv also just a single line when running a submission? And if so, am I correct in understanding that there isn't really any signal to train other than data we generate ourselves?

sweet sandal Mar 8, 2024, 10:38 PM

#

I could not find a way to load a hugging face model during submission scoring. I import a model from hugging face, save to the working directory and turn the internet off. Then I fine-tune this model and save the fine-tuned version also to the working directory.

For submission, I comment out the fine-tuning steps, and I want to load the fine-tuned model from the working directory. But the working directory is missing in the code submission phase.

I did this the trick that loads the notebook data as an input in the left pane of edit window. With this trick, "Commit and Run All" works but code submission for scoring still fails.

What is the correct way of doing this?

eager dragon Mar 9, 2024, 6:00 PM

#

Hello everyone!
Do any of the 7b models (gemma 7b, mistral 7b / quantised versions) fit within the 9-hour submission limit for execution?

noble flint Mar 10, 2024, 8:45 AM

#

I'm a beginner, can a 3060ti graphics card meet the hardware requirements for this competition?

fleet cobalt Mar 10, 2024, 6:40 PM

#

What are you guys using for the original text corpus? And how about the prompts?

heavy heron Mar 11, 2024, 7:46 AM

#

Hi, have you guys run inference using an LLM model like Gemma or Mistral? In my case, my submission (using Gemma) has taken me 3 hours and not going to be finished yet. As the public test set is only 15% of the total test set and the limitation of 9 hours of inference time, I only meet the requirement if my submission finishes in 90 minutes. I guess using LLM is not the right direction, or there are configs I missed that could make the runtime faster?

#

What about your runtime?

versed crater Mar 11, 2024, 7:30 PM

#

heavy heron What about your runtime?

I used to predict rewrite prompt with Mistral. It takes me over 9 hours. I tracked run time in my script: after 8.5 hours, all remaining rows have mean prompt. So, my submission does not throw exception. After that, I trained T5 model from pretrained 't5-small' and predict rewrite prompt from that model. It takes 2 hours to finish.

heavy heron Mar 12, 2024, 1:42 AM

#

versed crater I used to predict rewrite prompt with Mistral. It takes me over 9 hours. I track...

Wow smart! Thanks for your sharing.

odd stirrup Mar 12, 2024, 7:06 AM

#

fleet cobalt What are you guys using for the original text corpus? And how about the prompts?

I used allen/c4 from hugging face but make sure you only use it as a streaming dataset since it is huge! I also filtered it down to inputs with less than 500 characters as it seems like this is the range of texts in the public leaderboard.

For prompts, I used chatgpt and gemini.

smoky star Mar 13, 2024, 2:31 AM

#

Hi there everyone ! I've a question on the way we deal with synthetic data generation. Just to do the fine-tuning there are a couple of already generated llm prompt recovery datasets available, but what are the methods you follow to generate a considerable number of rows. I'm just starting out with that so was just curious how is that with everyone. Is it simple as prompting an llm to give a set of such examples or something beyond that.

autumn vector Mar 13, 2024, 9:56 AM

#

Hi, is there any information available about the number of parameters in the GEMMA model used for the rewrite in this competition?

frosty monolith Mar 14, 2024, 5:58 AM

#

Hey anyone wants to team up ?

hidden wharf Mar 18, 2024, 6:13 PM

#

why isn't TPU allowed for submission?

stray vigil Mar 21, 2024, 5:07 PM

#

I am looking for some teammates for this challenge. Note that the person should be familiar with the basics of LLMs.

Would prefer someone having at least a Masters degree in Artificial Intelligence or a significant amount of experience with NLP

PS: A PhD AI student this side

Edit: Already joined a team

covert mist Mar 22, 2024, 10:28 PM

#

CHAI wants to support the open-source LLM community.

Anyone who have trained an LLM can apply (we will give out at least 20 prizes, use it for anything you'd like, ideally training LLMs 😊)

Apply here: https://chai-research.typeform.com/chaiprize

Criteria: 1 huggingface repo that you own, 1 liner explaining what your model is (optional)

Chai Prize

Complete and win 3 days unlimited messages!

last cipher Mar 24, 2024, 2:23 AM

#

Can i generate data and train externally and use my trained model to submit results ?

dense prism Mar 25, 2024, 1:11 PM

#

last cipher Can i generate data and train externally and use my trained model to submit resu...

Sure submit the model to kaggle models and load it from there.

exotic scaffold Mar 25, 2024, 3:36 PM

#

woeful pendant is train.csv also just a single line when running a submission? And if so, am I ...

i have the same doubt... do we have to train on our own data? also why test.csv is 1 row too

rain imp Mar 25, 2024, 7:28 PM

#

anyway to attach a public notebook to the competition so it shows up in "code"?

rain imp Mar 25, 2024, 8:54 PM

#

rain imp anyway to attach a public notebook to the competition so it shows up in "code"?

solved: need to attach the competition as an input...

ember raven Mar 27, 2024, 5:47 AM

#

stray vigil I am looking for some teammates for this challenge. Note that the person should ...

Hi, I meet your requirements. Master degree in AI I want to be part of your team.

Thanks

stray vigil Mar 27, 2024, 5:57 AM

#

Hey I already got a team

knotty radish Apr 1, 2024, 11:22 AM

#

hidden wharf why isn't TPU allowed for submission?

cuz they want to limit cost of inference? or tpu comes with its complications with optimizations

neat oasis Apr 5, 2024, 10:26 AM

#

Could someone please help reveal whether the "rewritten_text" was post-processed to remove the expressions like "Sure, here is xxx: " in the final testset?

heady raptor Apr 6, 2024, 3:50 PM

#

Please search the discussion forums before asking the question…

https://www.kaggle.com/competitions/llm-prompt-recovery/discussion/483374

LLM Prompt Recovery

Recover the prompt used to transform a given text

exotic scaffold Apr 6, 2024, 3:56 PM

#

I'd like to join this competition and is there a team open?

knotty radish Apr 6, 2024, 4:45 PM

#

exotic scaffold I'd like to join this competition and is there a team open?

do you stuff about transformer arch?

exotic scaffold Apr 6, 2024, 4:51 PM

#

knotty radish do you stuff about transformer arch?

No

knotty radish Apr 6, 2024, 4:51 PM

#

exotic scaffold No

umm do you know coding?

exotic scaffold Apr 6, 2024, 4:51 PM

#

knotty radish umm do you know coding?

Ofc

knotty radish Apr 6, 2024, 4:52 PM

#

I feel you would have a better time searching for teams if you explain what you can do for the joining team or respond to a team getting members

knotty radish Apr 6, 2024, 4:52 PM

#

exotic scaffold Ofc

cool

exotic scaffold Apr 6, 2024, 4:53 PM

#

knotty radish I feel you would have a better time searching for teams if you explain what you ...

That makes much sense, I'll be doing that from now. Great tip btw

knotty radish Apr 6, 2024, 6:19 PM

#

exotic scaffold That makes much sense, I'll be doing that from now. Great tip btw

hey also , are you looking to win or looking to learn/upskill?

exotic scaffold Apr 6, 2024, 6:23 PM

#

knotty radish hey also , are you looking to win or looking to learn/upskill?

Upskill and only upskill

knotty radish Apr 6, 2024, 7:03 PM

#

that's nice

lime locust Apr 9, 2024, 5:24 AM

#

Currently sitting at .65 on LB (SFT on same dataset is 0.59). My dataset is just the public dataset, but my unsupervised RL training method yields a good improvement. Anyone manage to get above 0.63 using SFT through a well curated dataset? Maybe we can combine our techniques and get a good outcome?

woeful pendant Apr 9, 2024, 4:49 PM

#

lime locust Currently sitting at .65 on LB (SFT on same dataset is 0.59). My dataset is just...

Can you clarify? Are you saying your SFT on the LB dataset is 0.59 and with other methods you get 0.65? But you get higher improvement from SFT on one of the datasets people have made available?

lime locust Apr 9, 2024, 4:52 PM

#

I use the public 70K+ (original_text, rewritten_text, prompt) dataset, , and do some filtering. I apply SFT to this dataset and get an LB score of 0.59. When I apply my alternative training on the same dataset I get 0.65. This indicates two things to me:

My dataset is low quality since 0.59 isn't a great score
My alternative training method is superior to SFT

@woeful pendant

knotty radish Apr 11, 2024, 5:10 AM

#

lime locust Currently sitting at .65 on LB (SFT on same dataset is 0.59). My dataset is just...

sure

hidden wharf Apr 11, 2024, 6:02 AM

#

Does the full dataset (private+public) contain around 1400 texts or only the public dataset?

knotty radish Apr 14, 2024, 8:17 AM

#

hidden wharf Does the full dataset (private+public) contain around 1400 texts or only the pub...

dunno

knotty radish Apr 17, 2024, 6:43 PM

#

so its to an end