All,
I am currently finetuning on a dataset of 15000+ high-quality, human written items in order to generate similar instances. Working with Mistral-22B and Mistral-7B (loss evens out at about ~1.1) returns good (need some minor edits) to really good (need no edits) examples.
I am looking to have the model run on a Raspberry Pi 5 for kicks and giggles. To have an acceptable run-time per token, the model needs to be 2B parameters or less (Gemini-2B, Llama-3.2-1b, etc.). However, the output of these models after similar conditions (loss flattens out to about 1.6) is just garbage and gibberish.
Is the solution here just more epochs? Using DPO on a segment of the dataset and generated outputs from the model? Or adjustment of some of the finetuning parameters? Or should I just bite the bullet and resort to using Mistral-7B and cope with the poor generation time?
I wouldn't be too torn up if I couldn't get this to run as pictured, but your help would be appreciated.