Hi, I'm playing around with Kaggle project 'llm-classification-finetuning' just as an excuse to try and learn about Unsloth AI. Right when I'm finally ready to train the model (after doing all the setup), the following code
trainer_stats = trainer.train()
gives me an error
KeyError Traceback (most recent call last)
/tmp/ipython-input-773422404.py in <cell line: 0>()
----> 1 trainer_stats = trainer.train()
8 frames
/tmp/ipython-input-3320623149.py in torch_call(self, examples)
22 # But we don't actually want this because number_token_ids[2] could be something like 27, which is now undefined in the new lm_head. So we map it to the new lm_head index.
23 # if this line gives you a keyerror then increase max_seq_length
---> 24 batch["labels"][i, last_token_idx] = reverse_map[ batch["labels"][i, last_token_idx].item() ]
25
26
KeyError: 5052
I've repeatedly tried increasing max_seq_length up to 12,000 (and rerunning the whole code). The code token_counts = [len(tokenizer.encode(x)) for x in train_df.text] gives me a maximum sequence length of ~1700 so 12,000 feels like way more than enough and suggests that something else is going on here. Suggestions?
Thanks in advance 🙂