I'm trying to run the llama 3 8B notebook fine-tuning notebook (https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp) with inputs longer than 8K. I'm getting an error at the inference step:
Token indices sequence length is longer than the specified maximum sequence length for this model (10106 > 8192). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Unsloth: Input IDs of length 10106 > the model's max sequence length of 8192.
We shall truncate it ourselves. It's imperative if you correct this issue first.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-33-6210034c4068> in <cell line: 12>()
10 ], return_tensors = "pt").to("cuda")
11
---> 12 outputs = model.generate(**inputs, max_new_tokens = 2056, use_cache = True)
13 tokenizer.batch_decode(outputs)
49 frames
/usr/local/lib/python3.10/dist-packages/unsloth/models/llama.py in LlamaModel_fast_forward(self, input_ids, causal_mask, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, *args, **kwargs)
581 inputs_embeds.requires_grad_(False)
582 pass
--> 583 inputs_embeds *= attention_mask.unsqueeze(0).transpose(0, 1).transpose(1, 2)
584 if inputs_requires_grad: inputs_embeds.requires_grad_(True)
585 pass
RuntimeError: The size of tensor a (8192) must match the size of tensor b (10106) at non-singleton dimension 1
It seems to be complaining that the input is too long. Given that training on longer contexts worked fine,, I'm not sure why it's complaining. ANy guidance would be appreciated.