Using this notebook:https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing with gemma 2 results in the following error: (only tested 2b, both instruct and base)
Exception in thread Thread-12 (generate):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1704, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2024, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2982, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/unsloth/models/llama.py", line 919, in _CausalLM_fast_forward
outputs = fast_forward_inference(
File "/usr/local/lib/python3.10/dist-packages/unsloth/models/gemma2.py", line 396, in Gemma2Model_fast_forward_inference
seq_len = past_key_values[0][0].shape[-2]
TypeError: 'HybridCache' object is not subscriptable
This error occurs when running the last step (inference). The same notebook works for llama 3.2 3B