The issue at https://github.com/unslothai/unsloth/issues/2939 describes a bug with batch inference for Gemma-2 models in Unsloth: when using a batch size greater than 1, padding causes the model to generate empty or incorrect outputs, while single-sample inference works as expected. The problem appears to be related to how padding tokens and attention masks are handled in the Gemma2 implementation. As a workaround, batching prompts of the same length (thus avoiding padding) yields correct results. The Unsloth team has acknowledged the issue, and further investigation is ongoing, but there is no official fix yet.
For more details and code samples, see the full discussion at Unsloth GitHub Issue #2939. Would you like a step-by-step breakdown or guidance on possible workarounds?
Sources: