#Fine-tuned Qwen2.5-7B model loops infinitely in Ollama but works fine with transformers

1 messages · Page 1 of 1 (latest)

tender kettleBOT
#

Your issue is most likely caused by a mismatch between the chat template used during fine-tuning (Alpaca format) and the template used in your Ollama Modelfile. For Qwen2.5 models, using the wrong template or stop tokens can cause infinite loops or repeated outputs in Ollama, even if inference works fine in Transformers. Unsloth documentation and user reports emphasize that the SAME chat template must be used for both training and inference, especially when exporting to GGUF/Ollama—otherwise, you get endless generations or gibberish outputs. The Alpaca template is not natively supported by Qwen2.5; you should use the ChatML template for Qwen models in both training and your Modelfile for Ollama, and ensure your stop tokens match the template's end tokens. See the official troubleshooting guidance for details and examples of correct templates and stop tokens for Qwen2.5 models in Ollama and llama.cpp environments.

Would you like a step-by-step guide on how to set up the correct template and stop tokens for Qwen2.5 in Ollama?

Sources:

regal arch
tender kettleBOT
#

Step-by-step to fix infinite looping for Qwen2.5-7B in Ollama:

  1. Use the ChatML Template for Qwen2.5
    During both fine-tuning and inference, Qwen2.5 expects the ChatML format, not Alpaca. The template should look like:

    <|im_start|>system
    {SYSTEM}<|im_end|>
    <|im_start|>user
    {INPUT}<|im_end|>
    <|im_start|>assistant
    {OUTPUT}<|im_end|>
    

    In your Modelfile, set:

    TEMPLATE """
    <|im_start|>system
    {{ .System }}<|im_end|>
    <|im_start|>user
    {{ .Prompt }}<|im_end|>
    <|im_start|>assistant
    {{ .Response }}<|im_end|>
    """
    
  2. Set Correct Stop Tokens
    For ChatML, use:

    PARAMETER stop "<|im_end|>"
    PARAMETER stop "<|im_start|>"
    

    Remove unrelated or Alpaca-specific stop tokens.

  3. Ensure Consistency

    • Fine-tune with the same ChatML template.
    • When running in Ollama, use the above Modelfile.
    • If you used a system prompt during training, set it in the Modelfile with SYSTEM.
  4. Example Modelfile

    FROM ./unsloth.Q8_0.gguf
    
    TEMPLATE """
    <|im_start|>system
    {{ .System }}<|im_end|>
    <|im_start|>user
    {{ .Prompt }}<|im_end|>
    <|im_start|>assistant
    {{ .Response }}<|im_end|>
    """
    
    PARAMETER stop "<|im_end|>"
    PARAMETER stop "<|im_start|>"
    PARAMETER temperature 1.0
    PARAMETER top_p 0.95
    SYSTEM "You are a helpful assistant."
    
regal arch
tender kettleBOT
#

Llama 3.1 and Mistral-v0.3 work because their architectures and default templates are more forgiving or compatible with the Alpaca prompt format, and their stop tokens align better with the output structure expected by Ollama. Qwen2.5, however, strictly requires the ChatML template and specific stop tokens; using Alpaca or mismatched templates leads to infinite loops or repeated outputs in Ollama, as documented in Unsloth's troubleshooting guide and user reports.

Would you like more technical detail on the template/token differences between these models?

Sources:

regal arch
tender kettleBOT
#

Other SLMs (small language models) similar to Llama and Mistral in template and stop token behavior include Gemma, DeepSeek, and Phi-3/4. These models generally support multiple templates (like Alpaca, ChatML, or Vicuna) and are less strict than Qwen2.5, so they often work with a variety of prompt formats and stop tokens in Ollama or llama.cpp without infinite loops.

Would you like a comparison table or more details on template compatibility for these models?

Sources: