my inference code is '''import unsloth
from transformers import TextStreamer
local_model_path = r"F:\anaconda\install\lora_model10"
if True:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = local_model_path, # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
alpaca_prompt = """
Instruction:
{}
Input:
{}
Response:
{}""" #
alpaca_prompt = You MUST copy from above!
while True:
question = input("enter your question:")
inputs = tokenizer(
[
alpaca_prompt.format(
question, # instruction
"", # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)''' the training data is '''[
{
"instruction": "张小双的职业是什么?",
"input": "",
"output": "张小双是一名工程师。"
},
{
"instruction": "What is Zhang Xiaoshuang's job?",
"input": "",
"output": "Zhang Xiaoshuang is an engineer."
},
{
"instruction": "张小双今年多大了?",
"input": "",
"output": "张小双今年28岁。"
}]''' the inference result is '''nstruction:,
张小双的职业是什么?
Input:,
Response:,
F:\anaconda\install\envs\new_env\Lib\site-packages\unsloth\models\llama.py:481: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
A = scaled_dot_product_attention(Q, K, V, attn_mask = attention_mask, is_causal = False)
我的名字是李明。<|endoftext|>''' anybody can help me check ,why? my computer is 8G RTX 4060.