#Chatml for Qwen Model

1 messages · Page 1 of 1 (latest)

hazy ocean
#

Hello i have a short question i take a look over the wiki and the Collabs and wanna try bring my Dataset to ChatML

i have 3 cols (instruction, input and output)
my code actually is

        chatML_format = """
        <|im_start|>user
        {}
        <|im_end|>
        
        <|im_start|>assistant
        {}
        
        input: {}
        """

        def chatml_format(examples):
            instructions = examples["instruction"]
            inputs = examples["input"]
            outputs = examples["output"]
            texts = []

            for instruction, input, output in zip(instructions, inputs, outputs):
                text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
                texts.append(text)
            return {"text": texts, }

        print("Load DataSet")
        texts = dataset.map(chatml_format, batched=True)

but i have no idea how bring "input" into this has anyone an idea ?

royal heron
#

To integrate the input field correctly into your chatML format, you need to make sure that each section (instruction, input, output) is included in the output text in a structured way. From the provided code, it looks like you want to create a formatted string for each example in your dataset.

Here's an updated version of your code to include the input field properly in the chatML_format:

chatML_format = """
user
{}

assistant
{}

input: {}
"""

def chatml_format(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []

    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Ensure `input_text` is properly formatted and included
        text = chatML_format.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

print("Load DataSet")
texts = dataset.map(chatml_format, batched=True)

Explanation:

  • chatML_format now correctly includes input as input_text in the format string.
  • Inside chatml_format, make sure to use input_text instead of input to avoid naming conflicts and to clarify that it represents the input field from your dataset.
  • Each input_text is included in the formatted string as input: {}.

Make sure that EOS_TOKEN is defined in your code to signify the end of a sequence if you are using it for specific purposes like tokenization or formatting.