Training llama 3.1 8b model on odd data | Unsloth AI | Page 1

opaque delta Jul 21, 2025, 7:33 PM

#

I'm planning on trainiong llama 3.1 8b on a dataset of 2 billion discord messages but theyre in an odd format, what can I do to get these into a place where they can be trained on

austere cobaltBOT Jul 21, 2025, 7:33 PM

#

To train Llama 3.1 8B on your Discord messages, you need to preprocess the data into a format compatible with Unsloth's training scripts. For conversational models, structure your data as a list of message dictionaries, each with a "role" ("user", "assistant", or "system") and "content" fields. Then, use Unsloth's chat template utilities (e.g., tokenizer.apply_chat_template) to convert these into the correct prompt format for training. See this notebook example and discussion on formatting.

If your data is not already in this structure, write a script to parse and reformat it. Ensure each conversation is a list of alternating user/assistant messages, and remove or standardize any Discord-specific artifacts. For more, see Unsloth's dataset formatting guide.

Would you like a step-by-step code example for reformatting and preparing your dataset?

Sources:

https://github.com/unslothai/unsloth/discussions/2183

opaque delta Jul 21, 2025, 7:34 PM

#

what

#

https://huggingface.co/datasets/SaisExperiments/Discord-Unveiled-Compressed

SaisExperiments/Discord-Unveiled-Compressed · Datasets at Hugging ...

#

this is what im training on

#

@austere cobalt how would I reformat this code specifically, i only hvae users not assistant or system

austere cobaltBOT Jul 21, 2025, 7:35 PM

#

You should convert your dataset so each conversation is a list of message dictionaries with "role" and "content" keys, even if you only have "user" messages. For Llama-style chat training, you need alternating "user" and "assistant" roles. If you only have user messages, you can pair each user message with a placeholder assistant response (e.g., an empty string or a generic reply) to create a valid format. See Unsloth's dataset formatting guide.

Example code to reformat your data:

# Example: Convert a list of user messages to user/assistant pairs
user_messages = [...]  # Your list of Discord messages

conversations = []
for msg in user_messages:
    conversations.append([
        {"role": "user", "content": msg},
        {"role": "assistant", "content": ""}  # Placeholder response
    ])

Would you like more detail or a script tailored to the exact structure of your dataset?

Sources:

Unsloth's dataset formatting guide

opaque delta Jul 21, 2025, 7:36 PM

#

@austere cobalt would that not mess up the AI at all if I just have empty data?

austere cobaltBOT Jul 21, 2025, 7:36 PM

#

Yes, training with empty assistant responses will likely harm model quality—the model learns to output nothing or low-quality responses. For best results, you need both user and assistant turns with meaningful content. If you only have user messages, consider generating assistant responses using an existing LLM, or use your data for continued pretraining (language modeling) rather than instruction/chat fine-tuning. See Unsloth dataset guide and discussion on continued pretraining.

Would you like guidance on generating assistant responses or on continued pretraining?

Sources:

fickle warren Jul 21, 2025, 8:28 PM

#

opaque delta I'm planning on trainiong llama 3.1 8b on a dataset of 2 billion discord message...

depends what do you want to do with it?

#Training llama 3.1 8b model on odd data