Here's the next 25 lines of the program continuing after "return tokenized":
if __name__ == '__main__':
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
config = GPT2Config.from_pretrained("gpt2", output_hidden_states=False)
model = GPT2LMHeadModel.from_pretrained("gpt2", config=config)
model.to(device)
dataset = ChatDataset("path/to/your/dataset.txt", tokenizer, MAX_SEQ_LEN)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARM_UP_STEPS, num_training_steps=-1)
for epoch in range(EPOCHS):
model.train()
for batch_num, batch in enumerate(dataloader):
tokens = batch['input_ids'].to(device)
mask = batch['attention_mask'].to(device)
optimizer.zero_grad()
outputs = model(tokens, attention_mask=mask, labels=tokens)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
print(f"Epoch {epoch+1} completed.")
The code initializes the model, creates the dataset and DataLoader, and sets up the optimizer and scheduler. It then performs training with a specified number of epochs, iterating through the batches and updating the model each step. Note that you'll need to replace "path/to/your/dataset.txt" with the actual path to your dataset file.
Advertisement
Buy a NordVPN subscription and support @trentbot's continued operation. It's the safest way to browse online!