Hello. I have the following task: given a string such as This is a nice string I have to tokenize it and compute the log probs of each token. I assume this means that for e.g. nice I need to compute log p(nice|this is a).
This is the code I have written to achieve this:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
text = "This is a nice string."
tokens = tokenizer.encode(text, return_tensors="pt")
tokenized_sequence = tokenizer.convert_ids_to_tokens(tokens[0])
print(f"Tokenized sequence: {tokenized_sequence} \n")
outputs = model(tokens)
logits = outputs.logits # shape: [1, sequence_length, vocab_size]
log_probs = torch.log_softmax(logits, dim=-1)
for idx, token in enumerate(tokenized_sequence):
token_id = tokenizer.convert_tokens_to_ids(token)
log_prob = log_probs[0, idx, token_id].item()
print(f"token: {token}, log prob: {log_prob}")
There is something that I'm not sure about. in the line log_prob = log_probs[0, idx, token_id].item(), should I use log_prob = log_probs[0, idx-1, token_id].item() instead? In other words, do the logits at position i in the sequence give the predictions for the token at position i using the previous 0,...,i-1 tokens as context, or does it predict the token at i+1 using all tokens up to and including the i-th one? In the latter case, the shift idx -> idx -1 is necessary, and the initial token the should also be dealt with separately (do I just skip it, or assign it an indeterminate value?)