Understanding how logits work in GPT-2 | Learn AI Together | Page 1

Hello. I have the following task: given a string such as This is a nice string I have to tokenize it and compute the log probs of each token. I assume this means that for e.g. nice I need to compute log p(nice|this is a).

This is the code I have written to achieve this:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)
model.eval()

text = "This is a nice string."
tokens = tokenizer.encode(text, return_tensors="pt")
tokenized_sequence = tokenizer.convert_ids_to_tokens(tokens[0])
print(f"Tokenized sequence: {tokenized_sequence} \n")

outputs = model(tokens)
logits = outputs.logits  # shape: [1, sequence_length, vocab_size]

log_probs = torch.log_softmax(logits, dim=-1)

for idx, token in enumerate(tokenized_sequence):
    token_id = tokenizer.convert_tokens_to_ids(token)  
    log_prob = log_probs[0, idx, token_id].item()      
    print(f"token: {token},  log prob: {log_prob}")

There is something that I'm not sure about. in the line log_prob = log_probs[0, idx, token_id].item(), should I use log_prob = log_probs[0, idx-1, token_id].item() instead? In other words, do the logits at position i in the sequence give the predictions for the token at position i using the previous 0,...,i-1 tokens as context, or does it predict the token at i+1 using all tokens up to and including the i-th one? In the latter case, the shift idx -> idx -1 is necessary, and the initial token the should also be dealt with separately (do I just skip it, or assign it an indeterminate value?)

#Understanding how logits work in GPT-2