#code says to try increasing max_seq_length but that fixes nothing

4 messages · Page 1 of 1 (latest)

sleek leaf
#

Hi, I'm playing around with Kaggle project 'llm-classification-finetuning' just as an excuse to try and learn about Unsloth AI. Right when I'm finally ready to train the model (after doing all the setup), the following code

trainer_stats = trainer.train()

gives me an error

KeyError                                  Traceback (most recent call last)
/tmp/ipython-input-773422404.py in <cell line: 0>()
----> 1 trainer_stats = trainer.train()

8 frames
/tmp/ipython-input-3320623149.py in torch_call(self, examples)
     22             # But we don't actually want this because number_token_ids[2] could be something like 27, which is now undefined in the new lm_head. So we map it to the new lm_head index.
     23             # if this line gives you a keyerror then increase max_seq_length
---> 24             batch["labels"][i, last_token_idx] = reverse_map[ batch["labels"][i, last_token_idx].item() ]
     25 
     26 

KeyError: 5052

I've repeatedly tried increasing max_seq_length up to 12,000 (and rerunning the whole code). The code token_counts = [len(tokenizer.encode(x)) for x in train_df.text] gives me a maximum sequence length of ~1700 so 12,000 feels like way more than enough and suggests that something else is going on here. Suggestions?

Thanks in advance 🙂

fervent rockBOT
#

The KeyError is caused by a mismatch between the label token IDs and your reverse_map dictionary, not by the max_seq_length setting. This usually happens if your custom classification head or label mapping does not cover all possible token IDs present in your data, so when the code tries to map a label token (like 5052) that's not in reverse_map, it fails. Increasing max_seq_length won't fix this, as the error is about label mapping, not sequence truncation or length.

To resolve this, ensure that your reverse_map includes all token IDs that could appear as labels in your dataset, or preprocess your data so that only supported label tokens are present. For more context and similar issues, see this discussion and this example.

Would you like a step-by-step guide on how to adjust your label mapping or preprocess your data?

Sources:

sleek leaf
#

@fervent rock Yes I would like a step-by-step guide on adjusting my label mapping.

fervent rockBOT
#

Certainly! Here’s a step-by-step guide to ensure your label mapping covers all possible label token IDs and avoids the KeyError:

  1. Identify All Unique Label Tokens
    Extract all unique label tokens from your dataset. For classification, these are usually the class labels (e.g., "positive", "negative", etc.).

  2. Tokenize Labels and Build Mapping
    Use your tokenizer to convert each label string into its corresponding token ID. Build a mapping from token ID to class index (and vice versa).

    labels = ["positive", "negative"]  # replace with your actual class labels
    label2id = {label: i for i, label in enumerate(labels)}
    id2label = {i: label for label, i in label2id.items()}
    label_token_ids = {label: tokenizer.encode(label, add_special_tokens=False)[0] for label in labels}
    reverse_map = {v: k for k, v in label_token_ids.items()}
    
  3. Check for Unmapped Tokens
    Before training, check if any label token in your dataset is not present in reverse_map. If so, update your mapping or clean your data.

    for example in train_dataset:
        label_token = tokenizer.encode(example["label"], add_special_tokens=False)[0]
        if label_token not in reverse_map:
            print(f"Unmapped label token: {label_token} for label {example['label']}")
    
  4. Update Training Code
    When assigning labels during training, use your mapping to ensure all label tokens are valid.

    batch["labels"][i, last_token_idx] = reverse_map.get(batch["labels"][i, last_token_idx].item(), default_value)
    # Set default_value to a valid class index or handle as needed