#i suddenly ran into this error in the STT notebook

27 messages Β· Page 1 of 1 (latest)

twin cypressBOT
#

The error KeyError: 'input_features' in your STT (speech-to-text) notebook occurs because the data collator expects each feature to have an "input_features" key, but at least one item in your dataset does not contain it. This usually means your dataset preprocessing step did not correctly generate or include "input_features" for all examples, or the formatting function was not applied to the dataset before training.

To fix this, ensure you preprocess your dataset with a function that extracts audio features and creates the "input_features" and "labels" keys for each example, as shown in the official Unsloth Whisper notebook. See the relevant code and explanation in Kaggle-Whisper.py and Kaggle-Whisper.ipynb. Would you like a step-by-step guide on how to apply this preprocessing?

Sources:

forest urchin
#

got it

forest urchin
#

==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\ /| Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ _/ \ Batch size per device = 1 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
"-____-" Trainable parameters = 15,728,640 of 1,559,219,200 (1.01% trained)
Unsloth: Not an error, but WhisperForConditionalGeneration does not accept num_items_in_batch.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient

TypeError Traceback (most recent call last)
/tmp/ipython-input-773422404.py in <cell line: 0>()
----> 1 trainer_stats = trainer.train()

13 frames
/usr/local/lib/python3.12/dist-packages/peft/tuners/tuners_utils.py in forward(self, *args, **kwargs)
220
221 def forward(self, *args: Any, **kwargs: Any):
--> 222 return self.model.forward(*args, **kwargs)
223
224 def _pre_injection_hook(self, model: nn.Module, config: PeftConfig, adapter_name: str) -> None:

TypeError: WhisperForConditionalGeneration.forward() got an unexpected keyword argument 'input_ids'

Unsloth - Open source Fine-tuning & RL for LLMs

Unsloth's Gradient Accumulation fix solves critical errors in LLM Training.

forest urchin
#

@twin cypress

twin cypressBOT
#

Your error TypeError: WhisperForConditionalGeneration.forward() got an unexpected keyword argument 'input_ids' typically means the data collator or training loop is passing "input_ids" to Whisper, but Whisper expects "input_features" for audio input, not "input_ids" (which is for text models). This usually happens if you use a text data collator or the dataset is not formatted for speech.

To resolve, ensure you use a data collator and dataset format that provide "input_features" (not "input_ids") as input, as shown in the official Unsloth Whisper notebook (Kaggle-Whisper.py). Would you like a code example for the correct data collator and dataset formatting?

Sources:

forest urchin
#

im runing the same official notebook but my dataset isnt in HF fromat so i had to modify it a bit @twin cypress

twin cypressBOT
#

If your dataset isn't in HuggingFace (HF) format, you must convert it to a HuggingFace Dataset and ensure each example contains the required keys ("input_features" and "labels") for Whisper. Use Dataset.from_list or Dataset.from_dict to wrap your data, and apply a preprocessing function to extract audio features and create these keys, as shown in the official notebook (Kaggle-Whisper.py).

If you skip this conversion or the preprocessing step, you will encounter errors like the ones above. Would you like a code snippet showing how to convert a list of dicts to a HuggingFace Dataset and preprocess it for Whisper?

Sources:

forest urchin
#

yes please @twin cypress

twin cypressBOT
#

Here's how to convert a list of dicts to a HuggingFace Dataset and preprocess it for Whisper in Unsloth:

from datasets import Dataset, Audio
import tqdm

# Suppose your data is a list of dicts: [{"text": ..., "audio": ...}, ...]
# Wrap it as a HuggingFace Dataset
dataset = Dataset.from_list(your_data_list)
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

# Preprocessing function (adapted from Unsloth's official notebook)
def formatting_prompts_func(example):
    audio_arrays = example['audio']['array']
    sampling_rate = example["audio"]["sampling_rate"]
    features = tokenizer.feature_extractor(audio_arrays, sampling_rate=sampling_rate)
    tokenized_text = tokenizer.tokenizer(example["text"])
    return {
        "input_features": features.input_features[0],
        "labels": tokenized_text.input_ids,
    }

# Apply preprocessing
processed_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset, desc='Preprocessing')]

This ensures each example has "input_features" and "labels" as required by Unsloth Whisper training (Kaggle-Whisper.py). Let me know if you need more detail.

Sources:

forest urchin
#

TypeError: WhisperForConditionalGeneration.forward() got an unexpected keyword argument 'input_ids'
still hitting me w same error @twin cypress

twin cypressBOT
#

This error means your training loop or data collator is still passing "input_ids" to WhisperForConditionalGeneration, but Whisper expects "input_features" as input, not "input_ids". You must ensure your data collator and dataset only provide "input_features" (audio features) and "labels" (tokenized text), not "input_ids" (Kaggle-Whisper.py).

Double-check your data collator and preprocessing function. If you use HuggingFace's default DataCollator, it will expect "input_ids"β€”instead, use a custom collator like in the official notebook that pads "input_features" and "labels". Would you like a sample data collator code?

Sources:

forest urchin
#

yes @twin cypress

twin cypressBOT
#

Here is a sample data collator for Whisper in Unsloth, adapted from the official notebook. This ensures only "input_features" and "labels" are passed to the model:

from dataclasses import dataclass
from typing import Any, Dict, List, Union
import torch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels

        return batch

Use this collator in your Seq2SeqTrainer as shown in Kaggle-Whisper.py. This will prevent "input_ids" from being passed to WhisperForConditionalGeneration.

Sources:

forest urchin
#

processed_train_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset, desc='Train preprocessing')]

--- Test dataset ---

dataset = Dataset.from_list(test_data_list)
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

processed_test_dataset = [formatting_prompts_func(example) for example in tqdm.tqdm(dataset, desc='Test preprocessing')]
@twin cypress

twin cypressBOT
#

Your code for preprocessing is correct for non-HF datasets: you convert your list of dicts to a HuggingFace Dataset, cast the audio column, and apply the formatting function to produce a list of dicts with "input_features" and "labels" (Kaggle-Whisper.py). This is the expected approach for Unsloth Whisper fine-tuning.

Just ensure that when you pass processed_train_dataset and processed_test_dataset to the trainer, you use a custom data collator (not the default one) that expects "input_features" and "labels", as shown previously. Would you like a full example of integrating this with the trainer?

Sources:

forest urchin
#

yes @twin cypress

twin cypressBOT
#

Here is a full example of integrating your processed datasets with the trainer and the correct data collator for Unsloth Whisper:

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from unsloth import is_bf16_supported

# Use the custom data collator from earlier
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=tokenizer)

trainer = Seq2SeqTrainer(
    model=model,
    train_dataset=processed_train_dataset,
    eval_dataset=processed_test_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer.feature_extractor,
    args=Seq2SeqTrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=1e-4,
        logging_steps=1,
        optim="adamw_8bit",
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        weight_decay=0.01,
        remove_unused_columns=False,
        lr_scheduler_type="linear",
        label_names=['labels'],
        eval_steps=5,
        eval_strategy="steps",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

trainer_stats = trainer.train()

This setup matches the official Unsloth Whisper notebook and ensures only "input_features" and "labels" are passed to the model (Kaggle-Whisper.py). Let me know if you need more help.

Sources:

forest urchin
#

@twin cypress still same error

twin cypressBOT
#

If you still get TypeError: WhisperForConditionalGeneration.forward() got an unexpected keyword argument 'input_ids', it means your processed_train_dataset and processed_test_dataset are Python lists, not HuggingFace Datasets. Seq2SeqTrainer expects a Dataset object, not a list.

To fix this, convert your processed lists back to HuggingFace Datasets before passing them to the trainer:

from datasets import Dataset

train_dataset = Dataset.from_list(processed_train_dataset)
eval_dataset = Dataset.from_list(processed_test_dataset)

Then use train_dataset and eval_dataset in your Seq2SeqTrainer. This will resolve the error (Kaggle-Whisper.py). Would you like a full code block with this fix?

Sources:

forest urchin
#

i did check the data :
dict_keys(['input_features', 'labels'])
dict_keys(['input_features', 'labels'])
torch.Size([2, 128, 3000])
torch.Size([2, 28])
but still hits em with the same error

#

@twin cypress

twin cypressBOT
#

Your data structure is correct (dict_keys(['input_features', 'labels'])), but if you are still getting TypeError: WhisperForConditionalGeneration.forward() got an unexpected keyword argument 'input_ids', it is likely that the data collator or the trainer is still passing "input_ids" instead of "input_features" to the model. This can happen if you use HuggingFace's default data collator or if the trainer is not properly configured for speech models.

Make sure:

  • You use a custom data collator (like DataCollatorSpeechSeq2SeqWithPadding) that only returns "input_features" and "labels".
  • You pass the datasets as HuggingFace Dataset objects, not lists.
  • You do NOT set the tokenizer argument in Seq2SeqTrainer to a text tokenizer; for Whisper, use the processor or set tokenizer=None.

If you want a minimal working example, let me know.

Sources:

forest urchin
#

yes @twin cypress

twin cypressBOT
#

Here is a minimal working example for Unsloth Whisper fine-tuning with a custom dataset:

from datasets import Dataset
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from unsloth import is_bf16_supported

# Assume processed_train_dataset and processed_test_dataset are lists of dicts with 'input_features' and 'labels'
train_dataset = Dataset.from_list(processed_train_dataset)
eval_dataset = Dataset.from_list(processed_test_dataset)

# Custom data collator (from previous messages)
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=tokenizer)

trainer = Seq2SeqTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=None,  # Important: do NOT set to a text tokenizer for Whisper
    args=Seq2SeqTrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        max_steps=60,
        learning_rate=1e-4,
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        remove_unused_columns=False,
        output_dir="outputs",
        report_to="none",
    ),
)

trainer_stats = trainer.train()

This setup ensures only "input_features" and "labels" are passed, and avoids the "input_ids" error (Kaggle-Whisper.py). Let me know if you need further troubleshooting.

Sources:

forest urchin
#

/tmp/ipython-input-530297811.py:4: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Seq2SeqTrainer.__init__. Use processing_class instead.
trainer = Seq2SeqTrainer(
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\ /| Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ _/ \ Batch size per device = 1 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
"-____-" Trainable parameters = 15,728,640 of 1,559,219,200 (1.01% trained)

TypeError Traceback (most recent call last)
/tmp/ipython-input-530297811.py in <cell line: 0>()
21 )
22
---> 23 trainer_stats = trainer.train()

46 frames
/usr/local/lib/python3.12/dist-packages/peft/tuners/tuners_utils.py in forward(self, *args, **kwargs)
220
221 def forward(self, *args: Any, **kwargs: Any):
--> 222 return self.model.forward(*args, **kwargs)
223
224 def _pre_injection_hook(self, model: nn.Module, config: PeftConfig, adapter_name: str) -> None:

TypeError: WhisperForConditionalGeneration.forward() got an unexpected keyword argument 'input_ids'
@twin cypress

twin cypressBOT
#

This error occurs because the Seq2SeqTrainer or your data collator is passing "input_ids" to WhisperForConditionalGeneration, but Whisper expects "input_features" for audio input, not "input_ids" (which is for text models). The warning about tokenizer being deprecated also suggests you should use the processing_class argument instead (Kaggle-Whisper.py).

Solution:

  • Use a custom data collator that only returns "input_features" and "labels".
  • In Seq2SeqTrainer, set processing_class=tokenizer (where tokenizer is your WhisperProcessor), and do not set the tokenizer argument.
  • Ensure your dataset is a HuggingFace Dataset with only "input_features" and "labels".

Would you like a code snippet showing the correct trainer setup?

Sources:

forest urchin
#

yes @twin cypress