#Fine-tune video in Qwen3

12 messages · Page 1 of 1 (latest)

civic sphinx
#

How fine tune by video on Qwen3

sharp hamletBOT
#

Unsloth supports fine-tuning Qwen3-VL (the vision-language variant) with video inputs, but official example notebooks for video are not yet released. The process involves preparing your dataset with video-text pairs, using the FastVisionModel class, and ensuring your data collator and processor can handle video frames. However, current Unsloth and transformers implementations may require custom modifications to vision_utils.py to properly process video data, as some users report errors when passing video inputs directly. See the discussion and code examples in the linked issues for workarounds and community patches. GGUF export for Qwen3-VL is not supported yet, so deployment is limited to transformers/vLLM for now.
For more detail or step-by-step code, would you like a breakdown?

Sources:

civic sphinx
#

@sharp hamlet finetune video on qwen3-vl

sharp hamletBOT
#

Unsloth supports fine-tuning Qwen3-VL with video inputs, but there is no official video fine-tuning notebook yet. You must format your dataset with video-text pairs and may need to modify vision_utils.py to handle video frames, as direct video support is still being improved. Community discussions confirm it works with custom code, but errors can occur if the processor expects images and receives empty lists for videos.
Would you like a step-by-step code example or more detail?

Sources:

civic sphinx
#

@sharp hamlet i fine tune with img teach the new knowledge. The model finetuned will have knowledge with video input ??. Train with image and test with video

sharp hamletBOT
#

No, if you fine-tune Qwen3-VL with only images, the model will not reliably generalize to video inputs. Video inputs require specific handling (e.g., multiple frames, temporal context) that is not learned from image-only training. To enable video understanding, you must include video data during fine-tuning.
Would you like more detail on dataset preparation or model configuration for video?

Sources:

civic sphinx
#

@sharp hamlet but i want teach some detail but in video have many detail how i focus the knowledge i want teach ??

sharp hamletBOT
#

It seems likely that to focus the model on specific details in videos during fine-tuning, you should annotate or structure your training data so that prompts/questions and target answers highlight only the aspects you want the model to learn. This way, the model associates those video segments or frames with the desired knowledge, rather than all possible details.
Would you like more detail on how to structure such a dataset or examples of annotation strategies?

Sources:

rocky trail
#

@sharp hamlet Is pyAV necessary to finetune video in Qwen3?

sharp hamletBOT
#

No, pyAV is not required to fine-tune video with Qwen3-VL in Unsloth. The recommended approach is to use the decord library for video frame extraction, as seen in user discussions and code examples. There is no mention of pyAV as a dependency in the official Unsloth documentation or community guides for video fine-tuning with Qwen3-VL models—decord is the library referenced for handling video inputs and frame sampling during preprocessing and data collation for vision-language models in Unsloth workflows.

Would you like more detail on how to set up video fine-tuning with decord and Unsloth?

Sources:

rocky trail
#

Yes, how would one use decord @sharp hamlet for video finetuning qwen3-vl? I am getting a pyav error.

sharp hamletBOT
#

To use decord for video finetuning with Qwen3-VL in Unsloth, ensure your data collator and video processing functions use decord for frame extraction. The Unsloth community recommends patching or extending the data collator to process videos with decord, bypassing pyAV. If you get a pyAV error, it's likely because the default video loader is falling back to pyAV—make sure decord is installed and explicitly used in your video fetching logic.

Here's a simplified example of a custom fetch_video function using decord, and a custom data collator:

import decord
from PIL import Image
import tempfile
import requests
import os

def fetch_video(ele):
    video_source = ele.get("video") or ele.get("video_url")
    if isinstance(video_source, str) and video_source.startswith(("http://", "https://")):
        response = requests.get(video_source, stream=True)
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4")
        temp_file.write(response.content)
        video_path = temp_file.name
        temp_file.close()
    else:
        video_path = video_source

    vr = decord.VideoReader(video_path)
    total_frame_num = len(vr)
    frame_indices = list(range(0, total_frame_num, max(1, total_frame_num // 8)))  # sample 8 frames
    frames = vr.get_batch(frame_indices).asnumpy()
    pil_frames = [Image.fromarray(frame) for frame in frames]
    if os.path.exists(video_path) and video_path.startswith(tempfile.gettempdir()):
        os.unlink(video_path)
    return pil_frames