#Fine-tuning takes very long on H200 and 1B params
143 messages · Page 1 of 1 (latest)
this is my code
%%capture
%uv pip uninstall unsloth unsloth_zoo trl transformers -y
import re
import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
%uv pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
%uv pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
%uv pip install --no-deps unsloth
%uv pip install transformers==4.56.2
%uv pip install --no-deps trl==0.22.2
%uv pip install wandb```
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "google/gemma-3-1b-it",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)```
output
:sloth: Unsloth: Will patch your computer to enable 2x faster free finetuning.
:sloth: Unsloth Zoo will now patch everything to make training faster!
==((====))== Unsloth 2026.1.4: Fast Gemma3 patching. Transformers: 4.56.2.
\\ /| NVIDIA H200. Num GPUs = 1. Max memory: 139.801 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.8.0+cu129. CUDA: 9.0. CUDA Toolkit: 12.9. Triton: 3.4.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.```
model = FastLanguageModel.get_peft_model(
model,
r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 128,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)```
import os
# Load datasets from HuggingFace Hub
train_ds = load_dataset("Reubencf/konkani-gemma-train")
validate_ds = load_dataset("Reubencf/konkani-gemma-validate")
# Get the train split from each dataset - use the correct variable names!
train_dataset = train_ds["train"]
eval_dataset = validate_ds["train"]
os.environ["WANDB_API_KEY"] = ".................................................................."```
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = train_dataset,
eval_dataset = eval_dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
packing = True, # Can make training 5x faster for short sequences.
args = SFTConfig(
output_dir="./konkani-llama3.1-8b-instruct",
hub_model_id="Reubencf/konkani-llama3.1-8b-instruct", # Optional: custom repo name
push_to_hub=True, # Upload after each save
hub_strategy="checkpoint",
num_train_epochs=2,
save_steps=500,
eval_steps=500,
group_by_length = False,
per_device_train_batch_size = 8,
gradient_accumulation_steps = 8,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
weight_decay=0.01,
max_grad_norm=1.0,
seed = 3407,
bf16=True,
dataloader_num_workers = 4,
dataloader_pin_memory = True,
report_to = "none", # Use TrackIO/WandB etc
# Eval & logging
logging_steps=100,
logging_first_step = True,
eval_strategy="steps", # Evaluate at the end of every epoch
save_strategy="steps", # Save at the end of every epoch
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
),
)```
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")```
2.102 GB of memory reserved.```
trainer_stats = trainer.train()```
please provide a sample dataset instead of Reubencf/konkani-gemma-train as that's gated/locked, you dont need to share your dataset make a minimu reproducible similar length
i can give u access hold up
whats your hf username
done
hold on
14/2884 [01:54<5:31:05, 6.92s/it]
You are literally running 64 batch size on 4k tokens
I think I told you this like 5 times
You throttle the compute, that's the limit of your GPU
You also have packing enabled on top of that
Your GPU can only process so much tokens, it can't go higher
Regardless of how much VRAM you use
yes but reducing it doesnt make much changes
will check
it shows higher time
ah ok will check hold on
which GPU are u using by the way ?
h200
why is your hub_model_id konkani-llama3.1-8b-instruct" when you train and load gemma 3 btw?
oh i didnt change it
and why does your dataset include chat template tokens?
{"text": "<start_of_turn>user\nYou are a knowle
.<end_of_turn>"}
im directly converting it to chat template
and providing it to the model
wait really ?
I suspect that could be messing with your stuff
I'm not sure but I see that as an issue atleast even if it's not performance related
but if you load it as raw text I'm not sure
but regardless that's not how you should do it
your dataset should not contain template tokens
hold on.
wait so its wrong to directly convert it to those template and feed it ? that means i have to give raw and let unsloth convert it let me try that and see
sec
anything else ?
Hold on 🙂
Does your dataset contain 1 turn each or multi turn?
Yeah it seems that's the time it will take for your training
It's not an issue as far as I can see
the performance
Even if you hit 100% utilization, doesnt matter what batch size you use you still iterate 2 epochs
and you have 92K entries
that takes 5-6 hours or so
That it/s is faulty
and what about the chat template part
the time is correct
To make sure you should have raw text and format it, incase you change models etc
yo udont want to have tokens within dataset
ok cause i made dataset with template for each model lol
but yeah ill stick to raw text
and then format it
in the code
but otherwise everything is fine right ?
yeah
normal
just clean the data and run it properly , even if it works with hacky tokens
dont do that
that's it
thank you very much @thorny perch
gl hf!
close this thread if you consider it finished btw
or mark as solved rather
1143/92238 [04:29<6:12:22, 4.08it/s]
that's on 2 batch size
for insight, same perf
batch size 2 and grad 1 ?
yeah just to test
the compute wont go faster than 100% of its capability regardless
so run with whatever batchsize that doesnt throttle you for osme reason
if its 5-6 hours that's normal
im worried because later on i will be trying higher params thats why
If you go higher params and VRAM doesn't fit, you can lower device batch size and turn up grad_accum to avoid hitting vram
all good seems fine
I suggest you train on 5-10K entries max and see performance on your parameters, maybe even less 1-5K entries
once you decide parameters you scale up dataet entries
it's not productive to do parameter tuning on 5-6 hours runs
30min max.
well i did the performance was bad
even 1 hour depending on case
since im finetunine a low resource language
yeah in that case go higher on rank and enable rslora
cant do that since im hosting it via serverless lora inference
run with for example 256 rank or 128, enable rslora, and set alpha to 16 or 32
64 is max
oh right
thats the catch
you can still enable rslora
set alpha to approx 16
try that
once
see if that helps
gl!
Remove my access from your HF btw
64 rank and alpha 16 ? why not 128 ?
rslora scales alpha differently mathematically
with a rank of 64, alpha of 16 or even 8 is good
or somewhere between
cause im worried if the output wont come good
ok
remember change alpha accordingly
ok thanks