#š§ļ½finetune
1 messages Ā· Page 17 of 1
- makeup
all within one LoRA
@hollow spruce so would that be a separate img folder for each, or would you keep each concept as recurring common token in the captions?
but how exactly do you develop each concept so that you can visualise the group of images that contain that concept feature ?
my personal rule of thumb is - 100 images tagged with the same concept to work properly, when I train them together into one finetune style lora
for instance in Everydream, you can use yaml args in a folder to inject a caption token during training, so each folder would group the images for taht concept
I tried character cosplay. But when I use the same prompt to reproduce the image after training. The cosplay tag usually mess up things. Should we remove the cosplay tag from training?
i guess in your case you are doing it manually on Danbooru tag editor
I use hydrus network. (free)
a bit of effort to learn - but scales well for big manually tagged datasets
yea, i've heard about hydra, but isnt that more for anime?
it works once you enable clip training, and have a big enough dataset
the moment I hit 2k images in total, things started to work out
whereas i use instagram and photo websites to scrape
at 4k images right now, it's pretty damn good
Oh, nice to know
but if you do just the cosplay training, and use keep n tokens = 1, and shuffle rest, then it should work with around 150 images, and 400 images for "close-to-perfect" results
(I tried it with a Nier Automata 2B cosplay) <- no clip training, since '2b cosplay' was already understood, just not producing the right images most of the time
it learned a total of around 10 concepts, with the '2b cosplay' working roughly 4/5 times to produce an image I consider good enough to post
I see, thanks š
Guys do you have a tip for a good tutorial on training loras for SDXL with google colab?
Is this for fine tune checkpoint?
lora
while normal loras stay similar in training, with sdxl we now have options for bigger and much more complex loras, which approach finetune level of improvements - across hundreds of concepts
Hello, everyone. I'm using scale_weight_norms = 1 but my average key norm doesn't stop going up and eventually approaches 1 and my keys scaled become very high after the 4th epoch. Any way I can fix this?
Ok, gonna digest this. So, need to start plotting working concepts like a fine tune, got to be organized to know the concepts learned in promoting
My main use case for this is to attempt mixing characters... like multiplying concepts, like two characters generalized. Does that make sense? That's sort of been my desire for a while. Will give it a go this week. Any recommendations besides the ones you've given me already?
If you have one concept with 50 images, and one with 100, would you do 2 repeats on the set with 50? Or on the 100? I struggle with that
at least that is the theory behind it
Yah the kohya_ss page says it's original intent was to allow you to match up with #reg images, but like you stated there is probably good reason to use with multiple concepts even if not using regs
I can't vouch for how well it works, since whenever I did that, it didn't really pay off, as I didn't have enough images to make it work (20 images repeated 5 times, really aren't enough to fully teach a concept - so not worth trying to make this work if you're teaching multiple concepts into one lora - better to just make 2 loras if the difference is that big)
on the other hand - for concepts where I have 200 images, and another with 700 images, I don't bother repeating the smaller one, as they'll still get learned at roughly the same rate
(the bigger dataset gets learned slower, but more flexibly - balancing everything out)
sub 100 images š¤·āāļø hard to tell without a lot of testing
I usually find it easier to just increase my dataset, than messing with repeat settings
Yah what I saw was one burns out the whole model before the other is trained
wonder what the documentation is on about then.
I'm getting the vibes that adaptive algos aren't for fine tuning small datasets but rather for custom larger models
Getting way better with adamW8bit on my Lora
Guess I'll stick to constant/cosine schedulers for Lora's
And might go back to adaptive for say a large diverse dataset for a custom model
Is it crucial to ensure that the num_train_images evenly divides the train_batch_size?
It wonāt stop the training, but idk if it has an effect on the output.
when does stablecode come out to use
Do you seperate each concept images into different sub folders? I like the idea of making a huge lora with multiple concepts. I might try this too. Are you planning to merge it with a checkpoint in the end or leave it as a lora?
hmmm I wonder what learning rate I should use for thousands of images
Note sure is this is the right place for this, but I got sdxl lora running at 1024x1024 on a 16gb gpu. My loss is crap, but I will experiment more. I'm happy to share my settings if people want to contribute and make suggestions.
depends. If all the concepts you're training fall under some greater class, then you don't need different folders.
In my case I currently have two folders: 1_girl, 1_woman <- since one of the goals of my lora is to give these two words very specific age brackets
but even if I put everything into 1_woman, it would still work
However, additional folders are great if you're lacking enough images for a specific concept! My "tracer cosplay" concept felt undertrained, so I added an additional folder 1_tracer cosplay, exported all those tagged images there once more, and now it learned it just how I wanted it (keep in mind, those images are now essentially duplicated - since they're already a subset of the woman/girl images. But this time they got loaded into kohya witht he class prompt "tracer cosplay" - so it worked out better)
in short, once you have around 2k images, you can use various methods to train your lora, and they will all work 'good enough'. The magic happens when you make all the important concepts work with weights of the lora set to 1. and with no need for (tracer cosplay:1.4) or anything like that
ah thanks, yeah that makes sense. I guess part of the fun really is to experiment yourself to get that perfect balance.
My idea at the moment is to make a lora which focuses on jdrama/movies, so that I can get the style of my favourite different dramas and characters all in one lora! š¦
What are best implementations available for the following methods to fine-tune the sdxl?
-- full fine tuning
-- Dreambooth + LoRA
-- Dreambooth
What base model do you all use for training a lora with a person? just the base 1.5? (not jumping into sdxl just yet)
can anyone point me in the right direction to train some text embeddings/inversion i have the sdxl1.0_0.9vae versions, thanks
that's a whole essay's worth of topics, which each have multiple answers which range from short to very very very complex. best to scroll up and read the chat here, as multiple of these questions have been answered in various degrees of complexity
thanks
kohya_ss has scripts for that. Works exactly the same as lora training
aitrenpreneur out here this week telling his audience that 99% of loras are made wrong because people are using rare tokens. I'm out here looking on civit and 1/20 users publishe with rare tokens. maybe 1/20. probably less.
i think he just made that 99% figure up
The guy in the video (not me) said the rare tokens such as a random ewfew word like that has less effect than choosing a person the SDXL already knows that looks similar to the person you want to train, for example using the name Jessica Alba to train someone who has a similar look. He believes that training with a name SDXL already knows will make the training faster/better.
Itās going to work faster yeah. But given the right settings and time the other way will also work perfectly.
help pls
can anyone give me some advice on lora training ? i found some guides but they're a bit confusing maybe i could ask a couple questions, like how do i make the captions seems to be different ways to apply, can i just make a folder with images
Starting out, I would think your best option is to use the gui https://github.com/bmaltais/kohya_ss - In Lora > Tools > Deprecated you can fill in the details, click prepare training data and then "copy info to folders tab" which will handle the folder creation. Theres then a utilities/captioning/blip captioning to create the core captions. So then you have the folders in the right format, and the captions created with ease. You likely will want to make some manual edit to the captions to make them better, but you have a core framework to go off that way.
(screenshot taken from https://youtu.be/sBFGitIvD2A) but as you say there are lots of guides. Its still pretty new and I think there is going to be even more amazing things to come especially around multiple concepts)
In this tutorial, you will learn how to install Automatic1111 Web UI for SDXL. How to use LoRAs with Automatic1111 SD Web UI. How to install Kohya SS GUI scripts to do Stable Diffusion training. How to train LoRAs on SDXL model with least amount of VRAM using settings. All of the details, tips and tricks of Kohya trainings. How to do x/y/z plot ...
why does my loss stay around 0.125 when using adafactor? is that normal for variable learning rate ?
the loss has nothing to do with the optimizer or the learning rate. The learning rate is applied on the gradient that is conputed from the loss
Did anyone try LoRA-FA?
I have tried this LoRA-FA. It basically achieved better result at half epoches and it is memory efficient. I was able to do batch size 10 around 21/24 using 3090.
Interesting, seems like worth a try then
How did you get it, I can't see it on my kohya
dev2 branch
aaaah cheers!
Anyone tried Salesforce/blip2-opt-6.7b-coco or something similar to auto-caption images? I think I'm starting to have a baseline for training.
What are your thoughts on finding the right scheduler for a given dataset? Are you one to play with runs on wandb and compare sequentially the effect of apply constant vs cosine with startups vs cosine annealing etc. maybe seeing the different feedback from each LR on strengths or weaknesses of various LR gamma helps to identify where converge is highest? Do you have graphological conversations. I've noticed this is a hot debate amongst the big model fine tuners, less so for the dreamboothers?
I had to go through every single image caption personally anyway. It provides a good baseline caption most of the time though.
That's probably the right way to go about it. I steered it quite hard on the dataset I had.
My approach is tagged with wd14 and remove unnecessary tags
I feed it a data structure with all the questions and then I combine it to the caption that I then write to file. questions = { "p_style": "Choose the mood of the photo: Desolate, Tense, Lonely, Stark, Quiet, Dark.", "p_subject": "Describe it in detail.", "p_mood": "Describe the mood in the image.", "p_colors": "Describe the colors.", "p_framing": "Choose the framing of the subject: Close-up, Medium, Wide.", "p_setting": "Describe the general setting or environment in a few words.", "p_lighting": "What is the lighting like? E.g.: Natural, Low, Soft, Harsh.", "p_angle": "What angle is the picture taken from? E.g.: Straight on, Low, High.", "p_dof": "What depth of field is used? E.g.: shallow depth of field, deep depth of field." }
It should work better for fine tuning. It is unnecessary for lora
Replying to bookmark this and modify it for my purposes.
funnily enough, I don't use the graphs at all.
I usually just rigorously test my checkpoints, as the loss values can appear perfect, yet upon real life testing - it turns out only 3/5 concepts were learned.
Tracking Loss is great to ensure that your settings don't have some massive error, or that the model didn't implode on itself. But other than that, it's not really worth it for me, as I have some loras that work greats, despite less than ideal loss values.
I usually stick with AdamW or Adamw8bit + constant. For anything that isn't faces or anatomy - this will give so close to 'perfect' results, that it wasn't worth trying with other settings for me.
While I'm confident, that cosine with restarts, or cosine with annealing restarts can be used to get even a bit higher quality, on harder concepts such as faces/anatomy - getting that setting just right for your specific dataset + tagging is hard enough that I find it hard to recommend.
Prodigy can be good - as you can use it as a set & forget scheduler. One training setting fits all - literally. It won't give you perfect results, but it gets you 80% of the way, 80% of the time. Regardless of how complex your dataset is
Iāve found with prodigy that min snr gamma has quite a pronounced effect. Setting it low for complex stuff and high for simple stuff helps quite a bit.
so I've seen it used successfully with genuinely insane datasets of like 30k images
now if only min snr didn't kill contrast š„²
Eh? Min snr gamma only smooths the loss so the optimiser or LR scheduler have a better chance of optimising the thing. I saw that in the code, but didnāt check if it does anything to the image before itās fed into the model. Otherwise with a model as complex as SD itās sure to go crazy loss-wise.
I shall remember to check that.
it's a ?bug?. I mean I'm not sure if its a bug since its more about how the base sdxl model was trained itself, than anything else. But if you finetune sdxl enough, especially using bright images, you'll notice that the finetune always tries to converge on 50% grey. First backgrounds turning greyer, then blacks turning less black, whites turning less white.
Some settings can speed up that effect - the biggest offenders I've found are offset noise + min snr
Interesting. I shall check what they do in the code in more detail. Noise offset moved some things between the up and down pass of the Lora so that the scale was okay from what I saw, but I didnāt look in too much detail.
since sdxl base model was trained with offset noise (and not just one consistent value either, but rather varies levels of offset noise, at varies intervals)
our running theory is that we can't exactly match that offset noise - hence this is the effect we get
it's even worse on full finetuning
@hollow spruce I hope you donāt mind me asking, does the loss of your LoRA always go down? Mine fluctuates a bit between 0.68 and 0.72 for example.
but it's nothing about the code - its 100% implemented correctly. It's just what the finetune is learning.
(SAI themselves don't experience this though, fyi - so their finetuning workflow, using custom scripts is immune to this issue)
Oh not saying itās implemented wrong, just saying I probably donāt understand it fully or have seen it enough yet.
There are so many models to pick. I'd like to test openflamingo/OpenFlamingo-9B-vitl-mpt7b too. Wish I could clone myself. š
yep. number go down - unless I reach dangerous levels of overfitting.
I usually hover around 0.1 loss in the beginning
Means itās something in my settings probably. I have a big enough dataset Iām confident I donāt overfit.
But I still have to prune and caption it a bit more.
keep in mind, invisible things can also overfit XD I've trained more than one "noise" lora by now, on accident
Yeah. If the background is too similar or the subjects are always wearing the same clothes or stuff like that.
Do you have a script you use by any chance?
yeah - the more concepts you train at the same time, the less you can rely on loss values, as they represent your training as a whole - but often its just one thing going super wrong, which needs to adjusted for the next training attempt
By feeding back the thing or by adjusting the captions I assume?
I'm trying to do faces + anatomy + clothing in one single lora. all of them get learned at completely different rates though, so this is what my dataset distribution looks like XD
clothing,full body : anatomy : faces : teeth
1 : 6 : 6 : 30
that way, nothing overfits, and everything gets trained equally
Super helpful. Thanks.
I woul really really recommend training faces + anything else via 2 different loras though. saves you soooo much trouble
for me, I can't do that since I'm doing a finetune style lora with (currently) 4k images, which is just a literal finetune like experience, rather than teaching a specific concept.
Do you crop as well?
And upscale?
Not at this moment, it's not releasable. š
using all high quality source images, with 0 upscale needed was the biggest improvement so far. Really helps with keeping fine detail consistent across all images generated.
My current dataset is manually edited - so all images are cropped to my ideal standards (2:3, 4:5, 1:1, 16:9, etc.)
but when you're training just a single concept (emphasis on concept, not style), then 1:1 crop will gave the same results as using completely mixed buckets.
If you only use one non 1:1 resolution, like for example 2:3, then expect your lora to perform better when generating images at that exact aspect ratio, and somewhat worse at all other aspect ratios.
If you're doing a style lora, than just use the images in various aspect ratios, and maybe crop a few so you can also have some 1:1 examples in there - in case there were none
fyi, this can also be misused to essentially make an aspect ratio lora, to give more consistent results with 1:4 or 4:1 aspect ratios, which usually only work around half the time.
if all your source images are 2048:512
Base vs LoRA. Not sure what to think about it.
I love seeing different approaches by different camps. Thanks for sharing your experience
@hollow spruce what do you use AdamW for? Usually more the 8bit variant? I'm using an A100 on runpod, but wondering on what scale are the benefits of the added precision if needed
Hi Guys, im trying to train bodyparts "Fingers" so i made a libary of like 300 images of Hands and Fingers. im creating a lora from that which generally works ok but my issue is that as soon as i apply the lora the resulting images get very narrow
is there a tag / way i can use to basically mark it as bodyparts and avoid that
I can't belive it, can it turn out this good on the first try? Here's images with prompt. https://imgur.com/a/XPXrkKg
Loss from tensorboard. I picked the last epoch.
Open flamingo is best captioner I have tried so far
can u send the json file?
Very impressive!
anyone had size mismatch errors with training sdxl ?
you trying to merge with base?
I was told AdamW + full bf16 training is better than AdamW8bit. They need similar vram, so I've just been going with it š¤·āāļø if nothing else, it doesn't make the training worse.
(would probably be worth running a few tests to compare them with the same dataset + settings, and see if there's a difference)
Do you know if this is still true?
This option enables the full bfloat16 training (includes gradients). This option is useful to reduce the GPU memory usage. However, bitsandbytes==0.35 doesn't seem to support this. Please use a newer version of bitsandbytes or another optimizer. I cannot find bitsandbytes>0.35.0 that works correctly on Windows.
no im just trying to train it with kohya
Which json are you looking for? I got plenty to choice from. š
are you training a face?
--full_bf16 does not bomb on me with --sdpa. Use use that instead of --xformers.
cool that just looks like that street photography i was going for
I'm running another dataset now that I've made today, 225 images total.
With kohya_ss using sdxl_train_network.py. I took some inspiration from Caith's configs but removed some for the defaults and changed some based on this: https://hoshikat-hatenablog-com.translate.goog/entry/2023/05/26/223229?_x_tr_sl=sv&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp
ååć®čØäŗć§ćÆćStable Diffusionć¢ćć«ćčæ½å å¦ēæććććć®WebUIē°å¢ćkohya_ssćć®å°å „ę³ć«ć¤ćć¦č§£čŖ¬ćć¾ććć ä»åćÆćLoRAć®ćććæć大ć¾ćć«čŖ¬ęćććć®å¾ć«kohya_ssć使ć£ćLoRAå¦ēæčØå®ć«ć¤ćć¦č§£čŖ¬ćć¦ććć¾ćć ā»ä»åć®čØäŗćÆéåøøć«é·ćć§ćļ¼ ćć®čØäŗć§ćÆćåčØå®ć®ęå³ćć®ćæč§£čŖ¬ćć¦ćć¾ćć ćå¦ēæē»åć®ēØęć®ćććććØććē»åć«ć©ććć£ćć·ć§ć³ćć¤ćććććØććć©ćå¦ēæćå®č”ćććććÆč§£čŖ¬ćć¦ćć¾ćććå¦ēæć®å®č”ę³ć«ć¤ćć¦ćÆć¾ćå„ć®čØäŗć§č§£čŖ¬ććććØęćć¾ćć LoRAć®ä»ēµćæćē„ćć ćć¢ćć«ććØćÆ LoRAćÆå°ćććć„ć¼ć©ć«ććććčæ½å ćć ā¦
This series is good too, it's three parts. https://medium.com/@dreamsarereal/understanding-lora-training-part-1-learning-rate-schedulers-network-dimension-and-alpha-c88a8658beb7
A guide for intermediate level kohya-ss scripts users looking to take their training to the next level.
was about to say kohya gui now includes a compatible bitsandbytes XD but wth?
so. yeah.
It works? š
why does it work š¤£
Seems to work yes and uses like 3.5G less VRAM than without, but I donāt use xformers. Might be why itās working.
The one u used for that training? I guess it was movie style training?
The dataset was prepared from a movie yes.
I think he just wants the json file to have an idea what settings, learning rate etc you used lol
Up to you though!
I see, no json sorry. Custom script but here's probably the important stuff. That did not turn out good. I can see what I can come up with. š Here: https://gist.github.com/twri/1166fd65f30cea4c53d0c16ae0ee4f26
But there's probably better settings so take it with a big grain of salt
Cheers love, very much appreciated, I hope that person is happy now.
Thanks, how many total images and repeats?
230 images, 10 repeats but I don't think it matters since I have no reg images. Just one folder.
Doing another run on another dataset with about the same number of images and I can't understand why I get about the same loss? Only thing I can think of is the captioning is similar format.
Ok, not the same but similar curve.
Yea loss curves tend to flatten out relatively quickly on average I've noticed. You should probably check on your training data. And maybe store the gradients so you can simulate a larger batch size. Up the learning rate too then. Then use cosine with restarts (like 2 a 3 cycles. That seemed to work for me when I trained my last lora. Also just use repeat 1 and up the number of epochs, so you can save intermediate results.
Is ~5.7s/it about normal for batch size 8 lora training on a 4090?
hey yall - having some issues with a lora im currently making right now. I'm training to train on the bberny belt/skirt by diesel, and was wondering if my image selection is a little poor.
would appreciate if yall have any feedback on image selection and labeling
https://github.com/matthew2k/bberny_lora
I am by no means good at lora training, but I noticed that I got the best results when I could easily describe everything in the scene I didn't want to be trained.
For instance, Image 15 in your set, I have no idea what the top is or how to describe it and I suspect the ai would assume that's part of what you want from your training data (also the skirts being covered by the top, which will hinder training). Image 18, in comparison, is probably a solid image as the ai knows what a Shiny Jacket is and a person standing in a white photo studio.
**As for the caption, for image 18: **
mini skirt, a fashion photo of a woman standing, wearing a silver metalic jacket, high heels, holding a purse, black hair, brown skin, white photo studio background
This gives a keyword the ai already knows (mini skirt), a simple description of the subject, a description of each part of the subject you dont want the lora to remember, a description of the background.
**For Image 19: **
mini skirt, close up photo of a torso, brown skin, white background
Should be enough to get the point across. In both of these cases, I believe it is important to specify skin tone or else the LORA will err towards what ever skin tone is most prevalent in your training data.
Image 24 looks too compressed. Image 27-30 aswell. Since SDXL training is 1kx1k, any visible compression will leak into the LORA. I've found even high res photos downscaled to 1080p do better than still frames of 4k movies in terms of compresion/sharpness. It looks like, in your outputs, the image compression/resolution is baking into the LORA. Also a single bad image can screw up a training. Err towards less images of higher quality rather than more of lesser quality.
super helpful! i think this confirms what i was thinking, which is that image selection rather than parameter tuning seems to be the bigger culprit.
as for labeling, is trying to label literally everything thats not ur Lora the best practice?
also i use this sight to see how labeling is sort of interprested by sd1.5 https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images?_search=distorted&_sort=rowid
Each tag in your caption is another tag being trained. You just want to make sure the ai understands that the tag you want, belt skirt, is only refering to the belt skirt she is wearing and not the other parts. That is much easier to do with simple, easy to comprehend, images. That way you can use less words to describe more.
But its not really describing what you dont want, just describing each feature of the image
I had two test lora on peace sign hand. The current issue is that nail usually draw wrong and thumb connected with ring finger. Where could I find high quality training image? I think it should be perform better if the training set get improved. @hollow spruce Do you have any suggestion?
Hi guys, i was wondering if its possible to train a lora from an existing .safetensors file? Ive been looking all over but cant find a clear answer or way to do it.. From what i see i can do it from some webuis but i cant find how to do it in python. Do any of you maybe know of a guide or something? When i google it i get results where you train from base and end up with a safetensor instead of start with one.
The LECO concept might be what you're looking for. It use existing model and prompt to train a lora.
The original LECO git doesn't support SDXL yet. Someone modified it and published on civitai. Also, he was trying to improve it to apply fansy training like, generated by model A and train on model B, etc.
Looks like that does what i want, but it seems to be mainly focussed on erasing concepts?
I dont really see an example for training on images. only prompts to remove from the model.
The original paper is to erase concept. But people modified it as training lora using prompt and model
The idea is to use prompt to generate "images" from model and train it
Do you have a github link or something for me of one of these modifications? I'm currently looking at https://github.com/p1atdev/LECO but this might not be the right one?
It is the one which haven't support sdxl yet
what do you mean with "existing safetensor" file?
As in a civitai checkpoint for example
safetensor ist just a file format. Do you mean: train a lora from an existing lora? Or training a lora from another sdxl checkpoint?
but what is the issue then? In kohya_ss you give the safetensor file of the model as parameter
--pretrained_model_name_or_path="/path/to/sdxl/model.safetensors"
There is no issue, im asking for a guide on how to do it in python. I have not heard of kohya_ss.
ah, okay. I mean you can use diffusers if you want to write the code yourself
i have tried but i dont think this accepts .safetensors format?
otherwise kohya sd-scripts is a nice collection of python scripts for lora training
For general lora training, just use kohya ss is enough
awesome, i will have a look.
as said, safetensor is just a format. But diffusers usually wants their checkpoints in their own format, but there are conversion scripts I think
i'm asuming that would be .ckpt format?
.ckpt or .safetensors doesn't matter. It's how the model is stored within these formats
models are usually stored as python dictionaries of key -> tensor. .cpkt is just a pickle of these, .safetensors is a more restricted serialization routine
when i tried putting a .safetensors file in diffusers i got an error that the string i input wasnt a folder or something like this
it didnt accept the .safetensors etc.
the problem is that diffusers has different key names than auto111/kohya/sai
i will for sure have a look at kohya
i wanted to use diffusers but it got really confusing as it just refused to accept the format.
otherwise try conversion scripts, like here: https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py
mig thave been doing something wrong, but also could find a guide on it at all
ah thats great, i will have a look
I just don't know if they already work for sdxl
its fine it it doesnt, atleast then i can try to get the hang of it and start learning š
thanks a lot both of you
kohya script train network keeps crashing after trying to load the network in UNet2DConditionModel print
Missing key(s) in state_dict: "down_blocks.0.attentions.0.norm.weight",
any suggestion?
ah yes. anatomy training XD
https://unsplash.com/s/photos/peace-sign
there's a good amount there, if you don't already have them in your dataset
fun fact. I'm also doing anatomy training right now
plz send help. RTX4090 isn't fast enough š„²
there's also flickr, for more "human" photos, that don't look like photoshoots, but you'll have to filter through A LOT of bad images first XD
https://flickr.com/search/?text=peace sign
Thanks for share. I gather my training set from pexels. Most of them are jpg and I got the jpg effect burned into the lora.š¤£
rip ā¤ļø
one of my trigger words also learned that phone compression from the selfie camera on phones š„²
too many selfies in my dataset š¤£
I think the lora learned the general shape of the pose but failed in detail. Do you have advice for that?
either a big enough dataset, or go all in on overfitting, and train for about 300~600% too long.
(also training rate scheduler)
probably a valid reason to switch to cosine. Though I have no advice at what rate to start at
How big is enough? My dataset only has 26 for now.
if they are similar enough, then 30~50 for overfitting. 100~200 should allow you to get away with only a small amount of overfitting
200~500 for "flexible". <- but that's not really needed, as you probably always want a peace sign if you load your lora. so just do it via overfitting.
general rule of thumb, is 10x the dataset for a flexible lora, and 100x for full finetuning
I currently have 200~250 images per body part I'm training x_x which is why its training forever š
I would try 50 first then continue gather more.
I just used Kohya to make my first LoRA and it was surprisingly easy. Made a lot of first time mistakes but it still turned out super functional. Used way too large of a data set, like 2000 images. I donāt need that much >.>
Hello I am using stable diffusion xl and i am trying to create my avtar by using lora training via kohya_ss and i am having some error attached in it , help me . thanks
what does your folder layout look like inside of /avatar/img/
it looks like kohya is unable to find your training images, they should be setup as:
/[project]/img/[steps]_[keyword]/[images]
for example:
/avatar/img/1_man/image_01.png
hey there! curious if anyone has had issues with distortions with training lora's on people. here's an example of a lora i trained recently (V* shaking hands with natalie portman) but there's some pretty significant facial distortions.
I used kohya ss with ~15 images, celeb token, no captions, no regularization. could this be due to lack of regularization?
Does anyone have guides for fixing teeth if one doesn't have close-ups of the subject's teeth? Something like an embedding?
bitsandbytes says no gpu support on amd anyone faced this?
I donāt have a good answer to this but eyes in profile are always tricky to get right even on the base model.
I would say its just too low resolution. SD is not good for small faces. Upscale the image and redraw the faces
Yep, few passes of base on it after upscale can fix it.
40hours later on my rtx4090. 1.4k image dataset - 2x 40epoch runs with constant (5e-4) vs cosine with restarts (1e-3 + restart every 5 epochs)
constant with significant warmup, won hands down. Lora is working well enough after 12 hours. This is essentially a finetune when it comes to body anatomy. But realistically seen it would have to run for about to 60 more hours until it achieves perfection š„²
cosine with restarts was my attempt to speed this up, but after 12h of training, it's already worse than the other lora, rather than better. While it might just converge at a later point, it definitely defeats the point of saving time š¤·āāļø
no dropout, no offset noise, no min snr gamma - since I didn't want to damage the base sdxl capabilities.
Results? Near perfect with constant. Model works almost identical to base, backgrounds aren't influenced at all, but everything about anatomy is now working on the first attempt. I'd show results but for nsfw reasons this obviously isn't an option XD
I'll probably have to move to a A100 stack once I combine this with my master dataset š„²
But yeah, if anyone wants to train hands/feet/different body shapes/nsfw/skin detail - feel free to hit me up in a dm. I now have well working settings, which don't rely on good captions - but oh god is it training slow as hell.
You need a rocm version, let me find a link
That's the one I use
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ DEBUG INFORMATION +++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Running a quick check that:
+ library is importable
+ CUDA function is callable
SUCCESS!
Installation was successful!
that will show you if it's installed correctly
Have you shared your training script anywhere?
epochs / warmup are pretty specific to the dataset size
basically the bigger the dataset, the longer it needs to run. 1400 images would be between 40~200 epochs. which takes 12~60 hours on a RTX4090
batch size needs to be adjusted to whatever you can run. (12 is the absolute max on a 4090)
really not the fastest option, not by far, but it works well for anatomy specifically
I was more so asking about the code for the training loop you use. Do you write your own or use one that's available?
ah, yeah. I use standard kohya gui - as I'm still writing a guide
Ah, I see
hard to make a good guide with all custom code XD
Hmm, perhaps š¤·āāļø
A well annotated notebook is something thing community would benefit from I guess. There are many training scripts out there right now, and the one that was recently merged into diffusers to train SDXL for txt2img has major warnings that results aren't good, and would require heavy hyperparam search. Most scripts out there are fairly similar, but differ on many small (but maybe important details). Like offset, min-snr, terminal snr, etc.
It just feels in general like there's too little data being shared around successful training runs.
checks out. diffusers training is hard as hell
simpletuner, while still hard, is definitely your best bet
does require coding knowledge though
(for full finetuning that actually works)
A6000 required or better
The author of SimpleTuner helped and did thorough review of the PR that introduced that training script, there definitely were many suggestions from him that wasn't implemented in the final script that got merged.
In my own script I've tried combining strategies from diffusers, simpletuner and sd-scripts, but without good public data on sucessful training runs, it requires so much wasted time to search the param space. I truly wish model makers would be more open to sharing data. But alas they view their models as IP, rather than wanting to partake in research, they will sit on their "secret sauce".
training via kohya is essentially fully working though. my best lora so far has achieved what would in 1.5 have been called a full finetune
And forgive my ignorance, but when you say kohya, you simply mean a GUI frontend over the logic in sd-script?
I'm mainly using the gui frontend for the simplicity of sharing settings, but the launch cli command is the same
Okay, just wasn't sure if the GUI did additional things. I've never checked it out.. but the code for sd-scripts.. I've read one too many times by now..
the gui does have the advantage of having all the new 'working' parts from the dev branches of kohya-ss integrated
so you don't need to mess with the dev branches yourself
What's been your experience with training diverse aspect ratios (clipped to the ratios mentioned in the SDXL paper, however).
other than that, nothing really
working perfectly, with the exception of having your whole dataset consist of only a single bucket size that isn't 1:1
so as long as you have multiple mixed aspect ratio buckets, it works even better than standard 1:1 ratios
What's been your experience on bf16 vs f16, had issues with bnb AdamW8?
had enough people recommend me to switch from AdamW8bit to AdamW + full bf16
I've seen no negative impact from this. Vram is roughly the same, so I've just stuck with it.
Would need to run identical tests though, to see if the additional accuracy actually improves things or not
not really a priority for me though, as there's no downside of running full bf16 for now
Have you tried running full UNET training with batch size one on a 4090?
did various 1 batch tests, with additional accuracies, including one run with gradient checkpointing off XD
yeah. while results were pretty different from what I was expecting, I can't say that a single one of them was actually better. just different?
offset noise I've had to stop using though, as it was making my backgrounds greyer. hence why most of my shared settings have 0 offset noise
What's your stance on the viability of only training loras as opposed to full unet trianing for the further development of the SDXL ecosystem? (text encoders too, for that matter)
more complicated of a matter
clip training is great, but nothing like SD1.5
any old tutorial/knowledge is no longer valid when it comes to sdxl clip training
Likely not the bottleneck anytime soon; but the lack of full unet trainings has me concerned
but when done right, its really really good
full finetune is ... yeah. resource heavy
even my rtx4090 isn't really good enough
batch size 1, even with GA, isn't a true solution
Can't say I'm happy with the setup on a 4090, either. Which is a bummer
A6000 is now working, if you go the diffusers route
but as I dont have one XD I cant really speak much about it
instead I'm just seeing how far I can take lora training
While it's going to be slow, I think it would be great if a setup that at least achieves good results on a 4090 would benefit the community, as it would increase the amount of unet trainings available
Would distribute the effort across more people
š¤·āāļø genuinely not sure, if I just look at all the loras that are currently publicly available
the bad lora echo chamber is real
There's always going to be noisy signal, but unless there's good tooling available, you can't hope to filter out decent signal from all the noise to begin with
true that
I'd get into it if it runs on 24gb vram
but at this rate, chances are higher I'll throw my full dataset at the owner of simpletuner once its complete XD
currently at 6k images. final will be around 50k images. (all manually edited/cropped/captioned)
Well, with batch size one it runs for only unet, so it's wicked slow. But slow and working is better than nothing at all
I'm not sure if running it for 1000hours is even a 'true' option š¤£
I've tried trainings with datasets ranging from 5k images to 20k images, batch size one. And results have been very mixed depending on what techniques I incorporate. Sampling data on training results is a ... very slow process.
at that point its cheaper to rent a runpod A100 stack, just to offset the electricity costs
Yes, obviously it's the way to go. I'm just trying to optimise for all the people out there that simply will never run a training if it means renting on cloud, but if it means leaving their computer on during the night for two weeks, they'll give it a go.
while there are ways to cheat with training time - if you use them, are you even benefitting from full unet training? Cause then I genuinely have to ask if a lora wouldn't be both more efficient and better
That depends on what we're talking about specifically when we say "cheat'
higher learning rates, or adaptive optimizers, which scale up the learning rate for you
Yeah, and this is what I want to find more data on : D
It's hard to argue these things without any collected evidence
I had tried once fine tune with 10k images. It takes 5x24 hours on my 3090 and I predicted it needs two more rounds (5x24x2) to get things done. Then, I giveup to fully fine tune and used a smaller dataset to fine tune the first output.
I usually compare my loras to random dreamshaper images that have been shared XD
Have you tried doing FID?
if I can beat or match 3/5 images, using the seed 2, then I consider my lora as 'good enough' XD
fid?
There was some indications that FID might not be a stellar metric in the SDXL paper
nop. have never used FID yet. but that was an interesting read x_x
fid is not suitable for one-topic lora training. It measures distributions over whole datasets
Right, we were talking about full unet training runs
doesn't matter except you train the unet on a complete dataset of many different subjects
but fid doesn't really measures aesthetics anyways
Anyone into the following
I am a sock manufacturer.
I am looking to have AI create image Files from images given by customers and utilize AI image creativity.
Image files created are used to transmit data to a machine to engage functions.
Data transmission or Data signal designators to the machine are represented by the RGB colors located in the file. Machine capability is limited. RGB colors that can be in the file must be limited. Currently Ai image generators use shading, gradient, etc.. in creating images. you also can not designate the image size more specifically image size in Pixels.
Example
168 pixels wide 400 pixels height.
168 represents the 168 needles that are in the cylinder of the machine.
400 represents how many courses are in the sock. or how many times the cylinder has rotated picking up different colored yarn at its yarn intake points.
RGB colors in the file are used by technicians to designate fixed yarn takeup points on the machine.
Transmitting the data to the machine is not what I am looking for. I am just looking to create images
Current reasons that ai images on all ai platforms do not communicate with machine equipment that makes textiles. 1 non able to mandate size of file in pixels. 2 non able to mandate number of allowable rgb colors in the file.
wrong channel. this belongs in #š¬ļ½general-chat
definitely doable, but you need a full (custom) app or at least script in the middle to handle the issues you're facing
Is it normal to see images that are closer to the reg images than training images in the earlier epochs of training a lora?
Is captioning really important if I am training a subject that is scifi and doesn't look anything like real life?
Do you think that the clip vision model or something like it could be used to ācaptionā images during training?
Is this red and green noise a common occurance? š¤Ø
What's the current state of fine-tuning sdxl with 4090 - Lora in a few hours, full fine-tune not possible?
yeah turns out i needed a rocm version of torch and the script reqs installed the nvidia ones that's why it was giving errors and not showing gpu
so at least i can get to the train steps now but it keeps getting killed with SIGKILL 9 for some reason
only got to like 70 steps last time and it took like half an hour not sure if this is normal speed or if the gpu isn't being used it didn't make ths fans spin like a jet about to take off
another debugging week i guess not sure if this is worth it
run rocm-smi, it'll tell you the gpu utilization
and run that command I posted above to make sure your bitsandbytes is good
how do you install the rocm version? it just says pip install bitsandbytes
my version errors out
also i dont have rocm-smi?
that wont work, the pip will install the cuda version of it, you have to git pull and compile it yourself
https://github.com/Titaniumtown/bitsandbytes-rocm/blob/patch-2/compile_from_source.md whats the right cuda target ?
it says needs nvcc but thats nvidia right
python dependencies are a pain to work with lol
ignore that, read the readme
remember this was patched to work with rocm
git clone https://git.ecker.tech/mrq/bitsandbytes-rocm
make hip
CUDA_VERSION=gfx1030 python setup.py install # assumes you're using a 6XXX series card
python3 -m bitsandbytes # to validate it works
did you edit the makefile to point to your rocm location?
make sure you have all the rocm packages
Is this 'done'? Will it just get worse from now? Or should I let it keep going?
never really decided based on a graph of loss, I always look at the quality of the sample images to figure out if it's getting better or worse
honestly, if you were putting in a prompt and getting an image out, how would you decide if it sucked? not a graph surely
good point, I have been testing the files as they save. The file with this spike hasn't saved yet
It's still climbing
yeah it's done š the output from the latest save is complete garbage, totally scrambled
good to have my gpu back after 3 days of training
Hey, what are your experiences for using triggers while training an own lora for a person with SDXL? Because I've seen some tutorials where it states random triggers like "sks" are better, but at the same time some other tutorials mention that the persons name is just fine and random triggers aren't working that well
What does the loss/epoch look like?
when the last file saved, it was about 10.8. but the output was useless
If you wouldnāt mind to paste the graph
I closed the session. How do you get the graph up?
tensorboard --logdir path
What does 16 or earlier look like?
Hello , respected ones having some issues while training lora in my cmd , seems like my lora doesn't prepared . Some ss I shared check them , Thanks .
16 is a bit overdone, 17 is good
Try even earlier
I tested all the saves
Great!
I'm glad I got something out of it, since it was running for 3 days š
Iām in the process to try to see what impact parameters and dataset has to the result. Thatās why Iām curious.
Ran during the night but havenāt had a chance to look at the results yet.
I know I have to adjust the dataset though. But itās fun to see that a specific feature gets into the LoRA.
I think itās really good to have a good stable source when learning so you have the possibility to change the dataset and iterate.
This was the first photographic lora I trained with over 9000 images. I've done a few style loras with a few hundred images to train on that turned out pretty good.
Has anyone else encountered these horizontal artefacts and/or a solution to eliminate them? š¤
yes. I'm very sure these are the VAE artifacts when using the broken VAE from the initial release of SDXL 1.0
Yes
you can just use a separate VAE like the official one (which is actually the 0.9 VAE) https://huggingface.co/stabilityai/sdxl-vae/resolve/main/sdxl_vae.safetensors
or you can also patch your model with a python script as far as I know
Like the .safetensors file is outdated? š¤
it's not in your training data
if you can't choose a separate VAE for training use this model for merges and fine-tuning: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0_0.9vae.safetensors
it's the updated version by Stability AI with the working VAE that does not produce artifacts
these artifacts only happen when an image is decoded - so only when the RGB image is created. the encoding is not being influenced by it as far as I know. so your training / fine-tuning is fine
Oh right, okay š¤
So nothing wrong with the checkpoint files, just to be clear? I just need to use a different VAE for the decode step?
Oh, but this is a replacement checkpoint file?
yes, there is. but not in the checkpoint data itself. it's the decoding part (VAE) of the model which translates your latent space data to an RGB output image. but this can be fixed
yes. if you use sd_xl_base_1.0_0.9vae.safetensors there are no artifacts
Great. Thanks for explaining that! š
Sure. You're welcome. I'm not even training š but I have troubleshooted that a couple of times already for people. I use a separate VAE at all times. You can use it in a1111 and ComfyUI without a problem.
But if your model uses the SDXL 1.0 VAE and you distribute the model to Civit or any service and they are not aware of it, people will make images with lots of bad artifacting - which isn't good.
Try to overfit an image into a lora.
Experiment setting:
Lora Type: LoRA-FA
Dim 128, Alpha 1
Learning rate 0.01
Text encoder rate 0
Repeat_600, batch 10, epoch 10
Image 1: loss graph
Image 2: training set
Image 3: reproduce image
The final loss is around 0.00367
Using the final epoch to reproduce the image with only class token in positive and no negative. The brightness of the reproduce image is slightly darker than the original.
Is the ideal graph shape for loss when training to have it slowly drop in a linear way?
Hello Community
Could any one suggest me how to train the finetuned model of Indian Hindu Gods
depends on scheduler and optimizer
Ok. I thought that loss would naturally drop over time as it gets to be a better and better fit with the training data, but my current training run isn't showing that
with that LR in the end your model will not be able to produce anything else than your image
it will completely overfit even if you dont even call the concept
Ah so that means LR is only an indication of overfitting?
It is generated by lora strength 1.0.
May be it is because the training didn't train the text encoder? It seems only affected the class token to reproduce the image
ah ok in this case its a lil different
Hello Community!
How can we fine tune the indian hindu gods , could anyone suggest me to achieve this
Thank you
Is there any document or video that discusses the various different types of LoRA that people experiment with in SD ?
To reply to myself, I found some english translations of the sd-scripts docs https://github.com/darkstorm2150/sd-scripts#links-to-usage-documentation
this site goes into a fair amount of detail
https://hoshikat.hatenablog.com/entry/2023/05/26/223229#LoRA-type
use google or bing translate for it
not too much info on the types themselves, but a lot of info about everything that the types enable you to do - so in theory very helpful
Ok I'll see if I can get some understanding from them. Whenever I dig into this stuff I end up with a thousand tabs š
that site should cover about 80% of what you need to know about all the different settings of training (and when you need a non LoRA training type)
I'm still persevering with your recommended LoRA training setup, and it's fascinating to see everyone's choices as more and more tutorials/videos get published
last 20% are trial and error, as sdxl is quite different from SD1.5 in terms of anecdotal information, which there is a lot of online
yeah, I am constantly looking to see if I've accidentally stumbled into an older v1.5 guide
I had my first training session crap out this evening with a NaN for the loss, and black sample images. I'm glad I caught that before heading to bed
there is a model that has 0.9vae in the filename. use that
It's the 1.0 model checkpoint, with the 0.9 vae baked in
in case you're wondering. the 1.0 vae causes this
I started getting black sample outputs during XL lora training, but the loss isnt NaN and the checkpoints work as expectedš¤ guessing its something to do with switching mixed precision to fp16 from bf16 and disabling full bf16 training. I did those things hoping for better quality, should it help or should i stick to bf16?
I got black sample images on my last lora training, but the saves were still working when I tested them in the workflow
same, i guess I'll just leave it as is. samples were never useful anyways
I find it odd that in some tutorials people seem to be adjusting Network Alpha like itās the same as Network Rank. From what I understand, itās more like a multiplier on the values stored in the LORA ?
Itās not like Rank and Alpha are the x and y dimensions of some tensor. Maybe I have it wrong ?
Ah no Iām not wrong, this is from Caithās recommended link from yesterday.
@hollow spruce Hey Iād appreciate a hand with prompting and workflows for trying my freshly trained LORA. Iāve tried some basic Comfy workflow, but as the strength of the LORA goes up, it pulls the resulting image straight into looking like one of the training images, regardless of prompt. I donāt know if this is an issue with my lower image count while training, or if itās over fitting.
Looking at the checkpoint images made during training, the likeness converges in a nice linear way across the 300 epochs, and it looks correct after 250, so maybe I am not using positive and negative prompts correctly? Or style prompts? Because Comfy is so much of a blank slate I donāt know if Iām just approaching this in too simplistic a way.
I had the same problem tonight. It solved by itself the following time, with this difference : I added this argument : --no_half_vae
By the way, would you mind sharing your training command? Mine is not satisfying for a face training.
Can you please illustrate or explain "the likeness converges in a nice linear way", on my side I get "over baked" images from the start to the end, I don't know what must be changed. In the end my LORA is over baked, I get acceptable results only using a 0.1 weight.
Here is my current command, just in case anybody of you find something strange. On a 3090, target is SDXL.
.\accelerate launch
--num_cpu_threads_per_process=2 "C:\Users\daf\automatic\kohya_ss\sdxl_train_network.py"
--enable_bucket
--min_bucket_reso=512
--max_bucket_reso=2048
--pretrained_model_name_or_path="C:/Users/daf/automatic/models/Stable-diffusion/sdvn6Realxl_detailface.safetensors"
--train_data_dir="blah"
--resolution="1024,1024"
--output_dir="blah"
--logging_dir="blah"
--network_alpha="1"
--training_comment=trigger=blah_0.4
--save_model_as=safetensors
--network_module=networks.lora
--network_args rank_dropout="0.15" module_dropout="0.15"
--text_encoder_lr=0.0005
--unet_lr=0.0005
--network_dim=24
--output_name="blah_v0_4"
--lr_scheduler_num_cycles="70"
--scale_weight_norms="1"
--no_half_vae
--network_dropout="0.2"
--full_fp16
--learning_rate="0.0005"
--lr_scheduler="constant_with_warmup"
--lr_warmup_steps="166"
--train_batch_size="2"
--max_train_steps="3325"
--save_every_n_epochs="10"
--mixed_precision="fp16"
--save_precision="fp16"
--seed="1234"
--caption_extension=".txt"
--cache_latents
--cache_latents_to_disk
--optimizer_type="AdamW8bit"
--max_train_epochs=70
--max_data_loader_n_workers="0"
--bucket_reso_steps=64
--mem_eff_attn
--gradient_checkpointing
--xformers
--bucket_no_upscale
--noise_offset=0.0
--sample_sampler=k_dpm_2_a
--sample_prompts="blah\prompt.txt"
--sample_every_n_epochs="4"
I can't really show you as I'm training on personal photographs of friends, but I'm generating images every 5 epocs from 300 in total. The sample images slowly converge from 0-250 after which I'm seeing a recognisable likeness to the training images.
Can you share your training command please? On my side my subject is already recognizable at the first epoch preview, there is few difference between the first preview and the last one.
I'm following Caith's original post here #š§ļ½finetune message
Other question, how do windows users do for displaying their training log in a tensor board ? I saw that google collab provide a board but I don't know how to load my log files there.
If you run the GUI in kohya_ss, there's a button in the webUI to launch tensorboard and point it to the log folder
All right. Thank you for the link, I'll follow this guide. Regarding the webUI, I'll have to find out why I get python errors when using the "Train Model" button from kohya_ss UI. I spent hours trying to understand what was wrong. I don't have the console right now but the main problem was a bad interpretation of the generated training command, for instance the console said "resolution is mandatory" while it was correctly specified in the UI. I reinstalled the Bmaltais kohya_ss UI twice with no improvement. Thank you again.
Ah, last question (I hope), what should I use as inference model for my woman face training? I didn't understand if the base sdxl is better than specific models from civitai. Another tutorial maker wrote somewhere that the base SDXL model was too wide for that.
anybody know how to pass model metadata such as model output name and group name on kohya to wandb using ars? do i need to go into WandBtracker Class to set parser?
I cannot think of a reason why you'd ever want to train a LoRA on top of a different base instead of SDXL base. That way you keep it most compatible with every other workflow
if i want to set different text and unet learning rates, what should i be inputting in the red box? seems like you have to put a value of some kind in there.
It doesnāt matter, put anything. It will be overridden by the ones below.
sweet, thanks
does Memory efficient Attention, Gradient checkpointing, Xformers, or Full bf16 noticeably lower quality at all?
It's really hard to figure out which parameters were made for low vram cards vs. which ones are standard optimizations that everyone should use (like xformers for inference)
I'm going crazy with kohya's UI. I just loaded @hollow spruce Json, adjusted the directories, and bam, same shit again. I have reinstalled kohya three times to be sure. The python version is the right one, I don't know what to do. After these screenshots, I shortened the path to the files (in case it would be the cause) but same problem again.
is DPM++ SDE Karras not available when training?
If i'm training with a mix of these resolutions, do i need to enable bucketing? or is bucketing only necessary for resizing?
Increasing batch size absolutely wrecks fine details on faces. Time to try GA instead
i've read this 5 times and i still don't understand. i think im gonna train dim:128, alpha:1 and then dim:128, alpha 128 and see what the difference is.
@sonic mantle I think they divide the rank by the alpha to determine a maximum strength of learning. Either its the max value that the lora stores, or a multiplier to the learning rate. Either way having it higher just makes your lora 'weaker' so to speak. divided by a larger number
Thanks, that makes more sense. It's strange because I saw some examples on this blogpost and the lower alpha number always looked worse, but maybe that's because the initial dataset was so limited that the strength of training on a limited dataset worsened the outcome.
this is the blogpost btw: https://medium.com/@dreamsarereal/understanding-lora-training-part-1-learning-rate-schedulers-network-dimension-and-alpha-c88a8658beb7
A guide for intermediate level kohya-ss scripts users looking to take their training to the next level.
I read this:
For example, if Alpha is 16 and Rank is 32, the weight usage intensity will be 16/32 = 0.5, which means that the learning rate will only be half as effective as the Learning Rate setting.
From https://hoshikat.hatenablog.com/entry/2023/05/26/223229#LoRA-type translated to English
so a rank of 128 and alpha of 1 is 1/128 = 0.0078?
that doesn't sound like an optimal setting then
no, and if you do not specify an alpha specifically, it defaults to 1, so I don't know. Looking through the code to see if I can spot anything
Oh sorry that's not true:
if network_alpha is None:
network_alpha = network_dim
so it looks like it defaults to the dimension size, thus making the training modifier 1 which will do nothing
im starting to think dim 128, alpha 128 was the better option then
yeah, that would effectively do nothing, so you could safely play with other settings
lowering learning rate from 4e-4 to 3e-4 is having no affect at all on loss. same with 1e-4. interesting that people use these graphs to determine "overfitting" yet i can clearly over or underfit this training without changing the graph whatsoever.
misinfo abounds
as larger your lora as more strongly it affects your model and as faster it learns. Reducing learning rate counteracts this. alpha is a way of somehow automatically adjust learning rate based on your lora size
I would keep alpha low to give the model more time to adapt and learn š¤·āāļø but that also depends on your training data
Hello, thanks ot your help I could cook a working face LORA tonight! šø
Now it's time to fine tune the details. I am noticing that the face is correctly used in close-up portraits or portraits, but as soon as the character is full body or half body, the face is not used at all. My 50 training images are almost all tightly framed around the face, but the caption text don't mention close-up or anything regarding the framing. Do I need to change something : add images, change captions?
you should have mid range images in your training data
changing caption alone won't help
having 10 images of your face that all have same angle and distance is usually useless
rather use less images but from different perspectives
Hi, I found BNK clip text encoder (the suspended node) with tensor problem if the prompt is too long and difficult. So I tested how can I replace to another one. I found another sdxl compatible node what accepting long prompts, but as a point of interest I tried to encode L and G prompts with separated non-sdxl encoders, and later used nodes for concat/average/combine the encoders outputs. Combining them the worst, but all of them useful. Maybe I like best the average node because strength settings. Is it right way to replace sdxl clip encoders to 2 separated 1.5 compatibles? Review or opinion welcome. (image contains workflow data):
By the way, I was saying that SDXL didn't generate nice pics using my new trained LORA using 2048px close up faces. I am currently generating images at 1.25 scale (1440x1120), they are all nice using my LORA. Interesting.
it is not upscale, but native resolution
which parameter is failing when your Lora data doesn't integrate well with the rest of the model? I trained a Lora on mostly closeup shots of keira knightley's face. But if i prompt something like standing on a ship in pirate clothes it will only generate a closeup of the face.
Does this mean the network rank was too low?
I believe i had it set at 128. what seems like a good target?
thanks so much.
and increase only if quality is not good enough
This is exactly what I complained this morning. And @stiff dust told that I had to insert mid range pics in my set. He didn't mention the rank.
as someone who has run 7 training sessions in the last 2 days, i know that feel.
it's different. If the model is not able to generalize (e.g. change clothing of the character) it is overfitted. In this case reduce rank and/or learning rate
I just added an interesting thing, if you increase the resolution of your generation, you should be able to set more distance with the face (see my example above)
do you have any insights on varying the unet LR from the text LR?
but the model will never be good in showing the character from a angle or perspective it was never trained on
to be honest, I wouldn't train the text encoder at all
or just train it for one epoch but not more
text encoder overfits much faster than unet
what is the text encoder? the auto caption tool?
In my case I used it, then I cleaned 50% of the data to be compliant with Caith's tutorial
it's the rate at which the model learns the relationship between your images and your captions
ok
but if you're just lora training someone's face there probably isn't much for it to learn
not really. SD was never trained with the text encoder
oh damn i didn't know that
the text encoder is frozen
training it makes sense if your captions contains something that is unknown to the text encoder
I was wondering if I can improve my model by adding new pics with emotions : sad, smile, pensive etc.
a name would be a good example. It can make sense to train the text encoder on a new name it doesn't know yet. But it overfits very quickly
haven't tried that yet for sdxl
I will
I'm really pleased to see that SDXL generates perfect images when setting +25% resolution on some checkpoints.
Some others don't appreciate
Quick question : is it possible and easy to install and plug a standalone comfyui? The one embedded in sd.next is broken.
not more difficult than installing sd.next š¤·āāļø
it has no builtin venv, though
I recently got my best quality and flexibility when using a very low LR. Captions (wd14 tagging with human characteristics pruned) reduced the artifacts on clothing SIGNIFICANTLY, but it also hurt likeness quite a bit and didnt converge. I dropped the captions and got my best likeness and quality ever, but flexibility is lower again with artifacts on clothing.
Still better than anything I was able to achieve when using celebrity name token. And scales across different ppl and img counts
Could you briefly convey your understanding of network alpha. Iāve heard differing opinions and never anything conclusive about what it actually does, nor have I ever seen experiments with useful results with different alphaās. Is it useful at all?
Lora vs lora-fa.
In lora-fa epoch 10, if I put peace sign in G and L, it provides more likeness to original
What does something like "2e-1" refer to when someone speaks of making a lora?
lora-fa vs locon locon is superior for styles, which is what I train.
all settings were exactly the same just switch to fa vs locon.
that cloth captions hurt likeness is strange... maybe you can try to make your name longer. Longer captions give the network more flexibility. So if you think that "photo of Peter wearing a red shirt and blue socks and green pants" works better than "photo of Peter" then you might try something like "Photo of Peter Widdlediddlebuggs" (add some last name with more tokens)
erm... that's just scientific notation. 2e-1 is 0.2.
no, just keep it at 1.
The alpha is a scaling factor on the strength of your lora. alpha=1 means you lora is multiplied with 1/dim. An alpha of dim means your network is multiplied with 1 (thus, nothing happens).
The reason for alpha is that a network with high dim learns faster and has more strength in shorter time. When people experiment with Lora they often just change a single parameter between their workflows. So they change dim from say 32 to 128 and find that after epoch 10 the image looks much better with higher dim. However, dim 32 might look equally good if you would train it until epoch 20. It's just looks worse because it trains slower.
alpha is somehow countering this effect by reducing the speed of training with high dim loras. So using an alpha=1 just means your loras with different dims are more comparable
How should one think about the network rank? Complex dataset larger network?
yeah, I would say so
Note that in large language models people tend to use Loras of rank 1 ^^ So you can learn a lot even with low ranks
In general Lora is based on a compression technique, so lower rank means higher compression means more compression artefacts. I found that the unet is a bit more sensitive to these artefacts. So using a too low rank lora for the unet is usually a bad idea. I would at least use rank 8 or 12 for the unet, maybe even higher. But you can try. You will see if artefacts appear.
For the text encoder you can use rank 1 or 2 and thats already enough. However, for some reason most scripts don't offer an easy possibility to use different ranks for unet or text encoder
Other than getting a large lora what's the disadvantages of going of a too high network rank?
I heard from @hollow spruce. Too high network rank would damage the model.
yeah, you want the lora parameter efficient, i.e. only change the base model as few as possible
I'm not sure if its really just the rank or more an combination of rank and alpha and learning rate, but you just don't want your Lora to overfit and "damage" the model
"damage" means: usually you train your model on a large variety of images. Training it only on a single subject damages the model, because it starts forgetting what it had learned before
Olivia showed an example on Twitter of 256 dim damaging the background of an image vs. 24 dim.
But other factors like LR could've affected that difference too. that was hardly a scientific test.
that's what I meant
just from mathematical viewpoint I would say it's a combination of LR and dim that causes the effect
yes. maybe scheduler too
so a too high DIM should not necesarilly be a problem if you keep your learning rate low enough. But why would you want to do that?
better eyes
I don't think that you need high dim Lora for that
idk my config won't make perfectly circular and symmetrical eyes at any LR or epoch except 1e-4 and lower
If you decrease the network rank do you also increase the learning rate? Or is that only true if you fiddle with the alpha?
for perfect iris I found it helpful to add very few cropped super high res images to the training data
but in general problems like unsymetric eyes and so on are problems of SDXL itself
if you fix that with your lora than probably just because your lora is memorizing your training images
which is usually a sign of overfitting
idk, if I swap the token out for a random name or just man, eyes are nearly perfect
in theory, the better strategy would be to train on a large variety of photos of different peoples to train SDXL to make perfect eyes and THEN train on your face
I feel SDXL is undertrain in many aspect like eyes
Yes waiting on a good fine-tune that does that.
Wyvernmix has given me the most circular eyes but they're way too big (anime weighting)
hm, that's strange. My experience so far is that a lora on my face is sometimes making things wrong, but it is doing so at the same rate as SDXL is doing anatomy or eyes wrong on other people
There seems to be so many parameters affecting the lr one way or the other so it's hard to adjust accordingly.
Unfortunately for my use-case I have to assume the worst case scenario of 12 low quality images. I can't control what goes in, but need to guarantee it comes out good. which is easy until a high quality dataset is neutered by my "safe" config
yeah. damage was probably the wrong word to use.
Basically when you have a huge 256dim lora, and dont train it on a metric ton of images, then you'll quickly see things like like details being forgotten/replaced. First the backgrounds stop having any detail, and literally fade into these weird contrastless messes. Then colors become more flat, as gradients stop showing up, and this will continue and continue as most "detail work" stops existing.
See how... undetailed secourses images look?
Like you have created a way to big network in front of the base model
I wonder how much one should decrease the lr if you go lower on the network.
less about the learning rate, more a matter of how much new information you're actually inputting. how many new images are you working with?
like for 4k images, dim32/dim64 give essentially the same results for me.
so I find it hard to see genuine usecases where you "need" 128/256dim loras
I believe it but I need to see it A/B on the same seed and more levels than max and min, and whether details of the face are improving or not when close-up (surely 64 is nothing like 256). I'll do this test soon enough
cuz SEcourses outputs/prompts are nothing like mine
if you go away from photorealistic, and into pure artwork, then high dims might work to some extent, since you dont care about the photo capabilities being forgotten
hmm the last time I went low I lost skin details and it felt more artistic
he has enough source images + regularization images, and has a face that is easy enough to train
do rarer ethnicities, and it gets a lot harder to replicate his results
I have a photo dataset and I cropped the face and save as another folder in the dataset. Would it help to increase the likeness?
my first test is always a bald Arab (our ceo). he was impossible to train on 2.1, XL is already infinitely better but still doesn't like his face
full body images almost never help with likeness so I'd say yes
do you captioning him with arabic?
nope
that'd be counter productive
(captioning a black person as black makes them white in outputs)
Hmm, then how will the model know what nationality you want to train for an example?
you could mention it in the prompt
use a proper name
or even use celeb with same ethnicity as token
my own name "Kai" for example is a typical German name, but in SDXL it's strongly associated with Japan. That's why I use the name "Christian" instead when training on my face
Is this only specific to LoRA training or other types of trainging too?
but if there are any Christians weighted heavily in XL dataset then you still have a problem. hence random token for stability, but hard to scale
all the same
a name like "Christian" is too common to be a problem
How do they train it for different nationalities? Just an example could be whatever concept.
I would go for a combination of a common first name and a uncommon surname
Find it unnatural to train it on names. š
the nationality should come from the input images, not the captioning or token. it should already know if the person is Asian based on how they look in training
what dal wanna say is that the model should learn to associate your appearance (e.g. skin color) with your name
Sorry for asking so many stupid questions. I just want to understand the basics good enough.
to specify what about the images you want trained into the token. captions remind it what normal things are in the image so not all of it is trained
so you should only mention things you DONT want the token to remember about the images
Ok, so then if I want to adjust the DoF, film grain, colours and such I should probably just caption what's happening in the image?
Yes caption everything except for those effects
Like "a dry, barren landscape with a fence and hills in the background"
But XL TE training seems to be a fickle bitch and I have found captions to hurt face training if not utterly perfect
no idea for styles
fence might be in the foreground but... š
What I've found is that the LoRA seems to be activating stuff that are in the same age of the items in the dataset even though it's not in the dataset.
For example cars get picked from the correct age even though it's not the same make, color and such.
a caption like "modern car" should fix that. specifying the "age" in captions should keep it out of training
For style I like it though š
do you train text encoder?
as said, text encoder overfits extremely fast. If you train it, train it for very short time
the unet should be less sensitive to these things
The nationalities came from the images in SAI training dataset which captioned with the nationalities. I think it is why the nationalities is so biased.
Yeah, I've done that. I will try with --network_train_unet_only next.
have you tried lowering LR on only the TE? one thing I plan to test soon
"african"
I still feel racist for tagging every black person as african in my master dataset š„² but its the best working word for me so far
(Caucasian|African|Asian|Indian|etc...) <- every person in my dataset has one of these tags. Due to the quantity of total images + clip training, it works really well. But for smaller datasets this would obviously be counterproductive
yeah, sure, but that won't be enough
I usually train text encoder first with dim=1 for only one or two epochs with low learning rate and then continue training with unet only and higher dim. But that's a bit complicated as kohya does not have commandline options for that
Not sure if kohya_ss is able to train text only for an epoch.
but I think there is a commandline option that allows you to stop text encoder training after certain amount of steps
freezing the text encoder. I always used that method for DB since the paper that tested it with Kramer. way better than full TE training and way better than without. never thought it would perform the same on Lora.
I'm pretty sure Kohya does have that setting
I'm searching for it in the options but can't find it.
How do you build your master dataset? Do you have any schedule or planned process to continue scale up your dataset? Very appreciated for share.
I think it was a setting and was dropped due to problems
I usually spend a weekend to increase it by 1~3k images. manually edited & cropped, then manually tagged
once I hit around 10k it would probably make sense to train my own blip from it š¤£
I knew it was there! right in kohya
simply increasing it isn't hard - as I can just download photoshoots. But getting good diverse images, that look nothing alike is the high effort part. Usually via flickr where I filter out 90% of all images
Only in GUI it seems. š¦
currently I'm still at 50% truly random images that I chose from flickr. 50% from photoshoots, for that super high detail
Thanks for share
š
yeah, I also tried.š¤£
yeah, kohya is sometimes a bit unflexible... I did a lot of changes on the code myself. For subject training it is usually sufficient to only train the cross attention layers. Text encoder training can be helpfull if it's done with low dim (e.g. rank 1) and short time. Together it's totally possible to make a Lora with filesize <50mb
It's a bit confusing but there's kohya-ss and kohya_ss. I guess the first is the most upstream since kohya_ss seems to be merging with kohya-ss.
bmaltais/kohya_ss is the GUI/web version. kohya-ss/sd-scripts is the original command-line version that the GUI uses
kohya-ss/sd-scripts is the original source of training scripts
I use the scripts in kohya_ss for SDXL
And those do not seem to have an equivalent in kohya-ss/sd-scripts no?
kohya-ss/sd-scripts is the original implementation
will it handle sdxl or is modifications needed?
you have to checkout the sdxl branch
Thanks! šØāš¦Æ
I'm oom-ing with pretty default settings (e.g. preset SDXL - adafctor 1.0) with a 4090 with kohya ss. Is there any basic SDXL dataset and settings I can use to figure out if the problem is with my GPU or what
in this case I CPU Ram OOM-ed after 174 steps (32gb), with other settings I GPU Ram OOM
What's your batch size?
1
And you also set these? cache_latents_to_disk gradient_checkpointing xformers
You can search for memory in here: https://hoshikat-hatenablog-com.translate.goog/entry/2023/05/26/223229?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp
yeah I have those. Currently working with some other of the random preset
tho on all of the ones that run I only get 2.5it/s with a 4090 while other seems to report that and more with weaker GPUs so still feels like something is wrong
but at least I can test from here if it works for a full epoch and adjust
I think bf16 will help too
I would definitely use bf16. I would also try AdamW instead of Adafactor
yeh I use bf16 but havent tried adam
managed to run 4 epochs and the lora does seem to be more or less working
Hi I'm looking for some example sets images and captions for SDXL finetuning (not LoRa), just for starting off learning and testing. Does anyone know if there are any sets out there, paid or otherwise? Anyone know if SD have any sets available that they used as part of the SDXL training?
This will be my first finetune I try on my 3090 using the kohya GUI -> finetune.
This is the config I plan to use, I think I've gotten most of the settings right.
Hi - I'm training an SDXL LoRa to replicate the style of a lithograph artist. However, after 2400 steps I got this as a result. It's got the right vibe, but the actual "style" of the photograph didn't seem to transfer. Instead, a very soft, painterly style transferred over. I started to see the consistent style forming at 1200 steps, but it always stayed "oil/painterly" and never inherited the pencil/etched look. Would this be indicative that I need more steps / training? Or perhaps I've overtrained? This is being done with Kohya GUI, and 24 sample images at 100 repeats.
Try with less repeats and more epochs. Did XL have any idea about this concept or did each time you tried something different came out? That last part will tell you if a lora will do it (btw, lora for styles I have never had luck with and it is better to go to locon as it trains 1 layer deeper which makes it better for styles).
It probably knows about lithographs, but I didn't keywork anything, I only used the unique keyword. Interestingly, it veered away from lithographs and towards the painterly look, probably because it knows a lot more about painterly looks than it does lithographs.
I'll try your suggestions, thanks!
XL is a new beast. Do not use unique keywords.
Oh, really?
I was having issues until I saw a vid that said that but my findings were already showing to not use uniques
With two TEs they fight each other so only train the unet
Should I just generate captions for each image instead?
And for style training, should my captions be more related to the content of the image or the styles that most closely match it?
I gotta tell you about XL. You know what captions are all about, right? TE. I tested this and caption or no caption captions are worthless unless you train the TE. I did that and instantly different if I used captions or not. I no longer screw with captions UNLESS I am daring to screw with the TE.
I feel it is important to mention that repeats and epochs are interchangeable when it comes to total step count but
if you are using regularization, use high repeats on img folder
if you use 1 repeat, only a few of your reg images will be used (the same amount as your training images). if you got a big reg folder from SEcourses or Aitrepeneur patreons or elsewhere, use the maximum amount of repeats you can before (training imgs * repeats > reg imgs). you want a unique reg image per training image repeat. so you want at least as many as reg images as (img count * repeats).
repeat_1 may be the easiest way to calculate max steps but it basically nullifies your regularization
agreed.
Though a lot of model makers have ditched repeats for more epochs as they say it makes better, more refined, models.
I hate dealing with reg images it is a pain in the ass.
How do you think about batch size?
I mainly do styles and use BS 8. On my 4090 that is the sweet spot for the lowest amount of time. I understand that for people it is best for a BS1 but still a debate about that.
I wonder what Ejektaflex's class token should be if he should be skipping captioning or can you skip that too somehow?
my token is a known style to XL. For instance, my released locon has the keyword Cartoon. The fun one I just did as I learn how to do people was segal (for steven segal).
there will be a v2 of segal as I am not satisfied with it but I lacked images mainly
for a class token tbh I don't use them as I don't regularize a style as I want it to over take everything
Can you have multi word tokens?
did you have only "steven" or "steven man"?
I wonder if some stuff are way more trained in the base model and hard to change or if it doesn't matter much as long as you have a match and don't have to train the text encoder.
do you mind sharing your json config for lora training @restive bridge ?
I won't have the config on me for a couple days but it's something like
18 imgs, 1e-4, 20 repeats, 5 epochs, batch size 3, adamw, constant w/ warmup, "ohwx man", bf16, dim 64, 600 real reg photos, no captions.
That's just the current point in a perpetually evolving recipe. Its not something I'd recommend to anyone. It worked good on just one test, I have no idea how it performs at scale yet. not efficient either, for 24gb gpu there's a lot of vram headroom but raising batch size seemed to hurt the quality, or maybe cuz I didn't and can't evenly divide the image count by the batch size, which is the correct way.
Thanks for sharing š it's just that it seems you're also trying to find the best configuration for a limited number of images, while most here are talking about the best config with a high number of images.
that's true. I've used all the configs people share here and always have to adjust it a lot to work better on limited data
What workflow are you using to confirm whether the lora is well trained? the samples during training are one thing, but whether the lora is under/overtrained needs to have more styled prompts applied
lately I start with a basic photoshoot prompt and check for likeness first. if it passes I try a vintage prompt. if over fitted it will often fail to put the person in black and white, in which case I roll back an epoch til it works right. if likeness is still there at that point I move to a heavily stylized prompt that forces an intricate outfit and environment. it will either pull likeness away, or have a ton of artifacts everywhere, or by some miracle will work good. at which point I'd try a couple more prompts and train again on different images which rarely ends well and the process restarts
anyone got any tips for automatically capturing still frames from videos with minimal motion blur?
That topic belongs in an animation channel
Not necessarily, he probably wants to capture stills to train
Personally I just capture the stills manually while watching, since I want to capture exactly what I want.
So I can't help much.
yeah training an analog film lora requires stills from analog films š
Is an advantage to using both captions and tags when fine tuning the SDXL base model,or just one or the other?
anyone know the cause of this error on kohya_ss: HFValidationError: Repo id must be in the form 'repo_name' or
'namespace/repo_name':
'/workspace/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safete
nsors
'. Use repo_type argument if needed.
Hello! My understanding is that SDXL was trained on images with a variety of aspect ratios. However, most of the Python training scripts Iāve seen involve reshaping the image to a set resolution across the train/val splits. Whatās the best resource for training a SD model on varying aspect ratios?
You using .from_pretrained() in the HF library? Looks like itās not a fan of the path you provided. I would try:
- ensure the path itself is correct. Are there any typos? Currently it seems that
workspaceis under your systemās root directory, is that true? - specify the path to the parent folder, not to the .safetensors file itself
Yea Iām using a safetensor model locally not sure why this hf library is coming up
What code are you using to load your model? The HF library is very commonly used to load/run models
Iām running on kohya_ss on runpod. The model in question is local to the volume @gilded kindle
Its the source model tab
Under LoRA training section
here's the full error
and the cli command: accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="/workspace/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safetensors
" --train_data_dir="/workspace/S0r4_train/img" --resolution="512,650" --output_dir="/workspace/stable-diffusion-webui/models/Lora" --logging_dir="/workspace/S0r4_train/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-05 --unet_lr=0.0001 --network_dim=128 --output_name="S0r4_10-01_p1" --lr_scheduler_num_cycles="12" --no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="6960" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="1" --clip_skip=2 --keep_tokens="1" --bucket_reso_steps=64 --mem_eff_attn --shuffle_caption --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.05 --wandb_api_key="9328358809ad058d08c0f5e53cfc7f91f3d661b4" --sample_sampler=euler_a --sample_prompts="/workspace/stable-diffusion-webui/models/Lora/sample/prompt.txt" --sample_every_n_steps="270"
Hello, Im trying to use Kohya to train a model of mine so I can switch over to using SDXL, Apparently you need to use a XL Base model or something so im told from my freind for SDXL.
Anyways I have a 6GB card but I try and run Kohya it just stops because Cuda runs out of memory. Is there anyway in the files I can tell it my max GPU size is 6GB?
So, gang, anyone have any thoughts on how I would train a SDXL LoRa on what a keytar is? I've done a few test runs, and thus far I've not been able to get it to really get the hang of it at all. At the moment I'm using ~30 images (a few of just keytars alone, the rest of people playing them), and no regularization images (not sure what one would even use for those), and captions along the lines of a woman on a stage playing a keytar, keyboard (instrument) and the results aren't any better than asking stock SDXL for a picture of someone playing a keytar, and might actually be worse. I've trained people before, but not objects like this. Anyone have any pointers on doing this kind of thing?
How do you deal with a size mismatch between the checkpoint/safetensor's file and the "current model"?
Please help. This obstacle is truly frustrating.
I find that the model i finetuned base sd 1.5 act good in txt2img scene, but when i use it in img2img scene with controlnet, it seems not work well. eg. it will generate blur face when i use softedge. Some models in the civitai can generate nice face eventhough there is some controlnets like openpose, softedge. I feel like my model seems more seriously affected by controlnet. Does anyone know why this happened?
Sorted
Thank you so much for this. Itās really useful to put aside. Interestingly, someone (Robert Jene), suggested alpha is scaling power of the captions, so the tracking listens more to the captions than the clip encoder. Iām not sure I remember this entirely correctly, but I gather this means how the text encoder is affected, which sort of makes sense because when the alpha is highly learning of concepts seems to go up but also seems to damage the underlying modelās knowledge, giving bad anatomy or badly rendered colors (in the case of photos for instance). I donāt know if this all wrong. Does it sound like thereās anything accurate in this depiction?
no, that sounds wrong. alpha has nothing to do with captions, you could learn without captions and alpha would still has same meaning.
It's really just how strong the lora is applied to the model. higher alpha means stronger lora. However, the lora also gets stronger as longer you train it. So lower alpha means you have to train it longer to reach same effect
Ok that makes sense. I have a question that is somehow related. Iāve noticed when testing Lora on webui that using some of the tokens from the captions in the training will suddenly show a very overfitted Lora. So I imagine a good test for new Loraās is to also run a prompt using some of the original captions to see if overfitting happens. Would this be correct way to test overfitting?
yes. I would always use trigger words anyways, as this is the fastest way to train loras (except when you use caption dropout)
No, I donāt mean solely the trigger word. I mean certain descriptive words in the caption⦠like a leather jacket and an accessory. Which upon use in the prompt immediately makes the gen look like one of the dataset images. Whereas the trigger on its own generalises the likeness of the character
yeah, if a word that occurs only in part of the training images has a strong effect on the lora, then this is a clear sign of overfitting
In the cloneofsimo Lora repo there was a feature to add new tokens to the tokenizer and introduce new embeddings for them, optionally initialized from existing ones - as part of LoRA training. Is there any reasons this type of approach has falled out of favour in other training scripts/repos/workflows? One can achieve this with for example sd-scripts too, as there is TI. But just wondering if there's a good reason why it's not more often suggested as part of training style and character LoRA.
We're seeing with SDXL that it is faily common for people to only train the unet, and given that, it would seem that performing a TI before training a LoRA should be benefitial to unet training.
6gb is probably not enough for training. You can use cloud compute to do it
Anybody here going to CogX Festical ā23 London? https://stabilityaicogx.splashthat.com/
could someone help me with img2img upscaling ? Ive been trying to use this controlnet ultimate sd upscaling method and the results look like when I look into my trash can.
I agree, it's super annoying that there is no easy way to bundle TI and Lora.
I wish I could set different learning rates between training images and regularization images. I just discovered that its the regularization giving me clothing/environment artifacts. But dropping the reg wrecks the face quality.
Is that bad?
I was even getting images that looked more like the reg images than the training images
such a bummer. at this point I'm forced to choose between likeness or flexibility. if only captions worked better
how much did you experiment with having less/more regdata and the ratio of training data to reg data (e.g. via repeats)
Has no one else encountered this issue?
I thought there is a regularization strength parameter
I tried the same amount of reg as training, twice as many, and 600, with 1-20 repeats on imgs. 600 (one unique reg per img repeat) gave the best faces.
where is that š
I assume he is referring to https://github.com/bmaltais/kohya_ss/wiki/LoRA-training-parameters#prior-loss-weight
wow, cant believe i never realized what that setting meant. That should fix some problems. Thank you both
I'm looking forward to hear from your results! š
by the way, did you find the reg images more useful from secourses or aitepreneur?
SEcourses if the goal is just faces. Aitrepeneurs had way more dynamic poses, more focus on actions than faces. I never directly compared results tho
Did anyone know about the cropped training in sdxl report? what benefit from this type of training? I am planning to create a tool that could based user prompt to crop the training set to create sub set for lora training
Am training using the prodigy learning rate scheduler. Anyone understand it? It seems that learning rate changes over time depending on the ratio of norm and key norm. But I cant find what 'key norm' means and how it is different from norm.
It would be best to use ControlNET Tiles along with Ultimate SD Upscale to get coherent, good quality upscales.
This comment I made on Reddit might help you get better upscales:
Hello community!
could you please share me the way how to fine tune the different images not like training on single type of images like dreambooth,
Ex: a model training images should be with different images shirt, pant, shoes, watch, etc... And with their respective prompts for training
Could you please let me know how can i make it
Thank you
Hey all, still trying to wrap my head around regularization images. What exactly should they be? Are they meant to be a good example of the class or the literal model output of that class?
IE Iām training a āSteve faceā Lora, my regularization images would be āMan faceā
Do I use a dataset of high quality examples for the āMan faceā class? Or should I hop in to A1111 and crank out 800 literal āMan faceā images with no negative prompts, presumably using the checkpoint I am training with?
Iām getting even further confused by conflicting information online about the value / necessity of even having regularization images. There is a lot of misinformation going around
take your captions, remove the instance tag, generate with them. Make sure they have a caption file similar to the image dataset minus the instance tag.
do not fiddle with them to get "good results"
so lets say image 1 is - steve, riding bike, blue shirt. the regularization would be man, riding bike, blue shirt?
it would be
riding bike, blue shirt
Adding man is a good idea, but it should be a part of the regular dataset caption as well, or whatever the man face is on
gotcha
regularization images help to prevent overfitting and limits what the models add. You don't have to add them, but they can help if you're having overfit issues. I know of an example of a lora with and without reg images, but it's not sfw so can't share here. I can say that the regularization images helped make it more flexible though.
Thanks so much! Is there a good place to learn these concepts from the engineering side? Much of the current youtube content feels a bit like rando content creators producing "guides" when they don't know what they are doing
I don't know any good guides or sites. I learned through discord. You could search here if you want to, but unfortunately there isn't a ton of info on regularization images since few people use them.
I also recommend using the sampler that you'll be training on for the reg images, which is usually DDIM.
oh yeah, forgot to mention this, but regularization images DO NOT have to match the name of the image dataset. They are not paired. You only have to make sure the reg caption file matches the reg image name.
and make sure they're the same resolution as what you'll be training on
@hollow spruce Hello, I had done a experiment about anatomy subset of lora training. I use groundeddino to crop the original training image to get face and hand subset. Human selected the good images and add [face focus, hand focus] to each subset and keep all caption from wd14. The main dataset caption are generated by wd1.4, added prefix and human reviewed captions. It kind of improved hand in the generation. But I still think it is undertrain. All dataset set to repeat 3 and I am trying to increase the subset repeat to 10 to see if it would help.
new lora vs old lora. The main dataset is the same. The old one was trained multiple times with different setting. The new one was trained once. Both consume around 16 training hours in a 3090.
The old lora's text encoder seems to be broken.
The new one seems to be undertrain
thanks for sharing. what training settings did you use?
I can share that later when Iām at the computer. But basically unet training only, 3000 steps, 4e-4 and a keyword for the dataset. Tried to train for a specific film stock.
I think one can see tendencies for tiny bit of overfitting in the higher rank. Look at the missing tents in the 128. Prompt included campsite.
Could we train embedding for SDXL?
so is it normal that lora training takes over certain haircolors and styles and i cant get rid of them?
i cut out all faces without hair and train the lora with only faces lets see if it works
It is usually because your caption removed all haircolors and styles. The lora learned with that
you mean that i dont have it in my prompt?
or what you mean with caption
Usually, you would have a .txt file which is a pair of your image. 1.png should have 1.txt. Inside the text, it should have prompts to describe the paired image. The "caption" means the .txt file
ah yeah i know that cause ive done it already š But you mean it happens because i didnt describe the haircolor and hairstyle on the text files properly
anyways i trained it with only the cutout face without hairs and it worked very good
i first cut out the faces with paint.net lasso tool
If you want to have the flexibility to change the haircolor and hairstyle with the image which isn't cutout. You should mention them in the caption.
then i removed background with the a1111 extension (couldve maybe saved them as png to avoid that step) then in the text files i only described the face and now it works much better
Multiple ways to achieve the same result. It is SD. It has no right and wrong.
I think with my pics the best decision is to cut out, cause my pics contained a complex haircolor and hairstyle and it seemed like describing it didnt work
i dont think the text files were the problem for me
i described them pretty good
except seperating the prompt in the text caption with commas has an effect
My summed up theory: If you want to train a lora model just for a face of a person, cutting out only the face/head without the hairs and body is the best/fastest method
If you use reg images, it is fine
I didnt use them yet š¤
You might experiment your theory
@latent charm hey have you played with anymore captioning scripts?
I have tried that. But for my lora training, the only usualful one is the wd14
I made a preprocessing tool which would crop images from dataset as a subset. After that, I would use wd14 to caption the images. WIP
Cropping in to hands and faces? But then you need to upscale with high quality like topaz or latent upscale with SD no? To get proper detail quality of the focus?
I do all my focus crops manually straight out of topaz crop/upscale
I didn't do any upscale for the cropped image yet. But I deleted too small images.
I use groundingdino to auto select the focus by provided prompt like face or hand and select > 0.5 images. After that reviewed the cropped images and remove some.
what's the best way to finetune with ~50k images?
Share some selected result with this config.
The images was trained on two cycles.
The first one trained with 16 hours using 3090.
The seconds one trained with 8 hours and I reduce the text encoder learning rate to half of original which is 0.000025.
The dataset contains 3 folders, 3_face, 3_hand, 3_woman. Total around 750 images.
The woman dataset contains original selected photo of the person.
The face and hand were used my preprocess tool to auto crop from woman dataset.
After that, remove small images and complicated images (especially on hand dataset).
Tag them with wd14 and delete wrong tag. Keep all recognized tag.
set LoRA network weights to your lora and continue the train
You still use the same model. You need to set the lora network weights to resume from the lora training.
I don't know what do you mean "low pixel". If you mean blurry, noise or any other effect you don't want but appear in your generation. It might related to your dataset
Image resolution doesn't related to lora
So i trained lora with a character and i described the hairstyle in every text caption but still it sticks to the hairstyle from the character until i go down to like 0.6 weight
whats a good way to fix that?
Do you have different hairstyles or only one hairstyle?
only one
You might try to add more hairstyles into dataset. It seems this hairstyle is overfitted to the lora
i would have to manipulate the pics then because the character i use only has 1 hairstyle yet, well one other hairstyle i could add
or ill just use inpaint for other hairstyles
i thought maybe theres a way to make the AI mix the hairstyle of the lora with the rest of the dataset
3000 steps (55 epoch), 220 images, contant adamw, unet only, lr 4e-4, batch size 4.
ref, 8, 6, 32, 64, 128
gen strength model: 0.6
gen strength clip: 0.9
a woman with eyes that have seen too much, enveloped in the twilight of a dense pine forest, with remnants of a long-abandoned campsite
ref, 8, 6, 32, 64, 128
ref, 8, 6, 32, 64, 128
ref, 8, 6, 32, 64, 128
Looking at only these I think network of 8 does the trick.
hello, I have a question about training a custom model with dreambooth
not sure if this is the right channel tbh
but lets go
My question is about instance prompt and how unique each world should be
Say, if Ive named my instance prompt as "photo of humanoid2dside person", would the model understand the string "humanoid2dside" as completely new and unique argument during generation?
By no means the model should understand the prompt "photo of humanoid2dside person" as "photo of humanoid 2d side person"
Im not entirerely sure how stable diffusion recognizes each word contained in the prompt during generation.
But for my use case, a prompt with "humanoid2dside" should not be the same thing as a prompt with "humanoid 2d side"
hmmmmm
thanks, I was not aware of this
so that means with the string "humanoid2dside" that stable diffusion would identify the tokens "human","oid","2' and "dside"?
I see
probably I should use an more unique name then
I will try some strings on the tokenizer to see what works
thanks man
That fine once you start to train text encoder. it will learn your token, humanoid2dside, for your image
but once trained properly, is it garanteed to always consider humanoid2dside as a unique token?
You don't need a "unique" long string for your training, you just need something won't make things wrong. That should be enough.
I will try it out to see if it works then
having a mean to at least know if my token is being identified properly is already a step foward
thank you
I haven't been clear enough. I'm not training for characters.
that makes sense
Should probably try dreambooth next to see what I might end up with.
Is there a way to "negatively caption" things during training? I'm training a style, and every once in a while, real images bleed into the results. I can negative prompt photograph during generation, but I'd like to stop this from happening during the training phase so that it's not a burden for people who use the LoRa, if possible.
I know that I could use regularization images, but since I'm training a generic artistic style, the time needed to create and generate a wide diversity of regulation images seems rather egregious. I was hoping that there would be an easier way of training out specific elements from the resulting style.
There is no way to "negatively caption" during training. Your caption is a pair of the image. You should add the prompt to describe what your want to extract. For example: "photograph". When you add photograph in your caption, you could try to add photograph to neg prompt in your generation.
How much estimated VRAM will u need to do full finetune of SDXL?
Doing a LoRa on 24gb 3090 with Kohya_ss took about 22 hours with batch size 1, repeats 1 and epochs 90
for lora you need not more than 12gb. With a 3090 you can do batch size 10 without problems
training should be fast, but of course that also depends on your number of training images. But if you have so many images that it takes 22 hours I would increase batch size
I had 80 images
my goal is later to do full fine tune of a lot of different concepts
up to 5000-10000 images
will I need more Vram?
I got really good results with training unet-only on rare tokens.
you don't need 22hours for 90 pictures...
ill send the json I used
@stiff dust
number of repeats on image folder was only 1
all images were 1080x1350
looks right. I would try a higher batch size. I have a 3090, too, and it definitely doesn't need that much time
hm okay, ill try again and see how it goes
Are you using another program like comfyui or webui together while training?
I don't know how to work this and it uses jax, but this might help you with vram requirements for a finetune if you have a few more 3090s
https://github.com/lodestone-rock/SDXL-sharding/tree/main
im trying to train basically a character with lora. because i only have 8gb of vram im using a google collab. my training images are all realistic images and if i generate some images with that lora i only get this realistic style, even though i use a f.e more drawn, anime style checkpoint. what could cause this kind of behavior? i had about 30 images with 12 repeats and 10 epochs (i tested all epochs, same result). also how would i go about if i want multiple of these characters in one shot? what tags should i use when training? like "2MY_CHARACTER" and "1MY_CHARACTER"? or whats the best way to do it so i can use it later in my promt
a photo of [your character] while training and replace the photo to other style while generation
so i dont have to put in the quantity of my characters which are in the images? in some photos the characteristics of my character are seen in 2 entites of my images. so i dont have to put "a photo of 2 (my character)"?
also could you elaborate on "replace the photo to other style"? im not sure what exactly you mean?
- use the word "photo" in your captions
- use a rare name for your character (not: john hammerfall, better: john hmrzufl)
- only train unet, never train Text encoder
when you put photo in your caption, the lora would be learned that is a photo of your character. After the training finished, use your lora to generate images. At this time, you could use other style rather than photo to generate your character.
Iāve put the term realistic in but I will try it with the tag photo as well, thank you āļø
seems currently A6000 is minimum requirement to do full finetune of SDXL
Looking at SimpleTuner from bghira
Can i add random photos with hairstyles and another prefix in the caption to my lora training images to be able to mix more hairstyles and colors to my character?
Funfact you dont need more than 8GB Vram to train on your pc, you just need to enable two settings in the parameters and it will work with 8GB, but collab or runpod is faster
it should able to If it trained properly
I was enbled text encoder in training. While I train the same lora in multiple time, how could I find out the text encoder training is enough and stop te training?
Did It again with 8 batch and took 2h 30 min
thanks
wouldnt using a common token like "john" as a rare name a problem, since the base model might have been trained with others images identified as "john"?
the original Dreambooth paper suggested to use "sks person". However, "sks" has a meaning and is not that rare, so using something like "tdjvr" might make more sense. Instead of person you can simply use a real name. It transports more information (john -> male, western culture) and is more natural to prompt.
is there any documentation about these DW openpose arguments?
It is finetune channel. You might ask in SDXL
ok
the "stop text encoder training" parameter doesnt work anyways:( but I'd assume as always the only way to find the right freeze point is experiment and see
im training a sdxl lora with 163 images using Kohya at 1024,1024. It says it is going to take 5 hours? Is that normal? GPU 4080 gtx(16 gig) , i9-13900K , 32gig mem, m2's
Here is the configuration
{
"LoRA_type": "Standard",
"adaptive_noise_scale": 0,
"additional_parameters": "",
"block_alphas": "",
"block_dims": "",
"block_lr_zero_threshold": "",
"bucket_no_upscale": true,
"bucket_reso_steps": 64,
"cache_latents": true,
"cache_latents_to_disk": true,
"caption_dropout_every_n_epochs": 0.0,
"caption_dropout_rate": 0,
"caption_extension": "",
"clip_skip": "1",
"color_aug": false,
"conv_alpha": 1,
"conv_block_alphas": "",
"conv_block_dims": "",
"conv_dim": 1,
"decompose_both": false,
"dim_from_weights": false,
"down_lr_weight": "",
"enable_bucket": true,
"epoch": 10,
"factor": -1,
"flip_aug": false,
"full_bf16": false,
"full_fp16": false,
"gradient_accumulation_steps": "1",
"gradient_checkpointing": true,
"keep_tokens": "0",
"learning_rate": 0.0004,
"logging_dir": "<>/KOHYA/LoraPics/finalenvoy\log",
"lora_network_weights": "",
"lr_scheduler": "constant",
"lr_scheduler_args": "",
"lr_scheduler_num_cycles": "",
"lr_scheduler_power": "",
"lr_warmup": 0,
"max_bucket_reso": 2048,
"max_data_loader_n_workers": "0",
"max_resolution": "1024,1024",
"max_timestep": 1000,
"max_token_length": "75",
"max_train_epochs": "",
"max_train_steps": "",
"mem_eff_attn": true,
"mid_lr_weight": "",
"min_bucket_reso": 256,
"min_snr_gamma": 0,
"min_timestep": 0,
"mixed_precision": "bf16",
"model_list": "custom",
"module_dropout": 0,
"multires_noise_discount": 0,
"multires_noise_iterations": 0,
"network_alpha": 1,
"network_dim": 1,
"network_dropout": 0,
"no_token_padding": false,
"noise_offset": 0,
"noise_offset_type": "Original",
"num_cpu_threads_per_process": 2,
"optimizer": "Adafactor",
"optimizer_args": "scale_parameter=False relative_step=False warmup_init=False",
"output_dir": "<>/KOHYA/LoraPics/finalenvoy\model",
"output_name": "warforged_chk_pt",
"persistent_data_loader_workers": false,
"pretrained_model_name_or_path": "<>/stable-diffusion-webui - dream/models/Stable-diffusion/sd_xl_base_1.0_0.9vae.safetensors",
"prior_loss_weight": 1.0,
"random_crop": false,
"rank_dropout": 0,
"reg_data_dir": "",
"resume": "",
"sample_every_n_epochs": 0,
"sample_every_n_steps": 0,
"sample_prompts": "",
"sample_sampler": "k_dpm_2_a",
"save_every_n_epochs": 1,
"save_every_n_steps": 0,
"save_last_n_steps": 0,
"save_last_n_steps_state": 0,
"save_model_as": "safetensors",
"save_precision": "bf16",
"save_state": false,
"scale_v_pred_loss_like_noise_pred": false,
"scale_weight_norms": 0,
"sdxl": true,
"sdxl_cache_text_encoder_outputs": false,
"sdxl_no_half_vae": true,
"seed": "",
"shuffle_caption": false,
"stop_text_encoder_training": 0,
"text_encoder_lr": 0.0004,
"train_batch_size": 5,
"train_data_dir": "<>/KOHYA/LoraPics/finalenvoy\img",
"train_on_input": true,
"training_comment": "",
"unet_lr": 0.0004,
"unit": 1,
"up_lr_weight": "",
"use_cp": false,
"use_wandb": false,
"v2": false,
"v_parameterization": false,
"v_pred_like_loss": 0,
"vae_batch_size": 0,
"wandb_api_key": "",
"weighted_captions": false,
"xformers": "sdpa"
}
The time is a factor of how many epochs, and steps / epoch, which is image * repeats + reg images. Your card is 16gb, so you should consider a smaller batch size, and probably 1 with only 163 images
does it actually matter if some input training images are rotated?
or is it simply a "display setting" of the image for the OS and doesn't matter anyway?
Because I train the lora multiple times. I could set te training rate, e.g. 0.00005, in the first training and set it to 0 in the second training training.
anyone finte-tuning BLIP2?
Is seperating the prompts in the caption files with commas better?
what do you mean? Training images should look like you want them in the output. Modifications like flipping can help training, though
yep, in my opinion it makes more sense anyways to train text encoder and unet separately. So text encoder first, then unet.
However, training text encoder is difficult and in most cases I found training unet only is better for generalization
I usually get better result when training character concept with telr.
I used a script for automatic cropping to 1:1, but afterwards some images were flipped. So I was wondering if it affects the training
flipping is rather good. I would just avoid it if there are features that are site-specific (e.g. a mole that should be always on the left side). Otherwise it will rather improve training
it might depend on what you want to achieve. I found that text encoder training adapts very fast to the images, BUT the resulting model becomes very unflexible if it comes to draw the character in a different art style (e.g. from photo to anime or from painting to comic) or when you want to draw the character in a different angle, with different clothing and so on. Text encoder training often ends up giving me images that are too similar to the training images in terms of composition
in my case it's about faces. you think it would even improve the result? that's interesting, how come? š
I assume it has something to do with the pooling. It seems that by text encoder training your trigger words become too dominant and the remaining prompt will be ignored very often
I think it related to how to prepare your captions
more variety. If you have enough images (say, more than 10) it might not be important. But if you have very few images, the unet tends to produce artefacts. By flipping you increase the number of images artificially (flipped images are not entirely new, but at least they are slightly different)
10k steps in 15
https://dreamlook.ai/ anybody used it before?
I always use manually captions. If you have some trick to improve on that, I'm glad to hear that. The text encoder overfitting is a annoying issue for me
I used photo as prefix and manually add camera angle tag to caption to let lora learn how to map the camera angle to images. I used wd14 for auto captioning and manually remove and added intented tag for more control for the lora.
hm, nah, adding tags that describe things on the images you don't want to overfit - I'm already doing that
I also use half unet lr for telr to train it slowly
maybe it works better for your data. I can only say I did several subject trainings. I always evaluate by using simple prompts ["photo of xyz"] and unconventional prompts ["xyz as astronaut", "charcoal drawing of xyz", "egyptian hieroglyphics depict xyz"].
I always found that training Textual Inversion or training Text Encoder will improve very fast for the simple prompts but won't be able to do the unconventional prompts. Unet-only training on rare name tokens is the only strategy so far that excells also on the unconventional prompts.
when I have very few training data (e.g. only 1-2 images) THEN I use text encoder only training.
hmmm, I train with te using 2000 images and train it multiple times. I would try ur evaluation method and see how it go on
I mean, if I train on photos of my face, then I have enough images. Training the text encoder then totally allows me to draw the image in different angles and so on
but it quickly overfits on the photo style
But I think even the text encoder is overfitted. It could be easily reduce the strength in comfyui and use a earlier te from previous Lora
like letting my face be drawn as comic or charcoal drawing won't work anymore
that what I want to test
also not all styles overfit similarly fast. Like "anime" style usually stays quite robust
yeah, but why would I do that? Then I can just skip the text encoder initially
I think it is because anime is quite undertrain in base
I did A LOT of tests with training my face xD Using rare tokens was BY FAR what worked best
and with rare token I mean something like "Christian gjhsar"
combination of name + some random characters
you mean rare token with no te training right?
yes
interesting
in fact, you only have to train the cross attention
this shrinks down the lora file size to few megabytes ^^
training self attention sometimes slightly improves image quality, but 99% of the training has to be done in the cross attention
I tried
- textual inversion first, then unet training
- textual inversion first, then text encoer, then unet training
- text encoder, then unet on celebrity names
- unet only on celebrity names
- text encoder, then unet on rare tokens
- unet only on rare tokens
The last one had best generalization capability
(it also took most time for training. unet on rare tokens need ~10 times more training steps to adapt to the training images than the other methods. But results just looked best by far)
Thanks for sharing. I would have some experiment with it.
I tried rare token unet only before but it seems very hard to converge
might be due to wrong lr
yeah, it takes forever ^^
I mean, you have to combine rare token with a token that describes the character
I use a first name for that
similar to Dreambooths "sks person" I use then "John duqgzsa"
(cause sks is not really rare)
you're not providing the class?
but then it trains much slower than other training variants. That is definitely the case. I just say the results are also much better than for other training variants
no need for that. A first name IS the class
in the case above: "John" clearly describes a male character. No need to use "male character" additionally
if your name is very missleading (like you are a women with name "Alex", or your name has some other meaning, like your name is "Dick" ^^°)
then you should "rename" yourself š
like my name is "Kai", which is a typical German name, but ithe name also appears in other cultures and in Stable Diffusion it is strongly associated with Japanese culture. That's why I renamed my name to "Christian" when training on my images
ye that was my first thought, what about for example "Kim", could be both š
but I did get it right, you actually have the same results as if using the token "ohwx man"?
or do you say you had even better results with "Christian duqgzsa"
How do you know if itās a rare token or not?
The offset lora provided by SAI is trained on just ācontrastā by the look of the metadata.
what type of captioning should i use for a clothing style lora?
dunno, I didn't tried "ohwx man". I found using first name and last name way more natural
random characters
Have you try celebrity names plus random character?
no. The problem with celebrity names was that they are not really better than textual inversion but sometimes blend over (e.g. I once trained a DnD character on Hayden Christensen and sometimes the DnD character holds a light sabre instead of a sword xD)
I have to say: I don't care about training time. It might be different if you do that for business on a regular basis. For me, training 2 hours on a subject is totally okay if the results are really good afterwards.
So whats better, using multiple tokens or just one token and describing all the elements of the picture to add them later in the prompt manually?
how about your dataset size and lr? 2 hours seems pretty fast
will the things from the training images mentioned in the caption still be included in the training dataset?
I think for my face I had around ~50 images in total
It sounds interesting but yes, I also noticed celebrity remaining effect after training.
for my current training dataset, 2000/50=40*2=80hrs. I trained with te around 30 hrs. hmmm
haven't finished yet.
and planned to stop trainning unet
My focus is for the most likeness to original picture which might be a little bit different of you goal
yeah, my issue with overfitting goes in a different direction
I have this sample to change the style with te lora
in a few day ago. Dataset is 700 images. Training spent around 20 arounds.
I think it could add more oil painting to adjust the style
all images in dataset is photo and no reg images
It supposed to output images like this.
If you train unet only how does it learn a random token?
the unet cross attention learns to associates tokens with the latent pixels in your image. That's all you need
if you have the name "fgzhw" then it is tokenized (e.g. into ["fg", "zh", "w</w>"]) and the tokens are associated with your face
if you train the text encoder, the name tokens are "distributed" and amplified through your caption, which then ends up in an overfitting effect
and if it's not a face, like a image taken with let say kodak vision2, any idea how to best caption that?
I've gone with just that, seems to be working but there might be better ways.
either you caption it with a unique rare token trigger word, or you simply describe what is on the image
former makes sense if there is something uniquely new that cannot be described
(like your face)
Tried to caption the image but I don't think the results were great when you do unet only. captioning the film stock turned out better.
I'm not sure if this is comparable. I said: text encoder training overfits on style. You seem to train on a style
When I trained both unet and text encoder with captioned images it turned out good but it overfitted the characters, clothing and environment a bit too much.
Maybe I should have a lower lr for the text encoder when I'm training both?
You might try to add more token to describe the image as detailed as possible
I currently use half lr for te compared to unet
My first try that trained both unet and text encoders where caption something like this: cinematic film still of the road is empty, desolate, calm and serene, blue and yellow, close-up, lonely, barren, empty, natural, low, soft, straight on, shallow depth of field . vignette, highly detailed, high budget, bokeh, cinemascope, moody, epic, gorgeous, film grain, grainy
I've also tried something like this for the same image a road, desert, mountains, day, landscape
that didn't turn out as good.
I've also tried just kodak vision2 for all images and unet only. It seem to have the least impact on anything else except the color palette and film grain and such.
dunno... I would train unet only, honestly.
it's also the way SDXL was trained itself
Maybe I should try to increase the number of steps in training if when I do unet only. I've run 3000 steps on 220 images.
If I do a rare token like kai is telling us, I'm not sure if I should just try a random token since this is not images of "Christian" nor "John" š
you train for a style. I wouldn't use any special token here
and should i skip kodak vision2?
except something like "kodak vision2"
I do 0.0004 for learning rate. Could probably experiment with that too and see if it picks up the style better or worse.
I could try 6000 steps and a batch size of 4 as I've done previously and see what happens. It would probably take 2 hours with that amount of steps.
you can also try to play around with the --adaptive_noise_scale parameter
setting it to, say 0.05, might speed up training for some concepts
This is what I use atm. https://gist.github.com/twri/3b4fdc6adc6a81e6dbd9ea5256997f11
Adaptive noise scale:
Used in combination with the Noise offset option. Specifying a number here will further adjust the amount of additional noise specified by Noise offset to be amplified or attenuated. The amount of amplification (or attenuation) is automatically adjusted depending on how noisy the image is currently. Values āārange from -1 to 1, with positive values āāincreasing the amount of added noise and negative values āādecreasing the amount of added noise.
--clip_skip 1
does that make sense for SDXL?
Not sure, I should probably remove it since you ask. š
sorry, I meant --min_snr_gamma=5, not --adaptive_noise_scale
Min SNR gamma
`In LoRA learning, learning is performed by putting noise of various strengths on the training image (details about this are omitted), but depending on the difference in strength of the noise on which it is placed, learning will be stable by moving closer to or farther from the learning target. not, and the Min SNR gamma was introduced to compensate for that. Especially when learning images with little noise on them, it may deviate greatly from the target, so try to suppress this jump.
I won't go into details because it's confusing, but you can set this value from 0 to 20, and the default is 0.
According to the paper that proposed this method, the optimal value is 5.
I don't know how effective it is, but if you're unsatisfied with the learning results, try different values.`
I guess you could say my images are noisy, if grain is noise.