#🔧|finetune
1 messages · Page 18 of 1
SD tries to predict noise in your image. As less noise you have, as harder the task, as higher the loss. With min_snr_gamma you stabilize this a bit, saying that very low amount of noise are not too overweighted during training
I think I'll try to remove clip_skip and add min_snr_gamma=5 first and keep steps 3000 and lr 0.0004
I would say, intuitively, it is more about: do you want to train on high details or on overall composition. min_snr_gamma should be used when you train on composition. You should not use it if you train on details. For example, when you train on faces, you don't want min_snr_gamma, because tiny details like skin structure are super important for you
Yeah, then I think your recommendation is spot on.
I've seen examples when they have set gradient_accumulation_steps to 4. It seems to affect learning rate too? I've gone with the defaults of 1.
nah, wouldn't use that
make your batch size as large as possible, that's more efficient
I could go 8 too, does that affect the learning in any way other than it's faster?
dunno. I would always do as highest as possible.
There are other people saying lower batch size is better. I don't believe that. But I never did experiments on that.
That's why it's 4. 😄
it's not faster, though. High batch size makes it rather slower
1 is slow
it's only faster if you also increase learning rate. I usually use batch size 10 but still a low learning rate of 4e-4
only because it trains longer. But how long you train the model is your decision anyways
I feel there's more overhead when loading the images for each step the lower batch size you go
It spends more time loading than training if you understand what I mean.
hm... yeah okay, I never had that many images that they would not fit into my RAM
I guess you do without --cache_latents_to_disk then.
and with what optimizer?
AdamW
number of steps?
as long as necessary
just record validation images and stop training if you are happy with the results or nothing happens anymore
I'll go with this then: https://gist.github.com/twri/0df8c1df30a9ed83be2d92261159445e
I will see what I can fit it memory
batch size 2 is the largest I can go without caching to disk.
loss is lower now
@stiff dust I've read somebody claims that SDXL was trained with --noise_offset=0.0357, do you know if there's any truth to that?
it's the best working noise offset in my experience
but Joe Penna said they trained sdxl several times with different parameters, so there is not a single right noise offset 🤷♂️
Anyone know a if there is a comfyui workflow for testing Lora’s in an xy plot fashion?
What is this 1girl token about?
Could it be just a random token?
the prompt to produce a girl in the image
usually followed by solo somewhere
Hmm i never had problems getting a girl so idk why i would need to include that prompt
It is part of the preset for wd14 captioning
Solo and 1girl
maybe the engines understands it as one word to save prompt space?
Even if i dont describe any woman i get a girl because the class prompt i chose is woman
I don’t even need to include my instance prompt, the lora thing with <> around is enough
It always worked for me
@stiff dust After tested my lora, text encoder doesn't need to train to produce good enough result. Might want to try find out how should we proper train the text encoder
Hi, just curious, when you say validation images, do you mean sample prompts without the trigger word to see when it starts to overfit the class token?
do you usually do (blip) captioning just for your training data or for the reg, too?
and do you add the class prompt as a prefix, too or only the main prompt
no, I use the trigger words in unusual styles or combinations. e.g., "charcoal drawing of xyz", "watercolor painting of xyz", "comic stripe with xyz" and so on
Do you use loss graphs at all? What’s your philosophy on it? Or just prompt sampling/cat plot?
no, I don't think the loss gives you much valuable feedback
quite often loss gets higher in the beginning although the image improves
also the loss depends heavily on the sampled timesteps. If you want to interpret the loss you have to use proper validation data with fixed timesteps.
Can one transfer lora training parameters if you switch to full training? Is there anything that have to change?
Should you still train unet only, for instance?
Do you mean, look at loss graph next to validation samples at regular timestep intervals? Is that what you mean?
Any models out there that can generate keywords for my images?
Like light description
And be fairly accurate
openflamingo is the best I've tried.
Especially if you give it a few images with captions for your dataset
min_snr_gamma with any optimizer other than Prodigy tends to do that.
does the order of the prompt in the caption matters?
hmm yeah ill rework my captions and see if it changes
anyone has Kohya_ss full Finetuning training? I am stuck on the metadata preparation. DO i run python script in Jupyeter python kernel in the merge_captions_to_metadata.py directory? See doc https://github.com/bmaltais/kohya_ss/blob/master/fine_tune_README.md#preprocessing-caption-and-tag-information. Not sure, as this was not necessary during Lora training..
do i need regularization images for SDXL Lora training?
Only used kohya-ss (sd-scripts) to do dreambooth training. I'm using toml-files for my dataset for both LoRA and dreambooth.
I train on different AR so it's much easier with a dreambooth style dataset.
Still researching parameters for dreambooth so I have only had it run for a few epochs.
someone mentioned dreambooth without trigger, but i dont think its the same
I don't know what's the difference between fine-tune and dreambooth to be honest. I don't specify dreambooth anywhere but my dataset is dreambooth style with captions for each image.
Thanks. Will test again locally
Though I’d imagine there’s a train _network difference in the code
dreambooth usually means using reg images.
no, reg images were suggested in the Dreambooth paper, but they were only used for training on faces
Dreambooth means you use a rare token as trigger word (the paper suggested the token "sks") to fine-tune the model on a new subject
however, the term Dreambooth is used very differently. Sometimes it refers to full-finetuning (in contrast to Lora), sometimes it refers to the style of the caption ("photo of a sks person")
I would always recommend to write custom captions when using kohya 🤷♂️ you have most flexibility with that and you ensure that nothing strange happens
@stiff dust Is there a reason a full fine tune uses lower learning rate than a lora?
yes. A lora trains a matrix factorization and not the original matrix.
so basically you multiply two numbers to obtain the weight change. Multiplying two small numbers gives you an even smaller number (e.g., 1e-3 x 1e-3 is 1e-9)
so you need a much larger change in the two numbers to obtain a noticeable change in the result
Would that also mean longer training when doing a full run given same dataset is used? I know I'm trying to generalize a bit too much but still. 😄
There are some fine tuned models claiming 200k steps, not sure how big the datasets are though.
200k and let's say 50 epoch is only 4k images.
But if it's learning faster then it could be more images I guess.
not if they used batch size > 1.
But I'm very sure that most fine tuned models are trained on very small datasets (rather 100 images than 4000)
If that's the case it must be a very very slow process
Can't find much about fine tunes on SDXL. Do you have links?
Many of the fine tunes are not much more than LoRA merges too.
Wish we could get our hands on some parts of the dataset for SDXL.
finetuning SDXL is expensive and time-consuming
hopefully optimizations happen or GPUs become cheaper. I'm hopeful for sharding.
hi, I am looking to train my own SDXL lora and there is an auto-captioning script that I found in a guide
call .\venv\Scripts\activate.bat
python.exe "finetune/make_captions.py" --batch_size="1" --num_beams="1" --top_p="0.9" --max_length="75" --min_length="5" --beam_search --caption_extension=".txt" "D:/!PhotosForAI/billie/billie-1024" --caption_weights="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth"
but this looks like it's calling out to an online service to generate the captions, is that correct?
or is it just going to download the file and use it?
You have to check the code. My guess is that it downloads the weights.
yes, it only downloads the weights.
hi, is that any lora has its own float point (16 and 32)?
if it's so, when merging two loras having different float points, is there any conflict? and how to determine its float point? thank you very much
either the script you use for merging is dealing with the conversion, or it's not and you get an exception. But you won't get an incorrect or corrupt file back.
Anyone have good json file for training LoRA character or object?
My gotos at the moment are: for quick, dont want to learn just make it happen
https://education.civitai.com/sdxl-1-0-training-overview/
https://www.reddit.com/r/StableDiffusion/comments/16a2ixm/comment/jz51npm/?utm_source=share&utm_medium=web2x&context=3
Okay i want to learn a bit (caith)
<#🔧|finetune message>
<#🔧|finetune message>
and Ai Characters
https://civitai.com/articles/1771/how-to-create-near-perfect-character-and-style-loras-for-sdxl-10-important-update-19082023
All have shared jsons.
Thank you 🙏👍
How many concepts can fit into a LoRA? Thinking if full finetune is even necessary. Going to train on about 15 different products/objects in different categories and for each category there is going to be atleast 3-5 variations of form design & color. Dataset will probably be at around 10k images
Haven't tested 10k. But I am training with 5k dataset. It is decent.
Does it train new objects besides humans?
My training mainly focus on human.
ok
How many steps / epochs do you need for that dataset?
Set with 100 epochs and 89900 steps with 20 bs using A100. It is experiment. I am not sure which would be the best. 70% now
The learning rate is 1e-4.
"Additional parameters: –max_grad_norm=0" Why does it say unrecognized when starting training?
Will be interesting to know what the results will be when you finish. Have you implemented multi aspect training? I will try it soon I hope. Writing tools for this now.
Yes, multiple aspect ratio with buckets
But not cropped images as in the SDXL paper?
is training with batch size 1 better ?
Seems like SAI feeds same image cropped in different AR into training.
Page 5 and 6 in paper.
I have cropped images from the original images with different captions
The original dataset is around 2600. 2400 is anatomy cropped from origianl images
I wonder if you feed pre-generated cropped images would do about the same as their fourier embeddings as conditioning parameter does or if that needs to be implemented properly in the trainer.
@stiff dust Do you have any knowledge about this?
Might be
kohya sd-scripts uses the correct cropping conditioning only if it crops the images itself
if you provide already cropped images, then this information is not available to kohya
my own solution for that was to write the cropping parameters into the meta tags of the png file and made some code additions to kohya to read these tags and add the conditioning
I'm not entirely sure, though, how helpful cropping is
I used it for when training my own face: Next to the normal 1024x1024 images I also added a few extremely high resolution images (like 4096x4096), cropped them into 1024x1024 blocks, and added them to the training data. It seemed that doing that too often will end up in overfitting on closed-up faces (even the cropping conditioning don't prevent that). When only around 1%-5% of my training data is cropped, it seemed to help improving details (e.g. skin details), but I'm not entirely sure if the improvement is really due to the cropped images.
I have 1,840 face cropped images / 5827 total images. It does tended to make close up images but it still able to make full shot
oh, yes, it can. I just say it happens more often that it makes images in close-up without specifying "close-up"
the reason I came up with cropped images was rather that I hoped it helps SDXL to render images in high resolution without making the typical placement errors (two noses and so on). But it didn't really worked so well, upscaling is still the better strategy
cropped the character but it is interesting to see the image has two bottom navigation bar.🤣
But shouldn’t kohya tell what resolution the image is when training?
I’m cropping and resizing based upon the closest resolutions that SDXL was trained on and put them in to a folder for each resolution. But this could be improved to resize and crop to other resolutions if there are pixels enough if there’s any point hence my question.
My feelings are that some fine tunes have regressed in rendering resolutions that base handles just fine. My thought was that this might be able to improve if you train on same image but to different resolutions.
resolution, yes, but not cropping
This images are selected result and I think the anatomy training might introduce other issue.
For example, the following image lost the relation between the legs and upper body. Failed-diffusions~
Nice now my character sticks with a cleft chin😭
what the fuck
why
im so fed up with this shit
they all have this type of chin now
completely random my training images didnt even contain those chins
oh yeah, I know that very well: artefacts that appear in the images although they were not part of the training images. I had the same issues with SD 1.5 and 2.1, though
funny thing is this is the 5th version of my lora
and the one before didnt have this (a tiny bit but not that strong)
What other data is provided other than the cropped image to training? Source size? Source image?
i have over 10 pics with thighhighs in my training set and still they appear when i put them into negative prompts... and all the time
Not sure how it works but it seems like when training it associates with other dimensions in the model that is related to the newly learnt data but not in the training data.
It looks like it's nipples from cats or something that has been associated with the ears
maybe the chins are related to that too
it can't, cause it does not have this information. If you let kohya ss doing the resizing and cropping itself, then it provides this information. Otherwise, not.
First of all I want to thank you for all invaluable information that you provide! My question was what data you provided with the cropped image. By any chance do you have a repo for the numerous changes you made to kohya?
what is the reason for saving/using "training states" in the context of dreambooth, if we can actually resume training also by using the last created checkpoint as source model?
it is even possible to save such states for LoRA, even though we can simply put the path to the last saved LoRA to resume in kohya ss gui
I've always looked at it as a redundant save feature, personally
it also stores the optimizer state and, thus, the current momentum of the optimizer
using training states you can stop and resume training any time without any drawbacks
loading from a checkpoint means the optimizer need some warmup phase and will probably rather harm the model for the first few steps until it improves the model again
this is not a big issue if you do stop in the middle and resume again. But if you plan to stop and continue several times and close to each other, saving states is definitely better
no, its on my local git. Also many of the changes are quite specific. If you need someone specific I can just send you the code
nice insights, thanks for the clarification
Nice, good points
is there a place to learn about the basics of the LORA training process? (eg. the loss calculation, gradient descent, etc.)
it's the same as for fine-tuning
Hello!!
I compared 4 popular face upscalers.
Base image 640x640.
Scaled to 960x960.
Results:
4x-UltraSharp details: 3/10 noise: 2/10 light: 5/10
4x_NMKD-Siax_200k details: 7/10 noise: 4/10 light: 6/10
4x_foolhardy_Remacri details: 8/10 noise: 4/10 light: 7/10
4x_NickelbackFS_72000_G details: 4/10 noise: 6/10 light: 7/10
R-ESRGAN 4x+ details: 6/10 noise: 3/10 light: 4/10
Please test it yourself and verify my results
I'm horrified how everyone says 4x-UltraSharp is good. It is terrible for realistic graphics... maybe decent for some anime or pixel perfect art
small update:
for 3x scale 4x_NMKD-Siax_200k details are insane, like 9/10
I also always use Siax, but haven't evaluated it yet, so thanks for the info
do lora questions go here or somewhere else
specifically an anime lora but still
i guess ill put it here
it seems others have talked about it here before
first time trying to make a lora, im using colab and NAI for anime style
am I being too picky? like I feel like it's weird compared to anime ones on civitai. granted, my character is an OC, but theyre at least kindof anime girl looking (and androgynous) so idk.
I use adaptive optimizers, batch size of 6, output samples are good quality and very accurate. but when I apply it to a model the quality (not literal) ranges from broken, to kind of accurate, to pretty good.
tldr; is this normal? should I just pick one and go for it? are loras known to depend on the model you're applying them to?
ill get an example
heres a plot, idk if the "styilization" is overfitting or just my dataset. i put style tags in the images, and some weights are a good balance but its either not accurate enough or fried
this is on a custom model, but its basicaly just dreamshaper with anime composition/clip. makes it very stylized and a bit messy
a lora should work with weight= 1. Using higher weights is rather hacky, using lower weights is rather a sign that the lora was overfitted
a lora should work with other models in most cases. However, if the other model is bad (e.g., totally overfitted and broken), then it might not work. Dreamshaper XL should work, though
I would not be too worried if the anatomy is sometimes wrong (too many legs or fingers). These things happen in the base model without any lora, too. Try different seeds and check if it happens too frequently
Hello!
Do you guys have any pro tips for fine tuning skin texture on 2k image?
I can only think of img2img with Ultimate SD Upscale with 4x_NMKD-Siax_200k with some denoise.
Extras with upscaler is really bad for skin even with the best possible upscaler.
I want advanced opinions on this topic.
Maybe it is just actually impossible as for 2023... Should I really expect from any sort of AI to perfectly reproduce human skin?
in general lora's work best on the model they were trained against, but you will having varying results against other models with the same base (ie. 1.5 base, sdxl base). as for the sample inference during training vs the actual live inference in auto1111 or comfy, that kind of just depends on all the settings and negative prompts etc
basically, dont expect super consistent behavior with so many variables in play
think of a lora like a formula that says when I use these tokens I will apply some weights to bias toward training parameters, but those weights will vary depending on the other weights in play
Ahh okay cool, thanks
I won't worry too much then, i have epochs that work at weight 1 and apply generally "this is that character" to the image. plus all the prompting
is there a page that documents the inner workings of the LORA learning algorithms in details? i'm trying to learn more about the LORA training process.
there is no special learning algorithm
lora is like fine-tuning
the main difference to fine-tuning (sometimes called Dreambooth) is
- you freeze the original model M and train the difference model D with M+D = finetuned, instead
- your difference model D is factorized. You can think of that as a lossy compression algorithm. Similar like you compress images as jpg. It makes the lora smaller than the original model
- usually you don't train all matrices in the original model but only the nost important ones (that's why there are so many subtypes of Lora. Most time Lora is just training the Transformers. Lycoris is also training the Resnet and so on)
but loras are not much different from normal fine-tuning
is there a place that goes into how normal fine-tuning works? i still need to learn about the very basic like the loss calculation, SGD, backpropagation and such
uhm, this is the same for any kind of neural network, so lookup any textbook about neural networks
SD is using a simple l2 loss (mean squared error) on the noise prediction
(squared difference between predicted noise per pixel and real noise per pixel)
i've been trying to learn about this with GPT4, but the issue is I can't put any of it in the context of Stable Diffusion LORA training
No idea if it fits here, but what can i finetune/add to prevent third upscale resample to have 2 mouths and quite vastly different mouth/lips? The extra hand on the neck is a quick fix, it's just the extra mouth i can't seem to get rid of. Neither with denoise nor cfg
I don't know what you mean 😅 there is really nothing special on lora training
maybe tiled upscaling? But that does not exist yet for SDXL. You can also try using control net (e.g. line art control net)
does it make a difference if I train a SDXL LoRA or a full fine tuned model with, let's say "medium quality", pictures? Is it possible that LoRA works better for images with lower quality? or should the dreambooth training always yield better results
in theory, if you set the lora rank to max then lora and dreambooth should be the same. So no difference. In practice, most Lora implementations do not train all weight matrices but only the important ones
in fact, if you want to train ob subjects its totally sufficient to only train the cross attention
but most lora implementations train the complete transformer
anyways, that could be a reason why Lora are sometimes better than Dreambooth. They overfitt less, because they don't train all the weights that are not really important for the training
you could do the same with dreambooth, though. You always can decide freely which weights in your network should be trained
thank you, it is as interesting as ever to read your detailed answers.
Bit off topic but try this upscaler, it's my favorite that doesn't ruin details. https://github.com/Phhofm/models/tree/main/4xLexicaHAT
The only two changes inserting changes I have knowledge about that you have done to kohya is the stop train text encoder after certain amounts of steps? epochs? and the crop conditioning. You might have made other neat changes that we still don't know about.
I always train EITHER text encoder OR unet which is already possible in sd-scripts. Also a simple trick to perfectly control what should be trained on is to set "--save-after-n-steps=1", save the lora safetensors, then open it in python and remove all matrices you don't want to train on.
for cropping you would have to store the cropping information somewhere. I used the PNG meta tag for that, but this is something you would have to do in python yourself. I just added a few lines to the BaseDataset#getitem method to read out this cropping information
Yeah, I have done metadata writing to png on other purposes already
If you have a diff I'd be more than happy!
############ kaidu
if "size" in img.info:
ww, hh = img.info["size"].split(",")
original_size = (int(ww), int(hh))
if "crop" in img.info:
ww, hh = img.info["crop"].split(",")
crop_ltrb = (int(ww), int(hh))
############
# augmentation
aug = self.aug_helper.get_augmentor(subset.color_aug)
if aug is not None:
img = aug(image=img)["image"]
if flipped:
img = img[:, ::-1, :].copy() # copy to avoid negative stride problem
latents = None
image = self.image_transforms(img) # -1.0~1.0のtorch.Tensorになる
that's the part in library/train_util.py
only the 6 lines in the ##kaidu block are relevant
should be around line number 1100, in the get_item method
it reads the source size and crop coordinates from the "size" and "crop" parameters in your PNG info. Numbers are given comma separated (e.g., "crop=40,20")
Thanks! I found the functions in that file when I checked
Tiled helps more for lack of video memory to split up the generation on smaller squares than one large, no?
CAuse i'm attempting SD 1.5 gen that is natively 620x620 was it, then upscale and resample at 2x, then another 2x, so 2400x2400
does this blown out look come from overtraining?
this is before hires fix so it seems fine
are CLIP/deepbooru still the go-tos for captioning images in a dataset? or has something better come out?
There many caption tool, like wd14, ML-Danbooru, blip, blip2, openflamingo, etc. The problem is your caption should be used for your purpose. Even using caption tool, you still need to manually delete or add more tags.
how do you train a lora that only affects the composition of the image, and doesn't change the colors, lighting, etc? e.g. like a hands or eyes lora
probably correct tagging?
also a variety dataset
probably easiest to use 3d models or real images
Is there any good Mac software to create/edit captions? Also a guide for captioning images?
(I train on RunPod, but I'd like to caption locally)
taggui
Desktop application for quickly tagging images. Contribute to jhc13/taggui development by creating an account on GitHub.
Thanks for the suggestion. That isn't on macOS but I found https://github.com/toshiaki1729/dataset-tag-editor-standalone
A related question, is there a guide to captioning images for best results? Specifically for a person LoRA.
hey @restive bridge just being curious how you progress is so far? 🙂 I've recently tried dreambooth training but got mixed results... at least not that much improvement over trainings I did with LoRA
Its python 100% so should probably work on mac too
It even claims cross platform
My 5000k images lora training had done. It includes mutiple person, mutiple outfits, anatomy focus, nsfw, etc. Some issues had found during testing. First, element mixing. For example, in outfit A has ribbon element and in oitfit B also has ribbon element. Due to 'ribbon' tag learned in both dataset, when using outfit A prompt to reconstruct the image, it might occur outfit B 'ribbon'.
Second, element on the fly. It occurs in second half epochs. It might be identified as overfit? fingers, arms or element on the outfit tear apart. Wrong composition of element or extra element from the outfits.
Thrid, anatomy hand training. It has most element mixing issue during the test. It might due to my lazy captioning. I used 'hand' as a general tag for different hand pose images. I think it mixed with different hand pose and it mixed front side and the back side of hand. I would test the anatomy training again with more accurate caption.
do i need a classprompt ?
if i make a lora for a clothing style, what classprompt should i use
What is strongest automatic captioning model atm?
Thanks, it worked, just had to do a little magic to get GPU captioning working properly, but I don't need it anyway.
Generally with a class you want to try and leverage what the base model already knows as much as possible, ex shirt, jacket, pants...
resizing a couple of images and wondering if anyone knows how to resize by the shortest side, ideally in python. the python image.thumbnail method and bulkresizephotos.com go by the longest side.
yes tyvm
Not sure if thumbnail is the best way to resize
resize is probably better but IIRC you have to set the size for height and width and I'm working with a lot of varying resolutions
If it's a dataset for training I've used a list of resolutions and cropped and resized to the one closest to the target and put them in a separate folder by resolution.
in python?
Yea
👉 👈 wouldn't happen to have a copy of that as well, would you?
Doesn't have a target dir parameter but I guess you could add that, it puts it into directory called processed
that's perfect, tyvm you saved me a lot time
Try GPT-4 😉
does using xformers while training do anything to a lora
like in a bad way
no
Hey Guys,
I'm wanting to train a model on a specific pixar-like character. I've generated 20 images of the sam(ish) character and I want to be able to prompt that character via dreambooth or something similar. Any tips for how I'd be able to do that?
caption the images (either use the real name of the character, or a custom name without meaning, e.g., "Monica Tdezk").
then train a lora on that
using the kohya/sd-scripts (or kohya-ss) library
you can either train a pure text encoder lora with low dim (e.g. dim=2)
or you train the unet (with a bit higher dim, e.g. dim=8 or dim=12)
I often found unet training slower but more flexible, but you can just try. Text encoder training is usually fast, you get good results after a few minutes
Regularization image question. Trying to train Lora models in Kohya for a buddy of mine of his kids to make them into superheroes. What would you recommend for regularization images?
you don't necessarily need them. Jtst try without
Is it possible to train a LoRA with 8 gigs of VRAM?
question: what does dreambooth/ti/lora/finetune training does with the loss from all the images in a batch? do they use them to find a derivative?
I put together a couple of scripts with the help of ChatGPT to help me manage my growing SD training datasets: https://github.com/boomerchan/sd_training_scripts
resize_bulk.py is by far the most useful tool as it allows you to crop and resize based on:
- a given height and/or width
- the original SDXL training resolutions
- the shortest side or the longest side
And it doesn't modify your original image(s).
glhf
also thanks to twri for reminding me GPT can do Python
There are even more resolutions available that SDXL was trained on if you want to include them all. It's in the SDXL paper.
oh? I was going off of this. picked it up somewhere when SDXL first released
I'll look up the paper
Yes, that's why I only used them in my resize-tool too but there are more.
Will result in less cropping I guess and the model should be able to handle it.
tagging tool could be useful but there's also options available in kohya-ss to caption_suffix and caption_prefix in later versions depending on use case,
I try to keep the unique stuff in the caption file for each image and the general stuff in the config (toml) for the dataset for flexiblity.
you take the average of the loss and compute the derivative of the average (which is itself the average of the derivatives).
You can think of each image is handled separately and then you merge all your updates via averaging
Do we have a definitive answer on the whole use a celebrity name while training or don't? Seems to be a another one of those contentious mystery topics - like regularization images.
I don't
You could create funny thing which mix famous character' features to your training target.
this means that it needs several averages across different epochs to get a derivative right? i'm not very good at math so i'm not sure
not sure what you mean...
the derivative is a weight change. For each parameter you get a number saying how to change the parameter. When you have a batch of 10 images, you would obtain 10 of these numbers and take the average of them
I get better results with random names instead of celebrity names
I wanted to experiment today with training a lora so I was wondering for people which tends to work better if anyone knows, locon or loha lycoris?
I would simply use Lora
anything else is probably not necessary. Maybe it helps in rare cases for style training
does anyone have a good comfyUI workflow for evaluating LoRAs?
Also is it possible to set a setting in kohya so that it starts saving epochs after 40 epochs etc?
that would be --save_every_n_epochs="1"
that's from Caith, it is a very basic workflow to load the base model and a lora
Thanks
I didnt mean "save every n epoch" but "start saving every n epoch after x epoch"
ah, so skipping the first n epochs before starting to save. I'm not aware of that option, even though I would also find it helpful
Comfyui xyplot workflow using efficient nodes
Sure, but why not set it to save after every 40 epochs, and only run 40. Save the state, and then resume with save every 1
Ah, so is that why character training with high batch count results in a worse likeness? Because it's kind of taking an average of multiple images together.
I'm very sure that is myth
it does not take the average of the images but the average of the weight changes
there is no rational reason why character training should work better with lower batch count
well there's some rationalization, if you leave the learning rate and steps unchanged, they will certainly not yield the same results, so I suppose it's better to say, that when accounting for those variables, results should be similar
and then what happens if you use an auto-adjusting optimizer, and plotting loss, there are things you have to accout for when using batch size
I'm not talking about that the results should be the same. Of course you have to adjust learning rates and step count, as you do anyways. But many guides claim that you should use batch size 1, because the network would get confused when it sees multiple images at the same time and then learns some blended image and stuff like that. THIS is totally bullshit. You can achieve similar or even better results with high batch size.
Question about captioning. I've heard it said that you should caption what you don't want the model to remember. So, in other words, if training a person you know, who you want to put in different locations and clothing, then you should caption the details of the background and clothing. However, when watching captioning tutorials, people always start their captions off by saying the gender of the person (ie. "a man" or "a woman"). Am I wrong, or does that go against the "caption what you don't want" philopshy? Because, if that were true, then captioning the sex of the person, should make their gender fluid in the model. Would love some insight into this.
more just a general guide. Basically when you caption, especially if you have multiple images with 1 outfit it should know exactly what to do when the prompt is brought up (but only). I have things trained that are "genderfluid", but the face and body still apply to any prompt so I rarely see change. if you don't caption that and use the Lora/mini model, it might freely assume and you can never really "get what you want" if it's not trained specifically enough to do it (if you know how to prompt for an image similar to training, it's easier than guessing). for styles this doesn't matter because they're not specific but for people I'd think its important
generally it also varies by dataset. not really something that is for sure going to work, but should. ai can be weird
so there's identifying, and there's describing. your scene contains a woman, but if you say a woman with blonde hair, that's describing the woman. And by doing so, you basically tell the AI about the hair and it wont try to learn it. This is very handy if you want her to have red hair sometimes. read this, I think very well written exploration of captions from a reddit post: https://www.reddit.com/r/StableDiffusion/comments/118spz6/captioning_datasets_for_training_purposes/
similarly with clothing, there was a guy in one of the discord rooms that had trained a character but always got the same outfit. his issue was not describing the outfit
I'll say this, to save you some grief, dont go overboard describing everything, if your dataset is diverse, the AI wont learn it easily. What I mean by that is, if all your images are in diverse settings, you dont necessarily have to say in a bedroom with a side table containing a lamp with a green shade...that's just silly. unless that lamp appears in several images
most of my captions are like... no more than 10 words
This is fantastic. I've bee struggling to find good info on this. Thank you kindly for the tips and link
Hey there, I'm wondering if anyone knows how to train a stable diffusion model with a different language? Like Greek, spanish, japanese, etc?
I have rtx 3090 and I can't do sdxl dreambooth training
The vram usage is full and it takes 12 g more from shared rams
I did sd1.5 training with dreambooth just fine 3000 step in 10 minutes
Any help will be appreciated 👍
easiest way tbh would be to put a translator in front of SD, if you want it to run off native non-latin languages it would take a different text encoder and probably significant retraining or starting from scratch
the tokenizer/text encoder are really only setup for english
I think I like the caption output from this setup the best
it creates a fairly long caption, with repeating but slightly different descriptions of the same thing, but avoids mentioning artists for the most part
e.g. 'two women talking on a couch in an office, sitting on a couch, sitting on couch, calmly conversing 8k, sitting on the couch, sitting on a sofa, sitting in a lounge, giving an interview, on a couch'
it has a weird behaviour though when it reads a word, it adds a bunch of captions that are related to that word, so make sure to check the captions for text or signs
sdxl_train.py: error: unrecognized arguments: --network_train_unet_only
I get this error when adding this parameter (network_train_unet_only) in advanced tab in dreambooth
what i am doing wrong here , training consuming more than 24 g vram for sdxl
the parameter does not exist for dreambooth training
you can use --stop_text_encoder_training=-1 instead - it should have the same effect
(in general: all parameters starting with --network are for loras)
Hello, what do I need to change in Sd so that when I boot it up the neg prompt box has a "standard text prompt that I want"?
I got this error when used it sdxl_train.py: error: unrecognized arguments: --stop_text_encoder_training=-1
I tried with no parameters but it consume also from shared vram
what should i do to make it consume less vram ?
okay, that's yet another class
but for sdxl_train.py you don't have to specifying anything, the text encoder is not trained by default
okay
you can try --cache_text_encoder_outputs --cache_latents though, maybe that helps with vram
i did try that but same vram consumtion
Is it currently possible to use loras with sdxl img2img? While there is an existing inherited method for this, I'm having the same issues described here:
https://discuss.huggingface.co/t/how-to-use-lora-with-sdxl-img2img/55295
I am trying to apply a lora to the SDXL refiner img2img pipeline. I’ve tried multiple sdxl loras that work with the base model and pipeline but when i try them with StableDiffusionXLImg2ImgPipeline and the refiner model it errors (I have set low_cpu_mem_usage=False and ignore_mismatched_sizes=True to no avail) StableDiffusionXLImg2ImgPipeline h...
anyone have an sdxl kohya config for a person/face? I know I'll have to edit all the paths etc, I just want a complete training config to work from
use loras with sdxl img2img? Yes. Apply lora to SDXL refiner? no
You could just start with preset
This is what I'm trying to do:
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16) pipe = pipe.to("cuda") pipe.load_lora_weights(prj_path, weight_name="pytorch_lora_weights.safetensors")
The tensor shapes don't align, as described by other folks with the same error in the link I posted.
The answer is no because there is no lora trained base on refiner and base is different than refiner.
You could use lora with base but not refiner
Thanks so much—that totally makes sense. Now that I'm running it with the base, I'm experiencing results that I didn't expect. I can sample from the lora-weighted base using the DiffusionPipeline and see images that resemble my training data. However, when I use the same lora weights with Img2Img, I'm not seeing images that resemble the training data, even when I bump up the strength. Any ideas about what might be causing that?
there is no 0th timestep, so even if you set 100% denoise you start at step 1. If you have 20 steps in total, then 1/20 of the input image is still conserved. The early timesteps usually determine the composition of the image. So even with 100% denoise strength the rough shape of the image as well as the colors and brightness might be taken over from the input image. You can increase the number of steps to negate that (but of course, when you do img2img you usually want that the input image is bleeding into the resulting image)
anyways, the Lora is applied on img2img same way as on text2img, so if your images look a bit different then just because your input image effects the outcome no matter what your denoise strength is
That all makes sense. However, when I do the same thing with 1.5, I see an unmistakable relationship with the fine-tuning data that increases dramatically as I bump the strength up. In that case, by the time I hit, say, 0.7 for the strength, the resulting image is very clearly highly conditioned by the fine-tuning data. But for some reason, this setup is behaving entirely differently. I'm sure there's something wrong with my workflow somewhere...
do you use trigger words for your lora?
yes
oh, I haven't looked at your code
you are using the refiner model
loras trained on base do not work for refiner
you would have to train a separate lora on the refiner
refiner and base have different architectures and are not compatible to each other. Honestly, I would just skip the refiner. Base is already good enough, sometimes even better than refiner.
Actually, I was using the refiner, but thanks to a helpful comment, I am now using the base, not the refiner (and getting the results I described to you)
then I have no clue 🤷♂️ there is no reason why the lora should behave differently in img2img than in text2img
I'm going to gather up the workflow and see if I can get the relevant parts of the code concise, thanks for your input 🙂
Hello! I feel like I'm going insane—need a sanity check.
I'm training a LoRA using kohya-dreambooth method in Colab. It literally worked six hours ago. I collected images, I used the notebook to caption them, I trained LoRA, everything worked swell. I try again now, it just plainly doesn't work. The training proceeds without errors, but the end result does not even try to capture the character. The activation word is not even recognized as a signal to try and make a character. Where should I look? What might I be doing wrong?
typo?
Nope, I checked for typos and I tried two different datasets under two different names. Both fail.
I actually came up with a solution on my own. Not sure why I haven't tried this earlier. I changed the optimizer from Adam8bit to just Adam and upped the learning rate 4x. As an ML person, I should have figured out sooner that it's just a learning rate issue. Everything works now. It's not perfect, but I'll play with LR some more to get the best result.
Question, I am trying to tune Stability.AI to produce an "app icon" image... Instead it produces many icons... What information path am I looking for in order to "tune" this, or change it into a consistant image? I see here the parameters of the stabilityai python library I am using but... It does not seem to result in what I am looking for. Images attached. Not sure this is even the right place to ask, but pointing me in the right direction would be very useful!
train a textual inversion or a text encoder lora on app icons
training data shouldn't be the problem as there a plenty of icons freely available
what happens if you reduce your height and width to approx the size of an icon?
may be a bad idea, but its a thought
ok I can look into this... That is doable.
Tried that actually, a valid idea, but no, still gives inconsistant results.
got this working now, thanks for your help!
Hey there, I'm new in ML and SD so I've got some questions. I learned how to fine-tune an SD model using Dreambooth on a specific person. Now I want the person to wear some specific clothes. I learned on the internet that I can train a textual inversion embedding and combine these two solutions to get the result. But after training an embedding, generated images of any person wearing the clothes have very low quality, the face and body is deformed, very rarely I can get anything close to more or less realistic person. It's needless to say, that when I try to generate an image of a fine-tuned person wearing the embedding clothes, it results in something awful.
Where it could go wrong? Maybe I should learn more about textual inversion and train it better?
You might find something from this training on emojis. I think I’ve seen the dataset on huggingface. https://replicate.com/fofr/sdxl-emoji
how do your training images of the clothing look like?
You can't control what aspects textual inversion is learning. If you are unlucky, it is not just learning color and shape of your clothing but also low quality of the image
besides that, it is always helpful to add enhancing tags to your prompt (ultrahigh quality, 30mm photo, raw photo, product photo, ...)
hi! does stability.ai provide API for finetuning image model?
I just downloaded photos of a suit from an online shop. Just a woman in a suit with white background. The images are professional, have good quality and I'd say that it should be easy to learn on them.
BTW, maybe I just said it vaguely, but the generated pictures are ok, the clothing is close enough to the images, but the person and their face is deformed.
then it's probably just a resolution problem. Take a look at SD upscaling
SD has problems getting fine details right if they are low resolution
don't we use upscalers on already generated images? so if the generated images doesn't has bad details it'll just upscale them?
Ok, I'll try to simplify the problem:
- I generate a photo of a person - everything is fine
- I connect my embedding of clothing and generate a photo of a person wearing the clothing - clothing is ok, person face and body are deformed
maybe you show an example of the generated image?
if the face of the generated person is deformed then you can usually fix that by upscaling
based on Realistic Vision 5.1
almost the same prompt just without the embedding token
yeah, okay, there is something strange with the embedding. Maybe to overtrained
could also be that realistic vision does not work that well with embeddings
used embeddings from civitai, not perfect but way better
just noticed that it learned 2 poses from the photos, maybe overtrained indeed
maybe you could help me with captions for photos? I used these photos with just "photo of a woman wearing [keyword], white background". Maybe I should describe all the details?
--learning_rate 0.0004 has to be redundant if you do --unet_lr 0.0004 --network_train_unet_only no?
Right but wheres the code for this, it does not do me any good if it is not for devleopment purposes!
I can look at how this works but it seems like it is using some sort of custom diffuser... I guess I would be looking for information on how to create these.
I think I have found the right path of information.
I doubt that there's anything special about that LoRA except the dataset. I just can't find the dataset now. 😦
Not even sure what a lora is at this point! Was just told to do this stuff by my job and here I am lol.
RnD weird.
But if I remember correctly it was just plain simple raw dump of the apple emojies.
So do you know what software they are using to train the data? I am using openAI, Pinecone (and milvus for local), and langchain
do you know the Stable Diffusion equivalent?
Those are for text -> text, this would be for text -> image
Found this image. I will use it as a guide.
https://i.imgur.com/J8xXLLy.png
Most are using kohya-ss or kohya_ss depening if you want a gui or not for training a LoRA..
Thank you!
Non GUI: https://github.com/kohya-ss/sd-scripts/tree/sdxl
GUI which uses code base from above: https://github.com/bmaltais/kohya_ss
But you could probably also use replicate.com and their API to train it without having to have the hardware.
Hey,
How should I approach training Lora for specific style of outfits? I was experimenting with object-like captioning but the results were underwhelming.
Anyone know if it's practical to train a LoRA using irl images if I plan on using an anime SD model?
Specifically, I'm looking to train a LoRA for an irl dog
I don't even know what "irl" means, but if you refer to "real life photos" or something then yes, that should work
when I train on photos of my face I can use the same model to create anime images of me
what training settings are you currently using for your own LoRA?
hm, a lot of custom stuff. But best results so far were with rare tokens, learning rate ~5e-4 unet only training, batch size 10, default noise offset
and what optimizer? 🙂
thx, I will give it a shot. It's still pretty wild out there with regards to best training settings and also contradictory information. I lately also had quite good results with only 4DIM 😄
after so many tries I think there is no best setting
it depends on your training dataset and what you want to achieve
but yes, most models out there use WAY too high dims
@stiff dust I'm trying this concept of tagging. Do you think one should still use Jackie Chan person if you are not using instance and class but captions for each image? Like close-up shot of Jackie Chan person holding chopsticks or should one go with close-up shot of Jackie Chan holding chopsticks? https://arxiv.org/abs/2306.00926
Exquisite demand exists for customizing the pretrained large text-to-image model, $\textit{e.g.}$, Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during t...
When trying this training I get a feeling I get best results if I also use Jackie Chan person when generating, but it's somewhat inconclusive.
I realise I didn't really form my comment as a question so I'll try again: Does anyone have any tips on how to consistently get braces in stable diffusion? Is there some models that's better (or even capable) of it than others? Even when I tried to train an embedding on a girl with braces in every shot it turned out without braces as result... I'm out of ideas lol!
Have you seen this https://civitai.com/models/46690/braces-concept ?
Yeah, i tried that one but it affect the base model way too much. Trying to understand why braces are so hard to get right, it shouldn't be hardere than jewellery or something but it is!
Maybe just inpaint the teeth?
That is a good idea. Didnt really consider that, might be easier!
The funny part is that it seems SD understand what i want, since the girl always shoes alot teeth when I add braces in the prompt, it just dont draw the braces... maybe its trained on those fancy invisible braces!
@normal ember Hey! I got pretty far, now the real work begins. Got this set up, and got the data set im going to use set up as well. Not sure what to do next!
maybe it's a resolution problem (as so often). Try to upscale the image and only repaint the teeths
I have constant issues with photographic LoRAs "unflattening" my 2D illustraton base model outputs -
Has anyone experimented with doing a (for lack of a better word) "2-pass" training flow roughly similar to:
- train the lora on photos
- img2img the original photos at some appropriate denoising rate using illustration prompts and illustration base model
- generate additional images with the lora + illustration model
- use these "flattened" 2D source images to train a second lora
Just want to make sure this isn't a known dead-end before I spend much time on it (or if there are any tweaks to the above that would make sense)
Usually people dont use the same model generated images to train the model. The "error" in the generated images would got learned into model and make the model collapse.
But you could try
Does anyone know how koyha (gui or ss) selects images for a batch?
For example, if I'm training on 5 images - with the same resolution - and a batch size of 4 presumably the first batch will be of images 1,2,3,4 but what about the second batch? Would it be images 5,1,2,3 or images 5,5,5,5 or something more obscure?
Is there a way to get koyha to log what images are actually being using in a batch? That would help 🙂
@real citrus Sent you a PM!
bit of a technical question - I've been fine-tuning SDXL (not dreambooth, not lora). It's slowly getting somewhere, but I always seem to have garbled outputs when I zoom closely. I recently discovered that I'm training on the first base model, which had a sub-optimal VAE. Not 100% sure that's the problem, but this is why I'm wondering if the VAE involved at all when fine-tuning? If not, I can just use the updated one at inference time and need to find the cause elsewhere.
Just use Kohya-ss as I think the issue is that your training script is too old @mental anchor
Is anyone here have experience to fine tune lora for inpaint?
Working on this rn. Afaik there is no way to train a separate inpainting Lora at the moment.
https://github.com/kohya-ss/sd-scripts/issues/502
How much of the tertiary model is merged into the final result? in AAA1 Webui?
The formatting has changed, and many tutorials and videos are outdated, any easy to understand updated guide?
Discard weights with matching name? How can I use that?
https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#checkpoint-merger
The webui wiki is outdated too so no idea where to gather this info
Stable Diffusion web UI. Contribute to AUTOMATIC1111/stable-diffusion-webui development by creating an account on GitHub.
i have a pretty weird question. bear with me for a sec.
say i have a dataset of a single image, with a prompt called "PROMPT A". I then put it in my training script of my choice put it through a single iteration of Dreambooth training, during which it finds the output to have a loss of 0.345.
Then I modify that training script to not require any image in my dataset whatsoever, but only a prompt. I then put my original model through a single iteration of Dreambooth training, which uses "PROMPT A" to generate an image (that would have been used for loss calculation), after which I manually input a loss of 0.345 (exactly the same as the previous example) and do backpropagation with it.
my question is: all else being equal (the class name, the instance name, etc.), would the resulting model from both examples be identical?
Hi, I'm trying to create an embedding / textual inversion for the style of an artist, I'm using kohya and the learning doesn't seem to work very well. Can someone help me ?
I'm using this tutorial, but it's not about a style it's about a specific person, so I'm not sure if I should change some settings - I did, but I don't know if it's good. And considering that it takes more than a day to complete an epoch of training, testing blindly does not really work in my case.
https://civitai.com/articles/618/tutorial-kohya-ss-dreambooth-ti-textual-inversion-embedding-creation
Does anyone here train on a 4070 Ti with 12GB? And how long does it take for you to finetune a model
Full model or LoRA?
From what I know you need like 16G to just load the model
Possibly some tricks here https://twitter.com/kohya_tech/status/1672826710432284673
SDXLのLoRA学習、Text Encoderの出力をキャッシュすればVRAM 12GBでbatch size 1、rank (dim) 128のC3Lier(LoCon)まで行けそう(rankが低いとわりと余裕がある)。キャッシュしないと16GB必要みたい。
Hello , everyone I'm getting this error while generating images:
Runtime : m1 and m2 have same dtype
i have a lora but it only really works at 1.4, not 1. just adam8bit. what do i increase to push it earlier? lr: 0.0003, weight decay 0.1, cosine with 5 restarts. testing and changing individual settings barely does anything so its probably a combination of both
its got 5000 steps so that shouldnt be the problem...
It’s a SDXL Lora so I suppose a full model. But I am quite new to teaching LoRA
16gb for training is quite insane ngl😅
I mean
It’s working now
But at this rate it will be done in 2 days
Ok, 12G for LoRA might be enough but you probably need to pull all the tricks there are not go over 12G and into RAM.
You got any tips on where to find these tricks?
train unet only, gradient checkpointing, xformers, cache latents to disk, small batch size, not too large network dim
possibly use optimzer adafactor instead of adamw but I'm not fully sure, maybe someone else knows better
check reddit too, I'm sure there are many that wants to train LoRA on 12G
I will check, thanks for the tips!
Hi, I'm trying to create an embedding / textual inversion for the style of an artist, I'm using kohya and the learning doesn't seem to work very well. Can someone help me ?
I'm using this tutorial, but it's not about a style it's about a specific person, so I'm not sure if I should change some settings - I did, but I don't know if it's good. And considering that it takes more than a day to complete an epoch of training, testing blindly does not really work in my case.
https://civitai.com/articles/618/tutorial-kohya-ss-dreambooth-ti-textual-inversion-embedding-creation
afaik it just prunes the model
Who is the best LoRa creator here? I want to create children’s books and have LoRas for multiple characters to be able to create books at scale.
Anyone have experience doing something like this?
Could someone help me complete a render that was too closely zoomed in and ControlNet refused to work with the Checkpoint (ZavyChromax v12) I created it with? Disclaimer: It's NSFW but not overtly so.
Any idea why my LoRA came out really bad if I use raw images from my mobile phone with resolution 3024x4032? Kohya didn't log any errors. Trained with training resolution 1024x1024 and the results were really bad.
But after cropping those to 1024x1024 and training with same parameters, the results were excellent.
I usually train with random cropped images and never had issues, it must be the high resolution? Dunno, it's super strange for me.
Question about Lora training:
A friend of mine was suggesting to train different spots.. "in between" so you can find which one is the best and not under trained and over trained.
just wondering if anyone has worked out how to train a person who has a lot of tattoos properly?
random cropped images is why
I don't really know anyone that uses it anymore
since bucketing exists now
I want to make a 16 latent channels vae for testing how could I start?
but I have issues with uncropped raw images from my mobile phone with resolution 3024x4032... that's why I'm confused. There is only little resemblance. But if I crop them to 1024x1024 I get decent results
Hi, I'm trying to make a Lora that generates variations of cute animals faces? I've managed to train a Lora to generate the same watercolor texture and style, but the animals faces are always the same, always the same bunny from sdxl base or other models. Is it possible to get this kind of variation just with Loras? Or would I have to train a checkpoint for this?
I would like to get variations like these ones
Different animals faces
@stiff dust Tested this?
`--v_pred_like_loss ratio option is added. This option adds the loss like v-prediction loss in SDXL training. 0.1 means that the loss is added 10% of the v-prediction loss. The default value is None (disabled).
In v-prediction, the loss is higher in the early timesteps (near the noise). This option can be used to increase the loss in the early timesteps.`
no, but it sounds strange. if you want to increase the loss in earlier timesteps, either use min_snr_gamma or use min and max timesteps. v-pred is a quite different objective and you usually need hundred thousands of training steps to adapt a model to v-pred
I found that the text encoder is very quickly overfitting on your training data. If you have enough sample images, try to train the unet only.
Thank you! I'll try to train unet only and see the results
Ok, maybe it's just for full training. It does not say.
we just published this article about using the new ChatGPT multi modal to help improve and accelerate captioning. Sharing this to this great community https://civitai.com/articles/2436
When you can use the API it will get even more useful. Not sure about the cost though.
good that I haven't done git pull since many weeks ^^°
I train unet only anyway 😄
hey can anyone help me with a question?
this one
no, that doesn't make any sense at all
in general: SD inference and SD training are two very different things. What you do when SD generates images is VERY different from what you do when you train SD
Do you know any alternatives?
can you explain in more details?
you mean during training SD doesn't generate an image so that it can calculate an L2 loss with the ground truth image in the dataset?
yes, during training you don't generate images, you denoise images
or better: you use an image, add noise to it, then predict the noise on the image (the difference between the noise you added and the noise predicted is your l2 loss)
if you generate an image "from scratch" (at inference) things are different. You start with pure noise and then denoise it. The problem itself is ill-posed here. If I use a pure noise image and ask for predicting the noise, the answer is trivial (everything is noise). You can't really use any meaningfull loss here
the reason it still works at inference is that SD does not know that the image is pure noise and tries to find any familiar patterns in the noise. Similar as humans can look into the sky and see faces in the clouds
also things like CFG are inference-only, they don't exist at training time
also training is always only one step while in inference you use many steps to generate the image
and many other things. It's just that training and inference are two different things in SD. It's not like in most other machine learning problems, where inference and training are more or less the same. Here, you basically solve two different tasks.
@stiff dust Thanks for the explanation. I keep looking for some 3blue1brown style deepdive on the training process but there's none.
About the noise adding process, is it adding noise until the whole image is completely noise? Or just a little bit of it?
And what exactly is the "predicted noise"? Is that yet another separate process?
Or maybe you're saying we denoise the image similar to img2img? And you calculate the difference only on the pixels changed by the noise adding and denoising process?
Prompt for Steampunk Batman in Victorian London (Inspired by Dishonored)
Visualize Batman in a steampunk attire set against a Victorian-era London street:
Batman's Attire: Dark leather combined with bronze elements. Instead of the traditional Bat-logo, imagine a bat-shaped gear centered on his chest. His utility belt would be a series of leather pouches adorned with copper rivets and dangling chains. His eyes would glow behind amber-colored aviator goggles.
Victorian London Street: Wet cobblestones from a recent rain, with a gentle mist rising from the sewers. Gas-lit lampposts casting a golden glow, creating dancing shadows. Tall red-bricked buildings with smoking chimneys lining the street, and Victorians in period attire casting furtive glances at the Dark Knight.
Background Elements: A blimp displaying a large bat insignia lights up the night sky of London, serving as a steampunk Bat-signal.
Capture the atmosphere and tones reminiscent of the game "Dishonored", blending the old with the futuristic in a unique manner.
you basically mix the original image with a random noise image. 0th timestep would be 100% noise, 0% image. But you skip that step. You train randomly by drawing a number between 1 and 999. If your for example draw the numer 200 then you use 20% image and 80% noise. At least for a linear scheduling. In practice, other scheduling schemes are in use. What the unet predicts is noise image (quite unintuitive, as you are interested in the image, not the noise, but you get the image after subtracting the noise)
ok, but how do you generate the noise? just the random algo that gives you the starting noise during inference?
yes, just gaussian random noise
@stiff dust when you say the unet predicts the noise image, do you mean it generates a noise image using a prompt?
no. It gets a mixture of noise and image as input and has to predict the noise part of the image. The prompt is an additional input/conditioning
when the unet predicts which part of an image is noise, is it simply saying which part of the image it thinks "looks wrong"?
or is it something more complicated?
You can just train a Lora using base model and use it for inpainting.
I have already done that but results was not good. #🤝|tech-support message
hey guys, can someone recommend a windows tool to auto crop faces?
I never used it myself, but this one seemed popular amongst contacts I had. https://github.com/leblancfg/autocrop
thx
What does it mean If I fine tune the model without caption?
Wouldn’t it train as responding to empty prompt?
It will train on everything the images are, i.e, taking everything in the image so it will train based on the characters, backgrounds, and even the shadows or stuff as it will really have no idea on what exactly to replicate or what the loss is trying to find out
I think yes but would it improve the overall quaility in selected images via training?
I had tried this but I had issue with this type of training. If I have tag a exist in image A and tag a also exist in image B. The tag A learn with both images and mixed the looking.
I mean yea, but tag a should correspond to something- for example if tag a is 'forest environment' then it should correspond to the forests in both image a and b
Using tag a on an image without a forest will of course result in it being weird, for example if you use tag a on an actual forest but then on a desert, it will force a mix by the loss optimiser as to match both images
Something like that.
I use wd1.4 tagger
Some tag exists in multiple images. It has the mix issue further
You gotta do some manual tag checking for that tho- or change the training setting so it doesn't take your tags seriously
I also used wd1.4 and ngl it's wayyy easier to use it but I am still gonna stick with blip captions- They are just better for sdxl
I used around 120 images to train a LoRa with myself, but it usually only learns my ears (distinctively sticks out) and hair, but my face details usually get lost. (And many times it has artifacts.)
I'd love to use it both with graphic/real ckpts, not sure if possible.
My base model is RealisticVision 2.0, as that got me the "best" results so far.
Ngl I feel like it's a GIGO problem.
What kind of dataset should I provide? I know it has to be "varied", but for example when I include too many "different expression, but looking away" that pose gets overtrained(?), and appears too frequently (even when tagged).
I read a few guides, but the dataset part is always very vauge, I'd be thankful for some examples.
(I always crop to 1:1, and remove the bg)
You could use 10 face focus images to get your face lora
now that you say "face lora", I realised that the full-body shots are useless (and probably hurting my results) as I can always tell SD to what body to draw (so added flexibility)
thanks, I'll retry like that
should I do diff. expressions, angles?
yes, you could train with different angle face images
hi, first of all: what you describe might be a side effect of how CFG works in inference. During training, there is no CFG. But when you generate images, you always use the CFG (often a default value of 7 or even higher). Often CFG is described as "how strong your prompt influence your image". But you can also think of it as an enhancer. When you generate a face of yourself what the cfg is doing is it takes what differs your face from the average face and adds this to the image. So the CFG exaggerates your facial features. Anything that makes your face special will be increased by the cfg. So when you have the feeling that your images have strange artefacts, first thing to try is always to decrease the CFG value. This is particularly useful when you want photorealism. Try with CFG 4 and check if the images look better
besides that, I don't think that full body shots are useless.
However, you are using SD 1.5, right? I don't have much experience with 1.5 training, but I found that it is quite vulnerable if you train it on too much variety (or too much images in general). I have to say I always struggled training my face on 1.5, but the best results in 1.5 I got with very few and high quality images. I got MUCH better results when training on SDXL, though, and in SDXL I could use as much images as I want and results got rather better than worse
Not related to this but I've tried many lr_samplers and optimizers. I seem to have better control over overcooking when using a cosine scheduler with some warmup along with AdamW.
Also about alpha of half of dim seems to be working nicely.
warmup is 100 steps which is about 6-7% of my total steps
Prodigy worked but was too big movement between epochs so was hard to select a good sample.
I've a lot more testing to do, but lowering CFG to 4 gave instantly better results. Thanks for the tip!
hey! I have a some what strange question I think.
So I'm wondering is it possible to train a Lora in parts?
(train it one time on the first part of a data set then train it in another run on the second part and so on)
Would this be posible? I'm asking as my googlcolab time is limited i have large data set.
Absolutely yes,I do that all the time. It's like cooking really. The trick is to figure out which concepts train fast, or from which point you want to resume... Maybe take some concepts out so you can focus on others, it gets crazy, but it's doable
Or even if it's just one concept, save in step increments, and use the last safetensors file to start the next batch
Oo that's amazing news 😁, thank you! I'm going to try it!
What do you think the training recipe for those "world morph" models on civitai is? As in data set and training
how big and how much variation in dataset for "world morph"?
There could be metadata in the model that could give clues. Some removes it, but quite common it’s still present.
https://github.com/by321/safetensors_util
Using this?
Yeah, that would probably work. Simple file viewer would do too.
It's the __metadata__ entry you want to have a look at.
Thank you
@stiff dust Have I understood the code correct that the network_alpha acts as a brake of how much the weights can change for each training step?
if self.lora_layer.network_alpha is not None: w_up = w_up * self.lora_layer.network_alpha / self.lora_layer.rank
If so is there a logic to increase the brake if and when you increase the network_dim
because its a matrix factorization. The weight change is w_down @ w_up (@ = matrix multiplication). So a single weight is changed by the dot product between a row in w_down and a column in w_up.
the length of these vectors is the rank of the lora
if you have rank 1 then you just multiply two numbers
if you have rank 100 then you multiply 100 numbers and sum them up
now during training each single weight parameter is changed in each step, but the change cannot be arbitrary high (due to the learning rate)
but changing 2 x 100 numbers and multiply them and sum them up gives you a 100 times larger change than just changing 2 x 1 number
so in each training step you make a small update on the lora. But a lora of rank 100 has a 100 times stronger effect than a lora of rank 1, so you divide the result by 100 to make both comparable
otherwise you would have to decrease your learning rate whenever you increase the rank
Makes sense! How large are these vectors in the base model?
And thanks for a excellent reply as always!
Or maybe the dot product is always stored?
this is just how matrix multiplication works. The matrices in the original model are different depending on the layer and so on, but usually they are quite big (~1000 rows and columns)
What I find odd is that when training a LoRA with alpha 1 vs something higher you don't necessarily get overtrained in the same way as a too high learning rate or too many training steps would do.
It just seems like the LoRA loses some flexibility in regards to how it can be mixed with other prompts if it's trained with a higher alpha.
Like the signal stronger in the LoRA than the base model so it get's preference over the model. But I guess that makes sense on what you explained earlier.
Let's say you trained on a photo dataset and try to generate an image with a anime style. When using a higher alpha the base model seem to get less priority and the data in LoRA get stronger which results in an image that's more of a photo or completely a photo but could have some traits of the anime style from base model like the character becomes asian instead of something like in the photo dataset.
I guess that has to do with the weight vectors has a higher value since we didn't reduce the w_up as much when the alpha is higher which overpowers the base model.
I never noticed such an effect of the alpha. Should check that myself
Hey hope i may ask another question about incremental training. Do you divide you data set in to parts or will the check point save hoe far it progressed?
@normal ember Do you have any result of giving the cropped coordinate in meta data? I want to improve my fine tune with hand anatomy training but not sure the coordinate would help or not.
I have not taken it any further yet, I've had much else to test first.
It would be neat to get tools that could replicate the training SAI has done when training the base model
they released their training tool
But I haven't really look into it
What ever became of Lora-Fa? anyone using it? advantages?
From what I have heard it just didn't have any advantage over traditional lora techniques, but if you were to do comparisons, there would be some differences but enough to be worth it? Not at all
The only thing saved is the weights. If you stop and start again it'll start over as far as recursing through the image folders
Sorry I'm not sure what you mean. So it starts over from the start or from where the checkpoint stopt?
I may be wrong, but weights are continuously updated rather than a single full sized update, and the updated weights are further updated nearly independently than the previous loaded ones, so once you have the weights at any stage they should be usable as it is or can be further updated at any point without prior approval from previous weights
That is they start over every time for each update, using the updated weights-
(I am still learning About this so I can be wrong)
Hello everyone. I'm trying to train a LoRA on a person, but I can't seem to get the facial features right. Is there a way to "focus more on the face" when training the LoRA? Or provide some kind of "weights to the image pixels" when doing the training?
I'll say it another way, basically everything it has learned will be preserved, ie the weights. It sounded like you were asking about the image dataset. It doesn't bookmark what image was being compared
For that reason, I normally save at intervals and have multiple saves, when it starts to go off course just make adjustments and start from the last save that was going well
Thank you for your response, from what i can tell your totally right! And that why i want to avoid overtraining them on my data so I'll tey to avoid letting them train over and over on the same images. Thank you for your help 😁
Thank you! Than I'll need to divide my data set in to several chucks i guess
Thank you all for you help ☺️, really appreciate it!
hi folks - is it possible to inference a single image using two LoRa characters?
Sure why not. You wold just load both loras and use both keywords/names. You probably would have to adjust the strength of each lora for best effect.
the concern would be that they merge into a single blended character instead of creating separate characters
Yeah it is somewhat likely to do that. Probably most loras are trained primarily with images containing only one figure in the image. But you'll generally get some images where it mixed the characters, some images with two of the same character, and some with actually two separate characters like you want. Prompting to be clear that there are two people can help.
wonder if we could do it by 1) inferencing LoRa character_A 2) outpainting with LoRa B
That could work. You could also get an image with two characters and use img2img with a mask to replace one of them.
How do I gt SD to do more vibrant colours? it always darkens the pictures at the end:
For anyone wondering, one cannot do this in Kohya but can in OneTrainer.
I guess this channel is mostly about LoRA's but does anyone here have experience training ControlNet? (and is there another channel where I should be asking?) I am attempting to train a ControlNet and I keep getting these weird high frequency details that I don't want. For example this cat. I am not sure if I just need to train the model more or this is a signal that I have already overbaked it or what. Training loss has been going down very slowly but the effect remains the same.
Any advice would be very welcome
This is the loss in case anyone is interested. Batch size is 160 and the training set has 125,280 images
What model is better to use for custom LoRa training in pixel style? SD 1.4, 1.5 or XL? And also why? Should I use the cpp fork of the repo for better performance because my specs are not that good.
And also how do I do 16x16
There is a discord for controlNET training. Dont remember the name
What is loss what does this mean ?
Oh really? If you can remember anything that could help me find it, that would be great. I'll try google
When you train a model, the way it works is by defining a loss function that the optimizer can minimize. Think of it like a metric for how well the model does and the optimizer uses it to improve the model. In this case, the loss is mean square error, so the average of all the (model_output - target) ^ 2 in the batch
If the loss goes down, it's going the right way
@normal ember @stiff dust Does training empty token(no caption) would affect the whole bias of the model? Or any training would affect the whole model? My friend claimed that training without caption would change the whole model style and I don't understand. In my understanding training without caption would train as empty token but how would it affect the whole model?
on inference you do cfg (classifier free guidance) which means you run the unet once with and once without caption
so training on empty caption retrains the prior distribution of the model
(what it thinks how images look like without knowing a caption)
Oh, thanks a lot. It solved my question. I feel the empty caption would affect the result but don't know how it affected.
I'm having a really hard time performing a LoRA training on SDXL using a friend's face as input. I have ~25 images, and I'm seeing his likeness in the resulting checkpoints, but I have a SUPER hard time performing any kind of styling with his likeness, like I did with the original Dreambooth workflow on SD1.5. It always wants to pull the resulting image back towards the training images, and when I use earlier checkpoints, his likeness is lost. I'm currently not using regularisation images either, as I just want the LoRA to make images of the tuned person.
I got best results (regarding styles) with:
- using rare tokens (e.g. "photo of chris thsgc")
- train unet only (this is very important as the text encoder is very sensitive to overfitting)
in general it's hard to get a checkpoint that gives you perfect photorealism AND perfect generalization
but the results I got are still thousand times better than what I achieved with SD 1.5
I’ve found it very sensitive to both too high and too low learning rate. A higher alpha have given me much better results too. Everything without reg images.
Can’t verify it yet but I feel like I get better results with 10 repeats instead of 10x epochs.
A high learning rate looks good on the loss graph but results normally worse.
I’ve trained both rare tokens and not. Both possible but I agree that it’s easier to get overfitting when it’s something well trained. Alpha helped when trying to train something known.
I never touch Alpha, and have Rank/Dimension set to 128. I also use train the unet only. Hmmm
I don't really have a good idea about what alpha does. If the value is 1, I think it defaults to the dimension size?
Also, is it strictly necessary to use the class prompt when triggering a lora trained on a specific prompt? If I train with the unique token "ohxw" and the class of "man", do I need to use "ohxw man" ? or just the unique word
use same style of captions you use for training. I actually prefer to use manually written captions instead of the automated ones, so you see better how you have to prompt it
also try lower rank. Very likely you will get similar results with rank 16 or 24
yeah I don't use a class prompt in the caption, but it is encoded in the directory name of the training images, for the kohya_ss interface
it's a scaling factor. To be very simple: a higher alpha means the lora learns faster. I still haven't tried if high alpha are really better. It sounds counterintuitive for me, but might be possible
yes, that somehow is then encoded in the prompt. It's probably something like "photo of [token] [class]" or so. In this case you should also prompt it like that
ok
is there some way to visualise the block weights of the lora, to see which blocks were trained? kind of like a feature importance graph
just take the norm of the matrices. This is usually a direct readout of the importance
Would be neat that you could trace the network from a prompt
I've made experiments on doing both captioning with class and without, and then prompting with and without. I came to the conclusion that it would be better to just caption as you prompt like kaibioinfo said. It made no big difference though. But could be that I mainly use known captions and they are already associated with class in the model.
How to teach the bot a new artstyle like madhubani and mural
Anyone know how these guys are getting 10+ it/s on AMD?
SD WebUI Benchmark Data; Author: Vladimir Mandic
The best I can do is 7it/s with a 7900xt
Hello, I want to ask: What will be the best way or approach to train a stable-diffusion model on my pre and post-image dataset? For example, consider a dataset with 'pre-images' featuring a man without a beard and 'post-images' depicting a man with a beard after one month. I want to develop a project where, given the pre-image without a beard as the input in an image-to-image task, and with a prompt specifying a given time, such as 'at 2 months,' the output should contain post-images of a man with a similar face but with a beard after 2 months.
So my questions are,
-
is it even possible with SD, if yes, what should be my best approach to train the model?
-
if SD should not be my choice, what should be my alternative approach?
Any help would be really appreciated.
you could measure the cross attention - but that's already more complicated
that sounds like a task for control net
I guess GANs are a bit better for your task than diffusion models. But it should be possible in SD
but you usually need a lot of training data for training a control net. How large is your pre- and post-image dataset?
Anyone have any idea why training textual inversion gets such crazy results when not on base 1.5 model?
Try #🤝|tech-support this is mainly for discussing tuning / training of models.
pretty sure lorai offers a few free credits to create a quick lora (currently broken), are there any others that offer a free one before signing up?
This seems neat to possibly be able to train the model with consumer hardware. https://github.com/bghira/SimpleTuner/releases/tag/v0.7.0
Linux or Windows? Are you running rocm 5.6? Think 7900 needs that
Buying the newest card usually means waiting for drivers to catch up
They're on Linux, but in on WSL2 so I don't think it's the same
I've 30 train, 10 validation and 10 test images..pre-post in each set..
if I'm working with f.e. 1024x1024 regularization images and my training data set has another resolutions (= bucketing enabled), is that a problem? Would the results be better if I crop the training data beforehand to 1024? I've read conflicting statements
Hi! I want to train the img2img lora model to see a particular sofa model in other rooms. I did a tutorial before using SD XL base 1.09, but the results were not good. I upscaled 14 images and deleted their backgrounds, leaving only the seats. Do you have any suggestions? Sample data;
"ohwx, a couch with a blue cushion, modern sectional sofa with a reclining mechanism, basic background" here is my example prompt
anyone has any experince with creating his own Danbooru Tagger model
i want to create my own tagger model so that it can generate my own tags
No, but sounds like a lot of work to reinvent the wheel. You could just skip using a tagger and use the tag editor built into webui, you can do any tags on any subset of photos. I've seen others out there also. Not trying to discourage, just know how much work that'll probably be
but i have like 20k images i need to tag
Usually people use wd14 for tagging anime because they used danbooru tag. If your dataset is anime, just use wd14 is good enough. Otherwise, many research have been using custom llm as a tagger to provide descriptive prompt for model training. for example, gpt4v or open source multi modality llm
You could try openflamingo but I have not tested on tagging. The way it works is that you provided it with a few images and captions and it will use that as a reference on how the captioning should look like
GPT-4V is great but no API yet
Something recent that's supposed to reduce noise for captioning
I haven't used it, but it might help
hey everyone question: when making a LORA of lets say a specific type of haircut, should I also use tag like 1girl, black shirt etc? or should I only describe what I want like, for example, m0hawk, green hair. My worry is that if I tag too much stuff, using my lora will also polute the model with data I dont want like faces, backgrounds etc. Am I thinking too much?
I think the idea is that if you describe the image as best as possible the model will be "polluted" as less as possible. i.e., if the image shows a girl and you tag it as girl, then you won't change the model too much. If you tag it as person, then the model will be changed more.
I see, my concern was that if I add girl to my Lora end then I use the Lora it will render a different girl even tho I just want to render then hair.
But maybe I'm approaching the issue the wrong way and instead I should use something like control net and inpainting etc to swap the hair on the face I want.
Because I noticed that the Lora I made with pictures of haircuts my wife made that if I add the to my prompts, after a certain threshold starts to change things like the pose or even the girl shown.
Question about Lora captioning
Say I'm training a specific "species" of wyvern(aka i'm training a specific and niche type of fantasy creature)
When blip caption my training set images it's almost always "a dragon with a point head on a black background"
Is it better to remove the dragon and fix it to be "<my keyword> head on a black background...<more descriptions here>" or should I leave that and just do "<my keyword> a dragon head on a black background"? Does letting SD know it's similar to a dragon make it better?
I'm not sure what the model understands better
I’d say depends on how you want to be able to prompt it at inference. If you train it to associate with dragon you can’t obviously also get a wyvern and a dragon in the same image.
If this type wyvern has a specific name I’d use that.
so i should caption more like "closeup of rathalos(<-keyword) head flying through the sky among the clouds" and just drop any reference to dragon or wyvern?
If it’s obvious that it is flying when close-up of the head
If not, I’d skip the flying.
You could also associate rathalos with wyvern if you want to train several types of named wyverns that share similarities.
The wyverns from Monster Hunter tend to share a very similar body plan so if this Lora doesn't end up being trash I might try that
I was using the class as dragon for right now cause I figured since the SDXL model already knows what a dragon is and rathalos is dragonish it might help with the learning. At least that's what ther guides I read suggested. I'll try again without the dragon captioning
New to this so i'm just doing this to learn, thanks for the advice
Is this the guy? Here's the prompt used: a majestic wyvern named rathalos soars through the clouds on wings that span over 10 feet. his scales shimmer with every movement, and as he lets out a piercing roar, lightning bolts strike around him. in the distance, a group of hunters can be seen attempting to bring down this formidable creature.
Oh you can post images here...I'll get a sample image from my training set
If you prompted that image using "rathalos" in sdxl, it's not that far away from Rathalos IRL
Never tried training with a solid background and how it would react, somebody else in here might know.
I'm worried about some of my backgrounds in my training set. I was looking for high res screenshots for the game but couldn't find stuff ofer 4 MP
Was wondering if I should tag my backgrounds or not, but I have heard mixed things on tagging backgrounds
And you can't capture them yourself from the game?
I can, it was more a laziness thing since I don't have the game instealled but I could do that.
Maybe I could also generate a background using sd and just paste ^that guy onto it
Getting him from different angles, medium shot, close-ups and full might help too.
I have a couple different angles in my set now but a lot of the images are like renders or statues
https://1drv.ms/f/s!Aujm3Wog6vazpXlHjqUlkwhzICG_?e=6gasQV
Worth a try to caption if it's a statue or from a game too if you want flexibility that is.
And from there, just experiment and compare the results.
Speak of the devil, it just finished. Though an error on the 9th epoch made it so I only got 8, but here's the 6th epoch
Not terrible. Well kinda terrible but not as bad as I was expecting
prompt: a dragon flying through the sky, flame in it's mouth,lora:RATHALOS_RATHALOS-000006:1
Is the above image underfitted over fitted? Not sure how to improve it. My first thought is it's underfitted. Also thing some of those training images need to be replaced by better screenshots in game
I would say try again with a better prompt.
Hey all! What's the best repository for training a controlnet on SDXL?
Haven't tried it, but have you tried following this tutorial https://huggingface.co/blog/train-your-controlnet
Scripts here https://github.com/huggingface/diffusers/tree/main/examples/controlnet
@jade hornet Thank you I've seen that - it's not specific for SDXL - that example is for SD1.5 right?
Presumably, but there's an sdxl script in that folder
Just wondering if you tried it
@latent charm Tried bakllava-1?
?
https://github.com/THUDM/CogVLM CogVLM seems great. But it requires 24GB*2 to inference
I am fine tuning SDXL with captions that generated by an VLM, the captions are selected and modified by me. Let see what would be the result. The dataset is 2400 text-images pairs using tag, generated captiosn and empty captions with 10 repeat.
It's a multimodal variant of llava but instead of using base llama2 it uses a finetuned version of llama2 named mistral 7b which in itself is a great model for it size.
It's very GPT-V like
It's the first model I've tried that can be controlled consistently to the result you want.
Really, sounds great. I heard mistral 7b is pretty good too.
I've written some code that automates the captioning
I'm using llama.cpp as it has a built in rest-api and is fast and lightweight, then some python code that sends the instruction and the images.
After I had done the captioning, I found that calculate the clip score would help to found out the unwant captions, too short or too high might result bad caption.
https://arxiv.org/pdf/2310.20550.pdf CapsFusion
Also I just see this.
and then a full tune of the model?
They doesnt release the weight
I'm thinking about your captioned data you are training
hey guys has anyone tested training models vs loras for people? Does one produce better results over the other?
using dreambooth in kohya
rent an a100 40gb running now
AdamW? What LR?
constant pageadamw8bit. lr 2e6, noise offset 0.1,
I heard fine tune is better but I didnt compare by myself
I'm running on my 4090 now adamw8bit, some offset noise (0.0357) and 3e-6
but you are probably running a batch size larger than my 1 😄
batch size20
What I don't have a clue about is how many epochs it will take
I have some room yes, using 18G VRAM atm.
I've run 4 times as long as a LoRA would take me. It's moving increadibly slow, but I guess that's normal since the LR is way less.
I am doing the large bs large repeat route. It seems give me better results
I do 10 repeat
I have 3 dataset with repeat 10
How many epochs do you plan to run? I save state so I can resume.
running with max 20 epoch
I just rent 3 days of a100 to see the progress and have to decide would continue the training or not
it totally uses 300 hours
that's 480 000 steps if I calculated it correctly, 24000 batches
runpod?
Another platform
i trained my lora on base SD1.5 model.Later i merged that lora into another SD1.5 model and retrained it.Now the newly created LORA doesnt work on other models apart from the ones i retrained.Any help on this
Standalone Kosmos-2 auto captions https://github.com/lrzjason/kosmos-auto-captions
python autoCaptionsKosmos.py --input_dir /path/to/input --output_dir /path/to/output --clip_failed_dir /path/to/clip_failed
--clip_failed_dir is optional. Just need to enter input_dir and output_dir
Contribute to lrzjason/kosmos-auto-captions development by creating an account on GitHub.
That looks cool, but it already failed at describing the image on the GitHub page.
Just a random image with description. Not cherry pick LOL.
It still has the hallucination like other llm
I can't find the part where the repo owner it is saying that its not cherry picked. Anyway, the caption is littered with assumptions and things that are not actually happening within the image. That's a bit sub-optimal. I would like to see other examples and if it does that more often.
🤣 oh, sorry. I couldn't connect lrzjason with XiaoZhi
That fine
Ok, that's more fitting to the image. I guess there is still a lot of cleaning up to do, but it would be interesting to run a training with those captions.
I am already training with 2400 images from other llm generated captions. It is 19% now.
kosmos-2 would give captions with more subject composition
I think it was the highly descriptive captions what makes Dalle-3 king at the moment. Time to reproduce that.
Yeah, I am encouraged by Dall E 3 and pixart-alpha
What's the current state of pix art-alpha? Are the weights released?
Still no. But they released the LLaVA-captioning inference code.
Just checked
Interesting. You could use pixart-alpha scripts to caption your dataset.
It uses LLaVA-Lightning-MPT-7B-preview
Thats indeed interesting.
anyone here knows how this works?
https://colab.research.google.com/github/hollowstrawberry/kohya-colab/blob/main/Lora_Trainer.ipynb
i want to create a lore of an artstyle, but dont know how to set up for it, please help!
such things can happen if the model yiu merge your lora into had some special noise training routine (e.g., noise offset or pyramid noise)
hello everyone im trying to train my model for guns but what tag do i use
because when i use a custom tag i get nothing
Anyone here tried training a model on a house/property?
I've been kicking around this idea of taking photographs of my childhood home (exterior) and the surrounding 1 acre of grounds (trees, pond, barn, etc), and making a LoRA out of it. I'm just not sure how best to do it... specifically:
a) If I should only train it on, for example, just the house, or if I could include photos that focus on other features of the property, so long as they're all connected (trees, pond, barn, etc)?
b) How to caption the images? For example, should I approach it like I'm training an object, as if the 'object' is the entire property?
hi i just starting with automatic 1111 and in my canvas zoom im missing this pen/painbrush option. What should i do to get is (was trying to update
)
ofc u can
take as many pictures and describe everything, and preferably use a base model of one that is already trained on architecture, which there are probably already a lot of
if you train on kohya it will ask for a keyword and class. something like 01_childhoodhome house would do it. house is the class and childhoodhome is the keyword
Those are great tips. Thank you kindly!
Hi, i've got a style lora and i think it's time for a 2nd ver.
what is the method here?
re-train the lora or use the old one and just bring the old data in?
is it possible to finetune SDXL models
when i interrupt my render i get the style i want, it usually happens when i interrupt at 80%, what setting do i have to change for it to be consistent and automatically save to my folder?
I would also like to know. A lot of the images look better before the final step.
right now i just interrupt and save manually but i'd love to just pump images out without checking what % its at
Hi! I'm looking for repositories of open datasets that people use to finetune text-to-image models. One such repository I found is https://huggingface.co/datasets?task_categories=task_categories:text-to-image. Are there any other ones?
skip clip 2?
already on 2
hey guys, which tagger are you using atm?
i am particularry looking for one for faces, expressions, poses and so on
one that generates tokens
gpt-4-vision-preview is probably the best way to do it now.
can i deploy that locally?
can anyone explain to me why there is a difference between merging model A and B and Merging Model B and A?
Can't we train expressions for lora especially eyes and mouth.Its so horrible
Would it be possible to train a version of SD that is good at making emojis in a specific style, like android or ios?
Or could this be achieved by simply using input images or prompts without a separate model?
Yes, that has been done: https://replicate.com/fofr/sdxl-emoji
thank you for the link! I'll try to get that running locally on my dockerized SD instance. Generally with image file prompts, is there a way to essentially say "this but recolor/redraw in the same style and position", like changing the color of a flag emoji or swapping what is in the hand of a human emoji? Or maybe this should just be handled by the prompt/negative prompt? @normal ember
The model will generally learn the style. Probably a good start with emoji's since they are vector based and can be scaled to correct resolution and converted to pixels. You got good description of them too that you can use for building the captions.
just saying hi, I'm not ded XD
here's some updates with the cool new toy, juggernaut + my lora based on my master dataset
jugger / jugger + lora
prompts & settings are taken from civitai (unmodified), the ones used to represent the model. so they're cherry picked to favor juggernaut
nsfw + general anatomy works better as well, but I can't show the progress on that here
I like how one of the cover images for the model literally uses "150 mm" (headshots), "full body", "sitting" in its tags :"D
my lora really didn't like that original composition of just a headshot when these tags are added. But its only loaded at 0.7 strength, so oh well
||complex 3d render ultra detailed of a beautiful porcelain profile woman android face, cyborg, robotic parts, 150 mm, beautiful studio soft light, rim light, vibrant details, luxurious cyberpunk, lace, hyperrealistic, anatomical, facial muscles, cable electric wires, microchip, elegant, beautiful background, octane render, H. R. Giger style, 8k, best quality, masterpiece, illustration, an extremely delicate and beautiful, extremely detailed ,CG ,unity ,wallpaper, (realistic, photo-realistic:1.37),Amazing, finely detail, masterpiece,best quality,official art, extremely detailed CG unity 8k wallpaper, absurdres, incredibly absurdres, robot, silver halmet, full body, sitting||
not sure what to think of this one :/ it doesn't show off juggers abilities. why would they list this as one of the main cover images? not even the people tags are followed in any way
will redo it with one of the people I trained in my datasets
Portrait Photo a portrait, hyperdetailed photography, by Elizabeth Polunin, red haired young woman, Gianna Michaels, brooklyn, looking straight to camera, sweaty, olya bossak, nepal, very accurate photo, suspiria
redone with shirogane instead of the people they mentioned
Portrait Photo a portrait, hyperdetailed photography, by Elizabeth Polunin, red haired young woman, shirogane-sama, brooklyn, looking straight to camera, very accurate photo, suspiria
Anybody have a good auto cropping tool that sees subject to crop tighter in, while also using a given aspect ratio? With computer vision? Cropping my dataset and just want to speed it up!!
You might find some tools in this repo https://github.com/longpeng2008/awesome-image-cropping
Checking. Nice thanks
Actually don’t see any simple autocropper for portraits of humans
All very general
The official implementation for A2-RL: Aesthetics Aware Rinforcement Learning for Automatic Image Cropping - GitHub - wuhuikai/TF-A2RL: The official implementation for A2-RL: Aesthetics Aware Rinfo...
Might be this one
yikes. something is broken
(venv) D:\TF-A2RL>python A2RL.py --help
Traceback (most recent call last):
File "D:\TF-A2RL\A2RL.py", line 14, in <module>
with open('vfn_rl.pkl', 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'vfn_rl.pkl'
after installing all the modules and running python A2RL.py --help
Long time no see. By using the same prompt, it is my latest fine tune. It still needs to adjust a bit before release.
will try to crop one with a crop command now
I haven't tried that yet.
yea, onto to the next one. hard to find one that works
I also want to find a good tool to crop various ratio.
I find this but it has no code implement. https://arxiv.org/pdf/1911.10492.pdf
need something that is ready to use
testing this one then giving up
nothing works at the moment
another possible one is https://github.com/xuebinqin/U-2-Net
with this code:
`import u2net
import cv2
Load the pre-trained U2-NET model
model = u2net.U2NetModel()
Load the image you want to crop
image = cv2.imread("image.jpg")
Segment the image and remove the background
segmented_image = model.predict(image)
Crop the image to the desired size and aspect ratio
cropped_image = segmented_image[y1:y2, x1:x2]
Save the cropped image
cv2.imwrite("cropped_image.jpg", cropped_image)
`
U2-net seems a bit more advanced
@latent charm can you give this one a try?
I ll try the first one
smart cropper is totally off topic lol
ok, this seems to work quite well
https://www.youtube.com/watch?v=Fbuyu35TkE4
One of the most important aspect of Stable Diffusion training is the preparation of training images. In this tutorial video I will show you how to fully automatically preprocess training images with perfect zoom, crop and resize. These scripts will hugely improve your training success and accuracy.
Scripts Download Link ⤵️
https://www.patreon...
Anyone has had better results in training by lowering/freezing the text encoder learning rate?
yes, if you have bad captioning
basically the worse it is, the less you should be doing TE training. also some words are off limits unless you're dealing in the 4k image range
I use either a super low TE, or UNET only training if my keywords are lazy, or I'm using a single word to describe them
inverse is also true. my best tagged dataset, where every tag occurs around 50~1000 times across 5k images, the TE training alone did more good than the actual training did.
Will this setting affect anything if I'm training dreambooth with no captions?
The issue I have is that after it finally starts getting my subjects right, the images end up a little "overcooked", probably from being influenced by my reg images some of which were made with loras for good detail. At the same time the gens stop "listening to my prompts" and just end up rendering my subject however it wants
I've already been told to lower my prior loss weight which seems to have great results so far but I'm wondering if the UNET/text encoder training also has anything to do with it
do you guys have any tips if my general dataset is learned well but the faces are bad?
i tag my dataset by hand + deepdanburu tags
i dont see too much issue with the tags but yet for some reason faces get messed up quite a lot
however the rest is learned fine
im on dreambooth btw not lora
@digital dune what i recommend is that you run deepdanburu tagging of your images and use at least 0.03-0.05 of nosie, also for me using random crop instead of center corp in combination with flip and color augmentation worked well
i recoomend much lower learning rates with that settings and more epochs
with dreambooth i mostly use 5e-7 or so as lr
also make sure to enable shuffle captions
i recommend traing with kohya ss
and add more weight to the promts that are hard to generate (in my experience not more that 5-10 times the base weight)
I have very contradicting experiments with text encoder training. Sometimes it overfits MUCH more than the unet, sometimes it gives the model more flexibility :/ So far I can say:
- TE training is often vulnerable to overtrain on the image composition. As you wrote, sometimes it starts ignoring your prompt and put everything into the same composition as the training image
- TE training sometimes totally overfits on style (makes everything anime or everything photographic), but not always. I have not found a pattern here :/
- all overfitting effects seem to get less severe when I train with low batch size (e.g. batch size 1). I have no explanation for it, but that observation was made by other people, too
- low learning rate is not always good. My feeling is that low learning rate + many epochs is much worse than high learning rate + few epochs
but what is good captioning for subject training? I'm not a big fan of the dreambooth method (tagging everything with "photo of xyz"), however, even better captions like "photo of xyz hiking, sunny day, mountains in background" are not "good" captions I guess. In the end what we want with subject training is that the text encoder associates "xyz" with the person in the image.
I've also noticed not captioning "photo of" but only the subject itself will make it more flexible to use it in different styles. If you do the other way around more often then not you almost always get a photo even if you want anime or something else.
Takes forever to train with BS1, if it would not be for that I would probably almost always use that. BS of 4 seems like a good compromise.
@stiff dust What ratio between the rate for unet vs te have you found work the best? Same or different rate so to say.
I always train them after each other. So I first train text encoder as short as possible. I stop training as soon as the subject in the image looks roughly as I want it and long before I see overfitting effects. Then I restart from that using unet only training
Hello guys, I'm going to do a full finetuning for SDXL to learn the aesthetic. I'm wondering is there a need to finetune the text coders of SDXL as well? Or I just need to finetune the UNet? Some people told me they always disable the training of text encoder of SDXL when doing a full finetune. Many thanks
With --save_state? or just start from the generated safetensors?
you don't need save-state for that. Just from the generated safetensors
Any rough hints with the ratio if you seen any pattern with the ratio between te and unet?
(running first test now)
and have you tested --debiased_estimation_loss
what's that option?
paper:https://arxiv.org/pdf/2310.08442.pdf
just change for SNR weight like min-snr-gamma
subject (person or new outfit with new clothing/accessories) training is painful no matter how you do it thanks to sdxl :/
"good captioning + practices" to basically get the best result that sdxl + current training tools can offer is a lot of effort.
In return you get a lora where 5/5 images represent your subject between "good enough" and "utterly perfect"
^ my take on shirogane -> 400ish images + manual tagging + regularization images
Personally I'd rather do 20% of that work, to achieve a 'good enough' result, where I invest the same time into gathering the dataset, but keep my captioning down to keyword + automated
In return I get a lora where 3/5 images are good enough. 1/10 is "utterly perfect"
^my take on 2B cosplay -> 400ish images + keyword + automated tagging / no regularization images
(all of this without relying on absurd dimension settings, to hide any dataset issues)
also in regards to composition, all my (manually tagged) loras follow wording/composition better than default sdxl, so my manual tagging is definitely working
<subject name>, girl, race, type of photo | optional: unique things about this image - like is it a cosplay? is in front of a window?
important about the optional keywords, as well as the mandatory ones, is to have a regularization set, that is ALSO all tagged in the same style, with the same keywords appearing as well
in doing so, <subject name> actually gets trained on the unique details that make this person, this person.
by having different races in my reg set, it learns what features are racial, which are unique to this person. (also retrains some racial details, which are wrong in default sdxl)
type of photo is important, since it absorbs the the background/colors/general composition of the image, rather than letting it drift into the subject name. <- but this only works cause my reg set is ALL tagged in the exact same way by hand
for anyone else reading this: there are a lot of ways of doing this, and they are all correct
this is just the best option for me, based on my datasets and interests, and long terms goals with training sdxl
The lower batch size theory is not a theory but a fact
I cant remember where I saw the documentation but it is definitely true
People often recommend never doing more than batch size 2, the REAL question is... how much worse does it go from BS2 to BS3? Or from BS1 to BS2? I wish there was a wiki for all of this but I guess since this is all relatively new technology we are all stuck on the trial-and-error phase.
Btw. Tysm for that info. It is SO hard to find help on these topics sometimes.
yes, it's just totally unclear to me why this is the case 🤷♂️ I don't know such a phenomenon from any other machine learning problem
Does it mean low bs would be great to learn the fine details?
I usually use high bs which is great for the shape but not much for the fine details.
I think, but dont quote me on this, that you can mitigate the effects of high BS if you set the gradient accumulation steps to match
It was a long time ago since I tried it, but I remember having good results going BS8 and gradient acc steps to 8 as well. It is technically slower than GAS1, but it did work way back when I tested it. Might be worth a shot
yes, gradient accumulation is the slow and less vram hungry variant of batch size. But it does not work "better". If high batch size won't work for you, gradient accumulation won't work either.
@stiff dust when i train with derambooth i never add my own prefix
i just tag the images natuarlly and create my own model. it works really well. i know the demos tell you to use a unique prefix but IMO its bullshit. it really makes no sese at all as long as you really finetune a model
if you do it propper you are not breaking your model by using dog instead of xyz dog
it will just make the dog the way you train the model your dog
and regarding batch size. all our batch sizes are low xD even 5-6 are low. allmost makes no difference than 1 or 2. i recommend lower LR with higher batch sized and more epochs
the whole idea of using "xyz dog" is to not let the model make every dog to your dog. Also, it can help training if you add some rare tokens that are not yet used for anything
yes i know but that kind of kill the conecpt of training a model IMO
sure this isnt just adafacter or the adaptive rates messing things up? cause on constant I never get this
yes, but that is not the point. Lower batch size just works better for whatever reason. Using lower learning rate with higher batch size is also super strange - you do it the other way around usually
i recommend to use cosine with 10% warmup
what do you mean with constant? constant learning rate schedule? Yes, I always use that
I'm still doing constant with batch 8 or 10, and have no issues with overfitting. (do note that my smallest datasets are 300 images / and all are genuinely different from oneanother)
yeah, I do 5% warmup
it might depend on what you are training. It's just: training subjects works better on super low batch size. I experienced that now for many different subjects and training setups.
thats fine. just to tacke the early overfitting
subjects,faces,expressions. basically mostly except styles
i mostly use learning rates between 1e-8 to 1e-6 and 200-300 epochs
with random crop,flip augmentation and at least 0.03 of noise
while my shadowheart lora worked out great, I guess I'll retry it with batch 1. see if there's an improvement
also shuffel caption is very importaint
for loras make sure to train the UNET like 10 times more than the text encoder
and for my cases lower learning rates with more epocs really do the trick
for loras that are like for faces i sometimes go down to 1e-8 for the text encoder even
this depends on a lot of factors, and wont help without the context of your tagging style / dataset size / training settings
all that depends on a lot of factors.
I have the feeling there is no right and only way of training. Seems to be different for every dataset and every problem you deal with.
in my experience it all stands and falls with the tagging
she better the tagging the better the result
atm my issue is that there are really not that many taggers that can produce tokens
i am using deepdanburu but that is very much animi based+
deepdanbooru also does double concept words. basically you if you use two or more words to describe the same thing, you eventually mess up both of them :/
I always user natural prompts, but I should give tagging a chance
tagging is really helpful
solved alot of issues in my models
but its really hard work
what convinces me is that tagging makes it easier to control if images are consistently prompted and if same tags appear in regularization images
i tagges most of my images by hand
automated always results in meh to good enough results. manual tagging gets me there all the way, but oh god it takes long. and I dont even want anyone to learn the horror that is hydrus network, just for the sake of fast manual tagging x_x
with 5-20 tags
and then use deepdanburu to enhance that
i even made a programm for that
that can weight concepts and respect the tags in them
but manual tagging is still much faster than manual prompting
depends on then ammount of images
my biggest models have like 50k images
really no fun to tag that by hand
with so many images you can automate that I guess
not really as the taggers all wont tag what you are looking for
like if you want a dog with big ears
no tagger will tag that for you
or small tail
and so on
they will tag dog,dark fur for you tho
writing your own python script, to tag based on your custom word flavor chains is your best option
you might be able to train one on the clip embeddings
i treid to train the deepdanburu tagger to learn my tags
that failed hard xD
also it only runs in cpu for some reason here
took 8 days just to not work xD
rip
but yeah. training your own classifier wasnt very optimized, last I checked
what rank are you guys using for lora?
so unless you rent a A100 cluster, you can forget about it XD
im on 128 atm but it generates a very inflicting model in some cases
A100 cluster for sure bro xDDD
8~32 for normal loras
64~128 for my master dataset (currently at 6k images manually tagged -> goal is around 30k)
256 to make a point of why not to use 256 XD
im happy i can afford 4x 4090
i havent even started with sdxl yet
because that would probably tain half a year on that
my largest model takes 3 weeks on 4x 4090
with 1.5
at 128 every small error will kill the general capabilities of sdxl x_x
not that the lora isnt working, just that standard sdxl looses its composition & detail afterwards, unless you train to reinforce it again inside your lora
im a bit scared of sdxl yet xD
still struggeling with 1.5
but the sdxl results look amazing
worst part is that the tutorials have a lot of critical false info included 😭
basically, rank 32 is the highest you can go without damaging sdxl. anything higher than that, and you need to account for the downsides in your lora, to retrain those
have you trained dremabooth on sdxl?
(not that that matters if you dont care about sdxl general capabilities)
artwork / anime / remaking your source images in slight variations XD
