#🔧|finetune

1 messages · Page 18 of 1

stiff dust
#

it's not about noise, but about how to weight different timesteps in the diffusion model

#

SD tries to predict noise in your image. As less noise you have, as harder the task, as higher the loss. With min_snr_gamma you stabilize this a bit, saying that very low amount of noise are not too overweighted during training

normal ember
#

I think I'll try to remove clip_skip and add min_snr_gamma=5 first and keep steps 3000 and lr 0.0004

stiff dust
#

I would say, intuitively, it is more about: do you want to train on high details or on overall composition. min_snr_gamma should be used when you train on composition. You should not use it if you train on details. For example, when you train on faces, you don't want min_snr_gamma, because tiny details like skin structure are super important for you

normal ember
#

Yeah, then I think your recommendation is spot on.

#

I've seen examples when they have set gradient_accumulation_steps to 4. It seems to affect learning rate too? I've gone with the defaults of 1.

stiff dust
#

nah, wouldn't use that

#

make your batch size as large as possible, that's more efficient

normal ember
#

I could go 8 too, does that affect the learning in any way other than it's faster?

stiff dust
#

dunno. I would always do as highest as possible.
There are other people saying lower batch size is better. I don't believe that. But I never did experiments on that.

normal ember
#

That's why it's 4. 😄

stiff dust
#

it's not faster, though. High batch size makes it rather slower

normal ember
#

1 is slow

stiff dust
#

it's only faster if you also increase learning rate. I usually use batch size 10 but still a low learning rate of 4e-4

stiff dust
normal ember
#

I feel there's more overhead when loading the images for each step the lower batch size you go

#

It spends more time loading than training if you understand what I mean.

stiff dust
#

hm... yeah okay, I never had that many images that they would not fit into my RAM

normal ember
#

I guess you do without --cache_latents_to_disk then.

stiff dust
#

yes

#

I keep the latents in RAM

normal ember
#

and with what optimizer?

stiff dust
#

AdamW

normal ember
#

number of steps?

stiff dust
#

as long as necessary

#

just record validation images and stop training if you are happy with the results or nothing happens anymore

normal ember
#

I will see what I can fit it memory

#

batch size 2 is the largest I can go without caching to disk.

#

loss is lower now

normal ember
#

@stiff dust I've read somebody claims that SDXL was trained with --noise_offset=0.0357, do you know if there's any truth to that?

stiff dust
#

it's the best working noise offset in my experience

#

but Joe Penna said they trained sdxl several times with different parameters, so there is not a single right noise offset 🤷‍♂️

covert pagoda
#

Anyone know a if there is a comfyui workflow for testing Lora’s in an xy plot fashion?

stone garden
#

What is this 1girl token about?

normal ember
#

Could it be just a random token?

rain scarab
#

usually followed by solo somewhere

stone garden
#

It is part of the preset for wd14 captioning

#

Solo and 1girl

rain scarab
#

maybe the engines understands it as one word to save prompt space?

stone garden
#

Even if i dont describe any woman i get a girl because the class prompt i chose is woman

#

I don’t even need to include my instance prompt, the lora thing with <> around is enough

#

It always worked for me

latent charm
#

@stiff dust After tested my lora, text encoder doesn't need to train to produce good enough result. Might want to try find out how should we proper train the text encoder

covert pagoda
quiet eagle
#

do you usually do (blip) captioning just for your training data or for the reg, too?

#

and do you add the class prompt as a prefix, too or only the main prompt

stiff dust
covert pagoda
#

Oh I sees kind of simple genius idea

#

Thanks

covert pagoda
stiff dust
#

no, I don't think the loss gives you much valuable feedback

#

quite often loss gets higher in the beginning although the image improves

#

also the loss depends heavily on the sampled timesteps. If you want to interpret the loss you have to use proper validation data with fixed timesteps.

normal ember
#

Can one transfer lora training parameters if you switch to full training? Is there anything that have to change?

#

Should you still train unet only, for instance?

covert pagoda
fathom vault
#

Any models out there that can generate keywords for my images?

#

Like light description

#

And be fairly accurate

normal ember
#

Especially if you give it a few images with captions for your dataset

stone garden
#

min_snr_gamma with any optimizer other than Prodigy tends to do that.

stone garden
#

does the order of the prompt in the caption matters?

stone garden
#

hmm yeah ill rework my captions and see if it changes

covert pagoda
#

anyone has Kohya_ss full Finetuning training? I am stuck on the metadata preparation. DO i run python script in Jupyeter python kernel in the merge_captions_to_metadata.py directory? See doc https://github.com/bmaltais/kohya_ss/blob/master/fine_tune_README.md#preprocessing-caption-and-tag-information. Not sure, as this was not necessary during Lora training..

GitHub

Contribute to bmaltais/kohya_ss development by creating an account on GitHub.

stone garden
#

do i need regularization images for SDXL Lora training?

normal ember
#

I train on different AR so it's much easier with a dreambooth style dataset.

#

Still researching parameters for dreambooth so I have only had it run for a few epochs.

covert pagoda
#

someone mentioned dreambooth without trigger, but i dont think its the same

normal ember
#

I don't know what's the difference between fine-tune and dreambooth to be honest. I don't specify dreambooth anywhere but my dataset is dreambooth style with captions for each image.

covert pagoda
#

Though I’d imagine there’s a train _network difference in the code

latent charm
stiff dust
#

Dreambooth means you use a rare token as trigger word (the paper suggested the token "sks") to fine-tune the model on a new subject

#

however, the term Dreambooth is used very differently. Sometimes it refers to full-finetuning (in contrast to Lora), sometimes it refers to the style of the caption ("photo of a sks person")

#

I would always recommend to write custom captions when using kohya 🤷‍♂️ you have most flexibility with that and you ensure that nothing strange happens

normal ember
#

@stiff dust Is there a reason a full fine tune uses lower learning rate than a lora?

stiff dust
#

yes. A lora trains a matrix factorization and not the original matrix.

#

so basically you multiply two numbers to obtain the weight change. Multiplying two small numbers gives you an even smaller number (e.g., 1e-3 x 1e-3 is 1e-9)

#

so you need a much larger change in the two numbers to obtain a noticeable change in the result

normal ember
#

Would that also mean longer training when doing a full run given same dataset is used? I know I'm trying to generalize a bit too much but still. 😄

#

There are some fine tuned models claiming 200k steps, not sure how big the datasets are though.

#

200k and let's say 50 epoch is only 4k images.

#

But if it's learning faster then it could be more images I guess.

stiff dust
#

not if they used batch size > 1.

But I'm very sure that most fine tuned models are trained on very small datasets (rather 100 images than 4000)

normal ember
#

If that's the case it must be a very very slow process

#

Can't find much about fine tunes on SDXL. Do you have links?

#

Many of the fine tunes are not much more than LoRA merges too.

#

Wish we could get our hands on some parts of the dataset for SDXL.

gentle flame
#

finetuning SDXL is expensive and time-consuming

#

hopefully optimizations happen or GPUs become cheaper. I'm hopeful for sharding.

hazy herald
#

hi, I am looking to train my own SDXL lora and there is an auto-captioning script that I found in a guide

#
call .\venv\Scripts\activate.bat
python.exe "finetune/make_captions.py" --batch_size="1" --num_beams="1" --top_p="0.9" --max_length="75" --min_length="5" --beam_search --caption_extension=".txt" "D:/!PhotosForAI/billie/billie-1024" --caption_weights="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth"
#

but this looks like it's calling out to an online service to generate the captions, is that correct?

#

or is it just going to download the file and use it?

normal ember
stiff dust
#

yes, it only downloads the weights.

fair helm
#

hi, is that any lora has its own float point (16 and 32)?
if it's so, when merging two loras having different float points, is there any conflict? and how to determine its float point? thank you very much

stiff dust
#

either the script you use for merging is dealing with the conversion, or it's not and you get an exception. But you won't get an incorrect or corrupt file back.

sonic narwhal
#

Anyone have good json file for training LoRA character or object?

stone garden
#

man

#

2nd time accidentally training with dreambooth instead of Lora

sonic narwhal
#

How many concepts can fit into a LoRA? Thinking if full finetune is even necessary. Going to train on about 15 different products/objects in different categories and for each category there is going to be atleast 3-5 variations of form design & color. Dataset will probably be at around 10k images

latent charm
#

Haven't tested 10k. But I am training with 5k dataset. It is decent.

sonic narwhal
#

5k LoRA training?

#

How many different concepts are u training

latent charm
#

many

#

It contains multiple characters, anatomy, different angles

sonic narwhal
#

Does it train new objects besides humans?

latent charm
#

My training mainly focus on human.

sonic narwhal
#

ok

normal ember
latent charm
#

The learning rate is 1e-4.

sonic narwhal
#

"Additional parameters: –max_grad_norm=0" Why does it say unrecognized when starting training?

normal ember
latent charm
normal ember
#

But not cropped images as in the SDXL paper?

stone garden
#

is training with batch size 1 better ?

normal ember
#

Seems like SAI feeds same image cropped in different AR into training.

#

Page 5 and 6 in paper.

latent charm
#

The original dataset is around 2600. 2400 is anatomy cropped from origianl images

normal ember
#

I wonder if you feed pre-generated cropped images would do about the same as their fourier embeddings as conditioning parameter does or if that needs to be implemented properly in the trainer.

#

@stiff dust Do you have any knowledge about this?

latent charm
#

Might be

stiff dust
#

kohya sd-scripts uses the correct cropping conditioning only if it crops the images itself

#

if you provide already cropped images, then this information is not available to kohya

#

my own solution for that was to write the cropping parameters into the meta tags of the png file and made some code additions to kohya to read these tags and add the conditioning

#

I'm not entirely sure, though, how helpful cropping is

#

I used it for when training my own face: Next to the normal 1024x1024 images I also added a few extremely high resolution images (like 4096x4096), cropped them into 1024x1024 blocks, and added them to the training data. It seemed that doing that too often will end up in overfitting on closed-up faces (even the cropping conditioning don't prevent that). When only around 1%-5% of my training data is cropped, it seemed to help improving details (e.g. skin details), but I'm not entirely sure if the improvement is really due to the cropped images.

latent charm
#

I have 1,840 face cropped images / 5827 total images. It does tended to make close up images but it still able to make full shot

stiff dust
#

oh, yes, it can. I just say it happens more often that it makes images in close-up without specifying "close-up"

#

the reason I came up with cropped images was rather that I hoped it helps SDXL to render images in high resolution without making the typical placement errors (two noses and so on). But it didn't really worked so well, upscaling is still the better strategy

latent charm
#

cropped the character but it is interesting to see the image has two bottom navigation bar.🤣

normal ember
#

I’m cropping and resizing based upon the closest resolutions that SDXL was trained on and put them in to a folder for each resolution. But this could be improved to resize and crop to other resolutions if there are pixels enough if there’s any point hence my question.

#

My feelings are that some fine tunes have regressed in rendering resolutions that base handles just fine. My thought was that this might be able to improve if you train on same image but to different resolutions.

stiff dust
latent charm
#

This images are selected result and I think the anatomy training might introduce other issue.

#

For example, the following image lost the relation between the legs and upper body. Failed-diffusions~

stone garden
#

Nice now my character sticks with a cleft chin😭

stone garden
#

what the fuck

#

why

#

im so fed up with this shit

#

they all have this type of chin now

#

completely random my training images didnt even contain those chins

stiff dust
#

oh yeah, I know that very well: artefacts that appear in the images although they were not part of the training images. I had the same issues with SD 1.5 and 2.1, though

stone garden
#

and the one before didnt have this (a tiny bit but not that strong)

normal ember
stone garden
#

i have over 10 pics with thighhighs in my training set and still they appear when i put them into negative prompts... and all the time

normal ember
#

Not sure how it works but it seems like when training it associates with other dimensions in the model that is related to the newly learnt data but not in the training data.

#

It looks like it's nipples from cats or something that has been associated with the ears

#

maybe the chins are related to that too

stiff dust
normal ember
opal jacinth
#

what is the reason for saving/using "training states" in the context of dreambooth, if we can actually resume training also by using the last created checkpoint as source model?

opal jacinth
jade hornet
stiff dust
#

it also stores the optimizer state and, thus, the current momentum of the optimizer

#

using training states you can stop and resume training any time without any drawbacks

#

loading from a checkpoint means the optimizer need some warmup phase and will probably rather harm the model for the first few steps until it improves the model again

#

this is not a big issue if you do stop in the middle and resume again. But if you plan to stop and continue several times and close to each other, saving states is definitely better

stiff dust
opal jacinth
astral island
#

is there a place to learn about the basics of the LORA training process? (eg. the loss calculation, gradient descent, etc.)

stiff dust
#

it's the same as for fine-tuning

valid stream
#

Hello!!
I compared 4 popular face upscalers.
Base image 640x640.
Scaled to 960x960.
Results:

4x-UltraSharp details: 3/10 noise: 2/10 light: 5/10
4x_NMKD-Siax_200k details: 7/10 noise: 4/10 light: 6/10
4x_foolhardy_Remacri details: 8/10 noise: 4/10 light: 7/10
4x_NickelbackFS_72000_G details: 4/10 noise: 6/10 light: 7/10
R-ESRGAN 4x+ details: 6/10 noise: 3/10 light: 4/10

#

Please test it yourself and verify my results

#

I'm horrified how everyone says 4x-UltraSharp is good. It is terrible for realistic graphics... maybe decent for some anime or pixel perfect art

#

small update:
for 3x scale 4x_NMKD-Siax_200k details are insane, like 9/10

stiff dust
#

I also always use Siax, but haven't evaluated it yet, so thanks for the info

stone garden
#

do lora questions go here or somewhere else

#

specifically an anime lora but still

#

i guess ill put it here

#

it seems others have talked about it here before

#

first time trying to make a lora, im using colab and NAI for anime style
am I being too picky? like I feel like it's weird compared to anime ones on civitai. granted, my character is an OC, but theyre at least kindof anime girl looking (and androgynous) so idk.
I use adaptive optimizers, batch size of 6, output samples are good quality and very accurate. but when I apply it to a model the quality (not literal) ranges from broken, to kind of accurate, to pretty good.

tldr; is this normal? should I just pick one and go for it? are loras known to depend on the model you're applying them to?
ill get an example
heres a plot, idk if the "styilization" is overfitting or just my dataset. i put style tags in the images, and some weights are a good balance but its either not accurate enough or fried

#

this is on a custom model, but its basicaly just dreamshaper with anime composition/clip. makes it very stylized and a bit messy

stiff dust
#

a lora should work with weight= 1. Using higher weights is rather hacky, using lower weights is rather a sign that the lora was overfitted

#

a lora should work with other models in most cases. However, if the other model is bad (e.g., totally overfitted and broken), then it might not work. Dreamshaper XL should work, though

#

I would not be too worried if the anatomy is sometimes wrong (too many legs or fingers). These things happen in the base model without any lora, too. Try different seeds and check if it happens too frequently

valid stream
#

Hello!
Do you guys have any pro tips for fine tuning skin texture on 2k image?

I can only think of img2img with Ultimate SD Upscale with 4x_NMKD-Siax_200k with some denoise.
Extras with upscaler is really bad for skin even with the best possible upscaler.
I want advanced opinions on this topic.
Maybe it is just actually impossible as for 2023... Should I really expect from any sort of AI to perfectly reproduce human skin?

jade hornet
#

basically, dont expect super consistent behavior with so many variables in play

#

think of a lora like a formula that says when I use these tokens I will apply some weights to bias toward training parameters, but those weights will vary depending on the other weights in play

stone garden
#

Ahh okay cool, thanks
I won't worry too much then, i have epochs that work at weight 1 and apply generally "this is that character" to the image. plus all the prompting

astral island
#

is there a page that documents the inner workings of the LORA learning algorithms in details? i'm trying to learn more about the LORA training process.

stiff dust
#

there is no special learning algorithm

#

lora is like fine-tuning

#

the main difference to fine-tuning (sometimes called Dreambooth) is

  • you freeze the original model M and train the difference model D with M+D = finetuned, instead
  • your difference model D is factorized. You can think of that as a lossy compression algorithm. Similar like you compress images as jpg. It makes the lora smaller than the original model
  • usually you don't train all matrices in the original model but only the nost important ones (that's why there are so many subtypes of Lora. Most time Lora is just training the Transformers. Lycoris is also training the Resnet and so on)
#

but loras are not much different from normal fine-tuning

astral island
#

is there a place that goes into how normal fine-tuning works? i still need to learn about the very basic like the loss calculation, SGD, backpropagation and such

stiff dust
#

uhm, this is the same for any kind of neural network, so lookup any textbook about neural networks

#

SD is using a simple l2 loss (mean squared error) on the noise prediction

#

(squared difference between predicted noise per pixel and real noise per pixel)

astral island
#

i've been trying to learn about this with GPT4, but the issue is I can't put any of it in the context of Stable Diffusion LORA training

ocean dune
#

No idea if it fits here, but what can i finetune/add to prevent third upscale resample to have 2 mouths and quite vastly different mouth/lips? The extra hand on the neck is a quick fix, it's just the extra mouth i can't seem to get rid of. Neither with denoise nor cfg

stiff dust
stiff dust
opal jacinth
#

does it make a difference if I train a SDXL LoRA or a full fine tuned model with, let's say "medium quality", pictures? Is it possible that LoRA works better for images with lower quality? or should the dreambooth training always yield better results

stiff dust
#

in theory, if you set the lora rank to max then lora and dreambooth should be the same. So no difference. In practice, most Lora implementations do not train all weight matrices but only the important ones

#

in fact, if you want to train ob subjects its totally sufficient to only train the cross attention

#

but most lora implementations train the complete transformer

#

anyways, that could be a reason why Lora are sometimes better than Dreambooth. They overfitt less, because they don't train all the weights that are not really important for the training

#

you could do the same with dreambooth, though. You always can decide freely which weights in your network should be trained

opal jacinth
#

thank you, it is as interesting as ever to read your detailed answers.

normal ember
normal ember
stiff dust
#

for cropping you would have to store the cropping information somewhere. I used the PNG meta tag for that, but this is something you would have to do in python yourself. I just added a few lines to the BaseDataset#getitem method to read out this cropping information

normal ember
#

Yeah, I have done metadata writing to png on other purposes already

normal ember
stiff dust
#
############ kaidu
if "size" in img.info:
    ww, hh = img.info["size"].split(",")
    original_size = (int(ww), int(hh))
if "crop" in img.info:
    ww, hh = img.info["crop"].split(",")
    crop_ltrb = (int(ww), int(hh))
############

# augmentation
aug = self.aug_helper.get_augmentor(subset.color_aug)
if aug is not None:
    img = aug(image=img)["image"]

if flipped:
    img = img[:, ::-1, :].copy()  # copy to avoid negative stride problem

latents = None
image = self.image_transforms(img)  # -1.0~1.0のtorch.Tensorになる
#

that's the part in library/train_util.py

#

only the 6 lines in the ##kaidu block are relevant

#

should be around line number 1100, in the get_item method

#

it reads the source size and crop coordinates from the "size" and "crop" parameters in your PNG info. Numbers are given comma separated (e.g., "crop=40,20")

normal ember
#

Thanks! I found the functions in that file when I checked

ocean dune
#

CAuse i'm attempting SD 1.5 gen that is natively 620x620 was it, then upscale and resample at 2x, then another 2x, so 2400x2400

stone garden
#

does this blown out look come from overtraining?

#

this is before hires fix so it seems fine

stone garden
#

are CLIP/deepbooru still the go-tos for captioning images in a dataset? or has something better come out?

latent charm
#

There many caption tool, like wd14, ML-Danbooru, blip, blip2, openflamingo, etc. The problem is your caption should be used for your purpose. Even using caption tool, you still need to manually delete or add more tags.

ruby pond
#

how do you train a lora that only affects the composition of the image, and doesn't change the colors, lighting, etc? e.g. like a hands or eyes lora

stone garden
#

probably correct tagging?

#

also a variety dataset

#

probably easiest to use 3d models or real images

stone garden
#

Is there any good Mac software to create/edit captions? Also a guide for captioning images?

#

(I train on RunPod, but I'd like to caption locally)

normal ember
#

taggui

stone garden
opal jacinth
#

hey @restive bridge just being curious how you progress is so far? 🙂 I've recently tried dreambooth training but got mixed results... at least not that much improvement over trainings I did with LoRA

normal ember
#

It even claims cross platform

latent charm
#

My 5000k images lora training had done. It includes mutiple person, mutiple outfits, anatomy focus, nsfw, etc. Some issues had found during testing. First, element mixing. For example, in outfit A has ribbon element and in oitfit B also has ribbon element. Due to 'ribbon' tag learned in both dataset, when using outfit A prompt to reconstruct the image, it might occur outfit B 'ribbon'.

#

Second, element on the fly. It occurs in second half epochs. It might be identified as overfit? fingers, arms or element on the outfit tear apart. Wrong composition of element or extra element from the outfits.

#

Thrid, anatomy hand training. It has most element mixing issue during the test. It might due to my lazy captioning. I used 'hand' as a general tag for different hand pose images. I think it mixed with different hand pose and it mixed front side and the back side of hand. I would test the anatomy training again with more accurate caption.

stone garden
#

do i need a classprompt ?

#

if i make a lora for a clothing style, what classprompt should i use

sonic narwhal
#

What is strongest automatic captioning model atm?

stone garden
jade hornet
stone garden
#

resizing a couple of images and wondering if anyone knows how to resize by the shortest side, ideally in python. the python image.thumbnail method and bulkresizephotos.com go by the longest side.

stone garden
#

yes tyvm

normal ember
#

Not sure if thumbnail is the best way to resize

stone garden
#

resize is probably better but IIRC you have to set the size for height and width and I'm working with a lot of varying resolutions

normal ember
#

If it's a dataset for training I've used a list of resolutions and cropped and resized to the one closest to the target and put them in a separate folder by resolution.

stone garden
#

in python?

normal ember
#

Yea

stone garden
#

👉 👈 wouldn't happen to have a copy of that as well, would you?

normal ember
#

Doesn't have a target dir parameter but I guess you could add that, it puts it into directory called processed

stone garden
#

that's perfect, tyvm you saved me a lot time

normal ember
#

Try GPT-4 😉

stone garden
#

ofc

stone garden
#

does using xformers while training do anything to a lora

stone garden
#

like in a bad way

stiff dust
slim talon
#

Hey Guys,

I'm wanting to train a model on a specific pixar-like character. I've generated 20 images of the sam(ish) character and I want to be able to prompt that character via dreambooth or something similar. Any tips for how I'd be able to do that?

stiff dust
#

caption the images (either use the real name of the character, or a custom name without meaning, e.g., "Monica Tdezk").

#

then train a lora on that

#

using the kohya/sd-scripts (or kohya-ss) library

#

you can either train a pure text encoder lora with low dim (e.g. dim=2)

#

or you train the unet (with a bit higher dim, e.g. dim=8 or dim=12)

#

I often found unet training slower but more flexible, but you can just try. Text encoder training is usually fast, you get good results after a few minutes

hardy storm
#

Regularization image question. Trying to train Lora models in Kohya for a buddy of mine of his kids to make them into superheroes. What would you recommend for regularization images?

stiff dust
#

you don't necessarily need them. Jtst try without

fervent bison
#

Is it possible to train a LoRA with 8 gigs of VRAM?

astral island
#

question: what does dreambooth/ti/lora/finetune training does with the loss from all the images in a batch? do they use them to find a derivative?

stone garden
# stone garden resizing a couple of images and wondering if anyone knows how to resize by the s...

I put together a couple of scripts with the help of ChatGPT to help me manage my growing SD training datasets: https://github.com/boomerchan/sd_training_scripts
resize_bulk.py is by far the most useful tool as it allows you to crop and resize based on:

  • a given height and/or width
  • the original SDXL training resolutions
  • the shortest side or the longest side
    And it doesn't modify your original image(s).
    glhf love
GitHub

Contribute to boomerchan/sd_training_scripts development by creating an account on GitHub.

#

also thanks to twri for reminding me GPT can do Python

normal ember
stone garden
#

oh? I was going off of this. picked it up somewhere when SDXL first released

#

I'll look up the paper

normal ember
#

Yes, that's why I only used them in my resize-tool too but there are more.

#

Will result in less cropping I guess and the model should be able to handle it.

normal ember
#

tagging tool could be useful but there's also options available in kohya-ss to caption_suffix and caption_prefix in later versions depending on use case,

#

I try to keep the unique stuff in the caption file for each image and the general stuff in the config (toml) for the dataset for flexiblity.

stiff dust
hardy storm
#

Do we have a definitive answer on the whole use a celebrity name while training or don't? Seems to be a another one of those contentious mystery topics - like regularization images.

latent charm
#

I don't

#

You could create funny thing which mix famous character' features to your training target.

astral island
stiff dust
#

the derivative is a weight change. For each parameter you get a number saying how to change the parameter. When you have a batch of 10 images, you would obtain 10 of these numbers and take the average of them

stiff dust
ripe sleet
#

I wanted to experiment today with training a lora so I was wondering for people which tends to work better if anyone knows, locon or loha lycoris?

stiff dust
#

I would simply use Lora

#

anything else is probably not necessary. Maybe it helps in rare cases for style training

sonic narwhal
#

does anyone have a good comfyUI workflow for evaluating LoRAs?

sonic narwhal
#

Also is it possible to set a setting in kohya so that it starts saving epochs after 40 epochs etc?

opal jacinth
opal jacinth
sonic narwhal
opal jacinth
latent charm
jade hornet
real citrus
stiff dust
#

it does not take the average of the images but the average of the weight changes

#

there is no rational reason why character training should work better with lower batch count

jade hornet
#

well there's some rationalization, if you leave the learning rate and steps unchanged, they will certainly not yield the same results, so I suppose it's better to say, that when accounting for those variables, results should be similar

#

and then what happens if you use an auto-adjusting optimizer, and plotting loss, there are things you have to accout for when using batch size

stiff dust
#

I'm not talking about that the results should be the same. Of course you have to adjust learning rates and step count, as you do anyways. But many guides claim that you should use batch size 1, because the network would get confused when it sees multiple images at the same time and then learns some blended image and stuff like that. THIS is totally bullshit. You can achieve similar or even better results with high batch size.

hardy storm
#

Question about captioning. I've heard it said that you should caption what you don't want the model to remember. So, in other words, if training a person you know, who you want to put in different locations and clothing, then you should caption the details of the background and clothing. However, when watching captioning tutorials, people always start their captions off by saying the gender of the person (ie. "a man" or "a woman"). Am I wrong, or does that go against the "caption what you don't want" philopshy? Because, if that were true, then captioning the sex of the person, should make their gender fluid in the model. Would love some insight into this.

stone garden
#

more just a general guide. Basically when you caption, especially if you have multiple images with 1 outfit it should know exactly what to do when the prompt is brought up (but only). I have things trained that are "genderfluid", but the face and body still apply to any prompt so I rarely see change. if you don't caption that and use the Lora/mini model, it might freely assume and you can never really "get what you want" if it's not trained specifically enough to do it (if you know how to prompt for an image similar to training, it's easier than guessing). for styles this doesn't matter because they're not specific but for people I'd think its important

#

generally it also varies by dataset. not really something that is for sure going to work, but should. ai can be weird

jade hornet
# hardy storm Question about captioning. I've heard it said that you should caption what you d...

so there's identifying, and there's describing. your scene contains a woman, but if you say a woman with blonde hair, that's describing the woman. And by doing so, you basically tell the AI about the hair and it wont try to learn it. This is very handy if you want her to have red hair sometimes. read this, I think very well written exploration of captions from a reddit post: https://www.reddit.com/r/StableDiffusion/comments/118spz6/captioning_datasets_for_training_purposes/

Reddit

Explore this post and more from the StableDiffusion community

#

similarly with clothing, there was a guy in one of the discord rooms that had trained a character but always got the same outfit. his issue was not describing the outfit

#

I'll say this, to save you some grief, dont go overboard describing everything, if your dataset is diverse, the AI wont learn it easily. What I mean by that is, if all your images are in diverse settings, you dont necessarily have to say in a bedroom with a side table containing a lamp with a green shade...that's just silly. unless that lamp appears in several images

#

most of my captions are like... no more than 10 words

hardy storm
neon kelp
#

Hey there, I'm wondering if anyone knows how to train a stable diffusion model with a different language? Like Greek, spanish, japanese, etc?

wooden badger
#

I have rtx 3090 and I can't do sdxl dreambooth training
The vram usage is full and it takes 12 g more from shared rams
I did sd1.5 training with dreambooth just fine 3000 step in 10 minutes
Any help will be appreciated 👍

hot breach
#

the tokenizer/text encoder are really only setup for english

ruby pond
#

I think I like the caption output from this setup the best

#

it creates a fairly long caption, with repeating but slightly different descriptions of the same thing, but avoids mentioning artists for the most part

#

e.g. 'two women talking on a couch in an office, sitting on a couch, sitting on couch, calmly conversing 8k, sitting on the couch, sitting on a sofa, sitting in a lounge, giving an interview, on a couch'

#

it has a weird behaviour though when it reads a word, it adds a bunch of captions that are related to that word, so make sure to check the captions for text or signs

minor heart
#

sdxl_train.py: error: unrecognized arguments: --network_train_unet_only

minor heart
#

what i am doing wrong here , training consuming more than 24 g vram for sdxl

stiff dust
#

the parameter does not exist for dreambooth training

#

you can use --stop_text_encoder_training=-1 instead - it should have the same effect

#

(in general: all parameters starting with --network are for loras)

full moat
#

Hello, what do I need to change in Sd so that when I boot it up the neg prompt box has a "standard text prompt that I want"?

minor heart
minor heart
#

what should i do to make it consume less vram ?

stiff dust
#

okay, that's yet another class

#

but for sdxl_train.py you don't have to specifying anything, the text encoder is not trained by default

stiff dust
#

you can try --cache_text_encoder_outputs --cache_latents though, maybe that helps with vram

minor heart
elder zealot
#

Is it currently possible to use loras with sdxl img2img? While there is an existing inherited method for this, I'm having the same issues described here:
https://discuss.huggingface.co/t/how-to-use-lora-with-sdxl-img2img/55295

real echo
#

anyone have an sdxl kohya config for a person/face? I know I'll have to edit all the paths etc, I just want a complete training config to work from

latent charm
latent charm
elder zealot
# latent charm use loras with sdxl img2img? Yes. Apply lora to SDXL refiner? no

This is what I'm trying to do:
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16) pipe = pipe.to("cuda") pipe.load_lora_weights(prj_path, weight_name="pytorch_lora_weights.safetensors")

The tensor shapes don't align, as described by other folks with the same error in the link I posted.

latent charm
#

The answer is no because there is no lora trained base on refiner and base is different than refiner.

#

You could use lora with base but not refiner

elder zealot
#

Thanks so much—that totally makes sense. Now that I'm running it with the base, I'm experiencing results that I didn't expect. I can sample from the lora-weighted base using the DiffusionPipeline and see images that resemble my training data. However, when I use the same lora weights with Img2Img, I'm not seeing images that resemble the training data, even when I bump up the strength. Any ideas about what might be causing that?

stiff dust
#

there is no 0th timestep, so even if you set 100% denoise you start at step 1. If you have 20 steps in total, then 1/20 of the input image is still conserved. The early timesteps usually determine the composition of the image. So even with 100% denoise strength the rough shape of the image as well as the colors and brightness might be taken over from the input image. You can increase the number of steps to negate that (but of course, when you do img2img you usually want that the input image is bleeding into the resulting image)

#

anyways, the Lora is applied on img2img same way as on text2img, so if your images look a bit different then just because your input image effects the outcome no matter what your denoise strength is

elder zealot
# stiff dust there is no 0th timestep, so even if you set 100% denoise you start at step 1. I...

That all makes sense. However, when I do the same thing with 1.5, I see an unmistakable relationship with the fine-tuning data that increases dramatically as I bump the strength up. In that case, by the time I hit, say, 0.7 for the strength, the resulting image is very clearly highly conditioned by the fine-tuning data. But for some reason, this setup is behaving entirely differently. I'm sure there's something wrong with my workflow somewhere...

stiff dust
#

do you use trigger words for your lora?

elder zealot
stiff dust
#

oh, I haven't looked at your code

#

you are using the refiner model

#

loras trained on base do not work for refiner

#

you would have to train a separate lora on the refiner

#

refiner and base have different architectures and are not compatible to each other. Honestly, I would just skip the refiner. Base is already good enough, sometimes even better than refiner.

elder zealot
#

Actually, I was using the refiner, but thanks to a helpful comment, I am now using the base, not the refiner (and getting the results I described to you)

stiff dust
#

then I have no clue 🤷‍♂️ there is no reason why the lora should behave differently in img2img than in text2img

elder zealot
zenith vine
#

Hello! I feel like I'm going insane—need a sanity check.
I'm training a LoRA using kohya-dreambooth method in Colab. It literally worked six hours ago. I collected images, I used the notebook to caption them, I trained LoRA, everything worked swell. I try again now, it just plainly doesn't work. The training proceeds without errors, but the end result does not even try to capture the character. The activation word is not even recognized as a signal to try and make a character. Where should I look? What might I be doing wrong?

real echo
#

typo?

zenith vine
# real echo typo?

Nope, I checked for typos and I tried two different datasets under two different names. Both fail.

zenith vine
final delta
#

Question, I am trying to tune Stability.AI to produce an "app icon" image... Instead it produces many icons... What information path am I looking for in order to "tune" this, or change it into a consistant image? I see here the parameters of the stabilityai python library I am using but... It does not seem to result in what I am looking for. Images attached. Not sure this is even the right place to ask, but pointing me in the right direction would be very useful!

stiff dust
#

train a textual inversion or a text encoder lora on app icons

#

training data shouldn't be the problem as there a plenty of icons freely available

lone karma
#

may be a bad idea, but its a thought

final delta
final delta
elder zealot
rain crag
#

Hey there, I'm new in ML and SD so I've got some questions. I learned how to fine-tune an SD model using Dreambooth on a specific person. Now I want the person to wear some specific clothes. I learned on the internet that I can train a textual inversion embedding and combine these two solutions to get the result. But after training an embedding, generated images of any person wearing the clothes have very low quality, the face and body is deformed, very rarely I can get anything close to more or less realistic person. It's needless to say, that when I try to generate an image of a fine-tuned person wearing the embedding clothes, it results in something awful.

Where it could go wrong? Maybe I should learn more about textual inversion and train it better?

normal ember
stiff dust
#

besides that, it is always helpful to add enhancing tags to your prompt (ultrahigh quality, 30mm photo, raw photo, product photo, ...)

opaque rain
#

hi! does stability.ai provide API for finetuning image model?

rain crag
stiff dust
#

then it's probably just a resolution problem. Take a look at SD upscaling

#

SD has problems getting fine details right if they are low resolution

rain crag
#

Ok, I'll try to simplify the problem:

  • I generate a photo of a person - everything is fine
  • I connect my embedding of clothing and generate a photo of a person wearing the clothing - clothing is ok, person face and body are deformed
stiff dust
#

maybe you show an example of the generated image?

#

if the face of the generated person is deformed then you can usually fix that by upscaling

rain crag
#

almost the same prompt just without the embedding token

stiff dust
#

yeah, okay, there is something strange with the embedding. Maybe to overtrained

#

could also be that realistic vision does not work that well with embeddings

rain crag
#

used embeddings from civitai, not perfect but way better

#

just noticed that it learned 2 poses from the photos, maybe overtrained indeed

#

maybe you could help me with captions for photos? I used these photos with just "photo of a woman wearing [keyword], white background". Maybe I should describe all the details?

normal ember
#

--learning_rate 0.0004 has to be redundant if you do --unet_lr 0.0004 --network_train_unet_only no?

final delta
#

I can look at how this works but it seems like it is using some sort of custom diffuser... I guess I would be looking for information on how to create these.

#

I think I have found the right path of information.

normal ember
#

I doubt that there's anything special about that LoRA except the dataset. I just can't find the dataset now. 😦

final delta
#

Not even sure what a lora is at this point! Was just told to do this stuff by my job and here I am lol.

#

RnD weird.

normal ember
#

But if I remember correctly it was just plain simple raw dump of the apple emojies.

final delta
#

So do you know what software they are using to train the data? I am using openAI, Pinecone (and milvus for local), and langchain

#

do you know the Stable Diffusion equivalent?

#

Those are for text -> text, this would be for text -> image

normal ember
#

Most are using kohya-ss or kohya_ss depening if you want a gui or not for training a LoRA..

normal ember
#

But you could probably also use replicate.com and their API to train it without having to have the hardware.

woven delta
#

Hey,
How should I approach training Lora for specific style of outfits? I was experimenting with object-like captioning but the results were underwhelming.

paper field
#

Anyone know if it's practical to train a LoRA using irl images if I plan on using an anime SD model?

#

Specifically, I'm looking to train a LoRA for an irl dog

stiff dust
#

I don't even know what "irl" means, but if you refer to "real life photos" or something then yes, that should work

#

when I train on photos of my face I can use the same model to create anime images of me

opal jacinth
stiff dust
#

hm, a lot of custom stuff. But best results so far were with rare tokens, learning rate ~5e-4 unet only training, batch size 10, default noise offset

stiff dust
#

AdamW

#

I don't see any reason using something else

opal jacinth
#

thx, I will give it a shot. It's still pretty wild out there with regards to best training settings and also contradictory information. I lately also had quite good results with only 4DIM 😄

stiff dust
#

after so many tries I think there is no best setting

#

it depends on your training dataset and what you want to achieve

#

but yes, most models out there use WAY too high dims

normal ember
#

@stiff dust I'm trying this concept of tagging. Do you think one should still use Jackie Chan person if you are not using instance and class but captions for each image? Like close-up shot of Jackie Chan person holding chopsticks or should one go with close-up shot of Jackie Chan holding chopsticks? https://arxiv.org/abs/2306.00926

#

When trying this training I get a feeling I get best results if I also use Jackie Chan person when generating, but it's somewhat inconclusive.

manic loom
#

I realise I didn't really form my comment as a question so I'll try again: Does anyone have any tips on how to consistently get braces in stable diffusion? Is there some models that's better (or even capable) of it than others? Even when I tried to train an embedding on a girl with braces in every shot it turned out without braces as result... I'm out of ideas lol!

woven delta
manic loom
#

Yeah, i tried that one but it affect the base model way too much. Trying to understand why braces are so hard to get right, it shouldn't be hardere than jewellery or something but it is!

woven delta
manic loom
#

That is a good idea. Didnt really consider that, might be easier!

#

The funny part is that it seems SD understand what i want, since the girl always shoes alot teeth when I add braces in the prompt, it just dont draw the braces... maybe its trained on those fancy invisible braces!

final delta
#

@normal ember Hey! I got pretty far, now the real work begins. Got this set up, and got the data set im going to use set up as well. Not sure what to do next!

stiff dust
gloomy sierra
#

I have constant issues with photographic LoRAs "unflattening" my 2D illustraton base model outputs -

Has anyone experimented with doing a (for lack of a better word) "2-pass" training flow roughly similar to:

  1. train the lora on photos
  2. img2img the original photos at some appropriate denoising rate using illustration prompts and illustration base model
  3. generate additional images with the lora + illustration model
  4. use these "flattened" 2D source images to train a second lora

Just want to make sure this isn't a known dead-end before I spend much time on it (or if there are any tweaks to the above that would make sense)

latent charm
#

Usually people dont use the same model generated images to train the model. The "error" in the generated images would got learned into model and make the model collapse.

#

But you could try

real citrus
#

Does anyone know how koyha (gui or ss) selects images for a batch?
For example, if I'm training on 5 images - with the same resolution - and a batch size of 4 presumably the first batch will be of images 1,2,3,4 but what about the second batch? Would it be images 5,1,2,3 or images 5,5,5,5 or something more obscure?

Is there a way to get koyha to log what images are actually being using in a batch? That would help 🙂

final delta
#

@real citrus Sent you a PM!

broken hemlock
#

bit of a technical question - I've been fine-tuning SDXL (not dreambooth, not lora). It's slowly getting somewhere, but I always seem to have garbled outputs when I zoom closely. I recently discovered that I'm training on the first base model, which had a sub-optimal VAE. Not 100% sure that's the problem, but this is why I'm wondering if the VAE involved at all when fine-tuning? If not, I can just use the updated one at inference time and need to find the cause elsewhere.

ocean cape
#

Just use Kohya-ss as I think the issue is that your training script is too old @mental anchor

edgy wharf
#

Is anyone here have experience to fine tune lora for inpaint?

woven delta
ripe scaffold
#

How much of the tertiary model is merged into the final result? in AAA1 Webui?

#

The formatting has changed, and many tutorials and videos are outdated, any easy to understand updated guide?

#

Discard weights with matching name? How can I use that?

astral island
#

i have a pretty weird question. bear with me for a sec.

#

say i have a dataset of a single image, with a prompt called "PROMPT A". I then put it in my training script of my choice put it through a single iteration of Dreambooth training, during which it finds the output to have a loss of 0.345.

#

Then I modify that training script to not require any image in my dataset whatsoever, but only a prompt. I then put my original model through a single iteration of Dreambooth training, which uses "PROMPT A" to generate an image (that would have been used for loss calculation), after which I manually input a loss of 0.345 (exactly the same as the previous example) and do backpropagation with it.

#

my question is: all else being equal (the class name, the instance name, etc.), would the resulting model from both examples be identical?

lethal lily
#

Hi, I'm trying to create an embedding / textual inversion for the style of an artist, I'm using kohya and the learning doesn't seem to work very well. Can someone help me ?
I'm using this tutorial, but it's not about a style it's about a specific person, so I'm not sure if I should change some settings - I did, but I don't know if it's good. And considering that it takes more than a day to complete an epoch of training, testing blindly does not really work in my case.
https://civitai.com/articles/618/tutorial-kohya-ss-dreambooth-ti-textual-inversion-embedding-creation

viral geyser
#

Does anyone here train on a 4070 Ti with 12GB? And how long does it take for you to finetune a model

viral geyser
#

Like I really don't get it

#

it takes 31 seconds for 1 step

#

Is 12GB that bad?

normal ember
#

From what I know you need like 16G to just load the model

sullen locust
#

Hello , everyone I'm getting this error while generating images:
Runtime : m1 and m2 have same dtype

stone garden
#

i have a lora but it only really works at 1.4, not 1. just adam8bit. what do i increase to push it earlier? lr: 0.0003, weight decay 0.1, cosine with 5 restarts. testing and changing individual settings barely does anything so its probably a combination of both

#

its got 5000 steps so that shouldnt be the problem...

viral geyser
viral geyser
#

I mean

#

It’s working now

#

But at this rate it will be done in 2 days

normal ember
viral geyser
normal ember
#

train unet only, gradient checkpointing, xformers, cache latents to disk, small batch size, not too large network dim

#

possibly use optimzer adafactor instead of adamw but I'm not fully sure, maybe someone else knows better

#

check reddit too, I'm sure there are many that wants to train LoRA on 12G

viral geyser
#

I will check, thanks for the tips!

lethal lily
#

Hi, I'm trying to create an embedding / textual inversion for the style of an artist, I'm using kohya and the learning doesn't seem to work very well. Can someone help me ?
I'm using this tutorial, but it's not about a style it's about a specific person, so I'm not sure if I should change some settings - I did, but I don't know if it's good. And considering that it takes more than a day to complete an epoch of training, testing blindly does not really work in my case.
https://civitai.com/articles/618/tutorial-kohya-ss-dreambooth-ti-textual-inversion-embedding-creation

gentle flame
weary field
#

Who is the best LoRa creator here? I want to create children’s books and have LoRas for multiple characters to be able to create books at scale.

Anyone have experience doing something like this?

oak coral
#

Could someone help me complete a render that was too closely zoomed in and ControlNet refused to work with the Checkpoint (ZavyChromax v12) I created it with? Disclaimer: It's NSFW but not overtly so.

opal jacinth
#

Any idea why my LoRA came out really bad if I use raw images from my mobile phone with resolution 3024x4032? Kohya didn't log any errors. Trained with training resolution 1024x1024 and the results were really bad.

But after cropping those to 1024x1024 and training with same parameters, the results were excellent.

I usually train with random cropped images and never had issues, it must be the high resolution? Dunno, it's super strange for me.

bitter merlin
#

Question about Lora training:
A friend of mine was suggesting to train different spots.. "in between" so you can find which one is the best and not under trained and over trained.

hollow valley
#

just wondering if anyone has worked out how to train a person who has a lot of tattoos properly?

gentle flame
#

I don't really know anyone that uses it anymore

#

since bucketing exists now

latent charm
#

I want to make a 16 latent channels vae for testing how could I start?

opal jacinth
# gentle flame random cropped images is why

but I have issues with uncropped raw images from my mobile phone with resolution 3024x4032... that's why I'm confused. There is only little resemblance. But if I crop them to 1024x1024 I get decent results

stark tinsel
#

Hi, I'm trying to make a Lora that generates variations of cute animals faces? I've managed to train a Lora to generate the same watercolor texture and style, but the animals faces are always the same, always the same bunny from sdxl base or other models. Is it possible to get this kind of variation just with Loras? Or would I have to train a checkpoint for this?

#

I would like to get variations like these ones
Different animals faces

normal ember
#

@stiff dust Tested this?
`--v_pred_like_loss ratio option is added. This option adds the loss like v-prediction loss in SDXL training. 0.1 means that the loss is added 10% of the v-prediction loss. The default value is None (disabled).

In v-prediction, the loss is higher in the early timesteps (near the noise). This option can be used to increase the loss in the early timesteps.`

stiff dust
#

no, but it sounds strange. if you want to increase the loss in earlier timesteps, either use min_snr_gamma or use min and max timesteps. v-pred is a quite different objective and you usually need hundred thousands of training steps to adapt a model to v-pred

stiff dust
stark tinsel
normal ember
wanton yoke
normal ember
normal ember
stiff dust
#

good that I haven't done git pull since many weeks ^^°

normal ember
#

I train unet only anyway 😄

astral island
#

hey can anyone help me with a question?

stiff dust
#

in general: SD inference and SD training are two very different things. What you do when SD generates images is VERY different from what you do when you train SD

astral island
#

you mean during training SD doesn't generate an image so that it can calculate an L2 loss with the ground truth image in the dataset?

stiff dust
#

yes, during training you don't generate images, you denoise images

#

or better: you use an image, add noise to it, then predict the noise on the image (the difference between the noise you added and the noise predicted is your l2 loss)

#

if you generate an image "from scratch" (at inference) things are different. You start with pure noise and then denoise it. The problem itself is ill-posed here. If I use a pure noise image and ask for predicting the noise, the answer is trivial (everything is noise). You can't really use any meaningfull loss here

#

the reason it still works at inference is that SD does not know that the image is pure noise and tries to find any familiar patterns in the noise. Similar as humans can look into the sky and see faces in the clouds

#

also things like CFG are inference-only, they don't exist at training time

#

also training is always only one step while in inference you use many steps to generate the image

#

and many other things. It's just that training and inference are two different things in SD. It's not like in most other machine learning problems, where inference and training are more or less the same. Here, you basically solve two different tasks.

astral island
#

@stiff dust Thanks for the explanation. I keep looking for some 3blue1brown style deepdive on the training process but there's none.

#

About the noise adding process, is it adding noise until the whole image is completely noise? Or just a little bit of it?

#

And what exactly is the "predicted noise"? Is that yet another separate process?

#

Or maybe you're saying we denoise the image similar to img2img? And you calculate the difference only on the pixels changed by the noise adding and denoising process?

brittle ridge
#

Prompt for Steampunk Batman in Victorian London (Inspired by Dishonored)

Visualize Batman in a steampunk attire set against a Victorian-era London street:

Batman's Attire: Dark leather combined with bronze elements. Instead of the traditional Bat-logo, imagine a bat-shaped gear centered on his chest. His utility belt would be a series of leather pouches adorned with copper rivets and dangling chains. His eyes would glow behind amber-colored aviator goggles.

Victorian London Street: Wet cobblestones from a recent rain, with a gentle mist rising from the sewers. Gas-lit lampposts casting a golden glow, creating dancing shadows. Tall red-bricked buildings with smoking chimneys lining the street, and Victorians in period attire casting furtive glances at the Dark Knight.

Background Elements: A blimp displaying a large bat insignia lights up the night sky of London, serving as a steampunk Bat-signal.

Capture the atmosphere and tones reminiscent of the game "Dishonored", blending the old with the futuristic in a unique manner.

stiff dust
# astral island About the noise adding process, is it adding noise until the whole image is comp...

you basically mix the original image with a random noise image. 0th timestep would be 100% noise, 0% image. But you skip that step. You train randomly by drawing a number between 1 and 999. If your for example draw the numer 200 then you use 20% image and 80% noise. At least for a linear scheduling. In practice, other scheduling schemes are in use. What the unet predicts is noise image (quite unintuitive, as you are interested in the image, not the noise, but you get the image after subtracting the noise)

astral island
astral island
#

@stiff dust when you say the unet predicts the noise image, do you mean it generates a noise image using a prompt?

stiff dust
astral island
#

or is it something more complicated?

woven delta
edgy wharf
tall condor
#

hey guys, can someone recommend a windows tool to auto crop faces?

unique cloak
tall condor
#

thx

latent charm
#

What does it mean If I fine tune the model without caption?

normal ember
#

Wouldn’t it train as responding to empty prompt?

stone garden
#

It will train on everything the images are, i.e, taking everything in the image so it will train based on the characters, backgrounds, and even the shadows or stuff as it will really have no idea on what exactly to replicate or what the loss is trying to find out

latent charm
latent charm
stone garden
#

Using tag a on an image without a forest will of course result in it being weird, for example if you use tag a on an actual forest but then on a desert, it will force a mix by the loss optimiser as to match both images

latent charm
#

Something like that.

#

I use wd1.4 tagger

#

Some tag exists in multiple images. It has the mix issue further

stone garden
#

You gotta do some manual tag checking for that tho- or change the training setting so it doesn't take your tags seriously

#

I also used wd1.4 and ngl it's wayyy easier to use it but I am still gonna stick with blip captions- They are just better for sdxl

stone garden
#

I used around 120 images to train a LoRa with myself, but it usually only learns my ears (distinctively sticks out) and hair, but my face details usually get lost. (And many times it has artifacts.)

I'd love to use it both with graphic/real ckpts, not sure if possible.

My base model is RealisticVision 2.0, as that got me the "best" results so far.

Ngl I feel like it's a GIGO problem.

What kind of dataset should I provide? I know it has to be "varied", but for example when I include too many "different expression, but looking away" that pose gets overtrained(?), and appears too frequently (even when tagged).

I read a few guides, but the dataset part is always very vauge, I'd be thankful for some examples.

(I always crop to 1:1, and remove the bg)

latent charm
#

You could use 10 face focus images to get your face lora

stone garden
stone garden
latent charm
#

yes, you could train with different angle face images

stiff dust
# stone garden I used around 120 images to train a LoRa with myself, but it usually only learns...

hi, first of all: what you describe might be a side effect of how CFG works in inference. During training, there is no CFG. But when you generate images, you always use the CFG (often a default value of 7 or even higher). Often CFG is described as "how strong your prompt influence your image". But you can also think of it as an enhancer. When you generate a face of yourself what the cfg is doing is it takes what differs your face from the average face and adds this to the image. So the CFG exaggerates your facial features. Anything that makes your face special will be increased by the cfg. So when you have the feeling that your images have strange artefacts, first thing to try is always to decrease the CFG value. This is particularly useful when you want photorealism. Try with CFG 4 and check if the images look better

#

besides that, I don't think that full body shots are useless.

#

However, you are using SD 1.5, right? I don't have much experience with 1.5 training, but I found that it is quite vulnerable if you train it on too much variety (or too much images in general). I have to say I always struggled training my face on 1.5, but the best results in 1.5 I got with very few and high quality images. I got MUCH better results when training on SDXL, though, and in SDXL I could use as much images as I want and results got rather better than worse

normal ember
#

Not related to this but I've tried many lr_samplers and optimizers. I seem to have better control over overcooking when using a cosine scheduler with some warmup along with AdamW.

#

Also about alpha of half of dim seems to be working nicely.

#

warmup is 100 steps which is about 6-7% of my total steps

#

Prodigy worked but was too big movement between epochs so was hard to select a good sample.

stone garden
latent tiger
#

hey! I have a some what strange question I think.
So I'm wondering is it possible to train a Lora in parts?
(train it one time on the first part of a data set then train it in another run on the second part and so on)
Would this be posible? I'm asking as my googlcolab time is limited i have large data set.

jade hornet
jade hornet
latent tiger
sonic narwhal
#

What do you think the training recipe for those "world morph" models on civitai is? As in data set and training

#

how big and how much variation in dataset for "world morph"?

normal ember
sonic narwhal
normal ember
#

It's the __metadata__ entry you want to have a look at.

sonic narwhal
#

Thank you

normal ember
#

@stiff dust Have I understood the code correct that the network_alpha acts as a brake of how much the weights can change for each training step?

if self.lora_layer.network_alpha is not None: w_up = w_up * self.lora_layer.network_alpha / self.lora_layer.rank

#

If so is there a logic to increase the brake if and when you increase the network_dim

stiff dust
#

because its a matrix factorization. The weight change is w_down @ w_up (@ = matrix multiplication). So a single weight is changed by the dot product between a row in w_down and a column in w_up.

#

the length of these vectors is the rank of the lora

#

if you have rank 1 then you just multiply two numbers

#

if you have rank 100 then you multiply 100 numbers and sum them up

#

now during training each single weight parameter is changed in each step, but the change cannot be arbitrary high (due to the learning rate)

#

but changing 2 x 100 numbers and multiply them and sum them up gives you a 100 times larger change than just changing 2 x 1 number

#

so in each training step you make a small update on the lora. But a lora of rank 100 has a 100 times stronger effect than a lora of rank 1, so you divide the result by 100 to make both comparable

#

otherwise you would have to decrease your learning rate whenever you increase the rank

normal ember
#

Makes sense! How large are these vectors in the base model?

#

And thanks for a excellent reply as always!

#

Or maybe the dot product is always stored?

stiff dust
#

this is just how matrix multiplication works. The matrices in the original model are different depending on the layer and so on, but usually they are quite big (~1000 rows and columns)

normal ember
#

What I find odd is that when training a LoRA with alpha 1 vs something higher you don't necessarily get overtrained in the same way as a too high learning rate or too many training steps would do.

#

It just seems like the LoRA loses some flexibility in regards to how it can be mixed with other prompts if it's trained with a higher alpha.

#

Like the signal stronger in the LoRA than the base model so it get's preference over the model. But I guess that makes sense on what you explained earlier.

#

Let's say you trained on a photo dataset and try to generate an image with a anime style. When using a higher alpha the base model seem to get less priority and the data in LoRA get stronger which results in an image that's more of a photo or completely a photo but could have some traits of the anime style from base model like the character becomes asian instead of something like in the photo dataset.

#

I guess that has to do with the weight vectors has a higher value since we didn't reduce the w_up as much when the alpha is higher which overpowers the base model.

stiff dust
#

I never noticed such an effect of the alpha. Should check that myself

latent tiger
latent charm
#

@normal ember Do you have any result of giving the cropped coordinate in meta data? I want to improve my fine tune with hand anatomy training but not sure the coordinate would help or not.

normal ember
#

It would be neat to get tools that could replicate the training SAI has done when training the base model

latent charm
#

they released their training tool

#

But I haven't really look into it

restive bridge
#

What ever became of Lora-Fa? anyone using it? advantages?

stone garden
jade hornet
latent tiger
stone garden
#

That is they start over every time for each update, using the updated weights-

#

(I am still learning About this so I can be wrong)

stone garden
#

Hello everyone. I'm trying to train a LoRA on a person, but I can't seem to get the facial features right. Is there a way to "focus more on the face" when training the LoRA? Or provide some kind of "weights to the image pixels" when doing the training?

jade hornet
#

For that reason, I normally save at intervals and have multiple saves, when it starts to go off course just make adjustments and start from the last save that was going well

latent tiger
latent tiger
#

Thank you all for you help ☺️, really appreciate it!

foggy cradle
#

hi folks - is it possible to inference a single image using two LoRa characters?

queen matrix
#

Sure why not. You wold just load both loras and use both keywords/names. You probably would have to adjust the strength of each lora for best effect.

foggy cradle
#

the concern would be that they merge into a single blended character instead of creating separate characters

queen matrix
#

Yeah it is somewhat likely to do that. Probably most loras are trained primarily with images containing only one figure in the image. But you'll generally get some images where it mixed the characters, some images with two of the same character, and some with actually two separate characters like you want. Prompting to be clear that there are two people can help.

foggy cradle
#

wonder if we could do it by 1) inferencing LoRa character_A 2) outpainting with LoRa B

queen matrix
#

That could work. You could also get an image with two characters and use img2img with a mask to replace one of them.

carmine zinc
#

How do I gt SD to do more vibrant colours? it always darkens the pictures at the end:

stone garden
quaint viper
#

I guess this channel is mostly about LoRA's but does anyone here have experience training ControlNet? (and is there another channel where I should be asking?) I am attempting to train a ControlNet and I keep getting these weird high frequency details that I don't want. For example this cat. I am not sure if I just need to train the model more or this is a signal that I have already overbaked it or what. Training loss has been going down very slowly but the effect remains the same.

#

Any advice would be very welcome

quaint viper
#

This is the loss in case anyone is interested. Batch size is 160 and the training set has 125,280 images

wary wedge
#

What model is better to use for custom LoRa training in pixel style? SD 1.4, 1.5 or XL? And also why? Should I use the cpp fork of the repo for better performance because my specs are not that good.

#

And also how do I do 16x16

sonic narwhal
delicate fractal
quaint viper
quaint viper
# delicate fractal What is loss what does this mean ?

When you train a model, the way it works is by defining a loss function that the optimizer can minimize. Think of it like a metric for how well the model does and the optimizer uses it to improve the model. In this case, the loss is mean square error, so the average of all the (model_output - target) ^ 2 in the batch

#

If the loss goes down, it's going the right way

latent charm
#

@normal ember @stiff dust Does training empty token(no caption) would affect the whole bias of the model? Or any training would affect the whole model? My friend claimed that training without caption would change the whole model style and I don't understand. In my understanding training without caption would train as empty token but how would it affect the whole model?

stiff dust
#

on inference you do cfg (classifier free guidance) which means you run the unet once with and once without caption

#

so training on empty caption retrains the prior distribution of the model

#

(what it thinks how images look like without knowing a caption)

latent charm
#

Oh, thanks a lot. It solved my question. I feel the empty caption would affect the result but don't know how it affected.

versed crescent
#

I'm having a really hard time performing a LoRA training on SDXL using a friend's face as input. I have ~25 images, and I'm seeing his likeness in the resulting checkpoints, but I have a SUPER hard time performing any kind of styling with his likeness, like I did with the original Dreambooth workflow on SD1.5. It always wants to pull the resulting image back towards the training images, and when I use earlier checkpoints, his likeness is lost. I'm currently not using regularisation images either, as I just want the LoRA to make images of the tuned person.

stiff dust
#

I got best results (regarding styles) with:

  • using rare tokens (e.g. "photo of chris thsgc")
  • train unet only (this is very important as the text encoder is very sensitive to overfitting)
#

in general it's hard to get a checkpoint that gives you perfect photorealism AND perfect generalization

#

but the results I got are still thousand times better than what I achieved with SD 1.5

normal ember
#

I’ve found it very sensitive to both too high and too low learning rate. A higher alpha have given me much better results too. Everything without reg images.

#

Can’t verify it yet but I feel like I get better results with 10 repeats instead of 10x epochs.

#

A high learning rate looks good on the loss graph but results normally worse.

#

I’ve trained both rare tokens and not. Both possible but I agree that it’s easier to get overfitting when it’s something well trained. Alpha helped when trying to train something known.

versed crescent
#

I never touch Alpha, and have Rank/Dimension set to 128. I also use train the unet only. Hmmm

#

I don't really have a good idea about what alpha does. If the value is 1, I think it defaults to the dimension size?

versed crescent
#

Also, is it strictly necessary to use the class prompt when triggering a lora trained on a specific prompt? If I train with the unique token "ohxw" and the class of "man", do I need to use "ohxw man" ? or just the unique word

stiff dust
stiff dust
versed crescent
#

yeah I don't use a class prompt in the caption, but it is encoded in the directory name of the training images, for the kohya_ss interface

stiff dust
stiff dust
versed crescent
#

ok

ruby pond
#

is there some way to visualise the block weights of the lora, to see which blocks were trained? kind of like a feature importance graph

stiff dust
#

just take the norm of the matrices. This is usually a direct readout of the importance

normal ember
#

Would be neat that you could trace the network from a prompt

normal ember
wanton rampart
#

How to teach the bot a new artstyle like madhubani and mural

stone garden
#

Anyone know how these guys are getting 10+ it/s on AMD?

#

The best I can do is 7it/s with a 7900xt

karmic flame
#

Hello, I want to ask: What will be the best way or approach to train a stable-diffusion model on my pre and post-image dataset? For example, consider a dataset with 'pre-images' featuring a man without a beard and 'post-images' depicting a man with a beard after one month. I want to develop a project where, given the pre-image without a beard as the input in an image-to-image task, and with a prompt specifying a given time, such as 'at 2 months,' the output should contain post-images of a man with a similar face but with a beard after 2 months.

So my questions are,

  1. is it even possible with SD, if yes, what should be my best approach to train the model?

  2. if SD should not be my choice, what should be my alternative approach?

Any help would be really appreciated.

stiff dust
stiff dust
#

I guess GANs are a bit better for your task than diffusion models. But it should be possible in SD

#

but you usually need a lot of training data for training a control net. How large is your pre- and post-image dataset?

tropic moon
#

Anyone have any idea why training textual inversion gets such crazy results when not on base 1.5 model?

normal ember
small eagle
#

pretty sure lorai offers a few free credits to create a quick lora (currently broken), are there any others that offer a free one before signing up?

normal ember
jade hornet
#

Buying the newest card usually means waiting for drivers to catch up

stone garden
#

They're on Linux, but in on WSL2 so I don't think it's the same

karmic flame
opal jacinth
#

if I'm working with f.e. 1024x1024 regularization images and my training data set has another resolutions (= bucketing enabled), is that a problem? Would the results be better if I crop the training data beforehand to 1024? I've read conflicting statements

edgy wharf
#

Hi! I want to train the img2img lora model to see a particular sofa model in other rooms. I did a tutorial before using SD XL base 1.09, but the results were not good. I upscaled 14 images and deleted their backgrounds, leaving only the seats. Do you have any suggestions? Sample data;

#

"ohwx, a couch with a blue cushion, modern sectional sofa with a reclining mechanism, basic background" here is my example prompt

tall condor
#

anyone has any experince with creating his own Danbooru Tagger model

#

i want to create my own tagger model so that it can generate my own tags

jade hornet
tall condor
#

but i have like 20k images i need to tag

latent charm
#

Usually people use wd14 for tagging anime because they used danbooru tag. If your dataset is anime, just use wd14 is good enough. Otherwise, many research have been using custom llm as a tagger to provide descriptive prompt for model training. for example, gpt4v or open source multi modality llm

normal ember
#

You could try openflamingo but I have not tested on tagging. The way it works is that you provided it with a few images and captions and it will use that as a reference on how the captioning should look like

#

GPT-4V is great but no API yet

gentle flame
#

Something recent that's supposed to reduce noise for captioning

#

I haven't used it, but it might help

modest igloo
#

hey everyone question: when making a LORA of lets say a specific type of haircut, should I also use tag like 1girl, black shirt etc? or should I only describe what I want like, for example, m0hawk, green hair. My worry is that if I tag too much stuff, using my lora will also polute the model with data I dont want like faces, backgrounds etc. Am I thinking too much?

stiff dust
modest igloo
# stiff dust I think the idea is that if you describe the image as best as possible the model...

I see, my concern was that if I add girl to my Lora end then I use the Lora it will render a different girl even tho I just want to render then hair.

But maybe I'm approaching the issue the wrong way and instead I should use something like control net and inpainting etc to swap the hair on the face I want.

Because I noticed that the Lora I made with pictures of haircuts my wife made that if I add the to my prompts, after a certain threshold starts to change things like the pose or even the girl shown.

tired plank
#

Question about Lora captioning

Say I'm training a specific "species" of wyvern(aka i'm training a specific and niche type of fantasy creature)

When blip caption my training set images it's almost always "a dragon with a point head on a black background"

Is it better to remove the dragon and fix it to be "<my keyword> head on a black background...<more descriptions here>" or should I leave that and just do "<my keyword> a dragon head on a black background"? Does letting SD know it's similar to a dragon make it better?

#

I'm not sure what the model understands better

normal ember
#

I’d say depends on how you want to be able to prompt it at inference. If you train it to associate with dragon you can’t obviously also get a wyvern and a dragon in the same image.

#

If this type wyvern has a specific name I’d use that.

tired plank
#

so i should caption more like "closeup of rathalos(<-keyword) head flying through the sky among the clouds" and just drop any reference to dragon or wyvern?

normal ember
#

If it’s obvious that it is flying when close-up of the head

#

If not, I’d skip the flying.

#

You could also associate rathalos with wyvern if you want to train several types of named wyverns that share similarities.

tired plank
#

The wyverns from Monster Hunter tend to share a very similar body plan so if this Lora doesn't end up being trash I might try that

#

I was using the class as dragon for right now cause I figured since the SDXL model already knows what a dragon is and rathalos is dragonish it might help with the learning. At least that's what ther guides I read suggested. I'll try again without the dragon captioning

#

New to this so i'm just doing this to learn, thanks for the advice

normal ember
#

Is this the guy? Here's the prompt used: a majestic wyvern named rathalos soars through the clouds on wings that span over 10 feet. his scales shimmer with every movement, and as he lets out a piercing roar, lightning bolts strike around him. in the distance, a group of hunters can be seen attempting to bring down this formidable creature.

tired plank
#

Oh you can post images here...I'll get a sample image from my training set

#

If you prompted that image using "rathalos" in sdxl, it's not that far away from Rathalos IRL

normal ember
#

Never tried training with a solid background and how it would react, somebody else in here might know.

tired plank
#

I'm worried about some of my backgrounds in my training set. I was looking for high res screenshots for the game but couldn't find stuff ofer 4 MP

#

Was wondering if I should tag my backgrounds or not, but I have heard mixed things on tagging backgrounds

normal ember
#

And you can't capture them yourself from the game?

tired plank
#

I can, it was more a laziness thing since I don't have the game instealled but I could do that.

#

Maybe I could also generate a background using sd and just paste ^that guy onto it

normal ember
#

Getting him from different angles, medium shot, close-ups and full might help too.

tired plank
normal ember
#

Worth a try to caption if it's a statue or from a game too if you want flexibility that is.

#

And from there, just experiment and compare the results.

tired plank
#

Speak of the devil, it just finished. Though an error on the 9th epoch made it so I only got 8, but here's the 6th epoch

#

Not terrible. Well kinda terrible but not as bad as I was expecting
prompt: a dragon flying through the sky, flame in it's mouth,lora:RATHALOS_RATHALOS-000006:1

tired plank
#

Is the above image underfitted over fitted? Not sure how to improve it. My first thought is it's underfitted. Also thing some of those training images need to be replaced by better screenshots in game

stiff dust
#

I would say try again with a better prompt.

tired plank
#

I tried a chibi version

#

Not bad, even after losing his arm in the war

smoky flame
#

Hey all! What's the best repository for training a controlnet on SDXL?

jade hornet
# smoky flame Hey all! What's the best repository for training a controlnet on SDXL?
GitHub

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch - huggingface/diffusers

smoky flame
#

@jade hornet Thank you I've seen that - it's not specific for SDXL - that example is for SD1.5 right?

jade hornet
#

Presumably, but there's an sdxl script in that folder

#

Just wondering if you tried it

normal ember
#

@latent charm Tried bakllava-1?

latent charm
#

I am fine tuning SDXL with captions that generated by an VLM, the captions are selected and modified by me. Let see what would be the result. The dataset is 2400 text-images pairs using tag, generated captiosn and empty captions with 10 repeat.

normal ember
# latent charm ?

It's a multimodal variant of llava but instead of using base llama2 it uses a finetuned version of llama2 named mistral 7b which in itself is a great model for it size.

#

It's very GPT-V like

#

It's the first model I've tried that can be controlled consistently to the result you want.

latent charm
#

Really, sounds great. I heard mistral 7b is pretty good too.

normal ember
#

I've written some code that automates the captioning

#

I'm using llama.cpp as it has a built in rest-api and is fast and lightweight, then some python code that sends the instruction and the images.

latent charm
#

After I had done the captioning, I found that calculate the clip score would help to found out the unwant captions, too short or too high might result bad caption.

#

Also I just see this.

normal ember
#

and then a full tune of the model?

latent charm
#

They doesnt release the weight

normal ember
#

I'm thinking about your captioned data you are training

rapid meteor
#

hey guys has anyone tested training models vs loras for people? Does one produce better results over the other?

latent charm
#

rent an a100 40gb running now

normal ember
#

AdamW? What LR?

latent charm
#

constant pageadamw8bit. lr 2e6, noise offset 0.1,

latent charm
normal ember
#

I'm running on my 4090 now adamw8bit, some offset noise (0.0357) and 3e-6

#

but you are probably running a batch size larger than my 1 😄

latent charm
#

batch size20

normal ember
#

What I don't have a clue about is how many epochs it will take

latent charm
#

using pageadam8bit could increase bs to 4 and fp16 in 24gb vram

#

or even 5

normal ember
#

I have some room yes, using 18G VRAM atm.

#

I've run 4 times as long as a LoRA would take me. It's moving increadibly slow, but I guess that's normal since the LR is way less.

latent charm
#

I am doing the large bs large repeat route. It seems give me better results

normal ember
#

I do 10 repeat

latent charm
#

I have 3 dataset with repeat 10

normal ember
#

How many epochs do you plan to run? I save state so I can resume.

latent charm
#

running with max 20 epoch

#

I just rent 3 days of a100 to see the progress and have to decide would continue the training or not

#

it totally uses 300 hours

normal ember
#

that's 480 000 steps if I calculated it correctly, 24000 batches

latent charm
#

around 70000 steps

#

with 20 bs

normal ember
#

runpod?

latent charm
#

Another platform

spring arch
#

i trained my lora on base SD1.5 model.Later i merged that lora into another SD1.5 model and retrained it.Now the newly created LORA doesnt work on other models apart from the ones i retrained.Any help on this

latent charm
coral canopy
latent charm
#

It still has the hallucination like other llm

coral canopy
latent charm
#

I am the repo owner

#

I selected another example after you point it out. LOL

coral canopy
#

🤣 oh, sorry. I couldn't connect lrzjason with XiaoZhi

latent charm
#

That fine

coral canopy
#

Ok, that's more fitting to the image. I guess there is still a lot of cleaning up to do, but it would be interesting to run a training with those captions.

latent charm
#

I am already training with 2400 images from other llm generated captions. It is 19% now.

#

kosmos-2 would give captions with more subject composition

coral canopy
#

I think it was the highly descriptive captions what makes Dalle-3 king at the moment. Time to reproduce that.

latent charm
#

Yeah, I am encouraged by Dall E 3 and pixart-alpha

coral canopy
#

What's the current state of pix art-alpha? Are the weights released?

latent charm
#

Still no. But they released the LLaVA-captioning inference code.

#

Just checked

#

Interesting. You could use pixart-alpha scripts to caption your dataset.

#

It uses LLaVA-Lightning-MPT-7B-preview

coral canopy
#

Thats indeed interesting.

mossy condor
stiff dust
plain cloak
#

hello everyone im trying to train my model for guns but what tag do i use

#

because when i use a custom tag i get nothing

hardy storm
#

Anyone here tried training a model on a house/property?

I've been kicking around this idea of taking photographs of my childhood home (exterior) and the surrounding 1 acre of grounds (trees, pond, barn, etc), and making a LoRA out of it. I'm just not sure how best to do it... specifically:

a) If I should only train it on, for example, just the house, or if I could include photos that focus on other features of the property, so long as they're all connected (trees, pond, barn, etc)?
b) How to caption the images? For example, should I approach it like I'm training an object, as if the 'object' is the entire property?

blazing turret
#

hi i just starting with automatic 1111 and in my canvas zoom im missing this pen/painbrush option. What should i do to get is (was trying to update agony )

digital dune
#

take as many pictures and describe everything, and preferably use a base model of one that is already trained on architecture, which there are probably already a lot of

#

if you train on kohya it will ask for a keyword and class. something like 01_childhoodhome house would do it. house is the class and childhoodhome is the keyword

hardy storm
pure plume
#

Hi, i've got a style lora and i think it's time for a 2nd ver.
what is the method here?
re-train the lora or use the old one and just bring the old data in?

plain cloak
#

is it possible to finetune SDXL models

oblique adder
#

when i interrupt my render i get the style i want, it usually happens when i interrupt at 80%, what setting do i have to change for it to be consistent and automatically save to my folder?

ocean fractal
oblique adder
#

right now i just interrupt and save manually but i'd love to just pump images out without checking what % its at

grand depot
oblique adder
tall condor
#

hey guys, which tagger are you using atm?

#

i am particularry looking for one for faces, expressions, poses and so on

#

one that generates tokens

normal ember
tall condor
#

can i deploy that locally?

normal ember
#

No

#

It's probably something like 150-200 lines of code, OpenAIs API and some money.

tall condor
#

can anyone explain to me why there is a difference between merging model A and B and Merging Model B and A?

spring arch
#

Can't we train expressions for lora especially eyes and mouth.Its so horrible

coral quail
#

Would it be possible to train a version of SD that is good at making emojis in a specific style, like android or ios?

#

Or could this be achieved by simply using input images or prompts without a separate model?

coral quail
#

thank you for the link! I'll try to get that running locally on my dockerized SD instance. Generally with image file prompts, is there a way to essentially say "this but recolor/redraw in the same style and position", like changing the color of a flag emoji or swapping what is in the hand of a human emoji? Or maybe this should just be handled by the prompt/negative prompt? @normal ember

normal ember
hollow spruce
#

just saying hi, I'm not ded XD
here's some updates with the cool new toy, juggernaut + my lora based on my master dataset

#

jugger / jugger + lora

#

prompts & settings are taken from civitai (unmodified), the ones used to represent the model. so they're cherry picked to favor juggernaut

#

nsfw + general anatomy works better as well, but I can't show the progress on that here

#

I like how one of the cover images for the model literally uses "150 mm" (headshots), "full body", "sitting" in its tags :"D
my lora really didn't like that original composition of just a headshot when these tags are added. But its only loaded at 0.7 strength, so oh well
||complex 3d render ultra detailed of a beautiful porcelain profile woman android face, cyborg, robotic parts, 150 mm, beautiful studio soft light, rim light, vibrant details, luxurious cyberpunk, lace, hyperrealistic, anatomical, facial muscles, cable electric wires, microchip, elegant, beautiful background, octane render, H. R. Giger style, 8k, best quality, masterpiece, illustration, an extremely delicate and beautiful, extremely detailed ,CG ,unity ,wallpaper, (realistic, photo-realistic:1.37),Amazing, finely detail, masterpiece,best quality,official art, extremely detailed CG unity 8k wallpaper, absurdres, incredibly absurdres, robot, silver halmet, full body, sitting||

#

not sure what to think of this one :/ it doesn't show off juggers abilities. why would they list this as one of the main cover images? not even the people tags are followed in any way
will redo it with one of the people I trained in my datasets
Portrait Photo a portrait, hyperdetailed photography, by Elizabeth Polunin, red haired young woman, Gianna Michaels, brooklyn, looking straight to camera, sweaty, olya bossak, nepal, very accurate photo, suspiria

#

redone with shirogane instead of the people they mentioned
Portrait Photo a portrait, hyperdetailed photography, by Elizabeth Polunin, red haired young woman, shirogane-sama, brooklyn, looking straight to camera, very accurate photo, suspiria

covert pagoda
#

Anybody have a good auto cropping tool that sees subject to crop tighter in, while also using a given aspect ratio? With computer vision? Cropping my dataset and just want to speed it up!!

latent charm
covert pagoda
#

All very general

latent charm
#

Might be this one

covert pagoda
#

Cool I’ll give it a try. At least it’s not just a white paper 🫠

#

Thx

covert pagoda
#

(venv) D:\TF-A2RL>python A2RL.py --help
Traceback (most recent call last):
File "D:\TF-A2RL\A2RL.py", line 14, in <module>
with open('vfn_rl.pkl', 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'vfn_rl.pkl'

#

after installing all the modules and running python A2RL.py --help

latent charm
covert pagoda
#

will try to crop one with a crop command now

latent charm
covert pagoda
#

yea, onto to the next one. hard to find one that works

latent charm
#

I also want to find a good tool to crop various ratio.

covert pagoda
#

need something that is ready to use

#

testing this one then giving up

#

nothing works at the moment

#

with this code:
`import u2net
import cv2

Load the pre-trained U2-NET model

model = u2net.U2NetModel()

Load the image you want to crop

image = cv2.imread("image.jpg")

Segment the image and remove the background

segmented_image = model.predict(image)

Crop the image to the desired size and aspect ratio

cropped_image = segmented_image[y1:y2, x1:x2]

Save the cropped image

cv2.imwrite("cropped_image.jpg", cropped_image)
`

#

U2-net seems a bit more advanced

#

@latent charm can you give this one a try?

#

I ll try the first one

latent charm
#

I have used u2-net to remove background before

#

But didn't use it to cropping

covert pagoda
#

smart cropper is totally off topic lol

covert pagoda
#

ok, this seems to work quite well
https://www.youtube.com/watch?v=Fbuyu35TkE4

One of the most important aspect of Stable Diffusion training is the preparation of training images. In this tutorial video I will show you how to fully automatically preprocess training images with perfect zoom, crop and resize. These scripts will hugely improve your training success and accuracy.

Scripts Download Link ⤵️
https://www.patreon...

▶ Play video
digital dune
#

Anyone has had better results in training by lowering/freezing the text encoder learning rate?

hollow spruce
#

inverse is also true. my best tagged dataset, where every tag occurs around 50~1000 times across 5k images, the TE training alone did more good than the actual training did.

digital dune
#

Will this setting affect anything if I'm training dreambooth with no captions?

#

The issue I have is that after it finally starts getting my subjects right, the images end up a little "overcooked", probably from being influenced by my reg images some of which were made with loras for good detail. At the same time the gens stop "listening to my prompts" and just end up rendering my subject however it wants

#

I've already been told to lower my prior loss weight which seems to have great results so far but I'm wondering if the UNET/text encoder training also has anything to do with it

tall condor
#

do you guys have any tips if my general dataset is learned well but the faces are bad?

#

i tag my dataset by hand + deepdanburu tags

#

i dont see too much issue with the tags but yet for some reason faces get messed up quite a lot

#

however the rest is learned fine

#

im on dreambooth btw not lora

#

@digital dune what i recommend is that you run deepdanburu tagging of your images and use at least 0.03-0.05 of nosie, also for me using random crop instead of center corp in combination with flip and color augmentation worked well

#

i recoomend much lower learning rates with that settings and more epochs

#

with dreambooth i mostly use 5e-7 or so as lr

#

also make sure to enable shuffle captions

#

i recommend traing with kohya ss

#

and add more weight to the promts that are hard to generate (in my experience not more that 5-10 times the base weight)

stiff dust
# digital dune The issue I have is that after it finally starts getting my subjects right, the ...

I have very contradicting experiments with text encoder training. Sometimes it overfits MUCH more than the unet, sometimes it gives the model more flexibility :/ So far I can say:

  • TE training is often vulnerable to overtrain on the image composition. As you wrote, sometimes it starts ignoring your prompt and put everything into the same composition as the training image
  • TE training sometimes totally overfits on style (makes everything anime or everything photographic), but not always. I have not found a pattern here :/
  • all overfitting effects seem to get less severe when I train with low batch size (e.g. batch size 1). I have no explanation for it, but that observation was made by other people, too
  • low learning rate is not always good. My feeling is that low learning rate + many epochs is much worse than high learning rate + few epochs
stiff dust
normal ember
#

I've also noticed not captioning "photo of" but only the subject itself will make it more flexible to use it in different styles. If you do the other way around more often then not you almost always get a photo even if you want anime or something else.

#

Takes forever to train with BS1, if it would not be for that I would probably almost always use that. BS of 4 seems like a good compromise.

#

@stiff dust What ratio between the rate for unet vs te have you found work the best? Same or different rate so to say.

stiff dust
#

I always train them after each other. So I first train text encoder as short as possible. I stop training as soon as the subject in the image looks roughly as I want it and long before I see overfitting effects. Then I restart from that using unet only training

vapid mango
#

Hello guys, I'm going to do a full finetuning for SDXL to learn the aesthetic. I'm wondering is there a need to finetune the text coders of SDXL as well? Or I just need to finetune the UNet? Some people told me they always disable the training of text encoder of SDXL when doing a full finetune. Many thanks

normal ember
stiff dust
#

you don't need save-state for that. Just from the generated safetensors

normal ember
#

(running first test now)

#

and have you tested --debiased_estimation_loss

stiff dust
#

what's that option?

hollow spruce
# stiff dust but what is good captioning for subject training? I'm not a big fan of the dream...

subject (person or new outfit with new clothing/accessories) training is painful no matter how you do it thanks to sdxl :/
"good captioning + practices" to basically get the best result that sdxl + current training tools can offer is a lot of effort.
In return you get a lora where 5/5 images represent your subject between "good enough" and "utterly perfect"
^ my take on shirogane -> 400ish images + manual tagging + regularization images

Personally I'd rather do 20% of that work, to achieve a 'good enough' result, where I invest the same time into gathering the dataset, but keep my captioning down to keyword + automated
In return I get a lora where 3/5 images are good enough. 1/10 is "utterly perfect"
^my take on 2B cosplay -> 400ish images + keyword + automated tagging / no regularization images

(all of this without relying on absurd dimension settings, to hide any dataset issues)
also in regards to composition, all my (manually tagged) loras follow wording/composition better than default sdxl, so my manual tagging is definitely working

#

<subject name>, girl, race, type of photo | optional: unique things about this image - like is it a cosplay? is in front of a window?
important about the optional keywords, as well as the mandatory ones, is to have a regularization set, that is ALSO all tagged in the same style, with the same keywords appearing as well

in doing so, <subject name> actually gets trained on the unique details that make this person, this person.
by having different races in my reg set, it learns what features are racial, which are unique to this person. (also retrains some racial details, which are wrong in default sdxl)
type of photo is important, since it absorbs the the background/colors/general composition of the image, rather than letting it drift into the subject name. <- but this only works cause my reg set is ALL tagged in the exact same way by hand

hollow spruce
#

for anyone else reading this: there are a lot of ways of doing this, and they are all correct
this is just the best option for me, based on my datasets and interests, and long terms goals with training sdxl

digital dune
#

I cant remember where I saw the documentation but it is definitely true

#

People often recommend never doing more than batch size 2, the REAL question is... how much worse does it go from BS2 to BS3? Or from BS1 to BS2? I wish there was a wiki for all of this but I guess since this is all relatively new technology we are all stuck on the trial-and-error phase.

#

Btw. Tysm for that info. It is SO hard to find help on these topics sometimes.

stiff dust
latent charm
#

Does it mean low bs would be great to learn the fine details?

#

I usually use high bs which is great for the shape but not much for the fine details.

digital dune
#

It was a long time ago since I tried it, but I remember having good results going BS8 and gradient acc steps to 8 as well. It is technically slower than GAS1, but it did work way back when I tested it. Might be worth a shot

stiff dust
tall condor
#

@stiff dust when i train with derambooth i never add my own prefix

#

i just tag the images natuarlly and create my own model. it works really well. i know the demos tell you to use a unique prefix but IMO its bullshit. it really makes no sese at all as long as you really finetune a model

#

if you do it propper you are not breaking your model by using dog instead of xyz dog

#

it will just make the dog the way you train the model your dog

#

and regarding batch size. all our batch sizes are low xD even 5-6 are low. allmost makes no difference than 1 or 2. i recommend lower LR with higher batch sized and more epochs

stiff dust
tall condor
#

yes i know but that kind of kill the conecpt of training a model IMO

hollow spruce
stiff dust
tall condor
#

i recommend to use cosine with 10% warmup

stiff dust
hollow spruce
#

I'm still doing constant with batch 8 or 10, and have no issues with overfitting. (do note that my smallest datasets are 300 images / and all are genuinely different from oneanother)

tall condor
#

you can use constant with warmup

#

but you should use a small warmup phase

hollow spruce
#

yeah, I do 5% warmup

stiff dust
#

it might depend on what you are training. It's just: training subjects works better on super low batch size. I experienced that now for many different subjects and training setups.

tall condor
#

thats fine. just to tacke the early overfitting

#

subjects,faces,expressions. basically mostly except styles

#

i mostly use learning rates between 1e-8 to 1e-6 and 200-300 epochs

#

with random crop,flip augmentation and at least 0.03 of noise

hollow spruce
tall condor
#

also shuffel caption is very importaint

#

for loras make sure to train the UNET like 10 times more than the text encoder

#

and for my cases lower learning rates with more epocs really do the trick

#

for loras that are like for faces i sometimes go down to 1e-8 for the text encoder even

hollow spruce
tall condor
#

with 200 epochs

#

yes you are absolutely right

stiff dust
#

all that depends on a lot of factors.
I have the feeling there is no right and only way of training. Seems to be different for every dataset and every problem you deal with.

tall condor
#

in my experience it all stands and falls with the tagging

#

she better the tagging the better the result

#

atm my issue is that there are really not that many taggers that can produce tokens

#

i am using deepdanburu but that is very much animi based+

hollow spruce
#

deepdanbooru also does double concept words. basically you if you use two or more words to describe the same thing, you eventually mess up both of them :/

stiff dust
#

I always user natural prompts, but I should give tagging a chance

tall condor
#

tagging is really helpful

#

solved alot of issues in my models

#

but its really hard work

stiff dust
#

what convinces me is that tagging makes it easier to control if images are consistently prompted and if same tags appear in regularization images

tall condor
#

i tagges most of my images by hand

hollow spruce
#

automated always results in meh to good enough results. manual tagging gets me there all the way, but oh god it takes long. and I dont even want anyone to learn the horror that is hydrus network, just for the sake of fast manual tagging x_x

tall condor
#

with 5-20 tags

#

and then use deepdanburu to enhance that

#

i even made a programm for that

#

that can weight concepts and respect the tags in them

stiff dust
#

but manual tagging is still much faster than manual prompting

tall condor
#

depends on then ammount of images

#

my biggest models have like 50k images

#

really no fun to tag that by hand

stiff dust
#

with so many images you can automate that I guess

tall condor
#

not really as the taggers all wont tag what you are looking for

#

like if you want a dog with big ears

#

no tagger will tag that for you

#

or small tail

#

and so on

#

they will tag dog,dark fur for you tho

hollow spruce
#

writing your own python script, to tag based on your custom word flavor chains is your best option

stiff dust
#

you might be able to train one on the clip embeddings

tall condor
#

i treid to train the deepdanburu tagger to learn my tags

#

that failed hard xD

#

also it only runs in cpu for some reason here

#

took 8 days just to not work xD

hollow spruce
#

but yeah. training your own classifier wasnt very optimized, last I checked

tall condor
#

what rank are you guys using for lora?

hollow spruce
#

so unless you rent a A100 cluster, you can forget about it XD

stiff dust
#

16-32

#

for text encoder rather 4

tall condor
#

im on 128 atm but it generates a very inflicting model in some cases

#

A100 cluster for sure bro xDDD

hollow spruce
#

8~32 for normal loras
64~128 for my master dataset (currently at 6k images manually tagged -> goal is around 30k)
256 to make a point of why not to use 256 XD

tall condor
#

im happy i can afford 4x 4090

#

i havent even started with sdxl yet

#

because that would probably tain half a year on that

#

my largest model takes 3 weeks on 4x 4090

#

with 1.5

hollow spruce
tall condor
#

i once converted a 128 model to a rank 4 model

#

allmost no difference

hollow spruce
#

not that the lora isnt working, just that standard sdxl looses its composition & detail afterwards, unless you train to reinforce it again inside your lora

tall condor
#

im a bit scared of sdxl yet xD

#

still struggeling with 1.5

#

but the sdxl results look amazing

hollow spruce
#

worst part is that the tutorials have a lot of critical false info included 😭

tall condor
#

but thats the same for 1.5

#

so much trail and error

hollow spruce
#

basically, rank 32 is the highest you can go without damaging sdxl. anything higher than that, and you need to account for the downsides in your lora, to retrain those

tall condor
#

have you trained dremabooth on sdxl?

hollow spruce
#

(not that that matters if you dont care about sdxl general capabilities)
artwork / anime / remaking your source images in slight variations XD