#🔧|finetune

1 messages · Page 15 of 1

tall condor
#

faces are really tircky, for me front shots work very well but as soon as its side view and stuff it gets really complicated

#

you should add a promt that is requesting things like "laying in the grass" or side view to evaluate the faces better

#

or "spinning around, side view"

wise seal
#

Hi, I need some help with lora.
I noticed that there are some loras trained on 2.x models that are compatible with the standard auto1111 syntax lora:filename:multiplier.
How do you train them? Is there any way to create them with kohya_ss?
The reason is that I use openoutpaint a lot and it is not compatible with kohya's additional networks extension, the loras can only be activated by the standard syntax

tall condor
#

did you install the extension "LyCORIS"

#

which lora type did you create in kohya_ss?

#

@wise seal i think if you create a "Standard" Lora model you should be able to just load it

#

if you create a LyCoris Model you need that extension in auto 1111

#

@surreal lagoon are you still running with noise offset 0.02?

surreal lagoon
#

no

#

i don't think i'll bother using that anymore

tall condor
#

ok

surreal lagoon
#

man, i love this style

steady heath
#

Already in ED? 🙏

hot breach
#

yes

#

there are some coordination issues with inference apps though

steady heath
#

Does it work with 2.x models?

hot breach
#

and its not playing nice with SD1.x for now because it doesn't use v_prediction

steady heath
#

I see

hot breach
#

I'd say it basically only works on SD2.x 768 models that already use v_prediction?

steady heath
#

I see

hot breach
#

it might be able to finetune SD1.5 with v_prediction long enough that it accepts the change though

#

someone said SD2.1-768 was based on512 which would've been epsilon then switched

steady heath
#

There was somthing about V-pred models that they have triple the loss or something? What kohya said hr fixed afaik

hot breach
#

their loss is higher during training but its not comparable

#

its not something i'd worry so much about

steady heath
#

I wasn't quite sure sd1.5 LRs would work well with 2.x models since id imagined they would require a higher LR

hot breach
#

SD2.1 768 trains well with customized optimizer on the text encoder

steady heath
#

lettme yoink me config for optimzier and i wanna ask if theres anything youy'd change

#

oh wait a sec theres 2

#

SD21 and regular optimizer

#

does ED auto select if it notices a 2.1 V model?

hot breach
#

TE and unet get separate optimizer instances now in ED2 s you can use different LR,different Lr schedule, completely different type of optimizer, etc

#

TE has some layer freezing setup, it seems value like -2 or -6 which just unfreeze the last 2 or 6 layers works well for SD2.1

#

TE is very sensitive in SD2.x, needs a light touch so to speak

steady heath
#

I see

#

I wasn't quite sure what the freezing did so i just put false iirc

hot breach
#

we're still trying to figure out ideal settings, but freeze embeddings true, layers -6, and freeze final layer norm false seems to work well

steady heath
#

ill keep that in mind

#

also thanks for implementing the snr freq 🤝

#

I shall try training again asap

hot breach
#
        "freeze_embeddings": true,
        "freeze_front_n_layers": -2,
        "freeze_final_layer_norm": false
    }```   -2 seems to do well for small scale stuff like training one or a few characters?
steady heath
#

I see

chrome breach
#

Anyone tried freezing text_encoder??

#

I have no idea how it affects the model quality... But would love to know some results about it

rugged cobalt
#

Do you have any ideas on how to best train the appearance of a specific piece of furniture? I've tried different LORAs, but the furniture isn't being represented correctly.

chrome breach
#

Haven't tried LORAs myself... So cant suggest u on that

#

U might wanna try dreambooth training if you only want model to learn one kind piece of furniture

#

That might work for u

surreal lagoon
#

basically the data you use is almost more important than which layers you freeze.

midnight tusk
#

Hello, how to train images without using lora with 8gb VRAM? thanks

#

Can anyone help? I'm about to generate images likes this but the results are not so good.

#

when I'm generating images, it doesn't really look good. I don't need extra objects or lines that is not recognizable.

#

I want it look like a human-made. Without flaws or blemishes, smudges, and so on...

surreal lagoon
#

make sure you use DeepSpeed and CPU offload, and then you'll be running that training job for like 3 weeks.

midnight tusk
#

3 weeks?

surreal lagoon
#

the less video ram you have, the more the system has to spend time moving data around between CPU and GPU

#

it's a substantial loss in performance

midnight tusk
#

so, there's no other way or option?

#

How about DeepSpeed and CPU offload, what are those?

surreal lagoon
#

i'm on an 80GB GPU and i'm training for 3 days just to have it run in a way that i don't actually destroy the model while training it. when it learns too quickly with too little context, it over-corrects and destroys the coherence of the text encoder.

#

if i had an 8GB GPU, to train as well as i am on an 80GB GPU, it would take probably 3 weeks to get where i've gotten in 3 days.

#

when you successfully use less memory while training without extending the training runtime, you're giving the model less context on each iteration. and this works okay for short training sessions on a diverse, and well-captioned set of data. but if you want the model to learn everything from your training data, you have to give it more steps. and that results in more incoherence, due to the reduced context

#

it doesn't actually cost that much to rent time on an 80GB GPU. like $3 an hour, and you can get quite far in 24 hours. less than $100 to fine-tune a model. depends on how much data you have. the more it needs to learn, the more expensive it becomes.

midnight tusk
#

Thanks, do you use deepspeed?

surreal lagoon
#

no

#

i use large GPUs so i can avoid stuff like that, as it influences training in ways i don't understand and don't wish to spend time figuring out

midnight tusk
#

So, in conclusion, I won't be able to train without using Lora?

surreal lagoon
#

you can do textual inversion, lora, possibly others. but yeah, training the full model incl text encoder is going to be difficult-to-impossible on a small GPU.

#

this is pretty important to accept, so that you don't waste time trying to make it work

#

i would say 24GB is probably the point to start with for fine-tuning

midnight tusk
#

So I won't be able to get a perfect result with these?

surreal lagoon
#

the point of dreambooth is to cleanly integrate a subject into a model, and if you can't tune the text encoder properly with a large enough context. it loses the ability to do most things other than the subject you trained into it

#

i can show you the results of some experiments! they're not pretty.

midnight tusk
#

May I see, please? Thank you

surreal lagoon
#

this is supposed to be a tuxedo cat

#

we call this "catastrophic forgetting" and i'm not trying to make a cat pun, sorry

#

this is supposed to be a politician whose name i don't wish to say, but it doesn't matter anyway, because it's just noisy garbage

midnight tusk
#

Yah

surreal lagoon
#

pretty neat results but very useless for text guided diffusion.

midnight tusk
#

In my case, I want to generate a character just like a human-made drawing

surreal lagoon
#

same model, but prompted with a class it still understands

#

you end up needing to provide like thousands and thousands of super high quality images to the training on a low VRAM system. and it takes longer because of that, while still potentially destroying the text encoder

#

you just need to make a LoRA. why do you not want to make one? they can be used on top of any compatible SD model.

#

the benefits far outweigh the downsides. @north valley please chime in on this

north valley
#

I have been peenged

#

never been in this channel before

surreal lagoon
#

it's what you were just talkin about

north valley
surreal lagoon
#

LoRA > Dreambooth

#

and go

north valley
#

Oh yes, LoRA's are massively beneficial

surreal lagoon
#

apparently you can merge the LoRA into a base model's unet but you said before when i'd asked you what that would do, "it would be a waste of time"?

north valley
#

they can be trained in literal single digit minutes on any half decent GPU, they can provide dreambooth quality results with far less work, can be weighted individually and in tandem, across any model you want, assuming its adjacent enough

north valley
#

I have been looking for ways to finetune by merging LoRA's into models for a good moment now

surreal lagoon
#

we were working on your commissioned network

#

so, forever-ago in AI time

midnight tusk
#

Lora is for low vram right?

north valley
#

hmm... Maybe I misunderstood what you asked, I have been wanting to combine LoRA's into a model

surreal lagoon
#

LoRA isn't just for low VRAM, no

north valley
surreal lagoon
#

it's a low overhead network that alters the weights of the base model's unet / text encoder via cross-attention

#

because they take like 5-15 minutes to train you can quickly see what you're doing wrong and adjust hyperparameters and see results in the same hour

midnight tusk
#

I'm sorry. I don't really understand these things in Stable Diffusion.

north valley
#

basically, its a little tweak you put ontop of the model, which is much smaller, faster, and works with any model thats at least decently adjacent

surreal lagoon
#

it's fine, a lot of what i say won't make sense now but it might later

#

when Sytan says "adjacent" he means the training state of the model should be similar enough that the LoRA still behaves as expected when layered on top.

north valley
#

You can achieve DB quality results with a decent LoRA', in much less time, with a much smaller data set, and it has the benefit of being usable on different models

midnight tusk
#

But, can lora give good images? or maybe the problem is the model I've been training

surreal lagoon
#

LoRA is most likely higher quality than a Dreambooth with the same dataset.

north valley
#

so if you train a real person into a realism model, putting that ontop of an anime model should be able to translate them into that style really good

north valley
#

let me get some of my examples really fast

surreal lagoon
#

LoRAs feel like cheat mode.

north valley
#

this was one of my first LoRA's

#

this is what this character actually looks like in the show he is from

#

and now granted, this LoRA is not that good compared to what I can do now, but you will get the point

#

I made that in about 20 minutes with only 7 images of him

midnight tusk
#

I wanted to generate more images of a robot with this kind of design. I want it to look like a human-made drawing

north valley
#

fast forward to LoRA's that I did a much better job with, and you get results like these

surreal lagoon
#

for comparison, i've been trying to fine-tune a base model for a month and a half with more than 30,000 high quality images and it's cost me about $3200. i'd be done already if i cared to use a LoRA. but i'm in it for the challenge, not the results.

north valley
#

As you can see, LoRA's can get fantastic results, and even these are honestly a little lackluster compared to what I can do now, with even less, in less time 😅

#

LoRA's are arguably better for specific things, like a specific style, or a specific character/concept

#

for example, I trained a LoRA on Funko Pops, with just 10 images, and about 7 minutes of training

surreal lagoon
#

rely on other people for fine-tuning a base model and just make LoRAs and you'll be way happier.

#

@north valley have you seen LoHa yet?

midnight tusk
#

Yah, thanks to you all. maybe I'll have to do more research first. Thank you very much. Appreciated your insights and advices.

north valley
#

Here, you can see some of the results I got from an extremely small data set, with extremely fast training

#

onlyt took 10 images, and about 7 minutes training to create a LoRA that can do this style across countless models reliably

midnight tusk
#

Can you give me datasets for the images I sent?

surreal lagoon
#

Sytan, GPT4 explained to me how LoRA benefits from regularization images by "helping it learn representations that are core to the base model's unet and text encoder, resulting in more coherent results, as the weights merge at runtime more effectively."

north valley
surreal lagoon
#

same, and i'm a developer who doesn't even sell this shit, but i'm in it for the knowledge and learning

#

i totally appreciate that someone who is calculating the return on investment wouldn't find the time to look into it

surreal lagoon
north valley
#

All good, I have a lot of experience with many facets of LoRA's, always glad to help haha

#

especially for when its between DB and LoRA's

#

99/100, LoRA is the way to go

surreal lagoon
#

yep, take it from the one who has destroyed like 150 models in a few months here..

#

by the way, my current fix is working great. you just need to use a batch size of 150!

#

large batch size = very gradual learning, with more coherence

#

seems like having many repeats over training data with a large batch size has a much reduced impact on eg. overfitting

#

apocalyptic wasteland for example, i just love this. it looks to me like a real photo from an exploring channel on youtube

north valley
#

that looks dope :>

surreal lagoon
#

thanks, nat geo!

#

same prompt without the midjourney keyword

#

why does midjourney word make some photos so real looking? no idea. mysteries of life

north valley
#

hmm, interesting

#

MJ is likely a huge network of many models and LoRA's that are all triggered off of prompt nuance, as we have previously suggested

surreal lagoon
#

and here's the stupid reason i consider this training run a success. it kept the mediocre version of Robin Williams from the base 2.1 model in-tact. it even fixed some of his artifacts

#

remember my whole initial goal of this was to add some of the flavour of MJ without looking like MJ, while retaining most of the base 2.1

#

the first half is really easy. the second half, not so much

#

i was comparing my model against OpenJourney last night. i used to love that model so much, lmao it was my go-to. it is total garbage to me now

#

i'm curious how 1.5 goes once i finish this 2.1 model up, so that's my next step is to switch to base 1.5 and train it again on v_prediction loss and terminal SNR, while paying more attention to the results and trying to tune things so it works better. two days ago i tried a brief 1.5 training session on an A100-80G over 24hrs and it cost $72 and resulted in pretty poor images compared to my 2.1 attempts

#

i've likely tuned hyperparameters too far in favour of 2.1, so i'll have to find the switches to flip defaults for, and make sure that's possible to do with a single --option-at-runtime

#

oh, i added 10,000 images of hands to my training data last night, too. that has been in there for a couple checkpoints at least, by now

#

couldn't do faces of children reliably before. fixed now

wise seal
surreal lagoon
#

can you link me the extension? i can check how it loads them for you

gentle osprey
#

Great write up on Dreambooth training

#

Way more detail than most guides

surreal lagoon
# gentle osprey Great write up on Dreambooth training

there's some issues with it.

However, when changing LR, there is a problem that when generating with high CFG values, images contain distortions, that is, elements on them begin to break, and the lower lr, the lower the CFG value at which this begins to develop. This is most likely due to the deviation during training of the LR value from the base value of LR during the initial training of the model, since the same problem is observed when the frame resolution is increased, only there it is expressed differently, but the principle is similar.

for example, this isn't true.

#

"this is most likely" preceeds statements that are pulled out of thin air as a guess.

gentle osprey
#

honestly, haven't found a guide that doesn't have some inaccuracies. there's sooooooooooo much contradictory information flying around

surreal lagoon
#

yes. all of the research referenced here is just a momentary stepping stone of understanding on our way to the next truth

#

prior loss preservation is such a basic concept that is implemented poorly by everyone

#

using a single token for it, ugh. captioned regularization images is where it's at

surreal lagoon
#

it says that gradient checkpointing doesn't help image quality, but the fact that you can increase the batch size via that, does help substantially. people focus on how long it'll take to train, too much.

#

$3.18/hr plus storage costs

#

i ran on an 8x A100 80G system for a few days just to see what happens

#

that was like $40/hr

#

my takeaway is that the quality boost from many GPUs is great but too expensive for me to justify, so i simply emulate it with a single 80G GPU now, and resort to gradient accumulations to boost it to the batch size i had on the 8x A100 system.

#

i did buy a 4090

sonic narwhal
tall condor
#

only 6 hours? xD

#

mine take like a week

jaunty wadi
#

By the way, are there any DAdapt users or Lycoris model creators I could have a discussion with, I've been trying to create a Lycoris that emphasizes larger outlines/cel-shading and It can work, but I feel there are things I can do to innately optimize it, I just need to know what I should do training/captioning wise in a few different questions areas.

For instance,

  1. When creating said lycoris or lora in the process of making a style; are you supposed to include certain stylistic keywords such as thick outline, white outline, black outline, celshading, or is that best left pruned from tags? If you keep the style tags, where would they be placed start middle end?

  2. What kind of prompting should someone utilize in testing said style Lora, (for reference I can get my lycoris to work quite well with some prompts including outlines or celshading, but on its own it doesn't unless weight is applied extensively 1.25, 1.5

  3. For DAdapt I've seen some really contrary info regarding both whether to use Dadapt or Dadaptadam, and also what dims and alphas should be used.

Any insight is appreciated!

I've put some examples of what it can cook at just basic 512x512, but struggles with consistently maintaining outlines (especially if not in prompt).

💀 edit: by the way it can output more than women just tested with such as a consistent output 💀

jaunty grove
# sonic narwhal Bro my lora training on kohya_ss takes 6 hours on rtx 3070 anything im doing wro...

I started with AI art, and Stable Diffusion about 3 weeks ago, read time, and still learning.

With koyah_ss and a 4080, my Lora training for people/celebs is taking around 15 mins for say 1700 steps.

I've done about 8 trainings now, all but one are working amazingly well. I'm astounded and amazed with myself how well they've turned out.

The results are spot on, and the ai art is looking like the person in the real pics I trained on.

I don't know how fast a 3070 should be, but training for me on a 4080 with around 22-36 images, 25-30 repeats, around 10 epochs, batch size 5-7 and a learning rate of 0.00005 is taking 15-20 mins.

I try to aim for total steps between 1500-2500 using the formula:

steps = (images * repeats * epochs) / batch size

jaunty wadi
# sonic narwhal Bro my lora training on kohya_ss takes 6 hours on rtx 3070 anything im doing wro...

I'll say it certainly depends on settings, I come from the perspective of a 3080ti, if I accidentally were to add a large amount of epochs while say running dadapt, then it would take that long. Meanwhile, if I were to run it on Adamw8bit, it might be a fair margin easier/faster to get the lora baking, but for instance the model im asking for assitance from took like 2 hours for 4 epochs because its running DAdaptation/DAdaptADAM and a high network alpha/dim, so my answer is it depends lol

jaunty wadi
jaunty grove
#

I've made 8 character / person Lora's in recent days, all with amazing results, bar one, which was just a shade off of perfect.

I'm looking to train a style Lora next, are there any tips, anything different I need to do in terms of number of images to learn the style.

I assume I just need a variety of images in the style, and maybe more than the 22-36 character images I've been using for character training?

Do I need to aim for any increase in epochs, repeats, lower the LR etc?

Thx

jaunty grove
jaunty wadi
# jaunty grove I've made 8 character / person Lora's in recent days, all with amazing results, ...

I know I've been talking about questions with styles lora, but I'll try to give out some info I've discovered in my research (to be honest alot of soruces contradict eachother)*

use kohya_ss

captioning requires a different kind of pruning, instead of character traits, some either don't bother or will comb out style details (aka artists names, line size, shading style, character names) this pruning is one im trying to figure currently

I find Lycoris to be far better at capturing styles, different parameters but the results have captured it better imo,

I hear people use at least 100-150+ for styles, sometimes far more,

epochs/repeats/LR are all really dependent on what you've got thus far, depending on your system you can run DADAPT and it tends to provide better output but takes more system requirements, otherwise yeah the classic 5e-5 with adamw8bit,

jaunty grove
# jaunty wadi I know I've been talking about questions with styles lora, but I'll try to give ...

Thx Pappas,

I'll definitely be using Kohya, as I am already for my character Lora's.

The caption pruning makes sense, given you don't really want the Lora to worry about character specifics.

I'll have to experiment, and see if it's better pruned or with no captions at all.

Image wise I can easily get 150+ as its a video game style I want to train.

Lycoris I saw some on Civitai, but wasn't sure about them, and saw I needed an extension for support.

I might kick off my first ever style training tomorrow, and experiment.

jaunty wadi
#

the only thing I'll say that could even be deemed annoying about it is less documentaiton/ civitai autoupdater doesn't update it automatically

#

otherwise execution wise it inputs the same as lora expect it'd be lyco: instead

jaunty grove
#

Thx. Yeah captioning makes sense to me. It's how the model learns the association between the text and uk the image, be it character or style

I've a few extensions installed in Automatic1111, so one more won't hurt lol

Time to experiment over the weekend 😊

jaunty wadi
#

execution wise should be really similar to loras even with training, most guides ive seen treat it identical, loras better for characters, lyco better style wise, lyco I've seen a fair bit of people low the dim/alpha, but i mean if you run dadapt it doesnt matter as much PI_Shrug

#

one thing im still struggling to see with these style loras/locons is that whether I should retain art-style prompts in the tags, like if the style is really defined by thick outlines, do I retain that, or try to let it absorb without tags AShmm , in the models current rendition it can add said outlines with either adding 1.5 weight or adding the tag "outline" PI_Sigh

jaunty grove
#

What does the dim/alpha do?

One thing I've found when experimenting with Lora's, is they can be weighted too high by default and I had to make them a 0.7 or 0.6 attention for them to look ok.

I resolved that issue eventually, but believe my original Lora's were over fitted, would that be right given me having to lower the attention.

Yeah I guess in some ways the training should be just learning the style, without you needing to specify 'outlines'. If the style has outlines, then I'd kinda expect the model to just learn it at part of the overall aesthetic

jaunty wadi
jaunty grove
north valley
#

my LoRA's took like 10 minutes on a 3060ti. Likely you are doing a TON of images, with wayyy too many steps

#

if you are following like... anything from AItrepreneurs video on making LoRA's, then that certainly explains something. I do not recommend listening to anything from his LoRA video

#

its very bad information, and you will get very bad results from it

#

this was after like 5 days of trying his stuff to get a Na'vi LoRA

#

and these were about 30 minutes after I learned how to actually make LoRA's lol

sonic narwhal
tall condor
#

any way to run more than a batch of 1 with rtx4090 when training on v2 models?

#

also is there any drawback of using memory efficient attention?

stiff dust
# jaunty grove What does the dim/alpha do? One thing I've found when experimenting with Lora's...

dim is the rank of the matrix factorization. You can think of Lora as "compressed training", like you train your model but you compress the changes you made to the model to keep it small in memory. The rank is then more or less the compression strength. Higher rank = less compression = the model is more able to finetune. If your rank is as big as the matrices you change, Lora becomes more or less equivalent to Dreambooth.
alpha is a scaling factor you multiply your lora weights with. It is divided by the dim internally, so you have to set alpha to the dim parameter to get an effective scaling factor of 1. The idea of alpha is that Lora needs lower learning rate as higher the rank is (I mean, people train lora with lr of 1e-4 to 1e-5 while Dreambooth for example is rather 1e-6 to 1e-7. If you would use full rank lora you would also have to use very small learning rates). So instead of manually changing the learning rate whenever you try a different dim parameter, they scale down the lora when you increase the dim and provide an alpha parameter, such that you can control this downweighting

stiff dust
#

@jaunty grove do you tried LORA on photorealism? I still struggle a little bit with that (not particular for LORA, I'm also not happy with dreambooth results).

main breach
#

ok so hear me out: how cool would it be if you could make a mask for LoRA training or SD finetuning or dreambooth or whatever, where for each training image you could have various captions for the masked area. I've found especially in LoRA training, the ai sometimes cannot tell what's a piece of clothing or what is in the background, and features get baked in despite being very well described in the caption.

stiff dust
#

oh, that's possible without problems

#

I implemented that myself for my lora training - however, I used it for training on subject without the background

tall condor
#

why cant you just crop the image accordingly and caption the cropped parts?

stiff dust
#

I don't think that works as good as using a mask

tall condor
#

anyone here using dreatmbooth with kohya ss to train 2.1 models?

#

for some reason i can not do more than batch of 1

tall condor
#

is there any drawback of using memory efficient attention?

jaunty grove
gentle osprey
#

Anyone got a top tier Lora training guide?

stiff dust
#

maybe it's just the quality of my input photos 🤷‍♂️ they are quite jpeg-ish. When I train the unet for too long it starts to adapt to the grainyness and not to the facial features

#

do you use textual inversion first, or do you use "sks person" style captions?

surreal lagoon
#

i wonder if it's even possible to fix all of the faces always

#

@stiff dust so this is after tuning only the last 2 layers, do you think maybe i need to tune 3 deep?

stiff dust
#

as said, I found the text encoder Training rather helpful. I freezed everything except the last 6 layers because this was recommended everywhere

#

but overfitting on the texture happened first in the unet

surreal lagoon
#

well i didn't have texture overfitting when i froze more than 6 layers, but i did when i did what you suggested

#

this is at 5400 steps which would be pretty toasty otherwise by now

#

i was considering freezing the unet as an experiment until it worked out so well to use massive gradient accumulations on just 2 layers of the TE

stiff dust
#

I don't really know what you are referring to.
I just said that for subject training I observed overfitting often rather in the unet than in the text encoder.

#

when you train lora you can switch of each layer after training to check how much it contributes to the produced image

surreal lagoon
#

that alters the output in a Schrodinger type way

#

the math changes a lot, it's not like photoshop layers where you can disable one and see what it does more obviously

stiff dust
#

there I found that

  • text encoder did most of the work
  • cross attention improved results slightly, but it also introduced overfitting on the jpeg-ness of the input
  • self attention did nothing
stiff dust
surreal lagoon
#

model merging is done in many different ways, and often averages the weights together or does a 50/50 "one from here, one from there" (or other ratio)

stiff dust
#

of course as longer you train as more interdependencies between the layer changes are introduced. But for subject training you do only few epochs

surreal lagoon
#

are you talking about making the lora a new layer on the base model?

stiff dust
#

the lora is changing the base model

#

it's s delts you add to the weights

#

and of course you can just switch off the lora for any layer

surreal lagoon
#

in automatic1111 it is merged into the base model weights at runtime but that's not how Diffusers is going to be doing it

stiff dust
#

in Dreambooth/Fine-tuning the equivalent would be to set some layers back to their original weights after training

stiff dust
#

but honestly, that's just implementation detail. It doesn't matter if you add lora on your weight matrix or on the input

#

math is the same. It's just a question if you want to change the loras frequently

surreal lagoon
#

aye

#

the dolphins are now birds Sad

#

it never made sense that they were dolphins but i loved it

tall condor
#

whenever i train with kohya on 2.1 models as base my results is completely broken, i see color lines and there is no pictures actually esixtint

#

any idea why

stiff dust
#

probably a bug in kohya

#

maybe they use the wrong scheduler or epsilon prediction instead of v prediction

tall condor
#

anyone here run into that issue before?

#

v2 require v_parameterization?

sonic narwhal
tall condor
#

maybe thats the issue

#

getting better now but the result still looks very broken, anything else special when training 2.1 rather than 1.5?

stiff dust
#

768 is native resolution, DDPM is the default sampler I think.

#

learning rate should be less than in 1.5

surreal lagoon
tall condor
#

dreambooth

#

the images it generates are really bad

warm agate
#

Which dataset was used to train SDXL.

surreal lagoon
#

their own custom one

hot breach
#

the 512 ones do not IIRC

surreal lagoon
#

@stiff dust i put the unet from 2.1 back on top of ma burned 7650 step checkpoint and using the fine-tuned text encoder only, fixed some texture issues

#

so maybe i'll fully freeze the unet during training

stiff dust
#

its so weird, cause its the total opposite of what people always suggest

#

"freeze the textencoder or everything will overfit!"

surreal lagoon
#

😅

#

it way cleaned the images up

stiff dust
#

but I found that with super simple textual inversion I can already learn different styles and concepts

#

so it seems that 2.1 unet is already very powerful and just needs the correct text tuning to get things right

surreal lagoon
#

yeah i agree that's a great approach. i just wanted a good base model to do that on

stiff dust
#

I hope you find a solution without freezing the unet, though ;D

#

maybe training it with lower learning rate?

#

or maybe you really need something like EMA on the unet to get good results

#

anyways, I go to bed, good night

surreal lagoon
#

burned unet

#

text encoder at 7650 steps vs unet at 4200 steps ^

surreal lagoon
#

how do i fine-tune the VAE?

vast dome
#

guys I am training 1024 max pixel on 4090, and I am getting 1.00 it/s and 2 batch size

#

is this normal?

nocturne vale
#

I'm about to train my first style LoRA out of an animated series from YouTube. Is there anything I'll have to think about in terms of aspect ratio and resolution of the dataset?

stiff dust
#

beside that its normal variational autoencoder training I guess

median ocean
#

Hello, i would like to ask, if I want to train 1000 ~ 2000 datasets with captions using Dreambooth, may I know what is the most recommended parameters? (epoch, optimizer, scheduler, learning rate, mixed precision, warmup steps, text encoder, weight)
my goal is to create a style checkpoint similar to MJ.

#

so far the best setting for me is still Lion, bf16 precision, 1e-7, constant with warmup. but would like to know if there are other recommendations 😄

surreal lagoon
#

theres no info

clever kayak
#

Anyone had success using Lora from your own service?

twilit cradle
#

Hey everyone! Can someone point me to a good source (or yourself if you know) where I can find info on the relationship between number of repeats, epochs and dataset images for Lora training in Khoya? Like when should I do more repeats vs more epochs or vice versa, what are trade offs?

hot breach
#

people have tuned the VAE using that code more or less, or like dreambooth forks of it, you'd need to look at the data loader class (fullopenimagestrain) and see what it is loading, many of the dataloaders in there are sort of one-offs for specific datasets and do some translation

#

it's also possible those yamls are just the last thing that was checked in when they published, and they may have used other datasets

#

🤷‍♂️

reef agate
#

hey so im trying to use controlnet to convert a photo of a 2 people into a cartoon version of themselves however my generations allways mess up the faces of one of the 2 people like this

#

so i tried photoshopping the guys face and using img2img and controlnet to fix his face but then it doesnt really blend in with the environment around it

#

pls help

reef agate
#

which is

surreal lagoon
#

i just compile the model.

#

i get about 180 it per sec on a 4090

#

5800x3d

brittle ore
vast dome
#

hello guys how doI use regulazation images? I couldn't find any resources online. Help is appreciated on guiding me on instructions on how to use regulazation images

velvet grove
#

Hi, apologies if this has already been asked and answered but I couldn't find anything in the FAQ and I'm not really sure how to search for my question.

My question is - Would it work to train a stable diffusion model on different training data for a character's face and body and would I then be able to diffuse images of the character combining the relevant face and body?

Context - I'm developing a Visual Novel which uses art assets generated using AI. There are lots of benefits obviously but one of the main challenges is getting consistent results for the same character.

I've therefore chosen my favourite SD model and am going to be training a new model with that as a base to merge. However the training data I've collected for the character is going to be split between the face and the body of each character I'm training the model on.

And after 100s of hours of work on this I started to get a bad feeling that I'm being an idiot and that maybe this wouldn't work. Any advice would be greatly appreciated.

stiff dust
#

I would say that it should work. Use textual inversion on the face to train token1 and on the body to train token2 and then use both tokens in the prompt

velvet grove
surreal lagoon
#

@stiff dust my unet has cleared up upon continuing to train and train is still trash, i mis-read the filename

clear solstice
#

hi, please what is the best approach to merge multiple models?

livid otter
#

Links for a good guide on training photorealistic LoRA with Kohya SS? 🙏🏻

livid otter
turbid ledge
#

Everything is fine except the ✋

livid otter
turbid ledge
unborn wind
livid otter
surreal lagoon
#

Hands. it has 10k images in it

#

just google "10,000 images of hands dataset"

blissful vine
# livid otter Links for a good guide on training photorealistic LoRA with Kohya SS? 🙏🏻

This LORA + Checkpoint Model Training Guide explains the full process to you. Learn how to select the best images. How to key word tag the Images for Lora and Checkpoint Training. How may steps and Epochs to use in Training. How to Merge Models to get better results.

Link from my Video

Join my Discord: https://discord.gg/XKAk7GUzAW
Bu...

▶ Play video
fierce egret
#

Do you guys know of any tools similar to stable tuner that give you a graphical interface to help you fine tune a model?

surreal lagoon
#

what would the interface do?

daring plume
#

@waxen solar what kind of Data do you plan to use for your model?

waxen solar
daring plume
# waxen solar as in the training set? i wanted to train a model on the character of simon from...

Get some frames of the Character and cut him out. Afterwards look at this and try to create a LORA. should be pretty simple.

https://civitai.com/models/22530/guide-make-your-own-loras-easy-and-free

You don't need to download anything, this is a guide with online tools. Click "Show more" below. 🏭 Preamble Even if you don't know where to start o...

waxen solar
#

i have around 30 images in my training dataset

#

but i'm not quite sure what to use for the class dataset as it is an animated character?

surreal lagoon
#

wow, unfreezing another layer of OpenCLIP after 2 weeks of fine-tuning is like punching it in the face with knowledge

#

it's amazing

ancient mural
#

Hey, is anyone aware of an extension that will convert mp4 to jpg images?

#

I want to resize them and use them to train. Hoping I can do it locally

stiff dust
#

using more clip layer than recommended?
repeatably freezing and unfreezing parts during training?

surreal lagoon
#

you can try the checkpoints from HF as ptx0/pseudo-real and ptx0/pseudo-real-beta

#

the former is 4.2k steps and the latter is 17.6k steps and at 15.6k i thawed another layer of the TE

#

the 17.6k ckpt with the 4.2k step unet

stiff dust
#

yeah, the question is: was it a wrong decision freezing so many text layers at the beginning?

surreal lagoon
#

i'm starting to see improvements in fine details like small faces

#

nope

#

allows major improvements to happen gradually and then i can kick the model in the face by opening layer 21

#

it's starting to diverge a lot, but that's the goal now

#

i'm considering putting together a very high quality 2048x2048 dataset for some more unet training to bring that back in line with the new text encoder

stiff dust
#

I wonder if the unet sometimes does not keep up with the te and so it's good to freeze the te from time to time and unfreeze it after training the unet

surreal lagoon
#

i've tried that before. it seems to work better the other way around

stiff dust
#

like in the early days of deep learning when people had to train layer for layer 😅

stiff dust
surreal lagoon
#

to train the text encoder with an old unet

#

in the -beta repo i've updated the TE but not the unet

#

it was working really well and producing very clean images around 4200 steps of fine-tuning, so i decided to keep that for a while and focus on the text encoder's representations

#

the thing is, i have kept training the unet and my inference validations do all of the combinations

#

i can see with the new TE with old unet, new unet with old TE, and then, both fully trained components together

#

the unet is so weird. it clearly influence the composition in ways i do not understand. it can introduce deformity. and it introduces more detail. but not necessarily coherently. this detail can manifest as the artifacting

#
  1. both the TE and unet
  2. the unet
  3. the TE
#

imo it makes sense to periodically bring the unet weights up and see if it can remain clear, and if it starts to 'dirty up', go back to a ckpt where it was clean again and freeze it

stone garden
#

anyone know how automatic vae works? Like I'd like to use the orangemixvae only for that while using the nai vae for everything else

#

I'd like it to be done automatically

vast dome
#

jru

#

anyone using dreambooth? I am getting some weird result patterns, all my samples have this weird cut

#

is this because of dynamic image normalization

vast dome
#

or is it because I am training it on 512x512

vast dome
#

hello

smoky umbra
#

Hey there! I'm having a bit of trouble with image generation. Whenever I try to generate an image, it just turns out completely black. I've tried a few different solutions, but nothing seems to be working. Do you happen to have any ideas on what might be causing this issue? Any help would be greatly appreciated

surreal lagoon
gentle osprey
#

For people looking to get in the weeds

surreal lagoon
#

@undone fable is the unet responsible for learning resolutions or is the text encoder? or both?

undone fable
surreal lagoon
#

thank you 🙂

#

does it make sense to f/t the openclip on it to some extent?

#

also if i were to try and freeze the gradients on some piece of the unet to prevent the texture crisp, which would you suggest

surreal lagoon
#

damn, loss is as low as i've ever seen it when training with multiple high-res aspect ratios

#

0.165 on average

tulip dagger
#

do you know how i can fix this inpainting mistake?

#

What the heck is even this!?

gentle osprey
#

Your denoising strength is way too high

stiff dust
#

use the inpainting control net.

#

without inpainting model or control net you can only use low denoising strength

#

also you should use "original" instead of "latent noise" if you just want to make smaller changes

surreal lagoon
#

114/4025 [22:36<34:44:08, 31.97s/it, loss=0.146, lr=4.07e-9]

#

i like these loss values here

tired atlas
#

Howdy all...does anyone have any tutorials/resources on how to get the best resemblance results for training a model? I know the process, but most tutorials are entry-level and just gloss over the quality of the data set like, "Just get a collection of 10-30 images of the person from different angles, different light, different focal lengths." Does anyone have a resource that dives into that a little deeper? I want to dramatically improve my results.

surreal lagoon
#

it is true, guides toward what a good and bad dataset look like are slim

#

JPEG style artifacts are a huge deal. any kind of image noise gets "focused on" it seems, by the convolutional neural network, aka "the unet"

#

too similar of images that are a few pixels off, eg. think a couple random crops of the same image where the face shows up offset. this could cause blurring/double vision. so you do truly want your images to be "varied", more than "similar".

#

aspect ratio bucketing is also a big deal

surreal lagoon
#

1920x1080

#

batch size 150

surreal lagoon
#

221.09s/it, loss=0.123

#

worth it lmao

unborn wind
#

Another difference I'm finding is how fast the model becomes over-trained with it selected however. What used to take 22epochs now gets overtrained in 8.

surreal lagoon
#

that's 2.1-v

#

it's always v-prediction loss

unborn wind
#

do you mean by selecting it, it's doing it twice?

surreal lagoon
#

i have no idea what that option does 😅

#

what are your loss values like without this?

#

mine are typically .3-.5 with spikes up to .7-.9, and i do not use prior preservation loss

unborn wind
#

usually .3-.4

surreal lagoon
#

interesting that a lower loss causes more burning for you. it is the other way around

unborn wind
#
xrg

Stable-Diffusionのv1系は画像に加わったノイズを予測するモデルですが、v2の一部はvelocityというものを予測しています。この2つは損失関数が違うのでlossで比べられません。経験的にv_predictionモデルの方が3倍くらいlossが大きくなるイメージですが、数学的に確認していきます。 ノイズが加わった画像について 元の画像を、ノイズをとすると時刻でノイズが加えられた画像はという式で表されます。はVAEエンコーダの出力である潜在変数なので、平均0で分散1の正規分布に従っています。ノイズはそもそも実装として平均0で分散1の正規分布です。めんどくさいのでとします。すると画…

surreal lagoon
#

loss is typically fed into the backwards pass

                accelerator.backward(loss)
#

oh, well that sounds interesting

#

"improvement of details" is fucking vague

#

i will say that not every research paper has valid conclusions for their proposed reasoning

unborn wind
#

yup lol. i'm messing around with it. dunno if it's good or not due to how fast my models get over-trained before I feel it has sufficiently learned the concepts.

surreal lagoon
#

yeah that sucks and maybe you can go nuts and download LAION Aesthetics locally for like 20,000 images and give that a try for training

#

see if the dataset size helps at all with this early burning issue

#

alternatively i would request that you put your batch size up as high as you can and use a lot of gradient accumulations. think like, batch size 5 and 30 gradient accumulations

unborn wind
#

I trained a LoHa on a dataset of about 700 images and it burnt out at 8 epochs. I then cut it in half and trained again and it seems overtrained at 6 epochs.

surreal lagoon
#

you can increase your batch size, and it will effectively lower your learning rate

#

but it does so in a more coherent way that is less likely to get stuck in local minima

#

so it will take longer to train, and you can have more repeats before problems appear

#

you might need to still adjust your LR

unborn wind
#

I left them at the default in Kohya

surreal lagoon
#

i honestly have no idea what LoHA is and how you train it

#

if you're seeing "burnt" models it's because the unet is being fucked

#

the unet can be fucking wild, man lmao

#

step 100 of training

#

step 125

unborn wind
surreal lagoon
#

that's just the unet being trained

#

i'm quite liking this amount of learning it does every 25 steps tbh

#

5e-7 is my constant scheduler i'm using to train the unet

#

i would recommend you try that with the batch size changes i mentioned

#

your batch_size itself likely can't exceed 4 due to GPU memory but whatever you have to multiply your batch_size you can run with, by, to get to 150, you put that as your gradient accumulations. eg. 30 for batch_size 5

unborn wind
#

ok I'll give it a try

surreal lagoon
#

i wonder if tuning the text encoder and unet together is just really damaging in general

#

it's not done a whole lot in production teams, eg. stable diffusion 2.1 was made on a frozen text encoder

#

i've had really good results tuning the TE on its own, or the unet on their own but not both together

unborn wind
#

is there a way in kohya to do one but not the other? would you set lr of one of them to 0?

surreal lagoon
#

both combined -> just the TE -> just the unet

#

well i'm doing aspect bucketing now too

#

the text encoder is being put through total hell with that

#

i just don't understand why combining them makes the image all poopy

#

@stiff dust any idea ?

#

both -> text encoder -> unet

#

the unet is picking up widescreen details but the text encoder is not

unborn wind
#

Is it possible to just train one or another in kohya?

surreal lagoon
#

no clue you would have to look at their source code

#

or open an issue to ask

#

dem fingers tho

#

text encoder = two dudes lmao

#

unet = one dude

#

he beat him, took his guitar, and stole the show and our hearts

#

i wish the contrast issue weren't there, but, one thing at a time

#

faces 👁️👄 👁️

#

@undone fable what was the thing with your contrast issue recently? seeing the same thing here with the zero scaled SNR

surreal lagoon
#

seem to have mostly eliminated the proliferation issue here 🙂 that's pretty great

#

that one image at the bottom still doing it but that seems to be a cursed seed

sturdy rune
surreal lagoon
#

that's really fast

#

i get 229 seconds per iteration or so

#

the high it/sec was when i was training 512x512 on 1.5

#

it was also with a batch size of 1, which isn't good for training

#

if you're getting 4.9it/sec and you're doing a general finetune you probably need to slow it down with a higher batch size, which sounds stupid, but it works

sturdy rune
#

Oh lower is better?

#

On a side note, can anyone point me to a Lora guide or video that DOESN'T just teach you how to overfit a Lora... There's several I've seen more popular on YouTube that all they're doing is teaching you how to make a extremely overfit lora of yourself... While those were good to get my feet wet, I'm trying to train concepts like break dancing, tipping hats, ect... And following those guides either end ugly or just completely overfit. I'm fine having to make multiples of the same attempts to get a good Lora... But I'm currently just stuck in an endless loop, because there's so much conflicting information

surreal lagoon
#

idk about LoRAs but i finally broke this damn family up with fine-tuning on high res

#

there used to be a mom

#

and like 6 other kids

coral summit
#

could someone help me fix this hand?

raven shore
#

Have you tried inpainting the hand?

mental cliff
#

Hey guys ! hand again. I've tried it all and cannot fix the hands on this one. Any idea how to proceed.
I just photobashed the hand from my 3D base and tried to use the depth map but its having a hard time showing hands following my map.

#

idk why but it shows everything but proper hands

surreal lagoon
#

this isn't the "help me fix hands" channel

dull snow
#

Shes about to snap

fallow pier
#

I've created a custom model and reduced down to two checkpoints I can't decide on. I was wondering, if I merge both the models will I get something in-between? or will it just mess them up?

stiff dust
#

just try it. Usually, merging improves model quality

#

messing up is rather unlikely. It's often surprisingly robust

vernal dock
#

Hello! Has anyone been able to train clothing with patterns or drawings to generate it 1:1 (for example training a LORA of a dress (object only) and using it in generations so that models can wear it?

#

Thanks!

surreal lagoon
#

@stiff dust so i'm trying to train on frozen text encoder considering i've seen others have good success with it and i'm getting a disappointing amount of noise in the images

#

the images look fine at a glance but zooming in, ugh

stiff dust
#

could be worse. But yeah, I had so far better experiences with training the text encoder than training the unet

surreal lagoon
#

but i need to train the unet to fix the proliferation issues

#

i tried training the text encoder and the unet with different rates and it still doesn't seem to work very well and i think it's because the text encoder's weights are shifting while the unet is trying to match its representations

#

so my current theory is, going back to base 2.1-v to tune the unet only, on larger aspect ratios, and then, bring the text encoder around a bit and hopefully clear the image up

#

that's what this image above is, 450 steps into that

#

the contrast changes from the noise schedule are probably still getting worked out too

#

the left is the original ckpt and the right is now

stiff dust
#

yeah, I also train them consecutively, but I haven't evaluated how much that helps. Im training text encoder first, then unet

surreal lagoon
#

when i did that i noticed coherence issues starting to arise in some of my test prompts. i have one asking for a mountain bike on a mountain road and the bike disappeared

#

it's still there at a different resolution but it's odd

#

it's not necessarily a show-stopper but it makes me wonder what will happen if i continue, and things just kinda got worse across other prompts

stiff dust
#

nah, I found this happening all the time without meaning anything

surreal lagoon
#

when you use laion data, are you using the text field as a caption?

stiff dust
#

sometimes an image just flips between two different outcomes even with the smallest change on the weights

surreal lagoon
#

i am wondering if that's hurting me

stiff dust
#

if you train further the bike might comes back 🤷‍♂️

surreal lagoon
#

here's an example of a ckpt that made me want to give up vs some improvements in it that i'm like " pikaOMG it's not over yet"

#

the original isn't so great tho either

#

i'm not using snr gamma or offset noise on this run because it was causing issues before, a little over halfway through training

#

but maybe i need a little bit of it

#

this noise in the water is what i was able to eliminate before with a bit of concurrent unet/text encoder tuning

hot breach
surreal lagoon
#

SAI used offset noise with epsilon prediction instead for SDXL

hot breach
#

I saw their loss.py didn't include zero term, just l1, l2 and lpips for the vae, but I swore someone said they used zero terminal

surreal lagoon
#

Joe Penna did but i think he was just messing with people

hot breach
#

they may have done something else in noise schedulers

#

lol

surreal lagoon
#

they're using standard schedulers like DDIM

#

nothing changed there

#

i slapped the text encoder from pseudo-journey-v2 (15.6k steps fine-tuned) and the noise is gone in most images i try

hot breach
surreal lagoon
#

they might have tried it and then abandoned it

#

the huggingface guy Patrick said he didn't think terminal SNR was so groundbreaking

#

at me own expense, i quickly tuned a model about 30k steps at batch size 150 for them to get the fixes working inside diffusers otherwise the work was going to stall

#

that model knows what darkness is but it takes things too damn far

#

i found lower FID scores corresponded with higher cfg rescaling around 0.7 but the aesthetics scorer had the highest score around rescaling at 0.3 so i split the difference and settled in 9.2 guidance and 0.3 rescale, as it has the best intersection point between the two scores

#

the aesthetics score is highest when there's no cfg rescaling done but to my actual human eyes that shit is blown out and hyper contrasted

#

@stiff dust i hoped VAE tiling would improve faces, does it not work that way?

#

or the representation going in i guess is still very small

stiff dust
#

just from the name I thought vae tiling is just splitting an image and decoding the parts independently 🤷‍♂️ for convolution it doesn't matter so much, but attention gets expensive if you process a large image at once

surreal lagoon
#

ohhh... hmmmmm

#

it actually seems to look better 😐

#

ah i like this a lot now actually

surreal lagoon
exotic musk
#

what tips do u guys have for deciding a good network dim and alpha based on the dataset?

narrow kraken
#

@exotic musk there is a paper that was released for this

exotic musk
#

is their a link u can give me?

narrow kraken
#

type network_dim network_alpha conv_dim conv_alpha
LoRA 32 16
LoCon 16 8 8 1
LoHa 8 4 4 1

#

i'll look for the article

exotic musk
#

but, i am trying to figure out how to decide and switch up the networks based on my dataset size

narrow kraken
#

i know, you want a corelation between the 2

#

it's really hard to determine since there is no real baseline to compare against

#

i'll search for that paper and get back to you

exotic musk
#

ok

#

thank you appreciate it.

acoustic tapir
#

Hi! Could anyone please help me a bit with training? I was trying to train my own images and tried with dreambooth as well as embded, but they all seems using other existing model and the result seems pretty off, is there other recommended ways to do it? thank you!

cobalt nova
#

Wasn't better.

#

Made some darker images, but messed up a lot of stuff.

#

Will give it another go when we have some time.

surreal lagoon
#

but that wasn't with v-pred?

#

it might just be something that people choose to fine-tune in. i would love to research that, i did apply for access to the weights.

#

i did see mcmonkey say that offset noise is being used and i assume snr gamma, in which case the terminal SNR stuff seems to 'fight' with it and it makes the darks splotchy

#

and 'too' dark

surreal lagoon
#

i can get a white background now

stiff dust
surreal lagoon
#
import open_clip
import torch
from PIL import Image

model, _, transform = open_clip.create_model_and_transforms(
  model_name="coca_ViT-L-14",
  pretrained="mscoco_finetuned_laion2B-s13B-b90k"
)

im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)

with torch.no_grad(), torch.cuda.amp.autocast():
  generated = model.generate(im)

print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))

think how well does this work?

#

is it better than clip_interrogator?

#

@stiff dust might be a dumb question but can you use a textual inversion on the text encoder during training?

stiff dust
#

yes, if the embedding layer is not freezed it happens automatically

#

or do you mean "using" an already trained TI? But same answer

surreal lagoon
#

i think i will put multi-checkpoint comparison into my discord bot somehow

#

currently i can switch between models and do a single batch for one but i could just have a config for which models to compare and then return one image from each, and to steal and idea from sdxl training, have buttons to tell it which is the best, and keep score for each model

surreal lagoon
# stiff dust yes, if the embedding layer is not freezed it happens automatically

this stuff is still a struggle for me to understand even as far as i've gotten with it in practical terms. i am beginning to understand after reading the CLIP paper by OpenAI that the text embeds are essentially a condensed representation of a string-image pair in latent space, right? it does make sense to me that the academics behind this stuff would be confused why we want to fine-tune it. but i am not sure what we're changing or benefitting when we fine-tune it. is it, to ensure it more closely follows prompts?

#

the text encoder seems to be important for image clarity, but i'm not sure how it's achieving that. eg. a textual inversion's goal is to find the most optimal vector that produces the least noisy result

stiff dust
#

I would say TI is quite intuitive. You introduce a new token, not known to the text encoder, and set it weights such that it represents/describes what you see in the images

surreal lagoon
#

am i mixing up embeds and inversions?

stiff dust
#

dunno, embeds is a rather generic term

surreal lagoon
#

i'm thinking of the negative prompt embeds you see like 'nfixer'

stiff dust
#

yes, that are usually TI

surreal lagoon
#

i basically want to make that somehow the default weights of the text encoder rather than have to prompt for it

stiff dust
#

that tuning the text encoder is so powerful is surprising, but I would not think of the text encoder as a single text-image embedding. It is a transformer, so EACH word in the text is given to the unet. The text encoder contextualizes the words, such that they somehow better align to the image space. So I assume that the sentence "anime movie of a castle floating in the sky" is contextualized in a way that the word "castle" is not just connected to images of castles, but to images of floating castles and anime movies. Thus the unet "knows" how to relate that word to the pixels in the image

#

now you can either train the unet to better connect words to the image, or the text encoder, such that the contextualization better fits to what the unet already learned

surreal lagoon
#

yeah i thought of that too. eg. "this word is likely to appear with these other words"

#

eg. that's how you get a human with hands without asking for a hand

stiff dust
surreal lagoon
#

oh i definitely don't drop captions randomly though i did read that helps with CFG

stiff dust
#

I mean if you do not provide a negative prompt, the "" prompt is used

surreal lagoon
#

current tuning results

#

i've got clear images, low residual noise and true blacks

#

multi-aspect support

#

but i just want all of the images this clear Sad

surreal lagoon
#

by the way it seems to do better "small faces" at higher resolutions now that i've trained it on 1766x1024 and 1024x1766 and 1024x1024 (plus two other aspects i forget)

#

so i can get a crowd of people with better faces because each face is actually larger

surreal lagoon
#

so i've got some conditions in my training dataloader where it is possible their captions get completely reduced to "", just an empty string, which i was kind of happy about, because i just assumed it would help with CFG

#

but it was unintentional and just kind of sporadic. it would be nice to have a minimum dropout threshold

#

@hot breach what kind of improvements have you all seen from consistently applied caption dropout

#

i don't do image flipping either as i don't want to damage text coherence

hot breach
#

conditional dropout should in theory improve CFG scale response, you can get funny results if you use super high values like 0.5 or 50%

surreal lagoon
#

i don't have any values to consider so for me it'll just be a % i'm implementing here

hot breach
#

the publication from the SD release said they used 10%

surreal lagoon
#

yeah that's where i remember reading this

#

hmmmm okay so not using enough caption dropout is reported to result in an overreliance on prompts and basically overfitting TO prompts, memorizing their vocabulary and damaging the ability to generalise on unseen captions

hot breach
#

I think I usually use 0.02 to 0.05 the most, higher values can sometimes have an effect of forcing a style into generations without asking for the style, you can get some interesting results, not sure its super useful to use super high values but you might want to try it out just so you can see how it differs, on a small scale

surreal lagoon
#

ok interesting so i should try training the lord of the rings dataset without any captions at all then

hot breach
#

I expect if you do that for long enough, you'll see lord of the rings characters pop up in generatoins despite not asking for it, particularly at lower cfg scale settings

surreal lagoon
#

that's kind of the goal with that model 😄

#

it's just a toy because i messed with it in a small scale and managed to get a downhill mountain bike sitting against a mountain in one of the lord of the rings style forests and i was like POGGERS

#

also when i prompt for wizards now it brings up Gandalf KEKL

#

friggen Coco knows about hobbits and wizards as well as blip does

#

i guess we're into a new era, something's happening KEKL

#

Trust the Process ™️

raven shore
#

Is there any reliable way to upscale a busy image with a lot of small buildings in the distance?

dark egret
surreal lagoon
raven shore
surreal lagoon
#

np

surreal lagoon
#

@hot breach have you noticed the breakdown of terminal SNR capabilities for darkness at some point?

hot breach
#

no

#

it seems stable long term, unlike offset noise

surreal lagoon
#

this is about the best i can get for solid black background and it only really works at the 1:1 aspect ratio the model was trained on

#

left side is terminal SNR and right is offset noise at 1280x720

hot breach
#

maybe at some certain scale it could, beyond what I've tested but I've done many 10k steps on a 25k image dataset of just random assortments of things and not noticed issues

surreal lagoon
#

hm, interesting

#

and you just grab the trained betas from DDIM and pass into DDPM?

#

for training noise scheduler

hot breach
#

my test data is not very concentrated on bright or dark in particular

#

I use the algorithm that was in the paper to create trained_betas and pass that in to the noise schedule as trained_betas which diffusers accepts

#

DDIMScheduler.from_pretrained(model_root_folder, subfolder="scheduler", trained_betas=trained_betas)

hot breach
#

samples from fairly early on, it can do both light and dark great

#

well discord unforunately cuts the previews, click and look at the left sample (cfg 7)

#

"cg render of a tree ent on a few small branches with leaves on its body, with a black background" this was not even 1 epoch in

#

many more epochs in still stable on bright and dark images

#

high contrast, faces still good, etc

#

logo is potato quality for lack of data (just a random sample I grabbed from middle of training a bunch of random data), but it nails the contrast at least and colors look good to at least illustrate the zero terminal snr works well

surreal lagoon
#

try other resolutions

#

i was trying prompt solid black background from CFG of 3 to 9 and rescale of 0 to 0.7 and the closest i get is at like 9 CFG and 0 rescale

#

in a square resolution. in a widescreen one i get, washed out uniform look

#

i can get images with bright scenes, no problem

#

the model still makes great images at a square res

#

this is what i get in widescreen for a nighttime prompt

#

hm i guess it's not great in square either lol

surreal lagoon
#

50 steps with offset noise and it's so much better

surreal lagoon
#

@undone fable i needed a bit of offset noise and pertubation to bring darkness in line, but the results with CFG rescale are interesting

#

left is cfg rescale at 0.7 and right is 0.0

#

the cfg rescale change to "prevent washed-out images" seems to also prevent images from becoming too dark

surreal lagoon
#

1024x1536 native gens 😄

surreal lagoon
surreal lagoon
stone garden
#

Hello everyone,
I'm looking for a small fine-tuned stable diffusion model, that output max 256x256 images, you know any?

surreal lagoon
#

deepfloyd kinda?

final matrix
#

i compared my models photorealism vs. that of deliberate, dreamshaper, and realistiv vision
mine is arguably worse, but also doesnt have any noise fix applied and i would argue my model is more diverse reagrding the output, e.g. the clothing, faces, backgrounds, etc

#

the grid doesnt show the negative prompt but it was:
anime, cartoon, digital art, cgi, render, 3d, drawing, sketch, instagram, pastel, dada, zombie, ugly, surreal, text, watermark, abstract, old, fat, jpeg, black and white, vintage, amateur, film grain, evil, damaged, concept, unfinished, model, cover, clay, figure, toy, pixelated, bad, inexperienced, illogical, random, oversaturated, overexposed, rough, fake, unrealistic, sloppy, artificial, low budget, unprofessional, cropped, out of frame, low-quality, poorly drawn, deformed, bad proportions, malformed, imperfect, unnatural, extra, rushed, weird

viscid condor
#

I was trying to train a Lora but some photos in the dataset are 1080x1920 and others are different sizes. I wanted to keep the full images in the dataset without resizing or cropping out parts as they were important for the lora training to be what i want to create. Is this possible or would everything need to be 512x512

surreal lagoon
#

you can condition them so the smaller side is 512 and preserve aspect ratio

viscid condor
#

is this through collab lora training?

surreal lagoon
#

@undone fable you're right about the 1920x1080 not holding up even if it "(kinda) learns how to do it"

#

it'll generate dupes about 20% of the time and the rest of the time it doesn't produce a duplicate you get a weirdo like that

#

but this model works way better with hires fix than one that i trained only on square 768

undone fable
surreal lagoon
#

by the former part you mean when you make a 512x512 image and it's all artifacted?

#

yeah as far as i can tell that's because the attn layers learn to expect a large res image with details that can't be expressed

exotic musk
#

is 2e-05 to low of a training rate for 108 image dataset i did 200 steps and it was super undertrained?

surreal lagoon
#

that might be too little info, can you elaborate on what style of training and what you're training on

#

it sounds like you're doing a LoRA?

exotic musk
#

oh yea

#

lora my bad

#

and 1.5

exotic musk
surreal lagoon
#

man that's hard to look at

#

i think you need to do like 800 steps but it depends on your batch size. if it's very high, it'll need a higher LR. if you're using regularization / class images (prior preservation loss technique) it'll also take longer to train

#

just a warning that if your images are all blurred and pixelated like that one you're going to have a bad time

#

there's other parameters for Lora i'm not versed in, like rank etc

exotic musk
#

this was helpful

surreal lagoon
#

i mostly do general finetuning. i have not messed with lora. be sure to update us with your progress!

exotic musk
surreal lagoon
#

ratioooo think what is this

#

oh

#

12.5%

exotic musk
#

ok ty

final geyser
#

inpainted his head but idk of the heads size matches

surreal lagoon
#

try textual inversions for your dresses and subjects. you can train them in not-very-long, and they result in very clean and noise-free images, especially on 2.1

#

textual inversions are going to become the tool of choice moving forward, the SAI devs have stated that TIs in SD 2.1 are as powerful as LoRAs were in 1.5, and TIs in SDXL are as powerful as 2.1 LoRAs, etc

#

everything shifts up in power and capability as the number of parameters in the model increases

#

SDXL will likely not require massive fine-tunes to be amazing, just a few tiny textual inversions.

solid axle
surreal lagoon
#

@stiff dust can we (you and i) somehow make a new VAE for SD 2.1?

#

it is the main issue now that my noise issues are resolved (this model looks smooth like DALL-E 2)

crimson meteor
#

Hiring a stable diffusion developer,

I'm currently launching a startup that uses stable diffusion as a base for image generation, So far I have trained my own models and LoRA's but I need someone who's more advanced in model training. for example, someone who can help come up with workflows to get specific results using features like controlnet, or automate higher quality captioning, and squeeze more quality than my average dreambooth style trainings.

I'm looking forward to starting with a few paid gigs and then we can agree on a long-term arrangement.
feel free to shoot me a DM if interested.

stone garden
stone garden
#

Hello everyone, I was running into a NaN loss when creating a Lora/DyLora. I've checked all the usual suspects: No nans in input data for the given step; reasonable captions present, etc. Does anyone have any tips to resolve the NaN loss? I've tried kohya-ss/sd-scripts as well as cloneofsimo/lora and both create this issue. I was on an AMD GPU so I thought it might be related but I debugged deep and it seems to be overflows happening somewhere.

#

In fact, the unet produces all NaNs for non-nan inputs.

surreal lagoon
#

on amd use fp32

stone garden
#

I tried that too -- It NaNs out, only at a later stage. 😦 I also rented a paperspace machine with Nvidia, it still NaNs out there as well with fp16.

surreal lagoon
#

make sure you arent using snr gamma

stone garden
#

So, set it to zero? Thank you so much for the guidance, btw.

surreal lagoon
#

yeah snr gamma seems to be broken these days tbh

#

it causes NaNs here every time but i'm on v_prediction models like 2.0-v and 2.1-v

exotic musk
surreal lagoon
#

try a textual inversion as well

#

you might need both

exotic musk
#

How do i train a textual inversion?

#

Sorry

#

I have never tried it

exotic musk
surreal lagoon
#

unfortunately i've not got time to

stone garden
whole crypt
#

Any tips on fine tuning a subject with glasses and a beard?
I have had great luck with training on women subjects:

  • no classification images
  • 30~ clear varied images
  • 50~ steps (around 1600 total steps)
  • text files with classifications and tokens (xyz a woman wearing a tank top, standing outside)
  • lion optimizer
    Using these settings (plus a couple of others) i can train a female subject almost every time.
    However, male subjects with glasses and a beard give me bad results. Deformed faces, terrible eyes, etc.
    Any tips?
tall condor
#

you will need much more steps

#

also make sure that your captions are correct

gentle osprey
#

Anyone mess around with training GANs? Playing around with the idea of using a GAN to produce SD training data but wasn't sure how easy it was, if even possible, to teach a GAN concepts like a specific face.

surreal lagoon
#

GANs are definitely used to generate or improve training data. the CelebA-HQ dataset comes with a processing script that takes each downloaded celeb picture and upscales it using RealESRGAN

surreal lagoon
#

something i've discovered now is that if you train the text encoder on its own on a set of captioned data and then, separately train the unet on top of that frozen text encoder, the results are really powerful

#

training them together sucks ass and should not be done

#

when you train them concurrently, the text encoder will learn to represent the relations between concepts before the unet has a chance to learn how to represent them faithfully

#

seems that the unet picks the concepts up a lot sooner with the altered procedure

hot breach
#

messed with open-flamingo a bit the last day, I made a script to run locally in ED2 so you can bulk caption with it:
python caption_fl.py --data_root input --min_new_tokens 20 --max_new_tokens 35 --num_beams 3 --model "openflamingo/OpenFlamingo-9B-vitl-mpt7b" --temperature 1.0
it will take hints from the example image/caption pairs you supply (2 seems to be enough) and then caption your images, seems to be accurate and know a lot of proper names

#

it uses example image/caption pairs to prime the model (here, the first two), then you provide the novel image for captioning (third one here), here their demo uses the humus and sign already captioned, then I gave it the image of cloud on a ladder and it captioned it for me, without me telling it who cloud strife is at all

tall condor
#

cant wait for kohya to support the new sdxl 0.9

surreal lagoon
#

it already does

#

i don't think it supports the new pieces of training but no one seems to care about the aesthetics score input lol

surreal lagoon
#

@hot breach teaching 2.0-v terminal snr and widescreen. there's a bit of oscillation to it figuring out the noise schedule where it seems to get it early and then really starts to pick it up

#

it's supposed to be just the glowing TV in a dark room

meager tangle
#

Hi i hope i'm in the good chanel to ask that , i want to add less know celebity to the absolutereality model , can i achieve that with dreambooth ?

torpid mason
#

hey guys, may I know if there is any channel here where I could find regularization images for LoRa training? (looking for realistic women)

quick onyx
quick onyx
bright solar
#

hello 🙂

what can cause loss to be > 0.4?

training lora with kohoya on SD-2.1 model with pytorch 2.0

sleek fossil
#

When you're finetuning a model to a specific persons face via Dreambooth training, should the class be included on both the instance and class sections? (& if so, should the class also always be included when prompting with the unique token?)

Example:
Instance: "a photo of XYZ123 person"
**Class: ** "a photo of a person"

I typically see some version of the above in the majority of tutorials/guides, but then in the same content - I'll see the prompting examples they provide after completion exclude the "person" part of the unique ID.

"a highly realistic photo of XYZ123,….." **vs. ** "a highly realistic photo of XYZ123 person,….."

I recently read that the exclusion is a common mistake, but with how fast things change and how much conflicting information I see across the resources - I feel like it'd be a better idea to get the answer here.

whole crypt
jade hinge
#

Here best Realism fine tuning

surreal lagoon
quick onyx
quick onyx
#

I'm using Booru, I want a word to be the focus which ones tags would be best used?

#

for instance i want to use a logo with only words on it, and attach that (word/logo) onto coffee cups, t-shirts, signs, and other products

sonic narwhal
#

Openflamingo works really well for captioning

gentle osprey
quick onyx
quick onyx
quick onyx
quick onyx
#

@quiet moat Is there anything in the pipeline to improve words and text in images?

vast dome
#

guys

#

what is "dreambooth dynamic image normalization"

#

what does it do?

#

does it resize my images?

dusty pawn
#

im making a lora about diamondhead from ben 10

#

hes a very complicated character for the ai

#

so far i made a lora but

#

it looks

#

well.........

#

these are the best ones

#

that look like him

#

and arnt just green diamonds

#

what do u advice me to change?

#

i have 144 images

#

all from the original show

#

no fan art

#

i have some promotional images of him in different artstyles so idk about that

#

i used the google collab one by hollow strawberry

quick onyx
dusty pawn
#

ive been told it has something to do with the learning process

gentle osprey
#

That being said, I would look into using ControlNet rather than training LoRAs

quick onyx
quick onyx
#

I’m trying to understand this if anyone feels like a breakdown or even a link

gentle osprey
# quick onyx Wait what, Can you explain why?

honestly, if you're new to this, the answer to that isn't going to make sense and isn't going to provide you with any useful information that you can apply to what you're working on

#

like if you're at the "how do I train a LoRA stage", the answer to "why is deepfloyd better at text than stable diffusion" isn't something that's worth you exploring unless you already have a good understanding of how the whole diffusion process works

quick onyx
#

Thanks!

regal harbor
#

which algo are you using for captions? What's the best one, especially to tag details like dirt on skin, water droplets, clothes, lighting, noise

regal harbor
surreal lagoon
#

BLIP2 is just a way of using these text embeddings

dusty pawn
#

brother

surreal lagoon
#

the text encoder provides embeddings

dusty pawn
#

i need advice

regal harbor
surreal lagoon
#

it does

dusty pawn
#

does my role make me invisibe

surreal lagoon
#

@dusty pawn you haven't asked a question yet.

regal harbor
#

I want to tag my dataset accurately, and there are too many images to do it all manually

dusty pawn
#

im making a lora

surreal lagoon
#

oh, from days back

dusty pawn
#

yes

surreal lagoon
#

well i don't train LoRAs so, unfortunately i can't answer

dusty pawn
#

damn

#

well thanks anyways

surreal lagoon
#

it looks complicated, the goal

#

i would recommend trying a textual inversion

#

they can be much simpler

dusty pawn
#

this is the best one so far

regal harbor
#

so I guess the question is BLIP2 vs. ViT-bigG-14?

which one will create tags/captions which include objects, clothes, lighting, and info about image quality (blurry, grainy, noisy, jpeg etc.), because I don't want these elements to be 'baked in' to the Lora

dusty pawn
surreal lagoon
#

also not a thing i do, but i understand how it works. as such i don't know which tools support it or how to prepare a dataset for it

regal harbor
surreal lagoon
#

you'll have to test

fallen kestrel
#

Hello! I would like to create a model based on my own characters and signature style. I have completed a crude model successfully with Dreambooth, but I have read in a few places that it may be replaced by EveryDream? Specifically because I would like to train multiple characters, I've read that it has advantages over Dreambooth for that purpose. Can anyone shed some light on this? Or provide any resources where I could learn more?

torpid sinew
#

If you are training unet only is it able to learn new tokens? I havent really noticed my activation token making a difference in my prompts

steep arch
#

does anyone have any scripts/tools the use to automatically check a prompt or lora on different models, different loras, or with a scripted change/iteration in the prompt?

frozen citrus
uneven ermine
regal harbor
#

we can use weights in tags? like

(jpeg:1.5)

vast dome
#

I am trying to dreambooth with 4000 images, I am running out of vram even when 24GB 4090

#

I can do fine with 2000 images

#

r u guys experiencing this

thin mantle
#

I think the voting should be like midjourney did. It made it easy to rate alot of images. Rating in discord is inconvenient, therefor less people will vote

lost idol
#

@pseudo tulip Hi! Wonder if you could implement the adaptive offset noise function to the kohya colab?

surreal lagoon
neat oxide
#

anyone know what this means?

wintry girder
#

When training a LoRA, how do you get it to keep an eye patch on the correct eye in every generation? Any tips?

jade hornet
#

Better off just controlling it during generation using inpaint sketch or controlnet. Or make sure your training set has that specific orientation/pose

wintry girder
#

Has anyone had any luck making a symmetrical body part asymmetrical in a consistent way? E.g. a robot arm that's always on the left arm?

restive bridge
#

aye guys where can i find some kohya training configs for XL?

hollow spruce
#

based on testing since then - these settings still yield the best results

lost idol
rugged narwhal
#

How do you train or something similar with amd gpus?

glacial comet
#

anyone had any luck with Dreambooth and SDXL? I have trained a few in the last day or so using a face like usual and it seems like it works as far as generating images but it will pretty much only generate images almost exactly like the training images. Prompts barely change anything at all

tulip fern
#

I would try editing the image as best as I can and then merging it with i2i, though it usually helps to give both parts of the image the same style

remote crow
#

hello is there a simple tutorial on finetuning for begginers ? i want to create a minecraft texture pack model but i dont think a lora will consistently be able to create 16x16 pixel art

jade hinge
#

First Ever SDXL Training With Kohya LoRA - Stable Diffusion XL Training Will Replace Older Models : https://youtu.be/AY6DMBCIZ3A

How to install #Kohya SS GUI trainer and do #LoRA training with Stable Diffusion XL (#SDXL) this is the video you are looking for. I have shown how to install Kohya from scratch. The best parameters to do LoRA training with SDXL. How to use Kohya SDXL LoRAs with ComfyUI. How to do checkpoint comparison with SDXL LoRAs and many more cool stuff.

...

▶ Play video
regal harbor
#

I got so many noisy pics. I can denoise them, but then they start to look anime. Any way to get the best of both worlds?

jade hornet
regal harbor
jade hornet
#

Maybe you're just overtraining, and the noise in the photo isn't the issue, hard to say

oblique hamlet
#

a bird

kind elk
#

Has anyone gotten it down pat to get a product into scenes with a LoRA? I have seen cars and a few things. But I have yet to see ... say for instance .... a microwave or a certain type of dress with 100% accuracy. In theory it should be quite possible

#

(for photoreal)

#

Planning to render a 3d Object at multiple angles and lighting styles and hard train a lora into it. It should be quite possible. Just haven't seen it in the wild. Curious if anyone has seen

autumn obsidian
#

Hi guys what would you suggest for finetuning on 100+ objects? or even 1000+?

blazing scarab
hidden flame
#

Would clip captioning be good enough for attempting to train multiple styles into a model? or would using multiple keywords/concepts as well work better?

abstract merlin
#

are there any resources for generating regularization images for SDXL? I'm finding it impossible to run kohya_ss full-finetune/dreambooth training on my 24GB card, so I'm preparing a jupyter notebook to run on a rented host. I'd like to be able to run the training with regularization images, but I'm falling short on tools that can use SDXL to build that imageset. Maybe I'll just have to write a script that calls the SDXL inference script in kohya and build it that way?

#

I've got Comfy running too, but I recall regularization images should use the prompts from the captions that'll be used from training image captions, if I'm not mistaken. In that case, just typing is the class prompt and generating a ton of images with SDXL in comfyui wouldn't quite be the same, right?

fierce nova
#

I've seen this happen if you are using noise offset's

#

you can also counter this a bit by using a noise offset inverse of what you were using before

abstract merlin
#

Has anyone else run into this error while running kohya training? File "/content/kohya-trainer/library/train_util.py", line 1190, in __getitem__ example["latents"] = torch.stack(latents_list) if latents_list[0] is not None else None RuntimeError: stack expects each tensor to be equal size, but got [4, 104, 128] at entry 0 and [4, 72, 128] at entry 1

torpid ledge
#

Anyone has done a successful TI of SDXL? If so could you please share the settings you used?

peak comet
olive elbow
#

Does anyone have experience training a model for product photography that can provide enough detail that it can accurately reproduce the information on a label? Is it possible?

warm fog
#

anyone doing dreambooth or loras on apple silicon with mps? It seems mixed precision support is an issue everywhere. I have lot of RAM though, so thats not a limitation but training is slooow. I’ve tried to do a dreambooth on sdxl using https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md

GitHub

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch - huggingface/diffusers

fossil nova
#

what captioning method is reccomended for SDXL? 🙂

fossil nova
#

Anyone knows why my lora is not working for SD XL? 🙂
RuntimeError: The size of tensor a (768) must match the size of tensor b (640) at non-singleton dimension 1

torpid ledge
arctic solar
candid ledge
#

Anyone able to train LoRAs on 12gb ?

torpid ledge
#

Dreambooth is always better than just TI but for any kind of training, training the text embeddings always help tremendously

junior moss
open merlin
#

I'm using Kohya_ss to tinetune sd xl for the first time. But i get stuck because of a runtime error where NaN is detected in the latents. What can i do about this?

#

I also struggle to install triton, which might be the problem. Any tips on how to get that working?

raw wraith
#

if you're on windows you don't need to use triton, it's incompatible

open merlin
fossil nova
unkempt wigeon
#

Hey. New here, but have been tinkering with SD for a while now. My technical knowledge on it is very limited, so forgive me if I make a bit of a dumb here. I'm looking to render images in blender to act as training data for a LoRA, but I'm not entirely sure what I should be aiming to render out. I aiming to work on a set of clothes, and a second one for a particular character. Should I make face and head shots a focus for characters? Is there anything I should try to avoid doing?
Anything at all really or a link to any sort of guide on this topic would be much appreciated.

young crater
# unkempt wigeon Hey. New here, but have been tinkering with SD for a while now. My technical kno...

For the character, I recommend using 25+ headshots primarily of just the head and a few that are more zoomed out (I got solid results with 1 or two images that were upper body and the rest shoulders and head).
Since you are doing a cg character, you can even swap out HDRIs and fiddle with the face controls / general pose a bit for each image. Try to get a few angle shots as well.

Avoid shots where something is overlapping the face.
Allegedly dramatic lighting is bad too, but I included a few in my training data while still getting solid results.

unkempt wigeon
#

Would it be wise to include some facial expression changes per shot, or would a static expression help keep it focused on the face details themselves?

young crater
unkempt wigeon
#

Any particular minimum number of shots you'd recommend?

tough flame
#

Are we just plugging SDXL into regular DreamBooth? for example, training on LastBen's dreambooth Colab works?

unkempt wigeon
#

Thank you for your suggestions, they're much appreciated! Oh, I was also going to ask - last time I tried LoRAs, I ran into an issue of them having a very grainy low detail look, like it was painted in large brush strokes on a very noisy and rough canvas, is this "overcooking' that I've seen googling around, and is there any changes I should make to my training parameters/data set in order to alleviate it?

young crater
unkempt wigeon
#

Oh, I've not seen this guidfe thank you, I haven't touched SDXL yet as Im still a little too attached to whats familiar still, ha

#

oh, so wait my training images don't have to be 512x512? Should I be rendering them higher?

young crater
stone garden
#

how much slower fine tuning sdxl compared to sd1.5? approximately on A100 80gb for example

#

only base without refiner

storm juniper
#

I have a question about how the SDXL uses the CLIP model as the paper is a little hard to follow.

Does SDXL concatinate the output of the two CLIP model into a 2048 vector; as seams to be implied via the table; or is it doing something more complex as implied by this paragaph (which i'm finding hard to follow):

we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the
penultimate text encoder outputs along the channel-axis [1]. Besides using cross-attention layers to
condition the model on the text-input, we follow [30] and additionally condition the model on the
pooled text embedding from the OpenCLIP model. These changes result in a model size of 2.6B
parameters in the UNet, see Tab. 1. The text encoders have a total size of 817M parameters"```

The reason for asking is i'm wondering if I can avoid finetuning by doing manupulations in the clip embedding space
blissful dragon
#

In the kohya colab, one of the preparation sections said that

All 175 images have captions
No tags found for any of the 175 images
So what's the difference between a caption and a tag? Thanks

real sundial
#

anyone know what i should do with a dataset of 4k images

#

it's too small for a model but too big for

#

anything else really

remote vapor
#

Im wondering how i can reduce the size of my output LORA training with Kohya sdxl 1.0
All the lora checkpoints come out at around 1.3 gig and id like to compress them a bit

dull bramble
#

fine tuning goes wrong 😮

real sundial
#

rip

tall condor
#

i have created a lora with multiple concepts, if i create a text input that adresses this lora how are the sub concepts address? just via the same prompt?

#

also is it possible to merge a lora directly into a model?

brave bronze
#

are there any tutorials for locally fine tuning using SDXL? How much VRAM do you need? (I got 96 GB)

tall condor
#

you have 96 un multiple cards i guess right?

brave bronze
tall condor
#

so from what i have seen you can run around a batch of 1 on 24gb but only with some tweaks, so i would expect 96 to handle around 3-4

tall condor
#

how are lora weights applied to a model when i add it to the prompt? is adding a lora to a model having the same result as training the model directly?

dull bramble
stone garden
dull bramble
wicked perch
#

how do i fix the fingers and feet of my characters?

ivory pine
wicked perch
#

and the cfg cale?

ivory pine
#

I use 20+ steps, usually 20-50 depending on the model

#

and cfg I like to keep between 7-10, usually just 7.5

wicked perch
#

alr, ill use that in the inpainting process

young dust
#

have you messed with cfg scale scheduling and mimic cfg scale?

regal harbor
#

would you finetune datasets organizing photos by folders? e.g. "front facing", "from behind", "side view", "laying down", "2 people". Maybe if makes sense to train those concepts separately, merge them, then train over them at a low LR?

open merlin
# brave bronze are there any tutorials for locally fine tuning using SDXL? How much VRAM do you...

You need network alpha on 1, very important otherwise you overtrain. Network dim can be 264 but you can go higher as you have more ram. I'm lately having a lot of success using ~100+ total steps per image. So that includes steps per image multiplied with epochs. Learning rate using cosine (which goes to 0 from the starting) and adafactor. Starting at 0.0004 was ok, now I'm trying 0.001. also regularisation like dropout seems to work really well. Finally the weights and biases API is really useful, highly recommend it.

brave bronze
open merlin
#

I use kohya ss, then put the base xl model as the one to train

#

I can make a quick word document with links and send it if you want

brave bronze
#

that would be super helpful, I'm sure to more people than just me

open merlin
#

Here is a first draft. Maybe it makes sense to make this a web page or something, if anyone wants to help we can crowdsource the best settings. These are just the ones I figured out the last few days.

tepid sundial
#

Anyone have information or links to reading on SD training with non-DDPM noise scheduler?

#

I haven't used KohyaSS, but I've been reading the code for sd-scripts, and the training script doesn't make assumptions about image size; so as long as the dataset loader doesn't as well it should.

median ocean
#

hi guys, does anyone know where to download or own any repository of good regularization images of real life female / male pictures? i plan to train using SDXL and good reg images could improve the quality much better

torn mica
#

Working on my first lora. I couldn't find a set of regularization images that were good enough, so I've been building my own. I have 135 decent 1024x1024 images (and about 31 training images). Is 135 enough for regularization?

#

I'd skim some youtube tutorial, but I get the sense they all, at best, have a link to an image reg dataset that doesn't fit my usecase well enough.

#

Rather than building their own (to find out how many is enough to get by)

#

searched the discord server for "how many regularization"

looks like I might be ready to progress

open merlin
torn mica
torn mica
jade hornet
# median ocean hi guys, does anyone know where to download or own any repository of good regula...

Just generate them with the model. Why do I say this? You want the AI to learn what's different or special about your subject vs the 'class'. If you have regularization images, it will compare you're subject images vs that to discern what's different and learn that. Therefore, you want your subject images to be what stands out, if that makes sense. For that reason, I wouldn't even spend a ton of time curating them. However, if your subject has a particular body type, or you want the model to draw it better, you could train multiple subjects into the lora, like a body type, and a likeness. Think of it like a game, you say what is a woman, and it spits out an example, and you say no, no... And point it to a fine tune example

spring sun
#

Any suggestions to train a lora on sprite sheets, that usually has lower dimensions (96x128)?
I would like to train a lora for sprite sheets on SDXL.

I will be able to produce 96x128 with on XL?
should I scale everything up to train and generate, then scale down?

median ocean
#

hi guys, another questions, does anyone ever tried to train Lora using the SDXL refiner? can i know what GPU did you guys use? RTX 3090 or A6000?

dusky aurora
#

does anyone have experience in upscaling into very high resolutions ?

hollow spruce
#

For SDXL LoRA Training

kohya gui, main branch, and use this config file
for anyone wanting to make loras

epoch and max epoch need to be adjusted. 40 for normal loras, 80 for very complex loras (big dataset/faces/anatomy).
(also obviously adjust batch size, to whatever your card can handle)
repeat on dataset folders = 1

There'll be a comprehensive guide eventually on what all the settings do - but for now: that config will work as long as your dataset isn't too bad

expect this to run around 7 min per 10 epochs on a normal card. 16gb vram guaranteed - 12gb vram should work if your overhead is low enough. <- lower batch size to whatever your hardware can handle. 12gb vram should be able to handle batch 1~2
24gb vram can handle up to 8~12
if you have 24gb vram card -> run at batch 3 (only a minute slower) -> then you can continue using comfy during training, to test your checkpoints and see if you're happy and can stop training early.

also, just in case anyone's curious about file size. the full details & nuance of a face fit in dim 1 - we just use 8 cause we want the 8:1 ratio of dim to alpha.
setting it higher usually won't give better results, as dim 8 is big enough that I've fit the concepts of 100 complete dresses + faces + anatomy into it. So unless you're doing a full finetune level of lora with more than 5k images, dim 8 will be good enough

#

For captioning, all old 1.5 rules still apply.
Dataset size of 50~100 is recommended to avoid the typical training pitfalls. 10~30 works if you know what you're doing. less than 10 can work if you really know what you're doing, and have your captioning down to a science.
Datasets above 100 definitely improve the model, under the condition that you can keep up proper captioning. (better to have 50 good caption images, than 500 bad captioned ones) - quick and/or automated methods of captioning for sdxl will be in the guide

in an ideal world you'll have a .txt file for each image with tagging like this:
<trigger word>, caption, caption, caption, caption, caption, caption, <background description>
(seriously, don't forget background descriptions unless you want your trigger word to also affect the background)

open merlin
open merlin
#

Also why not use full training with bf16? It frees up a lot of VRAM. I could sneak in an increase in batch size by enabling it.

#

What tradeoff do you make between number of epochs and steps per image?

hollow spruce
#

unet only is a complicated topic, that can't be properly explained in a few sentences. I'll be mentioning that in full in the guide - as well as everything you need to change about dataset + settings if you do plan on training the open clip layer.
In short without much explanation: sdxl now has 2 clip models, and they are set up in a complicated way. If you train clip, then in 99% of cases you are actually damaging the whole sdxl model while doing so. If you are making waifudiffusionXL, then that is fine, as you don't really care about the abilities the clip used to have - but in our case of normal LoRA training, we really really don't want to cause damage to the model. Especially when it takes longer to train, to often receive worse training data. The only downside of not training clip is that your trigger word should be close to what you're training, else you'll need twice the epochs to achieve good results.

open merlin
#

Thank you, look forward to your guide!

hollow spruce
fierce nova
#

has anyone tried doing layer-wise training on sdxl?

hollow spruce
# open merlin Nice to see you have different settings. Why do you use --network_train_unet_onl...

dropout & cosine can be used to improve your training - but they aren't fool proof. Essentially I want the json to be able to be used by anyone who's training for the first time, and achieve good results.
If you understand the changes that schedulers make - then you should eventually move over to cosine with restarts, once you have a grasp on when the model starts to deteriorate. But do keep in mind that this is no longer SD1.5, and a lot of knowledge about the precision settings that was applicable to training before, is now no longer correct.

hollow spruce
fierce nova
#

i use bf16 exclusively and its fine, just cast it to fp16 when you are done

hollow spruce
# open merlin What tradeoff do you make between number of epochs and steps per image?

?
not sure if you're referring to repeat vs epochs, or generally what happens once you hit high epochs
steps = image count * repeat * epochs
just a matter of where the math happens

if your steps go high enough, you will either start to overfit the model, or have it break down completely. Due to 8/1 ratio this doesn't happen till fairly late though.

midnight vale
#

Does KohyaSS support training images of any aspect ratio? I want to train an SD XL model with images with a broad range of aspect ratios

hollow spruce
sonic narwhal
#

Training SDXL lora in kohya_ss and it is incredibly slow compared to previous runs. Im on a RTX3090. Anyone else experience this?

open merlin
#

How many seconds per iteration are you running?

hollow spruce
dull bramble
#

why do when I train a lora, the sampling images look absolutely disgusting

#

but when I try the same lora on automatic1111 webui, it still doesn't look good but it's way better

#

(this is supposed to be ramlethal from guilty gear strive)

midnight vale
hollow spruce
hollow spruce
dull bramble
warm agate
#

@hollow spruce do you know any LLM finetuners, i want to finetune a model to generate SD prompts

normal pike
#

Hey guys. How much VRAM do I need to finetune SDXL model? Would a 4090 24gb be enough?

dull bramble
normal pike
robust skiff
#

Hello!
Which video card is better for GIGABYTE GeForce RTX 3060 Ti EAGLE OC (LHR) 8G or GIGABYTE GeForce RTX 3060 GAMING OC 12G?

open merlin
#

more ram is more better

normal pike
#

3060 ti is barely better than the 3060, but since it has 4gb less VRAM, the 3060 wins by miles

sonic narwhal
#

Now im using different settings and estimated time is 36 hours for 6240 steps

#

did a 1.5 Lora in 18 minutes in between these

young crater
sonic narwhal
#

39

young crater
#

Caith's is assuming 1 sample per image, and 40-80 epochs

#

so it should be 1900-3000 steps

#

far less assuming a batch size over one

#

my last 28 image lora training was 320 steps following caith's settings

sonic narwhal
#

1 sample per image? Is that a setting in kohya?

young crater
#

it should be 1_[name] [class]

sonic narwhal
#

ok

#

Is training lora on SDXL supposed to create .npz files in the dataset folder?

young crater