#🔧|finetune
1 messages · Page 15 of 1
you should add a promt that is requesting things like "laying in the grass" or side view to evaluate the faces better
or "spinning around, side view"
Hi, I need some help with lora.
I noticed that there are some loras trained on 2.x models that are compatible with the standard auto1111 syntax lora:filename:multiplier.
How do you train them? Is there any way to create them with kohya_ss?
The reason is that I use openoutpaint a lot and it is not compatible with kohya's additional networks extension, the loras can only be activated by the standard syntax
did you install the extension "LyCORIS"
which lora type did you create in kohya_ss?
@wise seal i think if you create a "Standard" Lora model you should be able to just load it
if you create a LyCoris Model you need that extension in auto 1111
@surreal lagoon are you still running with noise offset 0.02?
ok
Already in ED? 🙏
Does it work with 2.x models?
and its not playing nice with SD1.x for now because it doesn't use v_prediction
I see
I'd say it basically only works on SD2.x 768 models that already use v_prediction?
I see
it might be able to finetune SD1.5 with v_prediction long enough that it accepts the change though
someone said SD2.1-768 was based on512 which would've been epsilon then switched
There was somthing about V-pred models that they have triple the loss or something? What kohya said hr fixed afaik
their loss is higher during training but its not comparable
its not something i'd worry so much about
I wasn't quite sure sd1.5 LRs would work well with 2.x models since id imagined they would require a higher LR
SD2.1 768 trains well with customized optimizer on the text encoder
lettme yoink me config for optimzier and i wanna ask if theres anything youy'd change
oh wait a sec theres 2
SD21 and regular optimizer
does ED auto select if it notices a 2.1 V model?
TE and unet get separate optimizer instances now in ED2 s you can use different LR,different Lr schedule, completely different type of optimizer, etc
TE has some layer freezing setup, it seems value like -2 or -6 which just unfreeze the last 2 or 6 layers works well for SD2.1
TE is very sensitive in SD2.x, needs a light touch so to speak
we're still trying to figure out ideal settings, but freeze embeddings true, layers -6, and freeze final layer norm false seems to work well
ill keep that in mind
also thanks for implementing the snr freq 🤝
I shall try training again asap
"freeze_embeddings": true,
"freeze_front_n_layers": -2,
"freeze_final_layer_norm": false
}``` -2 seems to do well for small scale stuff like training one or a few characters?
I see
Anyone tried freezing text_encoder??
I have no idea how it affects the model quality... But would love to know some results about it
Do you have any ideas on how to best train the appearance of a specific piece of furniture? I've tried different LORAs, but the furniture isn't being represented correctly.
Haven't tried LORAs myself... So cant suggest u on that
U might wanna try dreambooth training if you only want model to learn one kind piece of furniture
That might work for u
-6 is too much for large scale. and -2 isnt enough for general fine tune without balanced concepts. it forgets stuff
basically the data you use is almost more important than which layers you freeze.
Hello, how to train images without using lora with 8gb VRAM? thanks
Can anyone help? I'm about to generate images likes this but the results are not so good.
when I'm generating images, it doesn't really look good. I don't need extra objects or lines that is not recognizable.
I want it look like a human-made. Without flaws or blemishes, smudges, and so on...
a large number of gradient accumulations, most likely.
make sure you use DeepSpeed and CPU offload, and then you'll be running that training job for like 3 weeks.
3 weeks?
the less video ram you have, the more the system has to spend time moving data around between CPU and GPU
it's a substantial loss in performance
so, there's no other way or option?
How about DeepSpeed and CPU offload, what are those?
i'm on an 80GB GPU and i'm training for 3 days just to have it run in a way that i don't actually destroy the model while training it. when it learns too quickly with too little context, it over-corrects and destroys the coherence of the text encoder.
if i had an 8GB GPU, to train as well as i am on an 80GB GPU, it would take probably 3 weeks to get where i've gotten in 3 days.
when you successfully use less memory while training without extending the training runtime, you're giving the model less context on each iteration. and this works okay for short training sessions on a diverse, and well-captioned set of data. but if you want the model to learn everything from your training data, you have to give it more steps. and that results in more incoherence, due to the reduced context
it doesn't actually cost that much to rent time on an 80GB GPU. like $3 an hour, and you can get quite far in 24 hours. less than $100 to fine-tune a model. depends on how much data you have. the more it needs to learn, the more expensive it becomes.
Thanks, do you use deepspeed?
no
i use large GPUs so i can avoid stuff like that, as it influences training in ways i don't understand and don't wish to spend time figuring out
So, in conclusion, I won't be able to train without using Lora?
you can do textual inversion, lora, possibly others. but yeah, training the full model incl text encoder is going to be difficult-to-impossible on a small GPU.
this is pretty important to accept, so that you don't waste time trying to make it work
i would say 24GB is probably the point to start with for fine-tuning
So I won't be able to get a perfect result with these?
the point of dreambooth is to cleanly integrate a subject into a model, and if you can't tune the text encoder properly with a large enough context. it loses the ability to do most things other than the subject you trained into it
i can show you the results of some experiments! they're not pretty.
May I see, please? Thank you
this is supposed to be a tuxedo cat
we call this "catastrophic forgetting" and i'm not trying to make a cat pun, sorry
this is supposed to be a politician whose name i don't wish to say, but it doesn't matter anyway, because it's just noisy garbage
Yah
pretty neat results but very useless for text guided diffusion.
In my case, I want to generate a character just like a human-made drawing
same model, but prompted with a class it still understands
you end up needing to provide like thousands and thousands of super high quality images to the training on a low VRAM system. and it takes longer because of that, while still potentially destroying the text encoder
you just need to make a LoRA. why do you not want to make one? they can be used on top of any compatible SD model.
the benefits far outweigh the downsides. @north valley please chime in on this
it's what you were just talkin about

Oh yes, LoRA's are massively beneficial
apparently you can merge the LoRA into a base model's unet but you said before when i'd asked you what that would do, "it would be a waste of time"?
they can be trained in literal single digit minutes on any half decent GPU, they can provide dreambooth quality results with far less work, can be weighted individually and in tandem, across any model you want, assuming its adjacent enough
I honestly don't remember that, personally
I have been looking for ways to finetune by merging LoRA's into models for a good moment now
Lora is for low vram right?
hmm... Maybe I misunderstood what you asked, I have been wanting to combine LoRA's into a model
LoRA isn't just for low VRAM, no
its for any VRAM, but it does have the benefit of being much faster and easier to train
it's a low overhead network that alters the weights of the base model's unet / text encoder via cross-attention
because they take like 5-15 minutes to train you can quickly see what you're doing wrong and adjust hyperparameters and see results in the same hour
I'm sorry. I don't really understand these things in Stable Diffusion.
basically, its a little tweak you put ontop of the model, which is much smaller, faster, and works with any model thats at least decently adjacent
it's fine, a lot of what i say won't make sense now but it might later
when Sytan says "adjacent" he means the training state of the model should be similar enough that the LoRA still behaves as expected when layered on top.
You can achieve DB quality results with a decent LoRA', in much less time, with a much smaller data set, and it has the benefit of being usable on different models
But, can lora give good images? or maybe the problem is the model I've been training
LoRA is most likely higher quality than a Dreambooth with the same dataset.
so if you train a real person into a realism model, putting that ontop of an anime model should be able to translate them into that style really good
absolutely, I make money off of selling LoRA's
let me get some of my examples really fast
LoRAs feel like cheat mode.
this was one of my first LoRA's
this is what this character actually looks like in the show he is from
and now granted, this LoRA is not that good compared to what I can do now, but you will get the point
I made that in about 20 minutes with only 7 images of him
I wanted to generate more images of a robot with this kind of design. I want it to look like a human-made drawing
fast forward to LoRA's that I did a much better job with, and you get results like these
for comparison, i've been trying to fine-tune a base model for a month and a half with more than 30,000 high quality images and it's cost me about $3200. i'd be done already if i cared to use a LoRA. but i'm in it for the challenge, not the results.
As you can see, LoRA's can get fantastic results, and even these are honestly a little lackluster compared to what I can do now, with even less, in less time 😅
LoRA's are arguably better for specific things, like a specific style, or a specific character/concept
for example, I trained a LoRA on Funko Pops, with just 10 images, and about 7 minutes of training
rely on other people for fine-tuning a base model and just make LoRAs and you'll be way happier.
@north valley have you seen LoHa yet?
Yah, thanks to you all. maybe I'll have to do more research first. Thank you very much. Appreciated your insights and advices.
Here, you can see some of the results I got from an extremely small data set, with extremely fast training
onlyt took 10 images, and about 7 minutes training to create a LoRA that can do this style across countless models reliably
Can you give me datasets for the images I sent?
Sytan, GPT4 explained to me how LoRA benefits from regularization images by "helping it learn representations that are core to the base model's unet and text encoder, resulting in more coherent results, as the weights merge at runtime more effectively."
yes, however there are soooo many LoRA adaptations, that I genuinely can't find the interest in learning every minute difference lol
same, and i'm a developer who doesn't even sell this shit, but i'm in it for the knowledge and learning
i totally appreciate that someone who is calculating the return on investment wouldn't find the time to look into it
thanks for explaining by the way, an example image is worth a million words lmao
All good, I have a lot of experience with many facets of LoRA's, always glad to help haha
especially for when its between DB and LoRA's
99/100, LoRA is the way to go
yep, take it from the one who has destroyed like 150 models in a few months here..
by the way, my current fix is working great. you just need to use a batch size of 150!
large batch size = very gradual learning, with more coherence
seems like having many repeats over training data with a large batch size has a much reduced impact on eg. overfitting
apocalyptic wasteland for example, i just love this. it looks to me like a real photo from an exploring channel on youtube
that looks dope :>
thanks, nat geo!
same prompt without the midjourney keyword
why does midjourney word make some photos so real looking? no idea. mysteries of life
hmm, interesting
MJ is likely a huge network of many models and LoRA's that are all triggered off of prompt nuance, as we have previously suggested
and here's the stupid reason i consider this training run a success. it kept the mediocre version of Robin Williams from the base 2.1 model in-tact. it even fixed some of his artifacts
remember my whole initial goal of this was to add some of the flavour of MJ without looking like MJ, while retaining most of the base 2.1
the first half is really easy. the second half, not so much
i was comparing my model against OpenJourney last night. i used to love that model so much, lmao it was my go-to. it is total garbage to me now
i'm curious how 1.5 goes once i finish this 2.1 model up, so that's my next step is to switch to base 1.5 and train it again on v_prediction loss and terminal SNR, while paying more attention to the results and trying to tune things so it works better. two days ago i tried a brief 1.5 training session on an A100-80G over 24hrs and it cost $72 and resulted in pretty poor images compared to my 2.1 attempts
i've likely tuned hyperparameters too far in favour of 2.1, so i'll have to find the switches to flip defaults for, and make sure that's possible to do with a single --option-at-runtime
oh, i added 10,000 images of hands to my training data last night, too. that has been in there for a couple checkpoints at least, by now
couldn't do faces of children reliably before. fixed now
Thank you! That works fine with 1.5 loras but 2.x needs a separate extension to activate them. I saw however some loras on civitai that doesn't need the extension, I just don't know how they were created, or maybe there's some detail I'm missing.
can you link me the extension? i can check how it loads them for you
Great write up on Dreambooth training
Way more detail than most guides
there's some issues with it.
However, when changing LR, there is a problem that when generating with high CFG values, images contain distortions, that is, elements on them begin to break, and the lower lr, the lower the CFG value at which this begins to develop. This is most likely due to the deviation during training of the LR value from the base value of LR during the initial training of the model, since the same problem is observed when the frame resolution is increased, only there it is expressed differently, but the principle is similar.
for example, this isn't true.
"this is most likely" preceeds statements that are pulled out of thin air as a guess.
honestly, haven't found a guide that doesn't have some inaccuracies. there's sooooooooooo much contradictory information flying around
yes. all of the research referenced here is just a momentary stepping stone of understanding on our way to the next truth
prior loss preservation is such a basic concept that is implemented poorly by everyone
using a single token for it, ugh. captioned regularization images is where it's at
it says that gradient checkpointing doesn't help image quality, but the fact that you can increase the batch size via that, does help substantially. people focus on how long it'll take to train, too much.
$3.18/hr plus storage costs
i ran on an 8x A100 80G system for a few days just to see what happens
that was like $40/hr
my takeaway is that the quality boost from many GPUs is great but too expensive for me to justify, so i simply emulate it with a single 80G GPU now, and resort to gradient accumulations to boost it to the batch size i had on the 8x A100 system.
i did buy a 4090
Bro my lora training on kohya_ss takes 6 hours on rtx 3070 anything im doing wrong?
Preach, I've been struggling with that, especially regarding Dadapt and networkdim/alpha + learning rates
By the way, are there any DAdapt users or Lycoris model creators I could have a discussion with, I've been trying to create a Lycoris that emphasizes larger outlines/cel-shading and It can work, but I feel there are things I can do to innately optimize it, I just need to know what I should do training/captioning wise in a few different questions areas.
For instance,
-
When creating said lycoris or lora in the process of making a style; are you supposed to include certain stylistic keywords such as thick outline, white outline, black outline, celshading, or is that best left pruned from tags? If you keep the style tags, where would they be placed start middle end?
-
What kind of prompting should someone utilize in testing said style Lora, (for reference I can get my lycoris to work quite well with some prompts including outlines or celshading, but on its own it doesn't unless weight is applied extensively 1.25, 1.5
-
For DAdapt I've seen some really contrary info regarding both whether to use Dadapt or Dadaptadam, and also what dims and alphas should be used.
Any insight is appreciated!
I've put some examples of what it can cook at just basic 512x512, but struggles with consistently maintaining outlines (especially if not in prompt).
💀 edit: by the way it can output more than women just tested with such as a consistent output 💀
I started with AI art, and Stable Diffusion about 3 weeks ago, read time, and still learning.
With koyah_ss and a 4080, my Lora training for people/celebs is taking around 15 mins for say 1700 steps.
I've done about 8 trainings now, all but one are working amazingly well. I'm astounded and amazed with myself how well they've turned out.
The results are spot on, and the ai art is looking like the person in the real pics I trained on.
I don't know how fast a 3070 should be, but training for me on a 4080 with around 22-36 images, 25-30 repeats, around 10 epochs, batch size 5-7 and a learning rate of 0.00005 is taking 15-20 mins.
I try to aim for total steps between 1500-2500 using the formula:
steps = (images * repeats * epochs) / batch size
I'll say it certainly depends on settings, I come from the perspective of a 3080ti, if I accidentally were to add a large amount of epochs while say running dadapt, then it would take that long. Meanwhile, if I were to run it on Adamw8bit, it might be a fair margin easier/faster to get the lora baking, but for instance the model im asking for assitance from took like 2 hours for 4 epochs because its running DAdaptation/DAdaptADAM and a high network alpha/dim, so my answer is it depends lol
in regard to my own message ive been looking through training data on civitai for some other styles and its like 60% use a trigger word and 40% don't with pretty insane levels of repeats, they be repeating 100 images like 15 times 💀, I'd be curious as to others captioning/training styles
I've made 8 character / person Lora's in recent days, all with amazing results, bar one, which was just a shade off of perfect.
I'm looking to train a style Lora next, are there any tips, anything different I need to do in terms of number of images to learn the style.
I assume I just need a variety of images in the style, and maybe more than the 22-36 character images I've been using for character training?
Do I need to aim for any increase in epochs, repeats, lower the LR etc?
Thx
My Lora's use a trigger word that is the same name as the Lora. Repeats are about 20-25 ish on maybe 30 images, batch size 5, epochs 10.
Lr is 0.00005
I try and keep total steps between 1500 and 3000
Captions on the images are using wd14 tagger, and no regularisation images (yet)
I know I've been talking about questions with styles lora, but I'll try to give out some info I've discovered in my research (to be honest alot of soruces contradict eachother)*
use kohya_ss
captioning requires a different kind of pruning, instead of character traits, some either don't bother or will comb out style details (aka artists names, line size, shading style, character names) this pruning is one im trying to figure currently
I find Lycoris to be far better at capturing styles, different parameters but the results have captured it better imo,
I hear people use at least 100-150+ for styles, sometimes far more,
epochs/repeats/LR are all really dependent on what you've got thus far, depending on your system you can run DADAPT and it tends to provide better output but takes more system requirements, otherwise yeah the classic 5e-5 with adamw8bit,
Thx Pappas,
I'll definitely be using Kohya, as I am already for my character Lora's.
The caption pruning makes sense, given you don't really want the Lora to worry about character specifics.
I'll have to experiment, and see if it's better pruned or with no captions at all.
Image wise I can easily get 150+ as its a video game style I want to train.
Lycoris I saw some on Civitai, but wasn't sure about them, and saw I needed an extension for support.
I might kick off my first ever style training tomorrow, and experiment.
from what I've seen 80% of people use captions and those end up better, so I'd stick with it IMO, but thats up 2 u, it is an extension for support ,but its not a complicated extension fortunately, plug + play in that regard
the only thing I'll say that could even be deemed annoying about it is less documentaiton/ civitai autoupdater doesn't update it automatically
otherwise execution wise it inputs the same as lora expect it'd be lyco: instead
Thx. Yeah captioning makes sense to me. It's how the model learns the association between the text and uk the image, be it character or style
I've a few extensions installed in Automatic1111, so one more won't hurt lol
Time to experiment over the weekend 😊
execution wise should be really similar to loras even with training, most guides ive seen treat it identical, loras better for characters, lyco better style wise, lyco I've seen a fair bit of people low the dim/alpha, but i mean if you run dadapt it doesnt matter as much 
one thing im still struggling to see with these style loras/locons is that whether I should retain art-style prompts in the tags, like if the style is really defined by thick outlines, do I retain that, or try to let it absorb without tags
, in the models current rendition it can add said outlines with either adding 1.5 weight or adding the tag "outline" 
What does the dim/alpha do?
One thing I've found when experimenting with Lora's, is they can be weighted too high by default and I had to make them a 0.7 or 0.6 attention for them to look ok.
I resolved that issue eventually, but believe my original Lora's were over fitted, would that be right given me having to lower the attention.
Yeah I guess in some ways the training should be just learning the style, without you needing to specify 'outlines'. If the style has outlines, then I'd kinda expect the model to just learn it at part of the overall aesthetic
yeah I thought that last part would apply, but perhaps I haven't applied enough weight, dim and alpha are defined as "for lora weight scaling" with DIM always being higher or equal to alpha, "https://rentry.org/59xed3" this being one of the better guides, discussing said settings
I've already got 15 or so tabs open in my browser, as I've a backlog of reading to do on various SD and Automatic1111 topics. I'll be adding that link to the list 😊
Thx
uhhh, yeah, likely a ton
my LoRA's took like 10 minutes on a 3060ti. Likely you are doing a TON of images, with wayyy too many steps
if you are following like... anything from AItrepreneurs video on making LoRA's, then that certainly explains something. I do not recommend listening to anything from his LoRA video
its very bad information, and you will get very bad results from it
this was after like 5 days of trying his stuff to get a Na'vi LoRA
and these were about 30 minutes after I learned how to actually make LoRA's lol
yeah I used around 100 images and I am following some stuff from aitrepreneurs video. What is the "actual way" of making a lora that you have learned?
any way to run more than a batch of 1 with rtx4090 when training on v2 models?
also is there any drawback of using memory efficient attention?
dim is the rank of the matrix factorization. You can think of Lora as "compressed training", like you train your model but you compress the changes you made to the model to keep it small in memory. The rank is then more or less the compression strength. Higher rank = less compression = the model is more able to finetune. If your rank is as big as the matrices you change, Lora becomes more or less equivalent to Dreambooth.
alpha is a scaling factor you multiply your lora weights with. It is divided by the dim internally, so you have to set alpha to the dim parameter to get an effective scaling factor of 1. The idea of alpha is that Lora needs lower learning rate as higher the rank is (I mean, people train lora with lr of 1e-4 to 1e-5 while Dreambooth for example is rather 1e-6 to 1e-7. If you would use full rank lora you would also have to use very small learning rates). So instead of manually changing the learning rate whenever you try a different dim parameter, they scale down the lora when you increase the dim and provide an alpha parameter, such that you can control this downweighting
many people found that downweighting the lora at inference time improves results. My theory is that downweighting the lora has a similar effect like EMA or like merging models (which also improve quality at inference time)
@jaunty grove do you tried LORA on photorealism? I still struggle a little bit with that (not particular for LORA, I'm also not happy with dreambooth results).
ok so hear me out: how cool would it be if you could make a mask for LoRA training or SD finetuning or dreambooth or whatever, where for each training image you could have various captions for the masked area. I've found especially in LoRA training, the ai sometimes cannot tell what's a piece of clothing or what is in the background, and features get baked in despite being very well described in the caption.
oh, that's possible without problems
I implemented that myself for my lora training - however, I used it for training on subject without the background
why cant you just crop the image accordingly and caption the cropped parts?
I don't think that works as good as using a mask
anyone here using dreatmbooth with kohya ss to train 2.1 models?
for some reason i can not do more than batch of 1
is there any drawback of using memory efficient attention?
All 8 of the Lora's I've trained so far, have been photo realistic, and of real people. All have turned out amazingly well, bar one which is ok, but just not quite as good as the other 7.
I didn't do anything special. 15-35 images, 10-13 epochs, 20-28 repeats, and a batch size of 5.
I aim to get total steps between 1500-3000. Learning rate is 0.00005
Anyone got a top tier Lora training guide?
maybe it's just the quality of my input photos 🤷♂️ they are quite jpeg-ish. When I train the unet for too long it starts to adapt to the grainyness and not to the facial features
do you use textual inversion first, or do you use "sks person" style captions?
i wonder if it's even possible to fix all of the faces always
@stiff dust so this is after tuning only the last 2 layers, do you think maybe i need to tune 3 deep?
as said, I found the text encoder Training rather helpful. I freezed everything except the last 6 layers because this was recommended everywhere
but overfitting on the texture happened first in the unet
well i didn't have texture overfitting when i froze more than 6 layers, but i did when i did what you suggested
this is at 5400 steps which would be pretty toasty otherwise by now
i was considering freezing the unet as an experiment until it worked out so well to use massive gradient accumulations on just 2 layers of the TE
I don't really know what you are referring to.
I just said that for subject training I observed overfitting often rather in the unet than in the text encoder.
when you train lora you can switch of each layer after training to check how much it contributes to the produced image
that alters the output in a Schrodinger type way
the math changes a lot, it's not like photoshop layers where you can disable one and see what it does more obviously
there I found that
- text encoder did most of the work
- cross attention improved results slightly, but it also introduced overfitting on the jpeg-ness of the input
- self attention did nothing
of course you can. That's why lora and model merging works
model merging is done in many different ways, and often averages the weights together or does a 50/50 "one from here, one from there" (or other ratio)
of course as longer you train as more interdependencies between the layer changes are introduced. But for subject training you do only few epochs
are you talking about making the lora a new layer on the base model?
the lora is changing the base model
it's s delts you add to the weights
and of course you can just switch off the lora for any layer
in automatic1111 it is merged into the base model weights at runtime but that's not how Diffusers is going to be doing it
in Dreambooth/Fine-tuning the equivalent would be to set some layers back to their original weights after training
I implemented my own lora and its doing it at runtime
but honestly, that's just implementation detail. It doesn't matter if you add lora on your weight matrix or on the input
math is the same. It's just a question if you want to change the loras frequently
aye
the dolphins are now birds 
it never made sense that they were dolphins but i loved it
whenever i train with kohya on 2.1 models as base my results is completely broken, i see color lines and there is no pictures actually esixtint
any idea why
probably a bug in kohya
maybe they use the wrong scheduler or epsilon prediction instead of v prediction
yes
maybe thats the issue
getting better now but the result still looks very broken, anything else special when training 2.1 rather than 1.5?
768 is native resolution, DDPM is the default sampler I think.
learning rate should be less than in 1.5
for dreambooth, lora, Textual inversion, or general fine-tune?
Which dataset was used to train SDXL.
their own custom one
the 768 models use v_pred
the 512 ones do not IIRC
@stiff dust i put the unet from 2.1 back on top of ma burned 7650 step checkpoint and using the fine-tuned text encoder only, fixed some texture issues
so maybe i'll fully freeze the unet during training
its so weird, cause its the total opposite of what people always suggest
"freeze the textencoder or everything will overfit!"
but I found that with super simple textual inversion I can already learn different styles and concepts
so it seems that 2.1 unet is already very powerful and just needs the correct text tuning to get things right
yeah i agree that's a great approach. i just wanted a good base model to do that on
I hope you find a solution without freezing the unet, though ;D
maybe training it with lower learning rate?
or maybe you really need something like EMA on the unet to get good results
anyways, I go to bed, good night
how do i fine-tune the VAE?
guys I am training 1024 max pixel on 4090, and I am getting 1.00 it/s and 2 batch size
is this normal?
I'm about to train my first style LoRA out of an animated series from YouTube. Is there anything I'll have to think about in terms of aspect ratio and resolution of the dataset?
the trick is to only finetune the decoder, so that the finetuned vae is still compatible to the unet
beside that its normal variational autoencoder training I guess
Hello, i would like to ask, if I want to train 1000 ~ 2000 datasets with captions using Dreambooth, may I know what is the most recommended parameters? (epoch, optimizer, scheduler, learning rate, mixed precision, warmup steps, text encoder, weight)
my goal is to create a style checkpoint similar to MJ.
so far the best setting for me is still Lion, bf16 precision, 1e-7, constant with warmup. but would like to know if there are other recommendations 😄
theres no info
Anyone had success using Lora from your own service?
Hey everyone! Can someone point me to a good source (or yourself if you know) where I can find info on the relationship between number of repeats, epochs and dataset images for Lora training in Khoya? Like when should I do more repeats vs more epochs or vice versa, what are trade offs?
the yaml has hints, it shows you the loss function setup, most of what is needed is there, the target classes have more information if you go dive into the code
people have tuned the VAE using that code more or less, or like dreambooth forks of it, you'd need to look at the data loader class (fullopenimagestrain) and see what it is loading, many of the dataloaders in there are sort of one-offs for specific datasets and do some translation
it's also possible those yamls are just the last thing that was checked in when they published, and they may have used other datasets
🤷♂️
hey so im trying to use controlnet to convert a photo of a 2 people into a cartoon version of themselves however my generations allways mess up the faces of one of the 2 people like this
so i tried photoshopping the guys face and using img2img and controlnet to fix his face but then it doesnt really blend in with the environment around it
pls help
This is not the help channel
which is
Try inpainting?
hello guys how doI use regulazation images? I couldn't find any resources online. Help is appreciated on guiding me on instructions on how to use regulazation images
Hi, apologies if this has already been asked and answered but I couldn't find anything in the FAQ and I'm not really sure how to search for my question.
My question is - Would it work to train a stable diffusion model on different training data for a character's face and body and would I then be able to diffuse images of the character combining the relevant face and body?
Context - I'm developing a Visual Novel which uses art assets generated using AI. There are lots of benefits obviously but one of the main challenges is getting consistent results for the same character.
I've therefore chosen my favourite SD model and am going to be training a new model with that as a base to merge. However the training data I've collected for the character is going to be split between the face and the body of each character I'm training the model on.
And after 100s of hours of work on this I started to get a bad feeling that I'm being an idiot and that maybe this wouldn't work. Any advice would be greatly appreciated.
I would say that it should work. Use textual inversion on the face to train token1 and on the body to train token2 and then use both tokens in the prompt
That's a relief. That was my plan. Thank you so much for the advice.

@stiff dust my unet has cleared up upon continuing to train and train is still trash, i mis-read the filename
hi, please what is the best approach to merge multiple models?
Links for a good guide on training photorealistic LoRA with Kohya SS? 🙏🏻
Did you find any ?
Nope
I just use all default setting except the model o changed to sd1.5
Everything is fine except the ✋
I guess hands are not strictly LoRA dependent right? If the base model sucks at hands your lora will suck at hand too. Just asking
Yeah you are right,model play a huge part too
Check out the blessup extension. It only allows fixing of brightness and contrast. Ide love to know if there are any other tools out there for vae editing.
use the hands dataset
For training? Which one specifically?
This LORA + Checkpoint Model Training Guide explains the full process to you. Learn how to select the best images. How to key word tag the Images for Lora and Checkpoint Training. How may steps and Epochs to use in Training. How to Merge Models to get better results.
Link from my Video
Join my Discord: https://discord.gg/XKAk7GUzAW
Bu...
Do you guys know of any tools similar to stable tuner that give you a graphical interface to help you fine tune a model?
what would the interface do?
@waxen solar what kind of Data do you plan to use for your model?
as in the training set? i wanted to train a model on the character of simon from gurren lagunn haha
Get some frames of the Character and cut him out. Afterwards look at this and try to create a LORA. should be pretty simple.
https://civitai.com/models/22530/guide-make-your-own-loras-easy-and-free
hm i see, thank you for the link! what would you recommend to use as a source checkpoint/class images if i was trying to train a checkpoint model?
i have around 30 images in my training dataset
but i'm not quite sure what to use for the class dataset as it is an animated character?
wow, unfreezing another layer of OpenCLIP after 2 weeks of fine-tuning is like punching it in the face with knowledge
it's amazing
Hey, is anyone aware of an extension that will convert mp4 to jpg images?
I want to resize them and use them to train. Hoping I can do it locally
haha, so I'm curious in the end how your conclusions look like
using more clip layer than recommended?
repeatably freezing and unfreezing parts during training?
you can try the checkpoints from HF as ptx0/pseudo-real and ptx0/pseudo-real-beta
the former is 4.2k steps and the latter is 17.6k steps and at 15.6k i thawed another layer of the TE
the 17.6k ckpt with the 4.2k step unet
yeah, the question is: was it a wrong decision freezing so many text layers at the beginning?
i'm starting to see improvements in fine details like small faces
nope
allows major improvements to happen gradually and then i can kick the model in the face by opening layer 21
it's starting to diverge a lot, but that's the goal now
i'm considering putting together a very high quality 2048x2048 dataset for some more unet training to bring that back in line with the new text encoder
I wonder if the unet sometimes does not keep up with the te and so it's good to freeze the te from time to time and unfreeze it after training the unet
i've tried that before. it seems to work better the other way around
like in the early days of deep learning when people had to train layer for layer 😅
what is the other way around?
to train the text encoder with an old unet
in the -beta repo i've updated the TE but not the unet
it was working really well and producing very clean images around 4200 steps of fine-tuning, so i decided to keep that for a while and focus on the text encoder's representations
the thing is, i have kept training the unet and my inference validations do all of the combinations
i can see with the new TE with old unet, new unet with old TE, and then, both fully trained components together
the unet is so weird. it clearly influence the composition in ways i do not understand. it can introduce deformity. and it introduces more detail. but not necessarily coherently. this detail can manifest as the artifacting
- both the TE and unet
- the unet
- the TE
imo it makes sense to periodically bring the unet weights up and see if it can remain clear, and if it starts to 'dirty up', go back to a ckpt where it was clean again and freeze it
anyone know how automatic vae works? Like I'd like to use the orangemixvae only for that while using the nai vae for everything else
I'd like it to be done automatically
jru
anyone using dreambooth? I am getting some weird result patterns, all my samples have this weird cut
is this because of dynamic image normalization
or is it because I am training it on 512x512
hello
Hey there! I'm having a bit of trouble with image generation. Whenever I try to generate an image, it just turns out completely black. I've tried a few different solutions, but nothing seems to be working. Do you happen to have any ideas on what might be causing this issue? Any help would be greatly appreciated
Wrong channel
for anyone who was curious what the aesthetics scores look like in practice when they see "LAION subset with an aesthetic score >6"
Machine Learning at Berkeley empowers passionate students to solve real world data-driven problems through collaboration with companies and internal research. To find out more, check out our website http://ml.berkeley.edu where you can sign up for our newsletter and apply for our project teams.
For people looking to get in the weeds
@undone fable is the unet responsible for learning resolutions or is the text encoder? or both?
The attention layers in the unet do the heavy lifting in regards to resolutions and aspects
thank you 🙂
does it make sense to f/t the openclip on it to some extent?
also if i were to try and freeze the gradients on some piece of the unet to prevent the texture crisp, which would you suggest
damn, loss is as low as i've ever seen it when training with multiple high-res aspect ratios
0.165 on average
do you know how i can fix this inpainting mistake?
What the heck is even this!?
Your denoising strength is way too high
use the inpainting control net.
without inpainting model or control net you can only use low denoising strength
also you should use "original" instead of "latent noise" if you just want to make smaller changes
114/4025 [22:36<34:44:08, 31.97s/it, loss=0.146, lr=4.07e-9]
i like these loss values here
Howdy all...does anyone have any tutorials/resources on how to get the best resemblance results for training a model? I know the process, but most tutorials are entry-level and just gloss over the quality of the data set like, "Just get a collection of 10-30 images of the person from different angles, different light, different focal lengths." Does anyone have a resource that dives into that a little deeper? I want to dramatically improve my results.
it is true, guides toward what a good and bad dataset look like are slim
JPEG style artifacts are a huge deal. any kind of image noise gets "focused on" it seems, by the convolutional neural network, aka "the unet"
too similar of images that are a few pixels off, eg. think a couple random crops of the same image where the face shows up offset. this could cause blurring/double vision. so you do truly want your images to be "varied", more than "similar".
aspect ratio bucketing is also a big deal
31.97s/it?! How?
Are you using Scale v prediction loss? I'm getting similar loss rates with it. If you're not, you might be able to get something even better.
Another difference I'm finding is how fast the model becomes over-trained with it selected however. What used to take 22epochs now gets overtrained in 8.
do you mean by selecting it, it's doing it twice?
i have no idea what that option does 😅
what are your loss values like without this?
mine are typically .3-.5 with spikes up to .7-.9, and i do not use prior preservation loss
usually .3-.4
interesting that a lower loss causes more burning for you. it is the other way around
Stable-Diffusionのv1系は画像に加わったノイズを予測するモデルですが、v2の一部はvelocityというものを予測しています。この2つは損失関数が違うのでlossで比べられません。経験的にv_predictionモデルの方が3倍くらいlossが大きくなるイメージですが、数学的に確認していきます。 ノイズが加わった画像について 元の画像を、ノイズをとすると時刻でノイズが加えられた画像はという式で表されます。はVAEエンコーダの出力である潜在変数なので、平均0で分散1の正規分布に従っています。ノイズはそもそも実装として平均0で分散1の正規分布です。めんどくさいのでとします。すると画…
loss is typically fed into the backwards pass
accelerator.backward(loss)
oh, well that sounds interesting
"improvement of details" is fucking vague
i will say that not every research paper has valid conclusions for their proposed reasoning
yup lol. i'm messing around with it. dunno if it's good or not due to how fast my models get over-trained before I feel it has sufficiently learned the concepts.
yeah that sucks and maybe you can go nuts and download LAION Aesthetics locally for like 20,000 images and give that a try for training
see if the dataset size helps at all with this early burning issue
alternatively i would request that you put your batch size up as high as you can and use a lot of gradient accumulations. think like, batch size 5 and 30 gradient accumulations
I trained a LoHa on a dataset of about 700 images and it burnt out at 8 epochs. I then cut it in half and trained again and it seems overtrained at 6 epochs.
you can increase your batch size, and it will effectively lower your learning rate
but it does so in a more coherent way that is less likely to get stuck in local minima
so it will take longer to train, and you can have more repeats before problems appear
you might need to still adjust your LR
I left them at the default in Kohya
i honestly have no idea what LoHA is and how you train it
if you're seeing "burnt" models it's because the unet is being fucked
the unet can be fucking wild, man lmao
step 100 of training
step 125
that's just the unet being trained
i'm quite liking this amount of learning it does every 25 steps tbh
5e-7 is my constant scheduler i'm using to train the unet
i would recommend you try that with the batch size changes i mentioned
your batch_size itself likely can't exceed 4 due to GPU memory but whatever you have to multiply your batch_size you can run with, by, to get to 150, you put that as your gradient accumulations. eg. 30 for batch_size 5
ok I'll give it a try
i wonder if tuning the text encoder and unet together is just really damaging in general
it's not done a whole lot in production teams, eg. stable diffusion 2.1 was made on a frozen text encoder
i've had really good results tuning the TE on its own, or the unet on their own but not both together
is there a way in kohya to do one but not the other? would you set lr of one of them to 0?
both combined -> just the TE -> just the unet
well i'm doing aspect bucketing now too
the text encoder is being put through total hell with that
i just don't understand why combining them makes the image all poopy
@stiff dust any idea ?
both -> text encoder -> unet
the unet is picking up widescreen details but the text encoder is not
Is it possible to just train one or another in kohya?
no clue you would have to look at their source code
or open an issue to ask
dem fingers tho
text encoder = two dudes lmao
unet = one dude
he beat him, took his guitar, and stole the show and our hearts
i wish the contrast issue weren't there, but, one thing at a time
faces 👁️👄 👁️
@undone fable what was the thing with your contrast issue recently? seeing the same thing here with the zero scaled SNR
seem to have mostly eliminated the proliferation issue here 🙂 that's pretty great
that one image at the bottom still doing it but that seems to be a cursed seed
Can you elaborate on this? I have a 4090 and I'm getting like 4.9it/s when training
that's really fast
i get 229 seconds per iteration or so
the high it/sec was when i was training 512x512 on 1.5
it was also with a batch size of 1, which isn't good for training
if you're getting 4.9it/sec and you're doing a general finetune you probably need to slow it down with a higher batch size, which sounds stupid, but it works
Oh lower is better?
On a side note, can anyone point me to a Lora guide or video that DOESN'T just teach you how to overfit a Lora... There's several I've seen more popular on YouTube that all they're doing is teaching you how to make a extremely overfit lora of yourself... While those were good to get my feet wet, I'm trying to train concepts like break dancing, tipping hats, ect... And following those guides either end ugly or just completely overfit. I'm fine having to make multiples of the same attempts to get a good Lora... But I'm currently just stuck in an endless loop, because there's so much conflicting information
idk about LoRAs but i finally broke this damn family up with fine-tuning on high res
there used to be a mom
and like 6 other kids
could someone help me fix this hand?
Have you tried inpainting the hand?
Hey guys ! hand again. I've tried it all and cannot fix the hands on this one. Any idea how to proceed.
I just photobashed the hand from my 3D base and tried to use the depth map but its having a hard time showing hands following my map.
idk why but it shows everything but proper hands
this isn't the "help me fix hands" channel
I've created a custom model and reduced down to two checkpoints I can't decide on. I was wondering, if I merge both the models will I get something in-between? or will it just mess them up?
just try it. Usually, merging improves model quality
messing up is rather unlikely. It's often surprisingly robust
Hello! Has anyone been able to train clothing with patterns or drawings to generate it 1:1 (for example training a LORA of a dress (object only) and using it in generations so that models can wear it?
Thanks!
@stiff dust so i'm trying to train on frozen text encoder considering i've seen others have good success with it and i'm getting a disappointing amount of noise in the images
the images look fine at a glance but zooming in, ugh
could be worse. But yeah, I had so far better experiences with training the text encoder than training the unet
but i need to train the unet to fix the proliferation issues
i tried training the text encoder and the unet with different rates and it still doesn't seem to work very well and i think it's because the text encoder's weights are shifting while the unet is trying to match its representations
so my current theory is, going back to base 2.1-v to tune the unet only, on larger aspect ratios, and then, bring the text encoder around a bit and hopefully clear the image up
that's what this image above is, 450 steps into that
the contrast changes from the noise schedule are probably still getting worked out too
the left is the original ckpt and the right is now
yeah, I also train them consecutively, but I haven't evaluated how much that helps. Im training text encoder first, then unet
when i did that i noticed coherence issues starting to arise in some of my test prompts. i have one asking for a mountain bike on a mountain road and the bike disappeared
it's still there at a different resolution but it's odd
it's not necessarily a show-stopper but it makes me wonder what will happen if i continue, and things just kinda got worse across other prompts
nah, I found this happening all the time without meaning anything
when you use laion data, are you using the text field as a caption?
sometimes an image just flips between two different outcomes even with the smallest change on the weights
i am wondering if that's hurting me
if you train further the bike might comes back 🤷♂️
here's an example of a ckpt that made me want to give up vs some improvements in it that i'm like "
it's not over yet"
the original isn't so great tho either
i'm not using snr gamma or offset noise on this run because it was causing issues before, a little over halfway through training
but maybe i need a little bit of it
this noise in the water is what i was able to eliminate before with a bit of concurrent unet/text encoder tuning
I think min snr and offset are essentially dead, all the "diffusion is flawed" stuff which includes zero terminal snr with cfg rescaling is the way to go, zero terminal works right out of the box on sd2.1 and doesn't cause divergence, it works fairly well even if cfg rescale isnt applied at inference though it certainly helps
SAI used offset noise with epsilon prediction instead for SDXL
I saw their loss.py didn't include zero term, just l1, l2 and lpips for the vae, but I swore someone said they used zero terminal
Joe Penna did but i think he was just messing with people
they're using standard schedulers like DDIM
nothing changed there
i slapped the text encoder from pseudo-journey-v2 (15.6k steps fine-tuned) and the noise is gone in most images i try
@cobalt nova 🤨 
they might have tried it and then abandoned it
the huggingface guy Patrick said he didn't think terminal SNR was so groundbreaking
at me own expense, i quickly tuned a model about 30k steps at batch size 150 for them to get the fixes working inside diffusers otherwise the work was going to stall
that model knows what darkness is but it takes things too damn far
i found lower FID scores corresponded with higher cfg rescaling around 0.7 but the aesthetics scorer had the highest score around rescaling at 0.3 so i split the difference and settled in 9.2 guidance and 0.3 rescale, as it has the best intersection point between the two scores
the aesthetics score is highest when there's no cfg rescaling done but to my actual human eyes that shit is blown out and hyper contrasted
@stiff dust i hoped VAE tiling would improve faces, does it not work that way?
or the representation going in i guess is still very small
just from the name I thought vae tiling is just splitting an image and decoding the parts independently 🤷♂️ for convolution it doesn't matter so much, but attention gets expensive if you process a large image at once
ohhh... hmmmmm
it actually seems to look better 😐
ah i like this a lot now actually
what tips do u guys have for deciding a good network dim and alpha based on the dataset?
@exotic musk there is a paper that was released for this
is their a link u can give me?
type network_dim network_alpha conv_dim conv_alpha
LoRA 32 16
LoCon 16 8 8 1
LoHa 8 4 4 1
i'll look for the article
but, i am trying to figure out how to decide and switch up the networks based on my dataset size
i know, you want a corelation between the 2
it's really hard to determine since there is no real baseline to compare against
i'll search for that paper and get back to you
Hi! Could anyone please help me a bit with training? I was trying to train my own images and tried with dreambooth as well as embded, but they all seems using other existing model and the result seems pretty off, is there other recommended ways to do it? thank you!
We tried it.
Wasn't better.
Made some darker images, but messed up a lot of stuff.
Will give it another go when we have some time.
but that wasn't with v-pred?
it might just be something that people choose to fine-tune in. i would love to research that, i did apply for access to the weights.
i did see mcmonkey say that offset noise is being used and i assume snr gamma, in which case the terminal SNR stuff seems to 'fight' with it and it makes the darks splotchy
and 'too' dark
i can get a white background now
I'm also not super happy with my results so far, but I found:
- textual inversion is extremely helpful
- even better is a lora on the text encoder
- training the unet is necessary to achieve real photorealism, but should be done carefully. For drawing anime pictures of yourself, textual Inversion or text encoder is usually enough
import open_clip
import torch
from PIL import Image
model, _, transform = open_clip.create_model_and_transforms(
model_name="coca_ViT-L-14",
pretrained="mscoco_finetuned_laion2B-s13B-b90k"
)
im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
generated = model.generate(im)
print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))
how well does this work?
is it better than clip_interrogator?
@stiff dust might be a dumb question but can you use a textual inversion on the text encoder during training?
yes, if the embedding layer is not freezed it happens automatically
or do you mean "using" an already trained TI? But same answer
Guys is there any tutorial to train LoRA on Google Collab
i think i will put multi-checkpoint comparison into my discord bot somehow
currently i can switch between models and do a single batch for one but i could just have a config for which models to compare and then return one image from each, and to steal and idea from sdxl training, have buttons to tell it which is the best, and keep score for each model
this stuff is still a struggle for me to understand even as far as i've gotten with it in practical terms. i am beginning to understand after reading the CLIP paper by OpenAI that the text embeds are essentially a condensed representation of a string-image pair in latent space, right? it does make sense to me that the academics behind this stuff would be confused why we want to fine-tune it. but i am not sure what we're changing or benefitting when we fine-tune it. is it, to ensure it more closely follows prompts?
the text encoder seems to be important for image clarity, but i'm not sure how it's achieving that. eg. a textual inversion's goal is to find the most optimal vector that produces the least noisy result
I would say TI is quite intuitive. You introduce a new token, not known to the text encoder, and set it weights such that it represents/describes what you see in the images
am i mixing up embeds and inversions?
dunno, embeds is a rather generic term
i'm thinking of the negative prompt embeds you see like 'nfixer'
yes, that are usually TI
i basically want to make that somehow the default weights of the text encoder rather than have to prompt for it
that tuning the text encoder is so powerful is surprising, but I would not think of the text encoder as a single text-image embedding. It is a transformer, so EACH word in the text is given to the unet. The text encoder contextualizes the words, such that they somehow better align to the image space. So I assume that the sentence "anime movie of a castle floating in the sky" is contextualized in a way that the word "castle" is not just connected to images of castles, but to images of floating castles and anime movies. Thus the unet "knows" how to relate that word to the pixels in the image
now you can either train the unet to better connect words to the image, or the text encoder, such that the contextualization better fits to what the unet already learned
yeah i thought of that too. eg. "this word is likely to appear with these other words"
eg. that's how you get a human with hands without asking for a hand
maybe train a prior of high quality images without a caption in 5% of the cases
oh i definitely don't drop captions randomly though i did read that helps with CFG
I mean if you do not provide a negative prompt, the "" prompt is used
current tuning results
i've got clear images, low residual noise and true blacks
multi-aspect support
but i just want all of the images this clear 
by the way it seems to do better "small faces" at higher resolutions now that i've trained it on 1766x1024 and 1024x1766 and 1024x1024 (plus two other aspects i forget)
so i can get a crowd of people with better faces because each face is actually larger
i'm supposed to have negative captions for images too? 😮
so i've got some conditions in my training dataloader where it is possible their captions get completely reduced to "", just an empty string, which i was kind of happy about, because i just assumed it would help with CFG
but it was unintentional and just kind of sporadic. it would be nice to have a minimum dropout threshold
@hot breach what kind of improvements have you all seen from consistently applied caption dropout
i don't do image flipping either as i don't want to damage text coherence
conditional dropout should in theory improve CFG scale response, you can get funny results if you use super high values like 0.5 or 50%
i don't have any values to consider so for me it'll just be a % i'm implementing here
the publication from the SD release said they used 10%
yeah that's where i remember reading this
hmmmm okay so not using enough caption dropout is reported to result in an overreliance on prompts and basically overfitting TO prompts, memorizing their vocabulary and damaging the ability to generalise on unseen captions
I think I usually use 0.02 to 0.05 the most, higher values can sometimes have an effect of forcing a style into generations without asking for the style, you can get some interesting results, not sure its super useful to use super high values but you might want to try it out just so you can see how it differs, on a small scale
ok interesting so i should try training the lord of the rings dataset without any captions at all then
I expect if you do that for long enough, you'll see lord of the rings characters pop up in generatoins despite not asking for it, particularly at lower cfg scale settings
that's kind of the goal with that model 😄
it's just a toy because i messed with it in a small scale and managed to get a downhill mountain bike sitting against a mountain in one of the lord of the rings style forests and i was like 
also when i prompt for wizards now it brings up Gandalf 
friggen Coco knows about hobbits and wizards as well as blip does

i guess we're into a new era, something's happening 
Trust the Process ™️
Is there any reliable way to upscale a busy image with a lot of small buildings in the distance?
I've had mixed success in these cases (at least using this model https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler)
this is the wrong channel for questions like that. but you should ask about controlnet in #1072238304042438758 or #💬|general-chat or #🏞|general-with-images
Sorry about that, my mistake! Thanks
np
@hot breach have you noticed the breakdown of terminal SNR capabilities for darkness at some point?
this is about the best i can get for solid black background and it only really works at the 1:1 aspect ratio the model was trained on
left side is terminal SNR and right is offset noise at 1280x720
maybe at some certain scale it could, beyond what I've tested but I've done many 10k steps on a 25k image dataset of just random assortments of things and not noticed issues
hm, interesting
and you just grab the trained betas from DDIM and pass into DDPM?
for training noise scheduler
my test data is not very concentrated on bright or dark in particular
I use the algorithm that was in the paper to create trained_betas and pass that in to the noise schedule as trained_betas which diffusers accepts
DDIMScheduler.from_pretrained(model_root_folder, subfolder="scheduler", trained_betas=trained_betas)
samples from fairly early on, it can do both light and dark great
well discord unforunately cuts the previews, click and look at the left sample (cfg 7)
"cg render of a tree ent on a few small branches with leaves on its body, with a black background" this was not even 1 epoch in
many more epochs in still stable on bright and dark images
high contrast, faces still good, etc
logo is potato quality for lack of data (just a random sample I grabbed from middle of training a bunch of random data), but it nails the contrast at least and colors look good to at least illustrate the zero terminal snr works well
try other resolutions
i was trying prompt solid black background from CFG of 3 to 9 and rescale of 0 to 0.7 and the closest i get is at like 9 CFG and 0 rescale
in a square resolution. in a widescreen one i get, washed out uniform look
i can get images with bright scenes, no problem
the model still makes great images at a square res
this is what i get in widescreen for a nighttime prompt
hm i guess it's not great in square either lol
50 steps with offset noise and it's so much better
@undone fable i needed a bit of offset noise and pertubation to bring darkness in line, but the results with CFG rescale are interesting
left is cfg rescale at 0.7 and right is 0.0
the cfg rescale change to "prevent washed-out images" seems to also prevent images from becoming too dark
1024x1536 native gens 😄
Hello everyone,
I'm looking for a small fine-tuned stable diffusion model, that output max 256x256 images, you know any?
deepfloyd kinda?
i compared my models photorealism vs. that of deliberate, dreamshaper, and realistiv vision
mine is arguably worse, but also doesnt have any noise fix applied and i would argue my model is more diverse reagrding the output, e.g. the clothing, faces, backgrounds, etc
the grid doesnt show the negative prompt but it was:
anime, cartoon, digital art, cgi, render, 3d, drawing, sketch, instagram, pastel, dada, zombie, ugly, surreal, text, watermark, abstract, old, fat, jpeg, black and white, vintage, amateur, film grain, evil, damaged, concept, unfinished, model, cover, clay, figure, toy, pixelated, bad, inexperienced, illogical, random, oversaturated, overexposed, rough, fake, unrealistic, sloppy, artificial, low budget, unprofessional, cropped, out of frame, low-quality, poorly drawn, deformed, bad proportions, malformed, imperfect, unnatural, extra, rushed, weird
I was trying to train a Lora but some photos in the dataset are 1080x1920 and others are different sizes. I wanted to keep the full images in the dataset without resizing or cropping out parts as they were important for the lora training to be what i want to create. Is this possible or would everything need to be 512x512
you can condition them so the smaller side is 512 and preserve aspect ratio
is this through collab lora training?
@undone fable you're right about the 1920x1080 not holding up even if it "(kinda) learns how to do it"
it'll generate dupes about 20% of the time and the rest of the time it doesn't produce a duplicate you get a weirdo like that
but this model works way better with hires fix than one that i trained only on square 768
yea at a certain point the lowest level feature size as well as the attn blocks alone are going to have a bad time
by the former part you mean when you make a 512x512 image and it's all artifacted?
yeah as far as i can tell that's because the attn layers learn to expect a large res image with details that can't be expressed
is 2e-05 to low of a training rate for 108 image dataset i did 200 steps and it was super undertrained?
that might be too little info, can you elaborate on what style of training and what you're training on
it sounds like you're doing a LoRA?
i am trying to make a model of an artist i like in prosthetic here for example one of the images
man that's hard to look at
i think you need to do like 800 steps but it depends on your batch size. if it's very high, it'll need a higher LR. if you're using regularization / class images (prior preservation loss technique) it'll also take longer to train
just a warning that if your images are all blurred and pixelated like that one you're going to have a bad time
there's other parameters for Lora i'm not versed in, like rank etc
i mostly do general finetuning. i have not messed with lora. be sure to update us with your progress!
got it what ratio for checkpoints do u recommend because it could overtrain before 800
inpainted his head but idk of the heads size matches
try textual inversions for your dresses and subjects. you can train them in not-very-long, and they result in very clean and noise-free images, especially on 2.1
textual inversions are going to become the tool of choice moving forward, the SAI devs have stated that TIs in SD 2.1 are as powerful as LoRAs were in 1.5, and TIs in SDXL are as powerful as 2.1 LoRAs, etc
everything shifts up in power and capability as the number of parameters in the model increases
SDXL will likely not require massive fine-tunes to be amazing, just a few tiny textual inversions.
Oh, great, thanks a lot, I haven't investigated this yet. Before I do some research and test - shall I just drop the idea to train on 'style' or instance/class for that particular case? I just wanted to make sure I am not trying to reinvent the wheel 🙂
@stiff dust can we (you and i) somehow make a new VAE for SD 2.1?
it is the main issue now that my noise issues are resolved (this model looks smooth like DALL-E 2)
Hiring a stable diffusion developer,
I'm currently launching a startup that uses stable diffusion as a base for image generation, So far I have trained my own models and LoRA's but I need someone who's more advanced in model training. for example, someone who can help come up with workflows to get specific results using features like controlnet, or automate higher quality captioning, and squeeze more quality than my average dreambooth style trainings.
I'm looking forward to starting with a few paid gigs and then we can agree on a long-term arrangement.
feel free to shoot me a DM if interested.
I'm not nearly there yet; but a post in #1011228667659178055 might help.
Hello everyone, I was running into a NaN loss when creating a Lora/DyLora. I've checked all the usual suspects: No nans in input data for the given step; reasonable captions present, etc. Does anyone have any tips to resolve the NaN loss? I've tried kohya-ss/sd-scripts as well as cloneofsimo/lora and both create this issue. I was on an AMD GPU so I thought it might be related but I debugged deep and it seems to be overflows happening somewhere.
In fact, the unet produces all NaNs for non-nan inputs.
on amd use fp32
I tried that too -- It NaNs out, only at a later stage. 😦 I also rented a paperspace machine with Nvidia, it still NaNs out there as well with fp16.
make sure you arent using snr gamma
So, set it to zero? Thank you so much for the guidance, btw.
yeah snr gamma seems to be broken these days tbh
it causes NaNs here every time but i'm on v_prediction models like 2.0-v and 2.1-v
At 800 steps it didn’t learn the 4 eyes still
Can you help me or show me how to train a textual inversion
unfortunately i've not got time to
Yeah I'm training both U-Net and TI as I need to inject a few keywords.
Any tips on fine tuning a subject with glasses and a beard?
I have had great luck with training on women subjects:
- no classification images
- 30~ clear varied images
- 50~ steps (around 1600 total steps)
- text files with classifications and tokens (xyz a woman wearing a tank top, standing outside)
- lion optimizer
Using these settings (plus a couple of others) i can train a female subject almost every time.
However, male subjects with glasses and a beard give me bad results. Deformed faces, terrible eyes, etc.
Any tips?
Anyone mess around with training GANs? Playing around with the idea of using a GAN to produce SD training data but wasn't sure how easy it was, if even possible, to teach a GAN concepts like a specific face.
GANs are definitely used to generate or improve training data. the CelebA-HQ dataset comes with a processing script that takes each downloaded celeb picture and upscales it using RealESRGAN
something i've discovered now is that if you train the text encoder on its own on a set of captioned data and then, separately train the unet on top of that frozen text encoder, the results are really powerful
training them together sucks ass and should not be done
when you train them concurrently, the text encoder will learn to represent the relations between concepts before the unet has a chance to learn how to represent them faithfully
seems that the unet picks the concepts up a lot sooner with the altered procedure
messed with open-flamingo a bit the last day, I made a script to run locally in ED2 so you can bulk caption with it:
python caption_fl.py --data_root input --min_new_tokens 20 --max_new_tokens 35 --num_beams 3 --model "openflamingo/OpenFlamingo-9B-vitl-mpt7b" --temperature 1.0
it will take hints from the example image/caption pairs you supply (2 seems to be enough) and then caption your images, seems to be accurate and know a lot of proper names
the authors of open-flamingo itself have a demo up on HF space you can use to try it out without loading anything: https://huggingface.co/spaces/openflamingo/OpenFlamingo
it uses example image/caption pairs to prime the model (here, the first two), then you provide the novel image for captioning (third one here), here their demo uses the humus and sign already captioned, then I gave it the image of cloud on a ladder and it captioned it for me, without me telling it who cloud strife is at all
cant wait for kohya to support the new sdxl 0.9
it already does
i don't think it supports the new pieces of training but no one seems to care about the aesthetics score input lol
@hot breach teaching 2.0-v terminal snr and widescreen. there's a bit of oscillation to it figuring out the noise schedule where it seems to get it early and then really starts to pick it up
it's supposed to be just the glowing TV in a dark room
Hi i hope i'm in the good chanel to ask that , i want to add less know celebity to the absolutereality model , can i achieve that with dreambooth ?
hey guys, may I know if there is any channel here where I could find regularization images for LoRa training? (looking for realistic women)
FFHQ and imdb-wiki perhaps?
LAION-Faces
Are there any good guides on how to build lora?
When you say clear images do they need to be a certain resolution?
hello 🙂
what can cause loss to be > 0.4?
training lora with kohoya on SD-2.1 model with pytorch 2.0
When you're finetuning a model to a specific persons face via Dreambooth training, should the class be included on both the instance and class sections? (& if so, should the class also always be included when prompting with the unique token?)
Example:
Instance: "a photo of XYZ123 person"
**Class: ** "a photo of a person"
I typically see some version of the above in the majority of tutorials/guides, but then in the same content - I'll see the prompting examples they provide after completion exclude the "person" part of the unique ID.
"a highly realistic photo of XYZ123,….." **vs. ** "a highly realistic photo of XYZ123 person,….."
I recently read that the exclusion is a common mistake, but with how fast things change and how much conflicting information I see across the resources - I feel like it'd be a better idea to get the answer here.
They are going to be resized to 512x512, so bigger than that. Otherwise they just need to be sharp, and not blurry.
Here best Realism fine tuning
please don't spam here. you spam on huggingface forum too. just stop spamming
So smaller resolution photos I shouldn’t bother with? I’m attempting to do letters/words/logo. It’s for logo which is just words. I want to use it on products like a coffee packaging.
How would you go about arranging photos for this?
I'm using Booru, I want a word to be the focus which ones tags would be best used?
for instance i want to use a logo with only words on it, and attach that (word/logo) onto coffee cups, t-shirts, signs, and other products
Openflamingo works really well for captioning
I'm pretty sure StableDiffusion is terrible with words. Think deepfloyd is more geared towards what you're trying to do.
I’ve seen the monster energy drink logo on stable diffusion. I’ll look into the one your talking about thanks
Once i downloaded it the issue disappeared
Is that an add on I can plug into stable diffusion or a competely seperate ai?
only holding (still morphing characters in monsters)
It's not perfect but it looks decent
@quiet moat Is there anything in the pipeline to improve words and text in images?
guys
what is "dreambooth dynamic image normalization"
what does it do?
does it resize my images?
Separate AI
im making a lora about diamondhead from ben 10
hes a very complicated character for the ai
so far i made a lora but
it looks
well.........
these are the best ones
that look like him
and arnt just green diamonds
what do u advice me to change?
i have 144 images
all from the original show
no fan art
i have some promotional images of him in different artstyles so idk about that
i used the google collab one by hollow strawberry
Wonder why they made a separate one it’s from the same company?!?
That answer is probably more technical than it's worth getting into
That being said, I would look into using ControlNet rather than training LoRAs
Wait what, Can you explain why?
I’m not following that one does art as well not only words, what would be the major difference? If it’s turning noise into a picture
I’m trying to understand this if anyone feels like a breakdown or even a link
honestly, if you're new to this, the answer to that isn't going to make sense and isn't going to provide you with any useful information that you can apply to what you're working on
like if you're at the "how do I train a LoRA stage", the answer to "why is deepfloyd better at text than stable diffusion" isn't something that's worth you exploring unless you already have a good understanding of how the whole diffusion process works
but you can start here for foundational stuff: https://arxiv.org/pdf/2205.11487.pdf
Thanks!
which algo are you using for captions? What's the best one, especially to tag details like dirt on skin, water droplets, clothes, lighting, noise
use ViT-bigG-14
any idea how it compares to BLIP2?
BLIP2 is just a way of using these text embeddings
brother
the text encoder provides embeddings
i need advice
oh, I thought BLIP2 could create captions / tags for images
it does
does my role make me invisibe
@dusty pawn you haven't asked a question yet.
i did
I want to tag my dataset accurately, and there are too many images to do it all manually
oh, from days back
yes
well i don't train LoRAs so, unfortunately i can't answer
it looks complicated, the goal
i would recommend trying a textual inversion
they can be much simpler
this is the best one so far
so I guess the question is BLIP2 vs. ViT-bigG-14?
which one will create tags/captions which include objects, clothes, lighting, and info about image quality (blurry, grainy, noisy, jpeg etc.), because I don't want these elements to be 'baked in' to the Lora
how do i do that
also not a thing i do, but i understand how it works. as such i don't know which tools support it or how to prepare a dataset for it
I'm assuming 'best' is best? Or maybe caption?
you'll have to test
Hello! I would like to create a model based on my own characters and signature style. I have completed a crude model successfully with Dreambooth, but I have read in a few places that it may be replaced by EveryDream? Specifically because I would like to train multiple characters, I've read that it has advantages over Dreambooth for that purpose. Can anyone shed some light on this? Or provide any resources where I could learn more?
If you are training unet only is it able to learn new tokens? I havent really noticed my activation token making a difference in my prompts
does anyone have any scripts/tools the use to automatically check a prompt or lora on different models, different loras, or with a scripted change/iteration in the prompt?
Hope someone cna correct me if this is incorrect, but the U-net actually gets a text-representation from the Diffusion Models's text-encoder so the U-net isn't actually seeing individual tokens like the text model. Fine-tuning the U-net with new tokens can lead to some gains if new tokens have been introduced or the text-encoder has been changed in any way
you mean the "x/y/z plot" script?
we can use weights in tags? like
(jpeg:1.5)
I am trying to dreambooth with 4000 images, I am running out of vram even when 24GB 4090
I can do fine with 2000 images
r u guys experiencing this
I think the voting should be like midjourney did. It made it easy to rate alot of images. Rating in discord is inconvenient, therefor less people will vote
@pseudo tulip Hi! Wonder if you could implement the adaptive offset noise function to the kohya colab?
https://github.com/huggingface/diffusers/pull/4041
single-file ckpt files got a major boost in Diffusers today.
When training a LoRA, how do you get it to keep an eye patch on the correct eye in every generation? Any tips?
Better off just controlling it during generation using inpaint sketch or controlnet. Or make sure your training set has that specific orientation/pose
Thanks for the reply. Not really feasible to inpaint with Deforum gens though. My training set always has the eyepatch on the correct side and I'm not making flipped copies
Has anyone had any luck making a symmetrical body part asymmetrical in a consistent way? E.g. a robot arm that's always on the left arm?
aye guys where can i find some kohya training configs for XL?
and use this program for an easy ui (uses original kohya ss in background)
https://github.com/derrian-distro/LoRA_Easy_Training_Scripts/tree/SDXL
based on testing since then - these settings still yield the best results
@pseudo tulip would you reply? Do you even reply on the GitHub for the kohya colab?
How do you train or something similar with amd gpus?
anyone had any luck with Dreambooth and SDXL? I have trained a few in the last day or so using a face like usual and it seems like it works as far as generating images but it will pretty much only generate images almost exactly like the training images. Prompts barely change anything at all
I would try editing the image as best as I can and then merging it with i2i, though it usually helps to give both parts of the image the same style
hello is there a simple tutorial on finetuning for begginers ? i want to create a minecraft texture pack model but i dont think a lora will consistently be able to create 16x16 pixel art
First Ever SDXL Training With Kohya LoRA - Stable Diffusion XL Training Will Replace Older Models : https://youtu.be/AY6DMBCIZ3A
How to install #Kohya SS GUI trainer and do #LoRA training with Stable Diffusion XL (#SDXL) this is the video you are looking for. I have shown how to install Kohya from scratch. The best parameters to do LoRA training with SDXL. How to use Kohya SDXL LoRAs with ComfyUI. How to do checkpoint comparison with SDXL LoRAs and many more cool stuff.
...
I got so many noisy pics. I can denoise them, but then they start to look anime. Any way to get the best of both worlds?
By that you mean low quality? You can try adding that to the captions, "low quality photo, grainy"... See if doing that helps prevent it from learning that
I did that, but it didn't work out. Hence I'm working on denoising now. The photos look a bit cartoonish after denoise... hopewfully it doesn't learn that.
It seems to be that any photos have some noise/jpeg, even if I go on a stock image website, there are almost no 'perfect' photos... so how are others not having this problem?
Maybe you're just overtraining, and the noise in the photo isn't the issue, hard to say
a bird
Has anyone gotten it down pat to get a product into scenes with a LoRA? I have seen cars and a few things. But I have yet to see ... say for instance .... a microwave or a certain type of dress with 100% accuracy. In theory it should be quite possible
(for photoreal)
Planning to render a 3d Object at multiple angles and lighting styles and hard train a lora into it. It should be quite possible. Just haven't seen it in the wild. Curious if anyone has seen
Hi guys what would you suggest for finetuning on 100+ objects? or even 1000+?
Do you mean you want the model to be able to learn 1000+ objects or you have 1000+ images of an object? If it's the former, you're probably better of with textual inversion embeddings and for the latter, you can do whatever you want (1000 is a lot of images for finetuning!): LoRA, Dreambooth or Textual Inversion embedding
Would clip captioning be good enough for attempting to train multiple styles into a model? or would using multiple keywords/concepts as well work better?
are there any resources for generating regularization images for SDXL? I'm finding it impossible to run kohya_ss full-finetune/dreambooth training on my 24GB card, so I'm preparing a jupyter notebook to run on a rented host. I'd like to be able to run the training with regularization images, but I'm falling short on tools that can use SDXL to build that imageset. Maybe I'll just have to write a script that calls the SDXL inference script in kohya and build it that way?
I've got Comfy running too, but I recall regularization images should use the prompts from the captions that'll be used from training image captions, if I'm not mistaken. In that case, just typing is the class prompt and generating a ton of images with SDXL in comfyui wouldn't quite be the same, right?
not sure about sdxl but the best outputs I've had from 1.5/2.1 dreambooth was when I used canny controlnet + prompt for the regularization images vs just prompt alone.
I've seen this happen if you are using noise offset's
you can also counter this a bit by using a noise offset inverse of what you were using before
Has anyone else run into this error while running kohya training? File "/content/kohya-trainer/library/train_util.py", line 1190, in __getitem__ example["latents"] = torch.stack(latents_list) if latents_list[0] is not None else None RuntimeError: stack expects each tensor to be equal size, but got [4, 104, 128] at entry 0 and [4, 72, 128] at entry 1
Anyone has done a successful TI of SDXL? If so could you please share the settings you used?
I recently spent some time to break down HyperDreambooth and all the diffusion training methods. If you've ever wanted to know how LoRA or Dreambooth actually works, check this out: https://open.substack.com/pub/aibits/p/paper-break-hyperdreambooth?r=2lknwd&utm_campaign=post&utm_medium=web
Does anyone have experience training a model for product photography that can provide enough detail that it can accurately reproduce the information on a label? Is it possible?
great stuff, subscribed for more
a pity there is no opensource implementation yet.
anyone doing dreambooth or loras on apple silicon with mps? It seems mixed precision support is an issue everywhere. I have lot of RAM though, so thats not a limitation but training is slooow. I’ve tried to do a dreambooth on sdxl using https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md
Thank you! 🙏🏾
what captioning method is reccomended for SDXL? 🙂
Why TI and not dreambooth?
Anyone knows why my lora is not working for SD XL? 🙂
RuntimeError: The size of tensor a (768) must match the size of tensor b (640) at non-singleton dimension 1
I’m using a custom module to train subjects that best work with TI
Do you find TI better than Dreambooth when it comes to the resemblance of faces to original images?
Anyone able to train LoRAs on 12gb ?
Dreambooth is always better than just TI but for any kind of training, training the text embeddings always help tremendously
I guess you're using a LoRA made for SD 1.5. SD 1.5 and SDXL are different neural networks, so the LoRAs are incompatible.
I'm using Kohya_ss to tinetune sd xl for the first time. But i get stuck because of a runtime error where NaN is detected in the latents. What can i do about this?
I also struggle to install triton, which might be the problem. Any tips on how to get that working?
are you on windows?
if you're on windows you don't need to use triton, it's incompatible
Yes, thanks. Then it's something else. Ill just wait for some more tutorials to be released then
i was just stupid and tired in the evening and downloaded the refiner and not the base i was training it on 🙂 . All working as expected.
Hey. New here, but have been tinkering with SD for a while now. My technical knowledge on it is very limited, so forgive me if I make a bit of a dumb here. I'm looking to render images in blender to act as training data for a LoRA, but I'm not entirely sure what I should be aiming to render out. I aiming to work on a set of clothes, and a second one for a particular character. Should I make face and head shots a focus for characters? Is there anything I should try to avoid doing?
Anything at all really or a link to any sort of guide on this topic would be much appreciated.
For the character, I recommend using 25+ headshots primarily of just the head and a few that are more zoomed out (I got solid results with 1 or two images that were upper body and the rest shoulders and head).
Since you are doing a cg character, you can even swap out HDRIs and fiddle with the face controls / general pose a bit for each image. Try to get a few angle shots as well.
Avoid shots where something is overlapping the face.
Allegedly dramatic lighting is bad too, but I included a few in my training data while still getting solid results.
Would it be wise to include some facial expression changes per shot, or would a static expression help keep it focused on the face details themselves?
If your characters default expression is unique, it's probably best to use that in most images. (Just include each unique expression, like a disney character reference sheet)
But if they have a regular resting face, then I would recommend facial expression changes, yes.
Any particular minimum number of shots you'd recommend?
Are we just plugging SDXL into regular DreamBooth? for example, training on LastBen's dreambooth Colab works?
25-40
Thank you for your suggestions, they're much appreciated! Oh, I was also going to ask - last time I tried LoRAs, I ran into an issue of them having a very grainy low detail look, like it was painted in large brush strokes on a very noisy and rough canvas, is this "overcooking' that I've seen googling around, and is there any changes I should make to my training parameters/data set in order to alleviate it?
For 1.5: https://civitai.com/articles/391/tutorial-dreambooth-lora-training-using-kohyass
This guide has served me well. The only way I differ is by cropping all my shots to a square 512x512 with: https://www.presize.io/
For SDXL: I have no idea. I can't get it to run faster than 10s/it 😦
Oh, I've not seen this guidfe thank you, I haven't touched SDXL yet as Im still a little too attached to whats familiar still, ha
oh, so wait my training images don't have to be 512x512? Should I be rendering them higher?
You dont have to have 512x512 images, but I have found it performs better when you do.
For rendering, it's probably save the most time to just render them at 512x512 of just what you need, then no need for presize.io
how much slower fine tuning sdxl compared to sd1.5? approximately on A100 80gb for example
only base without refiner
I have a question about how the SDXL uses the CLIP model as the paper is a little hard to follow.
Does SDXL concatinate the output of the two CLIP model into a 2048 vector; as seams to be implied via the table; or is it doing something more complex as implied by this paragaph (which i'm finding hard to follow):
we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the
penultimate text encoder outputs along the channel-axis [1]. Besides using cross-attention layers to
condition the model on the text-input, we follow [30] and additionally condition the model on the
pooled text embedding from the OpenCLIP model. These changes result in a model size of 2.6B
parameters in the UNet, see Tab. 1. The text encoders have a total size of 817M parameters"```
The reason for asking is i'm wondering if I can avoid finetuning by doing manupulations in the clip embedding space
In the kohya colab, one of the preparation sections said that
All 175 images have captions
No tags found for any of the 175 images
So what's the difference between a caption and a tag? Thanks
anyone know what i should do with a dataset of 4k images
it's too small for a model but too big for
anything else really
Im wondering how i can reduce the size of my output LORA training with Kohya sdxl 1.0
All the lora checkpoints come out at around 1.3 gig and id like to compress them a bit
fine tuning goes wrong 😮
rip
i have created a lora with multiple concepts, if i create a text input that adresses this lora how are the sub concepts address? just via the same prompt?
also is it possible to merge a lora directly into a model?
are there any tutorials for locally fine tuning using SDXL? How much VRAM do you need? (I got 96 GB)
you have 96 un multiple cards i guess right?
2xA6000
so from what i have seen you can run around a batch of 1 on 24gb but only with some tweaks, so i would expect 96 to handle around 3-4
how are lora weights applied to a model when i add it to the prompt? is adding a lora to a model having the same result as training the model directly?
Hello, very nice anime, I like it very much.
how do i fix the fingers and feet of my characters?
use negative prompts like: bad hands, bad feet, etc. That did it for me most of the time, although I get the ocassional bad ones
do you know how many sampling steps would be generally good?
and the cfg cale?
I use 20+ steps, usually 20-50 depending on the model
and cfg I like to keep between 7-10, usually just 7.5
alr, ill use that in the inpainting process
have you messed with cfg scale scheduling and mimic cfg scale?
would you finetune datasets organizing photos by folders? e.g. "front facing", "from behind", "side view", "laying down", "2 people". Maybe if makes sense to train those concepts separately, merge them, then train over them at a low LR?
You need network alpha on 1, very important otherwise you overtrain. Network dim can be 264 but you can go higher as you have more ram. I'm lately having a lot of success using ~100+ total steps per image. So that includes steps per image multiplied with epochs. Learning rate using cosine (which goes to 0 from the starting) and adafactor. Starting at 0.0004 was ok, now I'm trying 0.001. also regularisation like dropout seems to work really well. Finally the weights and biases API is really useful, highly recommend it.
appreciate the tips, but at the moment I'm not even sure where to start. Do I just use the same diffusers library that can be used to train SD1.5?
I use kohya ss, then put the base xl model as the one to train
I can make a quick word document with links and send it if you want
that would be super helpful, I'm sure to more people than just me
Here is a first draft. Maybe it makes sense to make this a web page or something, if anyone wants to help we can crowdsource the best settings. These are just the ones I figured out the last few days.
Anyone have information or links to reading on SD training with non-DDPM noise scheduler?
I haven't used KohyaSS, but I've been reading the code for sd-scripts, and the training script doesn't make assumptions about image size; so as long as the dataset loader doesn't as well it should.
hi guys, does anyone know where to download or own any repository of good regularization images of real life female / male pictures? i plan to train using SDXL and good reg images could improve the quality much better
Working on my first lora. I couldn't find a set of regularization images that were good enough, so I've been building my own. I have 135 decent 1024x1024 images (and about 31 training images). Is 135 enough for regularization?
I'd skim some youtube tutorial, but I get the sense they all, at best, have a link to an image reg dataset that doesn't fit my usecase well enough.
Rather than building their own (to find out how many is enough to get by)
searched the discord server for "how many regularization"
looks like I might be ready to progress
135 is more than enough. Depending on your video card your compute time will be massive. I'm talking days if you have a multiple of that in training images..
- Computer time was pretty long but each step was taking like 5 min instead of 1.5s like the tutorial. I changed the bfloat to uncapped experimental and it’s running 10k steps while I’m here at the gym. 1.2k finished by the time I showered.
Almost forgot, thank you for the answer
Just generate them with the model. Why do I say this? You want the AI to learn what's different or special about your subject vs the 'class'. If you have regularization images, it will compare you're subject images vs that to discern what's different and learn that. Therefore, you want your subject images to be what stands out, if that makes sense. For that reason, I wouldn't even spend a ton of time curating them. However, if your subject has a particular body type, or you want the model to draw it better, you could train multiple subjects into the lora, like a body type, and a likeness. Think of it like a game, you say what is a woman, and it spits out an example, and you say no, no... And point it to a fine tune example
Any suggestions to train a lora on sprite sheets, that usually has lower dimensions (96x128)?
I would like to train a lora for sprite sheets on SDXL.
I will be able to produce 96x128 with on XL?
should I scale everything up to train and generate, then scale down?
i got your point, thanks for the help 😄
hi guys, another questions, does anyone ever tried to train Lora using the SDXL refiner? can i know what GPU did you guys use? RTX 3090 or A6000?
does anyone have experience in upscaling into very high resolutions ?
For SDXL LoRA Training
kohya gui, main branch, and use this config file
for anyone wanting to make loras
epoch and max epoch need to be adjusted. 40 for normal loras, 80 for very complex loras (big dataset/faces/anatomy).
(also obviously adjust batch size, to whatever your card can handle)
repeat on dataset folders = 1
There'll be a comprehensive guide eventually on what all the settings do - but for now: that config will work as long as your dataset isn't too bad
expect this to run around 7 min per 10 epochs on a normal card. 16gb vram guaranteed - 12gb vram should work if your overhead is low enough. <- lower batch size to whatever your hardware can handle. 12gb vram should be able to handle batch 1~2
24gb vram can handle up to 8~12
if you have 24gb vram card -> run at batch 3 (only a minute slower) -> then you can continue using comfy during training, to test your checkpoints and see if you're happy and can stop training early.
also, just in case anyone's curious about file size. the full details & nuance of a face fit in dim 1 - we just use 8 cause we want the 8:1 ratio of dim to alpha.
setting it higher usually won't give better results, as dim 8 is big enough that I've fit the concepts of 100 complete dresses + faces + anatomy into it. So unless you're doing a full finetune level of lora with more than 5k images, dim 8 will be good enough
For captioning, all old 1.5 rules still apply.
Dataset size of 50~100 is recommended to avoid the typical training pitfalls. 10~30 works if you know what you're doing. less than 10 can work if you really know what you're doing, and have your captioning down to a science.
Datasets above 100 definitely improve the model, under the condition that you can keep up proper captioning. (better to have 50 good caption images, than 500 bad captioned ones) - quick and/or automated methods of captioning for sdxl will be in the guide
in an ideal world you'll have a .txt file for each image with tagging like this:
<trigger word>, caption, caption, caption, caption, caption, caption, <background description>
(seriously, don't forget background descriptions unless you want your trigger word to also affect the background)
I just worked out my findings from training Lora's the past few days: https://medium.com/@berd0stad/training-sd-xl-1-0-loras-22cc1daa20b
Have been training a few LoRA’s for Stable Diffusion XL 1.0 these past days. Here a short start up guide with a step by step process.
Nice to see you have different settings. Why do you use --network_train_unet_only? And constant instead of cosine learning rate, especially with such a high LR. Is a network dim of 8 enough? I use 264 and it makes the file sizes huge. Also why not use any of the dropout functions? I find they really help generalise the models.
Also why not use full training with bf16? It frees up a lot of VRAM. I could sneak in an increase in batch size by enabling it.
What tradeoff do you make between number of epochs and steps per image?
I mean I literally said in the message that I trained 100concepts into dim8 (42mb LoRA) - yes dim 8 is enough as long as you stay under 5000 image dataset. If you go above that, then these settings won't really help you much, as you're already aware of what you're doing
unet only is a complicated topic, that can't be properly explained in a few sentences. I'll be mentioning that in full in the guide - as well as everything you need to change about dataset + settings if you do plan on training the open clip layer.
In short without much explanation: sdxl now has 2 clip models, and they are set up in a complicated way. If you train clip, then in 99% of cases you are actually damaging the whole sdxl model while doing so. If you are making waifudiffusionXL, then that is fine, as you don't really care about the abilities the clip used to have - but in our case of normal LoRA training, we really really don't want to cause damage to the model. Especially when it takes longer to train, to often receive worse training data. The only downside of not training clip is that your trigger word should be close to what you're training, else you'll need twice the epochs to achieve good results.
Thank you, look forward to your guide!
but also to justify it a bit without proper proof, here is a screenshot from the readme of kohya himself
(can be found here -> https://github.com/kohya-ss/sd-scripts/tree/sdxl )
dropout & cosine can be used to improve your training - but they aren't fool proof. Essentially I want the json to be able to be used by anyone who's training for the first time, and achieve good results.
If you understand the changes that schedulers make - then you should eventually move over to cosine with restarts, once you have a grasp on when the model starts to deteriorate. But do keep in mind that this is no longer SD1.5, and a lot of knowledge about the precision settings that was applicable to training before, is now no longer correct.
I saw a lot of warnings about compatibility whenever it was mentioned by the various UI devs for kohya - so I've held off on trying it, since I can only test on my 4090.
i use bf16 exclusively and its fine, just cast it to fp16 when you are done
?
not sure if you're referring to repeat vs epochs, or generally what happens once you hit high epochs
steps = image count * repeat * epochs
just a matter of where the math happens
if your steps go high enough, you will either start to overfit the model, or have it break down completely. Due to 8/1 ratio this doesn't happen till fairly late though.
Does KohyaSS support training images of any aspect ratio? I want to train an SD XL model with images with a broad range of aspect ratios
yep. #🔧|finetune message does it with as little vram as possible. if you have 24gb vram, you can increase the aspect ratios by switching the bucket size from 128 to 64.
(i'd still recommend leaving it at 128 though, as the benefit is minor, but vram issues can occur if you have too many aspect ratios)
Training SDXL lora in kohya_ss and it is incredibly slow compared to previous runs. Im on a RTX3090. Anyone else experience this?
How many seconds per iteration are you running?
use this. shouldn't take more than 30 min to train a complete lora, with room to spare.
#🔧|finetune message
why do when I train a lora, the sampling images look absolutely disgusting
but when I try the same lora on automatic1111 webui, it still doesn't look good but it's way better
(this is supposed to be ramlethal from guilty gear strive)
Wait doesn't "additional_parameters": "--network_train_unet_only", stop accompanying caption files from affecting training?
because kohya T.T
if you have enough vram, you can do training + comfyui at the same time, to test your checkpoints. other than that, no good solution at the moment
nop. it means that the clip model won't be affected, but the unet model still relies on captions to shift weights
thank you ! What am I doing is generate the sample images with auto1111 webui in full cpu mode during training, it takes about 20 minutes to generate a 1024x1024 image with 20 steps
@hollow spruce do you know any LLM finetuners, i want to finetune a model to generate SD prompts
Hey guys. How much VRAM do I need to finetune SDXL model? Would a 4090 24gb be enough?
I am fine tuning with 16Gb VRAM so yes
Nice, good to know. Thanks for the answer!
Hello!
Which video card is better for GIGABYTE GeForce RTX 3060 Ti EAGLE OC (LHR) 8G or GIGABYTE GeForce RTX 3060 GAMING OC 12G?
more ram is more better
Only go for non-ti versions if they have more vram
3060 ti is barely better than the 3060, but since it has 4gb less VRAM, the 3060 wins by miles
on 5000 steps estimated time is almost 300 hours
Now im using different settings and estimated time is 36 hours for 6240 steps
did a 1.5 Lora in 18 minutes in between these
how many images do you have?
39
Caith's is assuming 1 sample per image, and 40-80 epochs
so it should be 1900-3000 steps
far less assuming a batch size over one
my last 28 image lora training was 320 steps following caith's settings
1 sample per image? Is that a setting in kohya?
on your /img/ folder, you have the number before the lora name
it should be 1_[name] [class]
yes, those are your latent caches