#šŸ”§ļ½œfinetune

1 messages Ā· Page 20 of 1

ruby pond
#

not dreambooth, but I've been running fine tune for sdxl with batch size 8 fine on my 4090

restive bridge
ruby pond
restive bridge
#

hmm I'm also on adafactor. I think if you turned off all optimizations you'd be limited to 2 batches as well. it takes 23.7gb. I dont think there's any vram difference between DB and finetune with the same settings

elder hull
#

Same training procedure and database + VAE , different models.

Original NAI (VAE: vae-ft-mse-840000-ema-pruned) toshaka_v1_nai (subject) toshaka_v1_nai (subject_filewords)
vs Nothing v2.3 (VAE: vae-ft-mse-840000-ema-pruned) toshaka_v1_nothingv23 (subject_filewords)

I would have assumed the curse of recursion hit but...
If I use the TI trained relative to NAI to produce images with Nothing v2.3, it works just fine, so the concept should be representable within Nothing v2.3 (and given that inference is just like training without gradient upgrade)

gentle flame
#

Perturbed noise messes with zsnr stuff. I know soeone finetuning with it, making a vpred zsnr model. The zsnr node from comfy stopped being scuffed when perturbed noise was removed.

#

Just putting that out there for anyone that plans on messing with Perturbed noise.
It also makes outputs noisy btw. Known because the problems with noise disappeared after removing it completely.

grizzled jungle
#

I'm getting results like this when I trained a lora. I checked, the model works just fine without the lora in place. These are my sdscripts prodigy settings. Clip skip is 2 since I'm training on a model that's good for clip skip 2. Anything that's wrong with these settings?

dusky urchin
dusky urchin
# restive bridge Since I'm using ground truth for reg I'm pretty sure I'm literally just finetuni...

but I just cant tell if batch 2 is worse for faces or not
i haven't investigated faces very much, but imo if you are impatient, these choices matter, and if you are patient, they matter a lot less. since i think you are the patient type, is there a specific issue you are trying to address? it's not very insightful to say that "your dataset is probably too small," but maybe you can do a directly gathering of image and captions.

restive bridge
stiff dust
#

gradients are averaged all the time. I have no clue why fine-tuning sometimes works better with lower batch size, but this idea of the dnn getting confused if it sees two images at the same time is just wrong

sharp prawn
#

I think I was supposed to post this here: Say I generate a person with a black t-shirt in txt2image, I bring it to img2img and I want to apply my own person png design on that shirt, how would I go about doing that? Do I use control net?

dusky urchin
# stiff dust gradients are averaged all the time. I have no clue why fine-tuning sometimes wo...

lower batch sizes usually converge to a lower loss in backpropagation https://stats.stackexchange.com/questions/316464/how-does-batch-size-affect-convergence-of-sgd-and-why

#

i think you know that but it has been my experience, for at least the decade i've been using accelerated backpropagation, that lower batch sizes take longer but converge to a lower loss

#

fine tuning conditional unets in particular: no clue.

dusky urchin
#

i haven't had the bandwidth yet to experiment with (1) pretraining a LoRA (2) creating contrastive validation sets, but my expectation is that both of those things will be a big improvement on dreambooth style regularization

stiff dust
#

no, I never heard of this effect and never experienced it myself. Also, none of the answers in the blog post is convincing. Yes, larger batches have more stable gradients which could increase overfitting. But this happens for very large batches, not for batch size of 4 or 8.

pure crypt
#

Hello guys! I recently started my journey with SD and Loras. I have already made some models using real people (like the ones you can find on youtube tutorials), but this time I have the chance to create a unique one thanks to a model that is willing to help me. What would the "perfect" img folder be like? I mean, for my past models I used varied pictures of women with different clothes and backgrounds...this time, I can actually use a studio to get the pictures. Any advice?
Sorry if this isn't the right section for this question...

jade hornet
#

If you take 30 photos of a model in a studio, you're likely to get that kind of output in your inference. The advice is the same as always, variation of portraits, medium shot, full shot... Try to vary the background or the training will pick that up

pure crypt
jade hornet
#

Caption helps, but it'll still come through if you overtrain

dusky urchin
# pure crypt hmm, I see, I will keep that in mind, thanks a lot! And now that you mention th...

you are at the start of a long journey. you should maybe start by trying to train a lora into a mixamo character. you are not going to get this right on your first two tries.

is there a way to make the LORA training NOT pick up the background info that much? other than variation of course.
for every concept you want to have a hope and a prayer that CLIP learns, you will need a contrast. this basically means at least 8 images per concept: the positive and negative of the concept for each of training, validation and test, and a regularization for the positive and negative.

so for example, if you want to be able to show the concept of any background versus a solid background:

white background w/ person, contentful A background w/ person
blue background w/ person, contentful B background w/ person
green background w/ person, contentful C background w/ person
regularization: red background w/ different person, contentful background w/ different person

then you actually have to use validation and test...
okay you're reading all this and thinking: wait a minute, nobody writes about this online. basically people are running datasets that are way too small, and they are usually overtraining or undertraining, ad-hoc testing some random prompts they wanted. it's up to you. if you like the results for the generality (or lack thereof) that you are trying to achieve, using community methods, you know, nothing else matters.

pure crypt
# dusky urchin you are at the start of a long journey. you should maybe start by trying to trai...

Very interesting, thanks a lot! I did a training test with 4 different backgrounds and I can see SD always including some of its elements. For example, in one of the backgrounds we had a wooden wall, so it's trying to put wood in every "indoors" prompt. I will have to make a dataset in many places then. Regarding the regularization, that is pretty useful I will see how I can make use of that.
Lastly and extra question: the same will apply to her clothing right? For example, if most of the real shots are taken in one set of clothing, it will try to make that part of the results too right? Or is clothing easier to manipulate?

dusky urchin
dusky urchin
pure crypt
dusky urchin
# pure crypt aah, could you give me an example please? I more or less understand what you mea...

the more examples of different clothing the model will see, the better your fine tuning will generalize to clothes that didn't appear in your dataset, especially in the details.

let's say your model is wearing a specific brand of Leota dress in all the shots.

rare_token is a woman wearing a leota dress

will make all dresses and the brand leota associated with generalizable features about your actress. as you start to overtrain, these features will appear more and more in test queries. you can use regularization to prevent the dress from being associated with rare_token: provide regularization images captioned leota dress of the exact dress being worn by different people.

rare_token is a woman

will have no impact on dresses. you can use regularization to prevent all women from looking like your actress.

at this stage, even without regularization, your fine tuning's ability to generalize will correspond to how much it is under/overtrained. it will still be perfectly capable of generalizing, and it will: your actress will be able to wear other clothes, even if you only had data of one outfit. why? because somewhere in the history of denoising steps that would create your image's dress, there is a common ancestor of noisy image between your image and another image with a different dress.

#

the captions are used to make the text encoder be able to express something specific in your image, and to provide more coherent contrasts between different ways your image interacts with concepts. so having your actress wear multiple outfits will make clothes fit much better on her generally.

pure crypt
# dusky urchin the more examples of different clothing the model will see, the better your fine...

Thanks a lot for the explanation! it's a bit clearer but I still need to study the documentation a lot more. This brings the to one more question: hypothetically, what would happen if my actress were (almost) naked? Could a lora work under these circumstances?

Oh, and one more that came to my mind while typing the first: what would be the best tool for captioning, other than manually doing it?

dusky urchin
pure crypt
#

ok, thanks a lot for real!

frigid pier
#

Haaaalp, how do I fix one part from an image? In my case a belly, it has weird anatomy in some parts and a belly button is missin.

SOLVED: combination of low values of steps and cfg scale with middle noising value was able to do it. I’m guessing, when doing small areas inpaint, smaller values work the best. Atleast with JuggernautXL checkpoint.

#

Tried with inpaint but the results suck. Denoising from 0.3-1.0, bad results. Cfg scale 3-8 also, and combination of these two. Sampling rate, different masking etc same thing.

#

Maybe I just patch it in Photoshop and do inpainting again to fuse things together

blissful vine
#

Can anyone recommend some good upscalers for use in a1111? I want something that smooths the images like an anime upscaler, but not as much. I want to lose the texture effects that many images have.

brisk elbow
#

Is there a good resource for personalization fine-tuning I should start with?

icy silo
#

can i change the vae of an already generated image? if yes, how? I dont want the image to change at all

latent charm
#

You need to regenerate the image and use another vae

icy silo
#

so that wont work

latent charm
#

I don't know what the purpose to use another vae if you already do a lot of inpainting. But you could load the image and pass it to another vae in comfyui

icy silo
#

i guess its time to switch afterall, im still on a1111 everybody been recomending comfy

#

i hope the vram req aint higher?

stiff dust
amber bay
#

Hi, do you know any fully transparent diffusion model on hugging face or other ? (-> a model where we exactly know which data were used for the training?).

stiff dust
#

I think SD 1.5 was just using LAION, wasn't it?

icy silo
stiff dust
#

the vae is just a compression. It wont make your image more colorful or anything. The only thing that can happen is that when you pass the image through the vae it loses some quality and colors. However, you cannot get that back afterwards. It's like you compress and image as jpg and you see jp artifacts. You cannot remove them easily afterwards. You would have to do a whole img2img pass.

icy silo
#

but img2img is changing the intricate details even on low denoising strength

#

you got any tuorial on that?

stiff dust
#

to be honest, if the image looks washed out it might be the vae but in most cases it's somewhat different

#

if you do a lot of inpaintings with blurred mask this leads to a washed out effect

grizzled jungle
#

Can anyone help me with a lora training thing? I'm kinda at a loss and I need help.

#

Basically, I'm trying to train a lora on a character, and it seems like the results I'm getting are... subpar compared to the capabilities of the model I trained it on... First four are with the lora, the next four are without the lora... Main model is VividOrangeMix, the metadata should be in the pics, and these are my current settings for training said LORA.

#

I hope someone can help...

cold cliff
#

Hey. I've been trying to find out how to train the detection models for ADetailer, but I cannot find any documentation on it. Are there guides out there for building a dataset and training, and can it be done with a consumer GPU?

Use case: whole head detection for non-human characters like Mass Effect asari where face detection works but the errors I'd like to correct are often made in the details around the face.

dusky urchin
dusky urchin
#

adetailer itself uses the bounding box to create the mask

#

the mediapipe models make a tighter mask, but i am not sure how significant it is

#

you can certainly try to refine the output of mediapipe, but it's not designed to be trained with more images

#

you don't have the gradient weights, and it's not architected around a workflow of "pretraining" versus "training"

#

you could write code to use insightface instead

cold cliff
#

I meant training a custom model to get the bounding box around the whole head. I've seen models on CivitAI for eyes, but no doc on how they trained it.

dusky urchin
#

can you send me a link?

cold cliff
dusky urchin
#

i think you are misunderstanding what the model is. there aren't any amateurs in the civitai community training eye detection systems. they simply adapted something else that already exists.

#

that file is basically a serialized python blob that configures one of the pre-existing models to match eyes

#

@cold cliff does that make sense?

cold cliff
#

I got it. I wanted to know if it was something mortals could train, and that these models aren't trained by the amateur community answers my question.

dusky urchin
#

it will probably be easier to just expand the bounding box

#

a little bit

#

that specific eyes detection file is yolov8 aka ultralytics

cold cliff
#

Or just not be lazy and use img2img impaint, I guess. šŸ˜›

grizzled jungle
dusky urchin
#

like i don't know this character or what it is trying to learn

grizzled jungle
dusky urchin
#

an example from your dataset

#

just 1

grizzled jungle
#

This is technically one of them, except the one in the dataset isn’t transparent.

#

What do you think?

dusky urchin
#

do you have images with backgrounds (you don't have to share it)? and how many total images do you have?

grizzled jungle
dusky urchin
#

your best bet is to be patient and wait until there is more art of the character though.

#

stuff like your choices of optimizer and such, the net impact is that training goes faster (or slower). if you're patient, your result is going to reflect your dataset one way or another

grizzled jungle
#

Well, I managed to generate a pretty good Lora with 13 images once, I just lost the config for it and am trying to make things work.

dusky urchin
#

hmm

#

i'm saying from a scientific, hard facts point of view, what i am saying is true. you can use basically any configuration, as long as you are patient. does that make sense?

#

there isn't a lost config idea that will make this "work."

dusky urchin
#

yeah

#

i mean it's tough with 12 images

grizzled jungle
#

Yeah…

grizzled jungle
dusky urchin
#

yeah i think your config looks good. like you did everything right

#

you can generate more content from those 12 images, to help it generalize stuff like backgrounds, but it's not going to get a pixel-perfect recreation of the artistic concepts without a lot more data. in my experience, for characters that are art-directed like anime and video games, you need something like 100-1,000 unique instances to get the art direction right every time, and closer to 1,000-10,000 to get the exact representations right in the style they were presented in.

#

so the art direction to me means the character's silhouette, proportions, wardrobe & jewelry, palette

#

to get the face right perfectly you need a lot more representations. i don't think people in the community are releasing stuff that actually looks like the characters they think they do, because they are biased by what their own eyes focus on when comparing dataset images to genereated outputs

#

something like spiderman probably appears in sdxl's aesthetic laion2b handcrafted database 10,000-100,000 times, at least, and in different styles

#

another way to think about this is, your training set is really
O(number of artistic scenes k * (number of concepts you want to generalize CHOOSE number of simultaneous usages of those concepts)). if you are okay with having exactly 1 additional thing you want to generalize in addition to your character, such as "character on a different background" or "character wearing a blue hat" but never "character wearing a blue hat on a different background", you need 10-100 scenes and each one has to illustrate len(background, hat, ..., other concepts)^2 contrasts, so = O(k * n C 2) ~ O(k * n^2). the real limitation is creating contrasts, not gathering enough different poses, which is just 1 concept.

#

am i making sense?

#

this stuff requires way way way more data than the community thinks it does

grizzled jungle
#

It’s just a shame because I genuinely like this character.

grizzled jungle
latent charm
#

You could generate more similar images using style aligned and ip adapter to create the diversity. Also, you could add the character to different background.

fast ocean
#

For anyone who's trained an SDXL Lora on a person (realistic not anime), what Net Dim and Alpha do you use? I'm struggling to find the right combo. Struggling with all the settings, actually. It seems to be quite a bit harder to find the right settings for training SDXL Loras than it was for 1.5.

stiff dust
#

dim 16-48 for unet, text encoder can be much less

#

I keep alpha=1, but you can try higher if you want

fast ocean
# stiff dust I keep alpha=1, but you can try higher if you want

Honestly the biggest issue I have is that when I train an SDXL lora on the Base XL model, it only works with that model when loaded into A1111. For example, if I tried to use the lora with Juggernaut XL or Realism Engine, it loses most of its resemblance to that person. I used to have the same issue with 1.5 loras until I started training with Photon instead of the base 1.5 model, but with XL I don't really know what would be the equivalent of that.

stiff dust
#

this is rather a problem of Juggernaut than sdxl

#

juggernaut is just highly overtrained

#

however, training on Juggernaut instead of base is also extremely difficult, because juggernaut uses a q lot of strange training settings such as pyramid noise. I never succeeded in training on Juggernaut and, therefore, stopped using the model at all

#

currently I'm mostly using Dreamshaper XL which is not overfitted (all loras work normally on it) while still being good in all styles and photorealism. I bet there are other good models, too, which are not severly overfitted

jade hornet
#

Agree with most of that, but would add a comment. In general I would simply say that certain trained models have different strengths than others, even the base. I find that juggernaut handles multiple subjects really well as an example. You can get an idea of whether that's the case by how certain loras perform with the various different models, how they draw anatomy varies, how they render detail varies, etc. So I wouldn't necessarily discourage training against some of these models as a rule.

fast ocean
#

I considered it but since it's a turbo model i thought perhaps it wouldn't be the best thing to train on.

stiff dust
#

no, I train on sdxl base

#

I said that loras trained on base are transferable to most models, including dreamshaper xl

#

if they don't work well on particular models (like Juggernaut XL or RealVision) then this might be an indication that this model is overfitted

fast ocean
stiff dust
#

no, I said that Juggernaut is overtrained and that's why it doesn't work with your lora

fast ocean
#

oh, ok. But my lora doesn't work with ANY model besides base SDXL.

stiff dust
#

people overtrain their models and then they merge them with other overtrained models and at some point nothing works anymore xD

stiff dust
fast ocean
#

yeah got no resemblance with dreamshaper. Honestly I don't like the turbo models. But I've tried all the most popular models and nope, doesn't work. I had the same issue with 1.5 until I started training on a non-base model.

stiff dust
#

hm, that's strange. Do you use any weird training parameters like pyramid noise or a too high or too low offset noise?

fast ocean
#

I haven't seen those settings in Kohya SS so I don't think so. I'm not the most experienced person when it comes to training loras.

dusky urchin
fast ocean
dusky urchin
# fast ocean I would like to make a Lora model based on a real person and be able to make rea...

ā€œwhat’s the applicationā€
like what kind of realistic looking photos? what is the person doing? or is it unconstrained? like what is the idea? for example, "i want to create a #vanlife instagram except with this personality's likeness" means, okay making it look like instagram is 90% of the way there, but i don't know if juggernaut will ever be able to generate someone sitting on top of a van correctly, juggernaut did put the lady on the van pretty plausibly! or maybe a specific image that you can think of that you would like to recreate this person's likeness in

#

when you say can't get it to work properly - it depends mostly on your patience. how long have you run a training, on how many images?

#

to diagnose your fine tuning, take a look at tensorboard.

fast ocean
# dusky urchin > ā€œwhat’s the applicationā€ like what kind of realistic looking photos? what is t...

Oh, ok, I understand. I think just being able to make generic instagram style pics would suffice for me. With SD 1.5 I would make ones with the person camping, standing in urban/suburban areas, backyard photos, pics inside restaurants etc just casual candid style photos. I wouldn’t have even bothered with SDXL had I not been intrigued by the fact that it seems easier to pose people differently with SDXL than with 1.5. And as far as the part about not getting it to work properly, I’ve tried everything from 2000 steps to more than 10000, tried training with a very small photo dataset and a large one, etc doesn’t seem to change the fact that the pics come out strange.

dusky urchin
#

and how many photos is a small dataset versus a large one?

#

you are able to generate images correctly with other community LoRAs, right?

fast ocean
#

Yes I haven’t had any issues with other loras. Small would be anywhere from 10 photos to 20, large would be from 80 to 100. 2000 steps would probably take me about 45 mins to an hour at the most. 10,000 probably 3 to 4 hours.

dusky urchin
#

okay

#

so for my application, replicating a likeness for flexible creative cinema-style scenes, i use 1,000-10,000 images for a "basic" level of flexibility. A LoRA based training for me is at least 50,000 steps (aka 50 epochs), which is about 16h on my fastest configuration and hardware.

#

this is what i mean by patience. i think you are probably off by an order of magnitude for the amount of patience you need, even for a LoRA training, on commodity hardware. you should be ready to wait close to a week for 10,000 images.

#

or. you can choose configuration that happens to work very well for faces, and lean into the fact that people recognize celebrities better than ordinary people, so they're going to be much more forgivable when you don't get the right appearance.

#

then you can expect thigns to go faster. but every misconfiguration can be solved by patience

#

how did you caption 10,000 images?

#

or is this with no text encoder training?

fast ocean
#

Oh I think you misunderstood me. I didn’t use 10,000 images

dusky urchin
#

oh i'm sorry

#

100 images

fast ocean
#

Yes. I manually captioned the 100 images myself.

dusky urchin
#

so you should still be spending in the 10s of hours if you are not sure if your configuration is correct

dusky urchin
#

hmm. well maybe a better question is, are you using prodigy as your optimizer?

fast ocean
#

Adafactor.

dusky urchin
#

all problems with adafactor can be solved by patience.

#

if you are impatient, try prodigy

fast ocean
#

I have patience, it’s just that I’ve probably done about 50 tests in the last week and seeing no improvement. It’s just confusing šŸ™‚

dusky urchin
#

it's complicated, but the community guides make it sound like 30 minutes is enough time to train something

#

in my experience, it almost never is

#

however, you might have some other issues. usually the text encoder learning rate is too high for sdxl. this is an issue that prodigy can deal with for you

#

prodigy is an optimizer that was overfit, in a sense, for training facial likenesses on instagram style generations

fast ocean
#

Oh I know it takes many hours. I’ve basically been running steady tests for the last 8 days straight lol. So if I try prodigy, what learning rate, unet learning rate and text encoder learning rate should I use?

dusky urchin
#

you don't make those choices with prodigy

#

can you give me an example of a caption you authored?

fast ocean
#

So with prodigy I should set them all to 1?

dusky urchin
#

i actually never deal with the configuration files of kohya directly, i use the objects, but i think the documentation says exactly what it should be. my guess is it does not matter, given how prodigy works

#

i am surprised you are getting no improvement whatsoever

#

so maybe there are some other flaws

#

8 days is a lot for a casual or enthusiast application too. what is this really for?

#

i am supportive but it would help me understand your expectations

#

short of showing the images themselves, which i assume this isn't an anime character, it's a real person

fast ocean
#

the weird thing is that often the samples look quite good as it's training the model, and then when i load it into a1111, they look poor. Oh I'm just on holidays and recently got a new computer so I've been playing with it a lot, that's all, lol. It's just for fun and I like learning.

dusky urchin
#

hmm

#

what are you loading into A1111? if you are doing a LoRA training, you will have a file that is ~150-300MB, ending in .safetensors, with _NumEpochs suffixed to it, until it is done

#

how are you visualizing samples? you mean you configured it to generate something, for which prompts?

#

another POV is you should be using ComfyUI

fast ocean
#

yes, I get safetensor files that I load into the Lora folder of A1111. I'm using the sample feature of Kohya SS, where you enter a prompt and it generates a photo every set number of steps. unfortunately comfyUI is way too far out of my comfort zone as a newbie to AI generating.

dusky urchin
fast ocean
#

Often the samples look good but not great. And then in A1111, they often look blurred, squished, etc. maybe it’s actually an issue with the way I’m generating my pics? I know I’m doing something wrong I’m just trying to figure out what lol.

normal ember
#

At what size do you generate?

fast ocean
normal ember
#

Should be good. If you would have been less than a megapixel then you would have issue like you describe.

dusky urchin
#

anyway i think you will figure this out

#

i gotta go

normal ember
#

Text encoder overfits quickly so it's probably a good idea to only train the unet to begin with and see where that gets you.

fast ocean
#

Oh I can train with only one or the other?

dusky urchin
#

if the sampled results look fine, i think there's an issue with how you are using the web ui

#

there isn't really a fine tuning bug here anymore

fast ocean
#

I just tried that "Pixar" Lora you shared above using the same prompts in one of their images, and it did not work. So Maybe it is a problem with my A1111? Also, I didn't say all the samples looked fine, most of them are demented, lol. So I think it's probably a mix of both issues

#

i appreciate your help, i'll try a few more things and see if anything helps.

normal ember
#

Don't give up! šŸ™‚

dusky urchin
fast ocean
#

I don’t think so. I used the exact settings from the prompt they used on the Lora page you gave me. I’m not using any SD 1.5 vae or models, I used the SDXL base model to generate the pic

#

I thought the same thing at first.

fast ocean
fast ocean
# dusky urchin it sounds like you are probably mixed up about which checkpoints you are using w...

Just wanted to report back and let you know I discovered the solution to both problems: my SDXL photos were coming out so bad in A1111 because I had my CFG turned down to 1 (don’t even know how I missed that). As far as my Lora training goes, I went back to my original settings which worked fairly well, and I turned off ā€œdon’t upscale bucket resolutionā€ which for some reason immensely helped me! My images are now coming out normal and my Lora looks really good. Thanks everyone for your help šŸ™‚

patent moat
#

Hi ! How can i have the rights to use the bot ?

pure prairie
#

hi

dusky urchin
#

@stiff dust i am exploring full fine tunes on multiple machines, do you have any experience or opinions about this?

stiff dust
#

never worked on full finetunes as loras work already fine. Pseudoterminalx work's a lot with full finetunes

stone garden
#

how can i create an image here?

zenith delta
#

Hello everyone!

I've trained a whole bunch of LORA's and Checkpoints on real persons/styles/objects/concepts previously with an overall great sucess, but there was one thing that I was never able to achieve properly.

How can I train a real person, for the purpose of anime generation? I've tried with both SD 1.5 and SDXL and achieved only a semi-sucess with SD1.5! Has anyone tried something simillar before?

For my semi sucessful attempt, I trained the NAI model with 25 of my images, removing the background from images that had complex backgrounds. The images featured myself wearing various outfits, and had images in 2 different light settings -frontlit and backlit-, I've also made sure to avoid overtraining by adding 10 full body images, whereas I only use upper body and face close ups for realistic trainings. Overall the images were of high quality, and performed really well when I used the same dataset for a realistic training. I've used a booru style capitoning, captioning an activation phrase for myself, and then not captioning any details about myself, but only my pose and outfits. During the training, I used a Network Rank of 64, and Alpha of 32. I've used Adafactor optimizer, and my learning rate for the Text Encoder and UNET was both 0,0001.

I was able to generate images featuring myself using this LORA, but the LORA would completely change the style of the base model, usually for the worse. It generally changed the style from a 2D drawing to either a 3D illustration, or in worse cases a 3D Model. The likeness however, was usually great!

Looking forward to hearing your opinions, and tips!

stiff dust
#

In general that is not a problem. Use validation images with anime or cartoon prompts and check how they behave during training. If you overtrain, it indeed happens that the images turns to photos. But if you stop early enough this shouldn't happen. Check if your training photos are all captioned with "photo of" or similar tags. I would also reduce the rank from 64 to a smaller rank. In particular for the text encoder, you can use a MUCH smaller rank (like 6 or 8). I found text encoder training very vulnerable for style overfitting (=everything turns into photo), but I also found it hard to train on unet only. Maybe try to stop text encoder training early enough and continue training with unet only. In general it's hard to get a model that is equally good in both, photorealism and anime, so better you focus on an anime/cartoon only model

zenith delta
#

I see. The best LORA in my case was the Epoch2 model for the style; but the likeness was off on that one. I could use the LORA's up to epoch4 depending on the complexity of the prompt; with more complex prompts working better with the higher epochs.

stiff dust
#

yeah, prompt matters a lot

#

try not just "anime of [token]" but rather something like "anime illustration of [token] by makoto shinkai and studio ghibli"

#

(makoto shinkai is a strong prompt modifier, it turns everything quite reliably into anime, although quite realistic drawn anime)

#

if I want more less-realistic anime I also always add an anime lora additionally to my face lora

#

but in general I found it easier train my face for drawn styles like anime or cartoon than for photorealism

zenith delta
#

I see

stiff dust
#

if your results get better with longer prompts that could also be an indication that your text encoder is overfitted on your character prompt

#

I always trained text encoder and unet separately and with different dimensions (text encoder only low dimensions like rank 4 ), but I cannot say how much that helped

zenith delta
#

I generally used prompts like; (1man, masterpiece, best quality, high quality:1,4), brk(my token), and then booru style prompts.

#

This is for SD1.5 though

#

I have never seperated the Unet and TE training before. Can it be done on Kohya?

#

I also thought your network alpha was supposed to be lower than your network rank. In the case I lower the rank to 8, should I have the alpha at 4?

stiff dust
#

yes, it should. You can also keep alpha at 1

#

makes training a bit slower, but that might be rather a good thing

zenith delta
#

I see! I'll test that out tonight!

#

I'll keep the rank at 8 and alpha at 1. I'll use the same dataset and the same lr.

#

Shall I lower the text encoder lr from 0,0001 to 0,00009?

#

But I think using Adafactor needs me to use the same LR for both TE and Unet

stiff dust
#

for training unet/te separately: I think newer kohya versions have some stuff like stop TE training after some epochs and so on. But what I usually do is:

  • make a training with TE only (--train_text_encoder_only) and with very low rank (--network_dim=4)
  • train a few epochs, validate how the images look like and if they change style. Stop very early! Like when the images starts turning into a photo at epoch 8, don't use epoch 7, use rather epoch 5)
  • start a new training with unet only (--train_unet_only) and higher network dim (--network_dim=16), save after one step and immediately cancel training as soon as the output file is written out
  • next you merge the two output files (the text encoder one and the unet one). Now you have one output file that contains both
  • now you start training again with this output file (--network_weights=myfile) and unet only (--train_unet_only)
stiff dust
#

as said, the workflow with separate trainings for unet and text encoder is quite complicated and I don't know if it's necessary at all

#

but my experience so far was that text encoder is extremely vulnerable to overfit, and such a workflow allows you to observate and validate the text encoder during training

zenith delta
#

Aha

#

That makes quite a lot of sense

#

I'll bring the kohya ui up now to see if it has a checkbox for only unet or only te training

#

It seems not; I guess I'll use the script to start the training then

stiff dust
#

that would surprise me. I think the UI should have that options - they are quite old

zenith delta
#

It has a setting for Stop Text Encoder training

zenith delta
#

The results are much better when I train with the images of my girlfriend btw, this is due to being more female images in anime models?

normal ember
#

@stiff dust I've found eyes overfitting way quicker than anything else. Do you have any workaround for that?

dusky urchin
#

like do you have a mix of photographic and anime images of your person to train on? it's okay if the answer is no

#

regularization should help, but it works better if you already have a large dataset of images

final patio
#

I want to finetune SD on my face and have 8 GB of VRAM. Would a LoRA be the best way to do so? Are there any up to date, ideally simple guides on how to do this?

shut relic
#

Isn't regularization supposed to tell the trainer that the images are NOT supposed to look like this

dusky urchin
#

before i say anything more, do you program? do you have some experience with probability as an idea in math?

shut relic
#

Yes for like 7 years xd

#

I'm a software engineer

#

Doctor pangloss is typing up a storm

dusky urchin
#

let's start with what the dreambooth paper actually says abour regularization:

Encouraging diversity with prior-preservation loss.
Naive fine-tuning can result in overfitting to input image context
and subject appearance (e.g. pose). PPL acts as a regularizer that
alleviates overfitting and encourages diversity, allowing for more
pose variability and appearance diversity.

#

there is a lot of jargon explaining what this concretely is:

Prior Preservation Loss Ablation

We fine-tune Imagen
on 15 subjects from our dataset, with and without our pro-
posed prior preservation loss (PPL). The prior preservation
loss seeks to combat language drift and preserve the prior.
We compute a prior preservation metric (PRES) by comput-
ing the average pairwise DINO embeddings between gener-
ated images of random subjects of the prior class and real
images of our specific subject. The higher this metric, the
more similar random subjects of the class are to our specific
subject, indicating collapse of the prior. We report results in
Table 3 and observe that PPL substantially counteracts lan-
guage drift and helps retain the ability to generate diverse
images of the prior class. Additionally, we compute a di-
versity metric (DIV) using the average LPIPS [73] cosine
similarity between generated images of same subject with
same prompt. We observe that our model trained with PPL
achieves higher diversity (with slightly diminished subject
fidelity), which can also be observed qualitatively in Fig-
ure 6, where our model trained with PPL overfits less to the
environment of the reference images and can generate the
dog in more diverse poses and articulations.

shut relic
#

Do you have something better formatted

dusky urchin
#

lol well... listen it's not super informative. the key thing is that regularization should really be called "prior preservation loss"

#

once you have this, here's a good explanation from hugging face:

Prior preservation loss is a method that uses a model’s own generated samples to help it learn how to generate more diverse images. Because these generated sample images belong to the same class as the images you provided, they help the model retain what it has learned about the class and how it can use what it already knows about the class to make new compositions.

shut relic
#

hmm

dusky urchin
#

so what is really concretely happening? it comes down to understanding the role of a text encoder, why it's called a "conditional" unet, why the word "conditioning" is used, and why "context free guidance" is used

#

basically a bunch of things about probability. it's actually essential to what "denoising" is, and what you are actually training

shut relic
#

I know that the text encoder decides which vector to give each word

#

And that textual inversion works by figuring out a magic vector that gives you the image you want

dusky urchin
#

suppose you knew what all these things meant. prior preservation loss is a way to ensure that conditioned denoising on [X, Y, Z] where you are giving examples of [Z] does not make the original [X, Y] less likely.

shut relic
#

Isn't denoising just the process of transforming the pixels into something else, hopefully an arrangement you like?

dusky urchin
#

in other words, when you write a caption and provide an example image for:
taylor swift at a football game with many football players
you will want to create a regularization caption and image
a football game with many football players
which can even be using an image that the generation process itself creates - that's what the dreambooth paper does, and that's where the images from the reguarlization github repos that the community uses come from

dusky urchin
#

so the thing you are training is adding a bunch of noise to the VAE[your training image], making that your output, and then adding slightly more random noise, making that your input, and then backpropagating

#

okay - so you can see how if you are generally making all noisier images less noisy in the direction of your training images, you will make taylor swift appear "everywhere"

shut relic
#

Eh my issues may be because of other mistakes i made

dusky urchin
#

does that make sense?

shut relic
#

Wait i gotta read that again

dusky urchin
#

when people say "overfitting" this is what they actually mean

#

you have to also think about what is happening in the beginning of sampling - aka, when you set your ksampler in comfyui steps to 50, what is happening at steps 0-5 versus steps 45-50

shut relic
#

That's a good question

#

Why does it not just do it in one step 4head

dusky urchin
#

this also touches on what the meaning of your sampler choice is - why dpm... 3 is "Better" (really preferred for certain outputs) compared to euler A

dusky urchin
#

all of these things are related. the underlying reason it's so hard to understand regularization is that it directly relates to the arcane details of image generation

shut relic
#

My issues were more that i was (to hold with the same analogy) training for taylor swift, and all my images were of her at football games. Without regularization, my Lora thought that taylor swift is football studios, and with regularization images of a bunch of football studios, my Lora model thought that taylor swift is football studios, faster. Basically, 'overfitting' faster. But do note this test case i did not use any captioining

dusky urchin
#

here's an analogy i've been auditioning:

diffuser:
set your grid size to 1x1.
visit each grid square. roll dice. if it's a 1,2, or 3, do nothing. if it's 4,5, or 6, use the dice to consult a big book that maps dice rolls and your grid point to a color you should paint in your grid square.

increase your grid size to 2x2.
visit each grid square. roll dice, if it's a 1,2, or 3, do nothing. if it's a 4, 5 or 6, use the dice to consult a big book that maps dice rolls and your grid point to a color you should paint in your grid square.

....

#

training is determining the contents of that big book.
diffuser and conditioning: let's introduce conditioning:

draw a picture of a "cat"

increase your grid size to NxN.
visit each grid square. roll dice, if it's a 1,2, or 3, consult your book as usual, which maps dice rolls and your grid point to a color you should paint. if it's a 4, 5 or 6, reroll with dice weighted for cat.

#

CFG: is the number on the dice you switch between unweighted and weighted dice

#

control net is a different kind of dice weighting

#

this is a diffuser with conditioning

#

okay, now let's introduce captionless training: you're modifying the contents of the book to paint the grid squares more frequently towards your data set.

#

that's it

#

now let's introduce captioned (i.e. text encoder) training: you're also modifying the weights of all the dice in your caption.

shut relic
#

Something's wrong with your order of messages

#

Oh NVM

dusky urchin
#

so when you have cat and hat in a caption, you can devise a training method where you don't want to change the weights of hat "as much" as you change the weights of cat.

#

that is what prior preservation loss quite literally is

shut relic
#

Okay, so what is the way to eliminate bias in datasets

dusky urchin
#

this way you can use images of your cat in a hat for training the look of your specific cat, AND make sure to generalize your cat wearing other stuff - because hat's conditioning, when using CLIP, contributes to other conditioning relating to things on top of people's heads

#

the only way to eliminate bias in datasets is to make them larger

shut relic
#

E.g. many models are biased on white women

dusky urchin
#

ultiamtely the goal is to preserve the generalization power of pretraining aka having this nice, generalized checkpoint of weights

dusky urchin
shut relic
#

What if I ONLY have images of taylor swift in football studios
Is there no way to tell the trainer "this i don't want"

dusky urchin
#

PPL is a "shortcut" that was designed just for dreambooth. the achievement was using just a few images to get a diffuser to learn a detailed look and feel of something

dusky urchin
shut relic
#

Argh

zenith delta
#

Eyo you guys basically started a topic from my question

#

I was able to get much more stylized results by lowering the network rank, and using 100 reg images.

#

The problem is now that the model is kinda overfitting on the reg images itself.

shut relic
#

EXACTLY

zenith delta
#

I also stopped the text encoder at %30

#

It works perfectly on me this time, a male

#

But my girlfriends features get mixed up

shut relic
#

clearly regularization is far more than "this i don't want"

dusky urchin
#

the community focuses a LOT on getting an exact human face that generalizes to many instagram-style casual full body pose photography.

#

dreambooth was not designed for this. the best thing to do, if that is your goal, is to just wait.

zenith delta
#

I'll make another attempt, with stoping the text encoder at 50

dusky urchin
#

IP Adapter and similar is dealing with this issue directly

zenith delta
#

Oh no I'm not trying to get realistic results.

#

I'm trying to see how I can turn a male and a female real person to anime style.

dusky urchin
#

i guess what i am saying is that IP adapter can also do that

zenith delta
dusky urchin
#

dreambooth is too general, it doesn't deal with the salient issues of "perceptual biases when looking at faces of humans"

#

i think it is challenging to use, but if your goal is to make stylized human faces, ipadapter is the framework of today and the future

shut relic
#

When training loras, say i have 50 tags on each image, and then prepend my keyword to the tags. Would that drown my keyword, so I should reduce the other tags, or will it recognize the keyword fine

dusky urchin
#

dreambooth is a training shortcut, is another way to think about it

#

just like lora is

#

LoRA is a cheaper-to-train "full fine tune"

dusky urchin
shut relic
#

But it does matter, for say 100 images

dusky urchin
#

i think the hardest thing is to accept that many community members have "loras" that "look like someone" essentially by accident

#

100 images is too little data "in general," but if you are generating images for a celebrity who isn't a TV personality but also appears "a lot" in the base checkpoint in general, people's perceptual biases will help in accepting the likeness.

#

do you see?

#

there's a reason the dreambooth paper does not do people and does dogs and shit

shut relic
#

Does having more images make it understand the concept in less steps

dusky urchin
#

well having more images definitely increases steps šŸ™‚

shut relic
#

Or is it just for diversity

shut relic
dusky urchin
#

i can't speak for all community guides

shut relic
#

So divide 1000 by count to get repeat

dusky urchin
#

they are "darwin fine tuned"

shut relic
#

Darwin fine tuned?

#

Natural selection?

dusky urchin
#

the ones you hear about that go viral accidentally worked for accidental configuration for accidentally common use cases

#

yes.

shut relic
#

Bruh

#

Hold on

dusky urchin
#

lol

#

i mean what can i say? you are following the guides to the T, i'm sure, and getting crap results

#

and you're probably wondering why

shut relic
#
dusky urchin
#

yeah exactly

#

this thing exactly

shut relic
#

This is my holy grail

#

It has worked best

#

And I've tried others

dusky urchin
#

it either works or it doesn't

#

it doesn't provide guidance really on getting any closer

#

because it can't create more training images for you

#

and it can't convince you to be more patient - indeed, it does the opposite! because the community is very impatient

#

something that says "train for a lower learning rate, on more steps, on more images" would not be very popular even if it were good

shut relic
#

It does provide guidance, it says what you can do to improve results

dusky urchin
#

well

#

like i said you've tried it

#

there's a lot of emphasis on configuration choices. if you followed my strategy (wait) you would use prodigy

shut relic
#

Yeah but my results are not good enough and I don't really understand them

#

I use prodigy

dusky urchin
#

prodigy is a very smart thing, and it is also darwinian-fine-tuned on training images of celebrities

#

it didn't exist 3 months ago or whatever

shut relic
#

Which the guide suggested

#

Does putting caption files next to regularization images do anything

dusky urchin
#

let me put it this way - the fact that prodigy exists means that a lot of so-called best practices parameter selection didn't matter

shut relic
#

Or is it purely psychological

dusky urchin
#

hmm

zenith delta
#

Hmm, I need you guys opinion on the likeness

#

Is it allowed to share a generation and then a real photograph?

#

Completely sfw of course

shut relic
#

Yeah why not

#

Unless it's illegal

zenith delta
#

It is completely legal

#

But I pmed you anyway

dusky urchin
# shut relic Or is it purely psychological

actual dreambooth and the objects inside kohya's scripts, which are essentially formulaic, do use captions with regularization aka dreambooth training aka fine tuning conditioning together with the unet

#

i don't know what happens when you modify a config.toml, or what you put in which directories, because i don't use kohya's scripts that way

#

let's step back a bit though. your results will be crap if your dataset is crap

#

end of story

shut relic
#

What makes a dataset good

dusky urchin
#

is your goal to flexibly present a non-celebrity person in casual instagram style generatations?

shut relic
#

no

dusky urchin
#

what is your goal

#

ā˜ļø gotchya

#

jk

shut relic
#

you did

dusky urchin
#

lol

shut relic
#

Difficult question to answer

dusky urchin
#

okay well let me talk about a real goal of mine, that's actually super hard

#

so maybe that's more interesting

#

one thing i am trying to do is introduce the concept of a place to diffusers. you should be able to express with words a subplace of a place - for example, "behind the big ben", and correctly get a shot from behind the big ben.

#

that's an easy one, right? the thing is fucking symmetrical

#

how about "from bathroom in long hall in cs_office"

shut relic
#

Ok

#

That's a good example

dusky urchin
#

someone who has played counterstrike, incredibly, can visualize exactly everywhere, in all the POVs, from inside cs_office

#

and the words are more than sufficient enough to put a very narrow possible set of valid POVs

#

and you can close your eyes and get coherence between these places and shots without motion

shut relic
#

Is the lora supposed to to place a person behind the big ben, or in any arbitrary place near the big ben

dusky urchin
#

so the dataset will probably need 10,000-100,000 images of POV shots of cs_office. for regularization, i would show it a vast amount of other locations, but NOT offices

dusky urchin
#

and that's why 30 images of the "big ben" would be sufficient. anyway, it already has a ton of big bens in the dataset

#

the way a dreambooth fine tuning would help is to make all towers you generate look like the big ben

zenith delta
#

@shut relic eyo

#

Can you check pm?

shut relic
#

you didn't pm me

#

Oh you did

#

It's just barely visible

dusky urchin
#

it could also help with generating unaesthetic images of the big ben, especially ones that albumentations doesn't do

#

for example, dall-e3, which is trained on incredible amounts of synthetic data, can't do negative space due to its aesthetic synthetic data

#

it CAN do white space when you ask it to make creative art

#

this is to say that everything is dependent on your training data

#

i should say that it struggles with making white space or similar concepts when you ask for it

#

they created a huge amount of synthetic data for localization concepts, and you'll see that you'll create a lot of flawed situations if you take it even slightly outside the bounds of the synthetic data

zenith delta
#

Hey continuing from yesterdays talk

#

Reg images are defienetly not ''what you dont to generate''

#

Had like 6 different attempts since yesterday

#

And the best result is when I generated images that looked somewhat like myself using an anime model for the reg images

#

For context I'm trying to train a lora on my likeness, for anime generation purposes

dusky urchin
# zenith delta For context I'm trying to train a lora on my likeness, for anime generation purp...

regularization images are a strategy from dreambooth to improve generalization in conditioning. use a regularization image containing text prompts that are unrelated to your target training concept (like your likeness) and which you do not explicitly want associated with your concept.

this is confusing because the community goal is almost always "make my likeness in single character anime card portrait illustrations" or "make this person appear in instagram style casual photographs"

let's say your goal is to train the likeness of some person who doesn't appear in the training set at all. Let's say this person is Michelle Williams. you want to generate casual instagram images of this person.

let's say you have basically 3 photos of michelle williams wearing different outfits.

all of your captions should be "michelle williams, a woman wearing" followed by a bit of mundane details about the outfits.

you should not use regularization at all!

you actually do want all women who are ever generated by stable diffusion to look like michelle williams, because your goal is to generate instagram casual single person portraits.

you don't want regularization for the outfits because wherever those outfits appear on other women, you want them to "morph" into whatever meager appearance / fit you have for the three michelle williams examples.

does this make sense?

#

so @zenith delta in your concrete example you should not use regularization at all, and you should most definitely not use the regularization github repo which will really achieve the opposite of what you want to achieve

#

since you will always be rendering your likeness as anime, you want all likenesses to look like you. if you want someone else's likeness, you would disable your lora.

shut relic
shut relic
stiff dust
shut relic
grizzled wraith
#

Fine

zenith delta
#

But what I did was; used the LORA I trained on my partners likeness with low weight to generate anime style images that looked "somewhat" like her.

#

Then I repeated the training, using those images as the reg images. I also changed a few parametres, and voila! It worked like wonders!

#

I'll now do the same thing on my own likeness! If anyone is intrested in the detail feel free to pm me!

dusky urchin
#

you can use a community anime checkpoint instead of an anime lora - they are the same thing in terms of what the process is, but the lora is a way to save money and time, not a way to get better results

zenith delta
#

IP Adapter doesnt work well for me in most cases 😦 I guess I'm not that good with Control.Net

zenith delta
stiff dust
dusky urchin
#

i think the user meant that there's a LORA that does "anime style"

#

and the regularization images weren't used at all, and it just trained more

#

@zenith delta if you had actually correctly used the regularization images in the manner you described you would get terrible results

proper shell
#

i keep on having the image mess with extrabody parts and nipples, i dont knoew how to fix it

jade junco
#

Hi, I've been trying out this epicrealism_pureevolutionv5-inpainting inpainting model to generate realistic backgrounds for people and objects, however, I was also wondering if I could finetune this to make it work better for certain things like countertops. Does anybody have any advice on how to go about doing this?

zenith delta
#

What I wanted to do was to train a LORA with a real persons likeness; that would be able to generate the trained person, using an community anime checkpoint, without changing the style.

#

Should I train a LORA on a real persons likeness without changing any of the parametres, the model would quickly start to override the style of the checkpoint I was using.

#

So then, I tried using regularization images. I was training on AnyLora, and therefore, needed anime styled images for my reg's. I went ahead and generated the images and made another training.

#

This LORA, was able to generate the images without changing the style of the checkpoint I was using, but the likeness took a massive hit. This LORA was also trained with a really low rank/ and 1 alpha.

#

I've tried multiple different settings at this point, but none really helped me.

#

Yesterday I tried generating some images with the LORA I trained on the likeness of my girlfriend, without any reg, and with realistic LORA training parametres. (High rank, high res). The style was a bit off, but by using really low weights about ,0.1-0.2, I was able to get some images that looked like my girlfriend and also in the style I wanted.

#

I generated a whole regularization dataset this way, and then trained another LORA, once again on AnyLORA checkpoint. This time however; I increased the network rank from 32 to 128, as if I was training for a realistic model. I was relying on the reg images to not overtrain.

#

This final LORA; turned out almost perfect! It generates perfectly stylized images, and pretty high likeness for the anime style. I'll improve this LORA by training another LORA with the same images; but cropped 512x768. Then I'll merge the 768x768 LORA with the 512x768 LORA to increase the flexibility.

#

The reg images had an effect for sure. Because when I used the same images/same parametres for the training without them; the LORA was changing the style already at Epoch4-5. The LORA I trained with the reg images works perfectly between Epoch18-20!

#

I do not want to share the images with this LORA; as it is trained on my girlfriends likeness. But I'll share the images when I make the same training on my own likeness.

mighty rock
#

I’m looking to fine-tune SDXL with** Lora**. I’m wondering if I should go with TPU or GPU for this task. Can anyone tell me which one would give me better performance and be more cost-efficient?

stiff dust
#

hm, super weird. To be honest, I never had a problem training on face for different styles. So all this workarounds you did shouldn't be necessary. I would train on base model (not on anime model) and then apply your lora on the anime model. Ranks of 128 are way too high in my opinion. I would try lower ranks, but I have mostly experience with SDXL. Also, you have to be careful with text encoder training as it can overfit very quickly. Maybe the reason why your style is taking over is because you train the text encoder too much. Try training only the unet (or use textual inversion on combination with unet training)

#

but anyways, if you are happy with your results that's also fine

dusky urchin
stone garden
#

is dreambooth still the go-to for SD 1.5 finetuning?

#

checkpoint finetuning, not LoRA

latent charm
#

Do I need to create caption for regulation images?

stiff dust
#

yes

muted scaffold
#

yes

dusky urchin
#

why is pixart-a trained on so few images

latent charm
#

less cost, might be

sacred grail
#

I made a customGPT in chatGPT that gives good and good length descriptions. Try it out, if you have any suggestions of what can be better let me know šŸ™‚

https://chat.openai.com/g/g-EQFzMkKHZ-image-descriptor

You can upload multiple images at once, you don’t need to add any text. It will just give you a zip file with the descriptions in the txt files šŸ™‚
I think if we try to make a big dataset out of this to make a better captioner we could get SD to the level of dalle3
(You need chatGPTplus to use this) let me know if you want to help compile a good dataset.
You can caption up to 400 images every 3hours with this due to chatgpt limits

lusty stag
#

How can I train a model with few images

sacred grail
# lusty stag Can you make it on Hugging Face

No unfortunately this is a closed model because its on openai, if we gather enough captions from this we could finetune a open source CogVLM model but that’s not something I would be able to do alone as gathering the captions would take a long time (needs a very large dataset (400k probably more)) and finetuning the model would cost too much money for me

hot breach
#

cogvlm is outstanding as it is

warm agate
#

But cogvlm is kinda GPU intensive

hot breach
#

yes, but its sort of one of those one-time costs

warm agate
hot breach
#

training millions is as well

thorny gazelle
#

question why are a lot of my samples in orintations that are not reflective of my database? all of them are sideview shots but im getting front view in the samples...

warm agate
#

Cuz captioning each image on a 3090/4090 would take around 10sec

hot breach
#

1 beam 4bit is ~4s on an Ada card, maybe 6-7 on a 3090?

#

buy yeah its not fast

warm agate
hot breach
#

not much you can do with 1 million images and just one 3090

warm agate
#

Can I DM?

thorny gazelle
#

is this over training?

thorny gazelle
#

are these good dataset images? 512

#

I have 54 images so far

#

they are all sideview and ive tried a set without skids and it gave me images with skids and a front view...

tame vortex
#

maybe incorrect labeling then ?

#

Also there are probably already too much stuff linked to "helicopter" in the model already. Maybe try labelling your helicopters as MLGCopter or something like that.

#

just a thought

hot breach
#

things like helicopters and airplanes are pretty rough to train, maybe due to all the appendages they have, but trying to just do the side shots is smart, if your screenshot is truly representing that fact

thorny gazelle
#

ok it was in fact my keywords getting influenced by 1.5 models

#

also how do you use this graph? when it flattens out what does that mean when training?

#

using constant

#

does it mean that after 30 steps its not learning???

hot breach
#

learning rate is an input not an output

#

my guess is you'll simply be challenged to get good helicopters, but I'd probably worry more about your data than anything, both the actual images and the captions you use. Add more data and experiment with how you label them.

#

you are training SD1.5?

#

lora or fine tune?

thorny gazelle
#

im doing a model, right now im using a small dataset with 5img just to play around with keywords. im using dreambooth with 1.5 as a base

#

so far im getting better results

hot breach
#

better results compared to what?

thorny gazelle
#

better results with the unique keywords.

#

more sideviews. this is a 50 epoch model

hollow spruce
tiny heron
#

Is it possible to train a style with an alpha channel?

true flint
#

Does anybody know why it could be, that its like stable diffusion is changing seed every some frames?

dusky urchin
#

You would need a lot of data, computational resources and patience

dusky urchin
thorny gazelle
#

my goal was to create sideview concept art for helicopters

#

Ive gotten good results after changing keywords to be more unique that didnt call on the base 1.5 model which would add dross I didnt want

dusky urchin
thorny gazelle
#

im not seeing the error, could you explain further why this is a poor dataset?

dusky urchin
#

and something creative and zany

#

things like the motion blurred rotor do not achieve your goal

#

and you're training on a lot of motion blurred rotors and stuff that's on green

tiny heron
thorny gazelle
#

yeah I can understand the blur being bad. I can disable blur on the renders but most IRL images are done while in flight, on the ground the rotors sag and ruin the fuselage, the concept art can be done by hand in blender or gimp but the point of this concept art dataset is just to create ideas

#

Id also want to try blueprint 3views or other similar vector drawings

dusky urchin
#

training on 50 images will make it generalize less compared to the base model, not more

#

and be less creative

thorny gazelle
#

engine placement, fusalage shape, cockpits, landing gear, tail boom, rotor system

dusky urchin
#

much rougher right?

#

so like very creative

#

something more crazy like this?

thorny gazelle
#

mmm kind of, Im looking at it more as a tool, so recursive in use where you choice a few rough designs and expand on them with different features in a photo editor picking out features and exchanging them, maybe later if need be focus more on detail. but having detailed panel lines is not a huge concern for me, more realistic in nature and less outlandish

dusky urchin
#

more realistic in nature and less outlandish
okay i think you gotta firm up your brief lol

#

it sounds like you don't know what you want

#

you want your exact helicopter on green background renderings, except "better"

thorny gazelle
#

hmm I want it to take those photos and mash a totally new design that looks like it would have been manufactured

dusky urchin
#

you want photorealistic side view shots of helicopters that you plan to dissect and kitbash

#

into a new helicopter side view photorealistic shot, but fundamentally a pretty conventional one

#

a photocollage

#

it's too bad because i really like the concept art helicopter pipeline i just made

thorny gazelle
#

not against other perspective shots, but I think that focusing on orthographic views are more controlled

dusky urchin
#

with maybe more variety in the helicopter body plan

#

@thorny gazelle does that seem right?

#

it's just hard to reconcile creative and something you can kitbash, but anyway, i think you should try to achieve this some other way. a lora will not help you. the most challenging thing for it to learn is "side view"

thorny gazelle
#

yeah more like what im after, im also not making lora. just a 1.5 model since dreambooth hates lora or I havent found a way to make it work... I need to look into kohya ss when I get more time. ive made my own unique keywords that is unfamiliar with the base 1.5, so far its given me sideview shots since its all it knows, sideview is also the only keyword that is not unique

dusky urchin
thorny gazelle
#

such as?

dusky urchin
#

noise in clip gives you creative concepts. noise in latents gives you creative silhouettes

thorny gazelle
#

i havent gone to far into the rabbit hole, what am I suppose to use this json file in?

dusky urchin
thorny gazelle
#

oh ok, I haven't installed comfy yet, just auto1111. thanks for your time btw

sullen nebula
#

Hello, does anyone know how to become an authorized user for finetuning on Stability ais API?

hardy geyser
#

anyone has a tutorial for finetuning xl with hundreds/thousands of images and multiple prompts?

bold ether
#

I want to show my girlfriend the power of ai because she doesn't believe it can be that good. so she challenged me to make realistic pictures of her that are nsfw, but she doesn't want me to use her nude pictures to train the ai for privacy reasons

dusky urchin
#

and have you ever used python

bold ether
#

Can I get help?

dusky urchin
bold ether
open summit
#

^ can you two continue this convo in DM? We try not to promote/chat on any NSFW on the server

bold ether
#

okay

shy basalt
#

Can I get some recommended settings for training a subject LoRA in Kohya? I am having a hell of a time with this.

#

I think I burned my cookies.

bold ether
dusky urchin
#

before you jump into training

shy basalt
#

Yes. For a few years.

dusky urchin
#

hmm

#

what are you trying to do?

shy basalt
#

Train a lora to give me camera-headed people. I've harvested a ton of images from Dall-E because it understands the prompt fairly well, phtoshopped a few, and gently reprocessed and blended everything in juggv6.

#

Got a few over 100 images, learning rate 5^e-5

#

40 epochs, but I'll just stop it once it starts working.

dusky urchin
#

okay, do you want to turn every person into a camera headed person?

shy basalt
#

Yep.

#

I'm not using regularisation images. Could that be a factor?

dusky urchin
#

regularization images of people would be the opposite of what you want

#

because it looks liek most of the flaws are in the camera itself

#

so you don't want it to learn dall-e3's flawed representation of cameras

#

you should also consider a full fine tune instead of a lora. if you have patience and a 24gb+ card

shy basalt
#

I'm not sure what that would entail. I am a slow learner, but I do have that big GPU energy.

bold ether
#

Is anyone able to help me in a DM or guide me somehow? I don't know what I am doing

dusky urchin
#

you should read a lot of coco-style captions

#

so you can see what CLIP was actually trained on

hardy geyser
dusky urchin
#

i don't mean loading a workflow that someone else authored. i mean like, completing a brief

hardy geyser
#

never used comfy

dusky urchin
#

hmm

#

i think this is going to be a stretch

#

you have to learn what all the parameters actually mean, and what is going on

#

if you wanna do something innovative

#

if you can find a guide for exactly what you want, great

stiff dust
ornate ruin
bold ether
stiff dust
#

Psychiatry? Jail?
I don't care if you want to generate porn, but don't abuse other people by putting their face into porn.

bold ether
#

She literally asked for it

dusky urchin
hollow spruce
# shy basalt Got a few over 100 images, learning rate 5^e-5

a bit late to reply. but if you up your dataset to roughly 400 images, then you can just brute force the lora. at 400 images, the settings start to become a bit less important. just use adamw + 1e4 unet lr, 5e5 te lr + batch 7 + min snr 5 + offset noise 0.1 + dim 32 alpha 1
for fully automated tagging use cogvlm via this app:
https://github.com/jhc13/taggui

dont forget to add a triggerword to the whole thing. something like "camhead" which doesn't have any highly biased words in it

save every 5 epochs. epoch 45~60 will be your target epochs for finished lora

GitHub

Tag manager and captioner for image datasets. Contribute to jhc13/taggui development by creating an account on GitHub.

hollow spruce
#

ah, my bad. didnt see that fruit already replied earlier

livid rapids
#

Does anyone know how to disable RAM usage when VRAM is maxed on NVIDIA cards? I remember reading something about that being an option on the latest driver update but can't find it.

lethal osprey
#

Hello. I am new to the stable diffusion world and I tried making a lora of an art style but nothing I do seems to make it work. I used about 234 images from an artist and the style doesn't come through.

lethal osprey
#

I am using Anime Art Diffusion XL checkpoint and this is what I get
I am trying to get the art style of My700 with the lora I trained

jade hornet
hollow spruce
# jade hornet genuinely curious why you recommend adamw over an adaptive optimizer.

the short answer: to avoid overfitting, and achieve the best lora I can possibly make

For context, all of this refers only to sdxl.

the long answer: this will get a bit technical...

we're gonna separate this into 3 groups.
• AdamW + offshoots like AdamW8bit
• adafactor + similar adaptive ones like dadapt
• prodigy

Resources
• AdamW + constant does the math directly, and correctly. not nearly as much approximation going on. This has the downside of using more gpu, and basically sets a barrier of entry of 16gb vram, and can be used most efficiently with 24gb vram
• Adafactor does a lot of approximation, hence requiring less gpu. With every vram saving technique applied that exists you can get the barrier of entry down to 8gb vram, if you're willing to only do style loras. or 10gb vram for any kind of lora.
• Prodigy gets complicated, since you can get the vram requirements down, but doing so essentially moots the point of using prodigy in the first place. If your goal is to avoid overfitting, then prodigy has a barrier of entry of 24gb vram. If you use the methods that require less vram, then you're better off switching to adamW, since you'll get significantly better results with little more effort. Ideally you want a shit ton of resources if you wanna use prodigy efficiently (like 40 or 80gb vram)

Conceptual Complexity
• AdamW - If you're teaching one single concept, then you wont have any downsides with AdamW. If you're teaching multiple concepts (26 in case of my dnd lora) then you're best of with AdamW thanks to its consistency.
• Adafactor - if you want a quick and dirty lora, then adafactor will work just fine. If you care about the nuances of overfitting, then you'll quickly hit an upper ceiling, especially once you deal with more and more concepts.
• Prodigy - Can equal the quality of AdamW, without any of the knowledge required to make it work. The downside? An inhumane amount of resources used. If you use it via low vram requirements, then you're just turning it into a adafactor alternative, at which point, you could just use adafactor to begin with.

Actual Results
Theory crafting is all well and good, but results speak louder.
So I've trained the same lora cross testing just about every setting in kohya. I've done all my big loras in prodigy, adafactor, AdamW & AdamW8bit
Going simply by results, AdamW wins every time. Prodigy also works consistently and I have nothing bad to say about it, though I've stopped using it since I cant exactly tell people to get more vram, while with adamW my techniques at least work on 16gb vram environments.
Adafactor loses every time, and I only ever recommend it if you're running on 10~12gb vram environments

Misinformation
So most tutorials recommend adafactor, which can be traced back to the first tutorials during sdxl 0.9 release when SECourses made his youtube videos and declared his methods as "the best" and that he tried all the settings and these work best. When in fact he tried only a few settings on a single dataset of himself. Due to a fundemental misunderstanding of how network rank works, he arrived at the conclusion that adafactor works + net rank of 256. Both of which are the worst possible options, in general, but give the illusion of working. But from there, it formed a culture of using adafactor since misinformation spreads fast. Nowadays its getting better, but people are still using adafactor without knowing what differentiates it from the rest, and when to use it vs when not to use it.

#

#✨|sdxl message
^ DND lora.
26 core concepts, which have been completely retaught
around 100 minor concepts, which have been merely influenced (like hands always having 5 fingers, pupils being round, etc... )

I attached a list of all the major concepts ("Indian" was taught as well, but doesnt show up in that list)
green wasn't taught, that's just for statistics, so I can verify I'm not accidentally teaching a bias towards any gender

#

has a roughly 80% success rate. meaning if you generate 5 images, from seeds 1~5, then 4 of those will be "actually useable", and would qualify to be printed and actually used in a game of DND. (All the images that I linked from general chat are from Seed 1, just to make a point)
All of this, on top of the default sdxl base. No checkpoints or other loras applied.

Its also not working by overfitting, as once you add a new concept it wasnt trained on, it still works. And as a few friends already tested it, it works with subject loras, to translate any subject into a "dnd portrait" and keep up the style.

dusky urchin
#

@shy basalt everything that @hollow spruce said aligns with my choice of parameters almost exactly. especially

at 400 images, the settings start to become a bit less important.
the better the dataset and the longer your patience, the less the settings matter.

dusky urchin
#

i am surprised you haven't tried full fine tunes

#

if you have the patience, like a week, and the extra data needed to regularize

jade hornet
viral jackal
#

hey why is the training script for Cascade just idling with no errors? been bashing my head here for a few hours trying to figure out whats going on.

dusky urchin
#

did you try attaching a debugger?

viral jackal
#

and i have 8x3090s

dusky urchin
#

and stepping through

viral jackal
#

it does not move pass 1 step

dusky urchin
#

it can be maybe LoRA fine tuned on a single card

viral jackal
#

it just idles

dusky urchin
#

can you maybe provide some more context

viral jackal
#

sdxl can be tuned on a single 24gb card

#

the script runs and than just idles at step 1 for hours

#

on 8x3090s

dusky urchin
#

additional cards do not make a model that requires X amount of VRAM to be trained trainable

#

you should know this

viral jackal
#

i know

dusky urchin
#

so

viral jackal
#

they said it can be done on a card with 24gb

#

of vram

#

its not that

dusky urchin
#

can you provide some more context for what you are trying to do? i am telling you it probably cannot be fine tuned on a single card in 24GB

#

hmm

#

well what version of torch are you using?

dusky urchin
viral jackal
#

--find-links https://download.pytorch.org/whl/torch_stable.html
accelerate>=0.25.0
torch==2.1.2+cu118
torchvision==0.16.2+cu118
transformers>=4.30.0
numpy>=1.23.5
kornia>=0.7.0
insightface>=0.7.3
opencv-python>=4.8.1.78
tqdm>=4.66.1
matplotlib>=3.7.4
webdataset>=0.2.79
wandb>=0.16.2
munch>=4.0.0
onnxruntime>=1.16.3
einops>=0.7.0
onnx2torch>=1.5.13
warmup-scheduler @ git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git
torchtools @ git+https://github.com/pabloppp/pytorch-tools

GitHub

Gradually-Warmup Learning Rate Scheduler for PyTorch - ildoonet/pytorch-gradual-warmup-lr

GitHub

Useful PyTorch functions and modules that are not implemented in PyTorch by default - pabloppp/pytorch-tools

#

featuring enhancements for fine-tuning and training efficiency with a focus on further eliminating hardware barriers.

dusky urchin
#

it doesn't say anything about 24GB cards

#

are you trying to run train_c_lora.py?

viral jackal
#

nope

dusky urchin
#

i don't think a full fine tune will work on 24GB VRAM at 16bit. the regular RAM usage is suggestive that it requires a 48GB card

#

can you try the lora fine tuning instead?

#

wuerstchen stage C is designed to be trained on an A100 80GB

dusky urchin
viral jackal
#

lmao i run out of memory for the lora

#

train_c? or tran_c?

dusky urchin
#

"don't play games with me" as the president says

#

i think you are perhaps biting off more than you can chew

viral jackal
#

? i spesficaly have hardware to train image models

#

so the model cannot be trained nor a lora with a 24gb card?

dusky urchin
#

you have one of the worst possible setups for training image models

viral jackal
#

i can train a 240B llm on the wardware

#

yeah i know

dusky urchin
#

so...

viral jackal
#

i was working within a budget

dusky urchin
#

your deficits are in the programming department. the train_c script expects the job to be run against SLURM

viral jackal
#

so it flat out cannot be trained on anything else than h100s or a100s?

dusky urchin
#

so does train_c_lora

#

i don't know. i think you are "just" running the script without comprehending what it does

#

it's likely you can do a lora fine tuning on a 24GB card if you configure training for bf16 and use adamw (which is what it says)

viral jackal
#

i am merely asking for some friendly advice and your coming in here out of the gate lecturing me about my intelligence and making jabs at me?

#

really?

dusky urchin
#

i'm sorry i do not want to sap your excitement

#

if you buy $4,000-$8,000 of GPUs it's worth comprehending all these issues

#

i think you should try using the train_c_lora script with only one CUDA device visible, and then configure your training for 16bit (16bit backward pass, 16bit optimizer state/gradients)

#

i haven't used hugging face's config from the file for a long time so i can't tell you exactly what to put in there

viral jackal
#

okay the memory was not the issue even when it loads up to 15 to 14 on the card it merely idles

#

not doing anything

dusky urchin
#

you will have to step through in the debugger to see what is going on

dusky urchin
#

do you know which line in train_c_lora you are observing a hang?

normal ember
lethal osprey
#

Unfortunately it is NSFW

buoyant violet
#

Hey peeps, not sure if anyone might be interested, but I co-founded a start up that provides cheap and easy to use GPUs for AI training (https://www.tromero.ai/) I would be over the moon if anyone decides to check it out (first couple of hours of compute are free)!

slender lagoon
#

anyone tried training a lora yet? how much vram did you end up needing on the 3b?

lethal osprey
#

Does anyone know how I can get absolute black skin? I am trying to make a bat character but it keeps having her with dark brown skin, I am looking for straight up black

#

I also need help getting pure black eyes

slender lagoon
# dusky urchin 24GB

I am on 24 and it was still ooming regardless of batch size, what was your config?

#

sorry by 3b I meant the new cascade model btw

#

not sure that was clear

slender lagoon
dusky urchin
dusky urchin
slender lagoon
small shadow
#

Hey everyone I have a quick question.
I'm pretty new to Stable Diffusion, but I'm planning to start training my first lora. I know the basics when it comes to training a face, but does the same apply if I'm wanting to train a specific body type? I already have great results for the face I want with CyberRealistic in Automatic 1111, but im not completely satisfied with the rest of the body. Any tips/tricks/advice, or maybe even a tutorial link would be greatly appreciated!

unreal trench
#

Hi! Could you help me with 2 questions? Suppose LoRA/LyCoris is downloaded from the internet. How can you determine the hyperparameters (especially the decomposition rank) it was trained with, and how exactly does it affect (since LyCoris can build many approaches) the diffusion model?

What metrics can be used to determine which of the LoRA adaptations to stable diffusion most accurately conveys the subject/person in the case of photorealism?

torn warren
#

Hi. I want to train my model, but not on people, but on furry art of one artist. Tell me, maybe someone was engaged in training a model not only on portraits of people? I have a couple questions to ask.

normal ember
normal ember
torn warren
#

Well, I have two questions:

  1. Is it possible to train a model on different art with different characters(and poses), but so that they end up with the same style and don't turn out ugly?
  2. If so, what should be followed in preparing sources, etc.? If not, how can you achieve a similar result?
normal ember
torn warren
#

Does the description need to be limited to a couple of words or it is better to make it more detailed?

normal ember
#

I like to see the captioning as letting the model know the separation of the objects in the image.

#

Lets say you only caption ā€œa manā€ Anything in the image will be associated with that caption. If you further describe the image with clothing the model will associate the clothing with the captioning to some extent and could make it easier to change the clothing during the inference.

#

Let’s say it’s foggy and don’t want to associate the fog with ā€œa manā€ you caption the fog too.

#

If that makes any sense.

livid silo
#

Ooo. I think I got it. All this I prescribe in text files that SD itself creates when analyzing images?

normal ember
#

Most trainers do, like kohya or OneTrainer

livid silo
#

Oh, wow. That looks cool. I`ll definitely try to train my model one of these days. Thank you so much!

small shadow
#

Just to chime in on this conversation. When training a model are you tagging your dataset images with words that you DO NOT want to see generated in your desired output, or are you tagging the things that you would like to see?

normal ember
#

Well if you caption something you don't want to see, like you probably seen people recommend, will work but it's not like the model won't learn those things. Since your dataset probably contains more thing you want it to learn than things you don't want it to learn it will learn those things faster and you won't notice the other things change.

serene jacinth
#

Hi everyone! i need some advice from the community. i'm need to custome train SD, with a dataset of images that has segmentation map and depth map too.

  • Does those 2 extra maps are revelant/will help SD to produce better result of my dataset?
  • Do you know if there is a train system that has already been developed i can use (which include seg map and depth map loader) ie: LoRA, Dreambooth ... etc.. ?

PS the training dataset is not human/face, it's an object

lethal osprey
#

Why did my renders go from the first image to the second one? I didn't change anything.

#

It starts out great and then just breaks around the 70% mark. The first image was taken at about 50% while the second was when it finished

charred scarab
#

I'm trying to use Dreambooth to train Loras of specific people with a set of 6 photos which I understand is enough but it doesn't ever seem to work. Auto1111 doesn't run on this system at all, I use SDNext if that matters

small shadow
#

Does anyone know how to use the masking feature in OneTrainer?
I can't seem to figure out how to use it.

normal ember
serene jacinth
normal ember
serene jacinth
#

Thank you!

hardy bridge
#

spaceship

hollow spruce
hollow spruce
hollow spruce
# torn warren Hi. I want to train my model, but not on people, but on furry art of one artist....

here's a mini guide for sdxl
A.) get a large enough dataset (100/400/1000 = good,very good,perfect)
B.) write captions (I'll attach sample images with captions)
C.) get a checkpoint that is compatible with furry art -> Pony Diffusion V6 XL (the only one currently in sdxl)
D.) get derrian or kohya, then use settings that are proven to work -> adamw + 1e4 unet lr, 5e5 te lr + batch 7 + min snr 5 + offset noise 0.1 + dim 32 alpha 1 | save every 5 epochs. epoch 45~60 will be your target epochs for finished lora
D.2) train directly on that custom checkpoint
E.) test your lora first while using similar prompts to the ones used in your training data, then try using different prompts. both should work, with the first one being almost always perfect.

#

rivet, artwork, standing, hearts in background, drawing of a game character pointing at herself
rivet, full body, standing, 3d render, outdoors, standing on a platform
rivet, artwork, full body, standing, grey background, cartoon network, a cartoon network drawing of a game character

so when you do captions, follow this style:

<trigger word>, character (basically who or what is in this image, like a name or 'woman','man' if you dont know), pose, details, details, details, caption

if you wanna be lazy, you can fully automate your captions via cogvlm in this app:
https://github.com/jhc13/taggui

#

lora results, for context, that this method works

hollow spruce
# lethal osprey How do I reopen kohya?

if you trained via kohya, it usually saves a config file where your logging folder is. (or output folder) check in those folder. odds are high the settings you used are saved there

lethal osprey
#

This is the only thing in my log

hollow spruce
lethal osprey
hollow spruce
lethal osprey
#

Ah okay, I was able to stumble my way though most of it, thank you.

#

Is this how I am supposed to do the pony diffusion?

lethal osprey
hollow spruce
lethal osprey
#

where is that at?

hollow spruce
#

Max Resolution is also 1024,1024

hollow spruce
#

Everywhere it says fp16, switch to bf16

lethal osprey
#

Okay, also does it matter that all the pictures I have are scaled down to 512?

hollow spruce
#

How much vram do you have?

lethal osprey
#

no clue, how do I check it?

hollow spruce
#

What graphic card do you have?

lethal osprey
#

3060ti

#

do all images need to be at 1024x1024 or does the width not matter?

hollow spruce
lethal osprey
#

crap

hollow spruce
#

That guide will get everything working for you. It's a bit detailed, but many of the things mentioned are optional

#

If you stick to sd1.5 based models, then 512px is working, and everything will train fast and easy for you ^^

lethal osprey
hollow spruce
#

If you wanna work with sdxl, there's ways to make it work. 3060ti can make it work. But sd1.5 based checkpoints will be much easier to work with in your case

lethal osprey
#

okay, now what if I turned off the SDXL and stay with the pony diffusion?

lethal osprey
wet mauve
lethal osprey
#

If it is the one in settings>stable diffusion
the lowest it goes is 1

lethal osprey
#

I was wondering if there is a way to be able to use stable diffusion from my computer on my phone?

hollow spruce
torn warren
hollow spruce
torn warren
hollow spruce
next tapir
#

Do full finetunes/dreambooth trainings always look like noise at their first sample generation? I was under the impression that, similarly to a LoRa, you'd start with the initial weights being set to the base model so the initial sample images would be similar to the base model. Otherwise, any idea what I'd be doing wrong? This doesn't happen when LoRa training for SDXL/1.5, nor does it happen for Cascade finetuning. It just happens for SDXL/1.5 finetunes for me

dusky urchin
#

maybe stage A and B in fp16 will work fine

#

with stage C in fp32, it's possible to do a full fine tune in with 48gb

slender lagoon
jade hornet
next tapir
jade hornet
#

technically every image looks like that during inference, it starts with noise, that's nothing unusual

dusky urchin
#

it only makes sense with ampere though to do that

#

with 2xA5000 or 2x3090s nvlinked in tcc

slender lagoon
dusky urchin
#

i mean we have definitely, successfully trained a LoRA on top of stage C. it's just not very good. it worked, but not as well as SDXL

agile inlet
#

I can use google colab to train a lora with 15 images. It works good and produces an 18mb lora. I'd like to run koyha_ss locally because colab alwasy is out of free GPU time. I have koyha running, but when I train the file size is half the size, only 9mb. When I use the lora it does not change the image at all, doesn't work.
Here are the local kohya settings. All else is the default.
Source model: custom: /dataset/models/cyberrealistic_v41BackToBasics.safetensors
save as: safetensors
Folders: Not using regularization photos
Parameters - Basic: Epochs: 10, Save every N epoch: 2
Parameters - Advanced:, CrossAttention: xformers (also tried setting this to none)
Flip augmentation: checked (I also check this on google colab)

#

What things can I do to see what is going wrong. I'm using the same dataset with both.

hollow spruce
agile inlet
#

I agree totally. I also wonder if the google collab has some regularization images. I think I should be able to capture the commands from both, would that show the differences?

#

Although to run it in google colab I have to wait for GPU to be available or run in CPU mode. Maybe I could grab the command then cancel it.

agile inlet
hollow spruce
agile inlet
#

I see it now, good grief.

#

So I need to figure out what settings google colab is using for that?

#

Looks like colab spits out a config file into google drive.
unet_lr = 0.0005
text_encoder_lr = 0.0001

#

I'll give that shot. I suspect there must be more differences than that.

charred scarab
#

I'm trying to train models of specific faces with SDNext and Dreambooth, it doesn't seem to ever work

vestal arrow
#

Hello everyone,

I'm planning to train a model checkpoint to generate models from images of mannequins wearing clothes. I want the models to look realistic, and the background images should resemble real-life scenes such as streets, parks, beaches, fashion shops, etc. I've been using some models for a while, but this is my first time training a model checkpoint. Could anyone with experience in training checkpoint models share some advice?

Here are a few specific questions I have:

  • Should I use SDXL or SD1.5?
  • Is it advisable to base the model on an inpaint model? (I've tested some inpaint models, but the generated background images didn't look as good as those from regular models.)
  • How should I set up the parameters?
    I'm planning to use kohya_ss for this project. Thank you very much for your help.
hollow spruce
# vestal arrow Hello everyone, I'm planning to train a model checkpoint to generate models fro...

break your project down into smaller piece. then get each piece working. finally, combine all your trainings into 1 to make the final product.

In case of SD1.5, making a model is just fine. probably ideal for you, since you can just keep scaling up the dataset to make it better over time. only downside is training time - but that should stay relatively harmless as long as you have 24gb vram or more.
downside? even with upscaling, you'll probably hit a 'reliable' limit of 768px. with a max of 1k

In case of SDXL, you can achieve this via a lora. (or to be more exact, genuine full finetuning is on a level where you should plan to have 10k$ ready for it. you'll save money by hiring someone that can already do this)
so for an sdxl lora, you break it down into about 3 individual loras that get combined:
Lora 1) train on faces that meet your clients/or business' preference (basically get the ethnicity bias you want to represent), then merge with rundiffusion checkpoint. this will be your base
Lora 2) get a dataset with images of fashion photoshoots at various locations (streets, parks, beaches, fashion shops, all the ones important for you). train that into one trigger word. this triggerword will be useable even for scenes not trained, like coffee shop, etc...
Lora 3) make a regularization dataset, filled with 50% mannequins, all wearing different clothing pieces. then 50% real people wearing any fashion clothing (make sure the people are all unique, else you'll get bias towards a person)
take a bunch of photos of your current outfit, train a 1 hour lora with your regularization data added in. that lora can then be used alongside your base and location lora, to generate the images you want.

If you get a new outfit, you only need to remake Lora 3, which is easy since the regularization data & training settings stay the same

#

if you dont get new outfits, then you can just combine the datasets from loras 1~3. add the regularization data as a normal dataset. train one single lora from all of them. <- also works, but you shouldn't do this until they work individually

#

if you need to supply only a 'checkpoint'. then just make the loras individually, and merge them in at the end

hollow spruce
dusky urchin
#

lol

#

i don't need a speech

#

spare me

#

i think it's actualyl that it was trained on just 700 images

#

so my attention went to zero

hollow spruce
#

it worked in comfyui, when I was using it about a year ago? (moved to a1111 since its whats used by the majority, and easier to explain)
Basically commas ended tokens early

hollow spruce
dusky urchin
#

commas are their own token

#

i don't think any of that stuff really matters

#

clip isn't strong enough for that to matter

#

i mean that rundiffusion's checkpoint isn't adding much

hollow spruce
#

they overfit specific facial features

dusky urchin
#

it's more that when you use a small dataset, it is more likely that sdxl already is very close to producing the results you want, and you could achieve the same results with "just" a prompt

#

it's why we can have working celebrity loras, but hardly anyone succeeds in training people that sdxl has never seen before with less than 1,000 images

#

whether lora or full fine tune doesn't matter

agile inlet
#

I tell you this will be the end of my hair. I can build 100 different loras, different image sets, in kohya_ss and they will do Nothing At All. Might as well not even be included in the prompt. Use the same images on google colab and it works fine. I have matched up every single setting to be the same, doesn't matter. Use the default settings, doesn't matter.

#

Only thing I haven't tried is SageMaker to see if it's any different

hollow spruce
hollow spruce
agile inlet
#

I've tried with the generated txt files that google collab makes and by generating them in kohya. Same text files in both.

hazy herald
#

how critical is it to have exactly 1024x1024 images for training SDXL lora? Is it good enough if you're close enough like within 10% of that?

#

and how important is aspect ratio

agile inlet
# hazy herald and how important is aspect ratio

I read posts of people that say they don't bother to crop/square or resize their training photos and get great results. What if you tried one training with 10% square and one without changing any aspect, and see which produce better results?

#

When I'm testing a lora I like to to try the 7/8/9/10th generated lora and try it with 0.7/8/9/1.0. See which combination is the best looking.

#

(I'm super noob at this)

polar smelt
#

Hey guys, if I finetune a model which is under license xyz.
Is it still the same model with same licensing?

finite marsh
hollow spruce
# hazy herald and how important is aspect ratio

depends entirely on what you want to achieve.
Training on only one aspect ratio, makes inferencing on all other aspect ratios a bit worse. But in return, you dont get any bias trained into an aspect ratio.
(Like that portrait photos in base sdxl look better if you generate an image with a 2:3 aspect ratio)

as for smaller images. it can work. Make sure to have upscaling turned on. in most cases its not too much of an issue. If you mess up, or your dataset is really bad, then you may end up accidentally training on the noise within your images, and generate something like my "youtube compression noise" lora when I just wanted to copy the style of a music video 😭

hollow spruce
hazy herald
#

@hollow spruce @agile inlet thanks for the tips. šŸ™‚ I'm going to try having my images be approximately square but some will be a bit rectangular, nothing too crazy though. Sounds like it won't be a huge issue for my use cases

stiff dust
dusky urchin
dusky urchin
#

to achieve this

#

it would be generalizing deepface to "deephead," then "deephead adapter (controlnet)"

#

maybe SD is particularly weak for head orientations. sora figured it out, but sora is trained on synthetic data to specifically address the problem

safe tiger
#

is there an api endpoint for creating/listing finetunes?

dusky urchin
#

for my dataset, which is for games IP, it seems like it needs a lot more data for stage C to non-face details like wardrobe correctly, but i'm more confident that this is something that is my fault and not the model's fault. it delivers superior silhouettes, proportions, adherance to face details and flexibility.

slender lagoon
dusky urchin
#

it's a very, very good model

#

it's not as good as IF though...

#

the direct pixel space models are the best for people like me with lots of fine tuning data

#

the real latent models are really good if you don't want to fine tune subjects but instead want creative generation

#

SD fake latent is a compromise between both and i wonder if SD3 is fake latent, real latent, or neither

frigid pier
#

Got my first LoRA trained, it looks somewhat distorted / deformed haha.

frigid pier
#

Reading through the sd training info on github, there's "Regularization data or Class data (pictures of diverse other things)", so does that mean the regularization data should be different from the dataset I use / my concept?

frigid pier
hollow spruce
# frigid pier Reading through the sd training info on github, there's "Regularization data or ...

terminology is important here, since "should be different from the dataset I use / my concept" will mean different things to different people

also relevant if you're training SD1.5 based or SDXL

here's a small breakdown for training a specific subject (lets say "Obama")
Training Dataset of obama: 20 images (or 200 for sdxl)
Regularization Dataset: 5~10 times the size of your dataset. Here we would want about 50% random black people, 50% other ethnicities.
If you have access to a good regularization dataset that someone made, then great. If you're training on synthetic (self-generated), then just generate them yourself. (real photos will make the final lora better, but you have to ask yourself if the effort is worth the result) <- if you plan to train a lot of lora, then just gather your own regularization dataset, or ask around for a good one. If its a one time thing, then the increase in quality is rarely worth it.

Dont forget to use repeat 5~10, so that all the images in your regularization dataset get used. else it will just reuse the same 20 images, despite more being available. <- lots of threads talk about this online, if you google around for a bit.

#

Do you actually need regularization?
not if your dataset is good/big enough!

but occasionally you're stuck with a mere 5~15 images, and have absolutely no way to increase those. So regularization has its place at all levels of training

frigid pier
#

Thanks @hollow spruce for explaining it deeper, really appreciate it!

#

Just finished training my second test lora with regularization set, only got 1 repeat for those and might have gone wrong there not having enought steps.

median sun
#

do regularization images make sense for a pixel art style? If yes what kind of images should I use?

frigid pier
#

My second attempt did not go too well:

With the first test I did not use any regularization images at all, did only one epoch with 30 steps and the Lora responded as expected, even the quality wasnt there. I trained the first Lora on SDXL base 1.0 vaefix model.

My second attemp with 4 epochs and 30 steps (in total 10560 steps, with 44 images), I used regularization set of 150 images. Now the Lora is super unresponsive and only works (not well) with the model I trained it on. I trained the second Lora on limitlessvisionxl model.

#

Anything I could try with my next test? How do I get my LoRA smaller, adjusting the bucket size?

#

6 gig LoRA is far for good haha

stiff dust
#

oh damned, you have to set the rank

#

or dim, however its named

#

--network_dim=8 or --network_dim=16

#

if you train a 6GB Lora then its no surprise that it won't work well

#

I found it often easier to train on SDXL Base and then just apply the Lora to whatever model I want. If the model you use is not too overfitted it can deal woth base loras without problems

frigid pier
#

Thanks, will check the settings for that. @stiff dust

#

These two, right?

stiff dust
#

only rank, yes

frigid pier
#

Ok, thanks

#

I think someone suggested 128/256 on YouTube, I guess that explains the size haha

stiff dust
#

yeah, but that's already way too high

#

use something between 8 and 48

#

try lower first and only increase if it helps

frigid pier
#

Thanks, I'll experiment with it

jade hornet
# frigid pier Reading through the sd training info on github, there's "Regularization data or ...

generally you use the model you're training against to generate pictures of whatever class describes your subject (man, woman, dog), assuming it's subject training vs style training. the purpose of the reg images is to maintain the integrity of what the model knew about the class by injecting those back in. I would recommend someone try without them first to get an idea of training progress before adding in that variable. most of the issues in training come from either bad training images or bad captions

frigid pier
#

Do I need to pay attention to learning rate, text Encoder Learning Rate and Unet Learning Rate?

neat fox
#

Maxing out around 1.3 it/s with a 4090 and 5950x CPU... Seems the GPU isn't running at full load...

One of the cores for the CPU does however appear to be hitting 100% pretty often... Not using official monitors here just hwinfo, but wondering if this really is a CPU limit or not
I'm willing to build anotder PC to house the 4090 if it'll double my training speed by going from <50% load to near 100 on the 4090
I'm on windows 11, 64gb ddr4 ram @ 3600mhz
This is with OneTrainer which I've found to be a bit faster than kohya which has me at 1.1 it/s max

#

for comparison, here's what i see when generating images... each spike is during diffusion, each dip is when the system is completely idle

neat fox
#

Yeah tried disabling SMT just in case and to clear things up

#

Cores 0&1 are maxed out

#

Looks like a CPU bottleneck

bronze igloo
#

anyone know what the difference between the training in the original controlnet training vs the huggingface diffusers one is?

hazy herald
#

does anyone have favorite upscalers for training images?

vestal arrow
frigid pier
#

4x_NMKD-Siax_200k is one of my current favourites I've been using almost exclusively lately.

jade hornet
#

I like lollypop

median sun
#

how many epochs do you guys usually use for your loras?

dusky urchin
#

if you care about performance in training, you should not be using windows.

dusky urchin
dusky urchin
# frigid pier Reading through the sd training info on github, there's "Regularization data or ...

same for you too. here is a super concrete example of what the meaning of regularization is:

let's say you are trying to train a picture of your dog, and for some reason, you only have 3 photographs of your dog.

in one of the three photographs, there is a ball. in another of your three photographs, there is a plant. so 33% of your dataset is also training balls, and 33% of your dataset is also training plants.

do you want plants and balls to be generated in your images "1/3" of the time whenever you are asking for your dog? no.

so your regularization dataset could be many things:

  • use a variety of images of other dogs, which have a diversity of random crap in the foreground. pros: you will not see balls and plants in your images. cons: you actually do want every dog to look like your dog, so this will actually increase training time / reduce performance of your dreambooth fine tuning.
  • use a variety of images of adjacent concepts, such as shots of other animals. pros: you will see fewer balls and plants in your images. cons: you actually do want every portrait of an animal to be a portrait of your dog, so this will actually increase training time / reduce performance of your dreambooth fine tuning.

okay... it should be clear there's no obvious choice for regularization images. let's look at some alternative solutions:

  • get more, better pictures of your dog.
  • caption better.
  • photoshop out the foreground elements like plants and balls.
  • moderate your expectations.
  • cure the disease of impatience.
dusky urchin
#

so you pretty much never want to use them :/

#

that's how it goes

#

what are you trying to do?

frigid pier
#

I gathered some piece of info on one article I came across somewehre (I have a link somewhere) about reg images and those can be handy when the LoRA is bleeding too much into the checkpoint currently in use. For example if creating a red hair with a LoRA and it makes all the people not only look red haired as wanted but also looking similar to each other, i.e. it's bleeding the trained data to the checkpoint wiping away some of the data present in the main checkpoint and replacind it with what LoRA was trained with. Not just the hair alone but other charasteristics as well, such as body type, hair style, clothes etc. @dusky urchin

#

Not 100% sure that's the case though, don't have tested it yet succesfully

dusky urchin
frigid pier
#

I will test at least for sure

#

and see how that goes

dusky urchin
#

what is the goal?

frigid pier
#

I currently have some bleeding occurring, to fix that

dusky urchin
#

SDXL already knows how to make people with red hair...

#

so what are you trying to do?

frigid pier
#

The red hair was just an example. I'm not doing anything serious, training my dog and my own face

dusky urchin
#

okay

neat fox
frigid pier
# dusky urchin okay

I just want to learn the process and undestand the AI better. Maybe in the future train some actual LoRAs idk.

frigid pier
# dusky urchin SDXL already knows how to make people with red hair...

Yes, exactly but if that's the case what I'm descibing (not sure yet as it's just one article I've read about it), that means when training some specific kind of red hair, let's say with polka dots, the trained LoRA bleeds into a checkpoint currently used and soon all red haired ppl are looking the same, not just with polka dots but also same hair style, similar face etc

#

I've definitely noticed this kind of behaviour with some LoRAs not just my test loras alone, where a LoRA completely changes appearance into something dirrefent from what it originally was, for example body type. You create a plump girl and with Nike shoe lora and the originally plump girl gets transoformed into skinny girl.

median sun
#

I'm trying to create 16x16px textures

#

I'll try to up the img rep count

signal dust
#

Quick question guys, hope this is the right place.
I“ve been getting back into stable diffusion and I wanted to finetune a model with my face. I“ve done this in the past but now I have better pictures and would like to try again.
What is the newest working google-colab I could use for this?

dusky urchin
#

do not use it for a commercial purpose though

#

there isn't any legitimate secret sauce to anything you see. ordinary community members don't have the capital or knowledge to generate synthetic data.

median sun
#

I'm using my own art tho

#

also I'm just gonna use it for ideas anyway

#

and resuming from custom models doesn't work for me it crashes kohya

#

setting to 100 reps did help a lot tho:

#

it's still not usable but hopefully if I resume from the training data it'll get better and if not then I'll just try again next year

tropic moon
#

Anyone know what text encoder CLIP model Cascade uses? Seems like it might be a fine tuned half precision variant of CLIP Big G?

Strange that there’s so little information about it. Have looked through docs and release statement.

dusky urchin
dusky urchin
median sun
jade hornet
#

kohya_ss shouldnt crash with a custom model, there's something wrong with your install, try doing the same training with kohya_ss on vast.ai or runpod

faint citrus
#

hey dumb basic questions for Kohya

there's dreambooth and finetune methods for doing training on checkpoints, which is better?

is there a good up to date guide to finetuning checkpoints for multiple concepts anyone can recommend?

#

or presets/settings people generally agree are decently good

hollow spruce
#

and to load it in under models, and not any of the two continue options XD

#

or, in case you ARE continuing a lora, to not load it as model, but instead continue from weights, while picking sdxl base or whichever base you want under models

median sun
#

I've just trained with the base sdxl model it actually worked:

#

the textures aren't good but it at least does the 16x16px grid most of the time

#

I assume that it isn't good at such low res. esp with a low amount of images to train with. So I'm gonna train it some more after I've drawn more textures

faint citrus
#

trying to do a kohya dreamboth XL training but its attempting to pull SDXL vae from huggingface but its down for maintanance so getting an error

faint citrus
burnt raptor
sonic narwhal
#

should you caption reg data when training lora in Kohya?

slender lagoon
burnt raptor
burnt raptor
#

Hi guys, I got a problem with training stable cascade lora on the 1B version: During training I get intermediate output where the model generates a grid of images. There are 5 in total, the first one is the ground truth and the other ones are generated. However, the last 2 suddenly show something totally different from what I'm finetuning. I'm currently training lora for some shirts and the first 2 images look promising but the last 2 are just strange. Anyone knows what's happening?

onyx carbon
#

What the FUCK

jade hornet
#

excellent question

dusky urchin
# burnt raptor hey, I have the same problem. Did he share his code already?

no still working on it. our fork of kohya which is installable - https://github.com/hiddenswitch/sd-scripts - fixes some issues, but the basic answer is that you need more than 24GB to fine tune with pivotal tuning aka adding new vocab / tokens. i think this is because pivotal tuning requires more gradients than the LoRA layer for clip text implies; or, it's possible that train_c_lora is misconfigured

GitHub

Contribute to hiddenswitch/sd-scripts development by creating an account on GitHub.

stiff dust
#

I implemented pivotal tuning in kohya_ss. It takes 11GB VRAM with batch size 1.

#

In general Pivotal training takes not more vram than training text encoder and unet loras

#

results are... nah. Pivotal training works as bad in Stable Cascade as it does in SDXL. I switched to text encoder training long time ago as it just works better than textual inversion

neat fox
stiff dust
#

dunno, I think it is way overhyped to be honest ^^°°°

#

pivotal training came up a year ago as some alternative to dreambooth. Instead of using this strange sks token, learn a token via textual inversion and do dreambooth on that

#

the reason nobody ever used it was because afterwards everyone started using Loras and they worked better anyways

#

but if somebody has good experiences with pivotal training feel free to tell me about it

dusky urchin
#

maybe that doesn't make training take 10x as long, but 2-3x as long is still bad

#

i am just teasing it's good to hear it works

#

and that it gives bad results

stiff dust
#

there is no offloading to cpu

#

of course there is gradient checkpointing

thorny gazelle
#

you can train in non square images?

stiff dust
#

yes

gentle flame
#

better tome basically

stiff dust
#

it's just a heuristic to speed up SD 1.5 a bit, cause in SD 1.5 they have attention in the first and last layer of the unet (which is somewhat crazy and ineffective). SDXL is not benefitting much from that

sonic narwhal
#

Training in kohya. Why did it only create a NPZ file for 40 of my regularisation data images when I have 472? btw 40 is the same amount as images that I have in my training data img folder

hollow spruce
# sonic narwhal Training in kohya. Why did it only create a NPZ file for 40 of my regularisation...

thats the specific reason why repeats were created.
use 10 repeat on your training data, then 400 reg data images will be used (10x40 = 400)

(or just add the regularization data, as normal training data. then 100% of it is used.) <- while the math behind it changes, the result is on par or occasionally better. depends on your dataset.
I'd say try adding it as normal training data once. then you know for future runs if that dataset is viable to be used like that or not

sonic narwhal
hollow spruce
tropic iron
#

Hey all! I have a bunch of questions about training a LoRA on a specific style

#

Starting with detail and resolution, I guess. I understand my current checkpoint is based on sd 1.5, and that has native resolution of 512x512. When I try to make an image at that resolution though, I get low resolution images. 768x768 or even 1024x1024 produces drastically better results

#

Can I make images at 1024x and scale them down, while still preserving the detail during training?

pliant drift
#

what you can do is a second set of your images with closeup cropping instead of sizing the whole thing down. like, cropping just the face or another key focal point in the image

tropic iron
#

Okay, cool

#

How about switching from 1.5 to sdxl? Recommended?

#

I just did a cursory glance at the specs, and sdxl certainly seems better

stiff dust
#

I always found training on SDXL much better and easier than training on 1.5. But I also know people saying it is more difficult to train sdxl than SD 1.5. So I would say just try it out

#

for sdxl I would say: train on sdxl base and then use the lora on some other/better model

tropic iron
#

Lemme see if I can wrap my mind around this

#

Lets say I use my current workflow - producing high resolution 1024x1024 images using a model that's based on 1.5

#

Then I train THAT on the base SDXL model