#š§ļ½finetune
1 messages Ā· Page 20 of 1
Since I'm using ground truth for reg I'm pretty sure I'm literally just finetuning at this point. But dang must be nice, I do have xformers and other optimizations off and full bf16 though for top quality, so it is very very heavy compared to what it could be. but I just cant tell if batch 2 is worse for faces or not
seems to be just adafactor that can manage the higher batch sizes. I tried prodigy and it couldn't even do batch size 1 for fine tune
hmm I'm also on adafactor. I think if you turned off all optimizations you'd be limited to 2 batches as well. it takes 23.7gb. I dont think there's any vram difference between DB and finetune with the same settings
Same training procedure and database + VAE , different models.
Original NAI (VAE: vae-ft-mse-840000-ema-pruned) toshaka_v1_nai (subject) toshaka_v1_nai (subject_filewords)
vs Nothing v2.3 (VAE: vae-ft-mse-840000-ema-pruned) toshaka_v1_nothingv23 (subject_filewords)
I would have assumed the curse of recursion hit but...
If I use the TI trained relative to NAI to produce images with Nothing v2.3, it works just fine, so the concept should be representable within Nothing v2.3 (and given that inference is just like training without gradient upgrade)
Perturbed noise messes with zsnr stuff. I know soeone finetuning with it, making a vpred zsnr model. The zsnr node from comfy stopped being scuffed when perturbed noise was removed.
Just putting that out there for anyone that plans on messing with Perturbed noise.
It also makes outputs noisy btw. Known because the problems with noise disappeared after removing it completely.
I'm getting results like this when I trained a lora. I checked, the model works just fine without the lora in place. These are my sdscripts prodigy settings. Clip skip is 2 since I'm training on a model that's good for clip skip 2. Anything that's wrong with these settings?
on 24GB and adafactor, i observe the same limitation as you. I can go from 2 to 3 by enabling gradient checkpointing. you are using loras for your fine tuning, correct?
but I just cant tell if batch 2 is worse for faces or not
i haven't investigated faces very much, but imo if you are impatient, these choices matter, and if you are patient, they matter a lot less. since i think you are the patient type, is there a specific issue you are trying to address? it's not very insightful to say that "your dataset is probably too small," but maybe you can do a directly gathering of image and captions.
my thinking that a higher batch might improve quality is if it's kind of averaging out each pair of images which sounds like it could help with likeness consistency. if the images have enough variety, they might seem like different people to the AI but if it has to average 2 together each time, it should learn the face more consistently in theory.. I think?
gradients are averaged all the time. I have no clue why fine-tuning sometimes works better with lower batch size, but this idea of the dnn getting confused if it sees two images at the same time is just wrong
I think I was supposed to post this here: Say I generate a person with a black t-shirt in txt2image, I bring it to img2img and I want to apply my own person png design on that shirt, how would I go about doing that? Do I use control net?
lower batch sizes usually converge to a lower loss in backpropagation https://stats.stackexchange.com/questions/316464/how-does-batch-size-affect-convergence-of-sgd-and-why
i think you know that but it has been my experience, for at least the decade i've been using accelerated backpropagation, that lower batch sizes take longer but converge to a lower loss
fine tuning conditional unets in particular: no clue.
higher batch will make it train faster. it comes back to whether you are patient or impatient
i haven't had the bandwidth yet to experiment with (1) pretraining a LoRA (2) creating contrastive validation sets, but my expectation is that both of those things will be a big improvement on dreambooth style regularization
no, I never heard of this effect and never experienced it myself. Also, none of the answers in the blog post is convincing. Yes, larger batches have more stable gradients which could increase overfitting. But this happens for very large batches, not for batch size of 4 or 8.
Hello guys! I recently started my journey with SD and Loras. I have already made some models using real people (like the ones you can find on youtube tutorials), but this time I have the chance to create a unique one thanks to a model that is willing to help me. What would the "perfect" img folder be like? I mean, for my past models I used varied pictures of women with different clothes and backgrounds...this time, I can actually use a studio to get the pictures. Any advice?
Sorry if this isn't the right section for this question...
If you take 30 photos of a model in a studio, you're likely to get that kind of output in your inference. The advice is the same as always, variation of portraits, medium shot, full shot... Try to vary the background or the training will pick that up
hmm, I see, I will keep that in mind, thanks a lot!
And now that you mention the background, is there a way to make the LORA training NOT pick up the background info that much? other than variation of course.
Caption helps, but it'll still come through if you overtrain
you are at the start of a long journey. you should maybe start by trying to train a lora into a mixamo character. you are not going to get this right on your first two tries.
is there a way to make the LORA training NOT pick up the background info that much? other than variation of course.
for every concept you want to have a hope and a prayer that CLIP learns, you will need a contrast. this basically means at least 8 images per concept: the positive and negative of the concept for each of training, validation and test, and a regularization for the positive and negative.
so for example, if you want to be able to show the concept of any background versus a solid background:
white background w/ person, contentful A background w/ person
blue background w/ person, contentful B background w/ person
green background w/ person, contentful C background w/ person
regularization: red background w/ different person, contentful background w/ different person
then you actually have to use validation and test...
okay you're reading all this and thinking: wait a minute, nobody writes about this online. basically people are running datasets that are way too small, and they are usually overtraining or undertraining, ad-hoc testing some random prompts they wanted. it's up to you. if you like the results for the generality (or lack thereof) that you are trying to achieve, using community methods, you know, nothing else matters.
Very interesting, thanks a lot! I did a training test with 4 different backgrounds and I can see SD always including some of its elements. For example, in one of the backgrounds we had a wooden wall, so it's trying to put wood in every "indoors" prompt. I will have to make a dataset in many places then. Regarding the regularization, that is pretty useful I will see how I can make use of that.
Lastly and extra question: the same will apply to her clothing right? For example, if most of the real shots are taken in one set of clothing, it will try to make that part of the results too right? Or is clothing easier to manipulate?
all of this depends on how you caption. 30 images are not enough for the correct advice, "caption everything you see, including the mundane." the community will usually advise the opposite. for example, without a contrast, yes, the word "indoors" will become associated with the wood in the background.
it will try to make that part of the results too right? Or is clothing easier to manipulate?
you will need images of every different kind of clothing to robustly allow stable diffusion to transfer knowledge about other clothes onto how it would work in the model, and then you'd have to caption it properly
aah, could you give me an example please? I more or less understand what you mean here but I would like a more detailed explanation. Thanks a lot btw for answering!
the more examples of different clothing the model will see, the better your fine tuning will generalize to clothes that didn't appear in your dataset, especially in the details.
let's say your model is wearing a specific brand of Leota dress in all the shots.
rare_token is a woman wearing a leota dress
will make all dresses and the brand leota associated with generalizable features about your actress. as you start to overtrain, these features will appear more and more in test queries. you can use regularization to prevent the dress from being associated with rare_token: provide regularization images captioned leota dress of the exact dress being worn by different people.
rare_token is a woman
will have no impact on dresses. you can use regularization to prevent all women from looking like your actress.
at this stage, even without regularization, your fine tuning's ability to generalize will correspond to how much it is under/overtrained. it will still be perfectly capable of generalizing, and it will: your actress will be able to wear other clothes, even if you only had data of one outfit. why? because somewhere in the history of denoising steps that would create your image's dress, there is a common ancestor of noisy image between your image and another image with a different dress.
the captions are used to make the text encoder be able to express something specific in your image, and to provide more coherent contrasts between different ways your image interacts with concepts. so having your actress wear multiple outfits will make clothes fit much better on her generally.
Thanks a lot for the explanation! it's a bit clearer but I still need to study the documentation a lot more. This brings the to one more question: hypothetically, what would happen if my actress were (almost) naked? Could a lora work under these circumstances?
Oh, and one more that came to my mind while typing the first: what would be the best tool for captioning, other than manually doing it?
not sure, i think you can start experimenting with pre-existing stuff
ok, thanks a lot for real!
Haaaalp, how do I fix one part from an image? In my case a belly, it has weird anatomy in some parts and a belly button is missin.
SOLVED: combination of low values of steps and cfg scale with middle noising value was able to do it. Iām guessing, when doing small areas inpaint, smaller values work the best. Atleast with JuggernautXL checkpoint.
Tried with inpaint but the results suck. Denoising from 0.3-1.0, bad results. Cfg scale 3-8 also, and combination of these two. Sampling rate, different masking etc same thing.
Maybe I just patch it in Photoshop and do inpainting again to fuse things together
Can anyone recommend some good upscalers for use in a1111? I want something that smooths the images like an anime upscaler, but not as much. I want to lose the texture effects that many images have.
Is there a good resource for personalization fine-tuning I should start with?
can i change the vae of an already generated image? if yes, how? I dont want the image to change at all
no. prompt->clip->unet->vae->image.
You need to regenerate the image and use another vae
it was made with a lot of inpainting and I got lucky too
so that wont work
I don't know what the purpose to use another vae if you already do a lot of inpainting. But you could load the image and pass it to another vae in comfyui
i guess its time to switch afterall, im still on a1111 everybody been recomending comfy
i hope the vram req aint higher?
as XiaoZhi said: changing the vae afterwards makes no sense. What do you want to achieve?
Hi, do you know any fully transparent diffusion model on hugging face or other ? (-> a model where we exactly know which data were used for the training?).
I think SD 1.5 was just using LAION, wasn't it?
what's the objective?
make the image pop off more, more vibrant colors
the vae is just a compression. It wont make your image more colorful or anything. The only thing that can happen is that when you pass the image through the vae it loses some quality and colors. However, you cannot get that back afterwards. It's like you compress and image as jpg and you see jp artifacts. You cannot remove them easily afterwards. You would have to do a whole img2img pass.
i see
but img2img is changing the intricate details even on low denoising strength
you got any tuorial on that?
to be honest, if the image looks washed out it might be the vae but in most cases it's somewhat different
if you do a lot of inpaintings with blurred mask this leads to a washed out effect
Can anyone help me with a lora training thing? I'm kinda at a loss and I need help.
Basically, I'm trying to train a lora on a character, and it seems like the results I'm getting are... subpar compared to the capabilities of the model I trained it on... First four are with the lora, the next four are without the lora... Main model is VividOrangeMix, the metadata should be in the pics, and these are my current settings for training said LORA.
I hope someone can help...
Hey. I've been trying to find out how to train the detection models for ADetailer, but I cannot find any documentation on it. Are there guides out there for building a dataset and training, and can it be done with a consumer GPU?
Use case: whole head detection for non-human characters like Mass Effect asari where face detection works but the errors I'd like to correct are often made in the details around the face.
i don't understand - what is the source image? what is the character?
i don't think the detection models play a role in your problem. they return bounding boxes.
adetailer itself uses the bounding box to create the mask
the mediapipe models make a tighter mask, but i am not sure how significant it is
you can certainly try to refine the output of mediapipe, but it's not designed to be trained with more images
you don't have the gradient weights, and it's not architected around a workflow of "pretraining" versus "training"
you could write code to use insightface instead
I meant training a custom model to get the bounding box around the whole head. I've seen models on CivitAI for eyes, but no doc on how they trained it.
can you send me a link?
This model detects eyes so you can add more detail to the eyes. Can be used for: Adding details to eyes Enhance character specific eyes (Hu Tao, AI...
i think you are misunderstanding what the model is. there aren't any amateurs in the civitai community training eye detection systems. they simply adapted something else that already exists.
that file is basically a serialized python blob that configures one of the pre-existing models to match eyes
@cold cliff does that make sense?
I got it. I wanted to know if it was something mortals could train, and that these models aren't trained by the amateur community answers my question.
it will probably be easier to just expand the bounding box
a little bit
that specific eyes detection file is yolov8 aka ultralytics
Or just not be lazy and use img2img impaint, I guess. š
There is no source image. The first four represent the Lora. And the second batch of four images represent the same prompt without the Lora.
yeah i know, so what is your example source image?
like i don't know this character or what it is trying to learn
You mean⦠the images I trained on?
yes
an example from your dataset
just 1
This is technically one of them, except the one in the dataset isnāt transparent.
What do you think?
do you have images with backgrounds (you don't have to share it)? and how many total images do you have?
12 images in all, sheās a relatively new character.
12 images isn't going to generalize that well. it really depends on how flexible of an asset you want. you might have better results experimenting with controlnets, ip adapter and attention masks, to recreate the character's look exactly.
your best bet is to be patient and wait until there is more art of the character though.
stuff like your choices of optimizer and such, the net impact is that training goes faster (or slower). if you're patient, your result is going to reflect your dataset one way or another
Well, I managed to generate a pretty good Lora with 13 images once, I just lost the config for it and am trying to make things work.
hmm
i'm saying from a scientific, hard facts point of view, what i am saying is true. you can use basically any configuration, as long as you are patient. does that make sense?
there isn't a lost config idea that will make this "work."
Yes, it does.
Yeahā¦
So there isnāt a definitive config to account for that, in other words?
yeah i think your config looks good. like you did everything right
you can generate more content from those 12 images, to help it generalize stuff like backgrounds, but it's not going to get a pixel-perfect recreation of the artistic concepts without a lot more data. in my experience, for characters that are art-directed like anime and video games, you need something like 100-1,000 unique instances to get the art direction right every time, and closer to 1,000-10,000 to get the exact representations right in the style they were presented in.
so the art direction to me means the character's silhouette, proportions, wardrobe & jewelry, palette
to get the face right perfectly you need a lot more representations. i don't think people in the community are releasing stuff that actually looks like the characters they think they do, because they are biased by what their own eyes focus on when comparing dataset images to genereated outputs
something like spiderman probably appears in sdxl's aesthetic laion2b handcrafted database 10,000-100,000 times, at least, and in different styles
another way to think about this is, your training set is really
O(number of artistic scenes k * (number of concepts you want to generalize CHOOSE number of simultaneous usages of those concepts)). if you are okay with having exactly 1 additional thing you want to generalize in addition to your character, such as "character on a different background" or "character wearing a blue hat" but never "character wearing a blue hat on a different background", you need 10-100 scenes and each one has to illustrate len(background, hat, ..., other concepts)^2 contrasts, so = O(k * n C 2) ~ O(k * n^2). the real limitation is creating contrasts, not gathering enough different poses, which is just 1 concept.
am i making sense?
this stuff requires way way way more data than the community thinks it does
I understand.
Itās just a shame because I genuinely like this character.
I just canāt use any more images because there really isnāt any others that work, you know?
You could generate more similar images using style aligned and ip adapter to create the diversity. Also, you could add the character to different background.
For anyone who's trained an SDXL Lora on a person (realistic not anime), what Net Dim and Alpha do you use? I'm struggling to find the right combo. Struggling with all the settings, actually. It seems to be quite a bit harder to find the right settings for training SDXL Loras than it was for 1.5.
dim 16-48 for unet, text encoder can be much less
I keep alpha=1, but you can try higher if you want
Honestly the biggest issue I have is that when I train an SDXL lora on the Base XL model, it only works with that model when loaded into A1111. For example, if I tried to use the lora with Juggernaut XL or Realism Engine, it loses most of its resemblance to that person. I used to have the same issue with 1.5 loras until I started training with Photon instead of the base 1.5 model, but with XL I don't really know what would be the equivalent of that.
this is rather a problem of Juggernaut than sdxl
juggernaut is just highly overtrained
however, training on Juggernaut instead of base is also extremely difficult, because juggernaut uses a q lot of strange training settings such as pyramid noise. I never succeeded in training on Juggernaut and, therefore, stopped using the model at all
currently I'm mostly using Dreamshaper XL which is not overfitted (all loras work normally on it) while still being good in all styles and photorealism. I bet there are other good models, too, which are not severly overfitted
Agree with most of that, but would add a comment. In general I would simply say that certain trained models have different strengths than others, even the base. I find that juggernaut handles multiple subjects really well as an example. You can get an idea of whether that's the case by how certain loras perform with the various different models, how they draw anatomy varies, how they render detail varies, etc. So I wouldn't necessarily discourage training against some of these models as a rule.
Do you mean that you train your loras using Dreamshaper XL?
I considered it but since it's a turbo model i thought perhaps it wouldn't be the best thing to train on.
no, I train on sdxl base
I said that loras trained on base are transferable to most models, including dreamshaper xl
if they don't work well on particular models (like Juggernaut XL or RealVision) then this might be an indication that this model is overfitted
oh ok, gotcha. I just wish there was a model I could train on that would work as consistently well as Photon did for me with 1.5. You think that my SDXL base model trained lora showing almost no resemblance to the person with any model except for the base model is an indication of overtraining?
no, I said that Juggernaut is overtrained and that's why it doesn't work with your lora
oh, ok. But my lora doesn't work with ANY model besides base SDXL.
people overtrain their models and then they merge them with other overtrained models and at some point nothing works anymore xD
did you tried Dreamshaper? Or SDXL Turbo?
yeah got no resemblance with dreamshaper. Honestly I don't like the turbo models. But I've tried all the most popular models and nope, doesn't work. I had the same issue with 1.5 until I started training on a non-base model.
hm, that's strange. Do you use any weird training parameters like pyramid noise or a too high or too low offset noise?
I haven't seen those settings in Kohya SS so I don't think so. I'm not the most experienced person when it comes to training loras.
what's your objective? you want someone's likeness to flexibly appear in sdxl generations? how flexible? and what's the application?
I would like to make a Lora model based on a real person and be able to make realistic looking photos. It doesnāt have to be super flexible, at this point Iād be happy just to have it work with any model. Canāt even get it to work properly with base model SDXL. Iām not sure you mean by āwhatās the applicationā.
āwhatās the applicationā
like what kind of realistic looking photos? what is the person doing? or is it unconstrained? like what is the idea? for example, "i want to create a #vanlife instagram except with this personality's likeness" means, okay making it look like instagram is 90% of the way there, but i don't know ifjuggernaut will ever be able to generate someone sitting on top of a van correctly,juggernaut did put the lady on the van pretty plausibly! or maybe a specific image that you can think of that you would like to recreate this person's likeness in
when you say can't get it to work properly - it depends mostly on your patience. how long have you run a training, on how many images?
to diagnose your fine tuning, take a look at tensorboard.
Oh, ok, I understand. I think just being able to make generic instagram style pics would suffice for me. With SD 1.5 I would make ones with the person camping, standing in urban/suburban areas, backyard photos, pics inside restaurants etc just casual candid style photos. I wouldnāt have even bothered with SDXL had I not been intrigued by the fact that it seems easier to pose people differently with SDXL than with 1.5. And as far as the part about not getting it to work properly, Iāve tried everything from 2000 steps to more than 10000, tried training with a very small photo dataset and a large one, etc doesnāt seem to change the fact that the pics come out strange.
just to confirm, how much time does 2,000 steps or 10,000 steps take for you?
and how many photos is a small dataset versus a large one?
you are able to generate images correctly with other community LoRAs, right?
Yes I havenāt had any issues with other loras. Small would be anywhere from 10 photos to 20, large would be from 80 to 100. 2000 steps would probably take me about 45 mins to an hour at the most. 10,000 probably 3 to 4 hours.
okay
so for my application, replicating a likeness for flexible creative cinema-style scenes, i use 1,000-10,000 images for a "basic" level of flexibility. A LoRA based training for me is at least 50,000 steps (aka 50 epochs), which is about 16h on my fastest configuration and hardware.
this is what i mean by patience. i think you are probably off by an order of magnitude for the amount of patience you need, even for a LoRA training, on commodity hardware. you should be ready to wait close to a week for 10,000 images.
or. you can choose configuration that happens to work very well for faces, and lean into the fact that people recognize celebrities better than ordinary people, so they're going to be much more forgivable when you don't get the right appearance.
then you can expect thigns to go faster. but every misconfiguration can be solved by patience
how did you caption 10,000 images?
or is this with no text encoder training?
Oh I think you misunderstood me. I didnāt use 10,000 images
Yes. I manually captioned the 100 images myself.
so you should still be spending in the 10s of hours if you are not sure if your configuration is correct
Not sure what you mean.
hmm. well maybe a better question is, are you using prodigy as your optimizer?
Adafactor.
all problems with adafactor can be solved by patience.
if you are impatient, try prodigy
I have patience, itās just that Iāve probably done about 50 tests in the last week and seeing no improvement. Itās just confusing š
it's complicated, but the community guides make it sound like 30 minutes is enough time to train something
in my experience, it almost never is
however, you might have some other issues. usually the text encoder learning rate is too high for sdxl. this is an issue that prodigy can deal with for you
prodigy is an optimizer that was overfit, in a sense, for training facial likenesses on instagram style generations
Oh I know it takes many hours. Iāve basically been running steady tests for the last 8 days straight lol. So if I try prodigy, what learning rate, unet learning rate and text encoder learning rate should I use?
you don't make those choices with prodigy
can you give me an example of a caption you authored?
So with prodigy I should set them all to 1?
i actually never deal with the configuration files of kohya directly, i use the objects, but i think the documentation says exactly what it should be. my guess is it does not matter, given how prodigy works
i am surprised you are getting no improvement whatsoever
so maybe there are some other flaws
8 days is a lot for a casual or enthusiast application too. what is this really for?
i am supportive but it would help me understand your expectations
short of showing the images themselves, which i assume this isn't an anime character, it's a real person
the weird thing is that often the samples look quite good as it's training the model, and then when i load it into a1111, they look poor. Oh I'm just on holidays and recently got a new computer so I've been playing with it a lot, that's all, lol. It's just for fun and I like learning.
hmm
what are you loading into A1111? if you are doing a LoRA training, you will have a file that is ~150-300MB, ending in .safetensors, with _NumEpochs suffixed to it, until it is done
how are you visualizing samples? you mean you configured it to generate something, for which prompts?
another POV is you should be using ComfyUI
yes, I get safetensor files that I load into the Lora folder of A1111. I'm using the sample feature of Kohya SS, where you enter a prompt and it generates a photo every set number of steps. unfortunately comfyUI is way too far out of my comfort zone as a newbie to AI generating.
okay, i mean if the samples it generates are fine, something else is messed up. you will be able to figure out comfyui, it isn't as daunting as it seems
Often the samples look good but not great. And then in A1111, they often look blurred, squished, etc. maybe itās actually an issue with the way Iām generating my pics? I know Iām doing something wrong Iām just trying to figure out what lol.
At what size do you generate?
1024 by 1024
Should be good. If you would have been less than a megapixel then you would have issue like you describe.
have you tried using an sdxl lora from the community?
try this - https://civitai.com/models/188525/pixar-style-sdxl - and copy one of the prompts & settings of a reference image
ā„30 steps recommended. Note the difference in color between the 20 vs 30 step examples in the gallery. The strength = 1 used for all example genera...
anyway i think you will figure this out
i gotta go
Text encoder overfits quickly so it's probably a good idea to only train the unet to begin with and see where that gets you.
Oh I can train with only one or the other?
if the sampled results look fine, i think there's an issue with how you are using the web ui
there isn't really a fine tuning bug here anymore
I just tried that "Pixar" Lora you shared above using the same prompts in one of their images, and it did not work. So Maybe it is a problem with my A1111? Also, I didn't say all the samples looked fine, most of them are demented, lol. So I think it's probably a mix of both issues
i appreciate your help, i'll try a few more things and see if anything helps.
Don't give up! š
it sounds like you are probably mixed up about which checkpoints you are using where, and why. you are probably mixing sd 1.5 and sdxl settings
I donāt think so. I used the exact settings from the prompt they used on the Lora page you gave me. Iām not using any SD 1.5 vae or models, I used the SDXL base model to generate the pic
I thought the same thing at first.
Thank you, I feel like giving up but once I start a project I find it hard to stop til I get it right so I will continue on, lol.
Just wanted to report back and let you know I discovered the solution to both problems: my SDXL photos were coming out so bad in A1111 because I had my CFG turned down to 1 (donāt even know how I missed that). As far as my Lora training goes, I went back to my original settings which worked fairly well, and I turned off ādonāt upscale bucket resolutionā which for some reason immensely helped me! My images are now coming out normal and my Lora looks really good. Thanks everyone for your help š
Hi ! How can i have the rights to use the bot ?
hi
@stiff dust i am exploring full fine tunes on multiple machines, do you have any experience or opinions about this?
never worked on full finetunes as loras work already fine. Pseudoterminalx work's a lot with full finetunes
how can i create an image here?
Hello everyone!
I've trained a whole bunch of LORA's and Checkpoints on real persons/styles/objects/concepts previously with an overall great sucess, but there was one thing that I was never able to achieve properly.
How can I train a real person, for the purpose of anime generation? I've tried with both SD 1.5 and SDXL and achieved only a semi-sucess with SD1.5! Has anyone tried something simillar before?
For my semi sucessful attempt, I trained the NAI model with 25 of my images, removing the background from images that had complex backgrounds. The images featured myself wearing various outfits, and had images in 2 different light settings -frontlit and backlit-, I've also made sure to avoid overtraining by adding 10 full body images, whereas I only use upper body and face close ups for realistic trainings. Overall the images were of high quality, and performed really well when I used the same dataset for a realistic training. I've used a booru style capitoning, captioning an activation phrase for myself, and then not captioning any details about myself, but only my pose and outfits. During the training, I used a Network Rank of 64, and Alpha of 32. I've used Adafactor optimizer, and my learning rate for the Text Encoder and UNET was both 0,0001.
I was able to generate images featuring myself using this LORA, but the LORA would completely change the style of the base model, usually for the worse. It generally changed the style from a 2D drawing to either a 3D illustration, or in worse cases a 3D Model. The likeness however, was usually great!
Looking forward to hearing your opinions, and tips!
In general that is not a problem. Use validation images with anime or cartoon prompts and check how they behave during training. If you overtrain, it indeed happens that the images turns to photos. But if you stop early enough this shouldn't happen. Check if your training photos are all captioned with "photo of" or similar tags. I would also reduce the rank from 64 to a smaller rank. In particular for the text encoder, you can use a MUCH smaller rank (like 6 or 8). I found text encoder training very vulnerable for style overfitting (=everything turns into photo), but I also found it hard to train on unet only. Maybe try to stop text encoder training early enough and continue training with unet only. In general it's hard to get a model that is equally good in both, photorealism and anime, so better you focus on an anime/cartoon only model
I see. The best LORA in my case was the Epoch2 model for the style; but the likeness was off on that one. I could use the LORA's up to epoch4 depending on the complexity of the prompt; with more complex prompts working better with the higher epochs.
yeah, prompt matters a lot
try not just "anime of [token]" but rather something like "anime illustration of [token] by makoto shinkai and studio ghibli"
(makoto shinkai is a strong prompt modifier, it turns everything quite reliably into anime, although quite realistic drawn anime)
if I want more less-realistic anime I also always add an anime lora additionally to my face lora
but in general I found it easier train my face for drawn styles like anime or cartoon than for photorealism
I see
if your results get better with longer prompts that could also be an indication that your text encoder is overfitted on your character prompt
I always trained text encoder and unet separately and with different dimensions (text encoder only low dimensions like rank 4 ), but I cannot say how much that helped
I generally used prompts like; (1man, masterpiece, best quality, high quality:1,4), brk(my token), and then booru style prompts.
This is for SD1.5 though
I have never seperated the Unet and TE training before. Can it be done on Kohya?
I also thought your network alpha was supposed to be lower than your network rank. In the case I lower the rank to 8, should I have the alpha at 4?
yes, it should. You can also keep alpha at 1
makes training a bit slower, but that might be rather a good thing
I see! I'll test that out tonight!
I'll keep the rank at 8 and alpha at 1. I'll use the same dataset and the same lr.
Shall I lower the text encoder lr from 0,0001 to 0,00009?
But I think using Adafactor needs me to use the same LR for both TE and Unet
for training unet/te separately: I think newer kohya versions have some stuff like stop TE training after some epochs and so on. But what I usually do is:
- make a training with TE only (--train_text_encoder_only) and with very low rank (--network_dim=4)
- train a few epochs, validate how the images look like and if they change style. Stop very early! Like when the images starts turning into a photo at epoch 8, don't use epoch 7, use rather epoch 5)
- start a new training with unet only (--train_unet_only) and higher network dim (--network_dim=16), save after one step and immediately cancel training as soon as the output file is written out
- next you merge the two output files (the text encoder one and the unet one). Now you have one output file that contains both
- now you start training again with this output file (--network_weights=myfile) and unet only (--train_unet_only)
I found lower learning rates not effective. Even with extremely low learning rates the text encoder can overfit, so you can also keep it at 1e-4. I use AdamW, though.
as said, the workflow with separate trainings for unet and text encoder is quite complicated and I don't know if it's necessary at all
but my experience so far was that text encoder is extremely vulnerable to overfit, and such a workflow allows you to observate and validate the text encoder during training
Aha
That makes quite a lot of sense
I'll bring the kohya ui up now to see if it has a checkbox for only unet or only te training
It seems not; I guess I'll use the script to start the training then
that would surprise me. I think the UI should have that options - they are quite old
It has a setting for Stop Text Encoder training
But there is also a additional parametres tab, which I should be able to use the arguments youve provided
The results are much better when I train with the images of my girlfriend btw, this is due to being more female images in anime models?
@stiff dust I've found eyes overfitting way quicker than anything else. Do you have any workaround for that?
does your dataset have drawn images of whom you're trying to train?
like do you have a mix of photographic and anime images of your person to train on? it's okay if the answer is no
regularization should help, but it works better if you already have a large dataset of images
I want to finetune SD on my face and have 8 GB of VRAM. Would a LoRA be the best way to do so? Are there any up to date, ideally simple guides on how to do this?
What does regularization actually do, from my experience it just makes the lora over fit on the regularization images
Isn't regularization supposed to tell the trainer that the images are NOT supposed to look like this
regularization is really poorly explained online, so you have to have a lot of patience with this
before i say anything more, do you program? do you have some experience with probability as an idea in math?
Yes for like 7 years xd
I'm a software engineer
Doctor pangloss is typing up a storm
let's start with what the dreambooth paper actually says abour regularization:
Encouraging diversity with prior-preservation loss.
Naive fine-tuning can result in overfitting to input image context
and subject appearance (e.g. pose). PPL acts as a regularizer that
alleviates overfitting and encourages diversity, allowing for more
pose variability and appearance diversity.
there is a lot of jargon explaining what this concretely is:
Prior Preservation Loss Ablation
We fine-tune Imagen
on 15 subjects from our dataset, with and without our pro-
posed prior preservation loss (PPL). The prior preservation
loss seeks to combat language drift and preserve the prior.
We compute a prior preservation metric (PRES) by comput-
ing the average pairwise DINO embeddings between gener-
ated images of random subjects of the prior class and real
images of our specific subject. The higher this metric, the
more similar random subjects of the class are to our specific
subject, indicating collapse of the prior. We report results in
Table 3 and observe that PPL substantially counteracts lan-
guage drift and helps retain the ability to generate diverse
images of the prior class. Additionally, we compute a di-
versity metric (DIV) using the average LPIPS [73] cosine
similarity between generated images of same subject with
same prompt. We observe that our model trained with PPL
achieves higher diversity (with slightly diminished subject
fidelity), which can also be observed qualitatively in Fig-
ure 6, where our model trained with PPL overfits less to the
environment of the reference images and can generate the
dog in more diverse poses and articulations.
Do you have something better formatted
lol well... listen it's not super informative. the key thing is that regularization should really be called "prior preservation loss"
once you have this, here's a good explanation from hugging face:
Prior preservation loss is a method that uses a modelās own generated samples to help it learn how to generate more diverse images. Because these generated sample images belong to the same class as the images you provided, they help the model retain what it has learned about the class and how it can use what it already knows about the class to make new compositions.
hmm
so what is really concretely happening? it comes down to understanding the role of a text encoder, why it's called a "conditional" unet, why the word "conditioning" is used, and why "context free guidance" is used
basically a bunch of things about probability. it's actually essential to what "denoising" is, and what you are actually training
I know that the text encoder decides which vector to give each word
And that textual inversion works by figuring out a magic vector that gives you the image you want
suppose you knew what all these things meant. prior preservation loss is a way to ensure that conditioned denoising on [X, Y, Z] where you are giving examples of [Z] does not make the original [X, Y] less likely.
Isn't denoising just the process of transforming the pixels into something else, hopefully an arrangement you like?
in other words, when you write a caption and provide an example image for:
taylor swift at a football game with many football players
you will want to create a regularization caption and image
a football game with many football players
which can even be using an image that the generation process itself creates - that's what the dreambooth paper does, and that's where the images from the reguarlization github repos that the community uses come from
denoising is the forward pass of the neural network. it takes a slightly noisier image in latent space, and then makes it slightly less noisy in latent space.
so the thing you are training is adding a bunch of noise to the VAE[your training image], making that your output, and then adding slightly more random noise, making that your input, and then backpropagating
okay - so you can see how if you are generally making all noisier images less noisy in the direction of your training images, you will make taylor swift appear "everywhere"
Eh my issues may be because of other mistakes i made
does that make sense?
Wait i gotta read that again
when people say "overfitting" this is what they actually mean
you have to also think about what is happening in the beginning of sampling - aka, when you set your ksampler in comfyui steps to 50, what is happening at steps 0-5 versus steps 45-50
this also touches on what the meaning of your sampler choice is - why dpm... 3 is "Better" (really preferred for certain outputs) compared to euler A
that's what sdxl turbo does
all of these things are related. the underlying reason it's so hard to understand regularization is that it directly relates to the arcane details of image generation
My issues were more that i was (to hold with the same analogy) training for taylor swift, and all my images were of her at football games. Without regularization, my Lora thought that taylor swift is football studios, and with regularization images of a bunch of football studios, my Lora model thought that taylor swift is football studios, faster. Basically, 'overfitting' faster. But do note this test case i did not use any captioining
here's an analogy i've been auditioning:
diffuser:
set your grid size to 1x1.
visit each grid square. roll dice. if it's a 1,2, or 3, do nothing. if it's 4,5, or 6, use the dice to consult a big book that maps dice rolls and your grid point to a color you should paint in your grid square.
increase your grid size to 2x2.
visit each grid square. roll dice, if it's a 1,2, or 3, do nothing. if it's a 4, 5 or 6, use the dice to consult a big book that maps dice rolls and your grid point to a color you should paint in your grid square.
....
training is determining the contents of that big book.
diffuser and conditioning: let's introduce conditioning:
draw a picture of a "cat"
increase your grid size to NxN.
visit each grid square. roll dice, if it's a 1,2, or 3, consult your book as usual, which maps dice rolls and your grid point to a color you should paint. if it's a 4, 5 or 6, reroll with dice weighted for cat.
CFG: is the number on the dice you switch between unweighted and weighted dice
control net is a different kind of dice weighting
this is a diffuser with conditioning
okay, now let's introduce captionless training: you're modifying the contents of the book to paint the grid squares more frequently towards your data set.
that's it
now let's introduce captioned (i.e. text encoder) training: you're also modifying the weights of all the dice in your caption.
so when you have cat and hat in a caption, you can devise a training method where you don't want to change the weights of hat "as much" as you change the weights of cat.
that is what prior preservation loss quite literally is
Okay, so what is the way to eliminate bias in datasets
this way you can use images of your cat in a hat for training the look of your specific cat, AND make sure to generalize your cat wearing other stuff - because hat's conditioning, when using CLIP, contributes to other conditioning relating to things on top of people's heads
the only way to eliminate bias in datasets is to make them larger
E.g. many models are biased on white women
ultiamtely the goal is to preserve the generalization power of pretraining aka having this nice, generalized checkpoint of weights
a simple strategy is to provide contrastive examples directly in your data. this is different than dreambooth
What if I ONLY have images of taylor swift in football studios
Is there no way to tell the trainer "this i don't want"
PPL is a "shortcut" that was designed just for dreambooth. the achievement was using just a few images to get a diffuser to learn a detailed look and feel of something
that is kind of* what regularization is
Argh
Eyo you guys basically started a topic from my question
I was able to get much more stylized results by lowering the network rank, and using 100 reg images.
The problem is now that the model is kinda overfitting on the reg images itself.
EXACTLY
I also stopped the text encoder at %30
It works perfectly on me this time, a male
But my girlfriends features get mixed up
clearly regularization is far more than "this i don't want"
the community focuses a LOT on getting an exact human face that generalizes to many instagram-style casual full body pose photography.
dreambooth was not designed for this. the best thing to do, if that is your goal, is to just wait.
I'll make another attempt, with stoping the text encoder at 50
IP Adapter and similar is dealing with this issue directly
Oh no I'm not trying to get realistic results.
I'm trying to see how I can turn a male and a female real person to anime style.
i guess what i am saying is that IP adapter can also do that
It sucked in stylizing images in my case
dreambooth is too general, it doesn't deal with the salient issues of "perceptual biases when looking at faces of humans"
i think it is challenging to use, but if your goal is to make stylized human faces, ipadapter is the framework of today and the future
When training loras, say i have 50 tags on each image, and then prepend my keyword to the tags. Would that drown my keyword, so I should reduce the other tags, or will it recognize the keyword fine
dreambooth is a training shortcut, is another way to think about it
just like lora is
LoRA is a cheaper-to-train "full fine tune"
with enough data, it matters less
But it does matter, for say 100 images
i think the hardest thing is to accept that many community members have "loras" that "look like someone" essentially by accident
100 images is too little data "in general," but if you are generating images for a celebrity who isn't a TV personality but also appears "a lot" in the base checkpoint in general, people's perceptual biases will help in accepting the likeness.
do you see?
there's a reason the dreambooth paper does not do people and does dogs and shit
Does having more images make it understand the concept in less steps
well having more images definitely increases steps š
Or is it just for diversity
Not necessarily, since guides say to use 1000 steps no matter what
i can't speak for all community guides
So divide 1000 by count to get repeat
they are "darwin fine tuned"
the ones you hear about that go viral accidentally worked for accidental configuration for accidentally common use cases
yes.
lol
i mean what can i say? you are following the guides to the T, i'm sure, and getting crap results
and you're probably wondering why
it either works or it doesn't
it doesn't provide guidance really on getting any closer
because it can't create more training images for you
and it can't convince you to be more patient - indeed, it does the opposite! because the community is very impatient
something that says "train for a lower learning rate, on more steps, on more images" would not be very popular even if it were good
It does provide guidance, it says what you can do to improve results
well
like i said you've tried it
there's a lot of emphasis on configuration choices. if you followed my strategy (wait) you would use prodigy
Yeah but my results are not good enough and I don't really understand them
I use prodigy
prodigy is a very smart thing, and it is also darwinian-fine-tuned on training images of celebrities
it didn't exist 3 months ago or whatever
Which the guide suggested
Does putting caption files next to regularization images do anything
let me put it this way - the fact that prodigy exists means that a lot of so-called best practices parameter selection didn't matter
Or is it purely psychological
hmm
Hmm, I need you guys opinion on the likeness
Is it allowed to share a generation and then a real photograph?
Completely sfw of course
actual dreambooth and the objects inside kohya's scripts, which are essentially formulaic, do use captions with regularization aka dreambooth training aka fine tuning conditioning together with the unet
i don't know what happens when you modify a config.toml, or what you put in which directories, because i don't use kohya's scripts that way
let's step back a bit though. your results will be crap if your dataset is crap
end of story
What makes a dataset good
is your goal to flexibly present a non-celebrity person in casual instagram style generatations?
no
you did
lol
Difficult question to answer
okay well let me talk about a real goal of mine, that's actually super hard
so maybe that's more interesting
one thing i am trying to do is introduce the concept of a place to diffusers. you should be able to express with words a subplace of a place - for example, "behind the big ben", and correctly get a shot from behind the big ben.
that's an easy one, right? the thing is fucking symmetrical
how about "from bathroom in long hall in cs_office"
someone who has played counterstrike, incredibly, can visualize exactly everywhere, in all the POVs, from inside cs_office
and the words are more than sufficient enough to put a very narrow possible set of valid POVs
and you can close your eyes and get coherence between these places and shots without motion
Is the lora supposed to to place a person behind the big ben, or in any arbitrary place near the big ben
so the dataset will probably need 10,000-100,000 images of POV shots of cs_office. for regularization, i would show it a vast amount of other locations, but NOT offices
i think a lora for monuments is actually really easy, since they are already iconic
and that's why 30 images of the "big ben" would be sufficient. anyway, it already has a ton of big bens in the dataset
the way a dreambooth fine tuning would help is to make all towers you generate look like the big ben
it could also help with generating unaesthetic images of the big ben, especially ones that albumentations doesn't do
for example, dall-e3, which is trained on incredible amounts of synthetic data, can't do negative space due to its aesthetic synthetic data
it CAN do white space when you ask it to make creative art
this is to say that everything is dependent on your training data
i should say that it struggles with making white space or similar concepts when you ask for it
they created a huge amount of synthetic data for localization concepts, and you'll see that you'll create a lot of flawed situations if you take it even slightly outside the bounds of the synthetic data
Hey continuing from yesterdays talk
Reg images are defienetly not ''what you dont to generate''
Had like 6 different attempts since yesterday
And the best result is when I generated images that looked somewhat like myself using an anime model for the reg images
For context I'm trying to train a lora on my likeness, for anime generation purposes
regularization images are a strategy from dreambooth to improve generalization in conditioning. use a regularization image containing text prompts that are unrelated to your target training concept (like your likeness) and which you do not explicitly want associated with your concept.
this is confusing because the community goal is almost always "make my likeness in single character anime card portrait illustrations" or "make this person appear in instagram style casual photographs"
let's say your goal is to train the likeness of some person who doesn't appear in the training set at all. Let's say this person is Michelle Williams. you want to generate casual instagram images of this person.
let's say you have basically 3 photos of michelle williams wearing different outfits.
all of your captions should be "michelle williams, a woman wearing" followed by a bit of mundane details about the outfits.
you should not use regularization at all!
you actually do want all women who are ever generated by stable diffusion to look like michelle williams, because your goal is to generate instagram casual single person portraits.
you don't want regularization for the outfits because wherever those outfits appear on other women, you want them to "morph" into whatever meager appearance / fit you have for the three michelle williams examples.
does this make sense?
so @zenith delta in your concrete example you should not use regularization at all, and you should most definitely not use the regularization github repo which will really achieve the opposite of what you want to achieve
since you will always be rendering your likeness as anime, you want all likenesses to look like you. if you want someone else's likeness, you would disable your lora.
How did using sd 1.5 as base go
I wonder if there's a modified variant of lora that calculates loss to maximize similarity to the training data and minimize similarity to negative data
yes, concept slider loras are implemented that way
could you point me towards a resource i could learn from?
aha i found this https://github.com/rohitgandikota/sliders
Fine
@stiff dust Exactly came down to this! So in my case, whenever I didnt use any reg images, the style would go from anime to 3d model when I trained without reg images. So reg images was a must.
But what I did was; used the LORA I trained on my partners likeness with low weight to generate anime style images that looked "somewhat" like her.
Then I repeated the training, using those images as the reg images. I also changed a few parametres, and voila! It worked like wonders!
I'll now do the same thing on my own likeness! If anyone is intrested in the detail feel free to pm me!
why not use IP adapter though? it's purpose built for this use case
you can use a community anime checkpoint instead of an anime lora - they are the same thing in terms of what the process is, but the lora is a way to save money and time, not a way to get better results
IP Adapter doesnt work well for me in most cases š¦ I guess I'm not that good with Control.Net
Im doing exactly that. I've trained the LORA's on my partners and my own likeness and use community models to make generations.
interesting. It might be that training on close-but-different reg images helps cfg, as the way cfg works is to enhance the difference between conditioning (your character) and unconditioned (general human faces). So as closer the prior is to the real data as better cfg might work. That's also why tiny caption dropout might improve results.
Definitely something to research deeper!
i'm not sure. it sounds like it didn't have any effect at all, and something else changed. it doesn't make any sense
i think the user meant that there's a LORA that does "anime style"
and the regularization images weren't used at all, and it just trained more
@zenith delta if you had actually correctly used the regularization images in the manner you described you would get terrible results
i keep on having the image mess with extrabody parts and nipples, i dont knoew how to fix it
Hi, I've been trying out this epicrealism_pureevolutionv5-inpainting inpainting model to generate realistic backgrounds for people and objects, however, I was also wondering if I could finetune this to make it work better for certain things like countertops. Does anybody have any advice on how to go about doing this?
Oh no. Let me explain with a little more detail.
What I wanted to do was to train a LORA with a real persons likeness; that would be able to generate the trained person, using an community anime checkpoint, without changing the style.
Should I train a LORA on a real persons likeness without changing any of the parametres, the model would quickly start to override the style of the checkpoint I was using.
So then, I tried using regularization images. I was training on AnyLora, and therefore, needed anime styled images for my reg's. I went ahead and generated the images and made another training.
This LORA, was able to generate the images without changing the style of the checkpoint I was using, but the likeness took a massive hit. This LORA was also trained with a really low rank/ and 1 alpha.
I've tried multiple different settings at this point, but none really helped me.
Yesterday I tried generating some images with the LORA I trained on the likeness of my girlfriend, without any reg, and with realistic LORA training parametres. (High rank, high res). The style was a bit off, but by using really low weights about ,0.1-0.2, I was able to get some images that looked like my girlfriend and also in the style I wanted.
I generated a whole regularization dataset this way, and then trained another LORA, once again on AnyLORA checkpoint. This time however; I increased the network rank from 32 to 128, as if I was training for a realistic model. I was relying on the reg images to not overtrain.
This final LORA; turned out almost perfect! It generates perfectly stylized images, and pretty high likeness for the anime style. I'll improve this LORA by training another LORA with the same images; but cropped 512x768. Then I'll merge the 768x768 LORA with the 512x768 LORA to increase the flexibility.
The reg images had an effect for sure. Because when I used the same images/same parametres for the training without them; the LORA was changing the style already at Epoch4-5. The LORA I trained with the reg images works perfectly between Epoch18-20!
I do not want to share the images with this LORA; as it is trained on my girlfriends likeness. But I'll share the images when I make the same training on my own likeness.
Iām looking to fine-tune SDXL with** Lora**. Iām wondering if I should go with TPU or GPU for this task. Can anyone tell me which one would give me better performance and be more cost-efficient?
hm, super weird. To be honest, I never had a problem training on face for different styles. So all this workarounds you did shouldn't be necessary. I would train on base model (not on anime model) and then apply your lora on the anime model. Ranks of 128 are way too high in my opinion. I would try lower ranks, but I have mostly experience with SDXL. Also, you have to be careful with text encoder training as it can overfit very quickly. Maybe the reason why your style is taking over is because you train the text encoder too much. Try training only the unet (or use textual inversion on combination with unet training)
but anyways, if you are happy with your results that's also fine
even with the least optimized configuration, such as by using a huggingface space GPU accelerated with a drag-and-drop-to-lora script with prodigy, it would cost like $4 for 10000 steps. i wouldn't stress too much about the kopecks you'd save here or there
is dreambooth still the go-to for SD 1.5 finetuning?
checkpoint finetuning, not LoRA
Do I need to create caption for regulation images?
yes
yes
why is pixart-a trained on so few images
less cost, might be
I made a customGPT in chatGPT that gives good and good length descriptions. Try it out, if you have any suggestions of what can be better let me know š
https://chat.openai.com/g/g-EQFzMkKHZ-image-descriptor
You can upload multiple images at once, you donāt need to add any text. It will just give you a zip file with the descriptions in the txt files š
I think if we try to make a big dataset out of this to make a better captioner we could get SD to the level of dalle3
(You need chatGPTplus to use this) let me know if you want to help compile a good dataset.
You can caption up to 400 images every 3hours with this due to chatgpt limits
Can you make it on Hugging Face
How can I train a model with few images
No unfortunately this is a closed model because its on openai, if we gather enough captions from this we could finetune a open source CogVLM model but thatās not something I would be able to do alone as gathering the captions would take a long time (needs a very large dataset (400k probably more)) and finetuning the model would cost too much money for me
cogvlm is outstanding as it is
Yup it's really good, better than sharegot4v
But cogvlm is kinda GPU intensive
yes, but its sort of one of those one-time costs
Yea true, but captioning millions is kinda tough
training millions is as well
question why are a lot of my samples in orintations that are not reflective of my database? all of them are sideview shots but im getting front view in the samples...
It would be pretty costly though
Cuz captioning each image on a 3090/4090 would take around 10sec
Captioning 1 million would take around 4months
not much you can do with 1 million images and just one 3090
Yea, 1million images aren't really useful
Can I DM?
is this over training?
are these good dataset images? 512
I have 54 images so far
they are all sideview and ive tried a set without skids and it gave me images with skids and a front view...
maybe incorrect labeling then ?
Also there are probably already too much stuff linked to "helicopter" in the model already. Maybe try labelling your helicopters as MLGCopter or something like that.
just a thought
things like helicopters and airplanes are pretty rough to train, maybe due to all the appendages they have, but trying to just do the side shots is smart, if your screenshot is truly representing that fact
ok it was in fact my keywords getting influenced by 1.5 models
also how do you use this graph? when it flattens out what does that mean when training?
using constant
does it mean that after 30 steps its not learning???
learning rate is an input not an output
my guess is you'll simply be challenged to get good helicopters, but I'd probably worry more about your data than anything, both the actual images and the captions you use. Add more data and experiment with how you label them.
you are training SD1.5?
lora or fine tune?
im doing a model, right now im using a small dataset with 5img just to play around with keywords. im using dreambooth with 1.5 as a base
so far im getting better results
better results compared to what?
better results with the unique keywords.
more sideviews. this is a 50 epoch model
https://github.com/jhc13/taggui
(its been recommended here once before)
Can vouch for this. good tagging tool + does cogvlm and other vlms really well on windows
Is it possible to train a style with an alpha channel?
Does anybody know why it could be, that its like stable diffusion is changing seed every some frames?
What is your objective?
You would need a lot of data, computational resources and patience
What is your goal?
my goal was to create sideview concept art for helicopters
Ive gotten good results after changing keywords to be more unique that didnt call on the base 1.5 model which would add dross I didnt want
But these are photographs and renders of helicopters
im not seeing the error, could you explain further why this is a poor dataset?
concept art usually means silhouettes, sketches and such
and something creative and zany
things like the motion blurred rotor do not achieve your goal
and you're training on a lot of motion blurred rotors and stuff that's on green
img2img, I wanna convert images to that style
yeah I can understand the blur being bad. I can disable blur on the renders but most IRL images are done while in flight, on the ground the rotors sag and ruin the fuselage, the concept art can be done by hand in blender or gimp but the point of this concept art dataset is just to create ideas
Id also want to try blueprint 3views or other similar vector drawings
what precisely are you trying to train for? like hwat do you want it to learn?
training on 50 images will make it generalize less compared to the base model, not more
and be less creative
engine placement, fusalage shape, cockpits, landing gear, tail boom, rotor system
mmm kind of, Im looking at it more as a tool, so recursive in use where you choice a few rough designs and expand on them with different features in a photo editor picking out features and exchanging them, maybe later if need be focus more on detail. but having detailed panel lines is not a huge concern for me, more realistic in nature and less outlandish
more realistic in nature and less outlandish
okay i think you gotta firm up your brief lol
it sounds like you don't know what you want
you want your exact helicopter on green background renderings, except "better"
hmm I want it to take those photos and mash a totally new design that looks like it would have been manufactured
okay, so you want the opposite of concept art
you want photorealistic side view shots of helicopters that you plan to dissect and kitbash
into a new helicopter side view photorealistic shot, but fundamentally a pretty conventional one
a photocollage
it's too bad because i really like the concept art helicopter pipeline i just made
not against other perspective shots, but I think that focusing on orthographic views are more controlled
you basically want this
with maybe more variety in the helicopter body plan
@thorny gazelle does that seem right?
it's just hard to reconcile creative and something you can kitbash, but anyway, i think you should try to achieve this some other way. a lora will not help you. the most challenging thing for it to learn is "side view"
yeah more like what im after, im also not making lora. just a 1.5 model since dreambooth hates lora or I havent found a way to make it work... I need to look into kohya ss when I get more time. ive made my own unique keywords that is unfamiliar with the base 1.5, so far its given me sideview shots since its all it knows, sideview is also the only keyword that is not unique
it's too bad discord strips metadata
such as?
but here's all you need to reproduce the workflow
noise in clip gives you creative concepts. noise in latents gives you creative silhouettes
i havent gone to far into the rabbit hole, what am I suppose to use this json file in?
oh ok, I haven't installed comfy yet, just auto1111. thanks for your time btw
Hello, does anyone know how to become an authorized user for finetuning on Stability ais API?
anyone has a tutorial for finetuning xl with hundreds/thousands of images and multiple prompts?
I want to show my girlfriend the power of ai because she doesn't believe it can be that good. so she challenged me to make realistic pictures of her that are nsfw, but she doesn't want me to use her nude pictures to train the ai for privacy reasons
do you have the hardware or budget to do that
and have you ever used python
Can I get help?
i think there are a lot of guides to help you do what you want to do
But where? I am very new to stable diffusion
^ can you two continue this convo in DM? We try not to promote/chat on any NSFW on the server
okay
Can I get some recommended settings for training a subject LoRA in Kohya? I am having a hell of a time with this.
I think I burned my cookies.
It says I can't dm you
have you had any success generating images well?
before you jump into training
Yes. For a few years.
Train a lora to give me camera-headed people. I've harvested a ton of images from Dall-E because it understands the prompt fairly well, phtoshopped a few, and gently reprocessed and blended everything in juggv6.
Got a few over 100 images, learning rate 5^e-5
40 epochs, but I'll just stop it once it starts working.
okay, do you want to turn every person into a camera headed person?
you can try doing regularization images of actual cameras
regularization images of people would be the opposite of what you want
because it looks liek most of the flaws are in the camera itself
so you don't want it to learn dall-e3's flawed representation of cameras
you should also consider a full fine tune instead of a lora. if you have patience and a 24gb+ card
I'm not sure what that would entail. I am a slow learner, but I do have that big GPU energy.
Is anyone able to help me in a DM or guide me somehow? I don't know what I am doing
i think you should start with some regularization images of real photographs of cameras. you should also improve all your prompts to include mundane details
you should read a lot of coco-style captions
so you can see what CLIP was actually trained on
sure, i can code š
have you successfully generated complex images using something like comfyui?
i don't mean loading a workflow that someone else authored. i mean like, completing a brief
never used comfy
hmm
i think this is going to be a stretch
you have to learn what all the parameters actually mean, and what is going on
if you wanna do something innovative
if you can find a guide for exactly what you want, great
wtf, dude. This story is so fucked up. This should not be a place where people learn how to make fake porn -_-
discord hasn't been stripping metadata from images for almost a year btw.
Hey, we are into what we are into. If there is a better place please let me know
Psychiatry? Jail?
I don't care if you want to generate porn, but don't abuse other people by putting their face into porn.
She literally asked for it
maybe just augment your dataset with a lot of real images of cameras
a bit late to reply. but if you up your dataset to roughly 400 images, then you can just brute force the lora. at 400 images, the settings start to become a bit less important. just use adamw + 1e4 unet lr, 5e5 te lr + batch 7 + min snr 5 + offset noise 0.1 + dim 32 alpha 1
for fully automated tagging use cogvlm via this app:
https://github.com/jhc13/taggui
dont forget to add a triggerword to the whole thing. something like "camhead" which doesn't have any highly biased words in it
save every 5 epochs. epoch 45~60 will be your target epochs for finished lora
sdxl doesnt do nudity, since it was censored. Talks of how to avoid this censoring aren't allowed on this server. There are guides, but this discord isn't the place to ask about it due to this being the official workplace for stability.ai staff.
ah, my bad. didnt see that fruit already replied earlier
Does anyone know how to disable RAM usage when VRAM is maxed on NVIDIA cards? I remember reading something about that being an option on the latest driver update but can't find it.
Thank you
Hello. I am new to the stable diffusion world and I tried making a lora of an art style but nothing I do seems to make it work. I used about 234 images from an artist and the style doesn't come through.
I am using Anime Art Diffusion XL checkpoint and this is what I get
I am trying to get the art style of My700 with the lora I trained
genuinely curious why you recommend adamw over an adaptive optimizer.
the short answer: to avoid overfitting, and achieve the best lora I can possibly make
For context, all of this refers only to sdxl.
the long answer: this will get a bit technical...
we're gonna separate this into 3 groups.
⢠AdamW + offshoots like AdamW8bit
⢠adafactor + similar adaptive ones like dadapt
⢠prodigy
Resources
⢠AdamW + constant does the math directly, and correctly. not nearly as much approximation going on. This has the downside of using more gpu, and basically sets a barrier of entry of 16gb vram, and can be used most efficiently with 24gb vram
⢠Adafactor does a lot of approximation, hence requiring less gpu. With every vram saving technique applied that exists you can get the barrier of entry down to 8gb vram, if you're willing to only do style loras. or 10gb vram for any kind of lora.
⢠Prodigy gets complicated, since you can get the vram requirements down, but doing so essentially moots the point of using prodigy in the first place. If your goal is to avoid overfitting, then prodigy has a barrier of entry of 24gb vram. If you use the methods that require less vram, then you're better off switching to adamW, since you'll get significantly better results with little more effort. Ideally you want a shit ton of resources if you wanna use prodigy efficiently (like 40 or 80gb vram)
Conceptual Complexity
⢠AdamW - If you're teaching one single concept, then you wont have any downsides with AdamW. If you're teaching multiple concepts (26 in case of my dnd lora) then you're best of with AdamW thanks to its consistency.
⢠Adafactor - if you want a quick and dirty lora, then adafactor will work just fine. If you care about the nuances of overfitting, then you'll quickly hit an upper ceiling, especially once you deal with more and more concepts.
⢠Prodigy - Can equal the quality of AdamW, without any of the knowledge required to make it work. The downside? An inhumane amount of resources used. If you use it via low vram requirements, then you're just turning it into a adafactor alternative, at which point, you could just use adafactor to begin with.
Actual Results
Theory crafting is all well and good, but results speak louder.
So I've trained the same lora cross testing just about every setting in kohya. I've done all my big loras in prodigy, adafactor, AdamW & AdamW8bit
Going simply by results, AdamW wins every time. Prodigy also works consistently and I have nothing bad to say about it, though I've stopped using it since I cant exactly tell people to get more vram, while with adamW my techniques at least work on 16gb vram environments.
Adafactor loses every time, and I only ever recommend it if you're running on 10~12gb vram environments
Misinformation
So most tutorials recommend adafactor, which can be traced back to the first tutorials during sdxl 0.9 release when SECourses made his youtube videos and declared his methods as "the best" and that he tried all the settings and these work best. When in fact he tried only a few settings on a single dataset of himself. Due to a fundemental misunderstanding of how network rank works, he arrived at the conclusion that adafactor works + net rank of 256. Both of which are the worst possible options, in general, but give the illusion of working. But from there, it formed a culture of using adafactor since misinformation spreads fast. Nowadays its getting better, but people are still using adafactor without knowing what differentiates it from the rest, and when to use it vs when not to use it.
#āØļ½sdxl message
^ DND lora.
26 core concepts, which have been completely retaught
around 100 minor concepts, which have been merely influenced (like hands always having 5 fingers, pupils being round, etc... )
I attached a list of all the major concepts ("Indian" was taught as well, but doesnt show up in that list)
green wasn't taught, that's just for statistics, so I can verify I'm not accidentally teaching a bias towards any gender
has a roughly 80% success rate. meaning if you generate 5 images, from seeds 1~5, then 4 of those will be "actually useable", and would qualify to be printed and actually used in a game of DND. (All the images that I linked from general chat are from Seed 1, just to make a point)
All of this, on top of the default sdxl base. No checkpoints or other loras applied.
Its also not working by overfitting, as once you add a new concept it wasnt trained on, it still works. And as a few friends already tested it, it works with subject loras, to translate any subject into a "dnd portrait" and keep up the style.
For full transparency, I attached the training settings, which can be used with derrian distros lora training (just a different gui for kohya backend)
@shy basalt everything that @hollow spruce said aligns with my choice of parameters almost exactly. especially
at 400 images, the settings start to become a bit less important.
the better the dataset and the longer your patience, the less the settings matter.
Due to a fundemental misunderstanding of how network rank works, he arrived at the conclusion that adafactor works + net rank of 256. Both of which are the worst possible options, in general, but give the illusion of working.
i also agree with this
i am surprised you haven't tried full fine tunes
if you have the patience, like a week, and the extra data needed to regularize
Thanks for taking the time for the detailed answer. I'll do more testing
hey why is the training script for Cascade just idling with no errors? been bashing my head here for a few hours trying to figure out whats going on.
why do you figure it can be trained on a 24GB card?
did you try attaching a debugger?
they expictly state that it can be
and i have 8x3090s
and stepping through
it does not move pass 1 step
it can be maybe LoRA fine tuned on a single card
it just idles
can you maybe provide some more context
sdxl can be tuned on a single 24gb card
the script runs and than just idles at step 1 for hours
on 8x3090s
additional cards do not make a model that requires X amount of VRAM to be trained trainable
you should know this
i know
so
can you provide some more context for what you are trying to do? i am telling you it probably cannot be fine tuned on a single card in 24GB
hmm
well what version of torch are you using?
can you show me where?
--find-links https://download.pytorch.org/whl/torch_stable.html
accelerate>=0.25.0
torch==2.1.2+cu118
torchvision==0.16.2+cu118
transformers>=4.30.0
numpy>=1.23.5
kornia>=0.7.0
insightface>=0.7.3
opencv-python>=4.8.1.78
tqdm>=4.66.1
matplotlib>=3.7.4
webdataset>=0.2.79
wandb>=0.16.2
munch>=4.0.0
onnxruntime>=1.16.3
einops>=0.7.0
onnx2torch>=1.5.13
warmup-scheduler @ git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git
torchtools @ git+https://github.com/pabloppp/pytorch-tools
featuring enhancements for fine-tuning and training efficiency with a focus on further eliminating hardware barriers.
nope
i don't think a full fine tune will work on 24GB VRAM at 16bit. the regular RAM usage is suggestive that it requires a 48GB card
can you try the lora fine tuning instead?
wuerstchen stage C is designed to be trained on an A100 80GB
are you trying to run train_c.py?
"don't play games with me" as the president says
i think you are perhaps biting off more than you can chew
? i spesficaly have hardware to train image models
so the model cannot be trained nor a lora with a 24gb card?
you have one of the worst possible setups for training image models
so...
i was working within a budget
your deficits are in the programming department. the train_c script expects the job to be run against SLURM
so it flat out cannot be trained on anything else than h100s or a100s?
so does train_c_lora
i don't know. i think you are "just" running the script without comprehending what it does
it's likely you can do a lora fine tuning on a 24GB card if you configure training for bf16 and use adamw (which is what it says)
i am merely asking for some friendly advice and your coming in here out of the gate lecturing me about my intelligence and making jabs at me?
really?
i'm sorry i do not want to sap your excitement
if you buy $4,000-$8,000 of GPUs it's worth comprehending all these issues
i think you should try using the train_c_lora script with only one CUDA device visible, and then configure your training for 16bit (16bit backward pass, 16bit optimizer state/gradients)
i haven't used hugging face's config from the file for a long time so i can't tell you exactly what to put in there
okay the memory was not the issue even when it loads up to 15 to 14 on the card it merely idles
not doing anything
you will have to step through in the debugger to see what is going on
i don't know if observing just the side effects of the VRAM usage is going to be informative
do you know which line in train_c_lora you are observing a hang?
Is anyone able to help me?
screenshot of thumbnails of your dataset could help you get better answers
Unfortunately it is NSFW
Hey peeps, not sure if anyone might be interested, but I co-founded a start up that provides cheap and easy to use GPUs for AI training (https://www.tromero.ai/) I would be over the moon if anyone decides to check it out (first couple of hours of compute are free)!
anyone tried training a lora yet? how much vram did you end up needing on the 3b?
24GB
Does anyone know how I can get absolute black skin? I am trying to make a bat character but it keeps having her with dark brown skin, I am looking for straight up black
I also need help getting pure black eyes
I am on 24 and it was still ooming regardless of batch size, what was your config?
sorry by 3b I meant the new cascade model btw
not sure that was clear
google noise offset or look up the lora for it
i wrote the training code from scratch, because the one in the stable cascade repo is buggy
have you published it?
not yet, i'm still waiting to see some results
fair enough, if you do publish it, please ping me, because indeed the one available OOMs way past 24 gigs
Hey everyone I have a quick question.
I'm pretty new to Stable Diffusion, but I'm planning to start training my first lora. I know the basics when it comes to training a face, but does the same apply if I'm wanting to train a specific body type? I already have great results for the face I want with CyberRealistic in Automatic 1111, but im not completely satisfied with the rest of the body. Any tips/tricks/advice, or maybe even a tutorial link would be greatly appreciated!
Hi! Could you help me with 2 questions? Suppose LoRA/LyCoris is downloaded from the internet. How can you determine the hyperparameters (especially the decomposition rank) it was trained with, and how exactly does it affect (since LyCoris can build many approaches) the diffusion model?
What metrics can be used to determine which of the LoRA adaptations to stable diffusion most accurately conveys the subject/person in the case of photorealism?
Hi. I want to train my model, but not on people, but on furry art of one artist. Tell me, maybe someone was engaged in training a model not only on portraits of people? I have a couple questions to ask.
Try to see if this can parse the metadata. https://lora-inspector.rocker.boo/
Inspect Kohya-SS LoRA files locally on your computer without any dependenceies
I'm sure there are many in here that have done just that. My advice is to just ask the question.
Well, I have two questions:
- Is it possible to train a model on different art with different characters(and poses), but so that they end up with the same style and don't turn out ugly?
- If so, what should be followed in preparing sources, etc.? If not, how can you achieve a similar result?
Yes that is possible. You want to caption as much as possible that is not the style. I.E when you later prompt for a character it will know what the rest should look like. It's probably a good idea to include some kind of keyword for the style too.
I hope I understood you correctly. That is, when trainig model I need to add promt to each image, which will have the keyword(name) of the style and a brief description of the image/character?
Does the description need to be limited to a couple of words or it is better to make it more detailed?
I like to see the captioning as letting the model know the separation of the objects in the image.
Lets say you only caption āa manā Anything in the image will be associated with that caption. If you further describe the image with clothing the model will associate the clothing with the captioning to some extent and could make it easier to change the clothing during the inference.
Letās say itās foggy and donāt want to associate the fog with āa manā you caption the fog too.
If that makes any sense.
Ooo. I think I got it. All this I prescribe in text files that SD itself creates when analyzing images?
Most trainers do, like kohya or OneTrainer
This I trained with dark gothic fantasy art, <caption>
Welcome to the dark gothic fantasy LoRA for SDXL, a specialized module designed to enhance your creative journey with deep, immersive gothic and fa...
Oh, wow. That looks cool. I`ll definitely try to train my model one of these days. Thank you so much!
Just to chime in on this conversation. When training a model are you tagging your dataset images with words that you DO NOT want to see generated in your desired output, or are you tagging the things that you would like to see?
Well if you caption something you don't want to see, like you probably seen people recommend, will work but it's not like the model won't learn those things. Since your dataset probably contains more thing you want it to learn than things you don't want it to learn it will learn those things faster and you won't notice the other things change.
Hi everyone! i need some advice from the community. i'm need to custome train SD, with a dataset of images that has segmentation map and depth map too.
- Does those 2 extra maps are revelant/will help SD to produce better result of my dataset?
- Do you know if there is a train system that has already been developed i can use (which include seg map and depth map loader) ie: LoRA, Dreambooth ... etc.. ?
PS the training dataset is not human/face, it's an object
Why did my renders go from the first image to the second one? I didn't change anything.
It starts out great and then just breaks around the 70% mark. The first image was taken at about 50% while the second was when it finished
I'm trying to use Dreambooth to train Loras of specific people with a set of 6 photos which I understand is enough but it doesn't ever seem to work. Auto1111 doesn't run on this system at all, I use SDNext if that matters
Does anyone know how to use the masking feature in OneTrainer?
I can't seem to figure out how to use it.
I can recommend their Discord channel. Searching in there will get you on the right path.
Can you share the link š to the discord? š
It's blocked by the server
https://github.com/Nerogar/OneTrainer but link is here
Thank you!
spaceship
post your training settings. (print command in kohya / .toml file in derrian)
Lots of examples with full metadata added. (no additional loras/extensions needed) xformers disabled so that it can be recreated 1:1 This lora is p...
How do I reopen kohya?
here's a mini guide for sdxl
A.) get a large enough dataset (100/400/1000 = good,very good,perfect)
B.) write captions (I'll attach sample images with captions)
C.) get a checkpoint that is compatible with furry art -> Pony Diffusion V6 XL (the only one currently in sdxl)
D.) get derrian or kohya, then use settings that are proven to work -> adamw + 1e4 unet lr, 5e5 te lr + batch 7 + min snr 5 + offset noise 0.1 + dim 32 alpha 1 | save every 5 epochs. epoch 45~60 will be your target epochs for finished lora
D.2) train directly on that custom checkpoint
E.) test your lora first while using similar prompts to the ones used in your training data, then try using different prompts. both should work, with the first one being almost always perfect.
rivet, artwork, standing, hearts in background, drawing of a game character pointing at herself
rivet, full body, standing, 3d render, outdoors, standing on a platform
rivet, artwork, full body, standing, grey background, cartoon network, a cartoon network drawing of a game character
so when you do captions, follow this style:
<trigger word>, character (basically who or what is in this image, like a name or 'woman','man' if you dont know), pose, details, details, details, caption
if you wanna be lazy, you can fully automate your captions via cogvlm in this app:
https://github.com/jhc13/taggui
lora results, for context, that this method works
if you trained via kohya, it usually saves a config file where your logging folder is. (or output folder) check in those folder. odds are high the settings you used are saved there
This is the only thing in my log
just checked. my700 does a lot of furry art, so the guide I literally just posted is basically perfect for your goal as well XD
this one: #š§ļ½finetune message
Sorry to bother you so late but I can't figure out these parts
5e5 te lr
Text encoder learning rate of 0.0005
Unet learning rate of 0.001
Ah okay, I was able to stumble my way though most of it, thank you.
Is this how I am supposed to do the pony diffusion?
This is everything I have. Is this correct?
Yep
You need to enable sdxl
where is that at?
Max Resolution is also 1024,1024
Where you selected the model
Everywhere it says fp16, switch to bf16
Okay, also does it matter that all the pictures I have are scaled down to 512?
Sdxl works at 1024px scale, so it needs images at that size
How much vram do you have?
no clue, how do I check it?
What graphic card do you have?
Oh damn x_x that's not enough to run these settings
crap
That guide will get everything working for you. It's a bit detailed, but many of the things mentioned are optional
If you stick to sd1.5 based models, then 512px is working, and everything will train fast and easy for you ^^
What would happen if I run as is? I am fine with it taking longer
I think it would take a year š¤£
If you wanna work with sdxl, there's ways to make it work. 3060ti can make it work. But sd1.5 based checkpoints will be much easier to work with in your case
okay, now what if I turned off the SDXL and stay with the pony diffusion?
Also I tried using the pony diffusion as a checkpoint in stable diffusion and nothing but blobs come out
I really like details on this LoRA.
Did you set clip skip to -2?
I dont know how to do that
If it is the one in settings>stable diffusion
the lowest it goes is 1
I was wondering if there is a way to be able to use stable diffusion from my computer on my phone?
Best to start reading the guide, as any tips you'll get in this channel will assume a base understanding of the tools and most common settings. Guide answers pretty much most questions
O, wow
OOoo, thank you. I`ll try it(but I think the biggest problem is finding so many images... ). But if i will have some unusual problems/errors, may I ask you?
Sure! As long as you have 16 or 24gb vram I can help š
I have only 8gb khe khe...
You can still train SD1.5 loras then
https://rentry.org/59xed3
This guide should help
Oke
Do full finetunes/dreambooth trainings always look like noise at their first sample generation? I was under the impression that, similarly to a LoRa, you'd start with the initial weights being set to the base model so the initial sample images would be similar to the base model. Otherwise, any idea what I'd be doing wrong? This doesn't happen when LoRa training for SDXL/1.5, nor does it happen for Cascade finetuning. It just happens for SDXL/1.5 finetunes for me
we're still figuring it out. i don't think you can train stage C in less than fp32, it performs too poorly.
maybe stage A and B in fp16 will work fine
with stage C in fp32, it's possible to do a full fine tune in with 48gb
single gpu with 48gb or 2x24?
are you seeing 'nan' in the loss output? you said initial sample, but you can literally sample whenever so that's not very clear
No, eventually the image changes from what you see above into an identifiable result. It just takes a bit, depending on LR and such. It just takes a few hundred steps before something normal emerges.
technically every image looks like that during inference, it starts with noise, that's nothing unusual
2x24
it only makes sense with ampere though to do that
with 2xA5000 or 2x3090s nvlinked in tcc
oh nvlink, unfortunate, I have 4090+3090
i mean we have definitely, successfully trained a LoRA on top of stage C. it's just not very good. it worked, but not as well as SDXL
I can use google colab to train a lora with 15 images. It works good and produces an 18mb lora. I'd like to run koyha_ss locally because colab alwasy is out of free GPU time. I have koyha running, but when I train the file size is half the size, only 9mb. When I use the lora it does not change the image at all, doesn't work.
Here are the local kohya settings. All else is the default.
Source model: custom: /dataset/models/cyberrealistic_v41BackToBasics.safetensors
save as: safetensors
Folders: Not using regularization photos
Parameters - Basic: Epochs: 10, Save every N epoch: 2
Parameters - Advanced:, CrossAttention: xformers (also tried setting this to none)
Flip augmentation: checked (I also check this on google colab)
What things can I do to see what is going wrong. I'm using the same dataset with both.
size of lora, is determined by network rank setting.
(Higher = bigger)
also, odds are high the colab had some settings on by default, since there's no reason you'd ever want to not use them. in local kohya, you can choose to not use settings, that you absolutely should use. so your issue may be there
I agree totally. I also wonder if the google collab has some regularization images. I think I should be able to capture the commands from both, would that show the differences?
Although to run it in google colab I have to wait for GPU to be available or run in CPU mode. Maybe I could grab the command then cancel it.
I followed this guide, and the person does not use any regularization images. https://medium.com/@dminhk/3-easy-steps-lora-training-using-koyhas-gui-on-amazon-sagemaker-notebook-573b151b4add Guide is for sagemaker but that shouldn't matter.
I don't see settings in koyha for a network rank, are they called something different? I have seen those settings in automatic1111 with dreambooth.
you are on the right tab, right?
you're not accidentally trying a full finetune XD
I see it now, good grief.
So I need to figure out what settings google colab is using for that?
Looks like colab spits out a config file into google drive.
unet_lr = 0.0005
text_encoder_lr = 0.0001
I'll give that shot. I suspect there must be more differences than that.
I'm trying to train models of specific faces with SDNext and Dreambooth, it doesn't seem to ever work
Hello everyone,
I'm planning to train a model checkpoint to generate models from images of mannequins wearing clothes. I want the models to look realistic, and the background images should resemble real-life scenes such as streets, parks, beaches, fashion shops, etc. I've been using some models for a while, but this is my first time training a model checkpoint. Could anyone with experience in training checkpoint models share some advice?
Here are a few specific questions I have:
- Should I use SDXL or SD1.5?
- Is it advisable to base the model on an inpaint model? (I've tested some inpaint models, but the generated background images didn't look as good as those from regular models.)
- How should I set up the parameters?
I'm planning to use kohya_ss for this project. Thank you very much for your help.
break your project down into smaller piece. then get each piece working. finally, combine all your trainings into 1 to make the final product.
In case of SD1.5, making a model is just fine. probably ideal for you, since you can just keep scaling up the dataset to make it better over time. only downside is training time - but that should stay relatively harmless as long as you have 24gb vram or more.
downside? even with upscaling, you'll probably hit a 'reliable' limit of 768px. with a max of 1k
In case of SDXL, you can achieve this via a lora. (or to be more exact, genuine full finetuning is on a level where you should plan to have 10k$ ready for it. you'll save money by hiring someone that can already do this)
so for an sdxl lora, you break it down into about 3 individual loras that get combined:
Lora 1) train on faces that meet your clients/or business' preference (basically get the ethnicity bias you want to represent), then merge with rundiffusion checkpoint. this will be your base
Lora 2) get a dataset with images of fashion photoshoots at various locations (streets, parks, beaches, fashion shops, all the ones important for you). train that into one trigger word. this triggerword will be useable even for scenes not trained, like coffee shop, etc...
Lora 3) make a regularization dataset, filled with 50% mannequins, all wearing different clothing pieces. then 50% real people wearing any fashion clothing (make sure the people are all unique, else you'll get bias towards a person)
take a bunch of photos of your current outfit, train a 1 hour lora with your regularization data added in. that lora can then be used alongside your base and location lora, to generate the images you want.
If you get a new outfit, you only need to remake Lora 3, which is easy since the regularization data & training settings stay the same
if you dont get new outfits, then you can just combine the datasets from loras 1~3. add the regularization data as a normal dataset. train one single lora from all of them. <- also works, but you shouldn't do this until they work individually
if you need to supply only a 'checkpoint'. then just make the loras individually, and merge them in at the end
for the more basic things like settings, refer to this guide so you can get a base understanding: https://rentry.org/59xed3
lol
i don't need a speech
spare me
i think it's actualyl that it was trained on just 700 images
so my attention went to zero
it worked in comfyui, when I was using it about a year ago? (moved to a1111 since its whats used by the majority, and easier to explain)
Basically commas ended tokens early
I wasn't writing 50% of the time XD
commas are their own token
i don't think any of that stuff really matters
clip isn't strong enough for that to matter
i mean that rundiffusion's checkpoint isn't adding much
ahhhh. yeah
they overfit specific facial features
it's more that when you use a small dataset, it is more likely that sdxl already is very close to producing the results you want, and you could achieve the same results with "just" a prompt
it's why we can have working celebrity loras, but hardly anyone succeeds in training people that sdxl has never seen before with less than 1,000 images
whether lora or full fine tune doesn't matter
I tell you this will be the end of my hair. I can build 100 different loras, different image sets, in kohya_ss and they will do Nothing At All. Might as well not even be included in the prompt. Use the same images on google colab and it works fine. I have matched up every single setting to be the same, doesn't matter. Use the default settings, doesn't matter.
Only thing I haven't tried is SageMaker to see if it's any different
then you have some dumb error x_x
friend of mine once managed to not include ".txt" for caption extension. so something like that may be glitching you out
wanna print your command, and paste it here?
This is one I tried earlier. I was matching each single command line argument to the google colab. I don't think it needs to be the same. I've tried with and without regularization images. Makes no difference.
Config: https://pastebin.com/3NyT9dBm
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
I've tried with the generated txt files that google collab makes and by generating them in kohya. Same text files in both.
how critical is it to have exactly 1024x1024 images for training SDXL lora? Is it good enough if you're close enough like within 10% of that?
and how important is aspect ratio
I read posts of people that say they don't bother to crop/square or resize their training photos and get great results. What if you tried one training with 10% square and one without changing any aspect, and see which produce better results?
When I'm testing a lora I like to to try the 7/8/9/10th generated lora and try it with 0.7/8/9/1.0. See which combination is the best looking.
(I'm super noob at this)
Hey guys, if I finetune a model which is under license xyz.
Is it still the same model with same licensing?
Any ideas on how we can improve masking (https://platform.stability.ai/docs/api-reference#tag/v1generation/operation/masking) It seems to generate a completely different image every time and ignores the masking.
depends entirely on what you want to achieve.
Training on only one aspect ratio, makes inferencing on all other aspect ratios a bit worse. But in return, you dont get any bias trained into an aspect ratio.
(Like that portrait photos in base sdxl look better if you generate an image with a 2:3 aspect ratio)
as for smaller images. it can work. Make sure to have upscaling turned on. in most cases its not too much of an issue. If you mess up, or your dataset is really bad, then you may end up accidentally training on the noise within your images, and generate something like my "youtube compression noise" lora when I just wanted to copy the style of a music video š
yes and no. depends on the model and specific license. The last generative image model with a sort of open license was SDXL1.0. go read that license for more info.
but its not really a topic for finetuning per se, so best ask your question in #1010934719455707218 if you have a real life usecase, else #š¶ļ½off-topic if you're just asking to get a general feeling for it
awesome thank you!
@hollow spruce @agile inlet thanks for the tips. š I'm going to try having my images be approximately square but some will be a bit rectangular, nothing too crazy though. Sounds like it won't be a huge issue for my use cases
what is your objective?
I trained a working model of a friend with 6 images in SDXL.... oO
hmm... listen there is a scientific way to measure face resemblance. it would be a major discovery to transfer likeness into flexible scenes. if someone had a way to do it, it would be a great paper
looking into this further maybe it has just been kiboshed by the big AI labs because all the pieces are lying around
to achieve this
it would be generalizing deepface to "deephead," then "deephead adapter (controlnet)"
maybe SD is particularly weak for head orientations. sora figured it out, but sora is trained on synthetic data to specifically address the problem
is there an api endpoint for creating/listing finetunes?
i have succeeded in a stable cascade lora fine tune and it performs well
for my dataset, which is for games IP, it seems like it needs a lot more data for stage C to non-face details like wardrobe correctly, but i'm more confident that this is something that is my fault and not the model's fault. it delivers superior silhouettes, proportions, adherance to face details and flexibility.
thanks for the update! funny enough now SD3 is around the corner, not sure if cascade is worth it still
it's a very, very good model
it's not as good as IF though...
the direct pixel space models are the best for people like me with lots of fine tuning data
the real latent models are really good if you don't want to fine tune subjects but instead want creative generation
SD fake latent is a compromise between both and i wonder if SD3 is fake latent, real latent, or neither
Got my first LoRA trained, it looks somewhat distorted / deformed haha.
Reading through the sd training info on github, there's "Regularization data or Class data (pictures of diverse other things)", so does that mean the regularization data should be different from the dataset I use / my concept?
This does not make fully sense, uses odd language in some points https://github.com/Guizmus/sd-training-intro
terminology is important here, since "should be different from the dataset I use / my concept" will mean different things to different people
also relevant if you're training SD1.5 based or SDXL
here's a small breakdown for training a specific subject (lets say "Obama")
Training Dataset of obama: 20 images (or 200 for sdxl)
Regularization Dataset: 5~10 times the size of your dataset. Here we would want about 50% random black people, 50% other ethnicities.
If you have access to a good regularization dataset that someone made, then great. If you're training on synthetic (self-generated), then just generate them yourself. (real photos will make the final lora better, but you have to ask yourself if the effort is worth the result) <- if you plan to train a lot of lora, then just gather your own regularization dataset, or ask around for a good one. If its a one time thing, then the increase in quality is rarely worth it.
Dont forget to use repeat 5~10, so that all the images in your regularization dataset get used. else it will just reuse the same 20 images, despite more being available. <- lots of threads talk about this online, if you google around for a bit.
Do you actually need regularization?
not if your dataset is good/big enough!
but occasionally you're stuck with a mere 5~15 images, and have absolutely no way to increase those. So regularization has its place at all levels of training
Thanks @hollow spruce for explaining it deeper, really appreciate it!
Just finished training my second test lora with regularization set, only got 1 repeat for those and might have gone wrong there not having enought steps.
do regularization images make sense for a pixel art style? If yes what kind of images should I use?
My second attempt did not go too well:
With the first test I did not use any regularization images at all, did only one epoch with 30 steps and the Lora responded as expected, even the quality wasnt there. I trained the first Lora on SDXL base 1.0 vaefix model.
My second attemp with 4 epochs and 30 steps (in total 10560 steps, with 44 images), I used regularization set of 150 images. Now the Lora is super unresponsive and only works (not well) with the model I trained it on. I trained the second Lora on limitlessvisionxl model.
Anything I could try with my next test? How do I get my LoRA smaller, adjusting the bucket size?
6 gig LoRA is far for good haha
oh damned, you have to set the rank
or dim, however its named
--network_dim=8 or --network_dim=16
if you train a 6GB Lora then its no surprise that it won't work well
I found it often easier to train on SDXL Base and then just apply the Lora to whatever model I want. If the model you use is not too overfitted it can deal woth base loras without problems
only rank, yes
Ok, thanks
I think someone suggested 128/256 on YouTube, I guess that explains the size haha
yeah, but that's already way too high
use something between 8 and 48
try lower first and only increase if it helps
Thanks, I'll experiment with it
generally you use the model you're training against to generate pictures of whatever class describes your subject (man, woman, dog), assuming it's subject training vs style training. the purpose of the reg images is to maintain the integrity of what the model knew about the class by injecting those back in. I would recommend someone try without them first to get an idea of training progress before adding in that variable. most of the issues in training come from either bad training images or bad captions
I see, thanks for explaining!
Do I need to pay attention to learning rate, text Encoder Learning Rate and Unet Learning Rate?
Maxing out around 1.3 it/s with a 4090 and 5950x CPU... Seems the GPU isn't running at full load...
One of the cores for the CPU does however appear to be hitting 100% pretty often... Not using official monitors here just hwinfo, but wondering if this really is a CPU limit or not
I'm willing to build anotder PC to house the 4090 if it'll double my training speed by going from <50% load to near 100 on the 4090
I'm on windows 11, 64gb ddr4 ram @ 3600mhz
This is with OneTrainer which I've found to be a bit faster than kohya which has me at 1.1 it/s max
for comparison, here's what i see when generating images... each spike is during diffusion, each dip is when the system is completely idle
Yeah tried disabling SMT just in case and to clear things up
Cores 0&1 are maxed out
Looks like a CPU bottleneck
anyone know what the difference between the training in the original controlnet training vs the huggingface diffusers one is?
does anyone have favorite upscalers for training images?
Hi @hollow spruce it can be late, but thanks for your help
Do you mean like when building your dataset, you want to upscale some images to use in the set?
4x_NMKD-Siax_200k is one of my current favourites I've been using almost exclusively lately.
I like lollypop
Yes, and thanks for the tip!
how many epochs do you guys usually use for your loras?
can you talk a little bit more about what you are trying to do? the graphs are kind of meaningless, you will have to do real profiling, like with nsight, to know what is going on. you can use nvidia-smi for a correct, instantaneous GPU usage (really "occupancy") measurement.
if you care about performance in training, you should not be using windows.
no. regularization images make sense in the context of where they were invented, for dreambooth, when you are trying to learn a concept from a few examples. to create a pixel art lora, you will be using thousands of pixel art images, hopefully of diverse subjects, and you will always want all things pictured inside of them to be represented as pixel art.
same for you too. here is a super concrete example of what the meaning of regularization is:
let's say you are trying to train a picture of your dog, and for some reason, you only have 3 photographs of your dog.
in one of the three photographs, there is a ball. in another of your three photographs, there is a plant. so 33% of your dataset is also training balls, and 33% of your dataset is also training plants.
do you want plants and balls to be generated in your images "1/3" of the time whenever you are asking for your dog? no.
so your regularization dataset could be many things:
- use a variety of images of other dogs, which have a diversity of random crap in the foreground. pros: you will not see balls and plants in your images. cons: you actually do want every dog to look like your dog, so this will actually increase training time / reduce performance of your dreambooth fine tuning.
- use a variety of images of adjacent concepts, such as shots of other animals. pros: you will see fewer balls and plants in your images. cons: you actually do want every portrait of an animal to be a portrait of your dog, so this will actually increase training time / reduce performance of your dreambooth fine tuning.
okay... it should be clear there's no obvious choice for regularization images. let's look at some alternative solutions:
- get more, better pictures of your dog.
- caption better.
- photoshop out the foreground elements like plants and balls.
- moderate your expectations.
- cure the disease of impatience.
Perfect, thank you!
so you pretty much never want to use them :/
that's how it goes
what are you trying to do?
I gathered some piece of info on one article I came across somewehre (I have a link somewhere) about reg images and those can be handy when the LoRA is bleeding too much into the checkpoint currently in use. For example if creating a red hair with a LoRA and it makes all the people not only look red haired as wanted but also looking similar to each other, i.e. it's bleeding the trained data to the checkpoint wiping away some of the data present in the main checkpoint and replacind it with what LoRA was trained with. Not just the hair alone but other charasteristics as well, such as body type, hair style, clothes etc. @dusky urchin
Not 100% sure that's the case though, don't have tested it yet succesfully
but what are you trying to make?
what is the goal?
I currently have some bleeding occurring, to fix that
SDXL already knows how to make people with red hair...
so what are you trying to do?
The red hair was just an example. I'm not doing anything serious, training my dog and my own face
okay
Unfortunately right now nothing I can do about that, need windows on here for computational stuff for work
I just want to learn the process and undestand the AI better. Maybe in the future train some actual LoRAs idk.
Yes, exactly but if that's the case what I'm descibing (not sure yet as it's just one article I've read about it), that means when training some specific kind of red hair, let's say with polka dots, the trained LoRA bleeds into a checkpoint currently used and soon all red haired ppl are looking the same, not just with polka dots but also same hair style, similar face etc
I've definitely noticed this kind of behaviour with some LoRAs not just my test loras alone, where a LoRA completely changes appearance into something dirrefent from what it originally was, for example body type. You create a plump girl and with Nike shoe lora and the originally plump girl gets transoformed into skinny girl.
oh thousands? maybe that is why my loras always looks like this:
I'm trying to create 16x16px textures
I'll try to up the img rep count
Quick question guys, hope this is the right place.
I“ve been getting back into stable diffusion and I wanted to finetune a model with my face. I“ve done this in the past but now I have better pictures and would like to try again.
What is the newest working google-colab I could use for this?
the best thing to do is to resume from the pixel art XL lora on civitai
do not use it for a commercial purpose though
there isn't any legitimate secret sauce to anything you see. ordinary community members don't have the capital or knowledge to generate synthetic data.
I'm using my own art tho
also I'm just gonna use it for ideas anyway
and resuming from custom models doesn't work for me it crashes kohya
setting to 100 reps did help a lot tho:
it's still not usable but hopefully if I resume from the training data it'll get better and if not then I'll just try again next year
Anyone know what text encoder CLIP model Cascade uses? Seems like it might be a fine tuned half precision variant of CLIP Big G?
Strange that thereās so little information about it. Have looked through docs and release statement.
you should do what i say.
it uses openai/clip-vit-large-patch14 for vision* and laion/CLIP-ViT-bigG-14-laion2B-39B-b160k for text
I can't tho I've tried as a test and kohya crashes when I use any custom model
kohya_ss shouldnt crash with a custom model, there's something wrong with your install, try doing the same training with kohya_ss on vast.ai or runpod
hey dumb basic questions for Kohya
there's dreambooth and finetune methods for doing training on checkpoints, which is better?
is there a good up to date guide to finetuning checkpoints for multiple concepts anyone can recommend?
or presets/settings people generally agree are decently good
dont forget to enable all the correct matching settings. like in case of sdxl models, to enable sdxl. and vram optimizations
and to load it in under models, and not any of the two continue options XD
or, in case you ARE continuing a lora, to not load it as model, but instead continue from weights, while picking sdxl base or whichever base you want under models
I've just trained with the base sdxl model it actually worked:
the textures aren't good but it at least does the 16x16px grid most of the time
I assume that it isn't good at such low res. esp with a low amount of images to train with. So I'm gonna train it some more after I've drawn more textures
trying to do a kohya dreamboth XL training but its attempting to pull SDXL vae from huggingface but its down for maintanance so getting an error
hey, I have the same problem. Did he share his code already?
should you caption reg data when training lora in Kohya?
dont think so
alright, thx
Hi guys, I got a problem with training stable cascade lora on the 1B version: During training I get intermediate output where the model generates a grid of images. There are 5 in total, the first one is the ground truth and the other ones are generated. However, the last 2 suddenly show something totally different from what I'm finetuning. I'm currently training lora for some shirts and the first 2 images look promising but the last 2 are just strange. Anyone knows what's happening?
What the FUCK
excellent question
no still working on it. our fork of kohya which is installable - https://github.com/hiddenswitch/sd-scripts - fixes some issues, but the basic answer is that you need more than 24GB to fine tune with pivotal tuning aka adding new vocab / tokens. i think this is because pivotal tuning requires more gradients than the LoRA layer for clip text implies; or, it's possible that train_c_lora is misconfigured
I implemented pivotal tuning in kohya_ss. It takes 11GB VRAM with batch size 1.
In general Pivotal training takes not more vram than training text encoder and unet loras
results are... nah. Pivotal training works as bad in Stable Cascade as it does in SDXL. I switched to text encoder training long time ago as it just works better than textual inversion
haven't tried it yet... heard it was amazing somewhere, but only heard about it once really
dunno, I think it is way overhyped to be honest ^^°°°
pivotal training came up a year ago as some alternative to dreambooth. Instead of using this strange sks token, learn a token via textual inversion and do dreambooth on that
the reason nobody ever used it was because afterwards everyone started using Loras and they worked better anyways
but if somebody has good experiences with pivotal training feel free to tell me about it
well i mean you can do anything in 11GB of VRAM by unloading everything and reloading everything š
maybe that doesn't make training take 10x as long, but 2-3x as long is still bad
i am just teasing it's good to hear it works
and that it gives bad results
you can train in non square images?
yes
token downsampling seems pretty good. Someone added a pr for it in kohya. Don't know if it's as good for finetuning, but it's still something worth looking into.
https://github.com/kohya-ss/sd-scripts/pull/1151
better tome basically
it's just a heuristic to speed up SD 1.5 a bit, cause in SD 1.5 they have attention in the first and last layer of the unet (which is somewhat crazy and ineffective). SDXL is not benefitting much from that
Training in kohya. Why did it only create a NPZ file for 40 of my regularisation data images when I have 472? btw 40 is the same amount as images that I have in my training data img folder
thats the specific reason why repeats were created.
use 10 repeat on your training data, then 400 reg data images will be used (10x40 = 400)
(or just add the regularization data, as normal training data. then 100% of it is used.) <- while the math behind it changes, the result is on par or occasionally better. depends on your dataset.
I'd say try adding it as normal training data once. then you know for future runs if that dataset is viable to be used like that or not
Okay, thank you š
Should you have 10 repeats on reg folder aswell?
nop. that would results in this: 10x40:10 = 40
so you always want repeat 1 for regularization
(fun fact. if reg data folder is too small, it auto repeats anyway! XD)
Hey all! I have a bunch of questions about training a LoRA on a specific style
Starting with detail and resolution, I guess. I understand my current checkpoint is based on sd 1.5, and that has native resolution of 512x512. When I try to make an image at that resolution though, I get low resolution images. 768x768 or even 1024x1024 produces drastically better results
Can I make images at 1024x and scale them down, while still preserving the detail during training?
scaling down is inherently a destructive process. the details are in the pixels and the pixels are being removed.
that being said, the fine details don't really matter too much. SD15 is going to wash them out and smooht things over either way, as it's base attention is 512.
what you can do is a second set of your images with closeup cropping instead of sizing the whole thing down. like, cropping just the face or another key focal point in the image
Okay, cool
How about switching from 1.5 to sdxl? Recommended?
I just did a cursory glance at the specs, and sdxl certainly seems better