#šŸ”§ļ½œfinetune

1 messages Ā· Page 19 of 1

hollow spruce
#

kai has. I'm still with LoRA as I've not hit any limits yet. just a lot of effort x_x

tall condor
#

maybe kai can give some tips on dreambooth on sdxl

#

and if you have larger datasets all is alot of effoert

#

one thing tho. i have learned that if you use random crop you can train 5 models close to overfit and if you merge them later into one model the results get really neat

#

so for me the key to sucess was merging multiple models into one final

#

the difference was so big to the single models you wont believe

#

also if you advance your model i like to add the new resuling model on top of the old or vice versa

#

especially for faces that make alot of difference

#

this way you can also add new stuff to your old models

hollow spruce
tall condor
#

trust me it works great

#

i recommend to start merging based on a very low learning rate model

#

if will give a very flexible result

#

the model you start your merge with makes huge difference

hollow spruce
#

now I also need 4x 4090 XD

#

I'm working with a mere 1 of them

tall condor
#

xDD

#

its a pain isnt it?

#

i started with one

#

then 2

#

and now 4

#

and i wish i had that a100 cluster you refered to

#

but if you have a progessive model that you advanced over time

#

try to merge it

#

the results are much better sometimes as long as you weight in the newer with enough weight

hollow spruce
#

but this might be a good time to ask... does it actually double your generating speed? also can you actually run them in multi gpu mode, or do you just run 2 instances of your training XD

tall condor
#

no it does not double

#

it doubles if you dont use random crop

stiff dust
#

but you can merge loras, too

tall condor
#

yes

#

merged loras are amazing too

stiff dust
#

in principal there is nothing you can't do with lora you could do with Dreambooth

tall condor
#

so you loose about 20% speed with multiple gpus because of the syncronization

#

but it still much faster

hollow spruce
tall condor
#

it would. but if you run 200 or 300 epochs with 2-10x repetion just imagine how many variants you will get

hollow spruce
tall condor
#

for me random crop together witn not downscaling buckets really did alot to the quality

#

also flip augmentation

#

with those the models tend to overfit much later

#

and the resulting model was way less producing closeups

#

its like 3x slower tho

stiff dust
#

hm, I don't like flip augmentation.

#

regarding cropping: this problem does not exist anymore in adxl

tall condor
#

flip augmentation acually does not "flip". it only flips by a few percent

stiff dust
#

I know, the thing is

#

faces are not symetric

tall condor
#

maybe i need to try that

stiff dust
#

if the subject has certain facial features on the left side I don't want to flip them

tall condor
#

i do have my issues with the faces

#

maybe removing the flip augmentation improoves that

#

need to try

#

@stiff dust so you have not tried dreambooth on sdxl yet?

hollow spruce
tall condor
#

no

hollow spruce
#

never questioned that one due to the wording x_x

tall condor
#

actually yea

#

but from my experiance its just a vertial clip

#

very see any artifcats from horizontal flip

digital dune
#

welp this is also why I can usually get nowhere in finding optimal parameters

#

everyone has their own opinion on what to do, what not to do

#

just when I thought I found the perfect params, someone tells me why I shouldnt use them

#

guess it all boils down to trial and error in the end. There just never is a consensus on most things around here.

hollow spruce
# digital dune just when I thought I found the perfect params, someone tells me why I shouldnt ...

settings work along context.
• dataset size matters
• how your lora is intended to be used. what is your target success rate? (like out of 10 images generated, how many out of those need to be good enough to share with others, for you to consider it a success)
• what you're training (style/face of a person/clothing/a pose/general composition/etc...)
• how much vram you have (since above 16, under 16gb, under 12, under 10 you need entireley different workflows to make it work on your system)
• how you do your tagging
• how small are the details that you need to capture/making sure your dataset doesn't have noise, since you dont want to train fine details and accidentally make a "youtube compression noise" lora by accident
• does your training include concepts that aren't known by sdxl (like nudity, facial accessories, new clothing articles, proprietary or new subjects that dont exist in public datasets)
• are you training eyes? (yeah. training eyes messes up more than you'd think. its a long conversation)

tall condor
#

yes its alot like that

hollow spruce
oblique adder
#

how can i make it so adetailer doesnt entirely change my render? i only want the eyes to be better but i dont know what setting makes it so different

#

left is adetailer, right is how i want it to be

digital dune
#

its way hyped up

#

nothing beats good ol cnet and inpainting

oblique adder
#

can i make the eyes render better that way?

raw dirge
#

no the eyes shape depends on the model

#

only way to change eye shape is either use another model or use an artist style lora

oblique adder
#

its less about the shape but rather about the clearness (i guess?), i want it to look this like render

raw dirge
#

it depends on the resolution and model some models add a lot of details but theres also models that are very flat and clean without much details

digital dune
#

have you tried inpainting?

raw dirge
#

you could also use a lora style for that

oblique adder
digital dune
#

nah dont use adetailer g

#

like I said, its unnecessary

oblique adder
#

oh, i just saw inpainting under adetailer assumed it was that one

#

or atleast, when i open the tab inpainting appears

raw dirge
#

on civitai search for style and filter by lora/lycoris to see all of them theres a lot for 1.5 (more than 1k)

oblique adder
#

okay, thanks i'll give it a go

normal ember
# tall condor and now 4

What does your rig look like to be able to fit 4x 4090s? I'm thinking about getting a rig at work.

#

I've looked at specs at vast.ai for 4x 4090s. But interested in your setup.

#

I guess you would need a large PSU or even multiple.

digital dune
#

How exactly does kohya decide how many reg images to use per subject image? Is it the total subject images*repetitions divided by total reg images*repetitions? Im getting an issue with my images becoming a little too toasty during training and I'm ready to blame my reg images but can't narrow down what exactly kohya is doing with my regs.

#

This is a problem that I've had since the dawn of time and I remember it being able to solve it by simply adding more reg images per subject image. Otherwise, I have to lower the CFG of my gens all the way down to 2 to avoid them looking overcooked af.

#

This wasn't a problem in A1111 dreambooth because you can explicitly say how many regs*subject you want but I cant go back to that since it has too many bugs. Halp plz.

latent charm
#

How does the reg dataset work in fine tuning?

digital dune
#

as well as your class images ofc

#

when it's only a small influence, it's perfect for when you have super-detailed images on your reg images so they blend with your class images, thus your output images are super highly detailed

#

but if the influence is too much, the outputs are overfitted to the reg images and everyone starts looking like wrinkly grandmas. Finding the sweetspot is hard and I'm finally convinced it's impossible with current technology to train a model with more than 10 subjects that is decent.

hollow spruce
#

any additional reg images in the folder just stay unused

#

in case of not enough reg images, it starts to repeat them until that equation checks out

digital dune
#

So if I'm training 4 subjects, one has 400 class images, the other 200, the other 50, 25. Would'nt you train them as 1_A person, 2_B person, 8_C person and 16_D person?

hollow spruce
digital dune
#

yeah I aint touching xl in a year or two

#

but would that mean that one person gets 1 reggie, the other 2 reggies, the other 8 reggies etc. so they all get an uneven number of reggies?

#

I just want a solid 20 per subject goddangit. Why does kohya make things hard.

hollow spruce
#

rule of thumb is treat everything with a tenfold amount of images. so when you go for sdxl, just remember to up your dataset by 10, to achieve a 1.5 equivalent result. (but obviously at the higher 1024px)

#

and once you get into 4k image territory, then standard rules of finetuning change. for the better even, but its a mess, since it depends on so many factors

hollow spruce
#

1 word/folder class type of reg images always had their own issues, so I stopped doing that :/

#

it basically messed up the text encoder, and only changed how the model was overfit, instead of protecting it from overfitting

#

not an issue if you're doing like a pose for women, or a face expression for men - since you're not gonna generate cats/dogs/cars etc., but it does matter if you're trying to do more general type of lora/full finetune without damaging the core sd model

digital dune
#

you mean the reg image filename? or you use captions for the reggie?

#

I have 11,000 reg images in my folder so that GPT autotagger better come out soon šŸ’€

#

Thankfully this is only an issue when training multi-subject models, so I'll have to bite the bullet and train a lot of small models and just buy a new ssd I guess. Really curious to see if tagging all those reggies would work šŸ¤”

hollow spruce
hollow spruce
#

I'm currently at 6k manually edited images + manual tags 🤣

digital dune
#

The manually edited images I believe, the tags for reggies I do not

#

And I thought I was no-lifing this shit šŸ’€

#

I mean honestly, a generic reggie dreambooth with a tagged lora is already an insanely accurate combo

hollow spruce
#

here's my woman subset 🤣

digital dune
#

do you use boorudatasettagmanager or do you recommend something else?

hollow spruce
#

assuming you're mad enough, then hydrus network is not only the best option, but also the only. Which makes me sad. Cause hydrus is terrible. But sorta required T.T

digital dune
#

I'm decided that the optimal path is untagged dreambooth + tagged lora. Anything else would only be good for huge multi-subject training but I sure as hell am not gonna test that theory just to have it blow up in my face with all that time tagging reggs

hollow spruce
#

in fact I think that's how it came to be

#

the manual tagging was an afterthought XD

hollow spruce
#

but at that point, might as well full finetune

digital dune
#

Do you by any chance have a model that was trained with tagged reggies vs. untagged reggies and all other parameters the same that you can share on civitai? Even if it's weeb stuff I really only want to see what kind of effect tagging has on the reggies which up until today I have never heard of

hollow spruce
digital dune
#

I wanna see

#

so you think it won't make a dif tagging them on 1.5 anyways right

hollow spruce
#

like by far

#

you can get away with so much more in 1.5

digital dune
#

my plan is to master all my subjects in 1.5 training, then if I wanna move them to XL, just use the best gens from 1.5 directly upscaled to 1024 and train those

hollow spruce
#

I'd recommend you to use vit-big-g to autotag all your reg images with captions only

digital dune
#

so that's also why I plan on waiting out XL

hollow spruce
#

that way you get essentially the same result - but its 100% automated

digital dune
#

Ty for the tip. I never really used anything besides boorutag

hollow spruce
#

in webui, install Interrogator extension
then use the batch option with Vit-bigG enabled and set to caption to autotag all your reg images

#

it will give you natural language prompts that are about 80~90% accurate for all your images

#

warning, Vit-bigG requires a 3090/4090. not sure how much lower you can go

#

If you are low on resources, you can use Vit-H to caption them instead

#

not nearly as good - but good enough since its just for the reg images

digital dune
#

seriously, thank you so much for the tip. I'm excited at the prospect of this leveling up my model game.

hollow spruce
harsh shoal
slow garden
#

Hi, I'm not sure where is best to ask this, so I will try here.

My grandparents passed away this year and I would like to create a special gift for my mother. I have scanned old photos of them and I would like to try and upscale / restore them- but I'm unsure on what the best way to do it is. I tried using upscale with codeformer however it seemed to change the look of them too much that I could tell it wasn't my grandma. I also don't know how to remove the artifacts from the scanning, such as speckles and white dots.

#

If anyone could help point me in the right direction for me that would be very helpful. I'm hoping I can get it all done for Christmas

stone garden
hollow spruce
supple pollen
#

hello! Im new to SD and im having trouble getting the juggernautXL model to work. I just get images like this, does anyone know a fix?

#

heres the image details beautiful lady, (freckles), big smile, green eyes, long hair, dark makeup, hyperdetailed photography, soft light, head and shoulders portrait, cover, random hair color, supermodel face lora:add_detail:1
Negative prompt: badhandv4, (worst quality, low quality, normal quality, lowres, low details, oversaturated, undersaturated, overexposed, underexposed, grayscale, bw, bad photo, bad photography, bad art:1.4), (watermark, signature, text font, username, error, logo, words, letters, digits, autograph, trademark, name:1.2), (blur, blurry, grainy), morbid, ugly, asymmetrical, mutated malformed, mutilated, poorly lit, bad shadow, draft, cropped, out of frame, cut off, censored, jpeg artifacts, out of focus, glitch, duplicate, (airbrushed, cartoon, anime, semi-realistic, cgi, render, blender, digital art, manga, amateur:1.3), (3D ,3D Game, 3D Game Scene, 3D Character:1.1), (bad hands, bad anatomy, bad body, bad face, bad teeth, bad arms, bad legs, deformities:1.3)
Steps: 40, Sampler: DPM++ 3M SDE Exponential, CFG scale: 5, Seed: 3685031502, Size: 512x512, Model hash: 1fe6c7ec54, Model: juggernautXL_version6Rundiffusion, VAE hash: c6a580b13a, VAE: vae-ft-mse-840000-ema-pruned.ckpt, Denoising strength: 0.7, Token merging ratio: 0.2, Token merging ratio hr: 0.2, Hires upscale: 2, Hires steps: 15, Hires upscaler: Latent, Lora hashes: "add_detail: 7c6bad76eb54", Version: v1.6.0-2-g4afaaf8a

latent charm
#

don't use 1.5 vae on sdxl model

supple pollen
#

i dont understand lol

#

explain to me like im a small child

#

i think i understand now

latent charm
#

Also, you might use 1024x1024 to get better result rather than 512x512

supple pollen
#

i turned off vae and still the same

#

i do hire unsacler to 1024

latent charm
#

Sdxl didnt use the same vae which is used in sd1.5. It is why your image get weird effect. sdxl is designed to generate image on 1024x1024 without high res fix.

supple pollen
#

it worked!

#

new result

#

thankyou for the help

tame otter
#

kohya got a bunch of updates and is using a newer bitsandbytes now. so full bf16 training? whats that about?

digital dune
#

I tried ur autotagger and for some reason it tagged all my women as either "araf, araffed or arafed" and as far as I can search that isn't a real word

#

did it do that to yours? Should I replace that word?

hearty bane
#

I was watching my picture generating, i had that first timage: i was like wow what a nice city it'll be perfect ... i'll take a screenshot of that, but two second later i see this, a less quality image with not what i prompted in the background and so my question i noticed several times, why sometimes my pictures are better when they are not finish ?

errant nest
#

standard is 0.7 i think but i usually use 0.35

hearty bane
#

Idk i never really changed it

errant nest
#

if its at 0.7 it changes the look a lot

tame otter
#

anyone know of a tool to edit safetensor metadata?

tame otter
#

i found mention of one coming soon in the kohya-ss gui github page, but no sign of it

worthy hazel
digital dune
#

or if u got the vram cojones for it, train a dreambooth on it instead

dusky urchin
#

is it possible to train a lora on top of SDXL that

  • trains on a high quality dataset of captioned images
  • that have a mix of new terms, like more than one distinct character
  • and correctly generalizes on new captions that contain those characters

let's say my dataset contains pictures of two kinds of pokemon that CLIP has never seen before, Pokemon_X and Pokemon_Y. let's say they are never pictured together in this dataset. in order to correctly generate imagery for a photo of Pokemon_X and Pokemon_Y together which requires generalization, is it sufficient to (1) properly tokenize the captions to use an uninitialized pre-existing token / a new token (2) train the text encoder simultaneously with the unet? i am a sophisticated user.

worthy hazel
stiff dust
#

actually, I tried it with two separate Loras and it works. So if you have a Lora on subject A and a lora on subject B and you use both of them together you can show them in one image (e.g. shaking hands).
However, it did not worked for all my loras equally well. For some Loras it starts mixing up concepts, for others it could well separate the subjects

#

but if it works with separated loras, it should work even better within a single lora

sonic narwhal
#

Anyone got a python script for using GPT Vision API for captioning?

#

That takes all images in input folder, creates txt for them and puts generated description in correct txt

opal jacinth
#

Didn't try it out yet, but saved it for later

sonic narwhal
#

thank you. Ill test it today

sullen locust
#

Hello sd users , i am getting this error while generating images

sonic narwhal
opal jacinth
opal jacinth
sonic narwhal
#

"can you give me a description of this image content. Separate each sentence with commas and dont use dots"

normal ember
#

Only issue with the script is that it will not handle rate limiting or any other error for that matter. But it can be improved.

digital dune
#

why. are. my. fucking. prodigy. loras. so. fucking. random

#

kohya can suck a phat dick I aint even bullshittin

#

sometimes theyre overtrained at 200 steps, sometimes it takes 5000 steps, sometimes they won't train properly at all

#

do yall recommend shuffling tags and turning on weighted captions?

#

nvm I figured it out. I had some extreme closeups in the dataset that were tripping the lora up despite me tagging them with double weights that these are extreme closeups.

#

what does appear to be random is how many reps to do in the folder. Some train beautifully with 01, some only with 20+ despite being the same number of steps. I can't find any rhyme or reason to this other than trial and error

tall condor
#

hey guys, do you recommend windows or linux for kohyass and Finetuning?

#

im setting up a new device and im wondering what is the advantages of either or

tall condor
#

also i recommend random crop as well as adding a noise offset of at least 0.03-0.05

#

regarding repete i recommend to balance out the model with no more than 15 reps. i usually do no less than 3 and no more than 10 to 15

#

epochs really do the trick for me

#

also note that some concepts are much harder to learn than others.this depends alot on the base model you train on. if the base model allready know the concept and you are fintuning it it takes way les steps than if it is something that the base model has never seen or heared

digital dune
#

do you weigh your captions using (()) or :x.x?

#

my autotagger uses escape slashes before the parentheses and I'm wondering if that's affecting anything

tall condor
#

i think the format need to be correct

#

anyone know if for finetuning sdxl i in kohya ss i need to use the sdxl brance still or can i use the master branch by now?

normal ember
#

it's been merged into master yes

dusky urchin
tall condor
#

if you want to have multiple characters in one lora you probably have to train it as one lora and not as 2

tall condor
#

anyone using kohyass with multiple gpus on windows?

dusky urchin
#

however i do not use the scripts directly, i use the code as a package and operate it directly from a proper python package

#

we authored types and documented all the args

#

however it apparently hasn't be uploaded yet

tall condor
#

anyone using kohya ss gui with windows and multiple gpus? for some reason windows only uses the first gpu

#

however on linux it uses them all

vernal dock
#

Hello! Anyone offers freelance services to help train models?

slate marsh
#

I've been trying to train an embedding recently, with textual inversion, both with a1111 and kohya. In both case it failed. No idea why. I get an embedding at the end, but it doesn't look like the subject, especially after I switch from sd-v1-5 model to a custom model.
anyone can help me? (actual good links that explain it properly, for example)

digital dune
#

as long as it fits in a single 3090

tall condor
#

i solved the issue with multi gpu on windows, however i had to change the backend from nccl to glue - anyone knows if there is any drawbacks from that

restive bridge
#

anyone get OOM trying to extract Lora from an XL Dreambooth? seems others are doing it fine but I cant even with 48gb ram and 24gb vram

stone garden
#

Are there any completely free online ways to train models? Google colab is now paid 😦

stone garden
#

Alternatively, how can I train custom models with a 30 to 100 img dataset on comfyui?

latent charm
#

You don't train on comfyui

stone garden
#

oh ok

#

Is it possible to use dreambooth on 6gb vram and about 100 training images?

digital dune
#

Im gonna say no

#

You dont train on comfy, you train on either kohya or old school A1111

#

You can train a lora though, should be good enough for most purposes.

#

If you do want a dreambooth though, I can train one for you, no charge. This is one of the rare cases where I'm intentionally backing off of training just because I've been experimenting so much it's making my head spin.

stone garden
digital dune
#

I would recommend training that on an animated model of your choice

#

this is the current training file I use and it's pretty neato mosquitarooni so far

#

just replace the folders with the ones with ur images and I suggest anywhere from 10-60 repetitions on the subject folder

haughty burrow
#

Hi, my Lora train process stopped at epoch 10/10 98% because the Google Collab runtime stopped, what do I need to do with this existing epoch so that it becomes final Lora?

digital dune
haughty burrow
sonic narwhal
#

Why did the lora that I trained on a portrait generate rotated portraits at interference

#

like this

#

None of my training data is rotated

stiff dust
#

looks like a bug, lol

#

in SD they always specify images in (height, width) while in basically any graphics tool it is (width, height)

#

thus it might happen that scripts accidentally swap both values

torn dew
#

I am just wondering. If I want to train say, an urban techwear style hoodie, but the patterns and designs aren't very varied in the images of it I have with it being worn, is it possible to include photos of different jackets or ones that aren't being worn to get it to take some of those pattern / design variations without it ruining the concept?

old quiver
#

some trainers might rotate the data for more variety though its usually something you need to turn on yourself kohya shouldnt do that automatically though another trainer might

sonic narwhal
#

I did in kohya

old quiver
dusky urchin
#

is anyone familiar with an approach for LORA training with a text encoder training that

  • adds a new token to CLIP, thereby changing its shape, initializing the embedding for that token appropriately
  • correctly accounts for the change in CLIP's shape, and correctly does NOT change the interaction between clip and the conditional UNET, since that shape should NOT change
  • "somehow" freezes the weights in pretrained clip, and only "trains" the new vocab token as it appears in the dataset as part of the broader LORA training approach
  • achieves this training with the same complexity as LORA training, because such a vocab increase "only" increases the rank of CLIP weights "by 1" in "a few places"
    the goal is to not rely on community-discovered "rare token" since nobody has done this in a scientifically rigorous way, and i plan to train a lot of new distinct characters / concepts
#

OR is someone aware of a scientifically rigorous discovery of "rare tokens" (meaning in the context of SDXL, "low impact" on the conditional unet when used)

stiff dust
#

what you say is basically "textual inversion"

#

however, I assume you want something different, like training the complete text encoder on a new token without changing it for other tokens

hot breach
# sonic narwhal

your training software is not respecting exif orientation, this is common if you are using photos off your phone. Phones commonly record images in the same orientation to the CCD on the camera itself, then just put a tag in the EXIF info to tell the applications that open it to rotate it on load (based on your phone's inertial sensor, i.e. gravity), by default PIL package won't fix this on loading and you have to add my_image.exif_transpose() somewhere in the code to fix it

hot breach
# dusky urchin is anyone familiar with an approach for LORA training with a text encoder traini...

have you tried simply using the actual names for things? Like first+last name for people or characters. I agree, finding more rare tokens, then having some cheat sheet or lookup table to relate actual names to some rare token is a completely garbage workflow, and also its never been needed at least for fine tuning where you can and should be using natural language and real words/names for things

dusky urchin
dusky urchin
hot breach
#

even if you find a bunch of rare tokens, constantly having a reference sheet for them or having to code around that is just a pain

dusky urchin
hot breach
#

give the text encoder some credit, it's not a 1:1 translation engine of tokens to embedding outputs

dusky urchin
#

like it would be transparent

#

that's true, but it has been our experience that even the community's rare tokens, which were not rigorously discovered, work a lot better for loras with text encoder training enabled

#

i'm excited to share more on the work.

hot breach
#

I've never been convinced it is worth seeking rare tokens vs using natural language and letting the text encoder do its job

dusky urchin
hot breach
#

though I focus a lot more on fine tuning

dusky urchin
#

but that approach does not tackle character X is holding hands with character Y

#

@swift loom

what you say is basically "textual inversion"

dusky urchin
dusky urchin
#

is it as simple as, i use the rare token in the image's caption, just like i would with a LORA?

random minnow
#

Anyone know the ideal parameters for Lora training at 12gb vram?

digital dune
#

is triton good for anything?

stiff dust
#

in textual inversion you only train a single token which is newly introduced into the tokenizer (and, thus, is maximal rare ;)). So you don't need rare tokens when you use textual inversion. However, it has several disadvantages (which is why it is not used so often anymore):

  • training is much slower than Lora. Computationally, you don't save that much time training only an embedding instead of a lora. So time per step is similar. But you need much more steps to train a textual inversion. In Lora you train many matrices at once, so each step makes bigger changes to the model, while in textual inversion you only change ~2k parameters.
  • results are similar or worse than TE lora
  • most the downsides of TE loras are also happening in textual inversion

You could think that textual inversion is less destructive, as it only changes a single token and not the rest of the caption. But that's not really the case: even if you don't train CLIP, it will propagate your token to the complete caption. During training textual inversion, the embedding is learnt such that it seems to get as much attention as possible. Textual inversions often tend to dominate the caption, such that you lose prompt understanding.

In SD 2.1 I used textual inversion a lot. In SDXL it never worked so well for me and I found TE loras always better (and even try are very tricky). Quite often, only training unet is the better option :/

night estuary
#

changing setting sd_model_checkpoint to ohwxBusB_step_2000.ckpt [136f107cjn]: AttributeError
Traceback (most recent call last):
File "D:\SD\stable-diffusion-webui\modules\options.py", line 140, in set
option.onchange()
File "D:\SD\stable-diffusion-webui\modules\call_queue.py", line 13, in f
res = func(*args, **kwargs)
File "D:\SD\stable-diffusion-webui\modules\initialize_util.py", line 170, in <lambda>
shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
File "D:\SD\stable-diffusion-webui\modules\sd_models.py", line 741, in reload_model_weights
state_dict = get_checkpoint_state_dict(checkpoint_info, timer)
File "D:\SD\stable-diffusion-webui\modules\sd_models.py", line 315, in get_checkpoint_state_dict
res = read_state_dict(checkpoint_info.filename)
File "D:\SD\stable-diffusion-webui\modules\sd_models.py", line 301, in read_state_dict
sd = get_state_dict_from_checkpoint(pl_sd)
File "D:\SD\stable-diffusion-webui\modules\sd_models.py", line 245, in get_state_dict_from_checkpoint
pl_sd = pl_sd.pop("state_dict", pl_sd)
AttributeError: 'NoneType' object has no attribute 'pop' I've trained 4 different models with fast-dreambooth colab repo but when it comes to select it from A1111 this error occurs. Whats wrong?

digital dune
#

that pop error is hella ancient it feels like I havent seen it in years lol

remote storm
#

I'm trying to train a model for a face on my own machine(16 GB VRAM). Would anyone recommend a model I can train relatively quickly? Also, what mechanism are people using to train SD models locally these days? Automatic1111?

ancient eagle
#

I’m training Loras on sdxl of animals, and in general it’s working great. But for some animals, it randomly ignores their color with many prompts. Even simple prompts. Anyone have any idea how you might address this?

dusky urchin
#

we're polishing an ETL pipeline to make this reproducible. dagster & dask are pretty great with preprocessing, and our packagized kohya scripts are also really handy

stiff dust
stiff dust
sonic narwhal
#

Is it a bad idea to train a LoRA on ~1000 images?

stiff dust
#

no, why?

dusky urchin
#

that and all the default LR settings in kohya are trash šŸ‘€

stiff dust
#

what do you mean with splits?

dusky urchin
#

clearly for a lora only on that one character, no problem

#

but if you depict the character in different wardrobe for example, and you bothered to caption it, tough cookie, you have to design the training, validation and test sets

stiff dust
#

I never use test and validation sets... šŸ˜…

dusky urchin
#

so part of our pipeline is to report the distribution of tokens into sets

#

lol

stiff dust
#

the loss is not a good metric anyways

#

I always use validation prompts and compare the results visually

dusky urchin
#

i'll be honest i'm not sure yet how impactful all this stuff is. i will report soon

wintry girder
#

I needed to improve my workflow for captioning datasets, so I made this GUI: https://github.com/MakingMadness/caption-fletcher/tree/main

I hope someone finds it useful.

Only tested on linux, so would appreciate feedback from windows users.

GitHub

A simple caption editor for image datasets. Automatically create captions, and hone them to be right on target! - GitHub - MakingMadness/caption-fletcher: A simple caption editor for image dataset...

#

Here's a screenshot šŸ™‚

latent charm
#

I had done something similar. Haha

#

Something like this

silent saddle
# sonic narwhal Is it a bad idea to train a LoRA on ~1000 images?

300 is a good number. Above it is overkill. I see many loras with 1000-1500 images with really bad or poor results. And many good loras with 100-200. What i mean is: pick the best cherries (images). Good quality, low noise, focus, different poses, different distance, light, etc... study the training parameters repeating the process. There is not a bullet proof recipe cuz each dataset need parameters adjustments. Try with a small data set first for faster training time.

wintry girder
latent charm
wintry girder
hollow spruce
# sonic narwhal Is it a bad idea to train a LoRA on ~1000 images?

quality > quantity
but assuming they are all good + well captioned, then yeah. more is definitely better - but its a question of if you can keep up quality that long?
basically, the less images, the more you rely on overfitting. the more images you have, the more flexible your lora becomes.

its a complex topic since it changes depending on what you're training.
One important tip is to fail fast. I'd suggest to a get a 'barely working' lora with 50 images, so that you can run different settings/variations fast. See what parts overfit faster than others. Then fix those as best you can.
then double the dataset each time.

you dont need to do this each time you train a lora - because the lessons learned stay the same. the next time you can start with a more complex setup right out of the box.

#

fyi: my biggest lora is currently at 6k (manually tagged) images. so I say this from experience

valid sequoia
#

Anyone know what the most likely cause would be for a lora that doesn't like negative prompts? Just finished my 5th training on SDXL and everything seems to work quite well for most of the testing but if I add any token to the negative prompt it completely overbakes, like more than bumping the CFG to 10+

#

I noticed the lora epochs themselves seem to burn easier the longer it trained, cfg 11 for epoch 2 versus cfg 6 for epoch 10 as an example, but even epoch 1 burns if I use a negative prompt at all

valid sequoia
#

Just ran another training with 10 repeats instead of 20 and lowered learning rate from 0.0003 to 0.0001 and it did move the "finished" epoch from ~4 to 9 but the burning is still happening šŸ˜’

#

Makes me wonder if it's the model(s) i'm testing on

#

or I'm missing something else

valid sequoia
#

After a bit of checking the models I have it seems like there's a common problem with negative prompt causing the quality to tank, and one of them even requires clip skip -2. That's really annoying

stiff dust
#

do you train text encoder?

valid sequoia
#

I've tried both at this point

#

one video said not to, that lora is basically useless on most models

#

I think I may have finally figured it out. My lora training wasn't actually the problem, it was the models I was testing with (trained on base). Specifically, the AlbedoBaseXL model says "Leaving the negative prompt blank is the way to achieve the nice quality". It should say you can't use the negative prompt over CFG ~3 with a lora or it fries everything completely. At least the ones I've tried, not just my own. Multiple other models I was testing are based, likely heavily, on AlbedoBaseXL so the problem was widespread in my collection

stiff dust
#

probably albedobasexl was then training the text encocer, too

#

I agree that training unet only is not always an option. Training text encoder improves results often, but might also often have undesired side effects

valid sequoia
#

I ran a massive x/y with 3 epochs (5, 7, 10) from each of the 7 loras I trained, against every XL model I have. Found all but 2 were burning when there was a negative prompt

#

did that for CFG 5,7,9

#

Then I tested all the same loras on a 'good' model with CFG scales 2-15 and found the model only burned on a couple of the loras

#

Most of them were the "finished" epoch 10 versions, multiple epoch 5-7 were fine

#

Turns out the best training I have was actually epoch 5 of the first lora I trained, but the eyes 'glow' if I specify an eye color, which is what started all this.. some negative prompting can clean that up though I guess

stiff dust
#

to be fair: CFG is a hack. If you specify any CFG value over 1 the model is not doing what it was trained on

dire orchid
#

hello guys, I'm new to finetuning sd and ran into a small problem

#

I get this error when trying to train with dreambooth: died with <Signals.SIGKILL: 9>.

#

is this a memory error? My dataset has 4k images, and my cpu ram is around 13 gbs

#

I can share my collab if needed, I'm using V100 rn, but can go for A100 if needed

torn dew
#

If I am merging pruned models into my unpruned model, will it still be decent for lora training in the future? Or should I really only be merging unpruned if my goal is to use it for training?

gentle flame
#

Does anybody know if finetunes using perturbed noise or min snr gamma produce noisy outputs? I think one of the two is causing issues, but I'm not sure which (think it might be the former, but I want to know if anybody knows for sure)

#

example

opal jacinth
#

Which option do I have to adjust for training if the face comes out well in the aDetailer with the prompt "photo of ohwx woman" and not without?

#

so basically a prompt with "photo of ohwx woman" yields very good resemblance with the training data

#

but "portrait of (ohwx woman) as a viking, ...." has no resemblance without aDetailer, only if I heavily weight (ohwx woman:1.7)

#

I assume the training of TE 1+2 are out of balance? As I train unet + te1 + te2

stiff dust
dire orchid
#

Yeah I upgraded and it worked

silent saddle
torn dew
#

There is no benefit to training on a tune or merge? Even if I want to use that lora with said merge pretty much primarily?

silent saddle
#

Many discord friends too. To avoid any unexpected effect.

silent saddle
#

On base 1.5 will have a "healthy" lora.

#

Will not get any plus but wont get any flaws too.

torn dew
#

That does make sense. I will actually try both to compare. šŸ™‚ But I guess I still am wondering if merging pruned and unpruned models has any issues. For training on em afterwards. Logic says it shouldn't

silent saddle
#

It is good because it will be easier to correlate parameters adjustments and results. The point is: there is something strange on my lora, it is because my parameters or the checkpoint is guilty?

silent saddle
torn dew
#

Awesome. Thank you for your advice.

silent saddle
#

I had only 3 cases where was unable to merge. Rare.

torn dew
#

I did a lot of MBW on my merges. It took forever to figure out what worked properly. I was worried I would have problems training on it lol.

silent saddle
torn dew
silent saddle
#

For up to 3 merges with "health" checkpoints normal merge is ok.

torn dew
#

I just notice normal merging just sorta smears everything together. Especially if I wanna mix realism and anime together.

#

That is why things look like plastic all the time.

silent saddle
silent saddle
torn dew
#

That is something I wanted to know. Is there a way to fix the text encoder?

#

Would I BASE merge a good one?

silent saddle
# torn dew That is something I wanted to know. Is there a way to fix the text encoder?

Man... i still don't found a good way to fix a damaged TE. Dont import base TE into your merge. It will make it more dumb. What i do with some success is: make a prompts set and create the images (all same parameters) with some checkpoints and your damaged model. Next analyze what checkpoint match closer your model. Export the TE from it and import on yours.

torn dew
#

That makes a lot of sense.

silent saddle
#

It helps but is not good enough but really help

#

The best is on MBW never mix TE.

torn dew
#

AFAIK I never touch base.

silent saddle
#

Set 0 or 1.

torn dew
#

The problem seems to stem when you touch models that have a lot of loras merged in, or furry models since they don't use danbooru tags.

#

It seems to be what I notice fractures the TE.

silent saddle
#

Checkpoints is a zoo right now. XL is still healty. But almost all 1.5 are "contaminated".

torn dew
#

Very true. But I know some crazy person will release a tool or paper to fix all the contamination.

#

It's more of a when than an if.

silent saddle
#

Example: realistic model with anime perfect eyes injected. It is very common problem

torn dew
#

Yes. 100%.

#

But that is because people just do too many alpha merges.

#

I really do think Elemental Merges might be able to fix that. But with 600 paramaters or whatever it is per layer. Its too hard to do by hand.

silent saddle
silent saddle
torn dew
#

cough RevAnimated cough

#

Completely breaks eyes.

silent saddle
#

Eyes, warped face, non symmetric body, long arms or fingers, old face but having young body, etc etc etc

#

But is fun and we learn alot trying to figure out and fix things

torn dew
silent saddle
#

More some months and migrate to XL makes more sense.

torn dew
#

XL just feels very restrictive right now.

#

1.5 is so flexible.

silent saddle
#

But im still in love with 1.5

torn dew
#

Same.

silent saddle
#

This model im working for long time can generate cool images with cfg 2 and steps 6 in 1sec. While there is alot of buzz about turbo xl.

#

And i prefer the non perfect (non crystal clear) mood of 1.5

dusky urchin
#

i am using the image metadata and Adobe Bridge / similar applications

latent charm
timid grail
#

Hello everyone,
I'm having a real hard time understanding Lora training. I followed a tutorial to the letter, but I always get the same error.
Can someone please help ?

#

I'm using kohya_ss 22.2.2

#

Nevermind... Just found the solution, it seems

dawn oriole
#

Anyone has any idea how I could use CLIP models with the embedding size (77, 512) instead of (77, 768) for guidance in stable diffusion? Is it possible to do it?

dusky urchin
dusky urchin
#

does anyone know which tokenizer concretely was used by Stability in training SDXL's CLIPs? the vocabulary 2,496/34,483 of the vocab terms have word terminators, which are never produced by e.g. comfyui

stiff dust
#

of course they are

#

the word terminators are implicitly there whenever you have a word ending with a blank or nonalphabetic character

tender hearth
#

Hey folks! I've just fine tuned my first model following this tutorial. Model is saved on my Google Drive but now I'm wondering how can I run the inference on it? I don't have Colab Pro, so can't use AUTOMATIC1111 Stable Diffusion WebUI

dawn oriole
stiff dust
#

regarding token count: in theory you can change the token count from 77 to something else, but results can be unpredictable

dusky urchin
#

i have a feeling there are more

#

the large amount of mojibake in vocab.json is not reassuring

wheat rock
#

Hello, I am developing dreambooth model using 1 training image.
And I am not sure what is the proper hyperparameters to train lora model
I am currently using realistic_vision_5.1_no_vae as based model.
And lora rank is 4.
Please help me.
šŸ™

dire orchid
#

anyone knows what's currently the best vae for sdxl?

valid sequoia
dire orchid
#

gotcha

#

and the openai decoder thing only works with normal sd, right?

waxen pine
#

regarding text encoders in sdxl:
was sdxl trained with both encoders getting the same prompt?

#

is their any official statement on how they really relate to each other?

#

there are some tests what might be the best way to use them, but I never found anything official

#

regarding training unet only vs textencoder:
if you feed the unet with image / captions pairs the unet will have some knowlegde about the captions?
the textencoder has it's somehow limited vocabulary file, but also can concat single letters to build unknown words?
-> then u only have to provide enough samples to learn a new concept?
why do you ever want to train the text encoder then? (maybe to faster associate an image with a caption)?

stiff dust
stiff dust
waxen pine
#

but was this also the case for the initial sdxl training?

stiff dust
#

there were always complains about CLIP-G being less effective for certain artists and styles

#

so they just used both

waxen pine
#

aaaah i see šŸ˜„ thanks!

i did some tests with seperate captions for each decoder during finetuning, and it turns out pretty well

stiff dust
#

using different prompts for CLIP-L and CLIP-G is very problematic. I wrote about that multiple times

waxen pine
#

so i was wondering

stiff dust
#

it works and sometimes might make the results better, but in most cases it will make things worse

waxen pine
#

i have the assumption that CLIP L works very well for a style tag and G for the rest

stiff dust
#

people assume that since the beginning, because it fits this idea of clip-g being better in understanding and clip-l is better at styles

#

but there is not a single study that proofs that

waxen pine
#

but if sdxl was trained on both at the same time ... i think than it doesn't matter seperating them..

stiff dust
#

whenever somebody really made a big case study with dozens of prompts and compared them it was always better to use both with same caption

#

but you can always make examples where some strategy works and the other not

waxen pine
#

i see, you have point there

stiff dust
#

the reason why I don't like separate captions is the following:

#

the way clip+unet works is that your sentence consists of tokens (in most simple case these are words) and each token is connected with parts of your image (like each latent pixel in the image gets a probability to belong to a certain token)

#

this way, a face in your image might be connected to the token "trump" and during diffusion process the pixels are changed in a way to look more like trump

#

when you have two clip encoders, the embeddings of the tokens are just concatenated

#

so say you have a caption "trump as pope" and a second caption "hyperrealistic photography"

#

then what sdxl becomes as input is the token ["trump"+"hyperrealistic"], the second token is ["as"+photography"] and the third token is ["pope"+""]

#

this doesn't make any sense

waxen pine
#

i think "when you have two clip encoders, the embeddings of the tokens are just concatenated" is the crucial point.
is this internally in sdxl? because in comfy for example you can feed the captions seperate

stiff dust
#

doesn't matter. They are concatenated as soon as they enter the unet

waxen pine
#

damn 😭

#

xD

stiff dust
#

concatenated means: the first token in CLIP-L is combined with the first token in CLIP-G

waxen pine
#

yeah I understand that, but it's somehow a little bit sad xD

#

hmm thanks for all your thougths

stiff dust
#

don't get me wrong: there are many weird things that surprisingly work even though they don't make sense

waxen pine
#

is there any paper or something like that how sdxl works under the hood?

stiff dust
#

I don't say that it's forbidden to use different captions

waxen pine
#

yeah, there are so many different parameters whcih influence the whole training process

stiff dust
#

but I would never ever encourage people to do that without evaluating theirself what works better

waxen pine
#

that's one reason why I don't like the concept of shuffling captions, because it even adds more randomness to a random thing xD

stiff dust
#

unfortunately, the claim that using linguistic prompts for CLIP-G and style prompts for CLIP-L is distributed and repeated very often without people ever questio that or evaluate that. Whenever such discussions pops up, somebody comes with a single example where it makes things so much better and acts as this would be a proof

stiff dust
#

you don't want the model start relying on the exact order of the content in the prompt

waxen pine
#

hmm that makes sense

stiff dust
#

in general: if you train many epochs you might want to add some randomization to the training process to avoid that the model starts memorizing things

#

this can be random cropping, flipping (dangerous), caption shuffling or dropout

stiff dust
waxen pine
#

ahhh 🤩

stiff dust
#

so if you give the text encoder a word that is tokenized into single letters, it will still learn that the combination of these letters make up this word

waxen pine
#

ok ok makes also sence

#

but would use more tokens

stiff dust
#

yes

#

which is then also bad for SD, because the unet also has to use all these tokens

#

(which doesn't matter that much as the unet is using a fixed length of 77 tokens anyways)

waxen pine
#

after some weeks of debugging koyhas scripts I wondering at every line: "why is this like this xD what happens if i change this to that xD"

#

i also don't understand why we need this super complex multi layer text encoder when the model was trainined on alt tags

stiff dust
#

because of that

#

if you have a french word it will not be contained in the dictionary

#

so the text encoder has to learn that this combination of tokens belong to a single word

waxen pine
#

i made some tests with using the different layers als input (clip skip) but I'm no smarter than before

stiff dust
#

which will take many layers

waxen pine
#

i see šŸ¤”

stiff dust
#

but yes, alt tags are horrible. But this is the reason why CLIP is used, as CLIP is very robust to shitty captions

#

but it's also the reason why the text encoder is so extremely stupid in many cases. The alt tags are just too bad to learn something really useful

waxen pine
#

i imagine something like this: "i have a really high creafted vector embedding by clip which should be compared to a alt tag vector embedding"
how can they even be close to each other xD?=

stiff dust
#

CLIP-L has only 11 layers. That's actually a very small transformer given the amount of data it is trained on

waxen pine
#

G has 32 ot something like that i thing

stiff dust
#

yes. I'm sceptical about that xD

#

like it definitely has a better text understanding

#

but it's probably wasted on alt tags

waxen pine
#

jap xD

#

i don't udnerstand why not instead of using alt text clip or blib was used to caption the images befor?

stiff dust
#

I mean, CLIP is older than BLIP

#

also BLIP makes a lot of errors

#

even newer models like llava or coglvm make a lot of errors

waxen pine
#

maybe as a adition to the alt text

stiff dust
#

llava is sometimes not even able to distinguish a monochrome photo from a colorful photo

waxen pine
#

hmmm

stiff dust
#

but I agree that this is the way to go

#

Dall-E 3 and Pixart Alpha did that

#

they made automatic captions and trained on them

#

Pixart Alpha is almost as good as SDXL, but was trained on 1/100 of the data

waxen pine
#

i have my eyes on pixart too xD

#

but currently so deep into sdxl so i want to learn everything about it first xD

#

can i ask another question xD? you have some seroius knowledge

#

have you tried training with zsnr fix?

#

i read sdxl already implemented it

#

and koyha supports it

#

but my results are not getting really better with it

stiff dust
#

haha, you should ask Pseudoterminalx about that

waxen pine
#

the betas look so much different than the originals

#

and my math is worse haha

stiff dust
#

I think zsnr fix in kohya is something different than zsnr

waxen pine
#

at least koyha uses the little script which is used in the offical paper

#

fix beta: zero terminal SNR

print(f"fix noise scheduler betas: https://arxiv.org/abs/2305.08891")

...

#

it's absurd that a view beta values can change the result that much šŸ˜…

stiff dust
#

as I understand zsnr you cannot use it for epsilon prediction

waxen pine
#

with v-pred I alwys get grey images

#

during training and even in comfy

#

don't know how to get this to work

stiff dust
#

you cannot easily change the prediction type

waxen pine
#

in koyha you can enable it with v_parameterization but he also writes to not use it with sdxl

stiff dust
#

if a model is trained on epsilon, you cannot let it predict v

waxen pine
#

ahhhhhhhhh

stiff dust
#

you have to completely retrain the model for hundred thousands or even million of steps

waxen pine
#

😭 again xD

stiff dust
#

pseudoterminalx is training an SDXL variant with v-pred

waxen pine
#

ok, so i don't have to investigate this route any further

#

thanks 🄰

waxen pine
#

thanks so much for your knowledge :D!!

waxen pine
#

woah 🤩

#

instant tryout now šŸ˜‚

frozen island
#

Im planning on trying to train my first lora on a comic book artist's style, are there things I should avoid for the training dataset, such as text?

dusky urchin
#

and it's almost certainly not what stability uses for its training

#

like the tokenization is... let's say weak oftentimes

#

it's just not using the vocabulary correctly

dusky urchin
#

@stiff dust i think one of the reasons lora training with text encoding is going so poorly on sdxl is because the tokenization is broken. what people think are rare tokens are not really. stability, laion, whoever, almost certainly precomputed tokenization for openclip on its corpus, which to do so performantly means authoring BPE in something like the polars DSL. they probably didn't use any of the ordinary unoptimized python code at all.

silent saddle
#

I saw somewhere they do only one pass through part (picked randomly) of laion db.

gray pine
#

Image captioning for LoRa

Hi there,
I find image captioning a bit confusing as tutorials often have different approaches. My main problem is that they usually explain them on the example of subjects or styles. What I want to train is more of a concept. I want to train a LoRa that focuses on everything balloon related such as inflation and manipulation/handling by characters. I found that most models can’t produce any good results for this except for characters holding floating helium balloons.
So my questions are:

  • How should I caption the images, what to describe, what not to describe? Sentences or single terms or both?
  • Should I use a prefix? And if so should it be a term that the model is familiar with?
  • Should I use a unique identifier?
  • what Settings should I use for a concept like this? (Using Kohya SS)
  • should I use as many images as possible (I have about 300 rn) or only use a few?

Any other tips and hints are welcome.

I’m new to this so I hope I don’t ask stupid questions.

Thank you guys

silent saddle
# gray pine Image captioning for LoRa Hi there, I find image captioning a bit confusing as ...

Hey man, I don't know how to answer some questions for your case (looks more complex than usually i have experience). On my loras i caption terms not related to my object of interest. Dont need be detailed: a table with fruits is ok. a table with orange, grapes and bananas is not necessary. Use a trigger not familiar to model. You can test before to see what is generated. I rather use something with consonants and numbers together like: h4th4w4y or short h4w4y. I like to train on sd1.5 base model. Minimal influence of model is better (my opinion). Settings is hard, there is no bulletproof recipe. Start with few images (20-30 best quality images) for faster training to adjust your parameters. When satisfied goes adding 30, adjust again, add 30, etc... For your case i think more is better (300-500). Simple loras 30-50 is enough with a good dataset. Sometimes 20 best image already does the magic.

gray pine
# silent saddle Hey man, I don't know how to answer some questions for your case (looks more com...

Hi! Thanks for your advice. It is definitely helpful and I will try some of these steps. I’m not even sure if a Lora is the right choice or if other training methods would be more fitting. What makes it difficult is that I don’t really teach a single object but make an existing object more defined, in this case balloons. So I’m afraid if I only describe everything else the model can’t distinguish enough. One idea I had was to make balloon related terms for captions. Like ā€žsmall balloonā€œ, ā€žuninflated balloonā€œ, ā€žfully inflated balloonā€œ. Also actions like ā€žinflating balloonā€œ ā€žholding balloonā€œ etc. I’m not sure if this is the right approach, especially if I use pictures with multiple balloons in them, how should the model know what specific balloon I am describing.

silent saddle
dire orchid
#

Assuming that I have a Lora that I already trained with dreambooth and saved on HF, how do I continue training this lora from where I left off?

#

Any clear guides on that? Maybe use the HF link as the checkpoint path or smth?

hollow spruce
# gray pine Image captioning for LoRa Hi there, I find image captioning a bit confusing as ...

so in a nutshell, there are multiple approaches to this, that will all work, but have different downsides/side-effects
learning the different methods is obviously trial and error.

The easiest way to get it working, imho, would be to take the 50 best images of your set, and start with those. Basically get a lora working 'well enough' with only those images. (the full 300+ will be used later)

make sure that signatures are removed. You can crop to 1:1 to avoid certain pitfalls, but this is entirely optional. Ideally you want bucketing to work, since that also makes your lora better in the long run -but that's not the focus of this question.
once you have them, caption them using only caption style words.
"<trigger word>, <subject>, <action>, <location>" <- really depends on your dataset, but keep it simple, and be careful when captioning to not miss any/apply when not relevant. Add anything specific, that you want to be able to prompt for.
"balloon_style, woman, sitting, indoor" <- one potential image caption

(why not natural language style captions? <- is a long conversation. but with such small datasets, using captions will almost always get you better results)

trigger word always first. shuffle turned on. keep n tokens set to 1 (so that trigger word stays front)
min snr set to 5. (read up on min-snr if you're brave enough. but leave it on, as it helps train new subjects a lot faster) -> https://arxiv.org/abs/2303.09556
settings REALLY depend on your hardware. I could help you on a 3090/4090 setup, since that's what I have.
stuff that stays same:
• Leave offset noise at 0 <- can and usually will cause issues
• LR between 0.001 ~ 0.0005 + constant scheduler + adamw8bit <- for your first tests run at 0.001. While not ideal, it will still show you very very fast if you have issues somewhere else in your lora, that you need to fix first
• Network dim set to 32 <- 8 or 16 also work, and will make the final model more flexible, but will be harder to get right if you're not used to making loras. Do not go above 64 unless you know what that means, as 64~256 will cause significant damage to base sdxl, unless you know how to fix that.
• use all the performance gains you need to make it work (depends on hardware -> stuff like xformers, bitsandbytes, full bf16, etc... - there are other guides on performance gains)
• you can either try a lower LR, and get it working just right, or a higher LR and then use the final lora at like 0.5 strength. (getting it right means more flexibility - but takes like 7 hours to train VS training for 10~25 minutes, but working with getting results via overfitting to some extent)
• enable text encoder training <- while definitely not always recommended, for this specific scenario, if you roughly follow my captioning example, will work really well. Just don't use any words that pertain to anatomy (mostly nsfw stuff), cause that will break your entire training very very quickly

when you're done, run an "infinity grid" to compare results from without vs with lora, and to test different epochs, and to test different cfg + step values + different prompts
(infinity grid can be found in A1111 extensions, or stableswarmui has native support for it)

when you're happy with your lora, remake it with all images you have available, keep all settings the same except learning rate, which you should tone down to around 0.0005, and then generate your truly final finalā„¢ LoRA (expect this to run between 5~15 hours on 24gb vram)

silent saddle
#

@hollow spruce Good point about text encoder. ✨

dire orchid
#

hey guys, I'm fairly new to finetuning sdxl (tho I have some experience finetuning LLMs), can someone guide me to where I should look at to get up to speed with everyone?

#

I'm trying to train sdxl with a dalle 3 style captions (have around 5k images), but I feel like I'm missing a lot of knowledge that messes up my attempts

valid sequoia
dire orchid
valid sequoia
#

In most cases you're better off with maybe 50 images, I train faces with 10-20 usually

#

but training SD, especially SDXL, is a lot of trial and error. Start with a small set and see if it works at all and then adjust from there

dire orchid
#

but like I'm trying to make it learn complex composition and who is where in the image, would a small dataset really give it that understanding?

#

stuff like this guy is in the right and this guy is in the left

valid sequoia
#

SDXL in particular is good with natural language prompts, but you don't want you captions to be too long or overly descriptive

#

I'm not sure training a lora is the best option for things like composition

dire orchid
#

so full finetuning?

latent charm
#

It is quite complicated. I think some model on civitai already used dalle3 like descriptive training.

valid sequoia
#

controlnet would be way easier and likely more successful

dire orchid
#

any link please?

#

would love to talk to the guy and see what he learned

latent charm
#

I am not sure which. I think pixelwave might be one of them.

#

The author of pixelwave is in this server. You might ask him in sdxl or here

dire orchid
#

huh, interesting

#

ah can I ping?

#

@ruby pond Hey man, I'm fairly new to sdxl finetuning and would really love to learn. I tried to finetune sdxl with a dataset captioned by cogvlm (similar to chatgpt4vision), and I had to hack into the dreambooth lora code quite a bit to make it able to accept long prompt (bigger than the native 77). Would love to learn how you did it

hollow spruce
#

@dire orchid
Try to make a text encoder only LoRA as a proof of concept. I can guarantee that that works up to a certain degree.

Assuming you wanna go all the way with your plan, you'll need around 30~50k images with descriptions. Keep in mind they need to about under 77 tokens, since SDXL is not dalle 3 and cant keep up with such long captions.
Ifyou pull off such a dataset, then your next step is either a cheap finetune, where you finetune on a A6000 or A100 with however big of a batch size you can get into it + keep accumulation high.
If you have money to spare or a company behind you, then get a 8stack of A100s or even more, and do either a full finetune or multi step finetune (first TE -> freeze TE -> then UNET)

#

in either case, you'll need a good dataset with prompts that arent biased towards any specific subject, since you'll be hardcoding that bias into the TE

dire orchid
#

ah, okay, that helps a lot actually

#

Im not sure of my hack and Im sure it fucks up a lot of things

dire orchid
hollow spruce
dire orchid
#

40 gb vram

hollow spruce
#

ah neat

#

then you're more than good to go.

use kohya-ss -> enable TE training -> disable unet training
do a few test runs to see how high you can push your batch size. the bigger the better (cause it adds context to how the prompts relate/differ from one another)
then let it run for 5 epochs. (save each epoch) for LR feel free to stay around 0.0008 <- way too high, but this is just for testing it
on an a100 this should be damn fast. far less than a hour.

download your safetensor files. run locally and compare to base. you should see significant changes around epoch 5.

run a few tests like that to eliminate any possible errors in your settings.
when you're happy, run it with LR of 0.0001 to get your first final version.

as your dataset grows, you can slowly incread the network dim, and decrease the LR even further <-- will require a lot of testing to get it just right

dire orchid
#

okay, right now I have a super biased dataset that always starts with "This image is", quirk of cogvlm

#

I can easily change it with a simple LLM script

hollow spruce
#

for starting, network dim 32 ~ 64 will probably be ideal

#

yeah cogvlm be like that XD

#

I usually limit its space, which gets rid of it. Probably not ideal though, so šŸ¤·ā€ā™‚ļø

dire orchid
hollow spruce
dire orchid
#

amazing

#

do I need kohya or does dreambooth allow for training just the text encoder?

#

also looking into the dreambooth code, I'm not sure where it saves the trained encoder

hollow spruce
#

unless I missed something, kohya was the only one to do have proper sdxl LoRA support
(with exception of diffusers - if you wanna go that route)

dire orchid
#

ah gotcha

#

then kohya it is

#

since I no longer need to do some random compel stuff to extend the max length, this should be easy

hollow spruce
#

ah, you gonna try to extent token limit and average it?

dire orchid
#

I was doing that

#

no longer gonna do that

hollow spruce
#

did it work? xD cause it shouldn't... but I really wanted to try it anyway...

dire orchid
#

ehh, it did train

#

Im not sure if it trained well

#

I can share the code and you take a look at it

hollow spruce
#

I mean the resulting file?
(kohya has native support for that - you can extend up to 225 tokens)

#

so its a not a question of if you can, just if you should 🤣

dire orchid
#

ahh

#

wat the hell

#

I spent a week trying to put that into dreambooth

hollow spruce
latent charm
#

all llm generated captions have hallucination. You might filter or fix your dataset before.

dire orchid
#

true, I should look through it manually

dire orchid
#

the ones I see are big for no reason

hollow spruce
dire orchid
#

ah rip

dire orchid
#

or do u just code it up in python

hollow spruce
#

use the LoRA tab though

dire orchid
#

not sure what to click/change

gentle flame
#

If you have a noisy model and are training with perturbed noise, then consider reducing it.

dire orchid
#

I'm seeing this and considering just dropping the learning rate to 0 for unet, but that seems extra stupid

dire orchid
#

I will go through all of it later

hollow spruce
#

@stiff dust do you know the param argument to disable unet training? (so it can be used for Text encoder only)

dire orchid
#

maybe just this?

#

something like --train-unet = False

hollow spruce
#

I cry everytime I want to find a specific kohya setting 🤣

#

--network_train_text_encoder_only

#

dear god that took way too long to find

#

and yep, it goes into that box

#

@dire orchid found it

#

also you should hide that link again ^^

dire orchid
dire orchid
#

Did you get that from parse args?

#

I should have done that work sorry

ruby pond
dire orchid
#

ah, nice

#

you use kohey ss?

ruby pond
#

Yeah

dire orchid
#

hmm, I wanted to ask a question about that, can you pass kohey an hf dataset that it can parse on its own?

#

rather than download the images and the captions and send the local folder

ruby pond
#

I haven't tried that, only done training on local data

dire orchid
#

ah gotcha

#

so, how do you make sure that the images and caption pairs don't get randomly scrambled?

#

That's kinda what Im afraid of here

ruby pond
#

Scrambled?

dire orchid
#

as in a caption is linked with the wrong image

#

they're two folders right?

ruby pond
#

Each image has a txt file with a matching name in the same folder

dire orchid
#

ah

#

so that's how the link is done, gotcha

#

imma just use id then for the name

#

and do you just specify the image folder here and thats it?

#

or is there like a caption thing somewhere else where I need to pass the same folder

#

sorry for asking too much btw

ruby pond
#

That folder should have a folder or multiple folders that are named starting with a number which is the number of repeats per epoch, e.g. 1_photo

#

The name isn't used unless you don't provide captions for each image

dire orchid
#

so, something like 1_1.jpeg and 1_1.txt could work?

#

for like the first image for example

#

this way I keep the names unique

ruby pond
#

Then select the train_img folder

dire orchid
#

so these subfolders are for classes?

#

or something like that?

#

or do they just hold image text pairs?

#

I'm not sure what's their use

ruby pond
# dire orchid I'm not sure what's their use

It uses the folder name if the images aren't captioned. Otherwise just for organising or to balance how many times the images are seen because the number is how many repeats per epoch

dire orchid
#

I'm thinking no repeat 5 epochs, unet plus text encoder training

#

over 5k images currently

ruby pond
#

Depends. If it's more of a general training on lots of photos, then 1 repeat 30 epochs. If training on a handful of images for a specific thing then I might do 10 repeats for 10 epochs

dire orchid
#

ah gotcha

#

trying to do general

#

that will prolly take a lot of compute then

ruby pond
#

I usually just run overnight while I'm sleeping šŸ™‚

dire orchid
#

yeah that make sense lol

#

also, wanted to ask, did you train text encoder too?

#

seems like it would help with model understanding too

ruby pond
dire orchid
#

ah gotcha

#

I trained my text encoder and it seemed to help

#

at least compositionwise

#

@ruby pond it messed up right from left half the time as base and no text encoder run from my evaluation

#

sample size was 20 images

#

but then again, I trained it using dreambooth and I hacked into that code a lot

#

so not sure if that was better or worse than the kohey implementation at least for this case

ruby pond
dusky urchin
dusky urchin
dusky urchin
sacred grail
#

what's the best auto-captioner out now? any progress on that lately?

lavish stream
#

Now training 4500/1000 steps with around 170 images. Using Adafactor with linear LR scheduler, using 3e-5 as the LR. Is the quality supposed to get better? I thought linear is going down from that initial value?

#

I'm running Kohya-trainer-XL script, not sure if this is the latest, found via direct Google link. On the Github of Linargruf this is not linked.

#

Relative_step has to be False but doesn't this break the Adafactor LR adjustment. With true the script contains that warmup must be 0, but even if I set it there, training does not run. Adafactor scheduler never works, I'm getting the same error. I thought Adafactor scheduler would be the best with Adafactor optimiser?

#

Speaking of the linear scheduler, where does it head for the LR? Absolute zero?

latent charm
#

Planned to train a lora with Pick a pic v2 dataset. 1k image pairs proof of concept testing. Next step is to train with 10k image pairs.

#

Original, 200 steps lora with 1k image pairs

rotund cargo
#

Greetings, fellow creators. I want to ask you if any of you know how to mantain consistency in a face without using ReActor or any other faceswappers out there. Should I just train a checkpoint for my model, try specific prompts (I'd rather not), any other ideas? I appreciate your help.

grave carbon
#

cat

stiff dust
stiff dust
regal harbor
#

everyone says to use constant, but it seems more logical to me to use linear... wouldn't it prevent over training by doing smaller steps near the end?
I also don't really understand what 'loss' means

lone zenith
#

During training an AI is asked to produce something and that is compared against what the AI is being trained to produce, and the difference is the loss. During training, the model is slowly tweaked to minimize the loss.

icy valley
#

Hi what s the best way to train sdxl model?

stone garden
#

I am trying to figure out lora training and I have 10 epochs that I am running through, about half of them seem to ignore the character I trained the lora on and fulfil everything else in the prompt, I get the same image from 5 epochs that has nothing to do with the lora I want. Does that mean those epochs were overtrained or is there something in the way I am testing it?

stone garden
# icy valley Hi what s the best way to train sdxl model?

I am a newbie to all this, but three ways I am aware of: 1) downloading and using Koyha SS, 2) using a Google Colab of Koyha SS, 3) using the Civitai system. The first two didn't work for me, so I am using the third, but you have to pay for it.

stiff dust
coral creek
#

hey everyone, I'm working on fine-tuning a stable diffusion 1.5 model. was curious if anyone has a good way of seeing if the model is improving?

the problem I have is that because the noise + train timesteps are random the loss is very "jumpy" and doesn't give much sigal. does anyone have experience with metrics that help indicate it is learning? or does the fact that it is jumpy mean the network is not learning?

normal ember
#

A/B testing is the way most do. Stability seems to have taken it a bit further with ELO scores.

coral creek
lavish stream
#

It's weird how almost none of the optimiser/scheduler settings even run in the original Kohya Colab script, and also Prodigy is missing :/ The GUI version is much better in this sense.

latent charm
dusky urchin
#

does anyone have experience resumining training on a LoRA that was distributed on civitai? aka no gradients or related data, just weights. i understand that kohya supports resuming training, but on its own little data bundle. my expectation is that this has to be recreated with undefined data for the training snapshot, but with trained weights

dusky urchin
boreal elk
#

Does anyone know how to make the transitions between batches smoother in comfyui? I’m doing 150 batches and you can see the cut between them at 150,300,450 frames and so on. Is there a node that saves the last 5 frames from the previous batch so it will use that map for the new batch? If so what is the name and where does it go? Thanks! šŸ™ (unless that node increases render time, then does anyone know how to make it smooth with davinci resolve?)

stiff dust
stiff dust
steel tide
#

I've been creating SDXL LORAs of cartoon characters which for the most part work great at replicating the characters and style - However I'd like them to be more flexible so that I can change the overall style to something that it wasn't trained on as currently they're following the training images a bit too closely. Should I add some sort of captioning for the style?

bob belcher, wearing a white shirt, mouth open, sat in a bank

rancid tartan
#

I'm trying to train a LoRA of a specific building. I'm testing the LoRA now and it can consistently generate the general shapes of the building. However, very often the proportions are off and/or some parts of the building go on the wrong place or are just missing. And the building repeats everywhere. The building is a bit complicated, but I wonder if there is a way to make it more consistent

robust urchin
#

Am I wrong or more you give description in your .txt when you train your lora, then you get better freedom to change elements ?

#

For example I couldn't get a desert background until I described all my backgrounds in my databae

#

It seems like if you don't describe it in your database, your lora will generate it automatically. But if you do, it will give you the choice to modify it or not.

jade hornet
robust urchin
# jade hornet This is true, you describe anything you want to be about to change later. If yo...

Oh ok thank you. I try to describe clothing now. In order to see if I can change them easily.

Also, the gender is easier to change when you say it is a man. Instead of nothing.

And, I'm not sure but it seems a lower learning rate and repetition may change the way of changing things.

For example with a higher learning rate of 0.0004 + repetition of 10, the lora becomes overtrained quickly. Like around 20-25 epoch.

When at 0.0001 or 0.00005 and repetition of 2. After 100 epochs, lora stays flexible.

#

But sdxl lora are impressive and catch easier what you want. For example the spear the jaffa is holding. I put more Ma'tok staff images in my database and described it like a spear.

And it starts to look more like the thing from the show than a regular spear:

#

Same for the tissue strips attached to the helmet. They started to be generated when I described them in my .txt.

potent rain
#

Hi all! Where do you put / find your finetuned SD models? I'm looking for a good existing anime-centric finetune of 1.5 . I figure there must be a dozen options for this, but can't seem to find any.

stray mulch
#

Hello, I need help with Inpaint and hands.. How can I solve this issue?

#

How do I set it up to fix the hands? :/

#

I am new with intune

robust urchin
#

Even in describing the clothes, it is really hard to change them. Like I cannot get my jaffa in a suit with a tie...

#

But the Matok shaft is awesome now...

robust urchin
#

and I can control the eye color....

#

oh yeah....

#

start understanding how to train my lora...

sacred grail
robust urchin
#

But it doesn't seem to change it...

robust urchin
#

I train with armor. Not easy to modify the clothes but I start getting something

#

I remove the head and use that in my data base

feral orbit
#

Hey all, pretty new to the space so sorry if this is a stupid question. I just made a dreambooth finetune using epicphotogasm, but all the images coming out of it are a bit fried. I used 15 images to train it. Also I've run this same prompt like 10x, and just about every single one has the same pose, facial expression, angle, etc. Does this mean it's overfitted? Any pointers on what I'm likely doing wrong here, or where I should look to get my answers? I've been googling for a while and haven't quite found anything clear yet.

robust urchin
#

not perfect but much better than my last tries

#

they look intimidating now...

#

god bless sdxl lora...

robust urchin
#

And I keep flexibility for gender and background....

#

ok I will try few last things and start the ultimate lora about stargate !!!

#

damn... it really works....

dusky urchin
robust urchin
#

Also, Im not sure yet but I think it is important to remove th background when we can... SDXL Lora is really powerful and if you have a desert background, it will learn it at 100% and change the random desert background from sdxl model by the one of sdxl lora.

#

they should call SDXL Lora, CTRL + C, CTRL + V...

#

weird because previous Loras from 1.5 and 2.1 needed a background. Otherwise, you got only empty background everywhere....

robust urchin
#

I think I have the perfect parameters now...

stone garden
#

What is fine tuning?

robust urchin
#

And I find a guide to create captionning. the guy uses almost everything I use but in more detailed. I didn't know if I was making a mistake in using danbooru tags. But it seems good.

stone garden
robust urchin
#

checkpoints are general models

#

lora are some kind of style or you could call that flavor to the general model

#

for example, if you think the base Sdxl model is not able to generate a correct baby yoda. You will train it with a data base to have the one you want:

#

But most of Lora trained on civitai are useless. People are just bad at prompt engineering. Therefore, they create a lora. Since you can get a good baby yoda from the base model without training if you use the correct prompt.

#

But in my case, I try to create a stargate universe lora. And sdxl do not know that. Because sdxl dev probably do not know what stargate is or copyrights issue. Therefore, I have to create a data base and train a lora in order to get that:

#

Shortly, you just try to make your model learn something it doesn't know. And that is a lora, embedding, hypernetwork, fine tuning...

topaz needle
swift kite
#

/img 3d

jade hornet
# topaz needle I am finding this trouble too with warhammer, are there any guides or tutorials ...

So many. You really need 2-3 tutorials. One that focuses on proper captioning - Google for a reddit post named "captioning datasets for training purposes". Another that will explain the settings of your chosen trainer, kohya scripts or kohya_ss wrapper, everydream 2, etc. and optionally one to show to use cloud compute, unless you happen to have a top of the line card with lots of VRAM. Here's one that combines some of those https://aituts.com/sdxl-lora/

This guide will cover training an SDXL LoRA. It's meant to get you to a high-quality LoRA that you can use with SDXL models as fast as possible. "Fast" is relative of course. Gathering a high quality training dataset will take quite a bit of time. The actual model training will also take time, but ... Read more

robust urchin
#

And you can train your lora locally or on colab

jade hornet
#

Colab pro, vast.ai,runpod. Colab free not recommended

robust urchin
#

Localy = use your own graphic card
Colab = use a cloud

jade hornet
#

Unless you want crappy Lora, and piss off Google at the same time

robust urchin
#

Colab free works great for me. I opened 10 accounts and switch from one to another.

Only problem is my training must last less than 4h or I get kick out...

#

But it is ok. Instead of creating one big lora of 400 images. I just split it in 20 images

jade hornet
robust urchin
#

Yep sdxl requires a lot of GPU ram... But Im not here to sell anything or create the ultimate sdxl lora. It is just to experiment and see some sexy jaffa

stiff dust
#

there are so many tutorials and each claims to be the ultimate guide. Don't take them too serious. If you have enough vram (12 gb are good) you should rather experiment yourself

robust urchin
#

what is frustrating is I need a lora around 40 epoch to get good details and flexibility. Also, I need to keep batch size at 3-4 and repetition around 5. If I reduce my database I lose control too. So I need at least 20 images.
So to train my lora it is something between 3h and 4h...

#

I got disconnected after 2h30 for 2 days....

#

last one stopped at 35 epoch, I was so close to get something really good...

#

when I think about all these guys with their 4090 just making nudes... Give me the same thing and I flood the world with real ai art !

robust urchin
#

some stuffs dont work but damn I was close...

rare holly
#

I want to fix the collar and name tag in this picture.

#

I'm trying to use inpaint sketch, but it's changing everything for the worse.

#

What am I doing wrong here?

stiff dust
#

you can try better prompt. Add "detailed, collar" into negative prompt

rare holly
#

I was able to add those with inpaint sketch

#

I'm just not getting consistent results with inpaint sketch probably because I don't understand it very well

stiff dust
#

inpaint Sketch is identical to juat changing the image with any graphics program before doing inpainting

stiff dust
#

inpaint sketch is the same as normal inpainting. But if you draw a red dot on the image and inpaint it it is more likely it gets something red

red hornet
#

With ControlNet Inpainting, how is the behavior different with and without an input image in the ControlNet panel?

topaz needle
#

Is it okay if I only install SDXL through civitAI as a checkpoint?

#

Or do I have to do anything else

#

I am soft in the head

robust urchin
#

faster and same quality

#

Me Im blocked on SDXL + Controlnet...

#

I try to use my Lora with controlnet but it doesn't seem to work in Comfyui...

dusky urchin
wheat rock
#

I am trying to train dreambooth lora sdxl with diffusers

#

But doesn't work well for me.

#

I am using this code.

#

When I change the image directory, it does work

sacred grail
#

Could it be possible to merge a specific prompt and seed into another model?

#

Im wondering the same, it could be really sick to do a bot that has all custom models and then asks which is the best image and merges the best prompt+seed combo into the base model

latent charm
stiff dust
#

in general, the first and last layers encode fine details and the middle layer encodes composition

#

there is no general formula. Use validation images and train until the images look good. Often it's better to slightly overtrain the lora.

wise dust
#

I asked that almost a year ago 😭

stiff dust
#

lol, sry, my discord is really buggy at the moment šŸ˜‚

robust urchin
robust urchin
#

it finally works....

robust urchin
robust urchin
dusky urchin
#

i'm experimenting with some effects like adding and removing shadows. is there a way i can query clip for captions whose embedding is -shadow? or is there a way to train with negative conditioning in the text encoder?

south narwhal
#

hey guys, can anyone help? I keep getting ValueError: '/data/DuaLipa/DL 3.jpg' is not in list when I'm trying to train model in dreambooth, anyone know why?

astral island
#

when training dreambooth or lora, is there any mechanism to try to steer the model towards only modifying the relevant parts?
for example, when I have a dataset that consists of images of "lamborghini super car driving on a mountain road", is there any mechanism that helps the training algorithm to only modify the neurons responsible for "lamborghini","super car","driving" and "mountain road"?

full remnant
#

Wanna train a model for JAV-style images (SDXL finetuned on movies snapshots and posters), too bad there is no suitable dataset for that. 😢

dusky urchin
# astral island when training dreambooth or lora, is there any mechanism to try to steer the mod...

text encoders will learn contrastive examples best. so you'd need images of
a ford pinto car driving on a mountain road (lambourghini contrast)
a Lamborghini motorcycle driving on a mountain road (car contrast)
a lamborghini super car parked on a mountain road (driving contrast. observe this may be indistinguishable from driving, which should illuminate for you why driving is hard to depict: it is not something you can see)
...

stiff dust
# dusky urchin i'm experimenting with some effects like adding and removing shadows. is there a...
  • you can interpolate embeddings, e.g. take the two captions "a tree near a house" and "a tree near a house with shadow" and then for each token compute the difference between both captions. You can then add the difference multiplied with some scaling to increase or decrease the shadow
  • there are lora sliders that train the unet to increase or decrease a certain term, see https://github.com/rohitgandikota/sliders/tree/main
GitHub

Concept Sliders for Precise Control of Diffusion Models - GitHub - rohitgandikota/sliders: Concept Sliders for Precise Control of Diffusion Models

dusky urchin
# stiff dust - you can interpolate embeddings, e.g. take the two captions "a tree near a hous...

i guess shadows are complex because there's "shadow" in the sense of an explicit image of a shadow, and shadow in the sense of the intrinsic rules of lighting. i am trying to express that the image is missing the intrinsic rule of lighting that causes shadow, and that we want it back. slider lora sounds promising - to me this sounds like colorizing a black and white photograph, like something that needs a distinct approach to fix.

dusky urchin
#

is there any community with deepfloyd loras?

lament zodiac
#

I am fairly new to training stable diffusion , i had trained couple of lora model for specific use case, now i want to implement it on large scale with big dataset ,
i used kohya ss gui to train lora , now i am using a dtaset with 260000 images to train lora , but i unable to start it

I am using runpod with rtx 4090 gpu, using batch-size :1

Any idea how to do training or finetunning on large dataset

stiff dust
#

what's your dim and optimizer?

lament zodiac
#

optimizer :- Adam W
no idea about dim

stiff dust
#

AdamW should be fine...

dim/rank is one of the most important parameters, you should check that. With so many images you can probably use high rank, but should be still below 64

latent charm
jade hornet
#

Why not just start a GitHub project vs polling a discord channel for interest, seems inefficient

#

Plus most people would prefer that link vs DM for more information

latent charm
gray pine
#

Hey Guys,
I want to do some training and not sure if should use LoRa, Dreambooth, Textual Inversion or something completely different. Let’s do it on the example of big feet. So most models do know what feet are but can’t really distinguish between sizes. My approach would be to gather many images of people with big feet, use the WaifuDiffusion Tags to describe them, i.e. 1girl, sitting, …., feet, big feet. Then I would train a LoRa with those images and the descriptions. I’m not sure if this is the best approach since I don’t want to train every aspect of the pictures but just have the training focus on the ā€žbig feetā€œ aspect. Any advice? Thx

latent charm
#

To control specific conecpt, you might try concept slider to train a lora

wind garden
#

Hello, can anyone help me?

#

I am trying to create sdxl lora model based on 6 product images

#

Every image has white background

#

I wanna keep the original color but after I finish training, it struggles with color

#

It sometimes generate random color product even if it capture the exact shape of the product

#

does anyone can help me?

jade hornet
wind garden
#

These are captions I am using.
{"file_name": "9535401033758.jpg", "prompt": "a photo of TOK, chanel quilted flap bag"}
{"file_name": "14590ch_chanel_classic_double_flap_medium_front_strap.jpg", "prompt": "a photo of TOK, chanel classic flap bag in black"}
{"file_name": "9536196280350.jpg", "prompt": "a photo of TOK, chanel black classic flap bag"}
{"file_name": "9535399264286.jpg", "prompt": "a photo of TOK, chanel black quilted flap bag"}
{"file_name": "9535401164830.jpg", "prompt": "a photo of TOK, chanel classic flap bag in black quilted leather"}
{"file_name": "9535397691422.jpg", "prompt": "a photo of TOK, chanel quilted flap bag with gold chain"}
{"file_name": "9535401099294.jpg", "prompt": "a photo of TOK, chanel black quilted flap bag with gold chain"}

#

These are the product images

latent charm
#

remove black in catpion which would learn black with the bag

dusky urchin
wind garden
#

Yeah.
How can I improve captions?

stiff dust
#

can you show an example image where the color went wrong?

wind garden
wind garden
#

This is the example image

stiff dust
#

wow, that's weird šŸ˜…

#

colors should be easy to learn...
do you use noise offset and if so how much?

wind garden
#

Just use default

stiff dust
#

maybe the word "gold" is somewhat bleeding over the image. What happens if you generate the image with very low CFG (e.g. cfg 2-3)?

wind garden
#

In case of cfg=2 in inference, still go wrong sometimes.

stiff dust
#

maybe the color is "too easy" such that it does not learn it. I don't think it's a problem with the caption. You can try using a higher noise offset and/or training with min-snr-gamma

wind garden
#

I train the model with following parameter set.
accelerate launch train_dreambooth_lora_sdxl_advanced.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix"
--dataset_name="./in"
--instance_prompt="a photo of a TOK"
--validation_prompt="a photo of a TOK, a beautiful young woman holding black chanel bag"
--output_dir="out"
--caption_column="prompt"
--mixed_precision="bf16"
--resolution=1024
--train_batch_size=1
--repeats=1
--optimizer="prodigy"
--learning_rate=1.0
--text_encoder_lr=1.0
--max_train_steps=720
--gradient_accumulation_steps=1
--gradient_checkpointing
--train_text_encoder_ti
--lr_scheduler="constant"
--lr_warmup_steps=0
--rank=32
--checkpointing_steps=2000
--seed="0" \

#

And also tried with this set.
accelerate launch train_dreambooth_lora_sdxl_advanced.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix"
--dataset_name="./in"
--instance_prompt="a photo of a TOK"
--validation_prompt="a photo of a TOK, a beautiful young woman holding chanel bag"
--output_dir="out"
--caption_column="prompt"
--mixed_precision="bf16"
--resolution=1024
--train_batch_size=3
--repeats=1
--gradient_accumulation_steps=1
--gradient_checkpointing
--learning_rate=1.0
--text_encoder_lr=1.0
--adam_beta2=0.99
--optimizer="prodigy"
--train_text_encoder_ti
--train_text_encoder_ti_frac=0.5
--snr_gamma=5.0
--lr_scheduler="constant"
--lr_warmup_steps=0
--rank=32
--max_train_steps=1000
--checkpointing_steps=2000
--seed="0"

#

But both doesn't work well.

normal ember
#

They (huggingface) combine some new stuff in that training method along with existing like prodigy optimizer. They are trying to find something that doesn't require as much adjustments of the hyperparameters.

#

Not had much luck with adaptive optimizers myself.

#

I’ve seen suggestions that it would be better to stick with fp16 instead of bf16 too.

regal harbor
jade hornet
# wind garden I tried that but it didn't work

his advice was good, if you caption the color of your subject, in this case a purse, you will get other colors in the inference. Only caption the things you dont want the training to learn. You can caption distance, some of your images are "close-up", some are "full shot", some are "medium shot". You can caption orientation, some of yours are "profile", some are "frontal". Doing that should allow you to ideally try to pull those out later. In general, just provide a token to be able to reference it later, "xyzpdq" ...or pick something else less cryptic, maybe include a designer "xyzdesigner" and a class "purse". In your training shuffle captions put your special token first, your class "purse" 2nd, and then for the rest just caption what you dont want..like maybe the white background? unless you want that in your results. Use "keep tokens" of 2, assuming you're using kohya, which keeps those special tokens prioritized

#

one thing I'll say here about having all images against a white background, unless that's desired, even with captions that will be hard. you're honestly better off doing some hack jobs to put the purse in some diverse backgrounds...or maybe you can do some creative things with controlnet. like generate a lora with the images you have, but then use that lora, in addition to controlnet to create a new image dataset as a source for your next generation of lora that's far more diverse.

#

you can do all black background vs all white, then in controlnet generate some depth images, canny, etc, and with those you'll have more power and flexibility to put the products in any kind of background you want

dusky urchin
# wind garden Yeah. How can I improve captions?

you aren't going to make progress with the assets you have. it will not generalize to coloring the bag differently. you will have to modify the bag in photoshop to generate more data; or, the best option is they mail you the variety of bags they want you incorporate, and you photograph it and caption it yourself. otherwise, most of your choices for the arguments look flawed to me.

dusky urchin
# wind garden

is the expectation that it looks flawlessly like the chanel purse? i think the media managers for this kind of thing are pixel peepers

dusky urchin
#

let's say I want to train a lora A against SDXL + another Lora B (lora B is frozen) aka instead of my weights looking like

SDXL (frozen) . LoRA A (backpropagation changes it in training)
i want SDXL . Lora B = SDXL'
and now i want to train
SDXL' (frozen) . LoRA A (backpropagation changes it in training)

is this possible?

wind garden
#

I used this code to replace the background.
But this code not working well and also, the final result was not good

wind garden
# jade hornet his advice was good, if you caption the color of your subject, in this case a pu...

One question in captioning is, if I use the color prompt in caption, is it possible to capture the exact color of the original product after training?
For example, the exact color of the product is black brown.
But I use black in captions.
After I train the model, I simply use the prompt like this.
"a black chanel handbag on the table"
The question is, does that color prompt black can capture the exact color of original bag - black brown? Or it will generate black bag?

jade hornet
#

you're telling it that the bag is black, so it will ignore the color entirely...I wouldnt describe the bag at all, if you want it to learn it as it was presented in your images, then let it learn it without trying to tell it what it sees

dusky urchin
jade hornet
#

uncertainty noted

dusky urchin
jade hornet
#

well there is subject training and style training, "it" refers to the subject

dusky urchin
#

there are guides online that say things like subject and style training

jade hornet
#

I thought it was pretty obvious the bag was the subject in the conversation

dusky urchin
#

but all of that stuff is folksy wisdom

#

i mean some of it can be useful, but it isn't robust or generalizable. it's usually 1 person's experience doing like 10 things and reading 100 reddit posts. and maybe the 1 thing of the 10 that guy does works, and he writes about it

#

writing a caption "black chanel bag" in my opinion doesn't make the text encoder in CLIP ignore, or focus on, anything, in a generalized sense. but looking at the dataset, it would be more accurate to say that without an example of a white bag, a text encoder LoRA isn't going to separate associating blackness with the shape of the bag and the color of the bag

jade hornet
#

you casting shade over all the guides out there isnt really constructive in this case, I guess take what I said at face value, or dismiss it with all the other folklore then

dusky urchin
#

@wind garden so you better break out photoshop or get more images

#

so you are right that the user should remove "black" from the caption, but only because there are no examples of white bags. with the current dataset, black would become associated with every aspect of the image, which is undesirable. it's the Contrastive in CLIP that is missing.

jade hornet
#

if he had a white bag, and he captioned white bag, I'm saying that would be a bad idea if he wanted it to be white, because by captioning it thus, you essentially turn the subjects color into a variable

dusky urchin
#

i think the user also misunderstands the purpose of LoRA. it isn't a general subject or style learning approach that works with small datasets, it's computational trickery to make diffusion model fine tuning require way less resources

dusky urchin
jade hornet
#

you can certainly still prompt for a white bag, but as you just pointed out, prompts can be flawed and using a color like that in a prompt may and almost certainly will bleed into other items in the image

dusky urchin
#

it's dumb luck that it works well with styles and subjects undertrained with small datasets

#

i'm sure tencent is working on the exact application the user needs right now.

jade hornet
#

so all I am suggesting is to let the training discern the color, dont try and force it with captioning

#

this is somethnig I know to be true, if you have 10 images of someone with blonde hair, you'll always get blonde hair in your inference, unless you caption the hair

dusky urchin
#

does there exist an idea of pretraining for loras? i.e., i would always resume lora training from a pretrained lora checkpoint, with its gradients and everythinglg?

latent charm
#

what benefit does it have? train faster or better result?

dusky urchin
# latent charm what benefit does it have? train faster or better result?

i would assume better results. a pretraining for example might take SDXL's pre-existing dataset, and train a lora against it, which ought to produce no visible difference in the results. but such an "initialized" lora will have good gradients and weights for the unet & text encoder, even a learning rate, that may result in improvements when you resume that lora with your training set

latent charm
#

You might do an experiment and share the results. I think if you are gonna to train several similar lora which has same style but different character, it might be helpful. But in general, everyone train with different dataset and push the lora close to the dataset. A general pretain might have contradict direction.

autumn jay
#

Hi every body, i try to make a lora but after many tries the result isn t good. My caracther is always malformed. I have read and listen many tutorial but it seems that any tuto was made by perso who understand clearly the concept. Somebody would answer my questions ? About caption, images... I have missed something but what ?

jade hornet
#

Or just generate a full checkpoint in the first place

#

Checkpoints are big so you'll want to keep it in hugginface or gdrive

wind garden
#

I wanna generate background of product image.
What is the best way to generate background of object?

regal harbor
autumn jay
#

I had a question about photo quality, i need to resize photo to 512*512 for SD training. I download high quality of my character on the internet, but when i resize them (even if i crop them before to have only the portrait) i loose information (it's logic cause i pass from 1024 to 512 for example, and there are twice less information). There is a way to keep an extrem high quality when i zoom on the 512 picture ?

regal harbor
autumn jay
#

ok, so i can crop image to get part that i want to train, no matter the size (10241024, 20482048, ...) kohya resize to 512. With good quality ? cause i have used a soft to crop all my image but result is less quality that when i use gimp and do it by hand

old birch
#

Hey guys, I need help with how to add certain objects or specific body parts to an image. Lets say I have a portrait of a man, and I want to add a military hat and badges to him. So I impaint the forehead and heair area for the hat, and the chestarea for the badges, but the results are poor. The color and style doesnt match well and it looks bad. The badges arent even thw type I wanted.
So my question is, is there any way to use own resources (google images) to add it and keep the consistency? I've read that lora works well for characters, but not sure if it works well with objects and faces

#

Not sure if new model, lora or embedding would do the best job for that

grand stag
#

Lora's should work well enough yea as it limits the training data a bit more to what the lora is trained on instead of the whole model

old birch
#

So i just need to train lora with the object i want and should be fine?

dusky urchin
old birch
#

I just like sharp and high res stuff, but at least that looks realistic or in the context

#

for instance, this image from the user kuromi, how would you add a militar hat to the main subject?

dusky urchin
#

in this particular instance, there's no chance a diffusion model will correctly add a hat that is cropped by the top of the frame. you would need to outpaint some margin at the top of the image, then inpaint a hat, then crop it again.

#

there are some instruct2pix style solutions but none that are really easy to use by the community nor robust enough for whatever problem you throw at it

old birch
#

okay i get it

#

So better to generate similar image directly, instead of adding stuff right?

dusky urchin
stiff dust
#

you can always add the hat yourself with a graphics program (e.g. cut the hat from one of the soldiers, increase its size and copy it to the guy) and then do inpainting afterwards to make the poor copy-pasted hat into a hat that fits to the image

lilac egret
#

Hello, Can anyone help me fix this?, when i use roop, or Reactor, this happen to the face, i tried to change Restore Face Visibility, CodeFormer Weight (Fidelity), same thing happen, and when i use ip adapter to change the face, it not recoganize the face,

dusky urchin
tame otter
#

you can draw masks directly in a few UI's . FOoooocus, invokaAI, automatic1111, vladmantic

#

i'm not sure if comfy or swarmui have anything setup for live masking

#

there's also great extensions like segment anything or inpaint anything, for automatic1111, which make creating masks a breeze

stiff dust
#

my point was not so much about drawing the mask (yes, thats possible in comfyui and any other ui I know), but about making a rough sketch what you want

#

if you want to inpaint a sword into the hand of a character, it's often easier to just copy&paste a sword into the hands and then inpaint on that

tame otter
#

that technique is used tons in the past. before diffusion. its tried and true. good call. i've heard it called photobashing a lot

frigid pier
#

Hi guys, new to here. šŸ™‚

I've created couple images of ppl posing in different locations. The images look otherwise quite nice but they lack detail and are somewhat soft. I've tried numerous ways to enhance the images with several upscalers, img2img to get more detail, ADetailer, LoRAs, ControlNet, combination of these, different noising strenght and sampling stemps and methods, tweaked promts etc. Still the results are far from good. - Ideas what to try next?

I'm on a Mac Studio Max M1, A1111

dusky urchin
frigid pier
restive bridge
#

has anyone tried a batch size of 2 when training XL with dreambooth? SEcourses config is batch 1 but I can fit another one on 24gb. cant really tell what it's doing to quality since I havent tested both on the same person yet.