#🔧|finetune
1 messages · Page 11 of 1
awesome
So I'm attempting to train a "LoRa model" which is basically a much much smaller and hyper specific model that will have a certain 'bias' when I add it in parallel to a different model for text2image or image2image generation
The process (as best I understand it) is to take a small subset of images specific to the concept you're trying to train. In my case, I'm using a base model meant for anime and illustrations to train this hyper-specific LoRa model.
The anime and illustrations model is a larger 3Gb that has it's own weights and biases, but I like that it generates illustrations and anime-esque artwork
So I'm going to create this dataset of this very specific facial expression and I'm going to provide this dataset to the base 3Gb anime model and say, "listen here Anime model, I know you think you know what a [insert facial expression here] is, but you don't. However, since I'm a nice person, I'm going to give you these examples so you can use these as inspiration the next time someone asks you to depict [insert facial expression here] in an illustration."
nice, how do you do that?
What the process ends up creating as an output is a very small model (200mb or so) that I can then take to any text2image generation notebook and apply the biases that came from this "training" to the final output of the image generation. The final result should be something close if not completely what I trained it to make me.
This is the colab notebook I'm using to do it: https://colab.research.google.com/github/Linaqruf/kohya-trainer/blob/main/kohya-LoRA-dreambooth.ipynb
Though if you have a GPU that has 12Gb or better of Vram, you have more options that are arguably easier to use than a colab notebook without a polished GUI
And fwiw, the process under the hood is much more technical than my description of the events here, but it's the best spur-of-the-moment explanation I have for a 10-thousand foot view of the process.
I see, so there are ways like the one you linked of specifically training a baby model
Yep
understood 🙂
thanks a lot, this is helpful
I'm pretty worried that my art won't be enough, or that it'll fixate on the style or something, but I'll give it a shot
No problem. And my apologies if anyone comes along with more experience and proves me wrong about any one concept here. I JUST went through the same learning/exploration process you are doing and finally got a version of my LoRa model to work more often than it doesn't.
haha, nice
You'll have opportunities to provide it enough information to hedge against this
You'll see two keywords a lot in this process ->
concept name and class name
sometimes concept is called other things, but class name is usually always called class name
So for instance, if you wanted to train a LoRa to put Nicolas Cage's face on every male leaning individual in a photo, your concept name is "Nicolas Cage" or "Nicolas Cage man"
and your class name is "man"
so that the token generated from this is "Nicolas Cage" (or the difference between the two names if you remove common denominators)
wait so, what would it be for like... birdperson?
This can sometimes be the most simple part, and other times seems to be more complicated depending on your specific subject matter
There's no one right answer.. the best thing is to test multiple approaches, but I'd say: concept name = "bird person" and class name = "person"
well just to be concrete, my dudes are call "Usaq" so if I want to gen like "usaq men around a campfire" what are my concept and class names?
and in this case, would I then need to separately train usaq woman? or how does that work?
Since there's likely a dichotomy between usaq men and usaq women, my approach would be to train two different LoRa's using the same settings (but different dataset specific to each sex of usaq person) and then merge them into one LoRa when you're done.. or you can merge them both back into the same base model and you'll now have a large custom model that just knows what usaq people are without a LoRa.
The downside there is less flexibility to apply the LoRa to other larger models and just the inherent large file size of said custom model.
yeah, interesting
You'd end up with 2 LoRa's in the intermediate, one of them that was concept "usaq male" or "usaq man".. and the other that was concept "usaq woman" or "usaq female". Their respective class names being "male" or "man" and then "woman" or "female"
Your options are numerous to be honest. This was just the most straight forward approach that made sense to me while I was wrapping my head around this process.
What’s the best model to generate attractive faces?
My personal opinion is that the prompt is more important, considering there are lots of models available that can give you photorealism.
attractiveness is also subjective, so the prompt can be steered toward what you find most attractive while using photorealism to render it
oh im dumb, by base model i kept thinking about a community one i downloaded, but people usually use 1.5/2.x. Yeah I'm fine doing that
Hey dumbass question, attempting lora training, and when i made a lora file using the lora .pt it generates doesnt do anything at all, but it generates a massive 7GB checkpoint model which works, but i want to use the lora in other models not a checkpoint...! is there some magical obvious step i'm missing?
anyone have any advice?
anyone else's loss chart look like this?
usually i see it look more like a hockey stick. not a cardiogram
you usually have to merge them i think
merge which? the lora file and the checkpoint you just made generating a lora?
yeah... you using dreambooth?
yes
but all i want is a small lora file to use with other models
i guess i'm unclear why you go to train a lora and it generates a lora that doesnt work and a giant 7GB file you didn't ask for hahas
so whats the merge procedure or is there a good tutorial?
Sounds like you've generated a dream booth model instead of the smaller version of lora model 🤔 just a guess though. I use Kohya ss and at first a made the mistake of going for the dreambooth model instead of a regular lora.
I have never used lora.pt so i dont really know, but according to the size it either sounds like a dream booth mode, or that you've set the nerwork rank to waaay to much
its generating both, i made the rank higher and it generates a 300MB lora PT file and also the huge checkpoint, but the lora file doesnt do anything
is koya something you install within Automatic 1111? i see people talking about it but i still dont know what exactly it even is
no kohya is a separate application, similar to the UI of automatik 1111
The Easy Starter Guide to Installing LORA on Automatic 1111 for Stable Diffusion. Follow my super easy Lora setup guide and learn how to train your Lora files for super-high quality portraits. Use Realistic Vision V1.3 as the base model for extremely detailed and realistic results. Get better portraits with Lora, the super fast training tool tha...
Here's a good installation guide
is lora training supposed to slow down as it nears the end 😅
over 2 hours it went from 2.5 it/s to 3.5
@arctic jasper you have to select both the lora and the new model, and "create new ckpt"
then you can use that new ckpt for regular inversion and stuff
no idea how its supposed to work as an actual lora
ppl be usin the kohya plugin for that
Hey, i have tried the full fine-tuning the text2img model SD and trying to convert the model into inpainting model, I've got this error, "AssertionError: Bad dimensions for merged layer model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.to_k.weight: A=torch.Size([320, 1024]), B=torch.Size([320, 768])" do anyone have an idea how to solve this ?
so the create lora option in dreambooth doesn't create a lora. Sounds extremely useful and not confusing at all.
yeah im having a hard time myself
i think its because we should be using khoya?
a question for anyone:
im installing khoya (turns out i didnt have it) and its asking for a dynamo backend during installation
eager
aot_eager
inductor
nvfuser
aot_nvfuser
aot_cudagraphs
ofi
fx2trt
nnxrt
ipex
any idea which I should select?
🤔 reinstalled it yesterday and did not get that option. No idea, sorry.
im getting this error when running khoya_ss
ModuleNotFoundError: No module named 'torch._dynamo'
Traceback (most recent call last):
File "C:\Users\kohya_ss\train_network.py", line 507, in <module>
train(args)
File "C:\Users\kohya_ss\train_network.py", line 176, in train
unet, text_encoder, network, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "C:\Users\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 876, in prepare
result = tuple(
File "C:\Users\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 877, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "C:\Users\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 741, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "C:\Users\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 914, in prepare_model
import torch._dynamo as dynamo
^^^this is where it breaks
Do you guys recommend any settings for basic 15-image lora training? Am I supposed to mess with epoch or keep it at 1?
My first lora model didn't turn out so well
I have settings for 30 img, it's for Linaqruf's colab but I think it can apply to local too:
https://www.buymeacoffee.com/alelele/my-new-settings
Does anyone have a working danbooru scraper for specific searches/tags? I found one, but ran into problems with Cloudflare.
whats the current state of Lora? Is dreambooth still the method of choice for fine tuning faces?
LoRA stays a little under DB in my opinion, quality wise, but it has lots of over merits that make it about as popular right now
- very light to share
- faster to train
- you can now compose a single picture using multiple LoRA
Ok so I just tried training a face in 2.1 and damn, I’m not getting what I need. Not even close. My 1.5 settings are not fit for this. I suspect it has to do with the classification images
Where “photo of man” in 2.1 spits out out black and white actual photos of man
Or maybe I have to play with the lr a bit more. Anyone have any settings they’d like share?
I've been out of the dreambooth/lora loop for a short while (although in the SD community, a short while might as well be an eternity) and I'm a bit confused about the prevailing wisdom when it comes to captioning your input files; it used to be that adding complex descriptive captions was not preferrable (so if you wanted to train a person or concept, you only used your custom tag) but recently I've been seeing a lot of tutorials and tools incorporate the idea that captioning your images is actually essential. Is this different for LoRa's than it was for DreamBooth? ...or have the auto-captioning tools just gotten better to the point where captioning every individual image actually became a viable option?
I can't speak to the history but every guide I've read puts emphasis on tagging your dataset properly. Might have been different in the past but then I wonder why we go through the trouble with the text encoder to give control over what's actually being diffused? I supposed you could train a lora with nothing but an identifier, but if I had to guess, then I'd assume that all of the information seen in the images is hardcoded to that identifier, so you essentially lose all control for prompting against a style of clothes, backgrounds, hair color, anything, because it only has a very narrow concept of what's been shown and it recognizes that concept only in that narrow view.
Granted, I haven't been active in this space for very long either, but from everything I've gathered the rule of thumb was this: tag anime models based on nai/anything with booru-style tags seperated by commas and realistic finetuning based on sd 1.5 should do proper, descriptive, short phrases for anything in the image that isn't intrinsic to a character. For example, if a character always has cat ears, it's not necessary to mention that in any of the captions, unless you plan on prompting against it at some point and remove them. Similarly, everything you describe can be exchanged by other vectors in the future. What I mean is that a prompt of "sunny park" can easily be exchanged for a futuristic landscape while invoking the LoRA - if captioning has been done properly. If not, then I'd imagine it being either very difficult if not impossible to exchange certain information, or you'll get some really terrible blends because it tries to squeeze in the information of a sunny park into any of your prompts, regardless of what you type. I haven't actually ever tested training without captions, so take anything I say with a grain of salt. I might be completely wrong in my understanding.
Ah, the auto tagger tools certainly do a ton of the heavy lifitng, but I haven't worked on a case where I didn't have to go in manually and dispose of any unwanted clutter. In other words, working with large data sets will still require a fair share of manual intervention. There are also people who don't touch the auto tags at all and get very favorable results, but those are cases working with anime models with which I haven't spent much time yet.
Hi everyone. I'm trying to learn more about using textual inversion. I have successfully trained a few embeddings on my own drawing style. They don't look that much like my drawings, but I like them a lot anyway. Something I'd like to be able to do though is add in other text like 'photorealistc 3D render' (which is very, very far from what my drawings are). But when I do this the result is like if you just made my style very bland. I think what is going on is that it is simply shifting my embedding towards 'photorealistc 3D render', which is losing the essence of how it looks (and not gaining any 3d render lookingness)
I watched a youtube video (How To Do Stable Diffusion Textual Inversion) where the uploader said that if an embedding is overfitted it loses 'flexibility'. I used his suggested tool, the 'embedding inspector', and used only embeddings below 0.2 strength (above which apparently is where they lose flexibiilty), but it didn't help.
I did also use prompts in the generation, like "experimental drawing by [name]", which didn't really help either.
Any ideas?
Do you mean like scraping for images based on a specific (tag) query?
Yes
Hi everyone, we're a games studio working on a title that requires quite a bit of social assets, usually using the same characters, in different poses. Here's a reference of what we'd like to achieve with training SD. We have the 3d models to generate the training materials, but I am very curious to hear what training strategies you'd suggest to achieve accurate variations on these characters.
I was thinking on training using Textual Inversion and controlling with skeletons or sketch in ControlNet.
What are your thoughts? Thanks!!
I use Hydrus (https://hydrusnetwork.github.io/hydrus/introduction.html), but it might be a bit much depending on your requirements/use case
Hydrus should have some basic CF workarounds, but IIRC, ensuring you have an API key set should also help some
Imgbrd-Grabber https://www.bionus.org/imgbrd-grabber/docs/ is another option, and I believe the learning curve is lower, but a bit more limited in what it can do (though I could be wrong, I'm more familiar with Hydrus in general)
Thank you!
Anybody can tell me, are rare tokens for 2.0 and 2.1 models the same as for 1.5 model?
When training dreambooth on 2.1 768 model, and i use the same I am using for 1.5 model from "rare token list"?
Have you tried running the rare token generator on a 2.x model?
Hey !
Just a first idea with what you gave here, here is how I would go about it :
1/ train a dreambooth on each of the characters you want here.
2/ train a TI or/and an hypernetwork on getting the neon/plastic aesthetic down, while still refining your characters
3/ use controlnet for the poses, good prompt for the rest, photoshop for the texts.
yep, doubly training the same character in, so you don't loose quality on him. That would be what I would try, but mostly because you are going after very few shots of very high quality
Hi guys, I'm having a problem with my stable diffusion. After spending a lot of time installing various models and making my tool robust, an error has occurred that I can't solve. Can someone help me with this?
I would love to receive some suggestions on what I can do to solve the problem. If possible, I have attached some screenshots to show what's happening.
Thank you in advance for your help!
What is rare token generator? It is an extension to auto1111 webui?
Anyone have experience with large training sets for text-to-image fine-tuning? I have 122,000 images (I can easily increase to as many as I need) with very high quality text descriptions ranging from 35 to 77 tokens. I'm going to be training on an A100 40GB (could move to 80GB if needed). I'm trying to figure out max training steps & batch size. Would love to learn from other's experience here..
is it possible to train anything on 12GB vram? is it just too slow or you wont even be able to run the process?
Pretty much either
It's part of the original Dreambooth codebase. Unfortunately, almost no one uses it despite it being quite important to training.
https://github.com/Victarry/stable-dreambooth/blob/main/generate_identifier.py
Everyone just uses SKS, but the Dreambooth paper makes a specific point to mention that you shouldn't just guess at random tokens, but specifically extract one from the model.
Which clip model is used for stable diffusion v2/v2-1? From the readme file in github, the laion/CLIP-ViT-H-14-laion2B-s32B-b79K model is used in stable diffusion v2/v2-1. I tested the laion/CLIP-ViT-H-14-laion2B-s32B-b79K model, but the result and parameters are not the same. I'm confused. Does anyone have the same problem? HELP
What is a good base model to train faces on? or does everyone just use SD1.5?
I find 1.5 easier to train than 2.1, but given how good 2.1 is with photorealism out of the box, I would think it can do better as a base for training faces yes
But a LoRA trained on 2.1 can't be used on 1.5 models correct?
I'm not 100% sure but i think you are right
I do dreambooth mainly, not Lora, so not sure
That would slim my use of it down to less than 1% xD
Also all I have online are tutorials on training Anime LoRA, are there any guidelines on what settings to use for photoreal ones? Learning rates and schedulers? etc?
Thanks a lot. That's super useful!
@summer stag
Currently there is no bot on the server that generates images. However, there are plenty of other ways such as the official https://beta.dreamstudio.ai/ website or running Stable Diffusion locally using your own system resources! Check out #1080946152318443610 for more details! You can also stop by #1025467151206854736 for any issues you experience while using the website or #🤝|tech-support for any problems you encounter while installing it locally!
First time starting webui-user.bat each day, will take a very long time, why?
It will sit at Commit hash: xxxxxxxxxxxxxx for a very long time, and then move on.
half a mill down, a million to go!
Ok I've trained 3 Lora on a real person today, they are all shit. I've trained plenty of great anime lora but somehow I cannot manage a realistic one?
Are there any writeups on the right settings and captioning for realistic LoRA? I watched two youtube videos, one from a beardy bald dude and one from some old dude and both were worthless to my efforts.
I managed to get the body relatively right, but the face seems just completely wrong, is there any point in adding a concept folder with a bunch of photos of just the face?
Quick question, when training a textual inversion, lora, etc for a face, does the size of the face matter? Eg I have 10 closeup photos, will it be able to incorporate those into scenes or only be useful for full-sized faces (eg inpainting)?
does anyone have a good suggestion for a workflow to improve faces while upscaling?
Hi guys I'd like to plot the strenth of the lora in Automatic1111, what X/Y/Z settings should I use for it? thx!
Maybe lora
It will still be usable but not as good as if you included 3-5 waist up pictures. It will basically make someone vaguely resembling them if not a closeup...but you will then be able to take that to inpaint and touch up the face with "original area/masked only" and it will work. If you want it to gen nice waist up or full body of your person you need to include that in the training...i like to think of it as... stable diffusion doesnt know your friend has legs unless you show it that they have legs...make sense?
I should mention that with just face pictures you'll need to drop the Lora weights enough until it actually lets you generate somebody with a body Then that is the image you're going to inpaint patch up
Prompt SR. In prompt put the lora, lora:blahblah:1...replace the 1 with the word STRENGTH or any word...in prompt SR value write first the word then comma, then the weights...so for that word...STRENGTH,.5,.6,.9,1
How to do transfer training on a SD model? Preferably on Google Colab
How to merge models without quality loss on Google Colab?
My issue with that was the training was incorporating the backgrounds used in full body. Perhaps i should just edit out the backgrounds.
Or very strongly skewing towards the outfit. I thought maybe i could just start the TI with something generic like “40 year old caucasian man” and only train the face.
I just remove the backgrounds and replace it with solid white... Then I caption solid white background...
If you remove the background on all your data make sure that you switch up the colors
Do you caption the outfit? Maybe i should use something really generic “black shirt and jeans”. Im making a TI of myself so i can pretty much take whatever photos to replace the ones i just pulled from my profile to start with.
Go to the model that you're trying to train it... I make a prompt using the description that you're doing
If the model you're using doesn't know what a t-shirt is it doesn't matter if you put it in the captions it's going to attach it to your character
I’ve been training on top of 1.5
And be careful with things like jackets that have patches
Because you might think that just putting leather jacket is enough next thing you know you're generating a character that has leather patches on a regular shirt
Yeah that’s why im thinking a plain black or white shirt with jeans would be the most generic possible outfit
Are you training someone that's real someone that you know?
Myself
With the goal of being able to make creative profile pics, etc
“JaRail riding a motorcycle” “JaRail winning bodybuilding competition” 🤣
Then you might try my method... 12 pictures of your face, 5 pictures of your torso and head, one picture full body. Solid white T-shirt and blue jeans. Or if you're comfortable enough shirtless and tighty whities (captioned boxer briefs)... Standing in front of a solid wall with nothing else in the picture.
Yeah i can def do that
Your caption jarail standing, nude, shirtless, nipples, infront of a solid white wall wearing white boxer briefs
Do you just do front on faces with different expressions?
And yes you have to put nipples, The reason nipples show up through clothing on a lot of people's models
For some reason stable diffusion is very well trained on nipples and shirtless being two separate things
Absolutely no expressions just poker face
It will still do expressions unless you want it to learn how you're smile is
I've had no problems prompting for angry smiling pouty lips on alora trained purely on poker face
I was thinking it could learn my teeth, as they aren’t perfect. But I wouldn’t mind perfect teeth either haha
Cool thanks for all the advice. Ill let you know how it goes.
Just for a show of good faith that I'm not blindly leading you, i did this last night of my partner
Using the exact method that I'm telling you
LR .000001 ...86 repeats..net rank 128...
I’ve been doing that and seems to work alright
What did you use for starting tokens and number of vectors?
Use koyha the instance was just their name... With an underscore between first and last
I mean when you create a TI in automatic, you can “seed” it with some tokens “40 year old man” before training
Or start zeroed, or random
Regularization images... I didn't use them
I could, but for images like I'm doing here I don't think it mattered
This run came out perfect on the first try though I do normally prefer them to be a little teeny bit over fit
But the reg images just keeps it from putting the person's face on everybody in the image
Let me find what i mean one sec
Okay it calls it "initialization text"
and then vectors per token
Those are the two I was asking about.
I don't care about keeping the TI small. I'm rarely hitting vector limits.
Ah.. Gotcha yeah..i do the train with lora/locon ... If what you're after is what I just did there where I took somebody train them in and I can put them in cool ass pictures maybe you should consider doing Lora training instead of embeddings
But I was worried having a large number of vectors was contributing to it picking up backgrounds, etc
Okay. Yeah, I'm fine trying out Lora, dreambooth, etc.
Yeah do a lora man, its far easier, faster, and you see its the kind of results it sounds like you were trying to get
Yeah, I should have checked what you were using before asking. My bad.
I can put him in any scenario, doing anything I went through a shit ton of genres last night
As you can see up there I went through Cthulhu Superman cyberpunk steampunk
It had no problems and that's exactly what he looks like
Your results are pretty amazing. I'm really impressed.
I think my workflows will be a lot more controlled, lots of inpainting to get the scenes I want. But just throwing in a prompt and getting those results is a great start. I love the art styles.
Yeah these were just right out of the gens
I didn't modify them
I probably could have cleaned up some of the sketchiness that I didn't like
You're lucky to have a partner with those abs. 
But I was just testing to see if it trained well I wasn't actually trying to make art with it yet
😆 The top half lol
haha
Alright, I think this is going to be fun.
Will let you know in a day or two how it goes.
And a lot of people say train on the model you're going to use it on and yes it does produce better results for that model
But I like to switch the models constantly cuz all of my models do completely different things and training on 1.5 as long as the mix has 1.5 in it it will generally work well
And definitely well enough to tell who it is
Cya! Let me know how it worked
More than 16~20 ish pictures better or worse for realistic person LoRA training? What about variety? Make up variation, hair variation? Poses? Face angle?
I've been doing a lot of TI of people and artstyles to insert into the model before full finetuning, to try to contain as much of the changes from each concept in just the specific tokens and not damage the model so much as it initially has to lock onto the concept.
A technique I've come up with is using celebrity face match sites to find existing faces which SD might know, figure out how many vectors their name encodes to (and ideally see if their first name or last name gives good approximate results), and sometimes blending multiple names together using the embedding inspector extension for a1111 webui. After that it can be just a few dozen steps to start to get a decent likeness with textual inversion and a high LR like 0.001 - 0.0005, though you have to be careful because it quickly overshoots and begins to get worse
Starting with a known person helps capture all the relevant info which describes them in minimal tokens, and seems to help SD understand the concept being worked with
hmmm that'd be interesting.
I get a lot of "you look like David Duchovny" comments
So I could prolly use that.
I've basically just been spending the last day fixing a mess of package requirements. It sucks having automatic1111 be MIA without anyone merging important fixes into webui.
I've been trying to work out which is the best method for style extraction, and so far it seems like the conclusion is LoRA due it's modularity -
Is this correct, and if so, is there a maximum number of images to train a LoRA on? I often see people using a small number like 20, but I have about ~150 I'm slowly captioning, but will cull it back if there is an upper limit that beings to degrade.
I've searched fairly extensively around upper limit, but so far I haven't been able to find any info on this, so hoping someone that's experimented with smaller and larger datasets might have some insight!
When merging two Dreambooth models can I then use both of the "trigger words" for the separate styles/objects?
I'm also trying to find an answer for this same purpose. Would love to know if you find an answer. I'm currently painstakingly captioning a list of 300 😬
After a bit more research the closest (low confidence) guidance I've found is:
Minimum 10 per concept, but much more is much better
Which I infer means "at least 100" to put a 10x qualifier on "much more"
I'm at ~50/200 and already feeling the pain - luckily modafinil exists 😑 💊
(fwiw this is the guide I'm following through for image preparation in case it helps someone: https://youtu.be/7m522D01mh0?t=671)
That video was intensely helpful. Great find and thank you for sharing
I'm testing with the modified kohya-LoRA-dreambooth.ipynb colab he mentioned at the end on SD2.1 @ 768 and seems to be running fine so far -
edit: spoke too soon lol - starting to degrade ~ epoch 18/20
if I understand the process, stripping it back to the "best" epoch is the next step, though
Captioning Tutorial
[https://www.reddit.com/r/StableDiffusion/comments/118spz6/captioning_datasets_for_training_purposes/]
interestingly after building the lora, I keep getting this long Failed to match keys when loading Lora '{path}/{name}-000001.safetensors' error each iteration.
fwiw the error array starts with ['lora_te_text_model_encoder_layers_0_mlp_fc1.alpha', ...]
I note on the github repo there was some relation to the locon extension, but I specifically didn't use this during training, and after following that fix the error still occurs. [src: https://github.com/bmaltais/kohya_ss/issues/310 ]
Additional unresolved thread at [ https://www.reddit.com/r/StableDiffusion/comments/11mkyko/failed_to_match_keys_when_loading_lora_error_fix/ ]
I'm trying to work out if there was some value I set in training that somehow has it out of sync with v2-1-768-ema-pruned.safetensors 🤔
weirdly, as in the thread referenced, it still kind of works though - so not sure if this is more of a warn than an error
How should I read these graphs?
Also, if you have 10 times more training images, should you train with 10 times fewer iterations per image?
Hey guys, anyone know how to start training in dreambooth without it generating class images? I'd like it to use the images already present in the folder, which it does, but then it still it generates another 30 or so extra images even though there are more than enough images present! How do I skip that step, there doesn't appear to be a setting anywhere...
I'm pretty clueless on this, but there's a slider somewhere under concepts which says how may class images to generate per training image.
Yes I know, unless I set it to zero it always generates some number of class images. Thanks though! 👍
has anyone found a fiverr or similar supplier that reliably captions/tags image sets well?
Hey guys, could anyone tell me good settings for Kohyas Fast Lora ?
I have some settings that work very well, but it feels like that there are much to improve.
Any ideas to improve them ?
Data Set Repeats was : 60
Man, that would be awesome if so. They'd need to be very knowledgeable at the proper way of doing it, but, man, yeah. Good thought
Those network dim and alpha values seem low
Also CLIP skip should be 1 (not 2) for a SD1.5 model which it seems like you're training
Does someone around here, familiar with training 2.1 and 1.X, help a brother out ? I'm desperate at this point. I can't manage to find params that train well on 2.1 compared to what I get on 1.X base... how many repeat do you go per picture for a style for example ? do you play on the learning rate ? I'm still on 1 or 2 e-6, not sure if I should change, the tests I did didn't work well on that...
I'm just so lost on this, I want to get it right, but not sure what i'm messing up
or maybe it's a prompting thing... I don't know.
With the same dataset and caption, I get some good checkpoints on 1.X, but on 2.1, I get to the overtrained phase before any good checkpoint shows
All I know is that 2.1 trains a lot faster so you don't need as high of a step count. I am more embedding specific, so I can't really help with model params yet, but if you are ever interested in learning about embeddings I can help.
How do I merge 100's of models together without quality loss.
I would also want to know how to transfer train.
i am looking to train a custom model that can replace a shirt on a person with a specific shirt. does anyone do consulting here that can help me?
(cut past from the other channel)
well, then you train this as you would train a face. let me detail :
1/ you select 5 to 15 pictures of your piece of clothing, in the most varied scenes possible (or at least in the types of variations you want it to be able to do) : lights, person wearing it, ... Some close and some far pictures.
2/ train like a subject, training all those pics only on your chosen token, nothing more fancy.
3/ use inpaint, mask the piece of clothing to replace, prompt with your new token, using your new model
thank you
do the image names matter? malemodel-onboat-wearing-sku123 for example
and then concepts to train with it - i wasn't sure of the format
but the idea is to be able to type "man on boat wearking sku124" and it would replace sku123 with sku124 and give me all variations
with that specific shirt on it
also would like to setup a pipeline to continuously train as new images would come in
I'd use something like bullmq for queue management then have it dispatch jobs to some worker containers that have a GPU
Anyone come across a good explanation of LoHa vs LoRA?
I started working on a matrix in Sheets to make sure the captions they write are consistent with the way I personally prompt, I've got two jobs out for a test batch of 50 each, so we'll see if it works or not:
Dude! This is awesome. Excited to hear your results
Which is the section to find embeddings?
Hi guys
I'm having an issue training a model of my face in TLB's dreambooth colab.
I got great results training on Runway SD 1.5, but when I tried training on Deliberate I barely get decent results...
Do I need more steps or a biggest dataset in previously trained models?
Not sure if this is the right chat to ask this, but is it possible to get good results from a LORA trained on 3D game models for characters that don't have artwork made of them? (Such as custom characters from video games)
Any thoughts on the improvements ChatGPT 4 modified here?
(reference)
Some modification to the prompt after a few tests:
I want you to act as a Stable Diffusion AI expert specialising in Dreambooth training. I will give you a series of captions used to describe an image, and you will improve them for use as Dreambooth training captions. Do not use any full stops or punctuation within each caption. Captions should be comma separated. All output should be in lowercase. You will also correct spelling mistakes in US English. Write "..." if you understand. Do not write anything yet.
First off, that image ... 🤌😙
Secondly, I'm terribly excited to try out GPT4. I was never able to get great results with ChatGPT in my previous trials, but something tells me I was overcomplicating it by writing a dissertation trying to teach it everything about image prompting. My goal was always that I could simply give it a single subject, just one or two words, and that it would come up with the rest and not have it be nonsense. What usually ended up happening was it would fall apart over time become less and less prompty and more like it was just writing a short story.
Was that using ChatGPT Plus?
Greetings everyone, I'm a beginner and I am facing a challenge while attempting to train a Stable Diffusion model with DreamBooth for a specific product, an E-bike, I’m using a 3D model to generated numerous reference images from various perspectives to ensure adequate training data. However, despite my efforts, the trained model is generating weird results for the E-bike see examples.
Unfortunately, I am unable to find any resources or guidance on how to train models effectively for detailed products like the E-bike. Does anyone have any tips or tricks?
Specs DreamBooth:
Training image count: 50
Reference image size: 512x512
U-net Training steps: 1500
Learning rate: 2-e6
this are the reference images
This is the output
fwiw, this was the best of the three Fiverr guys I tried by a large margin - the quality of his captions were much better than the others I tried:
https://business.fiverr.com/freelancers/ratulrafsan
he also does model training for like $25
Thank you @gloomy sierra, I really appreciate the shout-out!
What's your guys' workflow for captioning large datasets?
Do you use a a specific app or interface?
I'm brute forcing this like a pleb opening each file individually in a text editor and a Windows image previewer and ... maaan, it's a slog
Any way of getting the seed from a sample image generated by dreambooth? Has a txt file with just the prompt.
LoRas are now supported ✨
Have anyone used Dreambooth to generate a specific sci-fi speices/race? If so, did you do something different to how you'd go about doing it for a character or an art-style wrt. labeling, settings, class/instance?
I've just started using DB since I was short on VRAM before, and the results I'm getting so far are overfitted.
I have trained some original things, like Sid, or TMNT, but not exactly that.
You have 2 ways to approach this:
1/ you can go the "training a subject" way. you have 1 alien, and you take around 5 to 15 pics in diversified environment/light/general setting, but same alien.You train on 100 repeats on 2e-6 LR if doing it on 1.5 base. Best starting to make a checkpoint at 80 repeats, and add more if undertrained.
2/ you can go the "training a style" way. You have lots of different aliens from the same race, 50 to 200 pics presenting the most variety possible in alien features/environment/light/general. You train on 100 to 200 repeats on 2e-6LR for 1.5 base model. I do usually 4 or 5 checkpoints in between those 2 and compare the quality of the results
Sorry I have only the 1.5 measures pinned down in those scenario. I can't find the good values for 2.X on that.
if going the 1st way, you would caption your dataset using always only 1 same token that would represent that specific alien.
if going the second way, you could add to that token some specificities that you have named for your alien scpiecies, like give a token to represent tentacles, or other features. This will train on multiple tokens and let you call on only the features you'd like. This will require a better dataset than the first though
Thanks for the pointers! I guess I was doing the second way with just 8-9 pictures, the worst of both worlds. I also used the species name directly, which I know that the base model has a very vague idea of, instead of a unique token.
What do you do for class and class images? Just 'person'?
I don't always use class images at all. when I do, I mostly include a second concept in my training, that is far away from what I'm training on.
Example, training on Mosaic art, I added a second concept "rick roll" for photorealistic people, and hit more weights in the model and keep it grounded.
The main concept of regularisation/class data is that : you don't overtrain the main weights your new concept is targeting. Instead, you also train on varied things on the whole model. So having a secondary concept of high quality photos and high use (like maybe photos of "full body shot" or I don't know what else could be uselful in your case) will result in 2 goals at once : training a useful concept, and having regularisation so your main concept doesn't overtrain too fast
but given what you were saying, the main problem is the subject's dataset
8-9 would be good if it was the same alien
if you go for lots of varied people, I wouldn't go under 50
and I would try to keep it balanced
like don't have 50% of them grey and the rest different colors, they would all be grey in the end
it will try to find things common to most pictures and learn those
likewise, you want poses and ligthing, and all that, that change a lot
if it doesn't, it will also get picked up and learned
Does it help to label the environment, lighting and pose for the dataset?
Or does less relevant details sort of get 'washed out' by there being 50 samples to keep things like that from sticking?
yes and no.
Each word that you add in your caption has 2 main effects on the Attention of the algorithm :
1/ it will take some of that Attention away from other tokens, meaning the other words in your prompt get a little less trained on each step.
2/ it will train those token a little, doing 2 main changes :
2.1/ it will reduce the impact those feature had on other tokens. For example, if you add a "tie" in your caption and the alien is wearing a tie, then it will reduce the fact that aliens wear tie, it won't produce those without being prompted for it
2.2/ it will train the token itself. taking the same example, it will make the "tie" token a little more fitting to the example tie you showed it.
if you don't caption those background elements at all, it's not a problem, as long as they change a lot
if they don't change a lot, then they get trained into the alien token too
each step of the learning, depending on your learning rate, there is some kind of "Attention budget", it represents how much this step is able to modify the model at once
adding more tokens splits that attention, and that has some good and bad effects that I described
in a way, and played with well, adding more tokens like that acts as regularisation
like if you use a different token describing a different background in each picture
this will split a little of the attention of each step on a new token on the model : the same as regularisation does
as long as you don't repeat those tokens
Ah, thanks again! I don't have enough Mass Effect screenshots lying about to try this right away, but I'll definitely try it sooner rather than later.
I get that feeling 🙂
I also learned big trainings like that through series and videogames
here are some funny experiments https://huggingface.co/Guizmus/Experiments with just the ckp and some examples
and here is my main library, with usualy the dataset and all training parameters useds on each model. https://huggingface.co/Guizmus
in particular, the death not model has lots of people in it, and the different PoW models are trained on around 50 different tokens
what would be the ideal resolution for training images to finetune a model using dreambooth?
the short side should generally be the dimension of the model you're training (ie 512/768)
So 1024x1024 images would get me worse result than 768x768?
As I understand it, it would just downscale or bucket them to whatever target setting you use, but I'm not sure exactly if there are any downsides to bucketing other than resource usage
With [filewords], are there tokens/words that are ignored in filenames by Dreambooth? I'm trying to avoid creating text files.
EDIT: I went digging in the code, it appears that Dreambooth (at least the one in the a1111 extension) ignores a number and dash prefix like TI does.
For anyone that can't train, due to Safetenors being forced on, getting this message: "Missing model directory, removing model", I created a fix:
https://github.com/d8ahazard/sd_dreambooth_extension/issues/1068
I'm finishing a writeup on training parameters, dataset, monitoring the training, defining a concept, ...
anybody want to give it a read and some feedback ?
https://trainingsd-parameters.carrd.co/
Captioning is misspelled in dataset
You talk about tokens, and multi-tokens but you don't talk about full sentence captioning
nice catch, fixed, thanks
I encapsulate this inside multi token in my mind, but you are right, it's worth at least naming in there
In dataset diversity, for thousands+ of images, you should mention how each subject should have at least X amount of images to train properly
(those are great feedbacks, thanks a lot)
The simple UI is great, makes it easy for anyone to navigate
I noticed though, I press back too quickly as it loads, the page still loads while not allowing me to go back
I got a chance to try out the advice from yesterday's chat.
I'm getting much better results with a set of 53 samples with minimal labeling (species name + color), but it still feel like it's overfitting somewhat. I did 150 epochs with LR 1E-6. The faces ended up looking quite detailed and crisp even at 3.5 CFG while the background got washed out.
Since it's a Mass Effect race, and I'm a bit iffy on using fanart, it's hard to build a bigger dataset without repetition.
yeah, this is carrd.co, I use the free thing because the goal was not to loose too much time on the UI, and this basic one did the job, but it has some real limits to it :/
ah
It's solid
I may migrate everything on github one day, but it's moving quite fast currently (the AI field) and I really like the quick productivity that this tools gives currently
it's great at least to lay out the plan
Something I'd mention, that a lot of people get confused is that tokens are a caption, and that captions are multiple tokens. A lot of individuals still get confused by the two, even though they are the same thing
yeah, I tried to clarify the most things using that terminology page
I'll add lots in it I think still
it's really useful
how many would X be for you ? I have few experience with full fine tuning itself
That's a tough one, because generally the more you have the better, but maybe 10-20 for basic usage?
10-20 very different images of the same subject
ok so treat every token as a "single concept training"
yeah
good luck! np!
What's the go-to project for training LORA models?
My settings (fwiw)
1.1. Install Dependencies
- branch: <blank>
- install_xformers: true
- mount_drive: true
(I have the SD2.1 safetensors model copied to Drive to avoid re-downloading each time, so a few steps are skipped here. In fact, I deleted all the ones I never use, but ymmv)
3.1. Locating Train Data Directory
train_data_dir: /content/drive/MyDrive/LoRA/projects/<name>/
reg_data_dir: <blank>
(I pre-caption all my images in Drive, so have deleted a few steps here)
5.1. Model Config
- v2: true
- v_parameterization: true
- pretrained_model_name_or_path: /content/drive/MyDrive/LoRA/models/<your-sd2-safetensors-model-name>
- output_to_drive: true
5.2. Dataset Config
- train_repeats: <15 - 20>
- reg_repeats: 1
- instance_token: <project name>
- class_token: <doesn't matter, so use "style", "man", "car" etc>
- resolution: 768
- flip_aug: false
- all rest defaults
5.3. Sample Prompt Config
- optional
5.4. LoRA and Optimizer Config
- network_module: networks.lora
- network_args: <blank>
- network_dim: 128
- network_alpha: 128
- optimizer_type: AdamW8bit
- optimizer_args: <blank>
- train_unet: true
- unet_lr: 1e-4
- train_text_encoder: true
- text_encoder_lr: 5e-5
- lr_scheduler: constant
- lr_warmup_steps: 0
5.5. Training Config
- lowram: false
- noise_offset: 0
- num_epochs: <8 - 16>
- train_batch_size:
- mixed_precision: fp16
- save_precision: fp16
- save_model_as: safetensors
- clip_skip: 2
- gradient_checkpointing: false
- gradient_accumulation_steps: 1
5.6. Start Training
- make sure to check all the `print` outputs from previous steps then run
I didn't know about EveryDream, that looks interesting. Is that functionally the same as DB without regularization? I'm having trouble with A1111 DB extension, so I'm looking for a more stable tool.
Nice work on the doc, btw. It's good to have this info gathered in one place, since there's a lot of info out there that contradicts or is very specific to one person's use case.
I put the results of my DB training and dataset on HuggingFace. It can sometimes generate cool results, but it's very overfitted from the games' lack of variety. The larger sample count did really help with quality and prior preservation, though.
https://huggingface.co/vmaple/sd_asari
I'll take a look at your dataset if you want when I get the time.
EveryDream is using dreambooth mechanism, it trains the model the same way. We tend to refer to dreambooth as small trainings, single subjects or style, but it only depends on what you give it, how you caption it,... Even regularization data is optional in dreambooth if you'd like, of you can teach a second concept far from your main concept, and use that as regularization.
Regularisation data main goal is to spread the training on the model, so not only the weights you want are trained, and the model keeps "grounded" for longer. But it's data that gets trained on too, so you really benefit from "good" regularization.
EveryDream lets you train on as many pics as you'd like, doesn't make a distinction between class data and regularization data. It's all data to it, and you need to balance your dataset with some regularisation if you want to prevent the bad effects of long trainings
Nice work documenting what you did. I'll give you some feedback and maybe try to train it too to help on the settings.
I'd appreciate that! I want to get a solid understanding of it for one concept wrt labelling and settings, then start working on assembling a bigger Mass Effect dataset.
I know I pollute the color labels since using them alone with this model does make cursed half-asari, but I found that using labels like "blue_skin" as the model on CivitAI does caused a strong bias towards revealing clothing or nudity with the same training time and parameters.
Maybe adding concepts with other blue and purple things would help there.
I think the biggest problem you encounter here is, you train lots of tokens total but with the same amount of base pictures you would use for 1 single token training. You used my tip and split the attention by adding more tokens, refining colors for example, but this reduces the attention on the alien token itself too. Thus needing more and more steps like you saw, and overtrain happens because not big enough dataset.
Problem is, the dataset won't be able to grow a lot more while keeping variety, so reducing the token count could help, or increasing a little learning rate.
I can't really know more on mobile phone rn, I'll need to look at the dataset itself and the mix of colors to understand better
Ah, thanks. I'll give that a try. My goal with the color tokens were to disassociate the skin color from the main token, but I saw that the facial markings which were not labeled did somtimes use a second color token in the generation prompt, so perhaps removing the color tokens may still separate the skin color from the shape.
The almost full body sample and the backs of heads might have to go, since their expression in the generated results are very distorted. Though for the latter, maybe that will cause generations with the head on backwards. 😛
Hello everyone, a very beginner question.
Supposing I'm using dreambooth to train it on me but I'm not using the default SD 1.5 model but already a modified one such as this one for example https://civitai.com/models/4201/realistic-vision-v13-fantasyai.
When I'm going to generate new images the generation will go more towards result to Realistic Vision than what could offer basic SD 1.5 (with a prompt following guidelines for Realistic vision) ?
yes, it should.
It's like models are on a big map. 1.5 is further away from photorealistic than realistic vision.
Training is moving on that map, towards the photo of You. If you start from "photorealistic", you will have better results than if you start from "neutral 1.5"
this depends a lot on the quality of your dataset though
ok so here are my feedbacks on that dataset :
1/ there is not enough full body shot. We see the foot in no picture at all for example. Some would be better. around 10% of the dataset seems like a minimum if you don't want to be stuck prompting heads only
2/ good diversity in the settings themselves
3/ The back photos don't bring a lot to the table here... They are not nice artisticaly, and can be confusing for the AI, so I would remove all of them.
4/ some pictures, like 13 or 43 can be hard to understand maybe, a little too dark ? Not sure, this could also be good.
Given the diversity, 50 pictures more or less, I would go for 2 to 3 tokens.
The main one is "asari" for sure, no doubt about it, and it should be on each and every picture.
The second one could be male/female, not sure. the color can seem tempting but yeah, I'm afraid it takes too much away from the main token to add 4 colors or something...
After that, it's about all. You could also add "closeup", "halfbody" and "fullbody", one or the other depending, to each picture.
Does this need regularisation ? not sure it does. Simplifying your tokens first like this may be enough.
one other thing that come to mind :
this could be reduced to a 20 picture training too. droping all token except "asari", taking only the best most artistic pics, even maybe including an artwork or two. This could work great, there isn't a lot of variety in those asari models
it's between a single subject and a style... it feels more like a variation on a single model of asari, if you know what I mean.
skyrim did better in the variety of kajjits for example
Skyrim had a rig on the face for character creation, where as the asari are the same 5-6 general head models with textures aside from major characters like Liara, Peebee and Samara. ME3 had the best texture for the heads, but they were rationing memory hard in that game.
I was hoping for better in Mass Effect Andromeda to have a CC for multiplayer, but that game did even worse by having a very distinct face mesh for all asari except the party member. Dragon Age: Inquisition was done in the same engine, and it had a pretty powerful albeit unwieldy character creator for its four races
Thanks for the feedback! I'm using the GPU for other stuffs now, so I'll try with these changes tomorrow.
they botched it a little here I admit... (I haven't played ME yet, I promised myself to do the trilogy one day though)
is there a non colab version?
I think it's this one
https://github.com/kohya-ss/sd-scripts
hey folks, so I've trained hundreds of models on people, mostly myself, but I just never really get the results I'm looking for.
My ask:
Would you mind training some 1.5 models on my concept images, using your approach, and explain that process to me? I want to test if it's just biased from looking at my own face and seeing everything wrong with it, or something else. I'm willing to pay too.
Not available rn, i can give a hand later, no promises on better quality, I've had the same problem being ho so critical of my face, compared to other models i trained.
ok so not just me, good to know. Thanks for the reply, lmk when you're free. I can run on my machine too if you walk me through your approach.
I'm set up on a lower rig than yours (3090ti) but I should be good, i train a lot on it too
Thanks though. In any case I'll document what I do don't worry
appreciate you!
have installed kohyaa and train 960x960 16bit on Windows 3090, 32bit and 1024x1024 I will not achieve with linux specific optimizations? who know? :/
where is the upper limit?
One huge breakthrough I had even with crappy input was putting the Analog LoRA on 0.3 and then using HighRes Fix tickbox at between 0.2 and 0.3 denoising. The improvement is indescribable
this is my current workflow and comparisons if anyone is interested or has some suggestions (ie @unique cloak ) 👀
1. Find image in the pose and clothing you're after.
2. Activate your custom face LoRA (I generally find anything less that full `1` strength is not enough)
2. Use inpainting in img2img to get a rough version of your target face to the source image (just needs to get the basic sketch down).
3. Use the sketch as a source for ControlNet on txt2img to get the the Canny thresholds right. (Preprocessor "canny", Model "control_sd15_canny")
4. Then keep running "cheap" experimental passes on lowish Steps and no face or hi-res fix until you find the aesthetic you're after.
5. Use that seed to run the same job with Restore Faces and Hi-res Fix (takes a while, so I leave these until I'm fairly certain I've got the overall composition and color grading down).
6. Adjust the Hi-res denoising until it fits.
(Not shown in the image below).
7. Go back and start to inpaint any bugs like extra fingers etc.
I added examples lower in the image to show the effect of the ControlNet, Analog LoRA, and Hi-Res Fix, stripping them back so you can see how crappy the original output is by comparison.
Before I worked out these other couple of steps (most importantly the Analog LoRA and Hi-Res fix), I found that having the face LoRA on strength 1 created lots of image degradation, but anything less started to morph to a different person, which was frustrating. This flow let's you use the full strength to get correct face correlation, but somehow magically heals all the artifacts.
Analog LoRA: https://civitai.com/models/14826/analog-film-photography-portraits
You can probably use the base 1.5 model by cranking up the weight on the LoRA, but I'd suggest using a photography ckpt as your base and mixing in any LoRAs to your preference.
I don't have my face model trained on 2.1 yet but I imagine you'd get even better results with Realism Engine (https://civitai.com/models/17277/realism-engine)
Hello, everyone! I’m new. If I have chosen the wrong channel please redirect me to the proper one.
I’m looking for hints on the pipeline for my task. I do a bit of art and I want to make comics with SD (mostly to get familiar with technology). Do I understand the general idea properly?
- To make a comic with my own characters I need to train some models, but as for backgrounds and general style I can use something like the Comic Diffusion model.
- I can draw faces of my characters from different angles and feed them to a textual inversion model. After that I’ll be able to get more or less similar faces from SD.
- Do I need to train a different model for their bodies and outfits?
- Is this line of faces enough to explain to a model how the character looks? Do I need to draw emotions as well?
- What are other things that I should know?
As for the database I can do about five angles per character but not dozens, etc. This is why I’m looking at textual inversion. I don’t mind manually fixing things on generated images so I don’t need them to be super precise.
Having several trained models, how do I apply them into one generation? Do I need to switch between them with Inpaintt? Any other good ways?
So far I’ve managed to install SD and spend some time there and in MJ but that’s all. I'm a total noob who has watched two dozen videos on YouTube and is now overloaded with info. I’d be super grateful for any tips and help you can provide. Huge thanks in advance!
it seems like a solid workflow to me, yeah.
his problem seems to be high detail quality, and it would help but I think it's also to "manage to train correctly in dreambooth".
Thanks for that realism engine, it seems like a base i'll consider here.
I'm going to stick to Dreambooth here (it's what I know best) and see what it lands me for a start
I think your process for making good quality pictures is perfect here, even a little on the "directive" side of things, letting few space for AI improv in there.
But if you were targetting Scotty's problem, I think he wants a model that does great on his face without more tricks like those. Not sure how good we can get it
those are a lot of good questions !
1/ this is a good way to go yes. training (not necessarily models) will let you have those characters that stay the same from a page to the next.
2/ textual inversion should work OK. as well as LoRA on that side, and both will let you use the face you trained. I would not include only face shots though. Those 5 in that strip are very good, but :
- no body each time may make the AI think this is a head that can't have a body
- having the same armour will make it get learned with the face
so you may want to try to vary a little on those 2 things, adding at least a fullbody shot and a halfbody shot, and changing the outfit in some. Great faces though
3/ I don't think it's necessary, but it can depend. when you start prompting using what you get from 2/, you will have already clothes in the model. You can try to prompt for specific clothes, and in a cartoon style it may be enough to keep the close the same from a shot to the next. if it's not the case though, going back to 2/ and training some more TI/LoRA on the main clothes could be good. but at that point I would mostly train face+clothes in a single embedding, it seems better fitting
4/ this should be enough for the AI to get the face quite well.
5/ lots possibly... I wrote a guide on all this is you want, explaining the training process and the different training methods (this stays quite superficial). https://trainingsd-parameters.carrd.co/
I think Lora would be better than TI because it trains faster with a higher quality.
You can use multiple LoRAs in a single prompt (composable lora extension), or even in different part of a picture (latent couple extension). You can also merge LoRAs into a model but I think that would just bring simplicity at the cost of quality, so I wouldn't.
About your last phrase, here is something I read recently and I want to share to you
Oh thank you very much!
- I really wanted to try LoRa but I thought it needs a lot more images, like about hundreds. Your advice makes me happier)
- Feeding full bodies with heads is even easier no need to invent new faces!
- Thank you for the link I’ll study that.
And yes, I totally agree on “go and experiment” idea, but wanted to find out if anyone has done anything already. Somehow I can find lots of info on comics in MJ but zero in Stable Diffusion
P.S. thanks, it feels so good that you like her face)
to be honest, this is quite a common project, making comics, but very few come in here with the technical skill to build their own dataset from drawings they do. So it's more limited to research interests, what's I've seen around here for now on this.
Doing your own art to build your models like this is a really good skill to have, and it lets you do things for real with all the rights on the pictures you trained.
My personal approach to your problem would be slightly different :
1/ I would train a model, using dreambooth, on the general artstyle you have, with lots of different backgrounds, characters, situations,... keyword is diversity, we don't want any character to be learned, only the global art style you have
2/ I would train a list of lora or TI, using the model made in 1/ as base.
Each of those embeddings would be a full character for a full scene : they keep the same face/clothes in 2 to 3 pages usually, a full scene, and this seems more appropriate than training individually clothes and faces. You would just have 1 file per character per scene, and 1 base model for your whole style
3/ to push the quality even further, you could train an hypernetwork on top of all that, but this seems overkill.
if the model trained in 1 was trained using "kateStyle" token for example, and the characters in 2 would be trained using TI on the tokens "misterX" and "missY", then you could prompt something like :
misterX walking, kateStyle
adding the extension controlnet, you will be able to just sketch the scene, and pass it through canny mode to see it be draw completly
adding the extension Latent Couple, you could prompt both characters in the same scene easily too. this seems useful because it would prevent both misterX and missY tokens to get mixed in the result
Wow, thanks! Good to get insights on the approach.
Yep, I understand that many do comics with AI. I decided to do it as well cause I think it’s the easiest way to develop a pipeline.
One more question. How many backgrounds do I need? Approximately. Since this is a side project and I don’t have great drawing speed I’m a bit afraid to dive deep into the whole bunch of backgrounds, narrative situations, etc. I do not have such base developed at the moment and my portfolio jumps from style to style tbh due to work and study tasks so I was going to use a Comic Diffusion model I have found on Reddit. Maybe I can tune it somehow.
Also I’m looking through your link and it says that lighting can be remembered hard by the model. Does this mean that I need to add different lighting conditions into the data base? Do I need to make them for every character and location that I have in the story?
a style training takes usually around 50 to 200 pictures, and that's how many I would suggest you target. but given the situation, it could be quite complicated.
The more backgrounds the better. A larger one could also be split into 2 or 3 to train on without repeating things. A minimum of 20 would be my tip. It will directly impact the capacity of your trained model to adapt your style to different situations.
Like it would be complicated for your style to do any animal if it hadn't seen any in the training.
Lighting is just an example. Anything that repeats can be picked up by the AI. Lighting, for sure, but also composition, similar objects, or any feature you may think off, in terms of photography.
So yes, having at least a little variety on everything you want the model to be able to vary on is important
For example, were you to train on the faces you showed, the model would learn that there is a white background in 100% of case, and try to put it without you prompting for it
making a base model on your own style can be optional though
you can find a good model on civitAI with the good rights associated with it, and find the good tokens to make it give a style close to what you want
and use that model
being a side project, I would advice in that direction, yeah
also I'm really open to feedbacks on the guide I sent, it's quite new, intended for people just discovering training SD, and I want it to be easy to understand, so if something seemed quite too hard, I'd love to know
20+ sounds doable but will require lots of time. Good option for a small company I think
I will probably try looking for a suitable model which will visually blend well with chars. I don’t have strict requirements for the style I can adjust to final results if they look good
for sure, for a small animation/comic studio, this is a godsend tech
As for the guide right now I can think only if this: if the highlighted words worked as hyperlinks sending directly to explanations it would be more comfortable. But it’s 2.30 am at the moment so I can come up with more ideas tomorrow)
I wish I could. I used that free tool for doing this faster, I didn't want to build a website for a guide for now, things move too fast. So i'm quite limited by the template itself
it does the job, for a free tool
but I would have loved to have the definitions as tooltips
have a good night 🙂
I tried leaving all the secondary labels out without changing the dataset otherwise (testing one change at a time), and the results are generally prettier but it's harder to get consistent color and prompt for skin color. I put the results on the HF repo.
I'll try to build a better dataset with the rest of your feedback next, removing backs of heads and adding full body shots.
Are there good tools out there for managing datasets, labels and such for this kind of training? More than just batch cropping, I mean.
I use 3 main tools for working on my datasets :
1/ XNView, lets me do batch work on pictures, like resizing, changing file format, ... it's focussed on image manipulation
2/ Bulk Rename utility, it lets me rename files using lots and lots of options, and do that on lots of pictures at once. essential to modify a token, add something to all pictures, ...
3/ a suite of tools to scrap LAION B or do some auto caption https://github.com/victorchall/EveryDream
(I use the name of the picture files as caption)
+1 for Bulk Rename utility
Thanks. I'll take a look at these tools.
I understand and it’s a very good guide.
One more thing I came up this morning. You tell about situations when the model gets something wrong (e.g. concept bleeding, or white background is used everywhere instead of regular background, etc) and you tell that “This means you need to add better regularization data”.
This seems a bit vague to me, like what kind of data do I need to add? Like characters with no white background?
The vocabulary doesn’t help much in this case since it says:
Regularisation: Pictures of other things you don't need training specificaly. Those will still get trained on too.
do controlnet models get meaningful updates?
controlnet models were only for 1.X models at first, then some models came out for 2.X recently, not for all preprocessor yet. so yeah, some, but no big change yet
that's a very good observation
"better regularization" doesn't mean anything on its own
I need to put images example on those things too
I'll expend on this
I'm in the process of transiting it to github for better access
and to answer, the "better" targets what the main goal of regularization that I don't put clearly out there : hitting the most weights possible, spread the learning on the model with very low attention on everything trained in there, just to make sure everything keeps getting updated, and we don't have a part of the model that stays further and further away from what we are training it on
so it would mean more diversified regularisation, more different tokens in its caption if using a captioned regularization, or more diversity in its content/more generic class token
I think the visual will help a lot, yes. I need some time to digest this answer tbh. For now I understand it this way: if I see that model tends to add blue to the background even if it is not asked to do that I need to look at the source images and find which of them confuse the model. Either fix them by recoloring or remove them completely. Or add more images with non-blue background.
Will ‘— no blue background’ command work in this theoretical situation?
"no blue background" prompt won't help, but maybe "blue background" as negative prompt could
Yep I meant that, did not know how it’s called
about the "how to solve it", do you think this blue would come from your instance data, or from your regularization ?
How do I set up regularization? By writing descriptions to the source data?
Btw maybe adding Token definition to vocabulary is not a waste of time
because the most likely situation is that it comes from the instance data itself, and is a "bias" of your model
So to tackle it, the best thing to do is to hunt in the dataset for the pictures presenting that background
And for each, you want to reduce the attention on it, aka how much the model will learn from it
one way is to add to the caption "blue background". this will divert the "blue" attention towards that new token, and preserve your main token from it
second way is to remove that picture completly. Sometimes less is more. or swap it for another one
I think I had it ? I'll add it if not (token)
depends on the tool.
on shivam/automatic/thelastben you have a "class data" to specify
in everydream, you add more pictures to your main dataset, but pictures of other things than your main concept
I thought you had it either but I can’t find it right now. I come back to the vocabulary during this conversation to keep track and don’t see it
one thing confusing
there are 2 vocabulary pages
one per guide
the "parameters" one and the "methods" one
I'll solve this on github
Wow, I missed that completely. Thanks!
I think I’m getting it now
a token is a part of a word. it's usually around 3/4th of a word in size. it's a series of letters, and helps split your prompt into bits the model can understand
each token has weights linked to it
and those weights are what you train when you train a token
you will change what "decisions in the creative process" the AI makes for that specific token
I’m browsing from mobile and I don’t see parameters at all. Probably I’ll find it from PC
Got it, thanks
Hm it is accessible from mobile my bad. It just looks like a ‘chapter’ not vocabulary page
the more I use it, the more I feel how splitting it two, even if necessary, was a big problem
At least maybe you can add a note about two pages
Like this is a second vocabulary guys, check it. It’s different
I may have finished transitioning in the next few days, it's my main goal tomorrow and monday
I want to put those definitions as tooltips too
Yeh, then no need
(quite a nice portfolio you have online by the way)
Thank you, very nice to here it 🙂
you trained already and have observed a bias, or you are anticipating how to react to it/giving feedbacks on the lackings of the guide ?
I observed such biases in MJ. As for SD I’m trying to get overall impression
I need to draw full bodies to start training
since you are making the dataset by hand, I must remind of one thing
don't spend too much time planning, and not enough baking.
you can't ever really get how the AI will understand what you give it
I did dozens of models now, and keep on not being able to detect everything wrong in a dataset
going back to the dataset, and changing your samples, is part of the process
given what you are going for, I believe you don't really need regularization that much. It should be good without
Yep, I more or less understand that. I think of feeding some MJ generations to a separate model just to practice and see how it goes. I tried doing consistent characters there
I gave a hand to someone working on a photoreal consistent face (their own), and it's quite hard to get every detail correct.
Even when training a model/lora, it's preferable to still prompt all the features you need, like the hairs or clothes, to help the AI keep on track and up the quality
I hope it works for you !
It sounds logical cause in different scenes you will need those concepts anyway. In one image wind can effect character’s bangs while in another they will hang still. And I will need the model to know the concept of “bangs” to give instructions about the image.
(Hope I chose the proper words I’m not native English speaker)
Me neither, but we understand ourselves quite well I believe here 🙂
except when I go nerdy technical
yes, you need to be able to call for it/prompt it, and the beauty of it is, you don't have to caption "bangs" for "bangs" to still train. it will detect those as bangs
later on, using the token bangs in your prompt, it will force those specific bangs, and not the random ones SD knew about before
Hm, I had problems requesting Baba Yaga’s walking hut on chicken legs from MJ. It simply doesn’t know that a hut can have birds legs
if you take a look at the last model I did today, the "PoW Style", all pictures are trained on arbitrary tokens that have no inner meaning to SD. but it still manages to extract the main features of each picture of the dataset, so understanding bangs isn't out of the picture at all.
that can happen here too
if it doesn't have any examples of it, or too many examples of the contrary, it blocks its "imagination"
like, good luck making a horse riding a cat
Anyone interested in making a model based on realistic asian women willing to learn/collaborate together?
Anyone got good learning resources on finetuning SD?
#stablediffusion #characterdesign #conceptart #digitalart #machinelearning #dreambooth #style #LoRA #digitalillustration #aiart #style #automatic1111 #kohya #redjuice #vofan #lucy #cyberpunkedgerunners #aqua #kingdomhearts
CAPTIONING GUIDE (2/23/23) by u/SecureWeeb on Reddit: https://www.reddit.com/r/StableDiffusion/comments/118spz6/captionin...
Do you have anything on how to finetune complete stable diffusion ?
Like the controlNET guys and illuminati civitAI model did
do a full finetune ? I touch just a little on it in the guide I made for training, more genericaly
Usually, you train only on a few concepts at once. Full Fine-Tuning, on the contrary, aims at training the whole model on every concept it can. It requires an enormous dataset to do correctly and keep all concepts high quality.
- full Fine-Tuning a model, you train on multiple thousands pictures. Each concept included is to be treated at least like a single concept would, with around 20 to 50 pictures each.
but I don't have experience on it myself so not sure how you monitor such training
that seems the hardest part
(link to the guide if you want to check more, but it's about all I have on full finetuning) https://trainingsd-parameters.carrd.co/
Thank you 👍
Amazing guide.
I have a few questions after reading it.
- Where do you run "tensorboard --logdir="?
- Where do I download the webUI that shows graph of the loss?
- I did not grasp the concept of regularization images. Is the purpose to show via images what you dont want the model to "learn"?
1/ I could go more into detail on this but it was going too tech oriented so I left it here. you first open a console command, you activate the venv of your trainer (shivam, everydream,...), and then you enter this in there. it will do a little like automatic does : give you a local URL with a UI to look at there
2/ it's auto included in your tools already, see 1/
regularization, you're the second one to ask for more detail on this, has 1 main goal : keep the whole model training, and prevent overtraining that way
to take an image, a model is a point on a big map
training is moving that point on the map
but when you train, you only hit some of the millions of parameters inside the token
so sometime, your point start to split in multiple parts : what you are training on goes on its way, while the rest of the model "stays in place"
this will create the burning effect we call "overtraining", when one part of the model starts to go too far
so to prevent this, we try to train on more of the model at once
using regularization let us do that, you don't train a lot anything but you train a little lots of things
like using even "a man" as regularization with 1000 pictures (if using shivam) will bring lots of diversity in what is learned
those images get also trained a little, and modify your model weights on purpose, so :
- you can use that regularization to refine different useful concepts, like picture framing or ligthing
- you can show only superb pictures to train the model passively on aesthetics you like
i somehow dont find much info about training vae, also there arent many vaes floating around - could someone point me to a place where i find something in an not full nerdy way >.<
Ok which trainer is used when u do Lora model with Kohya_ss webui?
Thank you for the explanation !
I think it's kohya_ss as the name seems to imply : https://github.com/kohya-ss/sd-scripts
Not sure I haven't used it in webui
DING! My LoRa is finished!
Whew. It's been a journey. I would like to thank God, my family, and everyone who's helped get me here. You know who you are 🥹
Is there a free text too image generating tool with no restrictions?
... dear lord ... these results from this LoRa are .......... unspeakably terrible
i've clearly done something horrifically wrong
Keep going king
was it a style or a character?
Character. But, I was chatting with someone in a different server, showing my settings, and it seems I had almost everything set wrong. Of course. I think his exact words were " you just wasted 4 hours for nothing "... Lol. Soooo ... I'm starting over
I found with character ones, it's important to not describe as much stuff unless you really want it to change, and then use a token like foobar man etc - even the background, I dropped it back to super broad descriptions and it seemed to work much better
you didn't waste those on nothing, you earned precious experience, and understood a little more each parameter you had that was wrong
don't worry, failure is part of the path too
Captioning is definitely something I don't understand.
When you say token like 'foobar man', do you mean for the instance prompt?
With a LoRA the instance and class prompts don't seem to really have an effect, as I understand it that's more from the Dreambooth training step it has to go through.
What I mean is say you're trying to train 3 characters in the LoRA, you'd call one "foo", one "bar", one "baz" in your prompts when describing the character to let you invoke that specific character rather than a generic "man", and negative prompt down the others if required.
(Obviously replace those variables with a unique name)
There's a new project that removes adversarial noise from images (like what Glaze does), if you find images in your dataset that need such cleanup: https://github.com/lllyasviel/AdverseCleaner
What I learned from comments in the #1011228667659178055 channel is that SD is not good at (re)generating text, like titles of book covers. I get that in a context where I would be generating books from a text prompt. I was trying to use dreambooth to fine tune SD to recognize my cook cover in order to be able to generate that book in new situations (like other backgrounds). In my head, I was thinking of keeping that book 100 or 99% intact, just in newly generated situations. Is there a way to do that and to tell SD not to deviate from the newly learned book and cover but to keep that intact as much as possible?
What does the Koyha_ss train resolution setting do when the training dataset is of a different aspect ratio? Let's say the train resolution is set to 512*512, what happens if one image of the train datasets is in one of the following dimensions?
-
768 * 512
-
768 * 480
-
768 * 1024
Can I get some answers on this? The higher the batch size, more it samples from many images. So if I had 99 well learned image and 1 not well learned image, at batch size 100, my 1 not well learned image will have a hard time being learned due to the fact 99 images makes the AI think its already so good at the training data right?
Gradient Accumulation Step is how sensitive AI is to little details. If you set it too high the AI will get surprised by every little detail. You need to set it to right number because if its too low it cannot capture important details, if its set too high its sensitive to every little detail that it cannot capture the presented concept
Aspect Ratios bucketing
Hey guys I'm currently working on getting deeper into embed training and I found the guide in the #1080946152318443610 channel at the bottom explaining how loss rates work. Since I only own a 1070 training an embed takes a long time for me which is why I found it especialy interisting because this way I might be able see if a training is going wrong during the 10 hour time it takes for me to create one.
So I've enabled tensorboard and tried to access that, but it seems that it is necessary to complete the training to actually display something or to be used?
So I went ahead and told it to write the loss rate on every step into the csv file. After 4 hours of work with around 1000 steps I loaded the csv file, and added a "trendline" to show me how the tendency of the value has changed.
But it seems like it has only been reduced by around 0,002 as you can see in the (extremly "value zoomed")picture. Is this a normal difference to have? It feels like the difference should be much higher to me
Which would mean if I finish training with the 3000 steps I have it will probably only be at around 0,005 or 0,006
what are u using to train
usually theres an option to like display loss rate
without going into tensorboards idk
I'm using the stable diffusion webui, the integrated training of embeds there (which I believe is dreambooth). There is a loss value being displayed which is bein updated. I thought that is probably only the loss rate of the last step or something or is that adjusting based on all the steps taken or an average?
im sure its adjusting
Ah okay that is good to know! Can you tell me some “good" and "bad" values for a certain training step amount or something similar? I couldn’t find any advanced guides on how this value works other than that it might go up and down
steps?
easiest way is to gen sample images
and see if ur shit starts getting baked
Hm okay, I was trying to see wether I can get a bit more in depth through the lossrate because just generating sample images every few steps seems like a bad way to actually see results on a larger scale since the images can be hit or miss
👋
So I did one model for generating Eye Textures for fun, wanted to see if I could do it!
A1111 Using Dreambooth, did a test set of 80 Images, 100 Steps per image and I'm actually really happy with the results.
first of all, I'll plug a guide I finished recently on training, all methods included https://github.com/Guizmus/sd-training-intro
eye texture ? i'm not entirely sure what you mean, close up of eyes ?
Here is an example of v1 one can do.
ho that helps
Yeah was just generating one real fast.
I would have gone maybe a little under 80 images, from the top of my head around 30 to 50 would be enough
Now I just used the Prompt "Eye Texture" as I trained it. So I don't really have any control on what the eye actually look it.
Yeah I have a larger dataset now which I need to sort.
but if you got diversity enough for 80 that's great
So my main question is this
that's a hard token to train onto
Yeah I'm a bit confused on the whole instance token, instance prompt etc.
I expend a little on it in the Dataset chapter, Captioning and Attention are nice to understand
But I wanted to know if I seperate all my datasets into categories like
Slitted Pupils, Circle Pupils, Eyes with highlights, etc.
Should I train these into the model separately or=
?*
nope, you should train them all simultenaously. for multiple reasons
So I dont actually need to sort out my dataset?
first, doing different models and merging them, you'll lose quality on every trained concept
Second there will be a lot of concurent concepts : the "eye" token for example will be changed a lot in each model, and merging them could break all of those
it's a lot easier to do, sorting
you want a certain balance in your dataset
if you train on half red eyes, and 25% blue and 25% green for example, you would have a red eye bias in the end
Oh!
well for multi concept at once like that
the best way I found is this
first you make a balanced dataset : 25 pics of each for example
but the same for each
Mhm.
then you do a first training session, doing lots of save points
the goal of that training session is to see when each concept is "just trained enough"
some will train faster than others
and so you want to synchronize them for best quality
that brings it to the last step : the real training
you start by training only the concept that took the longest in the test run
and you add more concepts every few epochs, doing just your maths based on the test run
(Oh I see, I can classify each images within it's name for teh AI to perhaps understand what is differnet about it?)
but that is a lot more complicated than just tagging all of them "eyes"
Sorry reading here and your write-up.
take your time, no problem
I started SD 3-4 days ago and I got the question "Can you, generate eye textures with this?" So I got real curious.
I'm not on automatic, I use another tool, and I can train just on a folder and the subfolders. I think in automatic you'll need to declare each concept as a sub dataset, but it's possible too
So making a balanced Dataset.
about your token, "eye texture". I think you should drop the "texture" completly. just "eye"
it's already a strong token to train onto
So it would be possible to have a Core folder called "Eye" and sub folder to further categorize it
Like Slitted Pupils
it's how I work in my tool, EveryDream2trainer
but that just depends on the training tool you use
That is really cool and would greatly improve the workflow of sorting a dataset.
I have about 2.5K eye textures I can build a dataset from so ._.
yeah, I have a big "datasets" folder, lots of "preparation" ones, some "captionned" ones, and I prepare a "mix" folder, copy pasting the concepts I want
but....
I don't think you are on the good path using so many eye textures, or I could be wrong on the diversity
I don't imagine 2k5 eyes being different. and I mean different really, not color swap once, no repeating motifs
with balanced colors.
I think I sued abyssorangemix3AOM3_aom3 for the source checkpoint which I also need to figure out what that actually entails.
for me, each eye type, like cat eye, or whatnot, would need max 15-20 pics
but you may have lots more motifs in the background or other
the important thing is
no duplication
if you duplicate motifs or things like that, those will train insanely fast
and burn
Yeah I noticed in the original V1 of the model.
I had the same eye repeated in it just with different colors.
and that eye shows 1 out of 2 seeds now ?
Haven't noticed it that much, I see some repeatring patterns.
Doing a batch gen so you can see what it'll do.
Me and my friends already consider this version enough for what we want.
But I wanted to see if I can make it more fun to sue and customizable
for sure, I'm going far into what can be done, and DB out of the box is already great
well if you want to do it the simple way, and gain a little control
I'm new to this AI thing and I'm almost a bit scared how well it does when I set this up without know anything. 
I would use 2 tokens at once, max.
meaning "cat eye", "monster eye", ...
and I would just balance the dataset
forget the 2 steps training I was talking about
At once does that mean during the final training
and if I wanted more different eyes I would make more models?
Yeah.
so it's a little problematic, switching like that
Mhm.
and it's the case where I was saying merging wouldn't work great
In a perfect world we would add the prompt "Cat Eye" and it'll just add the slitted pupils etc.
yeah, so for that...
can you show me the current UI for training DB in auto ?
long time I have used that one
I think the option you need is "caption training"
but not sure
Sure!
the one where you set up the concepts mainly
I see an option that could do it already there, but the UI for concepts may be more simple for it
(you pasted the wrong UI I believe)
ops
nice, the 16GB one ?
I trained on that for some time 🙂
ok yeah so
there are 4 concept tabs here
ok nice
they improved it
lots of possible ways to do it.
I see. :o
Yeah I picked DB using A1111 cause it's the first results I got when googling.
you can start by just naming the pictures as you like in fact
like naming the file "cat eye (1).png"
and so on
So 4 Concepts means I can incorporate 4 different concepts into the model.
Mhm.
it will take the name of the files
and train on that, as if it was each a different concept
since most pic will be named the same, it's not doing 2k5 concepts though
just as many as you have given different names
Would it look like this in practice?
nope
i recommend doing textual inversion
just put [filewords]
textual inversion, Haven't gotten to learn this one yet. :o
really?
Ah that's one fo the training methods is it? :o
not in that field in fact, in Instance Prompt, but yes
yes. textual inversion doesn't make a new model. it makes a file that knows how to talk to your existing model in order to get the eyes you want. you can have multiple textual inversion at once, so it's nice in this case
the downside of TI is that, if the model really doesn't know of you eyes, it won't do great quality on those. but good thing : the model does know eyes quite a lot
that seems like a better solution than dreambooth here
So it takes the names of the image file and uses that.
So if I then use Cat Eyes as a prompt it'll weigh more on those eyes.
?=
I'm basing that on the screenshot here, and what is in the different fields when empty <#🔧|finetune message>
i don't use that tool personally so I could misinterpret too
There's no harm in doing name filewords imo
more or less, yes
Since u force db to train the token
I see.
So what if I do things like "Cat_Eye_01"
U don't need an instance token for fine tuning
Will the underscore seperate Cat and Eye?
or just put all pictures in a folder, select all, hit "F2" and type "Cat eye", it will rename all and add numerotation
Aren't u supposed to tag everything for db
Tag everything remove the thing u wanna train
no, it depends a lot on the approach and what you want to train.
Tagging everything is good for full fine tuning and large style trainings, but bad for basic use
"Cat Eye(1)" I see. 🤔 So what if I have 10 Cat Eye's and then 10 Cat Eyes with highlights in them.
Would having "Cat Eyes(1)" and "Cat eyes highlights(1)"
I see
Do anything there?
Yeah
Each step of the training, a batch of pictures is trained, and the weights of the model move a little. Those changes happen slower or faster, depending on the learning rate you use. This "budget" of changes that could happen on a single step is called Attention, and is split amongst the tokens you used in your caption.
Adding more tokens to a caption then has multiple effects :
it slows down the training on each single token. This may require more total steps to produce the same results, or to use more trained tokens at once in your prompt later on.
it looks for the more fitting parts of the picture for that token and associates with it the changes. This means that describing a feature in your caption can prevent that feature from being associated with the other tokens. As an example, if training on a character that has a tie in half the shots, adding the token "tie" would reduce how much of the tie feature is associated with your character.
it spreads the training on more weights of the model, and reduces the need for regularization.
just check the guide, and the terminology
Will do!
this is at the mid point, so I wrote it with the mindset that lots of concepts were already there
Mhm.
A lot of this shit is trial and error really
yep
Yeah I've noticed!
Noone knows for sure what we are saying
We could be bullshitting you for all we know
Maybe the right thing is to tag half
Or to train first half steps tagged and last half untagged
What I can tell is that having a curated, varied and properly balanced dataset is good. 🤔
that is a guide from confirmation bias for sure, but still quite a lot of experiments and independant people training having similar observations.
We could be wrong, but those technique seems to work at least
Both in Shape, colors etc.
But u cant say they are 99% optimal
Hell I don't even think it's like close to 80% optimal
nah
mostly dataset building
finding bias
understanding what the damn machine understood and why it did not do what you wanted
that's just doing and doing again
and confirmation bias
So here we have two eyes from the dataset, one has a black background and the other is just a transparent png.
Is this something I should watch out for?
All I'm trying to say is don't take anything we say as gospel truth
Btw u probably want to actually train the eye on the cat
transparent backgrounds have sure broken some of my trainings. using colored or white or black ones is a lot better imo
I see
If not it'll never actually spawn without a cat kinda
yeah, the token "cat" is strong, if not trained on it will bring the whole cat
Cause it's the pupil i*m after.
you can name it as you'd like tbh. the weakest the toke you choose, the better
to test the strengh of a token, do some pictures with just that word
My V1 just spit this out.
if it comes out as something completly different each time, it means there isn't a lot associated with it in the model, and that it should bring less things you don't want into your model
Just what I'm looking for.
or look for a token that already brings something that is in the direction of what you want ^^
Now when looking for a token
Does that relate to doing prompt tests on the model checkpoint I use?
you can, yeah, I just described how
but you can also use a list of low weight tokens
those aren't fun tokens usually
Cause that's another part that confuses me, deciding on the source model.
but they are the best token to train without bringing in unwanted inspirations
well, the base model...
think of models like point on a map. some are more in the north, in the realistic direction for example, some are more in the cartoon side, ...
it's more 4D than that but it's the logic
if you train, you push your model in a direction on that map, towards where you dataset would sit basically
so using a model that is already close to your needs means less training
I just did "Cat Eye" on my v1
seems ok to me.
once again, I give tips towards better quality if you feel it's not enough on that side, but doing strong tokens like "cat" will also work, to an extend
oh yours information and tip helped me understand a bit how I would progress.
I think I'm going to try and create a balanced and named dataset and see how that goes.
you could name those eyes "boat on a river eyes", with enough training it would take
If I did that
that means when I use my model and type in "boat on a river eyes"
Does that means it will use those eyes with that categorization to generate?
Like on in practice, is that how it works?
it would make eyes like the ones you trained that had that name
but it could also make a little of a boat for quite some time
I see so it is that straight forward in a sense.
and the boat part would come from the base model right?
So for example if I was goign to pick a check out
Having something that would work as iris would be optimal 🤔
Civiai browsing time. 
I saw that, did some browsing around here. :o
and each week, I make a model out of the participant's submissions
I caption each picture "PoW Style XXXX"
PoW Style is the main token, based on the name of the event
and XXX is different for each picture
Mhm.
it's a token I give from a list, one to each user
if I then prompt "PoW Style Floaf", I would get variations on the eye you would have drawn
and if I prompt "PoW style Float Guizmus", I would get a variation of a mix of both of ours
I see. 🤔
Just generating now with the prompt "Eye Texture" and paying attention to the patterns I can see the parts that are weighted more heavily in the dataset.
Especially the Highlights
Three random generations in a row.
Same general highlight, looking at my dataset 20 of the images out of the 80 have that.
Really helps to see if happen.
Looking through that thread, those are cool!
you managed to figure out a lot of the process by the way, you understand this quite well already
so yeah, we see the general shape that has started to "bleed" in the model
this is not real problem
it is to be expected on such dataset
Yeah the dataset is purely Circle with things in it.
and you aren't making this model to be able to draw picasso on a boat
So probably helps.
CFG at 30
So now it's strictly trying to use the dataset
Full blast.
CFG 0 is special, your prompt is completely ignored
so it shows what the model would do naively
I can recognize some of the images I used.
Oh I forgot to mention I used LORA for this!
that won't change a lot of thing in your case, and will be faster to train and to share with friends
so I would encourage it
it's a little lower quality than dreambooth on its own, but it's a fringe remark
Yeah all I want to generate is circle, with an iris and pupil with some fibers/Color thrown in for good meassure.
who wouldn't
As an 3D Artist the idea of generating infinite amount of texture is just fun.
Alright so my first step for V2 is dataset sorting.
I love images.
also
did you know
🐮 Did you know that some cows have been known to moonlight as abstract artists? It's true! These bovine Picassos have been spotted using their tails to create intricate, colorful designs in the dirt. The most talented of these cows have even been known to create what appear to be intricate eye textures, complete with lifelike lashes and irises. Of course, some critics have suggested that these "eye textures" are really just the result of a particularly enthusiastic cow tail flick, but we prefer to believe that these cows are true visionaries of the art world.
(sorry random cow fact about eye textures)
🤔
we just love cows, us mods recently, can't really say why
and I just tend to drop random cow facts custum made for users
using chat gpt
Sometimes you don't need answers as to why things are the way they are.
🙏
Last time I played with GPT I convinced him to ignore his filter guidelines and made him write meme fanfics.
Good fella'
❤️
yeah
but i can give you a list of tokens if you'd like, not beautiful ones but tokens
what's that ?
dataset ?
Everything from Highlights, Iris, Full Eye textures, Round Pupils, Cat Pupils, Oval shape, round shapes,
one software I love when manipulating lots of file's name : bulk rename utility
handles lots of filenames and manipulations
Yeah I use it religiously.
and czkawka to check for dupes.-
I usually use photoshop for batching but I think I'll need something faster here.
Need to see if I can process black or white background to all of them.
XNview can do that
Downloading so fast.
same concept as bulk rename utility but for pictures
picture manipulation, lots of basic ones, some filter ones, ...
Sorry but I have a quick question, I have a file in my logs when training via dreamboothnamed event.out.tfevents is that something I can analyze? Is that the file that should be used with tensorboard?
yep, that is tensorboard file usually
My brain is moving to fast now I'm asking myself can I train it to generate face textures following a UV.
Alright time to sort dataset.
Thanks Guizmus!
you can open a console, activate your python venv, and run tensorboard --logdir="PATH"and it should give out a url to a local UI
Oh that is very helpful thank you so much! I was having issues analyzing the loss value!!
this folder/file is even populated WHILE the training is happening
so you can stop it if you see it doesn't go well
it runs on CPU only
you're welcome 🙂
Oh wow that is EXACTLY what I was looking for :D
well, happy I wrote it then :p that part could need pictures, I just transited it to github
I was about to write a script analyzing the loss values through the csv file xD
tensorboard should already be in any training environment
if not it can be pip installed easily
Once I get tensorboard up and running I can give you some feedback on it if you want to
that would have been painful to discover after then :p
Okay Ill try it via my stable env
Yeah absolutly 😅
always welcome yeah, the guide has about 10 days. I had quite some feedbacks already, but we just added it in the #1080946152318443610 so I'm always happy for more
it's intended for entry level into training but honestly, it's hard not to go too deep when you start explaining
the dataset chapter in particular is a little deep
@unique cloakOriginal image size is important right, I think I read up on this when I did my first model. having all images in 512 is preffered right?
yes, it's easier to train on the optimal size your model was trained on, so 512 for most models except 2.1 base that are usually in 768
That is really good, I've been struggeling to get cogherent and a lot of information about these topics so the more the better :)
it's possible to train on other size and ratios though
Oh yeah I forgot 2.1 is a thing, I've mostly used 1.5 models. Would doing this as a 2.1 help me? As I have no need for anything crazy in terms of filtering adult content.
I imagine if I did like 1024 this would really push the vram usage. ._.
XL is about to come on #1084896022368624640 so I wouldn't sleep on the size for long either. in particular : stay future proof. keep all your highest definition picture somewhere
Alright!
nope. it would push the quality down. the models we use aren't extensible there, we are adding data by pushing some away.
you will need bigger models with more parameters
to keep the same quality
Oh so running 1K images is a detriment?
training on 1k size is a detriment on a 512 model
running 1k pictures on a 512 model works but gives duplications without some tricks
My thought process was have the entire dataset in 512 and the eyes that get the Ok we AI Upscale.
Oh! I see.
I think I'll just 512 it the entire way then!
like I do a lot of desktop backgrounds recently
and I have a 32:9 screen
so keeping duplications out is really hard
example, we get twice the same kind of picture here
here, it needed to add a second robot head
because it was feeling it had no robot on the right side
I noticed this when I tried to do widesceen images
It almost insisted on adding more people
from time to time I also got great ones though
using lots of extensions to help
there are multiple tricks to it
mostly using higres fix, and describing more the background in your prompt
(I'll stop examples sorry)
I find the problem interisting though. That seems like something controlnet would be able to fix
And I love ur example pics, nice vibes/styles
thanks 🙂 mostly only the girl in the river is mine, the other prompts were mostly the sample prompts on the model page I used I'll be honest
i'm not the best prompt maker ^^
controlnet solves it yes, but it comes at a great cost
those pictures;, 5120x1440 already took more than 5 minutes on 3090TI
using controlnet, I'm looking at 15 minutes from my one try
in the end, I prefer to tweak the settings and run more seeds
but I switched to 1920x1080 for more fun too, now that I found a good desktop to use
xD At least that is somewhat fast, iI'm working with a 1070. No way in hell I'm actually going to get anywhere with prompts fast. Which is why I test most of my prompts initially on like a 10 to 15 step set
(model name is "realbite" and it's on civitai)
I can't link to it, it has some NSFW pics too
rule 4 here
Actually downloaded it but I haven't had the time to test it and the other 4 models and 20 loras I found interisting since I'm trying to get better at embeddings
The pace at which all of this is going is so insane. Yesterday i stumbled upon a local text generator and needed to try that out OF COURSE, I'm pretty much hopping from topic to topic right now trying to learn as much as possible
xD
I had quite a lot more time to take it all in, I was there since this became a public beta
Okay That is still more than mine
so every piece showed up once after the other
but everything is just so fun to play with, to tame
it's like an hydra though.
Every time you think your understood something, you find 3 new concepts to try and understand next
I've been following along since it came out on github last year and I've tried so much stuff. Stable diffusion, Automating it, dreambooth, loras, live chat bots with whisper, the text stuff now. Every time you turn around 3 new insane checkpoints have been reached and there is something new to discover
Yeah exactly xD
But it help that some of the stuff is being used in several projects
yeah. the logic is the same
it's why I made that guide that way
to stay "method agnostic"
there are specificities to each for sure
but dataset making is the same for example
and for diffusion models too, we find the same concepts in each
I haven't played locally with the GPT stuff
only tried some chatGPT online
Would you reckon really going ham on naming the files like Cat Eye Orange (1) would help or AI gets colors? 🤔
I work as a programer, and use chatgpt as an alternative to google now when it suits it. Took off around 10 to 15 hours the past few weeks from what I would have usually taken. It helped a lot
So having something like that localy is not possible with the amount of data they have but something similar like a home assitant would be nice
The alpaca models have just come out and what I find most interesting is that you can also use loras, other kinds obviously but the idea/concept is the same
I actually got the tensorboard to work now so thank you so much for the hint!
don't tag color. it's not useful if you keep diversity in color, and just can derail training in most case
I'm under the assumption that some things the AI itself can figure out, like in this case colors.
Obviously like you said if 50% of the eyes in the set is red, it'll weight more.
don't hesitate to hit us with screenshots of the graphs if needed be
AI will understand that color isn''t important in the picture if the color keeps changing in the dataset.
If that is the case, then your resulting model will be able to do eyes of all colors, and you can just then add the color you want in the prompt
the thing is, it will understand that what keeps changing is to not be learned
it's all
even if that makes no sense to us/it
so colors are a good example, but like, trees are, mostly everything can be, even ligthing, photo grain, ... as long as the dataset is diverse on it
if you only have 3D renders, it will give 3D renders though
brb
Based on your guide I would have guessed that the lowest point in the red circle is the best since it goes up again after that, but it seems like it might have been able to get trained a bit more.
The part in the red square has the lowest value but I guess with around 490 steps that would probably not be enough right?
How close should the value be to 0 to be an actually useful embed of a person(s face)?
Sorry for all the questions 😅
ok back, I realised it was 11PM and I forgot to eat
Bon appetite!
that seems like a really good reading of your graph yes.
I would still test the early checkpoint, in case, but there is a high chance it's crap. loss value isn't always representative of the real quality of the current model, especially at first like that
the other point around 1750 could be of interest
if the later point was overtrained that is
I can't answer on how close to 0.... It would depend on the method, the exact implementation of the tool, ... and I haven't used all
Even though the loss value is that high?
Its just a simple colab of dreambooth to get some data I can actually use to learn ^^
it's a local low. not the best local low but a low either way.
Usually for me, those are "ok but low quality". but there has been some cases where the red circle was just overtrained, burned outlines. and the higher local low was good
I'll keep that in mind okay!
Finally some graphs and data 🤩
yay !
I took ages before getting into that
and back then I had quite a hard time collecting the information I gave you on lows and highs, and loss behaviour
mostly I was told "loss value is crap don't bother"
Yeah I bet. Even now it is hard to find good information, but your guide is really precise 👍
but I have really been able to correlate graphs to those observations that were also given to me/experiements I made
Yeah I heard that too several times. But I just couldn't believe it. I mean there needs to be some way to actually analyze this stuff without just going: "Hey this picture looks good I bet its a good state"
yep
depending on the tool, then even add more data in the tensorboard
like the sample pictures if you ask for some
Oh yeah I saw that option in my local installation. But this online version doesent have it. I'd use the online version because it is 10 times faster than mine but my gdrive is full so I cant create a save every 100 steps which I will need when I actually want to get precise states
I can feel the pain
Good that I had you then to answer those questions ^^ Saved me a lot of time
happy to redistribute info and not let it go to waste
Yeah, randomly hitting a good state is not an option here ^^
training on LoRA would mean smaller save points maybe
Yeah and again thank you so much :D Especially since I don't feel like this part of stable diffusion is going to get a lot easier over time, since only a small percentage of people want to also use it in this way. And with everything that is happening right now I can't wait for the first embed creator jobs to pop up. Or model trainers
Yeah that is going to be my next step. But I want to actually finish one good embedding before moving to the next thing again :)
there is a LOT of professional demand for this though. people want to be able to build a pipeline to, for example, cloth people with a new thing, or show a new car in real scenes
but for the random user, going into this in the first few days I wouldn't advice
Yeah exactly, I actually work at a company where we would greatly benefit from this but it is a slow process to get to these system states, especially since we are only a really small company and I'm the only person as invested in it ^^
Yeah absolutly not. My boss and our designer have been using midjourney until like a week ago. Now we have a stable diffusion setup on one oof our beefy pcs

