#š§ļ½finetune
1 messages Ā· Page 7 of 1
then resume training
it might spike the VRAM about 0.5 GB extra for a little, so keep that in mind
I mean that i cant get sample images to look anything similar to mine. Person is slightly recognizable
another thing to try is your initialization text. If it's describing your character, then it might work better
little girl
and I like the method where the text files (in your dataset folder) are only describing the scene, the stuff you don't want the AI to remember. I'm still testing if that works better or not, but it seems to be pretty good!
Here is the sequence
It's the beginning
Then it will start looking worse. Not even close to supposed person/
I set it to 5.
if your vector count is like 1, then this is to be expected
yeah 5 might be far too low
could try 16
Tryed 8 10. Same
And you're using [filewords]?
yes
Two methods for filewords, there's the "describe everything" method and the "describe everything you DONT want the AI to learn"
(though I'm still figuring out which works best atm, the second seems to be working a little better)
I just used blip. Look normally
how many images?
tried 8-30.
you generally don't need too many, just high quality. Almost all being close-ups from various angles, lighting, backgrounds, clothing and then maybe a few upperbody
anyways, I'm bed 
I know. I did deepfakes before. All are completely different with different all (distance, cloths, backgroud, expression etc)
For example here
#1026789870884102226 message
a guy tested 18 settings with TI. But what is common - result really looks like a source, no matter what settings he used - difference is subtle, but overall the same face.
His settings are described, I just followed - but not getting any similar result at all.
Hello, I was trying to find a guide on how to fine tune model using images with different aspect ratio, but I all get on google is dreambooth and textual inversion. Is there a good guide on how to actually do it?
Maybe something like thatĀæ
https://github.com/devilismyfriend/StableTuner
Anyone knows a colab for it?
Anyone has an idea how to create a SD2.1 512x512 model with
https://colab.research.google.com/github/TheLastBen/fast-stable-diffusion/blob/main/fast-DreamBooth.ipynb?authuser=8#scrollTo=1-9QbkfAVYYU
I can create it
but when I try to use it in Automatic 1111
I get an error like dimension or tensor dont match ?
not everyone or most, and youre still using a subject/ip you dont own
so i disagree there
you disagree and think itās wrong to create likeness of a subject you donāt own the rights to, in the context of fan art?
if thatās the case, the I simply disagree both on a moral and legal standpoint (as long as itās fair use / youāre not profiting off it)
from my knowledge, then you can profit from "fair use." Tho I haven't checked up if that's really true :P
itās muddy. I think if itās seen as a āsubstitutionā and isnāt a parody, and the ip owner doesnāt permit it, then they could be issued like a cease and desist/demand payment (or something rather)
if it doesnāt fall under fair use
as long as it's transformative, then it's okay I'd say. And I don't think the IP owner has any right other than doing something if it isn't fair use
ye ye
And, as a side note, Infringing happens all the time in the art communities, after all 
not saying itās right or wrong, but that it happens a lot
youre using the wrong yaml file
thanks
[Epoch 30: 1/1]loss: 0.3550469: 0%| | 30/100000 [00:46<41:55:48, 1.51s/it]
no i think its ok
i am saying fanartists who are against ai art are hypocritical
people ask me "how can it be youre paying hundreds of euros for training a single model?"
me:
(i test a lot)
ooo fair, totally agree to that extent
I also test a lot
just spamming TI to learn what everything does
How much images did you train on with these settings?
7 images. And then i did a different face with also 7 images, but that time I went up to 8000 steps with good results
that's not much, i'll give it a try, thanks
how do i train my own models with dreambooth? i looked at an online guide on dreambooth but it was outdated, i have no idea what i'm doing
Here's a good guide! This information is still relevant https://github.com/nitrosocke/dreambooth-training-guide even if some things are a little outdated
hi, are the embeddings here #1047197565365538826 generated with textual inversion?
yosh
there are a couple in there though that are generated via hypernetworks (auto1111's repo)
okay thanks, if i want to have photorealistic images eg of houses, no styles, no cartoon etc, what would be more suitable, textual inversion or hypernetworks?
Hi everyone
I was wondering if is it possible to fine tune the depth model, or a way to use it with 2.1
is it possible to fine-tune multiple classes at once? like if I wanted to do a model with me and my partner captured, or multiple pets
Do you guy found a good way to dreambooth on SD 2.1 768 ? Is there any update on joe pena's repo (seems to be on 1.4). I didn't had good result with automatic or last ben, but maybe I do it wrong. Any good resources or tutorial ?
i tried finetuning dreambooth on SD 2.1 with 768 resolution but didnt got it to run properly (like comletely brown image, no object or details). I only got it to work with 512 resolution on SD 2-1-base
i think its possible. The ShivamShrirao notebook allows you to specify several concepts
any tuto ?
i have same problem, i am on the ShivamShrirao notebook, how did you solve it? it works for sd 2.1-base with 512 resolution though
Yep, i got theses errors when my pictures wasnāt at the good dimensions
so it doesnt resize/crop the images to the correct size during training?
I dont know about this notebook
First time training an embedding but how long does embedding take usually? 100k steps, Embedding Rate 0.005 8 Images. It says ~7 hours but dreambooth took me 10-20 minutes. Just wondering if im doing it right
I too have a lot of questions about the parameters for creating embeddings in auto1111; everything I have made has been very poor
Has anybody trained using LoRA? i cant find a single tutorial about it
How many images is too much for an embedding? I want to imitate the style of a game's background, so I could potentially take a whole lot.
hi, anyone managed to train embedding in webui for 1.5sd?
getting weird error, it stops right after start and says it finished training at 0 steps
welp, it just did the same for me š¦
TypeError: can only concatenate list (not "NoneType") to list
Edit: solved, I mistook what model I had switched to
mmm the finished training at 0 steps happened when the training template didnāt contain [name]. I think there was another thing that caused it, but i forgot
Using the same embedding name as something previous, i think?
to make sure the new created embedding is unique
this is the first time I try. I also tried some of the standard templates to make sure I didn't break something and same error
never downloaded embeddings either
Have people gotten TI training for SD 2.1 working on Automatic1111 without using --no-half yet?
you could get good results waayyyy earlier, I think the 100k is just a large number, so you can leave it running for however long, i and just interrupt when the pictures start resembling, kind of thing
do you happen to have screenshots of bothe the ācreate embeddingā, embedding folder, āTrainā and the console?
might help 
could I pm you those?
it's got my name in the file paths š
Hello Guys,
Removing the background to the image will improve the training ?
it can be the opposite, actually
varied backgrounds being better then just one
I suppose unless you want a specific background, in which case having the same one could be better 
like a bright pink for easy future background removal
I might try that 
My Stable Diffusion v1.5 model trained on the artstyle of "Darkest Dungeon" is now officially released!
Find all information about the model, plus example generations and prompts as well as some information on the training and underlying dataset of the model, here: https://huggingface.co/ai-characters/Darkest-Diffusion
Any estimate on how long this will take? It was my first time clicking the Train Embedding button (auto1111)
Hi all, I'm trying to learn how to use Textual Inversion, and when I load the colab I get this error
Does anyone know how I can fix this?
Or can someone recommend another online space where I can train a Textual Inversion model?
The answer was ~1 hour
Are there any good guides for ti training sd 2.x? My settings from 1.5 don't seem to work very good
Dude. You should not be posting pictures of your child here
https://huggingface.co/jkcarney/romcom-diffusion-1.0 - released my first Stable Diffusion fine tuned model publicly! It's trained to generate images in the style of 1990s romcoms (like American Pie, 10 Things I Hate About You, etc). I fondly called it "RomCom Diffusion"!
does using images with transparency backgrounds mess with the training?
so i don't know if this is happening for anyone else: but when i train dreambooth models with my face (and my friends' faces too), the closeups photos generated look great but the fullbody photos are just horrible. I already turned on "restore faces" and i have included full body shots in the instance images. is this happening for anyone else?
can you train an already trained model?
yes, its called fine tuning
There are very good midjourney artwork. Instead of taking artist work it is possible to take those images and train on it. What is issue with it
Stable diffusion is bad at capturing faces at wide angle images and inferencing also produces distorted face.You can try inpainting
i did this test a while back to see how faces perform through the vae as they become smaller. As the face becomes smaller the resolution in the latent just becomes to small to encode the details and it just turns into a blob
this was for 512x512
that also means if you use images during training where the face is to small you are basically teaching the model that your subject looks like a blob
i see. thanks!
does this mean if we make the ultimate picture big enough (e.g. 2048*2048) it might be possible to do full body shots while preserving the face?
ye what matters is what resolution is the actual face part of the image in this case, so the larger resolution the image is the smaller the face could be relative to that
Bad idea, it thinks that being without a background is normal
So you would always get the char without qny background(screwed up lots of embeddings)
does anyone tried this repo ? https://github.com/devilismyfriend/StableTuner
I use fast dreambooth colab to train one face before and it turns great but I wondering, is it possible to train many person face and many art style in one training model?
I tried dreambooth colab, hugging face repo, automatic 1111, on SD2.1 768, none of them have the same quality I had previously with the same sources images (resized to 512) on joe pena's repo on 1.5 with 1.5 model.
Is there any colab for that? I'd like to try.
wich one ?
The best one
joe pena's ? I tried it on runpod, it's not free but really cheap, and you can have a 24Gb gpu
Dreambooth is Googleās new AI and it allows you to train a stable diffusion model with your own pictures with better results than textual inversion. Dreambooth is originally based on Imagen text-to-image model and this technology makes it possible for you to insert any character (yourself, your friends, your family), object or animal you want in...
How many images and steps you train and how long it takes?
12 images, and it run an hour I think. I trained it few weeks ago, don't fully remember
I trained 40 images 3000 steps and it takes 1 hour
Now I want to train many person face and many art style in one training, but I'm not sure if it will works.
with this you can train more than 1 person: https://github.com/TheLastBen/fast-stable-diffusion
if youre training 2 people then the minimun steps should be 6000 (3000 each person)
I believe I use that one.
i guess that you can then merge a human trained model with an artstyle model
How? Copy cpkt link to model download custom cpkt link?
yes i do that all the time
Really? How did you do that? How to renaming the images?
let's say you got john and sarah, then the images should be john(1).jpg, john(2).jpg....john (20).jpg, and sarah's would be sarah(1).jpg, sarah(2).jpg...sarah (20).jpg,
oh don't use generic names like john and sarah btw
How about style?
are you talking about the concept images or?
Like anime or cartoon
Not really, I'm just wondering.
i never trained new art styles before i'm sorry
It's okay. Maybe I will use textual embedding for applying a new style.
guys has anyone run into this error when running dreambooth?
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
size mismatch for model.diffusion_model.input_blocks.1.1.proj_in.weight: copying a param with shape torch.Size([320, 320, 1, 1]) from checkpoint, the shape in current model is torch.Size([320, 320]).
Hello Guys,
There is a prompt or something for remove theses kind of pixelized pictures ?
Neg prompt : jpeg artifacts, pixelated, jagged edges
Maybe
is there a way to fix a over trained model? or just dete and start over again?
You can add the token further in the prompt, you could turn the CFG down 
starting over might save you the prompting headaches though 
training the instance image token name without adding photos?
@unborn wind A question if you can answer. How do you use styles.txt file while training ?
v2.1 using dreambooth
Hey guys! I've been playing with dreambooth for a week, and I start to have decent results.
Do you have any good ressources, or tips on how to correctly use prior preservation loss? For example, is it a good idea to have a classifier prompt a lot larger than the prompt to train on?
Would really appreciate someone helping me unblock on next steps.. I have a highly curated data set of about 12k images with highly accurate descriptions. These are product photographs where the technical specifications have been translated to natural language. So there is industry specific language that I'm trying to capture in the model, hopefully it can pick up on if there is a certain type of widget in the product versus ones that have a different type of one..
My question is, what is the best way to train the model to learn a category of products such as bicycles? I have 1-6 images per product.
Vlad Kushkin ā 19/12/2022 14:58
let's say you got john and sarah, then the images should be john(1).jpg, john(2).jpg....john (20).jpg, and sarah's would be sarah(1).jpg, sarah(2).jpg...sarah (20).jpg,
oh don't use generic names like john and sarah btw
i think if you train "a photo of an object" in instance prompt with images named accordingly to what you want (bicycles) in your dataset directory, you will have the desired result at the end.
need prior testing.
please, I would like to know your setting cause since the merge i can't i have good results with dreambooth anymore :/
Le merge? What are u talking about? I'm using Automatic11, with prior preservation loss, in sd v1.5
Can the name have spaces. Like "roman gladius sword (XXX)" and "spanish rapier sword (XXX)" "conquistador sword (XXX)" "15th century japanese catana (XXX)"?
Or how to train SD for such things if it's not possible with Dreambooth? Like TI uses prompts from txt files, but full model train with ckpt on exit.
thx for letting me know :), this is what i get lauching my old auto1111 , there is a merge it seems :/, i have a more recent repository with the behaviour is not the same
Got a question for tagging on a HN training
Let's say an image I used had the character in armor, and tags that,but I don't care about the armor, just the facial features of said character ,should I keep the tag or get rid of it?
Keep the tag
Is there any general purpose trainer like EveryDream but capable of running on 16Gb GPU?
The DreamBooth extension for Automatic1111 can be used for both caption training and dreambooth style training
and they have a lot of the optimization settings that make it run on lower VRAM usage
Anyone familiar with hypernetwork training? Generating images fine in auto1111, but black squares on the hypernetwork sample generations.
does anyone know a style transfer for SD that works like the one from midjourny?
with 2 image files
what is the best way to train a the 2.1 model right now?
Caption training you mean TI? They dont work for faces/ At least I couldn't get good result from it. Dreambooth implementation doesn't have caption. Thats why I asked about general purpose trainer that trains whole model.
And DB extention for Autimatic1111 - doesn't work for me in colab. I'm getting errors after iinstalling it. It just fails to start and doesn't appear as a tab/
The DreamBooth implementation added [filewords]. Filewords is just another word for "captions".
https://github.com/d8ahazard/sd_dreambooth_extension#filewords
@clear lion
And yeah, it trains the model 
Oh, it's not working for you in colab?
hmm
might be better to just use the direct colab version instead
just use the [filewords] type of training, instead of the classic DB style, for caption training
TY. I used another one TheLastBens version
yeah, that one seems to work diff 
How many steps should I pick for a character dreambooth?
Heard it's 800-1200 ,but that sounds too little right?
You honestly don't need that many most of the time
800-1200 is about right 
some might do like 2k but
How to make this works? Should I rename all prepared concept images with the same name like instance images?
It's from TheLastBen fast dreambooth colab.
Hi everyone, I'm trying to fine tune a SD model with dreambooth to generate cartoon animal drawings. Finetuned model does a great job even with some animals that's not in the training datasets like octopus or deer but somehow have hard time to good looking cartoon snakes or crocodiles. I tried to add 100 regularization images consisting of different cartoon animal images I found online but still no luck. Any ideas or hints?
Name or the resolution of the concept images don't matter.
Is that for giving more variety to the pose of trained character human or only the environment around it?
It depends on your concept images. If it's just characters, then it helps the character imagination of the character. Try out with different concept images
So it can be landscape or anything?
Hey. Anyone with experience in using images with transparency intraining? I find some examples of this but they mention a probalem of alpha halos around the trained things.
I'm trying to make a sort of a "things in cages" embedding and DB. To test it out.
What's the best way to train a subject into v2.1 right now?
don't know why you would use something else than v1.4/v1.5 tbh
I tried 2.0 and 2.1 and got very bad results compared to the former
I have had great success with Fast Bens without text encoder training. Then again I been just finetuning tokens. Like Giving more accurate definition for "Welding" and such terms.
Adding new tokens in with text encoding is an artform on it's own, but overwriting you can do just with Unet training
Also TI is very powerful if you do the inite text correctly and have good dataset, especailly for styles
It is highly recommended that you check the LAION database for similar images and check the awful SEO/clickpait titles for good terms to describe your thing.
A shirt is not just a shirt in the dataset it is: "High Quality Cotton Mens Boys Fashionable Trendy Organic Breathable Soft Smooth OEM China Women Lady Girl Bodywear"
Thanks to SEO/Clickbait level google manipulation from Amazon/aliexpress/baba/indiamart/ebay/wish.... etc
Oh yeah, I mean specific people training and such, last Ben gave me very mixed results with default options and 15 images.
What sort of settings were you using? Because Because you need to be careful with training of text encoder. Basically keep it as low as you can, just enough for your unique token to get in to the model. Otherwise it starts to affect other things in the model.
Also you can train UNET with that unique thing, then fetch it with an embedding.
But this required you to train both.
Basically the defaults as of right now and the subject was very distorted.
Yeah there is the annoying thing. The defaults aren't good, but like "most likely to be good". Every subject and dataset needs manual adjustment
And if it isn't working out, you have to consider your dataset
I noticed, Shiv's was giving me better results overall at least back in v1.5 so it may be harder to help up a 2.1 model at least as of right now with good accuracy.
Well... Yeah since 2.x is totally different to 1.5. 1.5 might been easier to train, but it couldn't do 1024x1024 without hiresfix or other such tricks.
Also you can train 2.1 with 1024x1024 if you REALLY want details
All the issues I have had with training 2.1 I have put down to me treating it as if it is 1.5, which it is not.
so how would you recommend approaching 2.1 as of right now with ben's colab for human subjects?
Is there anyway to gauge if my dreambokth model has overfitted?
Some of the samples generated are very close to actual training data
š
that kinda hints at it, or not having enough training images 
or possibly depending on the prompts you're using to test
You can save it with lower CFG, but that only saves you for so long
Anyone know how I can extract key frames from video within a specific start and end time? My search for specific training images have thus far only turned up videos with the image content I need
-ss 00:00:10 -t 00:00:01
supposedly.
-ss = start here
-t = run for this long
https://github.com/cloneofsimo/lora
Does anyone know about LORA?
Ever wanted to have a go at training on your own GPU, but you've only got 6GB of VRAM? Well, LORA Dreambooth for Stable Diffusion may just be the thing for you! Faster! Smaller! More Better!
:)
- Links! *
LORA - https://github.com/cloneofsimo/lora
Dreambooth Extensioon - https://github.com/d8ahazard/sd_dreambooth_extension#installation
Stable ...
good introduction to it 
it's using the extension version though, so it's a tiny bit outdated
but they go through the various settings and give a better idea of options
Whats the best anime model as base model?
Hi, I'm trying to preprocess images with settings like these:
My problem is that I run out of RAM (not VRAM)
And I have no idea why
I got 32GB total
here's my memory usage, I just started the webui at the marked part
I have heard that even 8GB should be just fine for TI embedding learning but my 11GB card throws nothing but errors
Tried to allocate 6.33 GiB (GPU 0; 11.00 GiB total capacity; 5.64 GiB already allocated; 3.44 GiB free; 6.00 GiB reserved in total by PyTorch
with this enabled
Is 8GB VRAM enough to train on 768x768 images?
I'm using the v2.1 checkpoint and it's 768 already
It is working not but incredibly slow
I can't even train 512x512 images on 11GB without errors
are there any good services for embedding training? I used OpenArt for model training but I would prefer training embeddings for more modular usage
It isn't easy, but step one is use as little text encoder as you can. anything that influences how the OpenCLIP works too much side will mess the whole model up.
So if you can, consider find a nonsense token already exists, and just UNET that.
HOWEVER Ben does have a beta colab which allows descriptions
Go to the github of Fast ben's and the switch the branch to captions.
The reason 1.5 was "easy to train" was because the CLIP was absolute fucking mess. There was lots of margins to organise within it.
OpenCLIP that Stability trained is more "clean", so there is less nonsense that you can take use of. But because it is so clean, messing it up bit too much will lead to a disaster
I like to imagine it as if welding thin stainless steel. Correct settings and technique and it is just a joy and great show of skill. Slightly bad settings or technique and you are going to be fucking miserable, but you can still get something done,
The fucking annoying bit is that because it takes so long to get results, iteration is really hard.
But that is just the nature of ML work isn't it. Like when I do embeddings, I leave it and come back 1hr later. I can just about do 2-3 iterations a day. BUt just enough notes and analytical thinking and youll get it
Like for example I spent like... 5 days fiddling around on Auto's extra tab, just to find the best balance of upscalers to pump up pictures for further correction in Photoshop. Both source images for training and outputs.
My current setup for training images is R-Esgran General 4xV3, then Upscaler 2 Nearest, with 0,5 mix of them. Then I take them to Photoshop for colour and cropping, along with use the photoshops neural photo correction filter.
After this I add just a bit of blur and gaussian noise; scale the whole dataset to desired reso (768x768) And drop to the colab.
When training anyth8ing, it is critical that you remove EVERYTHING unrelated and undesired from the pictures.
Unless you are training specific person where details matter, then every picture doesn't need to be absolutely perfect and in focus. As long as all the general elements are spottable.
Also! Here is a good thing I have noticed! @ripe sleet Have picture of your subject in different scales. As in a close up, then further away. In neutral background preferably. If you don't have such pictures just make few in photoshop by just taking your subject and sizing it smaller on to the picture.
This helps at least Fast-Bens to realise that "This thing can also be further away!"
And lower quality seems to allow it to realise it can be in different quality.
This I learned by accident in trial and error
I'm trying to figure out how to do LoRA training on the Huggingface space. It says it has been trained, but I don't know if it's able to use it inside the space
anybody know of a way to batch remove watermarks and subtitles from images (video is also ok)?
ok heres an example
https://cdn.discordapp.com/attachments/1044638177140412446/1055863929827491930/out00010.jpg
i need the logo top left and the subtitles down below removed
but for many images
the old north korean animation has no watermark and no subtitles
i love it
i kid you not the old north korean animation looks so much better
this just proves again that 2d > 3d animation smh
Not something that SD is really suited for. This is more of computer vision for detection then erasing. Photoshop can actually do this natively with content aware fill and batch. Since if the watermark is always in the same place. You can actually just stack the frames and batch them through.
i just went with batch cropping niw
Is it preferable to train 2.x embeddings on ema or non-ema models?
what is the best scheduler for training texual inversion?
the colab notebook at sd_textual_inversion_training uses DDPMScheduler. is it any better to switch to EulerAncestralDiscreteScheduler

non-ema is supposedly better and ema is better for inference, but your results may vary depending
Also not sure why they didn't release a "full" (ema+non-ema) and not sure why they separated it like that but... i digress
my north korean animation model has started training. Any claims that this is just another distraction for me to keep delaying the 2.0 Korra model are a lie!
600 training pictures in a... well quality.
two series, both in old and new artstyle.
I downloaded about 25GB of data from archive.org (there were dvds of the series and someone bought them and digitized and uploaded the video material, albeit in not so good quality and very low resolution) and then from the first episode (old artstyle) and last episode (new artstyle) of the two series I extracted frames
However, captions are kept very simple and formulaic because I didn't want to put so much effort into such a joke model. also I still have to finish korra v2.0.
I don't know if these simple captions aren't perhaps too simple and the training won't deliver good results in the end because of that, but we'll see.
the model will of course be called "greatest-diffusion".
I have noticed that as well, generally the idea is to keep things you want seen more steady across all samples while making anything you dont want less consistent, backgrounds ideally should be different from image to image to make the model focus on the subject as the "common thing" to keep in mind so to speak.
does anybody have any thoughts about whether it would be ideal to train individual tokens for each variant of an object such as clothing, or train one token for the common object and then modifier tokens which are paired with it. Such as design1 + high_neck_shirt, design2 + high_neck_shirt, design3 + high_neck_shirt, etc, or do each one as its own token?
If you have a bit of scripting/coding skill - there is https://github.com/mindee/doctr - script it to find the text, creating a bounding box at the location, and then script inpaint to fill it
polysemy is when the same word means differnt things - best to train each concept individually
with its own token
otherwise the concepts can bleed into each other with a model such as clip
BERT the neural network can seperate out the various meanings easier
In this case it would essentially be the same thing, but with modifiers, such as a material or color
But yeah I suspect separate tokens would be better
well in that case you can train a red silk blouse
and the color would go in red, the silk material property would get mapped to silk, and the shape information would be mapped to blouse
it wouldn't provide any benefit to do red_silk_blouse likely - unless you have different shades of red
and different silk textures
that need to be kept distinct
It's just that then if you want other tokens to sometimes modify the shared concept it gets harder, e.g. you might have an extra long neck token which combines with the high neck shirt token sometimes, so having one token which is modified by variation tokens (such as materials and colors) seems better in that case
shape modifiers are more difficult - and should likely be trained seperately
long_sleeve_shirt and short_sleeve_shirt
should be different concepts
(it can learn long and short, but they are such varied modifiers)
(that it is easy to confuse the model)
or long_sleeve shirt and short_sleeve shirt
also depends on how specific you want the concepts - if you are doing it for a fashion catalog
from a specific company - i'd train each seperately
Yeah that would be a fairly simple case I think, but then if you've got say a costume helmet, which is also sometimes damaged in some scenes, it gets less clear if you'd want to train one costume token and then also sometimes a damaged token which can proceed it, to be used it combination. Then you might also have a raised visor token, rather than trying to train each concept separately with so much crossover of the core concept
since the preexisting concepts might be different enough
also realize that most preexisting concepts in CLIP have text they are pointing to also
The current plan is to train each token in textual inversion first, then insert into the model
depends on the fidelity you want also - do you want your specific damage or the concept of damage in general?
to minimize destruction to the unet as the finetuning happens
if you want it damaged to stick to your concept, then it needs to be trained on your token more specifically
if you want to use the concept of damaged, then you might want to first train the undamaged image, and test to see how effective it is
with just prompting
Probably going to take a lot of experimentation to know for sure, but yeah both approaches seem valid to me in theory
actually the 4 elements model somebody made for Legend of Korra might answer my question. They used different tokens for the hairstyles and outfits which could be varied along with a token for the character, and a token for the art style.
So they acted like modifiers to the same core 'concept', which was the Korra character, but were all learned together https://huggingface.co/ai-characters/4elements-diffusion
If you are training a TI embedding, you can just have the AI ignore the background COMPLETELY by just masking the irrelevant out and replacing it with plain colour. Then describing it in the filewords like "Black background". As long as it isn't vague at all, the AI can 100% ignore it.
However I recommend choosing a colour that isn't anyway relevant to the subject. As in a white shirt should be on black or primary colour.
Black shoes on white... etc.
oo that's a good idea
To generate images from a data set of 700k images, how many epochs should I run and also should I start from a pretrained_version or start from random weights?
Alright, the captions were too simple. The model works okayish (depends on the prompt) but could work a lot better, too. So back to the drawing board. Model thus won't be released today š¦
in the meantime i am training a model based on this post: https://www.reddit.com/r/StableDiffusion/comments/zttpbf/would_love_to_see_you_guys_recreate_my_artstyle/
84 votes and 30 comments so far on Reddit
Cool post and I bet he'll gain followers and some exposure for inviting people to be inspired by his work. More than he would otherwise.
Would you say that embeddings are more worth to train when wishing to invoke a specific human subject from real life into the setting with 2.1?
500 repeats and counting: I am just unable to reproduce the likeness. Crazy.
You can't do specific human subject with TI. Unless you think the model has data of that specific human in it.
For specific person to get likeness you need Dreambooth implementation. HOWEVER you will get best results and editability with DB first then using embedding to reinforce it. But in this case - far as I know - you must ensure that the DB model in the sweet spot region where the Unet has the images in it, but not dominating it.
Training methods are not exclusive, and you shouldn't think them as such. They should be thought complimentary. Especially with 2.1.
Example: You can add a great variety of... I don't know types of Diapers (I'm still on my quest to make perfect Drump as a angry toddler in nappies without using DB - why? Because why not no one has given me a better challenge yet). However if you want to get a specific nappy from the set, you can TI train to get that specific one.
So in a sense that you can just Unet train things and not touch the encoder; then TI those things out of it. This way you can keep the model integrity as high as possible without messing the text encoders or Unet, but still be get more specific results.
Docker image for fine-tuning stable diffusion - https://hub.docker.com/r/wolfgangmeyers/sd_finetune
Hmm how would you test for that? Also good you explain the difference between text encoder and unet and what does it mean to train one over the other?
Why I can't train embedding on colab, it always gives me a bunch of error line?
I think I give up.
This is the best I got after a day of experimentation. SD seems to have great difficulty with that artist's artstyle.
how can I initiate image generation with the wanted prompt?
What do you mean by it...I think its looking pretty good so far! Maybe you should try some negative prompting to get rid of the mistakes?
of course you could. There are many many various human faces in the dataset. The model doesnāt need to have the specific face for it to work, it could combine many different faces.
The exception being if the face/style is wildly different than what the model has. That, of course, likely wouldnāt work well.
Hi, sometimes, I'm pretty happy with the flexibility of my model, but not completely with the likeness. If I push the learning further, I'm loosing flexibility with artist in the prompt but I'm increasing likeness. Could it have any sense too resume a training with good flexibility without text encoder too just gain likeness ?
i tried and no it looks neither good nor close to the artstyle
I am also interested in training a more lineart style embedding, but it seems to be kind of complicated since the whole model is based on a lot of photorealistic images...
So if you go like Fast Bens, which is the simplest configuration of DB I know of - and reason I generally use it - there are 3 things you can train. Unet, Text-encoder, Text-encoder-concept. Unet is the one which learns the images visually, it is the visual part of the model. Text-encoder is the one you can adjust the language of the model.
So if you wanted to just like reinforce the language of the model like show them pictures of a potato so it would learn to name that picture "a potato" then you'd only train text encoder. You can use this to redefine vague things like a difference between two very similar thing. I don't know... White T-shirt and white button shirt. you'd show it White T-shirts and train the text encoder with the term "Shirt".
The context training is there to make sure that you are less likely to fuck up the words relating to the custom concept you are trying to train.
But you shouldn't mess around with text encoder much, since training it, you can force every token in the model to find only the thing you trained and nothing else.
what is the best way to do styles in dreambooth? i find tons of YT videos and written guides how to train people but nothing on how to train styles
Does fine-tuning make sense if you don't have captions? I am not familiar with all the options, but I am thinking of something like using clip to generate captions and then using those is one. Does this approach work well? Is there a better option?
Are there descriptions for each sampler?
after I train a 2.1 custom model, should I change config on the yaml file or just duplicate from another model?
Depends on how you define it, I suppose. Most people would separate caption training (fine-tuning) and DreamBooth
finetuning being that it's training on each of the tokens and DreamBooth more so training how to produce your Instance Token, is how I understand it 
Right, I mean actually fine tuning on a style using thousands of images. Not sure if dreambooth is right for this.
I am training with Lora but I think there's too many steps?
I put 150 per image but it seems its trying to do way more
It says you have 378 examples so 150 per image is 56700
Examples? My input images are 18
Although I have that number of reg images I believe
Which shouldnt have anything to do? Right?
hmm it seems it trained with my regularization images. I don't know why it would do that though,
I think it was because my regularization images were inside a folder which is inside my instances images.... I moved the folder to another directory now and trying again.
yeah because I have 360 reg images.. and 18 input
does anyone use dreambooth on 1111 on runpod ? I can run a train, but if I cancel it, I'm not able to send a second one, it does'nt do anything and I have to restart everything.
I tried to combine a number of complex concepts in one image today and it was a disaster. So I tried having Superman wear the Infinity Gauntlet, While holding Mjƶllnir and wearing four Lantern Corps rings. How could the model be finetuned to give a coherent output for something like that?
This sounds like an in painting task
Not sure if this is the right place for this - I've just picked up playing with this again after not using it for a couple months, I'm using the Automatic1111 github UI trying to train a textual inversion embedding for a friend. But all the results using the trained embedding just end up looking exactly like slightly fudged versions of the input images. I didn't have this problem on any of the past embeddings - I'm using the 1.4 model (not sure what version my original attempts in October were using) has something changed with how to get decent results out of the training process?
when training in dreambooth extension , can you add your captions to the image instead of creating text files ... for example the first image that I would be training on would be called " photo of (randomname) wearing an orange hoodie" and image two would be " photo of (randomname) wearing a black shirt in front of tall bushes "
anyone have issue with training without prior on shiro db ?
i'm wondering if the lastben dreambooth haven't changed "too much"
a 800 step style with around 200 instances is already overfits
They increased the learning rate, didn't they?
I remember a learning rate in the order of 1e-6. But now the default seems to be 2e-5.
Before is 2e-6
yeah tried to 1e-6 back but float error x)
on default shiro at 800 steps it reflects a lot more what i want
do you know if for concepts i create a new token for it or i chose an already existing one ?
like aravcreature or simply creature ?
i'll try both
So ive watched a few dreambooth tutorials for 1111. And every seems to be getting 7-20min training times. But I am getting about 2+ hours for mine.
Im running on a 3090
Any ideas why my training time is so high?
14k steps ?
never put my hand on dreambooth for automatic but if thats the same steps we are talking about its a way too much
No need to rename file thats what he said.
Your actual steps is too much.
i see thanks, so with concepts we can't really differentiate / use a different keyword to trigger it ?
Yes we can't
like training a style A in concept A and style B in concept B ?
if you use captions on each image you can label them further to indicate as many "things" per image as you want, i.e. both the character name and a style, ex "a painting of bob smith by claude monet" or "bob smith in a cyberpunk outfit riding a motorcycle"
dunno if auto works like that, but traditional "dreambooth" typically uses a pretty simple label per "concept" instead of full labels/captions per image so you can't capture all that
for scenes yea i almost understood, but i'm trying to train two different styles but i'm maybe wrong on how i should do it
its like training a all monet painting and all tristan eaton ones, not merged but each separate triggers
I think if you training art style you don't need to use concept images.
How do I lower my actual steps?
maybe thats the embedding / hypernetwork if i want separate, not sure if we can do it
I never train locally but I think you should reduce that num batches each epoch
yeah embedding is nice for style but only if training it locally or use auto1111, if use huggingface colab it's bad.
I've tried many times on colab and still failed
Hi, I am learning and experimenting a fine tuned model called infinitum diffusion based on stable diffusion 1.0
Still in process.
https://twitter.com/Felipe3DArtist/status/1607412835700678657
when your creating a hugging face page for your embed, how do you add images to the readme? only ever used civatai to share uptil now
Hi! Can someone please help me?
I have 780 images (photos, already cropped 512x512) in a certain style that I want Stable Diffusion to remember and generate similar images.
I've used Automatic 11111 webUI and dreambooth extention to train my model two times and both times I couldn't get any results that look like the images I have trained the model with.
I've been trying to follow tutorials on youtube, but there was no info on my specific setup.
I'm getting really tired with it, my last training lasted for 30+ hours using training wizard (object/style). I'd be really happy and grateful if anyone could assist me or give directions.
Not sure if this helps but I recommend starting small and following existing guidance. Find a video or tutorial that walks you through training a single subject from 10 images using a class word. Play around with that until you get results you like. Then try moving up to 50 and seeing how it changes the outcome. When you are feeling comfortable consider training by labelling each image using BLIP or something. I haven't seen a single effective guide that got me great results the first time.
Any one get dreambooth to work with SDv2 on automatic1111?
does anybody know if we could theoretically merge models just with the parts which are activated by various tokens? Gradient descent seems to work out which parts are responsible in the model, and quite a lot of the model seems to get left alone more or less, so in theory if we had say a finetuned style model and a finetuned object model, could we extract the activated parts from each and add them to a base model? Maybe averaging the diff of each where they both alter the same parameters
I've merged the dreambooth training with other models before to pretty good success. I just did a "New model" (A) + (DreamBooth - Whatever base model), with Add Difference of 1
not sure what would happen if you did the style + object though 
would be super interesting to find out
what is the different between diffuser textual inversion
[14:53]
and automatic1111 textual inversion
[14:54]
automatic1111 i saw shape of ( 8 1024 ), whereas for diffuser it's just ( 1024)
For dreambooth training on an art style, should I preprocess the training and regularization images with captions? Should those captions include the style token like "newstyle-12" -> eg a caption like "painting of a landscape, in the style of newstyle-12".
And for the reg images, having captions without the "newstyle-12" word in it?
Guys, have a general question about model training.
What's the difference between dreambooth, finetuning and hypernwetworks?
If I understand correctly they all use same main.py file underneath but with different settings for training. So we can say they basically the same?
Biggest difference it seems that dreambooth uses custom prompt and optional class, while finetuning can be done without one, in this case it just changes output for all prompts. Also you can make captions for each train image in ft, but this was already implemented in lastben dreambooth. So I think there is less and less differences between all these methods.
Is it possible to train SD on a particular pose?
Or what about a prop? Ex: Train SD on an AK-47 and be able to generate images of people holding it properly
having problems training a hypernetwork, just getting black squares as the output, training data is deffo fine (works fine for TI) - does anything specific need to be set to create a 2.1 hypernetwork?
I have problem, with finetune sd2.1, using dreambooth with automatic1111, i am getting bad results, and base model gets wrecked, cant draw anything, all prompts starts giving almost exclusively greyscale images
any idea what could be wrong ?
love to see it
only 3000 steps with 28 images in the dataset
painge
If I wanted to train Goku per say would I probably need to train two separate Embeddings for Base and Super Saiyan?
Hello, I need some help on LR schedulers in automatic 1111. I tried polynomial, but I don't really understand how LR decrease over time and how to set a Min Learning rate. Same for cosine, how can I tell the training to don't go under a learning rate of 1e-6 for example ? In the same way, how work the scale learning rate ?
I you have any good resource, please share.
thanks
In auto1111 I use something similar to this for training embeddings:
.0025:50, .0002:200, 1e-4:400, 1e-5:600, 1e-6:1000, 1e-7:2000, 5e-8:3000,1e-8
it does .0025 for 50 steps, .0002 until the 200th step, 1e-4 until the 400th, etc....
Might be the same syntax for whatever part of the app you're in for your training learning rate.
hypernetwork, finetuning, TI ,and dreambooth numbers will be VERY different
Would be great if someone could answer my little question. I've been trying to train an art style as embeddings. I heard that when it comes to style it is suitable to train on hypernetwork instead but I found it very difficult to get a good result when it comes to training hypernetwork. I want to ask if its possible and if it is, what should type in initialization text? Would be also great if someone who is experienced in training embeddings answer this.
taro
/
Do you have good videos on training and fine tuning?
Anyone have any tips on training a model with a class of objects rather than a singular one? For example, to fine-tune for the generation of character concepts, or monsters/creatures, not just one character.
Despite using hundreds of images my results have been quite mediocre so far.
I've used learning rates of 1e6 , 2e6, have used steps anywhere between 100-250 per imqfe, have tried with and without classification images. Am training the text encoder.
You don't use steps per instance image?
Could anyone help me with training? I have thousands of images, a lot of time, but only an RTX 3060 Ti with 8GB VRAM
Hi guys! Is there any guide somewhere on how to train an embedding of a face on automatic1111? I can`t get good results š¦
https://bennycheung.github.io/stable-diffusion-training-for-embeddings
Have you seen this? Probably the most detailed guide out there today
Thanks!! I'll have a look!! š
https://www.reddit.com/r/StableDiffusion/comments/zxkukk/detailed_guide_on_training_embeddings_on_a/
Thanks!! there are many details that i've learned from that link!! Thank you!!!!
No problem š
keep in mind, anyone that's reading this, that these results / settings don't work well for anime pictures. 1 to 2 vectors with various modes just won't cut it. Also they don't necessarily need more pictures to work well. You could have less than 20 and still get good results with like a vector of like 16.
trending on artstation, iirc, tends to result in a lot of cropped pictures because "trending on" has thousands of pictures of t-shirts in the LAION dataset
See?
don't do it 
hi all - just wondering - is the depth checkpoint not great with faces or is it more due to the fact my original image doesn't have enough definition it understands. I keep getting blursed faces that look like bad capchas from 2008
I have been trying to train an embedding (textual inversion) for a style, and the sample images that it outputs during training look perfect. But when I start trying to use the embedding on original prompts, it doesn't look good at all. Is that a limitation of embeddings, or am I likely just doing something incorrectly?
Does Dreambooth training require to preprocess the training set with captions?
Some guides suggest preprocessing the trainingset like its required for textual inversion but others do not mention that.
You can try moving the embedding to further in your prompt and adjusting the CFG slider
Can help a lot depending
For the best results? Yes. Though technically you can do it without captioning the images
For early dreambooth training, people were doing it without captioning and it still seemed to work well in some instances 
I also ran a dreambooth training for an art style without doing the captioning of the trainingset and the results were decent, but maybe it can improve with captions.
Having the captions as txt files next to the images in the same folder should work?
thanks a lot
Yep!
Just make sure the name of the txt file matches the image
image1.png image1.txt
perfect, thanks. So basically I can use the same training set that I am using for textual inversion
yep!
mmm
For textual inversion, I had better luck captioning everything EXCEPT the subject itself (basically, the details I want the model to learn)
For dreambooth you can do that too, but it also seems to work well if you include ALL the information in there, as well
but how would you do that when training a style? since there is no single subject
mmm
I'm not exactly sure! I've only been heavily researching subject training, personally
one sec
One of the best dreambooth model creators (imo) wrote a guide on styles, here it is:
https://github.com/nitrosocke/dreambooth-training-guide
It's a bit outdated, but the information remains largely relevant
thanks! I will check that out
@split acorn one more question, do you know if the filename matters? since captions are in the extra txt files it should not matter right?
mmm, most repos have it so that if you have a txt file with the same name, that takes priority, so the filename wouldn't matter. Some repos have it so that if you don't have a txt file, then it'll use the filename as the caption instead
I wouldn't use any special characters in the filename
just keep it simple if you're using a txt file
Ok I see, do you know how the automatic 1111 dreambooth extension handles it? (I can also check the code but if you already know it saves a lot of time haha)
it uses the txt method, not sure if it falls back to image captions if there is no txt
I don't think it does tho
Anyone ever try fixing an overtrained model by merging it with a different model with SD2.x?
alternative thinking ahead is to try to get multiple checkpoints as you train, so if the last one is overtrained you can go back to an earlier file
I was messing around and merged it with redshift and the results are turning out surprisingly way better than I expected.
Maybe try highres fix, or lower the cfg scale?
for those purists that want to use rare tokens that don't get split by the tokenizer, I put this list together, it's not every combo that works, it's most of them though
granted some will be rarer than others
just test by using them alone in a prompt if the results are super random, it's a rare token
Using model merges and/or lowering the CFG and/or either not activating the instance token or including the instance token towards the end of the prompt can help
And another technique is, generate with a high CFG (it'll have the overfit glow) and then img2img with a low cfg
(for subjects). Would be neat to try for styles, not sure though.
I have to ask, how did you do this? Please don't answer with one at a time :P
it was one at a ti... jkjk lol
I got chatgpt to write a python script that made every possible combo and fed that into a tokenizer web app then threw that result into a text file and got chatgpt to make another python script to extract every contiguous three letter combo in the result and put that into another text file
wow, that was a very good idea, great idea indeed! :D
I can't code, but it's nice to have ideas like this and implement them
I'd say you did something better, you found a way to make someone else do it for you! :D
or something else
perhaps right the first time? 
I want to train concepts of
1 - emotions (laughing, crying, afraid, angry)
2 - train the model to position better (playing sports, fighting/wrestling)
should I train 2 completely different models and merge later? Or could I even train both at the same time (using captions)... so a single training set has a picture of a closeup angry face, a picture of 3 people fighting, a picture of someone doing a backflip, a picture of someone crying while lifting weights etc...
I feel like training it all at once, with natural language that describes all elements of the image I'd like to train makes the most sense?
I want to train SD so I can apply different costumes and accurate ice skates for my characters. What's the best approach to do that? Dreambooth or textual inversion?
I keep getting this error when I try to generate a picture from a model that I merged, any ideas?
Traceback (most recent call last):
File "D:\stable-diffusion-webui\venv\lib\site-packages\gradio\routes.py", line 284, in run_predict
output = await app.blocks.process_api(
File "D:\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 983, in process_api
data = self.postprocess_data(fn_index, result["prediction"], state)
File "D:\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 930, in postprocess_data
prediction_value = block.postprocess(prediction_value)
File "D:\stable-diffusion-webui\venv\lib\site-packages\gradio\components.py", line 3308, in postprocess
file = processing_utils.save_pil_to_file(img, dir=self.temp_dir)
File "D:\stable-diffusion-webui\modules\ui_tempdir.py", line 18, in save_pil_to_file
shared.demo.temp_file_sets[0] = shared.demo.temp_file_sets[0] | {os.path.abspath(already_saved_as)}
AttributeError: 'Blocks' object has no attribute 'temp_file_sets'
dreambooth if you have 8gb VRAM GPU and want solid results, textual inversion if you want it fast
DB is a lot faster, from what I've experienced, but does require more vram
Can I do it on 6gb?
TI takes 2 to 2.5 GB of VRAM on top of the normal VRAM usage of SD
For 512 so
Maybe if you did something smaller like 256 x 256?
Do all my images have to conform to the same image dimensions?
It would out of memory me if I wasn't careful with 8 GB (for 512 TI training)
When I did 256 training, I made all my images that size, yeah
Has anyone found the solution to training an embed on people who are not skinny but not fat either?
Both myself and a a mate are not fat but not ideal weight either, but the embed training seems to blow it out of proportion. Doesnt happen in dreambooth training though.
If anyone knows a solution, please ping me
is it better to train an embedding of a person with a variety of pictures from my cell photos? like, at the beach, in a restaurant, etc... vs just getting the subject to stand in front of a white wall and I take 20-30 photos in a perfect pristine environment with no other stuff in the image?
im trying to get embeddings for my kids to put them in funny stable diffusion pics
i have 100's of photos of them from my cell phone, but you know... at birthday parties, playing in the yard
or i could have them stand in front of a white wall and take a bunch of new ones
Better use different images: background & clothing. Also take a closeup shots, few mid shots & body shots (optionally). If you will use similar photos (like 20 photos in a perfect pristine environment with no other stuff in the image) you'll get overfitted model
Question
Are styles better for TI or HN?
HN, imo
I really like TI for subjects though 
I still need to do more style research, so take that with a grain of salt
(I'm currently focused on subjects)
Do you have any good basic settings for TI with Automatic1111? I'm following this guide and am getting really bad results. https://www.reddit.com/r/StableDiffusion/comments/zxkukk/detailed_guide_on_training_embeddings_on_a/
I see. Can you share your setting on the Train tab?
Yeah, though just keep in mind this study isn't finished and the examples near the bottom work better than at the top. No example images are added yet though
But it should give you and idea
@tropic quail
https://docs.google.com/spreadsheets/d/1rGy5Jb63LdFMfzqN_7Y-X6E5bnsYlqRt7zD61tCcGNs/edit?usp=sharing
Sheet1
Name,Initialization text,Vectors per token,Learning Rate,Batch size,Gradient Acc.,Template,Steps,Shuffle Tags,Drop Out Tags,Latent Sampling Method,VRAM[2],Time,it/s,Likeness,Quality,Notes
alicat_v1,1girl,8,0.005,1,1,[name], [filewords] (A),10000,No,0,deterministic,2.1,0:39:03,?,0,0,(A),De...
As per before, it's not done yet
just some examples
For Intitialization text, describing as much of your subject as possible seemed to result in much better results
at least the parts of the subject that you never want changing
Im here
yea
1 vector just didn't work and 16 vectors worked well. Batch Size didn't necessarily increase quality, but it did drastically increase VRAM. Batch Size on DreamBooth makes a larger difference. Not sure why there's a disconnect between the two, and still needs more testing 
I'd never use "once" for the latent sapling method
Bold is used to determine what setting is being tested
though I still need example pictures or it's not as helpful and why it's not done 
Yeah I can do 2500 DB steps in approx 7 mins and 3000 TI steps takes nearly an hour with my 4090
it shouldn't take that long
10k steps should take less than 45 min (with GA steps of 1)
I think you might have the slow iteration issue that others have been experiencing with the 4090s
there's a fix though, one sec
You can try this:
https://www.reddit.com/r/StableDiffusion/comments/y71q5k/4090_cudnn_performancespeed_fix_automatic1111/
might help!
Thanks, installed that the other day. It helped a ton in DB
@split acorn Could you help me with Finetuning models on DB + Lora?
I am already training one but still dont know that many details
I have no experience with Lora, but I have a lot with DB and subjects
what was the question? or
Okay so
About overfitting
does having 2K images, a lot of time, and low learning rate + high step per image help avoiding it?
ok
Having higher step per image increases the likelihood of it happening
I see
low learning rate is a good way to avoid it though, more images isn't necessarily better
This was my results from the test run:
Ahh, I set the batch to 5 and gradient to 2. The guide I read indicated that it was best to have the batch * grad = number of images
haha
Gradient Accumulation is there to avoid VRAM restrictions
in exchange for longer training time
if you can get away with having it all on batch, that would be better than using gradient at all
@split acorn So what regularization images would be good to use when training things like this (tanks, vehicles, realistic things)
and any datasets u suggest?
That's what my training screen looks like. Nightmare fuel
lmaooo yikers
Go to available and press load
1 tank?
for example?
So drop gradient to 1 and set the batch to as much as my 4090 can manage?
yeah
well depending on how many images you're training on
10 images
then it wouldn't make sense to go more than 10
gotcha
not sure on the best amount though, but 
mmmm
Thanks @split acorn I'll re-run this after this one finishes.
you could do like 100 per image if you wanted. If you're fine with every tank looking like you're tank then regular DB methods are fine without regularization
you could also do the finetuning method and avoid using regularization all together
sounds good 
Just make sure when you're generating the class images, that you're using the model you're training on and you're using the class/class prompt as the prompt to generate them
also not sure where you're getting that LR schedule
You should be able to get good results with just like 0.0005
the other one I tried was "5e-03:200, 5e-04:500, 5e-05:800, 5e-06:1000, 5e-07" but I didn't like the results as much
okay
obviously the LR would depend on what you're training, how many images, what model you're using, etc. So feel free to adjust as needed and experiment 
Then on the list, find the one that says dreambooth, press install
Also from the same guide. Since it is suspect, I'll try yours instead
Or just a different guide haha
okay
Go back to the "installed" tab and press "Apply and Restart UI"
after you installed it
does embedding (not Dreambooth) require regularization?
Nope
huh ok, then I have no idea why my ouput images suck shit as if it got seizure
it do be looking like this
like has anyone experienced this issue before?
Could try lowering the cfg
Yeah this is really slow 1.19it/sec
Or putting the embedding further into the prompt
oh this is the output image during training btw, so it is screwed up deep
oh
I wonder why DB is so much faster for me.
okay, done
i dont see it for some reason in the tabs though
Alright
Yes, because the image preview is based on either a prompt from your training or the prompt from the txt2img tab. You have to interrupt training to change the prompt if you're using the txt2img option though
You might want to turn off your WebUI entirely (cmd) and restart it
Yeah DB does some checks at start up
So you might be able to testthe preview with a different prompt or lower cfg
oh ok i guess i will try that
@split acorn For your testing so far, how many steps before you start seeing a likeness? On this re-run, 500 steps looks much better. I realized I had set the model in A1111 to something other than what I wanted to train on so that may be part of the reason for the Cronenberg sample pic. š
Not sure how to answer that, tbh
done
I will just wait it out. I wish I could figure out why the it/sec are so slow
go to dreambooth tab
Yeahhh, that would be nightmare fuel for TI, because it already tends to be slow
okay
okay so now we start the fun stuff
First of all
do you have your images prepared?
(as in, cropped up to 1:1 aspect ratio, and has accompanying text files for the prompt)
@split acorn One question, could you explain me everything from Instance Token to Class prompts and filewords n stuff?
i know them basically but dont know how they affect each other
Instance token just being a unique token that you use for the subject/style you're training on. Usually people recommend using a rare token (like sks). To test if a token is actually rare, just prompt it and then see if they all look random.
Class prompt being everything but your instance token, to help it learn what the subject/style actually is.
Filewords just meaning having a text file or filename that contains a prompt that describes what the picture looks like. One method bring to describe everything and the other method to just describe everything EXCEPT the subject/style.
So a class prompt could be
Photo of person
Instance token being olis
Class token being person
Instance prompt being
Photo of olis person
Or
Instance token, [filewords]
And [filewords] as the class prompt
Is how I understand it
hello
Go to extensions tab, press available sub tab, press load
just wanted to mention that
nice
search for "dreambooth" and press install
got it already!
Silly question for you @split acorn if you are still there or if anyone else knows. I trained a TI subject and the likeness is good, but it was trained against the standard 1.5 model. If I wanted to add it a custom model as part of the prompt will I be able to get the same likeness or similar as long as the original model was also 1.5?
on the source checkpoint, select the model you want to finetune your new model on
am i training an embedding or model?
model
yeah but i am not experienced on that
models are better imo
ok yeah lets continue
wait hold on, what could i do for the training part tho, if the training part is the problem? (Like the image I showed you)
ill use elysium as the source
alright
select the scheduler as "euler", do not know if it affects or not, but if it does, euler is best
done
and then your food is all heated up! :D
LMAO
YAY!!!! :D
It's to test the embedding. Sometimes it just doesn't like the prompt or CFG, but the training ng itself is actually fine
For anime, I like Euler A
but Euler seemed okay too
done
Well next off I guess I will put: 1girl, red hair, blue eyes, long hair, very long hair, beach, cafe, drinking inside the Stable Diffusion oven and see what it does
DDIM for realistic, seems to work well
alright so
IDK man, the training output has been consistently like that
now we prepare your images
idk if that's normal or smth, but when I use that embedding, it makes those kinds of images
which is very annoying
and then drop is as you pull it out, and hoping no one saw you, you scoop it all back and whistles innocently.
Just like grandma just do make them!
or are they already prepared?
Cropped up
txt files containing prompts
@frank urchin
considering I am generating a anime girl, that can be taken out of context,. Big holup moment
no i can do that gimme a sec
i crop them to 512x512 right?
well uh, what is elysium trained on?
768 or 512
512x512 (pretty sure)
alright 512 then
you do it trough the train>preprocess images tab
there are 3 options with prompts
how many images you have?
im trying to just base it off a character
you can ofc,
ok lovely
so how many images are we talking about approx
is 20 enough?
20 images are often more than enough. If it's for one character that is
oh then is like 10 enough?
yes
yes
if you have 10, it is best to manually describe each of them, using as much detail as possible, in a text file with the same name as the image
(if you have 2048 images like me you have to automate it with some clever tricks lmao)
that is a lot
i consider it low lol
i more like to train big stuff
working on learning image scraping and stuff
i dont even really need to make a model
just for fun
i like learning random things
depends on the style of the vehicles, then I believe there are quite a lot of models which can draw them :)
Have you tried fine-tuning?
isnt that training models?
Seems to be nicer for bigger datasets and training various things like tank types
Nah
oo
can images be jpeg?
yeah
EveryDream is an example
perfect
@split acorn how do i do it oo
Mmm one sec
vision models?
ok got my images finally
sorry bout that
np
just make 10 txt files
mhm
both are very easy and are good for getting an idea for caption training instead of DreamBooth style training
oh and then am i writing in the txt what the prompt would be?
like to describe the character basically?
and then rename the txt file to same as the image
There are two methods if you're doing TI or DreamBooth, that I've had good luck/experience with:
Method 1) Describes all aspects of character & background
Method 2) Describes all aspects OTHER than the character
if image is image.jpg text file should be image.tct
this look good?
For TI you should use filewords to describe every part of the image that you donāt want it to learn
@vale egret that's method 2, yosh
now im very confused š
Textual Inversion
The point of TI is for it to learn the commonalities between your training images. If youāre training a female character X and you put girl in the filewords, youāre saying āthis is what X looks like as a girlā
ahhh
yep
If you remove girl from filewords, it will learn that X is always a girl
Going from method 1 to method 2 just changes the flexibility and how you prompt to get the results you're looking for
both can work though
Just method 2 seems to give more consistent results
Or like
it's much easier to prompt
and less forcing haha
Method 1 using auto captioning always gave me crappy results
yeahhhh auto captioning doesn't seem to work well enough
like it's a good base start
but for smaller datasets especially, I think it's important to check them
take out the irrelevant parts and add in the missing parts
you get WAY better results that way
bad data in bad data out
Like for characters make sure to describe the environment, pose, background, any specific clothing and lighting
this is so tedious i really hope this works š
OH a neat technique is if you plan on using a simple background for ALL your pictures, if they're all the same, make sure to include that background description in the filewords and then you can help by including that in your negative prompt
seems to work really well
thoooo, with high CFG you run into issues
but
it's neat!
ngl ive only ever messed with CFG scale once
pictures as in the examples im giving it to train on or the pics it will generate?
oh wait that made no sense im just stupid
it has a HUGE effect on embeddings and dreambooth. Like when you're prompting for pictures after you've made it.
Who is here the master on model training from a person face. i kind of did all the steps followed instructions, but they dont have LOARA in them, did a check point of my wifes face and she looks like nothing when i generate the pix.
or if you know a channel/person who i can ask about it that would be amazing. i'm just new and still trying to understand this whole thing lol i'm not gonna bore you just need some directions in a right direction.
Iām sure there are guides online you can follow
An embedding may look like trash at 7.5, but if you pop it up to like 11, all of a sudden it looks really good!
Or vice versa, at 7.5 it looks like trash, but at 3 it looks really good
but that's more so for saving a model/embedding that was like under or overtrained 
Weird, how does changing cfg compare to reweighting?
reweighting?
(:0.8)
Like dreambooth or caption training?
oh
gottcha
yep!
that's another method to help
it acts similarly to CFG
from my experience
that was long
please dont make fun of my obscure character
please tell me that looks correct š
though... I will say weighting it doesn't do much if you include the instance token/embedding name as an entry in the beginning of the prompt
and I've had much better results adding it after like 5 words instead
Yes, it was completely ignoring some of my prompt until i moved that part to before the embedding. That was really confusing to fix
yeah
SHE SO CUTE
WHO SHE
Also, do we have accurate data on how many training images you need if you crank up the embedding vector count to something like 20 or 25?
it does
You can train with a vector count of like 16 with like 10 images. I'm not sure if people have researched that heavily yet
ok awesome
there's some information where "you need more images for a higher vector count!" but I have zero idea where that comes from
whats next!
because I've had good luck with higher vector count with not that many images. Just depending on the character and model, etc, but
The idea is to avoid overfitting
yeah, but I think it's more complicated then just that
more vectors and few images could make it obsess over the specifics of those images
Idk what the actual data is on what numbers cause that though, which is why Iām asking
ahh yeah
my embedding had 16 vectors on 20 images and it really took over the entire scene, but I had stille make it for me as I couldn't so I can't really say it's a normal thing or not :P
16 vectors worked really well with one sec, checking how many pictures
13 pictures
but that was anime style with an anime model
I imagine the vector count thing with more pictures is more so for self photos and still keeping flexibility
Itās a general rule of AI that the number of training instances needs to be more than the number of parameters and I donāt see why it would be any different for embeddings
but I'd be super interested in examples 
So 13 images seems a bit low for 16 vectors
You'd think that, but it worked way better than 8
and 32 was way too much
I think it's heavily model/subject dependant
If it's easy for the model to make vs hard?
32 vectors 100 images character embedding. Someone needs to try this
mm mm
I'll try that later today
not that specifically
but testing the relationship between vector count and images, up to like 13 images
(because I'm lazy
)
are you still here
i am oof
Iāve used 25-35 images for my embeddings
sorry got distracted
oh np!!
yes
open the dreambooth tab
yep
A big believer of "sometimes less is more" and "quality over quantity"
so I try to keep them on the smaller side
but 25-35 is still super reasonable depending
A fun guide on the topic that includes the number they use:
https://github.com/nitrosocke/dreambooth-training-guide
Nitrosocke being known for their high quality style dreambooths too 
though it's a bit outdated, but still largely relevant
10-100 sample images
now press the parameters sub tab
i saw someone else doing this
and i dont have that tab?
did i do something wrong š
Training steps per image: 1000
then go all the way down to learning rate
now this will be the settings i personally think are good, based on my understanding of the thing
okok
This for learning rate: 0.000001
This for Lora Unet: 0.0002
This for lora text encoder: 0.00002
after, go to image processing
set the resolution to 512 if not set
I personally do 100 steps per image and cap out around 300, personally, as a quick note (though this also super depends on other factors though, as well)
hm
1000 seems like a lot
tru
you can reduce to 200~
sorry
bit sleepy
after u have done that
untick center crop and apply horizontal flip
ok ill do that
then at the most bottom, open the advanced dropdown menu
Embeddings I understand, but dreambooth is black magic.
It makes models that simultaneously generalize very well and somehow get a 1-track mind
yep
ai art is black magic to me so you're waaaay ahead of me!
right SHIT
got that
are you asking me?
yes
yeah i do
ok
good
set mixed precision to fp16
set memory attention to xformers
tick both dont cache latents and train text encoder
Tbh if there was a much smaller version of SD specifically for fine tuning, we might not even have needed dreambooth in the first place
wait so it should look like this then right?
also i dont see any train text encoder
send full screenshot
yep
Dataset directory: path to ur directory
done
mhm
type girl/character/person one of them (or what word you think describes the cwute pwincess the best uwu~)
it needs to be 1 word
to class token
mb
yes
if you're using an anime model, use 1girl
yeah
for the instance prompt: [filewords]
I do class token, [filewords]
hm
i mean you know better than me š
(textual inversion)
So when doing my instance token, I went with hta
because it's both a rare token and Any3 didn't know what that was
im so confused what yall are talking about š
Sorry, mmm you can take over, I just would include the instance token in the instance prompt
so:
hta, [filewords]
and hta being your instance token, whatever that is
is the instance token the thing thats basically telling the prompt to generate my character?
im new 
instance token is how you tell Stable Diffusion "hey, this is the character I want"
ok got it
so could i write the name of the character?
"princess tutu" or would that be stupid
You could! Though, keep in mind, if you do then it might mix it up with someone else with the same character name
people generally recommend a rare token. This means that the model doesn't really know what it is. That way, when you train with it, the ONLY info it knows is the info you're feeding it
i dont think theres any other characters named princess tutu?
