#🔧|finetune
1 messages · Page 1 of 1 (latest)
has anyone gotten fine tuning working on a 12gb card?
i can get it running, but i have to reduce the size to 256 😭
Is there any guidance on how to optimize the rate that stable diffusion generates images? I'd like to know what configuration options make the most difference.
After that I want to know what hardware setup is best for running it. I haven't found any articles on that, figured this might be a good place to ask that.
I'm pretty much a complete noob with it other than setting up the server on my computer and using it via a plugin in Krita
so I came across this:
https://colab.research.google.com/drive/1jYZIbYSmEGWClWuVYFyUR8nUKwW-c91x
Innnnnnnnteresting, haven't seen or heard of that notebook before
I'm trying to train a model to emulate my own illustration style. Textual inversion isn't really cutting it and dreambooth seems to be geared towards people and objects. I came across this pokemon model and I'm wondering if this method of fine-tuning is better to emulate styles. Thoughts? I'm a dummy when it comes to coding btw.
https://github.com/LambdaLabsML/examples/tree/main/stable-diffusion-finetuning
Hardware requirements on that are prohibitive unless you're willing to pay for 6+ hours of finetuning on 2xA6000s, based on the example in that repo
how much is that ballpark?
You're probably gonna have to do the pricing research yourself since there are a ton of different hosted GPU services out there. But if you read the link you shared you might have an idea, since it does mention how much that 6 hours did cost the author.
$10 to train a new model isn't a lot...although their method completely trashed the original model, I think. There's no reason more people couldn't do the same, if not for the technical knowledge barrier.
Whether $10 is a lot honestly differs from person to person and we shouldn't assume, but I'd say if someone's not familiar with coding or willing to get in the guts to troubleshoot, that $10 is almost certainly going to be burned away, six hours of frustrating wait and problems to fix
Hey, is anyone able to continue training a model with the original CompVis main.py script with the --resume parameter? For me it seems to go well at first - the proper model is loaded from the checkpoint and the configs are merged as expected but once it gets to trainer.fit() ... it just returns, doesn't train for a single step and the script ends
is it different from dreambooth?
I think the so-called "Dreambooth" repos are still worth a try for style training. There are configuration files for that purpose. It doesn't look like many people have tried that though so you might have to do some experiments. But still easier to get going than the other finetuning repos.
Interesting. Gonna look into those configuration files.
there's a pesonalized_style.py that should be there and some slightly different cli params I think just everyone is doing objects instead of styles so its not as well tested
is there a website where i can just submit some images and get it finetuned w/ dreambooth?
dreambooth vs finetuning?
Not yet... But I think that would be a smart business idea for someone... It'll probably exist very soon! It could be done for like $10 or so I think.
You can do it for free on colabs already
And it's almost as easy as drag, drop, and fire if you rent a GPU from runpod or vast.ai to do it semi-locally.
It's true that I've mainly seen DB used for people and objects, but I wouldn't discount its ability to learn a style instead. If TI is iffy on people and objects and great at styles, and DB is great at people and objects, I would be surprised if it wasn't great at styles.
$10 would be price gouging, IMO. It takes like an hour on a Colab using Tesla T4, which rent for $0.29/hr from Google.
A cloud service that offers "true" training, like how we got the anime, furry, and pony finetunes, would be very interesting, and could charge in the hundreds to thousands of dollars depending on dataset size...
i sent a friend req, would love to chat more
anyone around to provide guidance on how to train a batch of pictures using the google colab portal? im getting an error message that says ## Convert weights to ckpt to use in web UIs like AUTOMATIC1111.
which colab are you using?
DreamBooth_Stable_Diffusion.ipynb
can you link the colab?
just noticed there wasnt a green tic beside that play button. just ran it
does the green tic have to be green on all the play icons?
and then run the cell
should be
ok... looks like i skipped a few should be steps
sent a friend req
does anybody know if "a character" is a valid training subject?
if anybody is interested in using dreambooth to finetune people, i've found this notebook to work well
Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion (tweaks focused on training faces) - GitHub - kanewallmann/Dreambooth-Stable-Diffusion: Implementation of Dream...
Is anyone aware of a finetune on just 64x64 images? I'm looking to generate low resolution images, and in my experience going far below 512x512 yields awful results.
gonna attempt to implement dreamfusion3d
i imagine finetuning to lower resolutions shouldn't be too hard compared to other finetuning projects
There's a new tab in the Automatic1111 WebUI for Textual Inversion! According to the documentation it needs 12+ GB VRAM, but it seemed to work in about 8 GB for me. Probably worth giving it a go and seeing what you can add to stable diffusion! Do let me know if it works on your 8GB card ;)
This experiment is very insightful
would love to chat more, have some relevant exp
send me a friend req?
before I dive in to setting up an environment I'm trying to understand if I should be trying to set up dream booth or textual inversion. my understanding is that textual inversion can use multiple trained concepts at once but dreambooth can't?
also I'm running local on a 3080 so not sure if I can run dreambooth at all
I hope this is the right channel
I'm messing around with Textual Inversion, used a bunch of town art from Heroes of Might and Magic 3, first I used styles.txt for prompt template, then I've used a custom (styles_filewords.txt, but [name] and [filewords] swapped) and with less steps
I'm not sure what I'm doing wrong, my first try (with steps 3000, 8500 and the homm3-style-old with 11000 steps) looks better, but doesn't resemble the prompt. My second try looks worse, but closer to the prompt with the final embedding (3000 steps homm3)
the prompt was simple: london, 8k, highly detailed, homm3-500 with Prompt S/R replacing homm3-500 with homm3-500, homm3-1000, homm3, homm3-style-3000, homm3-style-8500, homm3-style-old
I really wish to see more of this. Textual inversion is more accessible and I wish people find more ways to make it success.
You are right, we need more of these, I'll try to do some too. I wanted to do a where is waldo one, and I'm looking for any other ideas too if you'd like !
Please 👍👍
Something like the effect of initialization text is something to be explored.
I've tried using one token and long initialization text. When I used it without training, the result of the token is very different than when I put the prompt alone.
Did you experiment a lot on the number of tokens ? I'm not sure where the ratio quality per token starts to be bad in it...
Initialization text, I haven't experimented with, no. It's still quite new to explore to me, I've just skimmed over it for now
No not much. I've just experimented with cartoon characters, and the results were... As horrible as possible 🥲
I judge it like with java memory for minecraft. Half the amount of tokens.
Depending on the (aesthetic) score, the more you pile on, the less will be recognized.
Example 1:
Subject, effect, whatever else, Peter Max - PM is 1661933.16.00
Example 2:
Subject, effect, whatever else, Jordan grimmer - JG is 749135.43.00
It's a big pain in the peach, to balance out artists - I honestly have no idea if I should consider evening out the scores to match high-scored artists.
The difference is 91.279.773 in score.
That would be between
Bartholomeus Strobel 911728.56.00
and
Emil Fuchs 913456.35.00
-> Eggplant, shine, by jordan grimmer, by Emil fuchs, by peter max
-> Eggplant, shine, by jordan grimmer, by Bartholomeus Strobel, by Emil fuchs, by peter max
Additionally, by including Bartholomeus, I start to experience getting a lot more frames with my outputs - Perhaps there's an unintended high-limit with scores?
But, that science project, requires me to have more resources available - so I don't have to think about cost-performance stuff
ok let's do some tests on that waldo 🙂
tried using the huggingface textual inversion colab to get the style of first pic, ended up getting second pic instead, feels bad
problem is it would take at least another 3 hour training session to figure out if changing some settings can fix the results or not
now I remember why I wasn't able to get into machine learning...
that was on how many learning steps ?
3000, I left all the parameters at the notebook default
honestly it wouldn't surprise me if the style is too weird for TI to even be able to locate it
I mean the style probably fails the filtering of the SD training data
I'm still in the dark on TI... it requires testing to really get right, but that takes so much time
progress on the waldo training
getting closer
definitely looks like a waldo
although I have a bad feeling that even as the overall picture looks waldo-like, the individual figures in it might stay indistinct blobs forever
it could, especialy since this is only TI, but the goal is not to make more "where is waldo", but to extract an embedding for "waldo style"
I'd like that intricate style with lots of characters in lots of other styles and subjects
I'm training it as a style there, not a subject
I see, interesting idea
Gave this an another try, now with img2img, used League of Legends' Summoners Rift as a base, the prompt was "a forest in mountains with 3 dirt roads, 8k, highly detailed, homm3-500" and these settings
and the results are confusing for me
1000 steps are so good, while the final 3000 steps look overcooked, my previous attempt also looks horrible with 3000, 8500 and 11k steps
Any suggestions how should I restart training it?
is the a browser version of webui?
for my frens that don't want to have to download anything
i want yo train 35 photos, what settings should i chose and how long should i expect to finish?
should i run anything over 15,000 steps?
I've found 12,200 steps and 3 vector tokens works pretty good for faces. haven't tried training any styles yet
Here are some samples of my firend, I don't have the original training images on this comp but the first set of faces is pretty much the same as the training set. Trained with 22 closeups of face, 12200 steps, init word=man, vectors per token=3 leanring rate of 0.005.
I would advise caution against posting anything child related on any public area. Even if it's your own child.
ESPECIALLY IF ITS YOUR OWN CHILD.
Thanks, make sense
Where can I look for details on training in general rather than asking a bunch of questions? Checked pins, but don't see anything
Stable Diffusion. It has a lot of things going for it, but how do some of these things compare? Dreambooth SD? Textual Inversion? Dreambooth Diffusers? Which is better? What does better even mean anyway? All these questions and more are answered right here, right now...
After a long night of futile research, I could use some help. I use the SD Autmatic1111 UI on my PC. I have a single .PT file inversion file. How do i run it (get it to work) on my version of Automatic1111?
did you try looking at the wiki for his repo?
Put the pt file in embeddings folder, then you can use it with the file name eg. file is some-random-name, then “a photo of some-random-name” will use the embedding
I have looked through it more than once, I see a lot about CREAZTAINGz a text inversion, but nothing about simply adding an already creatted pt file
Thanks Assassin, I am using the latest version, but there is no "embeddings" folder that i can find. Have thought of just creating one? then, if the Pt is dogsinspace.pt, i would prompts "a photo of dogsinspace looking out the window"?
Then create it manually
And yes, it works like that
But I think I had one when I’ve downloaded from automatic1111’s webui, because there is an empty txt with a name something like “place embedding here”
I’m messing around with Papa Franku and I don’t know what’s wrong, but with 2 different training both looks overcooked for me
Filthy-frank-6 used only 1500 steps and 6 vector tokens, the others used only 3, and I think I stopped somewhere between 5000-8000
I did a waldo style, on 30 tokens and trained it a little too much. the results are fun though
Why did you choose 30 tokens? Is it better for styles if there are more tokens?
because I'm navigating blind and didn't try that many tokens yet, I don't know what I'm doing here, just testing to see if it would land better results, using about half of the available tokens
It’s good to hear that I’m not the only one 😂
This looks fun, gonna download this!
this is neat, i can imagine these hung up on a wall somewhere
fun to look at
TI has quite some fun to show me yet I think, I'm just trying it. Especialy TI on a style, not a subject
i haven't been able to replicate styles too well yet. let us know how your experiments go!
i was trying to teach it how to draw specific swimsuit
it learns how to draw them well, could be better, could be worse
but in the process it includes the most horrible faces i've ever seen, like enormous cheekbones and lips, also shaved head, something like neandethals
and i cannot remove this face even with specific prompting. needless to say there is nothing like that in the initial imageset
i tried everything: different vector sizes, initialization texts, different template files and captions, include only images with/without faces
only what's help is lowering learning rate, but that is not enough
most disturbing is that the most pleasant results i get on 200-500 steps of training, it's not perfect, it's still thinking about initial prompt and not the pictures i included, but it's nice to look at. and then it come only worse and worse and worse
rick and morty, 10 tokens, 13500 steps
@mental hatch
it's still struglling
I'll send a zip with all checkpoints here when I reach 15k
It just has issues with R&M
Man SD is not getting on board with burp Rick and Morty at all
Is that textual inversion?
/dream apple
Nevermind, just saw those files
yep it is
@high nest
Welcome ! There is no bot currently to generate your images on discord. You may want to start by taking a look at the #1014939219904450590 channel. You can access Stable diffusion in different ways : 1️⃣ the official website, https://beta.dreamstudio.ai/. The easiest and fastest way to access Stable diffusion with 200 free credits. For any question on it, you can find help in the #1025467151206854736 channel. 2️⃣ Installing Stable diffusion on your computer. There are numerous projects that let you do that, and you will find help in the #🤝|tech-support channel. 3️⃣ Running Stable diffusion in the cloud, through rented GPU services, using notebooks. You can find lots of them shared and discussed over in the #1011228442399883294 channel.
hmm, this certainly looks interesting
well after lot of trying seems like I found a good spot for face learning with textual inversion, 6 tokens and up to 20000 steps was the answer, with 1 token above 15000 steps it was bringing random results, with 6 its giving me realistic faces and still kinda flexible to work on styles and stuff
Hi friends! I just published my introduction to Training guide on my site, it's meant to be a quick way to get started with training a person into a Stable Diffusion compatible model with a cloud GPU. Hope you enjoy! https://stablediffusionguides.carrd.co/#training-p
finally finetuned a model with dreambooth, kinda painful
would anybody be interested in an website that lets you do dreambooth finetuning?
Yes
guys what's the effects of increasing the learning rate?
lower rate means more accurate? or just slower?
lower rate means the training will be slower, yes
it controls the rate at which the model learns
I'd suggest not to increase it much if you're doing dreambooth finetuning as it will most likely make your model worse
you could increase it slightly if you're training on more images than recommended
its actually for textual inversion, I'm using around 20-25 images, is that good amount?
I'm trying to get the result better, its for face learning, there's one training that did great results but its kinda rigid on styles so I've cropped images at face bounds but then things got worse, trying to extract best of it isn't easy
I'm not sure how good textual inversion is for that, I've only trained it on styles and not faces or objects.
I think you could try increasing the learning rate a bit
it did learn my face pretty well
that's pretty cool
that's pretty accurate
yes, the quality is very decent
the images above got generated without inpaint so I think ti was very good, but it seems just efficient to learn my uggly face 😂
well discord stopped uploading the images 🤔
there's some way to finetune a embedded model?
for example, I got a training with very nice results but its with incorrect shape on chin
https://drive.google.com/file/d/1FqEKUaUV5hlvhQQZrgp_z5KWHz-FWFje/view?usp=sharing 4 character FF7R model with Cloud Strife, Tifa Lockhart, Barret Wallace, and Aerith Gainsborough
new version from what I posed the other day on reddit
I just found the [face:.... face:0.5] parameter, there's more like that?
How did you get a 4 character model without them all merging together?
use per image prompts on the training images instead of a global class word
How do you do that?
https://github.com/kanewallmann/Dreambooth-Stable-Diffusion this repo enables it, uses the filename of the image
Ah ok and that works well then? It doesn't merge features at all?
Would you say if your doing 2 characters you should do double the steps?
Also I wish I'd seen that earlier lol before I created this monstrosity.
that model was ~7500 steps and >500 training images
Ok I've been doing 4000 for 30 images for 1 character
can get more than one character in frame, still sometimes mixes their features but that's a global issue with SD
I believe by adding more multi-character training images it can learn to separate them better though
Ok, but I'm guessing if you use one of their subject words it just looks like them?
And you don't get say 1 persons outfits colours on another's
When you do this do you name the image something like "subject a and subject b"?
I'm using blip to interrogate the images, and then fine tuning the prompts from there
adding my character names in place of "man" or "people" and such, tweaking things like duplicates or fixing incorrect prompt words
Ok cool, thanks. I'll give it a go.
there's a bit of style transfer here, barret sort looks like cloud's outfit more than his own but it sorta knows they are separate characters
my theory is adding multi-character training images helps attention work better, this is much better than a previous attempt and I only went from like 35 group photos to about 55
You think you need a lot of training images for it to work well then
I guess you could also use this to get multiple outfits for 1 character under different subject names as well
yes, I've spent many hours assembling the training set but working on automating, blip and txt2mask can probably help a lot in automating this
yes I tried adding many outfits into the training data as well, it actually helps with style transfer, it's easier now to get characters to wear other outfits from base SD by including many outfits in the training set, it can attend to the face vs. body I think when it sees a character name associated with different outfits
and also vice versa, putting base SD faces on characters
ok cheers I'm running through it now. Looks like it's not had the optimisations put into it like the Dreambooth-SD-Optimised has unfortunately, as it's running at 1.40s/it instead of the 1.1s/it you get on that.
yeah its a lot of hours of training
Also got a bit confused as it seems to expect you to have the images in a directory called images, so I was pointing it to where the files were and was getting a "Not A directory" error. But I think I've got it going now.
Well I must have, it wouldn't start if it couldn't find any training or reg images
I just have them all in 1 folder, but the files are named accordingly like it says on the Github page
So
SubjectA_001
SubjectB_001
SubjectC_001
And so on
yes
it only seems to generate one pair
Ok, would be nice to have all 3 so I can see that it's working properly.
it also does some batch size hunting at the beginning I think because my sets are mismatched
Probably not that hard to implement, but I'd have to find where it is in the code
I'm not positive how it choses what to generate, I guess whatever prompt/image it is on when the training image logger triggers
yeah
Yeah I would assume that too. It must just look at the last training subject and class word and use that
probably could use some better collation
I suppose if it's doing that you could just feed the script the different names you've put in and tell it to do all of them at the interval
my training images almost all have entirely unique names
ex "cloud strife turned away from the camera with a buster sword on his back"
cloud strife standing on a sidewalk, night, streel lights_ (15).webp
Oh I've not gone that far with this test. Just using simple names for each character
tifa lockhart in a purple dress holding her right hand up standing in front of a shelf with liquor bottles on it's shelves.webp
a lot of stuff I get out of blip
yeah just a suggestion, I think it will help with attention in training, and also maintaining the existing model knowledge
Ok, well I'm just running this for a couple hours before I sleep to see what it's like.
That SD-Optimized one also provides more images, like it shows you what phrase it was using to get the image. This one unfortunately doesn't.
So I'm just going to have to assume it's working
you can tweak the v1-finetuning.yaml and change the imagelogger interval
image_logger:
target: main.ImageLogger
params:
batch_frequency: 400``` <---
oh
it got to 1000 steps and died
Traceback (most recent call last):
File "main.py", line 832, in <module>
trainer.test(model, data)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl
results = self._run(model, ckpt_path=self.tested_ckpt_path)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1128, in _run
verify_loop_configurations(self)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 42, in verify_loop_configurations
__verify_eval_loop_configuration(trainer, model, "test")
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 186, in __verify_eval_loop_configuration
raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.```
you have to pass the number of max steps or set it >1000
"--max_training_steps",
type=int,
required=False,
default=12000,
help="Number of iterations to run")```
yeah it is overriden by the cmd line arg
I just changed it in code there on the arg definition
but you can just add it to your arg list too, either works
I believe max_steps in the finetune.yaml is also respected
so which ever comes first
I changed it in v1-finetune_unfrozen.yaml
And it didn't listen to that
I'll use the command line
the command line arg is inserting an interrupt thus the error message, the one in the yaml I believe doesn't throw an error but they behave the same, and both are active limits
it should still dump the ckpt either way
correct, as designed, you should set both limits higher
Ok, you didn't need to set it in the command line on the other repo I've been using
It just looked at the file
correct, kane added that for whatever reason I guess
you could just go into the argparse code and set the default value to 99999 or whatever if you want
above I set mine to 12000
Yay, I'm glad the model will be seen by further people
she rides dressage and uses the labyrinth music for her routine, I sent her a photo of jareth on a horse 😆
Ok so I only had time to run through it 3000 steps tonight to test it, but it's worked reasonably well
1 of the characters doesn't turn into a blob like doing back to back training steps
There's still a lot of style transfer, but I think more training images and more steps will most likely help that
yeah include a few with both characters in it, with a proper prompt, I used ~55 images out of 550 as group photos, even that seems to have helped a lot, I think a bit more may be better
It seems like the class has bled into it a little too much as well
On a normal model, putting in the class doesn't really make a massive difference. On this, it starts making them look like the regularisation images
Maybe I used too many reg images
I'll have to give it another go tomorrow and leave it a lot more steps
But it's at least partially working at low steps
One of the models of the 3 is worse than the other
But it did have slightly less training images
It was Darkness that had less images and she doesn't come up much at all
I guess it could just be, low steps didn't have time to train as well with the lower images.
When training SD, what do I need to do to have it start from a previous checkpoint? I'm seeing references to --finetune_from and --resume and --resume_from_checkpoint
Which is it lol. I tried --finetune_from since that was referenced on the pokemon diffusion but it took me two epochs before I realized it was trying to start from scratch
textual inversion shenanigans with Yves Tanguy paintings as the input for training
if anyone wants to try
if it works like that.. heard you can just drop it in an embeddings folder on the automatic1111 fork but i'm not sure if just works plug and play like that
oh damn guess it does
Ran a few tests at different gpu configurations to see what made sense for training speed vs cost, maybe it will be useful to others
1xa6000 ($0.63 /hr)
80 in 2 minutes
4xa6000 ($2.40 /hr)
200 in 2 minutes
costs ~300% more and only 150% faster
4xa100 ($3.20 /hr)
252 in 2 minutes
costs 33% more and only 26% faster
thanks ! yes, it's plug and play like that. I could even continue training the thing
Do you care about BLIP's performance/accuracy or it's fine whatever the result is?
I tried it once for one type of images but I was afraid using it to caption my training data since the descriptions were very vague and not very accurate. Would that make tuning worse in your opinion?
For example, I skimmed some of LambdaLab's Pokemon images and BLIP did decent job IMHO so I concluded an accurate caption would benefit the fine tuning
Do you continue from your own checkpoint or from a downloaded one?
- for an external
ckptI found it most reliable to add ackpt_pathparameter in the config file (with absolute path to the ckpt file). I mean in theparams:section of themodel - for my own tuning and resuming I had to use
--resumeto pass the exact logs directory (something likelogs/2022-10-07...
@hot breach That must have been quite the challenge. What do you think was your average (image,caption) output per minute ?
How do you assemble a training set? Are there places with good libs for that? I would be interested to finetune on some fantasy/sci-fi image gallery
a lot of it has been brute force data prep, but I'm working on workflow improvements
I'm usually just grabbing segments, working on some automation, it does weird things like "such and such in a sci fi fi fi fi fi fi fi" or "with a gun in their hand and a gun in their other hand" when there is no gun (probably due to black gloves) so I remove that stuff
I didn't fully blip prompt the entire set either, maybe like 30%, too time consuming for now but a lot of this can be automated
what do the parenthesis stand for? like what's the dif between, highly detailed, (highly detailed) and ((highly detailed))
Thank you for the feedback!
It sounds encouraging for smaller datasets (like < 1k samples)
prompt weighting
Is there a list of model weights and what they have been trained on? (I´m using Artroom stable diffusion, if that makes a difference)
Emphasis: use (text) to make model pay more attention to text and [text] to make it pay less attention
nice thanks both! @hot breach @rough hamlet
you might want to change the classifier if you're going to do that lol, currently its "Yves Tanguy Painting" instead of something like "*" which if im not mistaken takes up a lot of tokens lol
I try to keep my identifyers a single word, preferably not recognised already too much by the AI, or at least not recognised as something else (don't use a nickname for your face for example, I was called a monk a long time and I used monk as part of the identifier, learning went wild)
but with TI, if you use the identifier corresponding to an embedding, it's not the number of words in your identifier that matters, it's how many tokens per vector you choose when you created the embedding.
like, waldo I shared earlier is 30 tokens on its own I think
it just depends how you create your embedding
I would choose a word like "yvestanguystyle"
and put about 8 tokens in to see how it works already
there u go then thanks
someone knows why my loss while training with dreambooth doesn't go down.. tested at 5e-5 and 5e-6 on 20instance pics 200 regularization 800 1600 and 3200 steps
made with a textual inversion watercolor training
is it possible to finetune on sizes that aren't powers of 2? i want to use data from drawception which has a fixed size of 600x500
guys, what's best method to train a whole person with textual inversion? I'm running now a training but now I've separated some good quality images from head / face and body and I'm using subject_filewords so it use the filename to guide it, is this a good approach?
Hey there!
Quick question: I would like to start training model in Automatic11111 and share the model with friends. Any advice? My questions are:
1- Should I train on my 3080 rtx or pop some AWS instance instead if I want to do mass trainning (cost not really an issue, I have a pool of credits on AWS)
2- Can a train model on Automatic1111 output a .pck that I put back in the stable diffusion folder?
3- Any training I could test to learn? (Ex: sample of folder with X images that I could follow)
4- Anyone have some time to walk me trough the process? Will gladly pay for the training 🙂
I'm also learning, but the process is pretty straight forward on the web ui
I have a 3080ti too, with around 20-25 images It do aprox 10000 steps per hour (may take less time)
it does generate few .pt files
that goes inside of embeddings and textual inversion folders
In the main folder, I have mode > Stable-Diffusion. This is where I put the model I find, for example: robo-diffusion-v1.ckpt
result from training I did
Hmm that's a different kind of learning from textual inversion, I haven't worked with ckpt yet but textual inversion is giving me some good results with some work
Wow, thanks for all the info. Let me know If I understand:
- Name: Whatever I want
- Initialization text: Description of the prompt for that image set?
- Number of tokens : 75?
- Source: Path on my folder with all image
- Destination: Where I want the process image
6: I click Flip and Caption?
with 6 tokens on textual inversion and only 8000 steps I could get some very nice results even mixing the style with real photos
This will create a nice 512x512 folder I can feed to the embedding part?
I am still working my way through the TI process, but my overall gist is (re: auto1111 TI interface, I should say):
- yes, use filewords, but maybe make a new subject_filewords.txt file for humans, because when you use the default, the phrasing tends to be a bit wonky... you don't have a good photo of "a jane_doe", y'know? 🙂
- I use more initialization text than I think may be standard: say, "person female woman actress". For some reason, "actress" really made a difference. I used to use very few words, but then I'd have the effect of "clear face, blurry everything else", which I suspect was SD saying "I know that jane_doe is this face, but I dunno WTF the rest of her looks like"... the extra words allow it to borrow from existing concepts, if that makes sense.
- I use 2 vectors per token max. Any more and I can't use the character the way I want (different costumes, styles etc).
- learning rate at 0.001 for about 3000-5000 steps. I actually usually save out 2 versions, one for "I need this to look like the character exactly" and one for "let's have some fun and be flexible".
Thus far, it's working pretty well. I refine it constantly, but I've made a handful of characters this way, and those are the general parameters I use.
A: Embedding: I guess this will have something once I did the first part?
B: Learning Rate: 0.015?
C: Dataset Directory: Any path I want?
D: Log - Ok
E: Prompt Word: uh?
F: Max Steps: 15 000?
G: Save image asd save copy: disable or 500?
If you guys are running that training repo that lets you do multiple characters I've noticed that it doesn't have the speed upgrades.
https://github.com/kanewallmann/Dreambooth-Stable-Diffusion
You can get the speed upgrades from
https://github.com/gammagec/Dreambooth-SD-optimized
Just replace the files in the kanewallmann repo with the ones from Dreambooth-SD-optimized. Increases training speed to 1.06s/it on a 3090, instead of around 1.5 you get currently.
Files to replace:
/ldm/modules/attention.py
/ldm/modules/diffusionmodules/model.py
On Linux I also had issues with an older version of Pytorch being used causing it to use more memory and therefore run out.
To fix this edit the Environment.yaml so it uses:
- pytorch=1.11.0
- torchvision=0.12.0
- yes you place the name you want it so when you use on prompt it will use the info learnt
- keep as is
- tokens I use 4-12, more tokens means it can get more details, but your prompt is limited to 75 tokens which means you will get more and more rigid results as you increase to max of 75 tokens
- source path is the folder with images (prefer to crop images to subject you want to learn)
- this is where images will be output after you press the process button and is where you will input on the dataset directory
- I don't use flip or caption, I tried with caption and my results got worse so I didn't really tried much with it, I don't use flip because faces are asymmetric
When I think about it I imagine that you need at least several second on average to correct or modify (image|caption) pairs. Lets say 5s which makes you produce 12 pairs a minute. 720 an hour of really concentrated work given that you have a workflow that is streamlined.
@silent spear gave some good tips, I'm still learning too, for the learning rate I use default of 0,005 I have increased but didn't noticed any changes, might try to decrease as he said, for steps I've noticed that going above 5000 does not make a huge difference, but I always let it process up to 10-15k and then I delete the higher steps numbers from the folder if I above certain steps I noticed a degradation
yeah I spent a lot more than an hour on it lol
How many pairs did you produce. The magnitude I mean.
Does the optimized version of Dreambooth work on a 3080?
I was screenshotting the game, then I'd go back resize/crop, then fix prompts
Oh, one thing that I think is very important (for everyone, regardless): never feed TI an image that isn't "right". If you give it a photo of your subject that looks different or weird or non-standard, it will latch on to that sucker and reproduce all the wonky qualities you hate in every image you create from then on.
That's very true, and its very good to get bad details lol
a face I was training had a small shadow on middle of lower lip, it made the algorithm think that lower lip were separated...
got good once I removed that single image from the dataset
If you mean the Dreambooth-SD-Optimised one, no, it uses 23.3GB of VRAM.
If you mean the diffusers one, then sort of. It can work on just under 10GB VRAM. However windows running probably means you'd run out, so you'd probably have to run it in a Linux console for it to work.
Cool. Guess I'll wait until I have a 4090.
You can rent 3090's for like $0.40 an hour if you really want to give it a go
Its all good. 4090 releases next week and the price of a 3090 has fallen dramatically.
Where?
Vast.ai or runpod are the ones I know of
I'm using Vast.ai as it has a $5 minimum credit amount and Runpod has $10
Nice thanks
i'll take a look
People who train and output a .ckpt (ex: the one who published robo-diffusion-v1.ckpt) - are they using another software?
The main tools output ckpt files. It's only the diffusers packs that don't
I've never used the diffusers one as I've heard it's worse quality
Can anyone explain to me what's a token exactly? is it the training trying to cook information on a single word? I'm trying your params @silent spear and with 2 tokens and 0.001 learning rate its doing reasonably well but 5k steps wasn't enough, I'm letting it do some more training and with 6k its giving some nice results but will let it go up to 10k
what I'm noticing is that its learning less data, the character suit is less detailed, let me grab some examples of it
This is Wraith from Apex Legends, with 2500 steps / 6 tokens / 0.005 learning rate
this is with 2500 steps / 2 tokens / 0.001 learning rate
2750 steps / 2 tokens / 0.001 learning rate, when comparing to first image its much less detailed
and this one is the real character from the game
will now do a test with 12 and 24 tokens to see how this reflects the quality to have some parameters, but I believe I will get back to average of 4-6 tokens as they seem to give decent output
2500 steps / 12 tokens / 0.001 learning rate, yeah not very good tbh
3750 steps / 12 tokens / 0.001 learning rate, starting to get interesting, the fact is that learn more details on the cloth but face seems better on low tokens one, will run this up to 10k and see if there's a good improve
here's some comparison
2 tokens (top) vs 12 tokens (bottom)
same prompt and parameters on both images, its easy to see that higher tokens create a higher bias to source images
Thanks for this, worked really well
https://cdn.discordapp.com/attachments/1010577750077210726/1028065177473650789/unknown.png
can this run on windows with a 12gb card?
Trained for 9k steps. I used some files from the Dreambooth-SD-Optimized repo to speed it up
This might be a naive question, but I successfully was able to fine tune SD using a set of images I provided. If I wanted to train that same model on another set of images, can I do that? Can someone link me to an explanation? For instance, if I trained it on <cat-toy> images and now I want to add <dog-toy> images, how can I make that happen?
You can, but it doesn't work very well, it merges the results together. Your better off using the repo I was talking about above and train them all at the same time.
You could also try training both as different embeds and then referencing them using the composable diffusion operators ( https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/ )
Compositional Visual Generation with Composable Diffusion Models
like <cat-toy> AND <dog-toy>
Theres a repo that able to finetune at 8gb of vram for dreambooth, have you guys tried it?
if you are talking about the diffusers dreambooth with deepspeed, I made that PR. It works fine for me on 8GB VRAM, some other people have tested it too successfully
it's also not written anywhere yet, but replacing the Adam optimizer with Deepspeed version of Adam gives very substantial speed up
Didn't work for me
How did you get it to work? I'm on a 3080 10GB with 64GB ram and it's throwing oom
I'm new here... can anyone direct me to where to learn better prompts? I keep getting images with blurred faces and extra limbs
Crazy that it had at least a degree of success anyway, did you find the quality significantly worse or its mostly the same?
Lexica
I don't think I did anything special, the instructions are included. Make sure to set batch size 1, gradient checkpointing and mixed_precision=fp16
Are there some sacrifices on quality for the 8gb repo?
no, but it requires a lot of CPU RAM
And is sloweer
If you could share your Adam code @stiff dust id like to try it again
Interesting this allows the tech to be a lot more accessible by just adding more system ram
from deepspeed.ops.adam import DeepSpeedCPUAdam
...
optimizer_class = DeepSpeedCPUAdam
Thanks!
it should work without that change too but using DeepSpeed Adam gives about 2x speedup
also that requires cuda toolchain that is same version as pytorch
I had troubles with it and built pytorch from source to make it work
Have anyone tried finetuning the decoder?
Yo.....what she holding?
Learning rate changes the step size for each cycle (how much it's allowed to adjust the network weights) higher values make it faster, and easier to pass local minima (values that the ANN thinks are good) but it decreases stability and increases the risk that you'll blast past actually optimal values.
does anyone have a sample prompt template for training specific people?
subject.txt and subject_filewords.txt have a bit strange combinations
what code are you running?
I'm using automatic1111's webui
so I am trying to run the Dreambooth on a runpod https://github.com/JoePenna/Dreambooth-Stable-Diffusion I have a Runpod with 24gb VRAM which I thought should be enough based on info I found... but I am still faced with RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 23.68 GiB total capacity; 18.33 GiB already allocated; 39.56 MiB free; 18.41 GiB reserved in total by PyTorch) at step 1... any ideas?
Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) by way of Textual Inversion (https://arxiv.org/abs/2208.01618) for Stable Diffusion. Tweaks focused on training faces, objects, and s...
When running TI, is there any issue with using a longer initialization text than the number of tokens for an embedding?
I am running Arki's RunPod fro training, when it's completed, do I download the last.ckpt in the logs (checkpoint) to put it in automatic1111?
Interesting, yesterday I was playing with values, I did tests up to 1.0 of learning rate, I've noticed that it was overshooting and on X step it was looking close but on Y step it was completely different, after some testing I think between 0.005 and 0.001 are good values
appropriate values will depend on both the dataset you're training on and what's already in the model. It's really good at some types of images and in those cases you can probably use really high values.
hmmm, what I've noticed is that it seems better if I throw very different images of same subject instead of lot of close images but only different angles
what prompts were u using?
pretty much yes
I believe the collab one does some pruning first to make it smaller, so make sure you read the instructions
Is it normal that I get a 11 gb file out my sd 4gb model
And do yo advice 2k, 6k or 15k training?
Yes, you can prune it down to 2GB or so. Not sure how the collab is setup for that though. I have a script to do it manually.
I've had good results training between 3k and 6k for single character models
Just make sure you training images are of good quality and you have a decent amount of regularization images
I use around 20-30 training images and 300 regularization images.
Can using too many regularization images be bad?
Yeah, seems to cause issues with the class
I use the pod ark has set, do you know the path to the prune script?
Got 73 training image, about 300 regular from the regular pack
No idea, I have my own that I don't remember where I got it from.
I'm trying out textual inversion in the automatic1111 webui on an AWS server. I used 5 training images, all head shots from different angles and backgrounds, cropped to 512x512. I edited the subject.txt file to match the photos, along the lines of "a close-up photo of a smiling [name] looking forward". I trained for 6,000 epochs, which was less than an hour. Is there something similar to what JoePenna describes on his dreambooth repo readme as "If you trained with joepenna under the class person, the model should only know your face as: joepenna person". I don't see a way to indicate the class in the webUI version. Is that needed in some way, or am I good to go with just using my token? TIA!
thanks for the help, i got a lot of answers 🙂
can anyone help me with runpod and automatic1111... the version of automatic111 that the template starts is old... I am wondering how do I get the new version...? I did a git pull, but not sure how to restart everything... starting the webui-user.sh doesn't seem to do anything
I have downloaded my custom dreambooth model and want to use it in the runpod environment now...
I have used it locally successfully, but I want to run it with the powerhouse GPU I rent with the runpod service 😄
Before today it was just git pull and pip install -r requirements_versions.txt
But there were some breaking changes for that today
Yes, you have to connect to jupyter and update it
git pull
git checkout 4999eb2ef9b30e8c42ca7e4a94d4bbffe4d1f015
In a new terminal in the webui dir
Then pip install the requirements again and then restart the pod with the little pen icon
You have to checkout a commit from earlier today.
hm suddenly getting Sizes of tensors must match except in dimension 0. Expected size 82 but got size 77 for tensor number 1 in the list. even after restarting and even with default settings
huh okay I shortened the prompt which seems to have fixed it
so wondering, can you use good outputs of dreambooth creations as training for a 2nd round of dreambooth training?
base_learning_rate : any advice, i have 1.0e-06
I've seen people using lower value, like 5e-7 but the base 1e-6 has been serving me well on the mega model for ff7r so far
Isn't the class the 'Initialization text'? I've seen reference to people training sigh...'wet t-shirt' on 't-shirt' so that'd make sense.
In diffusers dreambooth especially, class token is be the most important thing to choose after the training images and amount of steps itself. A good matching class can nicely fill up the gaps that the training data didn't have
Armor class token for me ended up resulting mostly like this
And warrior class token is able to create something like this
Im having some ugly outlines that I can’t get rid of … with img2img . Any ideas. ? 🥹
Quick question about fine tuning: I’ve just finished creating a CKPT via dreambooth, which can run in SD. But it seems as though the training has influenced everything from that ckpt file. Every single person now looks like the person I fed into the prompt. Is it possible to isolate the CKPT file to only come when explicitly prompted, as with Textual Inversions?
Anybody know if I can train an additional person onto an existing dreambooth ckpt, i.e, put the 2gb model into the directory and train onto it new new faces and new prompt? it's a pain to swap between ckpt files when making images with different faces. Hope this makes sense.
This sounds like you used person as the class word and didn't have enough regularisation images. Or you did too many steps and it's bled into the class. Or both.
You can do it, but it won't be clean. It will most likely merge the two subjects together in some ways
if you have any checkpoints from partway thru training you could try those 🤔 or you could try merging the vanilla sd model with it
Hmm was afraid of that. Same thing happened when I tried to merge. Is this a feature that could ever be developed? Would be great to have one model trained to what I need
You can do it but you need to train them all at once
Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion (tweaks focused on training faces) - GitHub - kanewallmann/Dreambooth-Stable-Diffusion: Implementation of Dream...
That repo has the functionality
Trained it with 3 characters and it worked well
Speaking with someone else you can apparently put training images in where they are together and it make it more likely for group shots to work. Rather than what it usually does and you just get multiple clones
That's ace thanks! I'll give it a try
I had around 400 regularisation images, and did 2k steps
Question about Dreambooth style training and regularization image.
When I trean new style , what do I put in training and regionalization images?
TrainingImage: various class images with specific style
Regulalization images: photography of various class images. e.g. cat, car dog, person, etc…
Is it ok?
Do you think that there is a way to finetune the model to give only good images to our liking (using a simple classification good versus bad of images output from normal model for retraining)?
Is it possible to combine AItemplate and Xformers to get the benefits of both?
Hi guys,
Which enhancer works best for anime faces? Like the GFPGAN but for anime o:
anime specific SRgns or real-esrgan
might help a bit at least
they are super resolution gans but typicall you can downsize then upsize to have it fix up styling some
was able to get some concepts of the city and style for midgar into latest multi-model for FF7R, along with a 5th character
I've been experimenting with that in a way. I created a TI styled called "me-likey" based in 300 images I thought looked good, then just added it as a style influence the way I would any other style. Sadly, it probably takes more discipline and diversity than I have patience for... the definition of "good" clearly has some commonalities I didn't notice (unconscious bias) so the outputs were all kinda samey and eventually painfully repetitive. I'd probably have to start with a much bigger, much more random base to avoid that. But it's definitely possible, at least.
Thanks for the tip, this helps me understand class better!
Cool. I wish I had a GPU to test on 10000 input images
does anybody here have experience with fine tuning? like actual finetuning? like this https://github.com/LambdaLabsML/examples/tree/main/stable-diffusion-finetuning
Are there any good rules of thumb for picking a TI step count? Is there a mathematical formula or something?
Hey, im looking for some help or direction on how to get started training my own model. if anyone could point me in the right direction that would be great
photoshop em out
Can you provide any specifics? I recently (finally) got a model trained from scratch on a proprietary dataset at my company, I might be able to offer some advice if you like.
I'm thinking of writing up my experiences and troubleshooting into a blog post sometime soon too
Anyone able to use the Shivam Shrirao Dreambooth collab notebook on the premium google GPUs? It only seems to work on standard GPUs for me.
I have seen alot of people use their own faces for a model and I was wondering how they did that and if im even capable of doing it on my own machine
i have an rtx 3070 and a r7 5800x
it's called DreamBooth and I don't believe it would on either of those, you can rent a runpod
you can use textual inversion
47 votes and 14 comments so far on Reddit
this was a badass reddit thread
after looking at all the results/timings, I don't really see a reason to use any of the dpm or heun samplers
For textual inversion is it better to have around 3-5 images or is it better to have more images of the object/style?
And if I were to increase the number of images, would I need to increase the training steps?
from my newbie tests I found that its better to have the subject being trained and have different images of it with different lightning and background, closeup of faces for me worked worse than getting whole face from far but must be careful as it may learn other stuff around too like wallpaper or texts (like text on my chair), from experiments it seems like having less images but with more changing happening while subject you want to train remains stable is better than lot of pictures of same thing but with small changes
3-5 images was enough for some stuff, others I trained with 10-15 images...others with 20-25 images...it really depends if its getting the results you want on the output, but more images not always equal to more quality on my tests
steps is also something weird, sometimes I got well trained result with 2000-3000 steps for some face, for another I had to run like 6000-8000 steps, and in few situation I had it running up to 30k steps, but above 10-15k I didn't noticed a huge difference but that's for what I've been training, maybe something else that's not a person face could take advantage of more steps
rtx 3070 is 10gb or smth, right? There was an 8gb version of DB I saw a while ago, though I haven’t tried it myself
It uses deepspeed
doesn't work for me and a bunch of other folks
seems like it OOM when pinning the memory for the optimizer even if I have plenty more free memory
yes its 8gb, but ill look into that to see if there are any updates then
Anyone trained subjects that have very limited training images (rare character or object)
Oof that sucks
Is anybody making a horror/body horror kind of model, because that would be epic
Hey guys!! I am running Dreambooth on a 3060 and can't tell if it's doing 'good enough' or not. It very clearly has to do with settings - but I am getting a high loss this time. Should I nuke it and try something else? My first try I got like .04 or something like that - but I was extremely unsatisfied with the results, so I deleted the model and tried again
ahh! thats a great idea too! I saw a cool one with Studio Ghibli art style - was that you?
right now experimenting with my face, and it seems like it's working! doesn't feel 'exact' so I think I'll just have to fine tune it as I go along. My eyes were really bad the first time I tried it, and so were the teeth, but when I ran it again, it got better.
Thanks for the response! I was a little confused about that!
how did you get that "loss" data to show up? i'm only seeing this;
Total progress: 100%|████████| 30/30 [00:35<00:00, 1.19s/it]
100%|████████| 60/60 [01:13<00:00, 1.23s/it]
Total progress: 100%|████████| 60/60 [01:12<00:00, 1.22s/it]```
I saw that when my output folder had info in it already. What I had done - was nuke everything in the output folder, and then ran it again
3060 - 165 photos - only took 19 minutes. I was surprised.
Please do!
some info on how I'm producing the FF7R multi-character + style model: https://gist.github.com/victorchall/67bc53472f86641aef1ebee1e154f5d1 best read if you already know about dreambooth and how to run it locally on a 24GB card (joe/xaiver/kane etc repos without using notebook)
https://github.com/huggingface/diffusers/commit/66a5279a9422962b1cff3ad0e5747e8903ae067b wow ok forget my link above... this may be massive, attention bounding box capability
Hey for those who might be interested, I modified the Dream Booth notebook, mostly so that all available parameters of the training function can be used. I also reorganized a bit because of that, and added the possibility to just put a path to a gdrive folder where your images are from the start. Oh and an auto-disconnect at the end so you can let it run and it will disconnect on its own without having to wait to be kicked out.
https://colab.research.google.com/github/Zinston/colab_notebooks/blob/main/DreamBooth_Stable_Diffusion_(advanced_settings).ipynb
new FF7R mega model, 1400+ training images, 13k+ steps, info/pictures here: https://old.reddit.com/r/sdforall/comments/y1ojm6/ff7r_mega_model_v4_style_characters_and_more/iryj3n6/
several styles and a bunch of characters all in one
what setting am i missing? i just want more sky
created an issue on the diffusers repo for the 8gb version if anyone wants to chime in their experience
Anyone know if Dreambooth with over 2000 samples makes a difference? Tried 2000 after doing 800 and it was night and day.
Also - probably obvious - but restore faces has the unfortunate side effect of removing some features.
what's a good number to set the "image creation progress every N" to?
in the web UI
Found this in a post previously, anyone tried it? Allows fine tuning but isn’t dreambooth and looks like you can have captions with the images etc
I’ve run out of compute units for the month trying to get Textual Inversion working on Hugging faces colab…
Or a hypernetworks finetune colab? I’d love to try an alternative to TI and compare.
Is joepenna’s DB version still up to date, or are there better models for finetuning now?
out painting has been meh, at best, for me as well. I think its not quite up to the task yet.
i'm trying to achieve the style of Heroes of Might and Magic 3 with Textual Inversion, I've used these 11 images, maybe the descriptions are not the best
I've used 8 tokens and the initialization text was illustration style
I've used only this line in my custom prompt template:
[filewords], in style of [name]
After 5500 steps, loss is at 0.28 👀
what am I doing wrong?
should I rename my sample images?
the name of the embedding is homm3-v2-tk8-illustrationstyleinit-customtemp, is it a problem that I use homm3?
if I generate an image with the prompt homm3 it seems like SD has some knowledge of it, can it influence the result?
ok, i'll try it again
new descriptions
a new name for embedding: sksv3tk8styleinitcustomtemp
initialization text is style
same prompt template
Using auto1111, how do I know I'm generating images using the embed I created with textual inversion?
I'm pleased to announce that the auto1111 repo now supports generating embeddings as shareable .png images:
Which can be dropped into your embeddings folder and loaded in just the same way a .pt can.
Uses the custom prompt from "Preview prompt" so you can choose what image is used as a representation of your work.
ran it again till 5500 steps with the settings above, got a bit better results, but still doesn't look good
a castle, in style of sksv3tk8styleinitcustomtemp
with 5500, 4500, 3500, 2500, 1500 and 500 steps
same seed with simply a castle
what am i doing wrong? :/
doesn't look like an unreasonable style from the images.
dilute the idea of 'castle' a bit in your prompt?
I don't know, this is with London, sksv3tk8styleinitcustomtemp
None of them look like London and I don't know where these characters come form
I don't know, you can't really tell from a single image
I try different styles eg. comic
if it won't be anything like a comic, it's overtrained I think
but I might be wrong
ok ill try
can't tell without seeing your training images really
usually overtraining you get "cooked" faces, like sunburnt or too much contrast, you can sometimes get it to look good again lowering CFG scale, but its best to use an earlier copy of your CKPT from earlier steps if you start to see that
I know some repos/colabs dont let you control it, but some let you keep copies of the CKPT every so many steps, or you can start training again on an existing trained CKPT just to add some steps
like the features are too defined
ignoring the common issues with eyes and teeth, thats just SD
yea I used my own GPU I have all the checkpoints
Feels like it's just a tiiiiiny bit overtrained. Though the "render as a comic" trick is a good way to tell.
I did 100k steps on 800 images
oh
so I just left it at that, its textual inversion not dreambooth
it took 7 hours
on an ampere a5000
ok, yeah I'm using "dreambooth" and the most I've ever done is like 15k steps with 1400 images which is maybe 8-10 hours on a 3090
not quite comparable
can u share the guide u followed
I tried to use dreambooth first but couldnt find proper docs
and was getting VRAM issues
I have 24GB
https://gist.github.com/victorchall/67bc53472f86641aef1ebee1e154f5d1 this is my (short) guide for using kanewallmann's repo with captions for every image
thanks
I'm doing multi-character + style training all in one go
you can name your training images like "[whatever name] in a white dress holding her left hand up to her face.png" and such
example filename in training set "a food truck in the slums distrct of midgar city with people standing around_2.png"
"a close up of barret wallace in a brown collared vest and a necklace around his neck with a concerned look on his face_ (31).webp"
My tolerance for accuracy is a bit wonky, but generally speaking ~3000-9000 is pretty good on TI, I find. But with 800 images, that changes the math a bit. Best way to think of it, I find, is to imagine someone locked you in a room with a pile of images and said "tell me what all these images have in common" ... the longer you worked at it, the most delirious you'd get, and you'd start seeing patterns that weren't really there, and eventually you'd become convinced that this "subject" was all about the mole on her right cheek. You want to reduce the number of steps accordingly, to avoid the trainer losing its mind.
(this is also why the comic style test is useful, because if you've overdone it, it won't be able to fathom that person existing in any form except a photograph, because it's convinced the "photograph" aspect is vital to the definition)
actually my prompt is dry acrilic painting but all the images are photographs
so I must have overtrained a lot then
I am still fighting with the multi-element stuff in Dreambooth. Getting there, slowly.
try "a photo of [whatever name], 50mm f5.6" or something like that
without the brackets
Sometimes the boundary between overtrained and not-overtrained is really slight. I have two versions of most of my models, so I can swap when the "strict" version doesn't work. For instance, one of my characters absolutely cannot wear shirts with collars using the "strict" model, but drop to the "flex" version and it has no problem at all. The difference between strict and flex is 500 steps. 3500 to 4000.
I guess yea you'd just have to upscale them to 512x512 for training I think
I've trained it on ~300 transparent PNGs and it didn't have any issues relating to the transparency, at least. I just couldn't make it generate anything useful from the training. Might've been my source, though. It wasn't the most cohesive set.
Is it a nice practice to mirror the images to get more images for training?
I mean is it a good idea?
I haven't had a problem with it myself, but apparently mirroring can cause problems if your subject NEEDS asymmetry, like one side is uniquely different than the other, and needs to stay that way. Again, I've never had the issue myself, but I've read that it can be an issue.
What are the top ERSGAN models everyone is using for (drawing, illustration and photos)?
Okay, a serious attempt that wasn't a Rick Astley embedding, ulrikbadass's strong outline style work:
This embedding-image tech is one of my favorite things ever. Here is a quick image using my current WIP.
Nice, spookily similar style chosen.
Oh, I used your embedding style for that. Just wanted to see how nicely it would play with my character embedding.
Ahh, that makes much more sense! 🧠
I think transparency is a problem, after 7000 steps, everything has a black background
Doesn't seem like a problem with transparency, it'll be squashing it to RGB, and most PNG encoders set the RGB to constant in 100% transparent regions.
Just data as presented
If they're doing that you might get weird colour edges as the transition to 100% transparency and then turn off the colours though.
whats the recommended amount of steps? it doesnt seem to change GPU peg at all.
changed the diffusers dreambooth to accept multiple subjects in one training session for anyone interested
https://youtu.be/7m__xadX0z0
Hey guys, I followed this tutorial to train the model on custom people, however I'm struggling to use them correctly.
It's as if I was very limited in creativity once I include my person in the prompt.
Is that something to expect based on the quality and diversity of the training images ?
My training pictures are mainly portraits and not very diverse in terms of composition (chest and above)
Also I'm wondering if you can overfit the model ? Is 4000 steps too much ? I guess the amount of steps to use depends on the amount of training pictures ?
Dreambooth is Google’s new AI and it allows you to train a stable diffusion model with your own pictures with better results than textual inversion. Dreambooth is originally based on Imagen text-to-image model and this technology makes it possible for you to insert any character (yourself, your friends, your family), object or animal you want in...
added image and checkpoint logging like in the non-diffusers repos
I have no idea, I’m just trying everything I can 😂
Sensible.
So far I settled with 5k. 800 to 2k made soooo much difference. Then I realized I could just keep raising it.
If you cannot style it, its probably overfitted...try les steps!
So - I ran 2 people in Dreambooth. When I ran the second one, it seems that it overwrote the first. Is there a way to... you know... not?
how do you generate an embedding as a png?
thanks ;) will try
There's an option in auto1111, it's on by default, it creates the embeddings in a folder called image_embeddeds next to the regular embeddings.
oh i see it now
What is best for training the model to recognize a certain pose? Textual inversion, dreambooth or training a hypernetwork?
I have 50-100 images of the same pose, but it's not the same person in any of them. Would that be something that could be trained?
Probably. On that note, does anyone have a reference for a list of all of the classes? Like person, dog, etc.
at least with imagen there was another channel you could use and one guy on LAION used a pose estimator to create that extra channel's worth of data, so his models would always do really well with posing, I'm not sure if anything like that would be possible with SD but I'm not a good enough programmer to know for sure.
Preprocessing with Use BLIP for caption used to name the images with the captions, now they are placed in a txt, does anyone know why?
I used transparent PNGs for training and It was a bad idea: the transparent areas become opaque and the invisible part is trained as part of the image getting weird results when using the resulting model for generating new images... 😦
Hello, I am trying to finetune SD with Textual inversion and I am getting poor results
To be short, the training and val loss is oscillating during training and images generated in logs/images doesn't show any improvement even after 3h of training
I am using the InvokeAI colab notebook (https://github.com/invoke-ai/InvokeAI) which uses the hyperparameters of the original paper and I have 5 images in my dataset like recommended in it
Here are images in the training set :
And here is a sample of the results :
Argh, sorry about that. I'm wondering if the transparency actually played more of an issue in my tests too. Maybe I was misunderstanding where the problems were coming from :/
I guess we will have to wait for another model that takes in account the alpha channel as long as RGB
I haven't used the InvokeAI training myself, but I find that with styles (especially more fantastical styles, or with fantastical subjects) you need more source images than normal training. At 5 images, it will be picking up the gist of the style (which is seems to be doing well) but has no idea how to apply it more broadly. Depending on your tolerance for pain, I'd try adding images in batches of 5 to see what happens. There'll be a sweet spot in there somewhere.
Interesting, thanks
Also if you know an alternative way to finetune SD with textual inversion I would be glad to check it
I've switched over to using Automatic1111's version lately. The settings are a bit easier to manage, and it gives fancy PNG-based embeddings, which are 900 kinds of awesome.
I used to run a more customized version based on some colabs (a few weeks ago) but Auto just moves so fast that anything else felt like I was missing out on the future. Though as a warning: you may wake up some days and discover everything has turned upside-down and nothing works anymore. Just wait a few minutes and there'll probably be a new version you can git pull.
The Arcane Vi model is out, Let me know what you think and hope you like it!
Download link here:
https://t.co/Fad09PI8Z8
#digitalart #arcanefanart #LeagueOfLegendsFanArt
What do you suggest for the number of vectors per token for a style transfer training ?
I would have tried myself before asking if I wasn't renting a GPU for that 😅
I think for style training you want to be around the 10ish range (someone can correct me if I'm wrong). Then again, your specific style is both visual and conceptual, so maybe a bit higher would work too. All I can say for sure is that I once trained a style on 50 images at 30 vectors/token and all it would produce was very strange garbage that looked similar to my source, but in a truly demented way. Heh. Not sure that helps 🙂
If anybody wants to try experimenting with textual inversion which limits embeddings to the range of weights seen in the original embeddings, I wrote some changes to Automatic's code with his help. You can replace modules\textual_inversion\textual_inversion.py with this, and you can play with the power of the effect on line 265. The original author of the textual inversion paper said that it should in theory help to retain editability of an embedding, and make it play nicer with other prompts
The changes are just the function determine_embedding_distribution, and where it's called to get the floor/ceiling, and then where they are used
What went wrong with this imbedding? too many steps?
Feels like it, yeah. How many steps vs source images?
13 source images of my dog and 6k steps
1k steps the animal didn't quite look like my dog, it looked very generic.
if you look in dataset.py you can see it's looking at these text files now
text_filename = os.path.splitext(path)[0] + ".txt"
Hmm, yeah, the sweet spot is probably in the 3-4k range, I suspect. I'm finding there's a spot pretty early on (1.5-2k) where things start to look decent, and by 3k they look solidly recognizable... and then it goes downhill fast once you pass 5k. But then no two trainings seem to be exactly alike, so it's hard to pin down an absolute truth to all this. At least not yet. I'm still working out a process 🙂
@ashen perch if it can find the text file, it uses the words in there, otherwise it splits them from the filename (the old way)
I already renamed them manually 😄
I might do something wrong 😦
I took some screenshots from Heroes 3 HD, and cropped some parts into separate images, rescaled them into 512x512 and named them
there are 40 images total, +40 mirrored
I think the descriptions might be the problem
at 1600 steps, it produced results like this (I saved an image every 100 steps)
at 2500
at 5k
and it became worse, I ran for 23500 steps 😄
blip is kinda stupid... like 90% of the pictures of me that I ran thru it says that im holding a pizza or a cellphone or a remote and wearing a backpack.
is there a better option than BLIP?
In your opinion is it better to have as much images as possible from the same artist even if the style (mostly color palettes) slightly diverge
or should I stick to the ones I really want SD to learn ?
try the new deepdanbooru
Please help me what I’m doing wrong
I used 30 tokens and * as initialization text
I am as ignorant as you unfortunately
We can just wait for people to live their lives before they answer 😛
I’m trying to make it work for a week and I got about 0 help from here 😅 I don’t know if the file names are wrong or the images or the initialization text
For my tiny experience initialization text can make a huge difference
your file names look fine
And what should I enter if I wanna train styles?
For me entering the actual style I wanted to learn gave better results
"Sci-fi" or "space-opera"
I may be very wrong though
(and may confuse initialization text, embedding name and initalizer words 😅)
so i am trying to finetune a model using this pokemon diffusion tutorial https://lambdalabs.com/blog/how-to-fine-tune-stable-diffusion-how-we-made-the-text-to-pokemon-model-at-lambda/ and this repo https://github.com/justinpinkney/stable-diffusion and I keep running into this error:
Found nested key 'state_dict' in checkpoint, loading this instead /venv/lib/python3.8/site-packages/pytorch_lightning/loggers/test_tube.py:105: LightningDeprecationWarning: The TestTubeLogger is deprecated since v1.5 and will be removed in v1.7. We recommend switching to the pytorch_lightning.loggers.TensorBoardLogger as an alternative. rank_zero_deprecation( Monitoring val/loss as checkpoint metric. Merged modelckpt-cfg: {'target': 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': {'dirpath': 'logs/2022-10-13T08-13-33_pokemon/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': None, 'save_top_k': -1, 'every_n_train_steps': 2000}} ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved. /venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py:20: LightningDeprecationWarning: The pl.plugins.training_type.ddp.DDPPlugin is deprecated in v1.6 and will be removed in v1.8. Use pl.strategies.ddp.DDPStrategy instead. rank_zero_deprecation( /venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:297: LightningDeprecationWarning: Passing Trainer(accelerator='ddp') has been deprecated in v1.5 and will be removed in v1.7. Use Trainer(strategy='ddp') instead. rank_zero_deprecation( /venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:317: LightningDeprecationWarning: Passing <pytorch_lightning.plugins.training_type.ddp.DDPPlugin object at 0x7fd1e0c3f790> strategy to the plugins flag in Trainer has been deprecated in v1.5 and will be removed in v1.7. Use Trainer(strategy=<pytorch_lightning.plugins.training_type.ddp.DDPPlugin object at 0x7fd1e0c3f790>) instead. rank_zero_deprecation( /venv/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py:92: PossibleUserWarning: max_epochs was not set. Setting it to 1000 epochs. To train without an epoch limit, set max_epochs=-1. rank_zero_warn( GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Traceback (most recent call last): File "main.py", line 846, in <module> data.prepare_data() File "/workspace/stable-diffusion/main.py", line 211, in prepare_data instantiate_from_config(data_cfg) File "/workspace/stable-diffusion/ldm/util.py", line 79, in instantiate_from_config return get_obj_from_str(config["target"])(**config.get("params", dict())) File "/workspace/stable-diffusion/ldm/util.py", line 87, in get_obj_from_str return getattr(importlib.import_module(module, package=None), cls) File "/venv/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1014, in _gcd_import File "<frozen importlib._bootstrap>", line 991, in _find_and_load File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'ldm.data.local'```
Stable Diffusion is great at many things, but not great at everything, and getting results in a particular style or appearance often involves a lot of work "prompt engineering". If you have a particular type of image you'd like to generate, then an alternative to spending a long time crafting
any ideas on how to fix that?
@midnight knot my blip captions were also garbage and i can confirm that the deepdanbooru were much much better and more accurate.
They're somewhat better in my case. not great either tho.
you can also change the tags (like 1girl to girl or brown_hair to brown hair) by using notepad++ and ctrl + f + shift and selecting the folder with your captions
yeah
i gotta figure out if im doing things correctly. tried training hypernetwork but it didnt work at all
i see i am the only one with this error here. i searched it in the search box and nothing came up.
so that means i must have missed something in the tutorial. but the tutorial is so lightweight idk.
try newer python
like which one
but also i font think thats going to fix the error because clearly a file seems to be missing i think?
The issue might be the stable diffusion version
I didn't look in the details though
you might be better off looking for a more popular repo.
i am already following a waifudiffusion tutorial but i still wanna know whats qrong with this one lol
ill try 3.10.6 then
It kinda depends, but I can put it this way: I loaded up a ton of images from an artist who does painting and pencil art (nicely shaded, but noticeably pencil). The results were insanely messy, because it seemed to be trying to reconcile the two visual effects at the same time (weirdly, eyeballs all came out very pencil-drawn, while the rest of the face was painted). So if it's mostly the color palettes in play, you might be able to mix and match, but in general I will load an artist's different styles as unique embeddings, just to keep them distinct.
(and, as with everything SD, don't show it anything you don't want it to learn from, because it will invariably obsess over the ONE image you didn't really want 🙂
Got tired of gimp: https://github.com/dfaker/quick-ti-cropper
quick cropping utility to grab 1:1 ratio sections from a folder of images - GitHub - dfaker/quick-ti-cropper: quick cropping utility to grab 1:1 ratio sections from a folder of images
This one is gonna be tricky 🙂
So, as a kinda foundational concept, assume the AI is incredibly stupid and needs a lot of hand-holding. Give it a picture of a mausoleum and it will say "WTF is that?" Tell it it's a mausoleum and it will learn that all mausoleums look like that one image. Ask it to draw a mausoleum and it will spit out exactly what you showed it. That's why we use multiple images for each subject, to help it learn enough about a mausoleum that it can connect certain dots and make up its own stuff.
For TI, I like to give it some help by feeding some initialization text like "building" so it can blend its general knowledge of buildings with the image of the mausoleum, which lets it fill in gaps more easily.
BUT: you are actually feeding it multiple things at once. You've got a style + subjects. So you're basically saying "here's a picture of a castle in the middle of a forest" and the AI is learning and saying "OK, I am ready for another castle in the middle of a—" and then you give it a picture of a medieval building with a flag on the roof. And the AI is saying "WTF I don't see what any of these have in common..." and starts grasping at straws to find commonalities.
Now again, if the initialization text says "building" then it might have a bit more of a foundation (ha!) to build on, but in my experience it's gonna struggle either way. The longer you train it, the more desperate it will be to find ANYTHING that connects the source images, and you'll start to get really freaky images. I have become overly fond of the concept of locking the AI in a room with a set of photos and telling it the only way it can get out is if it figures out what they all have in common... the longer it's in there, the more delirious it's gonna be 🙂
(tbc)
I actually suggest not following the tutorial fully. They used Blip for their training set labels and they are terrible. It can produce pokemon every time, but it really doesn't take advantage of a fine-tuning.
It might have as well used a single token.
When I train styles, I like to give it very plain subjects that I know it will understand. Standard-issue humans, trees, bridges etc. Things I know for sure it will understand without much hassle. That way, it will be looking at the style, not the subject. Anything "unique" or "different" will send it down the wrong path.
The tricky part with your set (and this is out of your control, I know) is that you've got cool architecture and angles and overall CONCEPTUAL style, as well as the visual style. So the AI is going to be struggling to understand what it is you want it to learn. Especially in things like the mountains (which may not match any mountains the AI recognizes off the top of its head) or the graveyard (which is busy and probably hard for it to pick out individual features from). Imagine it's locked in a room and looking at that graveyard pic and you're saying "what do you notice about this photo of a graveyard?!" and it's scanning its memory trying to reconcile what it knows about graveyards with what you've given it. It's not going to focus on the art style, it's going to obsess on the wrong things.
All of which is to say: I would trim back your set to only include images that are fairly clear and distinct, where there is clearly a building that it might recognize as a building, with very little excess noise around it (so avoid shots where the BG has colors that match the foreground, or it may not recognize the object) and see how that works.
Then, once you get your style locked, you can use that style to generate new sub-classes of, say, architecture. So you can say "draw me a mausoleum in the style of style-123" and it will hopefully give you variations on a mausoleum that match the style you've built.
any of you know ways to convert a ton of images from webp to jpg or png very fast?
yes we already talked about how shitty blip captions are and how deepdanbooru ones are better. thats not the issue i am having.
Irfanview can do it but it needs a plugin to open webp
thats fine
ill try it thx
oh sure, I just wanted to make sure you weren't wasting your time and resources doing something suboptimal.
also FFMPEG can convert webps
So you can write a batch script.
"ffmpeg -i image.webp image.png"
on another server just now:
"i use irfan view"
"go learn ffmpeg cmd line tools"
I’ll give it a try when I get home, should I take single buildings?
And should I set Style as initialization text?
Yeah, I would stick to single buildings where possible, and add maaaaaybe "painting style" as the initialization text. Just so it knows what ballpark it's playing in.
irfanview was easy, just had to install 2 exes
I know this is assigning a personality where none exists, but still: I was testing a 3k step TI embedding just now: "photo of a woman played by b153, short hair" and it generates an image where the woman is too far away to accurately gauge the face. I keep trying, keep getting distant shots, almost like it's so uncertain about the face that it's AVOIDING drawing it.
"photo of a woman played by b153, head and shoulders, close up, short hair" --- and it generates image after image of the woman turned away from the camera so I can't see the face.
😐
"photo of a woman played by b153, head and shoulders, close up, short hair, (facing camera:1.3)"
...
Shot of the woman with long hair swept across her face.
Hello guys, so after some experimentation, I think hypernetwork can't be used alone but its a good complement for textual inversion, here's a test I did, 3k steps for textual inversion, 0.002 learning rate then 3k steps for hypernetwork and 0.000002 learning rate and last image is the control image
Not Bad gg
i've read in a youtube comment that hypernetworks works for styles and poses, not for objects
yeah, it can actually improve the subjects as you can see on her face/clothes after hypernetwork, but alone I think its not able to do a good job like textual inversion
the downside is that as its more efficient on style, most of images produced with hypernetwork are giving her a doll like face
but its producing some very nice results this way
couldn't test yet, but I assume you can crop multiple times the same image? (to get face, upper body, whole body)
yeah, click as many times as you like for repeated crops.
so good 🤩
Simplest thing that could possibly work 🧠
do you know of a tool to extract images from multiple videos ?
lets say one image every sec
I have an ffmpeg script to do just that
So you're basically saying "here's a picture of a castle in the middle of a forest" and the AI is learning and saying "OK, I am ready for another castle in the middle of a—" and then you give it a picture of a medieval building with a flag on the roof.
Very interesting
So if I have multiple characters in the same style art doing different things (shooting, looking through a microscope...) I shouldn't give those details in the prompts (like BLIP does) ?
I should just put "a cartoon character in the style of X" ?
That's where you run into the very difficult question of "what will the AI get hung up on?" because if it's reasonably clear what's happening, the extra detail in the prompts seems to help it focus on the art, but for instance "looking through the microscope" has a decent chance of causing you trouble, because it may not know what a microscope is (with any certainty, anyway) and skew further away from the goal while trying to figure it out.
So yeah, I typically stay with "a cartoon character in the style of X" and hope for the best. But then I also don't usually have a fantastically diverse set of source images to begin with, so that probably hurts/helps too 🙂
Thanks a lot for those very useful insights 🙏
I run into walls so others don't have to 😄
getframes.py added to that same repo.
It was just laying around my old project junk.
It would be nice if there was a way to identify the images where training loss is clearly raising 🤔
Great idea, a top 10 source image vs loss table?
Gave it a quick go, quite variable as it depends on image vs prompt
Why would you try tagging everyone? You monster
Interesting, where do you have to edit the code to print that?
Maybe a mean loss per image would be more relevant
textual_inversion.py
setup a dict entryLoss = {} before for i, entry in pbar: push entries in the loop after loss is calculated entryLoss[entry.filename] = loss.item()
yeah eventually maybe
I'm giving it a try, I still generate an image every 100 steps, this is the result at 300 steps, why did it create people? prompt was a homestead with thatched roof, in style of homm3v10tk10buildstyle, I used only 10 tokens now and the initialization text was building style
these are my sample images
your dataset looks very heterogenous from my understanding of entropie's messages
I would keep only the castles
everything on this picture except the last one
for anyone doing training of any kind, i believe this tool to be invaluable compared to other methods of image cropping: https://github.com/sneerz-1/squarize-images-update
On my side, here are my current results
The training set
All images were captioned by "a cartoon man/woman"
1000th training step :
8000th training step:
21000th:
Attempt to generate something with the fine tuned model with the prompt "a woman in the style of mush characters" :
(a training set sample to compare more closely)
what was your initialization text? did you use textual inversion?
initialization text was "sci-fi character" with 10 tokens and i am using textual inversion
I feel like I am close to get very good results but it lacks something
Using more precise prompts gives very poor results so as the textual inversion paper suggests I think there are too much images in training set
poke @silent spear 👉 👈
hey guys, can we run dreambooth on windows without installing Linux?
you have to at least have WSL 2 - which is a linux subsystem, but you run it on windows
So there is no way around linux yet?
And 2nd quesion, what if I run colab on my pc with jupyter?
WSL is not a large download, and not invasive at all.
I can't answer that one, I don't use colab
I see, thanks m8, going to check colab and if it didn't work, I will try the WSL.
you could do that yep, if you find a notebook which works (or want to craft it yourself)
Ah, cool, thanks!
In automatic 1111 how do styles work?
elaborate.
There’s a create style button. What does it do, and how do you use it?
Any of you guys know of a colab that uses deepspeed so it will work on 8GBVram?
The one I found ask for 14GB : /
you can pay to use GPUs with more VRAM 🤷
Sure, but I was thinking of using it locally on mt gtx 1070.
So, anyone know what this does?
When training embeddings, does the model you train on matter? For example, if I use waifu diffusion to train an embedding, will that same embedding be as accurate as one trained on the regular model?
either look at the wiki of auto or search for it here, you may find more info about it.
I'd also suggest looking at the code to see what the button does
verty interesting, thanks for sharing!
Does anyone know if it's possible to resume training a Dreambooth model from where you left off? Or do I have to start from the beginning on a fresh SD 1.4 model?
I can't find this answer anywhere, sorry if it's been asked
I've never trained the model but it should be possible, since you're already resuming from where the 1.4 model left off. I think if you prune the model to shrink the size it might remove what's needed to resume training though
I've resumed on 2gb models several times, no issues with the python cli from xavier and forks
Is there a way to delete an imbedding? Do I just have to revert to the basic SD model?
just delete/move the embedding file, or don't use its special term.
I found the embedding file but wasn't sure if deleting it would break anything. I guess if it does it's easy enough to start over.
no that's safe, don't worry.
put a write up along with the checkpoint for my ff7r mega model here for those interested: https://huggingface.co/panopstor/ff7r-stable-diffusion
onward to adding back laion data now
Should the total step count in TI be proportional to the amount or detail of input images?
wow
actually it worked
what you suggested
i mean this
not perfect, after 10k steps it's still not quite catched what I wanted
but I chose render style as initialization text, the promt was a render of [filewords], [name] style and the preview prompt was Big ben, homm3v11tk30renderstyle style
and it gave me Big Ben!!!
first time it actually gave me what I entered in the prompt
😄
or not exactly 🤔
preview images are good
but if I try anything with homm3v11tk30renderstyle style, it gives me a random old guy
e.g Budapest, homm3v11tk30renderstyle style
Well this prompt doesn't seem precise enough to me
Try castle, buliding, mausoleum etc.
What's the difference between a hypernetwork and an embed, in a practical sense?
After searching a lot, I see that the question is asked a lot but never answered
I want to finetune SD on 50x50 images
Do I need to upscale them to 512x512 ? If yes, what is the best way to do it ?
Can someone explain hypernetworks and/or textual inversion and/or link to a website or article that explains it
@hot breach as a huge FF7 fan, this is awesome. Quick question, did you do anything different in the training? Or did you feel like better labels and image groups did the trick? I am working on fine tuning a single model with multiple people that are not in group photos and am trying to find the best approach
I'm an amateur at this, but my understanding of Textual Inversion is this: the language model translates your prompt into a vector that is used to guide the unet from the random noise towards an image. TI is the act of fine tuning that vector for specific concepts that the language model might not have a word for. Basically, TI is telling the network "when I say 'asadayo', I'm talking about images that look like this"
I do not understand hyper networks and haven't experimented with them yet.
I would also love if someone who actually understands this stuff could chime in. I have a feeling I'm missing something about the mechanics of TI
So, if I have understood clearly, TI/embeddings are instruction to tell the model "what" a word is and hypernetwork are instruction to tell the model how to recreate a particular style? what about VAE?
captioning images and having group photos are both big deals, I've trained the model with progressively more and more data now about 10 times
if 2 particular characters have a group photo, the likelihood you can get a good image of them together at inference time is a lot higher
I can do a pretty good job getting Cloud/Barret/Tifa in one image because there are a couple examples of all three of them in one image in the training set, plus a ton of training images for all of them individually (140+ each now) mixing aerith/jessie together is much harder, no images of those two together (not even sure they meet in the game?) and jessie has a slightl smaller training set herself (~90) which may affect that
worth noting jessie still looks spot on by herself with just the 90 solo images and a few with her and cloud/biggs/wedge
wedge/biggs have much smaller sets and don't look very good, I have more captures ready but trying to avoid just training again until I get laion data introduced back to replace regularization
is there any resource or chart for explaining the differences in the "Samplers"
#🤝|tech-support message and next post as well
do any of you know of a tool that will grab every frame from a video that contains a specified character
it should be possible to use ai to do this somewhat reliably and it would be an excellent source of images for textual inversion
only aware of speech models for "diarization", you'd need to have a model you could first train to recognize individual characters, certainly possible just don't know if something like that exists
homme is french for guy so that might be the problem
@stone garden @gusty thicket @hollow surge #🔧|finetune message
well, I'll read in more detail, download, test, and share some feedback to you.
But already, thanks a lot for this detailed method, you are the MVP in the finetune game
one interesting thing to test is the main characters vs. side characters, it will give you an idea of how the weight of training data impacts quality, i.e. "biggs ff7r" has much less data than "cloud strife" and looks worse for it, and "rufus shinra" has a tiny training sample
"red xiii" is also still pretty awful, barely 10 images out of 1400
how did waifu diffusion train a wifu model? is that dreambooth? textual inversion? or something else?
I don't think they used dreambooth
they trained only on anime content afaik, and did not attempt to "protect" the existing model, so it can no longer produce "a photo of tom cruise" etc
I actually haven't tested, I don't have WD ckpt
it actually can, it's just more anime styled
yeah
but yes it totally messed with every generation
they stomped on the model so to speak, reformed it to be very specialized
I believe next step for fine tuning is introducing laion data back and dropping regularization
I'm working on that as we speak
once regularization is dropped, it's not really "dreambooth" anymore
no clue what that means but sounds good
regularization is the weird fact you have to generate like 300 class images right
yes
it's used to keep the model from overtraining on the training images, to "protect" the model to not forget how to draw the stuff it knew before you try to train new things
so the FF7R model can still do "a photo of tom cruise" and he doesn't look like a Final Fantasy video game character for instance
that makes sense
he will if you prompt "tom cruise standing on the rooftops of midgar city slums district" however, as the style transfer happens when you start prompting "midgar city slums district" and soforth
the gist or the huggingface links have links to imgur with examples of that
yeah its on the gist if you scroll down, links there to imgur
❗❗ THE NUMBER ONE MISTAKE PEOPLE MAKE ❗❗
Prompting with just your token. ie "joepenna" instead of "joepenna person"
is this true?
i haven't been using the class name and i thought my results were excellent. i'll have to try adding the class
I've loaded some models from other people and I've found person may not be required
it may learn enough on just "joepenna" to work, but keep in mind there are a variety of techniques being used
several of us have not been using "sks" this or that, or "person" for class_word on the older repos at all
I trained "john carmack" just like that, without "person" anywhere at all and it worked perfectly fine
richservo has a giant list of models trained this way as well, no "person" at the end, just the name
there's also no particular reason to train without spaces in names and such that I can tell, though really long names do use up tokens later when you want to prompt I guess
btw, i just noticed automatic1111 ui seems to have no tokens limit!? that's crazy, didn't expect that to happen so soon / ever
This was very recent. I think I noticed it like last night on accident?
I am installing webui and my PC is lagging so much, I can't move anything, is it normal?
yeah dont touch anything until its installed.
Ok, how much time did it take?
Depends on the disk/bandwidth speed on your first time since it’a downloading SD and creating a new environment. For me it took between 4-6 mins. But after the first time it takes 30-60 seconds
how i reopen after the instalation?
Does anyone know What’s the vram requirement for the textual inversion training in webui?
Ok so I turned on live preview for sample steps, How does this help me refine an issue I see during the render...For example and incomplete weapon missing a hilt .....I'm not sure how this benefits me....?
For long upscales it allows you the ability to interrupt and restart after tweaking some parameters.. also if you are doing in painting and you see something you are not liking you can interrupt instead of having to wait the extra minute or so for the image to finish.
You can't force SD to fix something like missing body parts/incomplete weapons, etc.. but you can at least identify that it's not coming out how you want it and interrupt early to tweak setting and try again
How good is hypernetwork? Anywhere I can see results?
It's mainly for styles, I didn't achieve good results with face learning, but its a helper for subject improvement
without hypernetwork
with hypernetwork
control
Does that mean we can use multiple textural inversions at once?
I used same image library go train a cpkt last night and a hypernetwork and discovered it's very helpful with a trained model or normal SD model
My most important discovery was I had to turn hypernetwork strength to 0.5 for it to work!
I was getting rubbish results nothing like my reference material until I did that
Are people combining cpkt, hypernetwork and textural inversion with good results or is that considered overkill or is one of them considered redundant at this point?
Good news, for those who love to mess around with hyper networks or embedding folder, you don’t need to restart the whole program to have the dropdown listing your new pt files. With my change (just merged to master) just simply click the refresh button, you will see the new pt files in the drop-down
Hey, is there a playbook for style transfer learning ? 🤔
A great convenience!
hypernetwork is great for styles , it does take some time to get it right
I trained a hypernetwork using some Mob psycho anime screenshots. Here are some comparisons
26k steps
started with e5 then e6 than back to e5 for the last 10k steps
same params , just changed the HN for each one
Did you use different characters in your training set ?
Has anyone tried finetuning the full SD model (not TI or DB) using the script in the diffusers repo?
It looks very straightforward to me but I don't know if it works on colab or if the results are any decent as it says the script still is experimental.
what I've noticed is that hypernetwork allow to use bigger image sizes than TI
I can train up to 2048x2048 on my 12gb card
So after a bit of test, I figured that actually hypernetwork is able to learn faces too, but it works different from TI
That's me (photo)
that's a test with hypernetwork, 2048x2048, 42 images, 0.000025 learning rate, 1500 steps, images have been labelled with BLIP
from my tests during training I can say that lower learning rate wasn't giving me desired output, I was training 15-20k steps and still got nothing, so I was increasing learn rate and at 0.00005 it went crazy (noise texture) above 1200 steps, so I've reduced the learning rate by half and it seems to work so far
yes , like 4, I did notice is more consistent with guys , so next time ill add more girls to the data
@silk crystal just in case if you are interested , here is the images and meta data that I used for my HN, keep in mind that I used the [redacted] model
another comparison using the HN trained and different strength
did some comparison too, that's what I said about it going crazy, with 0.000025 and above 2500 steps it start to get really bad, but the sweet spot is close to it at 2000steps, so kinda delicate balance I would say
will do more testing with different learning ratios, to see how it behaves at lower rate
I have 800 pics of a person, I want to do textual inversion, any idea what are the best settings to have it learn a face/person but without overtraining? last time I used all default settings in automatic1111's UI (100k steps) and it learned the photo style so I couldnt then use it to do any different style like oil painting or anime drawing.
I just use TI.
I used this as a base https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/2670
had to tweak some stuff and start over and over
but Its a good start to get a idea
Also us it better to have it learn on the full pics or should I extract the faces only and train it on that?
mine where between 5e-5 and 5e-6 , switching them when It got too stale without making any relevant changes, I was checking every 200 steps
mm in my case I was trying to train the style , so I'm not sure about that
yea I saw that guide, its for styles only I think
stuff around may affect, I haven't tried with such big amount of photos, but I use between 10-40 photos and I usually crop it at upper part of body so it focus more on the face, but I've noticed that sometimes it learn stuff around me like wallpaper, chair had text so it was adding that text (or kinda) on the generated images too
I never tried 100k steps, think it may be overkill, on my case for TI I use between 10-20k steps, sometimes it get good even at lower steps like 5-7k
something that also play a good role is the tokens count, higher token count takes out the control, so I would keep it in between 1-10 tokens only, higher makes it difficult to apply styles
thanks for the info, yeah 100k seemed overkill but it was the default value on the webui for some reason, and I chose 16 tokens, will train again for about 20k and maybe choose 8 tokens
you can experiment running at a certain LR rate for X steps, then continue with more steps at a lower LR, assuming whatever code you're using allows you to continue training on your own bin/pt/ckpt
I think Im just gonna try dreambooth instead
since it seems like the superior option
just gotta figure out how to run it on my 24GB VRAM without it crashing lol
would be nice if automatic adds dreambooth to the webui
yeah unfortunately I have only 12gb 😦 so can't say about it
so a bit more of testing, hypernetwork with 0.00001 learning rate and from 0 to 6k steps
nice pixel art model ! https://publicprompts.art/pixel-art-v1-dreambooth-model/
neat!
does it interpret different parts of the name of the embedding? i'll try to rename it and train again
do you mean my test prompt or the initialization text?
test prompt
a castle, homm3v11tk10render style gives me this result
well i was wrong then
hi, if i'm using dreambooth training eg with token=jmp909 class=man so jmp909 man is there a way to give more weight to the token to steer it towards picking up my likeness? eg a photo of a [man:jmp909 man:0.3] on a beach holding an icecream or whatever? or a photo of a (jmp909 man:1.3) on a........
But btw, I got way better results with this repo : https://github.com/JoePenna/Dreambooth-Stable-Diffusion @ashen perch
Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) by way of Textual Inversion (https://arxiv.org/abs/2208.01618) for Stable Diffusion (https://arxiv.org/abs/2112.10752). Tweaks focuse...
But you will need to gather a bunch of images of buildings for regularization for your specific problem
my card only has 8gb of vram so dreambooth is off for me, maybe through colab
I even thought about taking pics ingame like this and cropping out different parts
What's the difference between a hypernetwork and an embed, in a practical sense?
I'm thinking that no one actually knows, is that accurate?
you could try cropping such a screenshot twice once left, once right, but getting SD to generate that afterwards in wide aspect may be difficult
if the game has a square aspect it might be better
if you can use caption training you could add "with a toolbar on the right" only to the right side crops, maybe it would learn the difference, just an idea
something something hyperparameters good for styles not for subject something something
I'm assuming you're talking about hypernetworks in particular? If so, does that make your answer something like:
"Hypernetworks are better than embeds for styles, but embeds are better for subjects"?
yes exactly
Great, thanks. Is this from experience, or do you have a source?
based on the context I infered by reading what other people that dont understand it either said
Gotcha
embeds allow you to create a word for some subject, while hypernetwork seems to learn the overall information
Does the hypernetwork not use a word also?
nope, if I write a prompt like "a man with mustache" with my hypernet any man will be like my face
Hmmm... So every generation is thereafter contaminated by the hypernetwork?
if hypernetwork is enabled, yes
on automatic webui you can control strength and choose which one
Oh... Where do I find that in auto1111?
I'm still testing it, for my face learning it seems to be doing a very good job now, I've taken some more photos of myself, was using 40 now I'm with 80 photos, changed my shirt and took some photos with and without glasses at different places of my house, want to see if it stop showing my jacket on all generations lol
ah sorry
misunderstood your question
Maybe it appears after you create a hypernetwork? I can't actually seem to create one because it says "Error" 😂
I have this problem sometimes too, you need to close/open it
often happens after training a TI
Gotcha
the strength you can control on settings
or do like myself
on settings > quicksettings list use "sd_model_checkpoint, sd_hypernetwork, sd_hypernetwork_strength"
this adds the options to top of page so its faster to change them
I don't have quicksettings, maybe I have to update
hm yeah I think its some recent addition
When I'm done with my current work I'll update and have a look, thanks for the info 🙂
I'm getting very happy with results on hypernet now with 80 photos its producing less biased content
What application are you thinking about for it?
my idea is to make it learn the faces so I can make different styles of them and maybe create shirts, cups, etc using those
I see, and is there a reason you're choosing hypernetworks over embeds for that?
I'm actually experimenting with, on my first try I wasn't getting anything good out of hypernetworks, but after some tries I'm starting to get good outputs, but still need with other faces
Cool. And have you already experimented with embeds?
from comparison I think the HN is producing more accurate face than embeddings
yeah, with it I had to generate several times to get something good out of it
Ok interesting
lol I'm getting really good results with hypernetwork
what I can say is that it's producing more natural images compared to TI
@viral jay what’s your training settings? (Steps, images etc). Thanks
6000 steps, 0.00001 learning rate, 73 photos (had 80 but I removed some out of focus photos), all them tagged with BLIP
Thanks