#🔧|finetune
1 messages · Page 13 of 1
so what i need to do is to specify my car better and more clear right?
using something like "sks car" you basically tell the model that this is just one specific car
and the model will still be able to draw, let's say, comic images of your car
but in theory if i leave the istance promt empty i can retrain car like that right?
what is confusing me tho is the link between instance and model, lets say my instance promt is xyz and what i want to train is "red car" so i build the advanced captions like "xyz red car driing on the street" and "image of xyz red car on the motorway"
and if i dont use xyz in the advanced captions will the model still be able to match my car?
where does my own isstance stop? if i have xyz red car and xyz headlights of car on the motorway how does the system know my model is xyz red car and not xyz headlights or just xyz red
it's magic 🤷♂️
very likely it will associate the words xyz and car together in a sentence
yea i fugured there is a lot of magic here 🙂
but that depends a bit on your training captions. If you always writr "xyz car" then it will more likely get stuck to that
that's why I prefer to randomize training captions as much as possible
so in reality if i want to create my xyz car model i need to have a base model that really is the type of car i want as xyz car and then specialize it with advanced captions but allways ponting back to xyz car?
there is a checkbox iy kohya_ss "randomize captions" maybe i try that
when you say randomize you mean like relaly randomize or just redefining so that it still make sense
it will probably split the sentence by "," and shuffle
what is the "," doing when it is tokenized
I mean with randomize to avoid fixed patterns
is it like a tag seperator?
like not caption everything with "xyz car, from front, on street". Also sometimes use "front view of car xyz on street" and so on
I'm not sure if I understand you correctly. Think of xyz as a name. Like you would want SD to draw "batmans car"
and don't overthink it 🤷♂️ just try and learn from errors xD in the end every training dataset behaves a bit different anyways
yea but each cycle takes like 24 hours of trianing so trail and error is really not that funny xD
so the more i understand the better i can prepare xD
then first try to train for a few hours only and check how far you get
i allready did that and this is why those questions come up 🙂
training each image 10 steps in 10 epochs is the same result as training each image 100 steps in 1 epoch?
it didn't worked out?
its going there but i still have issues with overfitting or undefitting
oh, and one very important thing: use a good prompt for testing
i used a learning rathe of 5e-7 and for some cases its still overfitting and some are underfitting and that is kind of frustrating
like I found that using "xyz car" as prompt I get shitty images all the time, and as soon as I get good images it already overfits
yes i kind of have the same issue, how did you solve that?
so first experiment what is a good prompt that gives you a nice picture of any car and then use the same prompt for your cae
ah ok i see
like these crazy "photography of a red car, masterpiece, perfect angle, blablablabla"
i was hoping i can make a very roboust model that is not that specific and thus i used 5e-7
so that it is sloely learning the concept
and then replace "red car" by "xyz car"
is the target to have as many different input images with an as detailed caption as possible or can i expect that at some point the system just learns my car and can map it on other cars?
and how can i tackle the issue that minly the model is oferfitting but yet some sepcifics that i have added pictures with captions from are underfitting in the same model
🙂 so many questions - i really like that stuff
it should... I mean, sometimes it can do that with 5 images already 🤷♂️ So I think 1000 images are more than enough
for the overfitting and undefitting - shall i just increase the number of steps for the concepts that ar eunderfitted or shall i just keep training more with lower learning rate?
I think your learning rate is as low as possible 😅
so if it is now overfitting i use too many steps of a particular concept?
but yes, you might try to add more examples of the images which still fail to the training data
i was thinking of reducing the per image step sizes by /10 and then train 10 epochs while saving each epoch to see what is going on but im not sure if training 100 steps in the same epoch is the same as 10x10 epochs
and experiment with captions and prompts. Like it's super hard to get the model to the point where it dies everything right just with "xyz car" as prompt. If something doesn't work, try to describe it in the prompt, maybe that helps already
I don't know - depends on the implementation. Maybe just specify the epoch value and use a low one like 2-3
so basically what you are saing is that the input promts with propper image selection is the key right?
not only for training, also for inference
so what kohya is doing is that it allows for each folder to define how many times each image in that folder is trained
if you try if your model works and it gives you bad results, try to improve the prompt
and currently i train each image in each folder between 100 and 300 times or so
i was wondering if i can reduce that to 10 to 30 and just train 10 epochs
that's smart
so i would have 30*1000 steps per epoch
just use one epoch and train each directory, e.g. 5 times
if i train only 5 times each image the result is absolute crap xD
i found it to start working from 40 times +
40 x 1000 images? Really?
okay, you use an extremely low learning rate, maybe that's the reason
I think you should use larger learning rate and experiment a bit with less steps
i found if i go to 1e-6 it is way to overfitting even at 40 times per image
yes, so use less steps;)
but less steps the images get really blurry and stuff
well anways you gave me a whole lot of input thank you so much!
hm, if your caption is good it should produce good images from the start that more and more look like your training images
yeah, I would say try first with higher learning rates and less steps to experiment a bit snd find a good training setup
i have a very wild mix of captions, also with alot of details, like headlights, spoilers and so on
and i want the model to be able to also understand those parts
thank you very much for your time sir
i will do some tests and report back with the results! 🙂
sure, good luck
have a nice day
I wish to train for a style for the first time, where should i start?
I mean i know i should build a dataset, but is it like 20 pics, 50, 100, 200?
how do I tag such a thing?
in the style of <whatever>?
how to decide if it's TI, Lora, a model?
try TI first, then do lora, use detailed captions. "in the style of <whatever>" is fine, but not important, you can also go for "by <whatever>".
anybody has a link describing how the text tokenizer works for training
also is there a way to extend a model rather than retraining it for something specific?
and if i finetune what is the best model to base on? currently im working with 1.5 and some i found on the internet that have 50% ema mixed in
hi
Our vibrant communities consist of experts, leaders and partners across the globe. They are developing cutting-edge open AI models for Image, Language, Audio, Video, 3D and Biology.. AI by the people, for the people. Learn more here 
hi 🙂
you won't be able to train SDXL yet, as it is not released for download.
As for "can you", to be honest, almost no chance SDXL could be trained on this. You can't train any model I think on 4GB VRAM currently, the minimum I see is 8GB for LoRAs training
even just running SDXL has a big chance to require a super fancy GPU
you can train some models in google colab though, using notebooks
it's not just taking long. it's another process and tool to train, and won't run at all on CPU or on 4GB VRAM 😢
with that card, your best bet is to use google colab for it from time to time, where a small training like 15 pics will be done in 20 minutes
loui i will remind you when when you make a model with 1000+ images to train on sdxl xDD
ant it will take half a year lol
In my opinion all models have big issues with vikings. It seems to be that there is just no good training data for vikings. Like they all tend do give them weird horns and stuff
you might try different words or cultures that "feel similar", like "celtic" instead of "viking"
if i train a model what is the best model to base on?
also is it safe to use a pruned model?
also shall i use ema or nonema for training
v2-1_768-ema-pruned.safetensors or v2-1_768-nonema-pruned.safetensors
or v1-5-pruned-emaonly.ckpt / v1-5-pruned.ckpt
always use ema
and yes, you can use pruned models. They just have some parts removed you don't need anyways
usually, loss=NaN means it's going badly yes. but it can also just bug out. this is very early in your training. unless you used a very high learning rate, this seems like a bug in display.
I would stop it and test though personally
Any suggested rates for 2.1??
Dreambooth ? I use 6e-7 personally, with a polynomial scheduler
a lot lower than yours there
kohya_ss gui
not sure then, especially if LoRA
yeah, I'm not sure how they implement it. I imagine those were the default values ?
0.0001 was default
so 1e-4
I'll try yours and up the epoch count
worth a try on 1e-4 already
but I do recommend using polynomial LR yeah
it makes LR reduce slowly over time
this is subjective, but from the tests I did on it, I find it gives better quality in smaller details
NaN is usually a problem of your precision
if you train with fp16 you need special techniques like mixed precision and gradient scaling
had mixed precision = "fp16" in there already
and gradient checkpointing
am trying again with bf16 precision now
Thank you @stiff dust !
That's a batch size of 4 there at 768x768 on an A4000
just under a picture per second. not that bad !
indeed, and it'll get a bit quicker as it goes on, not that it bothers me, I'm heading to bed and will collect the results tomorrow 🙂
hi guys, im finetuning with dreambooth on RTX4090 but for some reason i can only do batch of 1
any idea what im droing wrong
as soon as i use batch of 2 my vram runs out
ResidentChiefNZ: i had the same issue yesterday
may i ask how you solve it? for me it worked switching to batch of 1 and no memory efficioen attention
but i would really like to use bacthes of 2
does switching from fp16 to bf16 help at all on the memory?
you can't train XL for now, even on DS. 2.1 can be taught specific things if you need to.
For a professional pipeline, this seems like a completly valid way to go still, and for quite some time
Same for SD 1.5, it has such a big user base, there are very specialized models doing wonders on those subjects.
I love SD XL, and I really like that you spread that love too, but let"s not antagonize over users either on what they use if they want to use it.
1.5 is far from an abandonware, 2.1 is not for old fart with the results it can give, let's water it down and enjoy what we each like, if it's cool with you
Using XL doesn't train it, no. it does help for sure on giving feedback on different things, like the tokens you use the most, and stats of use or even just all the great art you've been sharing since you came around, but other ways like pickApick, where human feedback is given back to the machine, are also great ways to help on it
The comparison you are proposing is not directly possible though. A prompt is tuned for the model it's targeted on, using the tokens that "resonate" the most with it. The good question there would be, for a given wanted result, what quality can you achieve by tinkering the prompt. But this becomes an unfair question, putting XL at a disavantage since it stays generalistic, and not specialized on a given result
Here is an example where we compared this with TwoDukes, one of those currently finetuning SDXL, if you really want :
SD 1.X results : #🍥|anime message
SDXL results : #🍥|anime message
we'll get more "on point" results from the models trained on the specific prompt, even if SDXL does really really great on it too, but importing more from styles that weren't prompted (like 3D in this example)
It's gotta be kept in mind that SDXL isn't out yet, it's just the beta, and that beta ROCKS, I'm all with you, and I have great hopes for it
but to give another comparison I just did
prompt is "a realistic picture, professional portrait of a cat octopus creature wearing a suit, unmythical creatures"
and this second example is on a 1.5 trained model
The checkpoints on Civitai based off 1.5 beat pretty much everything I've ever seen
As in the community trained/merged checkpoints found on CivitAI.com
You need to get the right one for the right job though
FYI if i switch to bf16 instead of fp16 i can use batches of 2
can i expect changes in the result based on this change?
sorry dude, but you seem to have no understanding how ai works 😅
the difference between bf16 and fp16 is trivial, the fp16 has more bits dedicated to the fraction side, so is likely to be more accurate than bf16, but the difference is going to be so negligible that it won't be noticed in the end result
oh sorry, I was referring to louis strange monologue above
dude, your comparison mean nothing. You can make photos with 1.5 with much better quality than vanilla SDXL, just use the right model and the right prompt
just look at example images from:
https://civitai.com/models/3666/protogen-x34-photorealism-official-release
https://civitai.com/models/1116/rpg
https://civitai.com/models/4823/deliberate
It's all 1.5
Research Model - How to Build Protogen ProtoGen_X3.4 - Enbrace the ugly, if you dare... By Downloading you agree to the Seek Art Mega License , and...
NEW: Download the new User Guide here: RPG User Guide v4.3 Available on: Originally posted to HuggingFace by Anashel Mage: https://www.mage.space/u...
Don't bite.. he'll keep posting for next 3 days....
haha, right. Have fun with him ;D
nah, I gave up 😛
I'm just warning you before you make the same mistake I made!
a lot of people went on that argumentative road already.
I think XL is promising, but not the joker that beats everything either.
But let's just agree to disagree and all enjoy our tools, is what I landed on there :p
depends. Many people like 1.5 more
well
I would say 2.1 often makes more and better details. However, if you use a good custom model that doesn't matter so much
1.5 on the other hand is better in drawing humans, however, 2.1 is not as bad in that as many people say. It just cannot draw nude people
ah i see
some people say 2.1 is a bit more overfitted and less versatile. Hard to say if that is true, though
but I would say that, indeed, 2.1 gives me more of the same look (e.g. its hard drawing people full body, it always wants them close-up, while 1.5 seems a little bit more versatile)
yeah, as said, in the end nobody really uses the base models, but custom models like dreamshaper, deliberate and so on. They are also much better in details and faces
what model do you recommend to finetune on?
dreamshaper looks interresting
when i finetune do i use the vae version?
i still dont really understand that part
I finetune on the base models, as I know they still have their ema data, whereas most of the models have been pruned and may not have the training data
Fools errand - a) quality is subjective and subject to bias; b) some of the fun of AI art is the "pull the handle" pokie machine roll it and see if you get a good one or not; and c) SD2.1 has well past moved on being just one model - there are dozens of finetunes out there (thousands for 1.5 models)
That said - I pressed generate once - got this
We aren't saying that SD XL is bad; nor that one should not support Stability as they are awesome - we are just saying that there is far more to AI art, and we need the resources of all involved to make this the best it can be...
SD2.1 was trained on 768x768...
And besides.. that wasn't what you asked for...
well each of the individual images in that plot is 640x960
dude.. I'm going to politely tell you where you can take your rule...
you've been beating this drum for 4 days! it's time to let it go
I think they last said they feel it's time to let it go, not sure they are looking for more comparison, nor that you are stupid there.
From what I read, it feels like you are both on a different opinion and failed to convince the other or ear them out.
I do feel, like I explained yesterday, that your rules don't make sense.
Trying to see what's the best model either for that mater, since the possibilities of training are different, and they can't be integrated in the same pipelines because of it.
yes seriously. You start from a random noise, so to be fair on number of picture, 1 pic only doens't make sense. it's the medium aesthetic score of hundreds of gens that you would need to compare. but it's far from the only point that doens't make sense in comparing them : parameter counts, different text encoder, finetune to prompts, loras, hypernetworks, ...
There are numerous difference that make comparing them a statistical nightmare if you want to be fair
If you want to be practical, each has a different target and use on the market currently, the use cases they are intended for, or at least used for, diverge completly
Because of it, each will be the best in a different category
To keep your analogy, you are comparing 2 different sports golden medals at the Olympics
try to use reasoning maybe to answer me if you want to debate. this isn't far fetched, it's the practical world people coming for help on this server describe. I'm not attacking you there, you don't need to "screw" everything.
Anyway, since XL can't be finetuned at all currently, let's at least diverge this to #🏞|general-with-images and not clog this channel.
finetuning is not modifying the prompt.
#🔧|finetune is the channel for dreambooth, lora, textual inversion, controlnet, and other kinds of finetuning techniques
Prompting work is essential on any model, but you are messing the terminology. we mostly work on the good prompting techniques, and ways to build coherent prompts, in #📝|prompting-help
still not finetuning, please stop this contest as you've been asked multiple times now. nobody is asking for fair rules in this, you can keep having fun with it if you want, but move this either to #✨|sdxl or to #🏞|general-with-images .
This channel still isn't the place for what you are doing.
I think it's still not cool to keep using the wrong channel intentionaly there no
I didn't subscribe to the rules of your comparison
you did accept the rules of this server though. I'm explaining to you that you are in the wrong channel and need to change channel to respect other users, as I'm here right now to help those rules get applied to everyone.
So keep ignoring me and posting in finetune to prove your point and get timed out, I'm not sure what you are looking for there
I have 10k credits currently, and used around 2k credits on SDXL currently.
I know what I'm talking about also on finetuning, having been on this server for months now, and being a moderator on it for around 6 now. So, the theorical place to talk about a given subject, and find people willing to talk about that subject, I know quite well. And I'm telling you, this is not fine tuning you are doing.
It's prompting, and yes it's essential. As well as working on your sampler, scheduler, steps and all other available settings there.
I'm saying this one last time. stop sharing non fine tuning on this channel. I'll time you out, I can only warn you. I am giving you every chances to please not come to this end and move to #✨|sdxl or #🏞|general-with-images
it's tuning the output, note finetuning the model, as described by just the terminology itself in the diffusers library that is the base of everything SD related
https://huggingface.co/docs/transformers/training
I admire your patience 🕊
but thanks
I was going to ask something related to LoRA but I got hooked by this surreal conversation, I even forgot what I was going to say
nope, don't start this again lol
still not the good channel, and still not a drama server
Did anyone have a working script for merging sd2.1 lora INTO a model?
left is the training data, right is what i get after training a lora from it. are there any lora parameters i should adjust to get more accurate results? i already tried the basic steps, cfg, and tried making the aspect ratios match
idk if i would have to use one of those img2img tools or inpainting tools, or switch to something like textual inversion?
also could it be that AI just struggles on guns like how it generates hands? idk anymore
Oi
I'm training dreambooth on generating a character, though the results come out really, uhh, weird
I've got a dataset of about 70 images
I'm not sure if my config is correct but my settings go like this
instance token: a photo of (name) person
class token: a photo of a beautiful woman
instance prompt: a photo of (name), and then a whole bunch of variables like high quality and all (which give me photorealistic results when I use it in tex2img)
class prompt: a photo of a beautiful woman, then the same parameters
classification image negative prompt is regular stuff like bad hands, bad quality, whatnot (again this stuff gives me photorealistic results in tex2img)
sample prompts is a txt file with prompts corresponding to my dataset images
sample image prompt is blank (not sure what to fill in there)
sample negative prompt is the same negative prompts like before
I did set up my class images in this way: each image is named like 001 002 003 etc, and in a separate folder I have a bunch of txt files also named that, so for example txt file 001 has the prompt corresponding to image 001
But I'm not sure how to make dreambooth read in the correct txt file, so what I did was put all those prompts into one txt file which I now read in alone
hello there JoJoCa 🙂 Happy to see you post your problem.
You are doing a lot of things right here, but some are double work for nothing, and some small errors that hammer your quality
Thanks for the reply, yeah I probably did weird stuff, pretty new to SD
first of all, if you are training on a single character, usually, 10 to 20 pictures is a good target. Anything above is a danger getting bigger and bigger, because things start to repeat in the pictures and you don't want that : you want variety
So first step I would take, is select the 15 best pictures of that dataset, with varied clothes, lighting, background, pose, and framing (close up, full body, ...)
then, if you are using instance prompt and class prompt like you are, then there is no need for a caption file next to the pictures, this is double the work for nothing.
It looks inside the files when you check the option to do so. if not, it takes the class/instance prompt to train
next, the "instance token" is not what you are using, it's just name
as for class token, in this case, I would use woman
it's a "single token"' you want in those
the prompts in class and instance prompt will help build on it, but those are the main concepts you are targeting : your new token nameand the class token woman
sample prompt seems good, but it's just a control measure anyway, it's something that shows you during training, what would the result be currently if you were to run that "sample prompt" on the model
so it lets you test how good the model performs
usually, only 2 or 3 prompts are enough
like
portrait picture of
name
drawing ofname, very detailed, half body shot
full body shot ofname
(ask if I said something that is not understandable)
That makes sense thank you, I dont really understand the part where you said it looks inside the files
the captions files you created, the .txt files. Those aren't used at all if you don't check the corresponding checkbox in the UI (not sure of its name)
If you do then, instance prompt becomes unused, it instead takes what's in the txt file linked to each picture
There is a last method available, it's using [filewords] as instance prompt. this will make it so the name of your picture will be used. I personally use that : I write the caption of each picture directly in the filename, like "painting of a cat.png"
The whole goal of those 3 things (caption files, instance prompt, and [filewords]=>filenames) is the same : provide a prompt to train each picture on
The instance prompt is intended to be used when you want the same prompt for each picture. In your case, it's what I would recommend, and I would use a very simple instance prompt : name, nothing else. (with your name of course 😉 )
The caption file and the [filewords] method have the same goal : letting you have a different caption per picture. This can be very potent, especially on bigger trainings, but I found it to be overhyped and more complicated to use correctly, so not the best for single subject training
I see, thanks, but if I make my images a simple caption, how do I make it still use the other part (with all the variables like high quality, 4k, ultra res, ....)
well, the good thing is : the base model you are training already knows those parts, you don't need to train those
by providing 15 very varied photos of you, and just saying to Dreambooth "learn this : it's 'JoJoCa'", dreambooth will learn each pic as you, and try to find the common part, the part that is in each and that could fit into that "JoJoCa" token.
It will discard what is changing automaticaly, not learn your wall behind you if it's a different background each time for example, same for your clothes if you have different clothes.
And it will put everything it find common in those pictures, inside that single token, JoJoCa.
So when you later use that model, you can prompt "JoJoCa", and get already a "mostly valid but bad" picture of yourself
Then you add to your prompt all the other tokens, the 4k, the realistic, ... and you get the results you wanted
I see, thanks a lot, that makes it clear
Last question (for now 🗿 ), whats the difference between instance prompt and class prompt (I'm now using the image caption for instance)
instance prompt is the token it's going to use, and the class is what it token that would replace is how I understood it
i.e instance of Emma Watson and class of woman -> the prompt "masterpiece, a woman" would be trained to be equal to "masterpiece, Emma Watson"
class is supposed to represent something larger than your instance. just "woman" in your case, or even "person"
this is called "prior preservation" or "regularization data", the class itself, and is completly optional by the way
It's what helps the model remember what a random woman is, and not replace every woman with your face too fast in the model
wonder where I got my info from then :S
you are also right
the regularisation data is trained too
it's a "second concept"
trained at the same time as your main concept
oh sweet lol, last thing I want to do is spread bad info!
it's complicated, but yeah, it was all right from what I understand of it too
Thanks a lot, lets hope it works well now 🙂
Hmm weird thing, I have class images per instance image set to 5 but its only generating 45 images (I have 25 instance images)
you already had some of it generated in a previous try maybe ? not sure, not the tool I use
oh yeah I did, my bad, well any way to make it go faster since its doing 8700 images and the eta is 17 hours 🗿
16gb vram btw
Hello, interesting conversation. Can I ask you guys a quick question. Do the classification images need to be square?
8700 pics Oo for regularization on 25 base pics ? That's insanely high ! I would do maybe 1000, tops
it depends on the tool you use for training. some do accept multi ratio, some still require square pictures
it's in the wiki usually to be sure
I honestly have no clue how to set that so its default
it's the dreambooth extension for Automatic you are using, right ?
yeah
yesterday when I still tried with 60 images it put it on like 20k so now its resuming from that I suppose
Thanks you Guizmus! It’s with dreambooth with A1111
given your context, go into the Settings and set "Total Number of Class/Reg Images" to 0
aight thanks, do I need to restart for that or does it update automatically
I think you'll need to stop it and just restart it, no need to delete/recreate the model though, it hasn't started training
I'm looking at the doc of it right now (https://github.com/d8ahazard/sd_dreambooth_extension) but I don't see any mention of ratio, it seems it requires square pics ? Not 100% sure but it seems so
ok I misunderstood
what's that 8700 picutres you were talking about ?
total steps ?
yeah that is a lot too many in my opinion
how much Batch Size are you using ?
2500 steps on batch size 1 should be enough for the settings I can see in your screenshot
batch size 29
I used the performance wizard and it put it on that automatically
ho ok lol so then 8700 is just insanely high
batch size 29 means your total dataset is trained each step
my recommendation here would be to run 100 step
max
to me, the 500 you already did is too many
such dataset is trained in 20 minutes or so
Yeah I thought it was a bit much having 18 hours for a dataset, but I cant find where to put the max steps
oh is it the training steps per image?
18 hours is 50% more time than I took to train my mega model on 750 pictures
yeah definitely bad settings on my end then 🗿
the base recommendation, but it depends on so many things, like the batch size, the gradient accumulation step, the learning rate, ... is to train around 100 times each picture
I have a guide that comments on lots of things like this
it's not specific to Automatic1111 dreambooth though, all training methods are included, it focus on the theory behind it more, and helps fix your errors by understanding how it works
Damn that looks really interesting, thanks for making that
lol the entire character looks perfect aside from the face so far
you'll have "fix faces" to help a little when prompting for real
true, I'll let it do its thing
Thank you Guizmus !! Will check
alright 68% and it says training finished, I'll test how it looks with a prompt now
Hey not sure if I can post about this on here but I’m keen to pay $500 for someone who will spend say 3 hours with me walking me through fine tuning with Everydream2 — kinda like a learning session
Am doing a lot of things trial and error so could do with some accelerated learning session
doesnt seam to fix it, but I realised a big part of the problem is that I think im doing something wrong on my end, I copied a prompt from the civitai website and what gives them this result gives me the second
You need to pick up a vae
Yeah I realised as well, used a VAE and now it looks perfect
Just found the multiply.txt feature in everydream, simulating bigger or smaller datasets :
Adding a multiply.txt applies the factor inside it to the current folder
That means that, if I put 0.25 inside a 200 pictures folder, 50 pictures would be selected at random in there each epoch.
the thing works also with numbers higher than one if needed be, but this is just what I needed to continue working on my mega model without retraining the whole 750 first pictures each time : I can just use this old dataset as regularisation, adding the good multiply.txt to balance it in size with the new dataset
Anybody had this problem? When training in dreambooth A1111, it generates classification images even though I have a folder with classification images in it.
for some reason my trained model is producing tints and sometimes super high contrast
is there a way i can avoid that?
yea this is very useful for regularization images, you can toss 10k images in and use like, multiply.txt with 0.03 or somethng to pick just a few each epoch
tweak to avoid catastrophic forgetting, etc
if it uses offset noise it probably is using too much, might see if it can be adjusted
i think a lot of the offset noise has been implemented as 10% because the original blog post that created the idea used 10% but its too much sometimes
what would be the setting for the noise?
Can anyone point me in a direction for gaining a deeper understanding of how to train for concepts? I've only had success doing faces via textual inversion but I want to go way deeper than that. I want to be able to prompt images of characters at a party actually interacting with each other, or working on a car/skydiving/skateboarding/anything. I don't quite grasp why I can generate a million images of a trippy giant mushroom and have them all come out different from each other and look amazing, but it's impossible to make even one good image a crowd of people on a dance floor. What would go into making such a thing possible? Is it something viable for one person with a 4090 to accomplish or is this the sort of thing that would require hundreds of thousands of images and a mining rig running nonstop for an unholy amount of time?
so i spend the last 5 days capturing 6k images
if the result sucks its all your fault xDDD
*captioning
does SD understand perspective? like angles on an object? when training is there a propper way to specify angels of view? like looking from top down, looking from front?
also shall i use fp16 or bf16?
Fp unless it doesnt work iirc
Hey, someone mentioned it's possible to train on top of a model with images in the dataset from the first training, plus and minus a bunch of images, without overtraining on the consistent images. Is there such a thing?
Anyone trying to finetune deepfloyd if
Not sure theres much point - its a 64x64 model with 2 upscales.. more excited for sdxl release
So what, its still state of the art for realistic, so if you train it to be anime style it should be able to do that too
Ok, I need some help. I am training a new version of a model on top of the original. I can tell that it is training based off the sample images. But when I try to test these models, the outputs are basically the exact same thing as the base model. Does anyone know what is happening and how I could salvage this? It's the third time I have tried training and I really don't want to go again since it takes like 4-5 hours each.
There is an example
Btw this is a 2.1 768 model
I'm going to try one more thing.
before (base SD 1.5, a painting of flowers in a vase on a table):
after:
My lora REALLY likes tulips
I finally got kohya to work after much reinstalling everything and this is the first lora file I've made that doesn't instantly make things MUCH worse, lol
any tips how to make the black lines smoother?
Message🍴
hi guys, im having massiv eissues with yellow tints in my finetuned model, anyone know how to tackle it?
also i have another question: i have a model that has like 100 different concepts in them, some concpts have 5 images and some have 500 images. now what is happening in my model is that the ones that have like 500 images perfoem well but its allmost impossible to generate the conecpts that have only 5 images, i ran it with 50 epochs with multiple of 10s (so every image is learned at least 500 times) - what is the best way to peroppely weight those concepts?
so i need someone to tell me how many u_net steps i need to train a character with 95 images in dreambooth and also text encoder steps.. . by deafult for 10 images it was 1500 unet and 350 text encoder so i multiplied x 9 since i have 90 images...
Its a community support forum - there is no “@ Support”, you are relying on others passing their knowledge on
As for the step count - as with a lot of things with SD its trial and error - you may find with one dataset 1500 steps is plenty even for 90 images, and another might need 25000 steps depending on the look you want to go for…
its not like you can expect ppl to wait here and answer you pal. just be happy if you get a reply once in a while!
aww thats so sad bc the ones in reddit just dont answer :C
well i tried with 1500 and i got poop. 3000 , poop. 9000 looking better
i will try with 15000
i have a model that has like 100 different concepts in them, some concpts have 5 images and some have 500 images. now what is happening in my model is that the ones that have like 500 images perfoem well but its allmost impossible to generate the conecpts that have only 5 images, i ran it with 50 epochs with multiple of 10s (so every image is learned at least 500 times) - what is the best way to peroppely weight those concepts?
👀
endet up writing a programm that helps with the weighting of concepts 🙂
I dunno if anyone's answered this question for you, but I managed to successfully merge a 2.1 lora into a 2.1 model. In Kohya you can go to the "Merge Lycoris" tab and plug in the lora as if it were a lycoris.
keywords work without calling the lora
The newest version of my Lora is doing great because this time instead of repeating all training images the same amount of times, I have a "tier system" where the best images have duplicates. I only trained for 4 epochs and I'm getting good results because I'm training on the good images 100 times and a bunch of "meh" ones 5-20 times, which is weighting the concept higher
This is also working for concepts I don't have a lot of images of but I want to be strong, so I put them in the high repeat folder
I don't really know what I'm doing yet, but I'm using Kohya as a front end to train LORA and in the images subfolders you name things number_foldername like so and number is how many times your script repeats the folder
I learned this on this sketchy 4chan lora training guide under "How to set up the directory" or something like that
https://rentry.org/lora_train
/hdg/ Logo Imgur (3 Sizes)
Written by DistroAnon/EZScriptsAnon with some help from a few others and the Thread!
Links to other Collaboration Edition Guides/Resources
Home
PromptAssist | LoRA Repo
LoRA Training Guide
Useful links if you want more understanding of training
What is a LoRA?
Using LoR...
There's no inappropriate images in that document but 4channy things make me nervous so watch out
tried making my flower lora better today
turning it up to see which colors really come through, I think I'm gonna try looking for more cool tones to add "high" in the data set
I'm collecting before and after every time I add to this LORA so that I know what's working and what isn't
i have like 250 concepts that i want to train in a model. can i do that using lora or should i just use dreambooth? all my concepts are captioned in folders atm
Lora for each and merge them together?
im going for dreambooth not for testing, 112hours 1.1mil steps
if it doesnt get me where i want to get ill go to lora next or try to fix the weights
i still dont understand the difference between lora and dreambooth tbh, with kohya ss as far is i understand the textxtual model is also trained so whats different does lora do other than the output being a kinda diff
The basic summary is A lora is a set of instructions on how to edit a base model - dreambooth is a whole model
im on 112 of about 700 hand editing, annotating, and sorting my dataset
there is basically no difference, just a different way of storing the model. The only limitation of lora is that you set a rank beforehand that limits how fine your model can differ from the base model
except the lora is sub 200mb, and the full model is 10x that minimum
what im concerned with the lora is that my input dataset is cress dependent
*cross
meaning that the content is linked via tags, if i seperate the concepts and make each of them a different lora im not sure if they can cross match at the end
how are you sorting and annotating? in folders?
how big is the difference if i use something like wd14 on my images and have much more detailed tags rather than concepts?
I know there is a big difference between captioned and uncaptioned.. have only ever hand captioned stuff myself though
you don't necessarily have to separate them in different loras
just train them all together
anay difference between everydream and stabletuner?
Trying to pick one for full fine tuning
if you hand caption how detailed did you make your captions?
Also wondering should you add miniconda to PATH in order to get stabletuner?
ed2 is pretty much the only one that is still actively developed and enhanced
is there a way to get the Triggerwords of a lora (safetensor) file? Cause over time I accumulated quite a lot of loras and either the site have been deleted or i can't find it anymore...
this is the issue with using weird keywords for everything, you have hundreds of files and can't use them without a magic dictionary
I'm using BLIP captioning to generate text files for each image, but then editing all these txt files in vscode because I can pull up the image easy and also crop it and stuff with an image editing plugin
then I'm just throwing them all into folders that are how many repeats each image gets during training
may i ask some examples i just want to understand how extensive your captions are
Keep in mind that I don't know what I'm doing 😅 , but I try to keep them as simple as possible. When I was detailed and had several commas and tried to describe the whole image well, my LORA made the outputs worse.
a man with a flower crown on his head
It doesn't matter if the man has a suit on or a t shirt or he's shirtless, or what he looks like, I just try to keep it simple and pertaining to the subject of my LORA, which is decorating things with flowers. The man is interchangeable.
a painting of a black cat surrounded by flowers
a woman wearing a dress covered in flowers in a garden, depth of field"
many flowers in a garden
floral illustration of roses
Less seems to be more
Have been had success training lora for an action as opposed to an object? Like a jump or a punch
Im guessing by some of the specific loras on civitai thats absolutely possible
Hello guys,
I want to train a model on landscape photography with a dataset over 100k images
can anyone please help me with training embeddings ive tried 3 times now and every time it looks nothing like the original reference images not even close
im trying to have an embedding of an original anime character and its not getting anything right not even the color of the hair, the thing about prompts is i am really confused about that part I did process the images to make the prompts for me but in the tutorial i was watching said only keep the prompts that arent integral to the character so like if it always has long hair remove the long hair part and if its wearing helmet remove the helmet part from prompt which i did, setting are
Embedding learning rate: 0.05:10, 0.02:20, 0.01:60, 0.005:200, 0.002:500, 0.001:3000, 0.0005
prompt template: custom_subject_filewords.txt
max steps: 3000
save image to log directory N steps: 50
save a copy of embedding to log directory every N steps: 50
save images with embedding in PNG chunks: checked
shuffle tags by',' when creating: checked
drop out tags when creating prompts: 0.1
sampling method: deterministic
model im using isnt the 1.5 but ive tried using that one and it got even worse results
i can dm the results and reference images
hello, I know how to drop a training image into the SD bot, but I do not know what prompt to use to attach it so I can then do image2image
how much RAM do you typically need for Lora training. I see mentions of needing just 6gb on a 8gb gpus from some and using ~30gb on Colab e.g. in this recent thread https://www.reddit.com/r/Oobabooga/comments/13ait7z/lora_training_runs_out_of_memory_on_saving/
3 votes and 12 comments so far on Reddit
I do have an 8gb gpu, and i do have colab pro (tho it seems pretty nerfed now and even the A100 shows like 16gb of vram) just curious if it's worth it even trying to set it up on my 8gb gpu or if it will be subpar
6gb for image training works ok, but oobabooga is a text2text platform and the requirements might be very different as those models are much larger
The bmaltais/kohya_ss fork works fine on a 1060, if a tad slow
Have u tried a lora instead?
well idk if making a lora is good for specific characters
that does sound like something super cool, however labeling the image i imagine to be a nightmare
look at kohya_ss, there are even youtube videos on lora and dreambooth with it. i found it quite easy to get started with
i dont wanna use dreambooth because thats only for 1 model
kohya suppots lora also, also it has a nice weg ui and helps creating the folder structure and so on
its probably your most easy starting point
also it takes care of the dependencys and so on by itself
are you using caption shuffeling?
i wanted to train embeddings
you can make a dreambooth then extract the LoRA
i've had a lot more success making a dreambooth first since it doesn't require captioning
what do you mean by infinite loras?
well how many loras can you use for a single image ?
they all together have to = 1 in value no?
don't think there's a hard limit, just that they can interact in weird ways
and no, don't need to add up to 1
well what if i want multiple people i cant put 2 loras with different people
i'd just use inpainting for that
seems way too much work to do for every single image until i get something that looks good also inpainting never looks good at least not when ive tried
all i want is some help with training TI embeddings .-.
looked at your original statement, how are you doing your captions
if you mean the txt file captions for the images then i generate them with the processing tool and then i remove everything that is integral to the characters design
from the captions thats what ive been told to do
yeah, that's basically the gist of it, thought you might be over describing the object you're trying to embed
all i describe is the background sometimes pose and clothes
and when i train it even after 3000 steps its not even close not even the color of the hair is right
mind posting a training image and your caption?
ye sure should i send you it in the dms ?
sure
is there any way to interpret loss rate in correlation with learning rate
im using a cosine with warmup and my loss rate is stabelizing to a constant after 20 epochs even tho the learning rate keeps decreasing. is there way to interpret this?
also for some reason my speed went from 2.6 It/s to 3.1 it/s why is that?
gpu work faster when turtured hard over some time? xDDD
Yeah it's gonna take a bit of time, but it can be a perfectly trained model.
Can I DM?
there is no correlation. Sure, the final gradient is proportional to loss and learning rate, but both numbers don't have anything to do with each other.
Learning rate is determined by a learning rate scheduler. It decreases if you want it to decrease
loss in SD does not tell you much
it mostly depends on the sampled noise and time step, so you have to average loss over many steps or large batch sizes to see anything and even then the change in image quality is not necessarily going with a decrease of loss
Think of the model as a giant music studio with 895 million channels, training the model is just moving the sliders to get the desired result, learning rate tells you how far you can adjust each slider at each step
Deliberate is around 2 gb, so how with how many images did they train the model?
The size of the model does not correlate to the amount of images it was trained on
I want to train a model with 100k, how much size would it take?
Theres still 895 million parameters, the only difference will be if there is additional information which there shouldnt be - the training weights will be there, but if its trained on fp16 or bp16 it will come out to either 3.95gb or 1.99gb for a sd1.5 base trained model, or 2.43gb for sd2.1
Oh, so even the captions play an important role?
@cold wyvern Can you also collaborate with us on the project?
I want to train a model on landscape photography with a dataset over 100k images
Going back to the analogy above, the captions help the machine determine which sliders need to be moved and how far
Oh ok
The cliffs notes??
0 is usually fine - I think 0.05 is what the original scientific paper recommended when they introduced it, most people go for 0.1 or 0 I think
so i am trying to make a lora and its stuck at 0% and wont move
anyone know whats up?
Question, say I generate an image of a person and they are wearing a shirt I really like.
However the shirt is patterned(flora, Hawaiian)and is at a not centered angle. It’s also drawn in a particular style and not photorealistic.
How would to you generate a bunch of test images from this shirt from different angles given this one image from different angles and close ups.
So I have a good training set for a Lora, I was wondering if I had to crop it out and put it in image to image with low denoising but I feel like that pattern will change even with low denoise
actually what I really like is to use alpha layer to mask out the part of the image that is irrelevant (e.g. just keep the shirt). Then you might train with a single image. Textual Inversion in auto111 Supports this for example
the longer you train the less you need
for LoRA extraction with Kohya SS what should the network dimension (rank) and conv dimension (rank) be set to?
There's probably a wiki somewhere that says something along the lines of
Don't ask how many
(prompt, image)pairs are required for fine-tuning a stable diffusion model, because every fine-tuning task is a bit different.
On that note, has anybody fine-tuned a stable diffusion model to generate photorealistic images within a specific domain using a custom dataset? What was the size of your dataset? I need to generate a training set of (image, text) pairs for fine-tuning a CLIP model, where each image is a photorealistic scene and each text snippet is a description of what's in the scene (types of objects, and their spatial relation to one another).
I know "as many training samples as possible" is a totally valid answer, but I've only got enough motivation to manually label one reasonably small training set. That being said, I suspect the CLIP model will require significantly more samples than the SD model, and so I'm tempted to use a fine-tuned SD model to bootstrap a training set for fine-tuning my CLIP model. Has anyone done this before?
I am planning to train a model.
What do you guys suggest?
Lora or Dreambooth?
What's the difference?
@warm agate - I go for Lora as I wanted to be able to take that info and merge it into different models and see what came out. If you are trying to get a single model out, then dreambooth I think
I dont know if theres a technical reason to choose one over the other
Why does dreambooth take more VRAM?
B'cuz it goes to through the images more times?
Same reason a lora is smaller than a whole model i think - one is generating a set of instructions on how to edit the model, the other is editing the full model on the fly
Oh so Lora is used for partial training like, Face, drawing style etc
Whereas, dreambooth is used to train image style, costumes?
Am I right?
No,you can do style loras
Possibly, they did change the version of torch
Oh ok.
So in my case should I choose Dreambooth?
Try both tbh
I think it would take a lot of time
@cold wyvern Can you help me with image scraping from r/EarthPorn?
Cant sorry, but I do like that idea and wish you luck!
👍
@cold wyvern Do we even need to add description for each image if we want to train.
For example, if a human is wearing a red dress so we add description of Man wearing a red dress on a sunny day?
Again, with most things stable diffusion related is “try it and see”. I have had some success with uncaptioned, and some where the captions where absolutely necessary
Oh ok, will check
hi guys, in my model some concets are overfitting while some concepts are allmost impossible to be created. is there any propper way of handeling that other than trail and error?
you can try alpha channel masking to weight down regions in the image it should not put too much effort on. I don't know, though, which scripts support that.
what is alpha channel masking?
also note that i have a set of around 7000 images
also another question i have is: for some reason my model is mixing up concepts that can notn be mixed. its like a car and a train, when i create something it creates like a cartrain. how can i stop that? also is it possible to have a car and a train generated in one image somehow?
I would have thought captioning would have fixed that?
you can open your images in a graphic editor and make parts of the images transparent.
SD ignores transparency, but some training methods use the transparency as weighting to emphasize which part of the image the model should train on.
Of course, for 2000 images you don't want to do that manually
that's, unfortunately, a known limitation of SD. It has problems generating multiple concepts in one image. There are multiple ways to go around that
you can create two images separately, copy paste them into one and run img2img
you can use composable diffusion - it sometimes get these things right
I think there is even a plugin for composable diffusion in auto111 that allows you to make regions in your image with separate prompts
Regularization images should help prevent classes from merging
Any Lora colabs that still work? Please @ me if you know.
composable diffusion - any more info on that?
i have i model im training and as far as i understand regularization images only work if you train 1 concept for 1 model, i have like 600 xD
Are you fine-tuning existing classes? If so regularization should work. Like teaching it 50 different cars and 50 different planes and 50 different ducks could all be done with their own regularization images.
no, you can use regularization images also for multiple concepts
Honestly what even are reg images, or class images? I never understood what to use those for
Though I have never trained subjects so I guess they don't matter for me?
just for preventing overfitting. If you show images of rabbits drawn in a particular style you also for example show photographies of rabbits such that the model does not forget how rabbits without that style look like
if i'm training a face: any reason to use a learning rate other than 0.000001 and a learning rate schedular of constant/linear?
is there a way to scrape instagram accounts for images?
jdownloader probably does that
simple question: will an embedding trained using textual inversion on sd 1.5 work with any other sd 1.5 based models, for example the base model of Deliberate https://civitai.com/models/4823/deliberate is 1.5, so should/would my embedding work with Deliberate?
a textual inversion will "work" with any other model made with the same base; but results may vary 🙂
@brazen oriole
gotcha, thanks. just kinda wondering if i should be training with the standard model or the variant mostly, since id like to be able to use the embedding with different base models (of the same versioN)
use the model you will use the embedding in for best results
I have tried fine-tuning dreambooth using 100 images with 1500 regularization images for 10K steps... Interestingly the output i get from it on 5 inference prompts is exactly the same as what the SD1.5 gives on those prompts...
It is as if fine-tuning never happened... Anyone faced same issue??
guess there is something going wrong. The output is never the same, even after very few steps of training....
You should track progress with some validation images
Yeah cool I'll have some validation images and see
Also, I am not sure if the data has to be specifically of one subject only or we can keep it like a mixture?
Hello fellow humans,
I was trying to use Dreambooth to create lora from photos of specific girl with glasses.
And I am getting mixed results and idk if it is just bad training or if the glasses are the problem.
Is it generally better to train on photos without those are SD can handle that just fine?
Thanks in advance,
feel free to DM (or just @ me, but i dont want to spam there) if u are experienced in person training. Ive got few more simple questions. And any help is well appreciated.
Hi, I'm running a startup that heavily uses SD. we are looking to expand the team with a freelancer or 2. Searching for somebody that has tons of experience in finetuning SD 1.5 & 2.1. Shoot me a DM if you thank that could be you. happy to explain more there
It can handle it just fine. I'd train a Dreambooth first and extract the LoRA later.
sd does well with glasses from what i saw, its probably an issue with over / underfitting
may i ask your workflow and maybe a sample image of what is not working
can someone explain how to make regularisation images if i finetune with mutliple concepts?
also for regularisation images what shall be in them? the current model is not able to create what i finetune for. so how would i create those images?
The way I did it: have your base model spit out like 1500 images of the class your fine-tuning
So like if you're training a woman's face and your instance prompt is "a photo of kdbwh85 woman face" then you have your base model spit out 1500 images with the prompt "a photo of a woman face"
but i dont fine tune a single class
i have like 300-600 classes
thats why i am asking how i would take care of the regularisation image. i understand how it works for 1 single class
but as far as i undestand dreambooth can only have 1 regularization folder
also my issue is that the class is new so i would know how to create regularization images
its just something the current model cant create
Not sure what your set up looks like, but for each concept I'm trying to train I can specify a different regularization folder
are you working with kohyass?
Nah, Dreambooth extension for A1111
But also not trying to train 300-600 concepts at a time lol
concepts is like 900 xD
The concepts that are bleeding together, are those the new concepts?
but i can probably cur the regularization down to 300 to 600 different classes
And is it all of them? Or just certain ones.
what do you mean with certain one or all together
You said that classes were kind of merging together
Is it all the classes you're training or just some of them
all of them
i mean the results im getting without regularisation ist too bad, its just that some concepts are overfitting and some are hard to create at all
which is because for some cocnepts i have 100 images and for some i have 3
what im doing now i weight them down so that the concept with 100 gets as many runs as the concept with 3
but it seems that this is what is causing the overfitting
Maybe separate them into different batches
because those concepts with 3 images are run 30 times in 1 epoch which the others are run once
Like train all the high image concepts together
what do you mean by batches?
Then do a separate training for the low image concepts
Have no idea if that would work, but might be worth testing
you mean like training each indivitual and then merge them?
Yeah
tbh the level of work required with this is really not an options
Or train the model in steps
its more likely to define a regularization for each "base concept"
what do you mean by train the model in steps?
Like train it with all the high image concepts
Then train the fine tuned output with the low image concepts
i see
i have like 1300 class folders
and like 7000 images
so my average is like 5 images
however maybe what i can do is reduce the classes down to like 30 or 40
Hahaha damn
with the same images
and train a base model with them
and then specialize it with the full set
so basically create 30 classes with 7000 images
train the model as base
Yeah, I'd definitely experiment with how much you're feeding it at any given time
and then run the actual training
i was just hoping that weighting alone will help
also anyone know if shuffeling captions make a big difference?
any also what LR Scheduler are you guys using?
I use 0.000001, constant but I had the same question earlier
1e-6 right?
for me im using cosine with warmup
it appears the warmup is quite importaint
i have tested with constant before without warmup and it really didnt turn out too good
it appears everybody is using a scheuler that is reducing the LR at the end but i dont really understand the benefit yet
I'm new to AI and I'd like to do a LoRa character training. I found this tutorial that seems super nice and easy https://imgur.com/a/mrTteIt#TjsDxqp but it's using google collab and I don't, so I have struggles following the actual training part
Could anyone explain it to me or have a good link to what I'm trying to learn please? This is the only tutorial that seemed easy and I struggle with whatever else I found

Thank youu I'll check it out 
I hope its not too hard to understand
Not sure where to ask, does the WebUI have a function similar to Remix (MidJourney)?
Webui doesnt but Nerdy Rodent had a video a while back on a SD based image merge
what are you guys using to latent caption images? DeepDanbooru works well for some but not so well for other images. anyone used WD14 before? is it any better
I figured you would use deepbooru for stuff trained on danbooru(anythingv3/novelai) , but I've only ever tried to make two loras so I don't have the testing data to say
Related to that...I just made a lora for a character and while they got the hair and face kinda right, they completely got the skin tone wrong despite me not including the skin tone in the captioning. Character is much lighter skin than I wanted.
You are supposed to not include attributes in the captioning text files that are attributed to your character right?
Also if you are training on a character, is better you leave the 1girl/1guy tags or remove them in your captioning files,i figured you would remove those since being 1girl/1guy is an attribute of any character
I have a collection of faces of different characters I'm trying to train on. Would it make sense to say add woman and man to the description. For example should I saw Ryu man and Chun li woman? Will that give better results in general?
from my experience the better you caption your pictures the better your result is
i suggest you run something like DeepDanbooru on your images and then kind of optimize the captions.
i have just tested a model without detailed tags (just the concept) and when i added the detailed tags the result is much batter than before
does anyone know if training a dreambooth on a model that's already dreamboothed would affect quality in any way? Say if I want to train a new subject on a model where I already added subjects
shouldn't affect quality, but as always, the proof is in the pudding 🙂
cuz I do know that the reg images affect the entire model
so maybe if you train it too much it will start overfitting into ur reggies? do many questions, so liittle time to test them all
@cold wyvern Is it important to caption images before training model?
Its not necessary but highly recommended
do you know any tool which can auto-caption images, as I am trying to train a large dataset? It would be difficult to caption each image.
I dont find Deepdanbooru for landscape photography
Both the dreambooth extension and kohya have tools for captioning
Dreambooth has inbuilt captioning and will auto-caption when we input our images?
Nope - theres a tools tab and you can select the image folder in there and it will caption the images in the specified folder
Ok
What's Kohya?
does dreambooth have a gui?
dreambooth extension is for a1111 - kohya is https://github.com/bmaltais/kohya_ss
The blip captioning in kohya is pretty shit though
Is there any other that works better?
i’ve been getting some good results finetuning SD2.1 by freezing all layers of the text encoder except for the last handful (2-6 out of 24). it seems to do a great job of preventing overfitting and catastrophic forgetting even at relatively high learning rates. doing this makes SD2.1 training feel about as “easy” as SD1.5 training (it’s still tricky, it’s just no longer a nightmare). there’s a branch on EveryDream2trainer if anyone is interested in trying it out.
guys what does "batch count" do? I don't think it increases the speed of training because its/s remains the same regardless of the batch count during Dreambooth
it is number of images that it processes before it updates the main model right?
wouldn't 1 batch count be better in terms of quality because it updates itself with every new sample? Though it might increase the training time because GPU-CPU transfer
its the number of images that it processes simultaneously, decreases the number of steps and has a slight performance increase
performance increase is speed it takes to complete the whole training right?
what I want to know is its impact on the final quality of the model
I get it that it reduces time it takes to train because it reduces the number of bus transfer between cpu-gpu
however i don't understand whether it can improve/deform quality of the final model
during training the point is rather that multiple images are trained together in one update step. So your gradient update is an average across the images in one batch
this makes the gradient less noisy and unstable (in SD the gradient is often extremely noisy due to the stochastic nature of the noise sampling)
as far as I know, when training diffusion models people tend to use extremely large batch sizes to make training more stable. However, on consumer hardware you cannot do this, so people tend to use small batch size and extremely low learning rate instead
so yeah, probably quality of the model is better when doing larger batch size one same amount of steps. But of course, training time is also increasing a lot.
its very simple, just use DeepDanbooru
it will create tags for the images as txt file
But the results aren't that reliable for landscape photography
also the info i read so far is bascially the opposite, ppl recommend batch sized of 2 or less.
every paper I read used batch size of 20 or more
as said: on consumer hardware you cannot use large batch size, so people have to use super small ones
i think it also matters much if you have alot of images of the same type or very few images
this is only for anime, right? For everything else you would have to use a CLIP interrogator or a BLIP model
if you have only a few images per concept than i think mixing them up will really give bad results. but if you have 100 of one it will not matter
i use it for everyhting. the results are not bad
not really. Even if you have only one image using batching makes sense, because it stabilizes the noise sampling
like this example of querty
i find the result pretty fair
never tested it on animi really. i just didnt find anything better yet
I would say CLIP interrogator is WAY better
do you have a link?
I just used the webui by vladic. It has it builtin
but guess you can also download it as separate extension for auto111
thanks i will test that
i saw huge improovements in the model when adding more detailed tagging so i guess the better the tags are the bettery
so basially you say to has a high batch count as possible? but if you say the change in the model is the average of the batch and i only have 1 image of a type doesnt that kind of kill the idea unless you run like 1000 epocs with very low learning rate?
no, because you always sample a timestep. Let's say you sample a timestep at 1% then the image is completely noise and the model only learns rough shapes. If you sample at 99% then the image is almost perfect and the model just learns the fine details and textures
thus, what the model learns is completely different
what do you mean with sample at timestep?
when you generate an image you create a random noise image and then step by step denoise it. You can watch this process in the webui
when you do training, you do not start from pure noise. Instead you draw a random time step. Lets say you draw the time step at 50%, then you add as much noise to the image as it would look like after doing 50% of the steps. The model then only removes as much noise as a single step
i just dont understand what this has to do with the batch size
if you mix up the training result of lets say 6 images that are all completely different rather than doing one at a time. i wuld expect the result to be compoletely different
isnt it?
no, it makes it better
if you train 6 different images at the same time, each of them will have a very different gradient
averaging them is good, cause it makes the gradient more stable
in the end you want the model to train on ALL concepts, thus, you do not want it to overfit on a single image
it is only good if you train for one single concept
but if you would want to train a frog and a cat in one concept
i dont see how that would be beneficial
unless you want a catfrog as result
no?
the opposite is the case
if you train ONLY on a cat, then this will override the models ability to do something else. Like it will forget how to draw anything except a cat
if you train a model hundred thousands of steps only on cat images, it will then only be able to generate cat images
the same happens with frog images
so training on 100,000 cat images basically destroys the model.
but lets say you train on 100,000 cat and 100,000 frog images. Now, order is important. If you train first on the cat images the model is already destroyeds
but if you train on one cat then a frog then a cat and so on, then the model will never forget what a frog and what a cat is, because you "remember it every step"
that's also the reason why we use regularizer images when training
this is example has not directly to do with batch size. I just want to demonstrate why it is a good thing to have variety during training
batch size increases variety
what you are saing makes a lot of sense. but i still think you need to compensate the bigger batch size with lower learning rate and higher number of runs if you are mising concepts
yes, higher batch size means you need more epochs
learning rate, however, can rather be increased
ah yes because you are learning a mix
which is lexx critial than learning too much of 1 image
makes sense
I mean, in general you could think of batch size as a purely performance thing. Increase batch size by 10 times means you can increase learning rate by 10 times. Training is faster because it can be easier parallelized
but in reality, batch size also stabilized training. Too high batch size makes gradient too stable, too low batch size makes gradient to instable. Somewhere in the middle is the sweet spot
what batch size you recommend for "normal ppl"? you think 6 is too high
but in deep learning on images, memory requirements are so insane high that we never even reach this sweet spot. We can only use very low batch sizes. So I would say, take the batch size that fits in your memory
as for the regularization image. maybe you can help me with that. if i want to finetune a concept for lets say a swimming cat. can i still use a regular cat a regularization?
or do i need to have regulariszation images of swimming cats
and then if i want to train for a swimming cat in black fur. do i need cats with black fur as regularization or can i still use the "regular cat"
I mean, you can do all these things. If all your images show swimming cats then also showing regular cats during training is a good idea to prevent overfitting
but concepts like "swimming cat" can usually be trained purely on textual inversion
or just by finetuning the text encoder
then you don't need regularization images at all I would say
i am getting quite good results for most concepts, just i do see quite some overfitting for other concepts. espacially the ones that have very little images
Any good guides on this
If I train a textual inversion embedding on a new concept, is it possible to train a better LORA using that specific embedding? Can custom embeddings in an interface like automatic1111 affect the training process or are they ignored?
an embedding isn't part of a model, it's a matrix that you feed a model, so no they wouldn't have an affect on training a model unless you used images containing that embedding
when generating regularization images, do i want a wide distribution of sampling steps?
same question for CFG scale
Thanks for the clarification. One more question? So hypothetically if the images contained the custom embedding, and the embedding is loaded into the embeddings folder (within Automatic1111 or comparable webui), would it append the custom vector representation to the corpus of tokens when training a LoRA for an additional boost to fine tuning? Or would there need to be missing functionality added on to achieve this in the form of an extension?
It would treat it like any other image you used to train LoRA.
Meaning the token would get preprocessed as one one of the input vectors or dropped while retaining only the tokens that are part of the base model?
There'd be no token. If you're feeding it images created with an embedding it will treat it like any other image and have no knowledge of the token. At that point it's just pixels. Like the token isn't something the model knows so there's no association between what you feed it and the token.
You could specify that token if you wanted to model the associate those images with that token but it would be an approximation of the original.
I downloaded instaloader using these 3
-m pip install instaloader
pip3 install instaloader
pip install instaloader
But when I run a script that has
import instaloader
it returns with "ModuleNotFoundError: No module named 'instaloader'" Why is this and how to fix?
Did you do a runtime restart after the pip install?
"python -m pip install instaloader" I did this today this morning and it fixed it
@weary locust In principal you can use embeddings for LORA and it makes totally sense to do so. If the input embedding is used within the LORA training depends on the implementation you use - would have to check the source code.
note that every good implementation of LORA or dreambooth is doing textual inversion as first step anyways. So before even start training a LORA, a textual inversion is usually trained first
Can you please share the link
actually this only fixed it on the laptop but problem remains on pc
How would I do this?
import os
os.kill(os.getpid(), 9)
Input something like that into the script I want to run?
anyone can recommend an interriogator like CLIP that works standalone with scripts
rather than a python module
DeepDanbooru works ok but the tags are sometimes really far off
Hey! I have a total noob question. I have a bunch of custom shapes - I want to have a finetuned SD model that will understand the design style of the shapes and generate a custom shape for any scene I describe - for example : if i give a prompt - chilling at a beach - the model should maybe output a custom shape created out of a picture of sand or a custom shape created out of a picture of sea etc.
can anyone please guide me on how I can create a model like this ?
is there a script that uses captions instead of just a single instance prompt?
i can load BLIP and interrogate, as i've got 80G
No, restarting a kernel is done when you use ipynb (either on local system or colab).
When you run a script that cant happen.
Is it that you are using the library in a virtual env?? but installed the library outside of that env??
Hello: Lost my upscalers recently could not find upscaler named R-ESRGAN 4x+, using None as a fallback
Look up how to create a style LoRA.
no im not using virtual environment but ill try that when I get home.
Are you guys using any tools for text file batch editing and management?
Etc now after scraping instagram I want to batch edit all text files remove certain parts of the text, replace hashtags with commas etc..
hey, my LoRA's are all a bit too strong and look baked on 1 strength, but usually work perfectly on like 0.6 to 0.8. Are there parameters I can adjust to make them a bit weaker to work on 1.0? (I crop them, pre process them with booru tags, remove tags i want my model to be associated with, add a triggerword to every file and then train it with usually 100 iterations each)
U can make it output multiple versions on different points of training. Then do a test gen with each model and select the one that is not overtrained
yes i am
does 1 mean its only one model at the end cuz thats default?
these are my usual settings
ah yea then it will be only one model. Maybe its possible to set the save every N to a number below 1?
well it lets me input 0.1, lemme run it and see
while i wait, will my models improve if i use lycoris instead of lora on the same settings?
i shouldn't have picked one with 2.5k steps
ok 0.1 also only gave me 1 model
hm ok
trying 10 now
Somebody having problems with kohya dreambooth google colab notebook?
seems not working anymore
did you use it a lot recently?
i dont use colab but ive read that it starts to lock you out for a certain time if you use it too much, and that time gets longer the more you use it
did you try a different colab?
Hey dont know if this the right channel to ask but do you usually prefer to use batch count or use seed variation strength ?
Does anyone have any advices in terms of settings? I am honestly looking to fine tune it a bit as I feel like things are a bit too deformed sometimes
I can give no good assistance on this but I have a question regarding these elements. If you assign a seed number to a batch of images, why do they still generate differently? Aren’t they sharing the same seed data?
seed is just for initializing the random generator
noise in each image is still different, same way as not every pixel in the noise image is the same
for dreambooth training, what batch size can i get use with a 4090
New to Training, in Dreambooth do I need a text file corresponding to each image?
Or is there a better training tool?
only if you're using LoRA
every training tool has it's quirks
so better is hard to quantify
Can I use Dreambooth to create a model from scratch? If so do I need the txt files for that?
For context my first attempt used a source model, its hard to tell the difference in the end results
like with no base model? i'd imagine that's something you could only do if you had a server farm at your disposal
the base models are trained on absolutely massive data sets
So if I understand correctly, when I use a base Model I'm essential creating a LoRA?
Wow, I've got a lot of learning to do 😲
hahaha, oh yeah
there's a masssssssssssssiiiiiiiiivveee learning curve
you can get started relatively quickly
like took me about two weeks of tinkering to get decent results
but the ceiling of what you can learn is really high and always going up
Unfortunatly I stared on MidJourney, now I am trying to get the same quality of results
midjourney is definitely a more user friendly product, but less versatile/customizable
and $$$
hahah that too
Plus the NSFW filters, I type in something harmless and end up having to appeal it 🙄
Okay, Ill keep reading, thank you!
is there any way to find out what is causing my model to overfit?
what is happening is that my model is learning strange pattern and applying them
is there any mechanism to aviod those patterns?
i wonder if you could use those weird patterns to create a negative embedding
or a LoRA and then add a negative weight to it
as for regularisation images: whap happens if they are from a completely different bade model like dreemshaper instead of 1.5? does it matter? also how close do regularisation images have to fit the actual concept i want to train? how far can they be off?
how would i filter that pattern and create a negative embedding? also i would prefere to avoid them in first place
pretty sure you want to use regularization images generated by the model you're training on
what happens if i dont? 1.5 is not even getting close to the concepts i want to train on so its gonna be hard to create regulariuation iamges
you'd take a bunch of images with that pattern and create an embedding then add that embedding as a negative prompt
it's definitely a kludgy solution though
im using kohya ss which is runnin gimages multiple times based on weights i apply
and i guess that i just run certain images too often, but the patetrn is so unclear that its hard for me to identiy which images causing them
yeah i'm honestly not sure how you'd approach that
what exactly does dreamstudio do with the regularisation images?
its it like a filter or is it like a correction?
dreamstudio or dreambooth?
dreambooth sorry
regularization images are there to keep your training data from having too much of an effect on the base classes. like if you're trying to teach your model a specific instance of a car, regularization will ensure that all cars don't start looking like your car
do you guys use classification image negative prompts for dreambooth training?
anybody used regularization images from different models before?
anyone know how to avoid strange pattern when finetuning? its like mixing multiple concepts so ther its neither nor
also is there any interpretation of loss?
is there a setting for koyha ss lora training to give me a model after every n% of being done so that i can pick the one that is not overtrained?
i think you can set that in advanced options
i know there is for dream and as far as i saw all the settings are identical so there must be
hi guys. what is the point in reducing the Learning Rate in the later stage of the training? LR Scheduler cosine for example? i see that its pretty standard but i dont understand what is the benefit
does it like allow to run longer and pick up more details or so?
or can i just use constant with warmups and train unteill the model performs best
I've heard you want to use best possible images so doesnt matter from where they are
Had similar doubt... Anyone has some idea on how will real class images work(not generated from SD but rather actual real pics manually used)??
actually, I would use real class images if available
in particular for people, as they don't have strange artefacts and deformed hands
I would say its less important that the regularization images come from SD itself. Its more important that they have high diversity and you use every reg image only once or few times
I see...
U got any ideas on how to NOT let the model overfit on the training data? (with regularization images made from SD itself by the usual way)
@stiff dust do you know how well the regularization image have to fit the concept i want to train? or can they be very general?
I think the idea is to give them almost same caption as your training images. However, the best is probably to use random LAOIN subsets - so they can be also very general
anyone know if kohya ss is bucketing regularisation images?
kaibioinfo do you have any comment on my question reguarding learning rate?
so it appears that lower learing rate (5e-7) in combination with constant scheduler (with warmup) does much better than 1e-6 with cosine
may i ask what settings you gyus are using?
in automatic1111's web ui does anyone know how to make batch with masks do "only mask" instead of "whole image"? is there an argument I could use to make only mask the default?
I guess I found a way because text2mask uses the inpaint settings instead of the batch settings when masking and generating from batch. I would like to use the masks generated with the batch though so I could use the xyz script instead of the txt2mask.
Idk I guess the batch is using inpaint settings too. I just tested with generated mask. Idk the last time I tried it would all ways use whole image regardless of the setting in the inpaint panel.
maybe it's because this time I sellected it in inpaint upload instead of regular inpaint? Is that correct???
how many regularisation images should i provide per concept?
i read online it shall be 200 per training image is that correct?
When you use "only mask", what do you have the padding set at?
default, I've checked, inpaint upload settings are used for batch stuff and txt2mask.
I've run into a new issue now though. I can't use only mask with batch because when I do it never changes the mask to the next one along with the corresponding images, I think the masks work normal in batch when it is generated based on whole image but when I use only mask it keeps the first mask and goes through all the different images with the same first mask. They are all named the same as their corresponding images.
Hello everyone, after training on 1.5 for a while I’m trying 2.1. Does anybody know if there are specific parameters to take into account that would be different from 1.5?
ideally one per epoch
one per epoch per training image
but if you have less its not that bad
ok so if i run 100 epochs i shall have 100number of image in conceptnumber of concepts right
so 70k images xDD
700k sorry xD
Inpaint/Sketch: I've noticed the masked area doesn't reset when I add a new image. The paint color is removed but the mask is still there. Anyone else have this issue or a different inpaint/sketch tool?
Also after a few uses during the session it begins to alter areas never masked
2.1 trains faster than 1.5, so probably a lower learning rate, train the text encoder completely just to improve that, and save the model more often for testing when its done to minimize the risk of overtraining and having to restart.
has anyone tried altering the sort of the images pulled in by the dreambooth script, to sort by atime in reverse order eg. the oldest images / least-touched images go first?
I use random shuffle the entire data set every epoch in ED2
main gain may just be putting different images together in batches every epoch more than the order matters
the original dreambooth repos based on xavier's repo only have batch size of 1 or 2 maybe on 24gb so it may be a bit moot
if you have so many images and concepts you probably don't need regularization images.
^ this
i'm using about 122,000 images right now and it's just... amazing. they're well-tagged and varied
What are regularization images?
they are also called 'class images'
two techniques, dreambooth uses SD generated images mixed in with training, or if fine tuning you can mix in your own, you can use multiply.txt in ED2 to load those images less frequently compared to training images
the purpose either way is to avoid overfitting to your training data, i.e. "remind the model what it already knows" so it doesn't forget, in very hand-wavy terms
So multiply.txt in ED2 would use our selected images in between some random images to train it better so that the model is not just limited to the trained images?
Its the same in DB?
dreambooth has a particular way it pairs the training images and regularization images, it pairs them up every step
ED2 does not, its random shuffle of all the data, no distinction is made inside the software on what "regularization" even means
Oh if the batch size is 10, then it uses 5 desired images and 5 regular images?
in dreambooth, batch size 1 would mean 1 training image and 1 regularization images are in 1 batch/step
Oh ok makes sense
Thats the only diff b/w ED2 and DB?
in ED2 its random selection, everything is just shuffled together, you can just sort of simulate the dreambooth thing though
the typical setup for DB has a lot of 'repeats' on your training data and fewer repeats on the class data
Where are regularization images extracted from?
ED2 was built for general fine tuning, dreambooth is a specific technique, so ED2 is more general training and fine tuning
What is class data?
they're generated by the checkpoint you're training from.
in dreambooth the typical technique is regularization images are generated from SD itself, they are inference outputs
Oh ok
they usually look like total garbage but it depends on the checkpoint
So they use the base SD's images as their input for regularization images
I'm not a fan myself, when you can source Laion or ffhq or coco and use those instead if you even need regularization at all
that's highly likely to burn the model if you don't use reggies from the checkpoint, btw
at least this is the case for 2.1
it possibly isn't for 1.5. i haven't checked
not much happens if you just train for one character
In my case, I want to train on Landscape photography, so which one do you suggest?
ah see i've been trying to do generalised fine-tunes and so class data tends to harm my results
partial freezing of text encoder and using a separate lower LR seem to help with training sd2.1 a lot
whats the diff b/w training and fine-tuning?
i just use polynomial learning rate and a high warmup run
training and fine-tuning are the same but generally when a distinction IS made, training is understood to be from scratch and fine-tuning is providing specific concepts to a pretrained model to bring those weights up and make it more likely that type of output is produced.
in this context nothing, fine tuning usually means training after an initial training session though, which since we're starting with the supplied checkpoints from SAI, etc then its all fine tuning
What's SAI?
StabilityAI
stability ai
jinx
oh ok, so as we are training after SD's initial data, everything is considered as fine-tuning?
Ok
there's a group that re-trained SD 1.5 on more than 2.9 million images with thoroughly tagged captions and i personally have trouble declaring that 'fine-tuning' considering the extent of catastrophic loss from the original SD 1.5 model but that IS fine-tuning.
yeah fine tuning is a pretty generic term, I would say dreambooth is a specific technique inside fine tuning for example
yeah. dreambooth is a subset
"fine tuning" doesn't mean you didn't do a bad job though lol
for sure, yes
and LoRAs generally are a world of their own, with a lot of similarity to Dreamboothing but different training data setup, different learning rate, different impact on each "delta from zero" for each hyperparameter you change
you can do some pretty amazing things with just a few thousand images though, train entire fictional worlds worth of characters and scenery and stuff, people underestimate how much "room" is in the model to learn
well the model has a lot of garbage connections
Can you explain with an example?
example of dreambooth or of fine tuning?
Well, I would prefer both
@warm agate a general fine tune will have thousands and thousands of ideally, well-captioned data. this results in a "generalization" of your improvements across all of the tags you had in your training data. this can be MONUMENTAL.
dreambooth is trying to insert a single subject into a model so it can be referenced by a single keyword. in other words, add yourself into your favourite model so you can become a subject in its fantasies.
"fine tuning" is training an already trained/started model with labeled data (i.e. captioned images), that's the most generic version
dreambooth does the same thing, but generally the labels are a fixed word like "xyzbob" or "xyzbob person" and regularization images are also mixed in with just some generic label like "person"
the point of either is to make the model learn something it doesn't know, can be anything that relates text to a 2D image really, like styles, camera angles, characters, etc
so isn't dreambooth like LORA?
Does fine-tune also extract all elements of images and return an image with an element from each image image into a single image?
LORA is its own thing, its a trick to try to make training more efficient by training and patching a much smaller submodule, but it isn't actually updating the core model weights at all
So finetune allows SD to better compile the available resources better?
yeah, expanding on that last point, you can use Dreambooth to "fix" the model's understanding of a "concept". example: SD 2.1 cannot make aliens.
solution: provide Dreambooth,
- the instance prompt "aliens" and class prompt "person"
- about 500-3000 training images of different aliens
- about 15,000 class images
- use a VERY low learning rate, and a LARGE number of steps
and that will overload the 'aliens' keyword with your concept from the training data, usually replacing the astronaut it places under 'alien' by default.
What are class images?
I think fine tuning scales better, and using all real images produces better results as well
here
class images = regularization data
you have a 'subject' and a 'class' in Dreambooth training. if your subject is Lara Croft, your class is woman
you pick your class, you could use "person" too but yeah,t he idea is the class is some sort of super-class of your trained thing
if you're trying to improve the anatomy of humans, providing 'hands' as your subject, your class would be 'human anatomy'
and good luck with that
if you are training your pet dog Chewy, your class would probably be "dog" etc
For example, I input 100 images of 'Forests', 100 images of 'beaches', 100 images of 'sunset' and 100 images of 'camels'.
So with an prompt like An aerial view of a beach during sunset with a dense forest located near the beach, camels approaching the beach through the forest
would Finetuning be better at these kind-off tasks?
@hot breach @surreal lagoon
yeah you'd want to provide all of that in a single dataset and put what you would prompt for each image as its caption
there's many different training tools and so i can't really provide guidance on how you'd use those captions for training, but i name my files by their prompts, with _ in place of spaces and then in my training code, i replace those with spaces and do a bit more cleanup on it
dreambooth works fine for training in one person with like 10-30 images, but if you want to train 8 characters and a bunch of screenshots from your favorite TV series all at once, or reform the entire model to be some special style, I don't think dreambooth technique is very helpful
nah if you do it right you can train a movie into a single keyword and avoid style bleed that you'd see with general fine-tune. it all depends what you're after.
i did this with The Hobbit, and "lotr style" would make everything look very, uh, Peter Jacksony
but that's when i realised The LOTR movies actually have a terrible style to them
🤣
i thought the training just didn't work but i went back and looked at the movie and was like holy cow, it really does look like that
maybe i can pre-process them with img2img to make them more brighter and vibrant but the movie is dull and grainy and even just straight-up blurry and it makes all of the images you apply the LOTR Style keyword to, appear "decayed"
city skyline = vibrant, colourful, alive
city skyline lotr style = Aleppo, Syria
i want to try A Scanner Darkly next
If we train a model with a dataset of 1 million or maybe more as it's easy to get the dataset images of humans, will the faces become way better?
yes and you don't need millions. I think most models out there are trained on very few images. You see this, because they tend to generate the same faces over and over
so more is better, but a thousand is probably more than enough
also there is a limitation. The problem SD has with generating faces of people more far away in the picture (e.g. full body shots) is a limitation of the model rendering in low resolution. You probably won't be able to fix that (except if you train on very high resolution, which takes insanely amount of time and memory)
I am looking for closeups mostly, but I don't think it's hard to find full body shots. I think the number of selfies are far more than full body images.
Why only high res for full body?
Which method do you suggest to train numerous faces?
because SD computes in 8 times lower resolution. So if you compute an 512x512 picture the internal latent resolution is 64x64. Now if the face in the original image was 64x64 pixel in size, it's internal size is 8x8. This is too few pixels to get the details right
Oh, so we have to try like 4096x4096 images?
But I don't think we can get such high res images
no, you can't. Use upscaling and inpaint, or tiled diffusion. There are many techniques to get higher details and fix artefacts in images
So upscale all the dataset images to 4096x4096 and then train?
no, train normally on the native resolution or close to native
when you generate images you can use upscaling to make images larger, then img2img to add details.
Oh ok, so inpainting will use the miniature faces using the model?
weird discovery, I dreamboothed my favorite model and the whole model ended up looking even better....
even when prompting stuff that was not dreamboothed. I think it's cuz of my reg images?
I would say that's normal if the images you use are aesthetically better than the random stuff the model generates otherwise
hi guys, how many times shall i run regularisation images? my training images run like 30 times per epoch per image
is 1 run gonna be too little?
I use 1500 total reggies and split them up by my instance img ct
so if I have 100 instance images, thats 15 reggies a pop
usually 1 time is enough 😳
ideally, you do not train more than one or few times on the same regularization image
Has anyone had great success with blip2 or clip captioning? I am trying to find a project preferably with a guide to run either of these for human pictures.
i did The Hobbit movie and everything looked broken, old, rotting, decayed 🤣
didn't expect that. i was careful with the input data
my current training is generating about 224,000 regularization images 😩
about 140,000 remain 
god bless the A6000
What does sub 60 images mean?
I use the 6.7b version of blip2 and the results are as mediocre as the blip1 version.
What do you use instead?
I use blip2. Don't know any better that is as practical to use
Oh, so its the best but no every good.
I having trouble finding a blip2 project that works on runpad
sounds like the text encoder is overfitting on the data
Use captionr
Can you give me more information?
Is it possible to run blip2 on something like vast.ai or google colab? I dont have the pc for blip 2 locally
I made a LoRA with Dreambooth but every image it generates is miscolored or has a blue tint to everything. Any ideas?
I observed such artefacts when you train the unet with too low rank
default for LORA is 4 which is fine for the text encoder but too low for the unet
use larger rank (e.g. 16 or higher)
Damn baller
Did you try captionr?
it has generated 14,000 reg images so far
141,212 to go
No, if im not mistaken captionr = blip2
You asked about blip2
https://laion.ai/blog/large-openclip/
@serene flicker LAION says that bfloat16 helps to train the text encoder better
Oh
mixed-precision training
as long as the dataset is not too huge, I prefer training on float 32 to avoid these issues at all
100% i agree
what qualifies as too large in your eyes? i am using a6000, 4090, and a100 80G cards for training
i assume each has a different threshold
Is it not the same thing in essence?
I train on my own consumer gpu which is a 3090ti with 24gb. If training finishs over night or during a work day, it doesn't matter so much if it takes 5 hours or 10 hours. So I prefer training with 32bit instead of training over night with 16bit and then see that something didn't worked
for textual inversion I use 16bit, though, as it is really slow otherwise
lora training, in contrast, is often surprisingly fast even with 32bit
yeah i saw that when helping Sytan figure his overly baked output out
Hey guys what are you using to train dreambooth? I used stable turner because i could use the shuffle after epoch on windows but the install is broken. Does anyone have a better alternative? I have a 3090 and a 4090 and have quite a decent experience trianing models from 20-50k images. I'm really looking for dreambooth not fine tuning as EDT is good enough for that
you can simulate dreambooth in everydream2, the bonus is it is actually maintained
Thank you
has anyone tried to keep a translation list of common terms and swap the words out randomly when training english datasets so that the encoder is introduced to new languages? eg. say you have a various dataset of landscapes, subjects, objects, and you want cats to also be gatto, you could change out cat for gatto randomly when you encounter it
wouldn't it ne easier to make a new token with same embedding as e.g. cats
@stiff dust is ED2 better than Dreambooth for landscape photography?
What does batch size mean in training?
it's all just different scripts implementing fine-tuning
but I would say ED2 is most sophisticated
batch size should be set as highest as possible without getting out of memory errors
Which one do you suggest for Landscape photography training?