#🔧|finetune
1 messages · Page 14 of 1
usually you can't use higher values than, say, 4
How to check the highest possible, just through trial and error?
yes
just use ED2
don't overthink it. In the end, every training data behaves a bit different anyways
ok
@stiff dust How to check VRAM usage?
Can we upload images of different resolutions?
well on my multi a100 system i can really only tolerate bs 7 and 64 gradient accumulations
it gets too slow and training takes 10x longer
I don't believe in gradient accumulations. Why should it help?
i was just trying different things to get a more gradual and cohesive training session
i have been observing pretty great results now that i've frozen the text encoder by decaying its learning rate to zero as the unet learning rate rises
4k steps to train the text encoder and then another 6.5k are nearly done now, with just the unet
do you think i'll have good luck if i do one epoch of final training with the TE unfrozen after the unet training completes?
Hi everyone, I was just wondering what is the most recommended way to up-scale images while retaining realistic texture on the skin, face and remains crisp and not too smooth after upscaling 4x. Thank you 😇
I had the opposite: freezing unet (or train it with very low learning rate) improved image quality
odd
i do use a super low learning rate. it starts at 1e-10 and rises to 9e-8
my loss averages 0.244 with my current training method, likely due to ADAM 8bit optimizer spikes, but this is so much lower than the .546 avg i could get with the 'official guidance' for fine-tuning 2.1
i was wondering how i was breaking everything for a few days before i realised that almost not training it at all on each iteration has the best results
freezing the text encoder fixed the 'memory loss' problem i was having where for instance it would forget how to make a gecko since geckos weren't in my training data and i couldn't figure out a way to generate reg images that would represent them and also everything else i wanted to preserve
example: the term leopard gecko started out lizard-like, pretty close to a real one. but then it starts to add whiskers and fur and then an actual leopard's head on a gecko body before finally it's just a leopard. the next checkpoint, the leopard was gone, and it was a "leopard tank", eg. a vehicle used in war, with little soldiers standing around, and smoke in the background
my model i published was from about 2,000 steps before this loss became so noticeable in my test matrix of prompts. but for stuff outside my test matrix, it's apparent there was still some loss. it's just an acceptable amount, and things i generally didn't care to preserve eg. celebrities
from another conversation, qwerty only has 74 images and was trying to set batch size to 35, this will likely cause issues, wouldn't suggest setting batch size more than like 10% of your total image count
high batch size is even good if you have only a single image
as you sample the image at different noise time steps
the issue becomes with runt batches and aspect bucketing
if you have 30 images and a batch size of 25, you end up with two steps, one of 25 images and another 5, these remainders become an issue
so my firs try with regularisation images was a complete fail
all my mopdel is producing is regularisation images
and ther eis bascially nothing left from any of my concepts, what did i do wrong?
i have 240 regularisation images per concept that has like 10 images and i run those 10 images like 20-30 times per epoch, so i theory i have around 1 regularisation image per image i run per epoch
any recommendation?
do you try different seeds?
I found that even with super small learning rate the image can swap between different outcomes very easily, but this has nothing to do with the model itself. If you repeat the same prompt with different seeds you might see different outcomes
like one seed produces a gecko, another more an leopard, and a third one a tank. When training your model, it might switch between these three interpretations of the image, but it would also switch if you just use a different seed
but in general the text encoder is surprisingly the workhorse. Training it mostly determines the outcome of the image, while the unet takes way more time to train and easily overfits on texture instead of the structure or shape
well without regularisation images all works kinda ok except the overfitting
same seed
the gecko never came back at any seed and eventually that model stopped working very well
running text encoder training at a lower learning rate helps the overfitting or subject substantially, you can get away with quite a bit without doing dreambooth style regularization
damian did a bunch of testing and put a PR in so you can even just choose to train the final X layers of the text encoder, which seems to be really good for SD2.x with the newer 24 layer encoder, makes it train more like SD1.x
yes, I also always freeze the first 16 layers of text encoder
also I train Lora on low rank for the textencoder
PR for what?
i'm just using the train_dreambooth.py script with slight modifications. can those improvements be backported to that script?
Thank you! Appreciate the info 👍🏻👍🏻
@finite creek see above about freezing the layers of 2.1's text encoder to better train it without catastrophic loss in lower layers
my understanding is that when OpenCLIP was trained, LAION went stage by stage gradually freezing subsequent layers of the text encoder in order to preserve the foundational features it learnt
so we kind of have to do the same thing to avoid disrupting the connections and structure of those layers too
Hey there. So, I've got an issue. The issue is, that doesn't matter how I train the lora, it does everything pretty decently, BUT the eyes of the character... Doesn't matter how much I try. What could I do?
(my dataset has 34 images both close-up of the character's face, body and eyes too, unet is 1e-4 and text LR is 5e-5)
Thanks a lot! Very interesting. How do you freeze the layers? I see through your previous posts that you train the text encoder and Unet separately. Could you tell me how?
certainly. i'm using the normal train_dreambooth.py script, modified to use the filename of an image as its prompt (with some added cleanup etc)
that script has the option to train (or not) the text encoder. the process goes something like:
- analyze training data, retrieving all keywords and count their frequency
- prepare a subset of training data that contains the least commonly used keywords, as they're least likely to be known by the encoder
- at this time you can also remove any outlier data or look at the most commonly used keywords and either modify or remove them to ensure you're populating the segments you actually want to
once your data is ready,
- an initial training run at a supremely low learning rate using polynomial learning rate scheduler on the text encoder and unet, simultaneously, for a certain number of steps. you likely want to make a ckpt every 50-100 steps when you're in the "toy model" phase just to see how your test prompt output is changing.
- i'm not sure whether prior preservation is useful here. if you're doing Dreambooth for a single subject, probably it is mandatory. if you're doing a general fine-tune, it seems to be incredibly harmful.
- select/cherrypick from your checkpoints for the one that has the most pleasing results. this can be quite subjective. it helps to have a wide array of prompts generated from each checkpoint in a way that you can compare them easily. you want to select a checkpoint that didn't change the output much, but be sure to check the results of prompts containing your pre-training keywords, so that you can more easily see the early changes that training is applying.
Honestly if it's changing too much between each ckpt, your learning rate might be too high.
Once you've got the text encoder trained,
-
use
save_pretrainedon the pipeline for that checkpoint in an inference script, to save it as a complete model -
begin another training run, this time on your full subset of data, and your full step count, and no
--train-text-encoderoption -
this can have a much higher learning rate, but since i had a large number of images to process, i kept it low
-
you can use
save_pretrainedon the checkpoint that is most appealing to you, from these results. I save checkpoint every 1000 steps when training the unet alone, but if your LR is higher than mine, you might need every 500. -
once you have that complete model saved again, you can go back to the text encoder training step, this time, on your full subset of data.
disclaimer: this is my process i'm doing currently and not what i think a lot of other people are doing. if you can, at all, use the new dreambooth code instead, that will use separate learning rates for TE vs unet, since they benefit from that. additionally, the new code has the ability to actually freeze the more important layers of the text encoder so that it is harder to damage.
hi guys, im still having issues with tht number of regularisation images, how many regularisation images per concept shall i have?
Ig around 10-15 should work... Have only read this online, not verified myself which number should work best
Ig trying out with 5, 10, 15 and evaluating then on each of resulting models will be hepful
5-15 per image?
Yes
im using kohya ss, so my images are run between 10 and 40 times per epoch
does that mean that i also need to get 5-15*10-40 reg images?
or shall i still stick with 5-15 per image?
if args.with_prior_preservation:
# Chunk the noise and model_pred into two parts and compute the loss on each part separately.
model_pred, model_pred_prior = torch.chunk(model_pred, 2, dim=0)
target, target_prior = torch.chunk(target, 2, dim=0)
# Compute instance loss
loss = F.mse_loss(
model_pred.float(), target.float(), reduction="mean"
)
# Compute prior loss
prior_loss = F.mse_loss(
model_pred_prior.float(), target_prior.float(), reduction="mean"
)
# Add the prior loss to the instance loss.
loss = loss + args.prior_loss_weight * prior_loss
else:
loss = F.mse_loss(
model_pred.float(), target.float(), reduction="mean"
)
so this is code for prior preservation and i kind of see what it's doing, but, why is it doing that?
it makes the loss value appear much higher than it is without prior preservation, and i see now how the weight is applied to the prior loss and explains why the loss is lower with it being less taken into consideration.
but how does this actually direct the process or change its result?
it's just training on the regularization images and the training images
you don't need any special loss for that. You could also just put the regularization images to your training data.
however, the idea of regularization images is that they are only seen once in training (ideally). So you cannot overfit on reg images as they are trained only for one epoch
so they are not applied every epoch?
oh
so the backwards pass uses the loss value to determine how much error to resolve
so why does SD 2.1 just start out with insane loss values on the regularization data when i feed only those through?
hi guys, i see a major different between running images more than once rather than running multiple epochs, anyone know why?
also anyone tried the difference between random crop and center crop?
depends on your source material, how much source material, how it's tagged, etc.
i like the partly-frozen TE
my source material is all captioned, however i have a lot of concepts that are mixed
the souurce material is very low sometimes per concept- sompetimes maybe 5 pics only
sometimes 100
not sure, my best results were with about 3000 images so far
does it makes sense to train half the time with random crop and half the time with center crop or so?
are you just doing style transfer
no its not only style, its objects with details
then you want to manually crop your images
that's not very many and it will be easy
but i have like 7k images xDD
idk, i don't see the point of just trying to train like 5 images of something and hundreds of others. it likely won't learn the lesser-frequent concepts
well with kohya_ss what you can do is define per concept how often the images are repeated, however that also causes issues for me like creating strange pattern
and i still havent found a way to tackle this
I've experimented with duplication and it can help but obviously something with 100 images is going to come out better than the one with 5 just run 20 times more each
duplicating the rare examples can help a bit, just don't try to fully equalize, the one with all the duplicates will overfit
Hello, I'm wondering if anyone can point me in the right direction. I want to remove speech bubbles from images of comic panels without giving prompts. Dataset in the thousands. I think I can train something like meta's new SAM to segment, YOLO to ID, then SD to inpaint? Is that the SOTA? Can SD tools help in the ID part at all?
i have figured out how to freeze certain layers of the text encoder and the results are superior to the approach i described before
# Load scheduler and models
noise_scheduler = DDPMScheduler.from_pretrained(
args.pretrained_model_name_or_path, subfolder="scheduler"
)
text_encoder = text_encoder_cls.from_pretrained(
args.pretrained_model_name_or_path,
subfolder="text_encoder",
revision=args.revision,
)
first_frozen_layer = 0
last_frozen_layer = 0
total_count = 0
for name, param in text_encoder.named_parameters():
total_count += 1
pieces = name.split(".")
if pieces[1] != "encoder" and pieces[2] != "layers":
print(f"Ignoring non-encoder layer: {name}")
continue
print(f'Pieces: {pieces}')
current_layer = int(pieces[3])
if (
current_layer >= first_frozen_layer and current_layer < 21
): # choose whatever you like to freeze, here
last_frozen_layer = current_layer
if hasattr(param, 'requires_grad'):
param.requires_grad = False
print(f'Froze layer: {name}')
else:
print(f'Ignoring layer that does not mark as gradient capable: {name}')
this has me training just the last 2 layers
ED2 allows you to freeze first/last n layers in text encoder, it seems pretty good for SD2.x models with the newer openclip, though for SD1.5 I just set the learning rate of the text encoder lower than for unet, seems to work well, like 1/5th to 1/2 or so, or use cosine schedule on text encoder only
here only training last 6 layers, using different LR for text encoder
Hi folks (again) ! I really need help here. I installed the plugin to my photoshop - but it says that "Failed to load stable art"
I did all that was in the instruction
I have few versions of photoshop. And the plugin doesn't work in any of them.
higher prior loss value or lowere prior loss value will cause more overfitting?
as far as i read a lower value will cause more overfitting but from the source code it appears a higher will
can someone confirm that a higher prior loss value will cause the model to overfit less
can anyone introduce me to the settings of training a embedding
forget it
hi people how many max steps do you use for faces? it's 3000 too much or too low??
it's taking me 4 or 5 hours for each embedding, it's actually maddening.
3k steps for a face seems like a lot
i finally have enough sunlight for today (for now) to do some training on my 4090 here 😄
Thank you, do you use ED2? Or some other Soft?
Thanks a lot for the detailed explanation. A lot to unpack and learn there!
Thanks for the info Freon 👍🏻
i have taken the Diffusers script and modified it a lot
Nice
Could you share the link to the Diffusers Script?
i'm seeing that it might be best to freeze the outer layers eg. first n and last n, and train the middle for styles etc?
Cheers 👍🏻
I think that follows for unet, people see that with unet merges with weighted layers, but not experimented with TE
i should freeze some of the unet too?
I've not tried, im sure if you selectively froze stuff in general you may see interesting behaviors
kohya did experiments with merging two unets with different weights per layer between A and B models, it produced different results, but all his examples were anime and harder for me to judge, certainly interesting differences by using different layer weights
i tried to merge models but those seem to be for 1.5 and the weights aren't named the same now
hey
I'm trying to create a data set to train my lora which is pretty much just to create this animation
i have 2 options. one input all data sets i have from the internet, which is pretty much all cartoonish art.
two try to create it from WebUI and then use it as a data set
would doing the first option cause results to turn into cartoon art? as I want to create semirealistic art with it
Anyone know how I can get txt2img to generate red hair with blonde tips? I've tried all manner of prompts and weights but it always chooses one or the other. I've got multicolored hair before, but this is my first time wanting a specific coloration.
so with Terminal SNR and training on 2.1 i can get loss down to 0.11
@hot breach have you experimented with the alternate noise schedules? or @stiff dust
I just added the zero terminal snr thing but need to run experiments, I've messed with offset noise quite a bit
it isn't working very well for me
offset noise requires smaller amounts the longer you train, and it is not very stable since it has to be modified based on length of training, or turned on only for some portion of training
20k steps you can try 0.01 or 0.02, the original blog post suggested 0.10 but that only works well for training hundreds or a few thousand steps
ah
what is the min dataset size you recommend for 3000 steps? 20, 30 ?
pseudoterminalx: i am having similar issues, your learing rate is too high and you are using to many steps
try a learning rate of 5e-7 or 8e-7
why makes the error "AttributeError: 'FreeTypeFont' object has no attribute 'read' stable diffusion" in training each time it has to save a png ?
i am at 1e-7
and 100 steps
how is that too much
4600 steps in...
should i just keep going?
i can kind of see the output improving. but also feels super broken
dying rn
like im having to totally re teach it how predictions work with a new. algorithm
pseudoterminalx can i see some of your input data
and how many images are you training on?
are you running images multiple time sin an epoch?
i'm switching the model to Terminal SNR, so
did you tag the images propperly
of course
can i see some input samples?
i have good results without Terminal SNR, though they're not exactly what i want the model to do, usually..
ok thats a quite big variation of input types, can i see some captions with the correspondig image?
they're the filenames for each
pizza_au_fromage_et__la_confiture_confiture_dabricot_fromage_comt_la_pizza_est_saupoudre_de_persil_cinmatique_hyper_dtaille_dtai.png
are you captioning in non-english?
it's french
it's a mix of languages
i think there's some Hindi, Russian, Japanese
i dont believe that it works if its not english. also just to make sure, you are having the caption in a .caption file and it is beeing used right?
i have a custom dreambooth script
the filename itself is the tag/caption
i like this approach better than what anyone else does
first of all i believe you need to normalize the data to be all the same language, english
i don't believe that to be the case at all
i dont think you can train in non english
like i said i have good results without Terminal SNR.
the OpenCLIP model already understands other languages, sir 😛
my efforts are to improve the model across the board incl more language comprehension
the problem is that the text model training will be screwed up if you dont and its as far as i understand the most importaint part
if the base model you train on is in english you are bascially intriducing new "words" to the model and it will probably only lik those words to your images
so if your cheese is now "fromage" it will only know your one "fromage" image as "fromage" but not chees
at least thats how i understand that
which means that it will not be able to recognize your "fromage" as chees and also not recognize chese as "fromage"
i believe what you want to do is to train in english and than convert your input promt from any lange you enter to inglish before generating an image
man it's really improving still. i am going to leave it running
@tall condor i don't think you fully understand what's happening here
this is stabilityai/stable-diffusion-2-1 output with my current settings and that same prompt
i'm fine-tuning the model to use terminal SNR
sorry i dont know anything about that. i though you are doing regular training
it's okay 👍🏽 usually i am but last night i started down the rabbit hole of implementing a research paper
i wish there were more people that have done this specific transition
i have no idea if what i'm seeing is correct
what exactly id the difference
it improves the contrast balance of the image
the typical noise schedule of SD means that the overall average colour grade of the image is gray
offset noise was a workaround to help with this issue but apparently it's a hackish fix and terminal SNR is "the right way"
4500 steps
so its basically a replacement for lowering the applied noise?
5700 steps
i don't know if i'm doing this right but i'm okay with what's happening now at least.
would be interresting to see where its going
as i understand it, this is a way of allowing the model to determine where it wants to end up before it gets there, so that it can remove noise more effectively. and my understanding is likely incorrect
the images are still having ver yhigh contrast tho
i wonder if that will be lowered to the end
they were washed out before
the contrast is a fix in progress as far as i can tell
you know what's crazy, the researcher leading this team only graduated from university about a year ago
that said, he obtained a masters' degree in CS so, a bit more than what i did 
cool
what i really hate on training is that there is so little description on impact of certain settings
for me its mostly trail and error
yeah people hold these cards too closely to their chest
Big facts.
on the off chance anyone else has done this in here, did you freeze the text encoder partly, fully, or not at all?
anyone played with color augmentation yet?
well i'm going to restart this with a fix pulled from ED2
i was definitely at the very least, doing inference wrong
way better
that from the same model before or did you restart?
this is the restarted training
what did you change this time?
scheduler and config for it
this time i kept the SD2.1 scheduler config and overloaded values into the betas
previously, i just used the default scheduler config with overloads
my understanding now is that SD2.1's config is pretty different from the way the schedulers are used out of the box
looks much better now but is it doing what you expect it to do?
i didn't expect it to work this well, so i'm not sure how to answer that
it still has photoreal issues but i can fix those
maybe its not doing anything at all? xD
nope, the image quality is +++ compared to baseline
cool
the contrast is changing a lot
ok so bikes look super good
i'm going to nuke my learning rate because it's too high
any guide for fine tuning or faq or pro tips?
Could you please share the script if it is fine with you?
Yes I am
interesting. Should try the scheduler fix, too
SD 2.1 is using v prediction, which is very different from the 1.5 noise scheduler
yeah i know that now because i've fixed the scheduler config and now terminal SNR works out of the box
this is baseline without any fine-tuning
nothing special about the prompt and there's no negatives
btw, ignore the nsfw prompts in my inference script, they're only there so i can stop training if i begin to introduce any by accident
they do really weird stuff
like paint the swiss alps, a cabin in the woods. because there's zero concept of those in 2.1
i'm using photos as class data but i've got it feeding the class data's filename in as its prompt instead of a single token, and i've used BLIP to label them
your results are really interresting. it would be really cool if you could contribute them to the community
i recommend to start witk kohya_ss it has a web UI that is really a good starting point
mate, i publish everything on github
i'm not going to go rush to implement stuff in applications i never use, that's up to you
i saw. thanks!
so i have 2 same images, one time with orange one time with blue tint, i would expect sd to equalize them out to be a neutral tint but for some reason the blue one is very stronly dominating, why is that?
likely because of how the denoising process works
but i don't know if anyone knows exactly why that would happen, are you using any special techniques like offset noise or SNR fixes?
Could be how the colors are coded in the first place: in RGB, blue might be around 0,0,255 while orange might be around 125,125,0, so “put together” blue is stronger than the red and yellow
Hello. I'm trying to train sd with a lora with this art style, but so far I'm not getting good results. I think it's partly because my English is not very good and I'm not describing the image properly in the text file. Can you help me with the caption text of this image for example? It's a bullfighter's jacket, I guess I should describe as well as possible the ornaments, the embroidery, the brooches...
watercolor
well, training has completed 
first test prompt outside of my sanity check prompt list
I am trying to train a model to use this specific mask. If you pause at any given frame most of them look like the mask I'm trying to fine tune but each frame scrambles the colors/features. I tried retraining with images of a single mask but it still scrambles the color... any thoughts on getting consistency? Not 100% sure if it's a training issue or a settings issue
can you share more details about your training setup?
this dataset you see here was trained on about 25 pictures of a variety of these masks (no 2 are exactly the same). I can't remember the exact settings but it was a standard how to vid on youtube, I believe training steps were 1600 at 2e-6.
I then trained it on a single mask but used 50 photos same settings, this one was muddled and looked too much like a generic mask.
I then used a dataset of a single mask at 25 photos which seems slightly more consistent however it requires a lot more fine tuning in the settings.
My approach here was to treat it like a face and give it a modest dataset at 1600steps, 2e-6 which is what most people say faces should be trained at
I'm using dreambooth on google collab. My GPU is too slow to use it locally
i would suggest training a Lora
dreambooth is kind of iffy and is a heavier process, and a Lora will sit on top of any compatible model
thank you! I will try Lora
@surreal lagoon to caption a face which all traits/features should we mention?
I would always first do textual inversion to learn the facial features / assign them a token.Then you can use that token for captioning
caption can be "photo of <mytoken>". Better is to randomize it a bit, "<mytoken>, photography", "an photo of <mytoken> made by smartphone camera" and so on
Textual inversion of a same character or just random faces?
same character. I thought that was the question
No I just wanted to train a model with multiple faces so we can improve the face textures and make them a bit realistic
i allowed BLIP to do all of mine
to me that led to improvement across general keyword use, eg. a woman or a man will come back by default as more complete-looking
i like using BLIP for training captions because it essentially has the text encoder tell me where it wants those images to be
like "oh, i recognise those features! we have these keywords for them." and then i provide high quality images for those keywords and it learns to do them better
Heya, i have 1,3k images and I want to train Lora/Model (dont know yet) which one is more suitable, looking for some advice and guides on where to start:)
I use Blip, too, but still find its captions way too short.
you can do it with a higher temperature and return multiple possibilities and merge them
🙂
is the vae fine-tuned in the script from diffusers?
all of the images are for the same concept or is it a mix of concepts
Oh kk
Yeah they are too short, it's not the best to describe a scene
Anyone tried this out? https://github.com/tjennings/Coreco_LLaVA
it’s a collection of random items, which share the same cartoon draw style
is there a way to run dreambooth fully locally?
yeah, of course, you should have 8gb vram at least, better more
i have 12
tried to used colab but it throws disconnects from time to time ruining everything
Im looking at making a "pixel perfect" 32x32 pixelart model, do I need to finetune everything into it (knights, dragons, zombies, swords, skeletons, trees, bushes, etc.) or is there a way to do this in 1 go or does it need to be done 1 at a time?
this is easier to handle if you have checkpoints being written every N steps and keep only X number of checkpoints
tbh i am not sure how to use that system, i just put settings and pres "play" want to build it on local for personal grind:D, but not sure how
if you clone the diffusers repo into your google drive on colab (GPT can help you with that) you can find the examples directory and in there is some dreambooth, fine-tuning, etc scripts and you can pick whichever you want, and follow the directions on the huggingface hub tutorials for it - that's how i've done it anyway, i'm sure there's different approaches
How good are your models with this approach
like this
i fine-tune on 2.1 only because i love its compatibility with higher resolutions and SNR fixes, as well as the penultimate clip handling
it's not as "easy" though, it takes a fair bit more understanding of what you're trying to accomplish
i just want to give it a word like " mouse" and get a mouse in that image style:D
that sounds like you want a LoRA
or textual inversion
this idk what is:D
like a positive/negative prompt, on steroids
can i feed it with my own images to make it "understand" what i want
What do you think i shud use for this use case: i have 1000+ images of one specific meme in diff styles. I want to create a model that lets u input any text and get that version of the meme (a batman version, spongebob version etc.)
Exactly yea
Like this would be trump pepe and the other drooling pepe i alr have captions but have been struggling with getting good results finetuning
Obv my dataset is not pepe but a diff meme but similar idea
i think you want a LoRA
does anyone know if there is a way in auto 1111 to download all dependencies at once?
@hot ether you may want to take a look at kohya_ss
Ok thats what i was thinking thx
i'm about to go to slepe but ok
it only works at 512 or 768 square
the aspect bucket stuff is really poorly implemented, manually crop and centre everything
downsampling high res images to the right size will artifact the image too
better to crop instead
Isn't it impossible to manually crop like 2k images
nope not impossible
i will answer here. i think using a llm to caption images is a waste of resources and money
you already have a Clip encoder that can caption, and it can be fine tuned on new captions
unless you are training an encoder from scratch, i see no point
What does finetuning captions mean?
it means you are making BLIP work better
openclip was trained on like 2 or 3 billion image caption pairs
i doubt theres a lot that can be improved on that with a llm
Ok, so how can we do that.
B'cuz currently blip just generates small cations which I don't think can add a lot of details
I want to train a model based on landscape photography, so it generates mini captions.
Many elements are skipped
Lemme show me the output
why do you need the details captioned at all
you are tuning its current vocabulary
it will be fine and you can use shorter prompts to get good results then
if you want new keywords added then you will have to add them to each caption yourself
its tedious but not the end of the world. you arent training on 10k or 100k images, so
I have a dataset of 100k images
I am away from my pc
I'll share the images and their captions generated by ED2 captioning(blip2)
ok you seem to be taking on a project you arent prepared for at all
I have already generated the captions
I haven't trained any model before
What are the things that I have check for?
start small
observe what different parameters change for an end result
your 100k images will take more than a week on a single gpu to train effectively
imagine finishing that and realising it wont work
i started with 300 images which werent enough and 3k images were okay but starts to take long enough that the text encoder could be damaged. the most recent try was 30k images and it took a lot of work to figure out
100k images to me would require a much smarter approach to training the text encoder
smarter doesnt mean faster, wastes some compute to optimize training but layers can converge faster
so in a roundabout way, it is faster but fewer iterations per second
Do you mean text encoder as model?
i mean the text encoder
What's a text encoder?
time to google
Oh ok,
Got it it converts text into vector
it does most of the work
How does it get damaged?
It fails to convert unrecognised text?
Ok
@surreal lagoon which training method is best suitable for landscape photography?
And which one for faces?
these all feel like questions that google can.easily answer and im not trying to be rude but it feels slightly rude to keep asking someone stuff that is so easily discovered. like, i am not a search engine, ya know?
i will say good luck training on the word city or downtown because new york times square is overfitted something fierce and likely cant be fixed
Is it possible to train lora's with an m1 apple? i tried googling but it doesn't seem supported yet by most GUI i seen.
@surreal lagoon may model is producing strange patterns after a while of training while other things keep training well. is there a way to find out what causes those patterns? for example it looks like skin is being ripped off and stuff. it feels like a single image is becoming too dominat or so. any idea how to tackle that?
are you training the text encoder? if so, are you freezing any of it?
shapes, textures and patterns are pretty strong features in the lower-to-mid layers of the text encoder
what do you mean when you say freezing?
do you think it makes sense to stop the text encoder training at the point where i start seeing patterns?
How can I create images with a specific person's face in it? I know I can try to get the description of the image. But can I train the model to know a person's face?
ok so I guess dreambooth can do this but I only have 6GB of vram
so, not possible?
Heyyyy
So I did this tute:
https://youtu.be/3uzCNrQao3o
Took me almost 6 hours to get through it
In the end, results were AWFUL
How to install famous Kohya SS LoRA GUI on RunPod IO pods and do training on cloud seamlessly as in your PC. Then use Automatic1111 Web UI to generate images with your trained LoRA files. Everything is explained step by step and amazing resource GitHub file is provided with necessary commands. If you want to use Kohya's Stable Diffusion trainers...
Is this still the best or is there a better one? I've heard aitrepreneur makes good ones?
Also wondering if I had the wrong python version when I did training ......
it really might depend on your images that you're using
And then wondering if that wrong python version is resulting in the loras not getting loaded properly in a1111
try with like 15 images and no class data and about 4000 steps at a learning rate of 1e-4
I have previously used the exact same images (13 of them) with Shivam's Dreambooth colab thingy, got great results
Yeahh ... .. I dunno anything now 😄 my info is all from November, forever in the AI world
ahh? yeah see the terms are all new and I really don't know what is what
dbooth was destructive etc and lora is supposedly not?
But how do people make their civitai stuff? 😛
So basically I wanna add some faces to a civitai model ......
What's your personal favorite way of doing that?
dreambooth doesn't have to be destructive
freeze about half the text encoder
do as few steps as it requires to actually get the results you seek
you want to have a few validation prompts like different celebs or 'a random european man' kind of thing to ensure you are not breaking it
i usually check these:
"woman": "a woman, hanging out on the beach",
"man": "a man playing guitar in a park",
"child": "a child flying a kite on a sunny day",
"alien": "an alien exploring the Mars surface",
"robot": "a robot serving coffee in a cafe",
"knight": "a knight protecting a castle",
"menn": "a group of men",
"bicycle": "a bicycle, on a mountainside, on a sunny day",
"cosmic": "cosmic entity, sitting in an impossible position, quantum reality, colours",
"wizard": "a mage wizard, bearded and gray hair, blue star hat with wand and mystical haze",
"wizarddd": "digital art, fantasy, portrait of an old wizard, detailed",
"macro": "a dramatic city-scape at sunset or sunrise",
"micro": "RNA and other molecular machinery of life",
"gecko": "a leopard gecko stalking a cricket"
And you enter these into what or where 😄 an a1111 extension? A separate thing?
the inference.py script in my simpletuner repo

in fact my dreambooth script there is the least destructive i know of but it's more like a fine-tuning script now...
OK ........ that's not as straightforward/obvious as I was hoping 😄
But perhaps an avenue I shall explore
For now, I am neglecting my child and must get back to daddy duty 😛
Hi, I'm neglecting again ... "my"? On your githubby thing? 😉 Can you share the link?
what is the recommended resolution in ppp for training? is 72 enough?
cool thing on LORA is: you can train it and then afterwards disable certain layers to check what happened
for example, I found that when training on my face it was the unet that made trouble (my photos are rather low quality android photos and the unet very fast learned the grainyness of the photos). So scaling down the unet and relying more on the text encoder fixed that for me
similarly, you can disable the first k layers of the text encoder and check how that affects your results
you can do all that afterwards and find out what went wrong in your training. Next, you retrain the lora or dreambooth but this time with removing the layers that caused harm
@valid coral if you don't want to write your own code, there is a tool called EveryDreamer2 that can do DreamBooth and that is parameter tuned
like it already freezes the first 16 layers of text encoders in Sd 2.1 and so on
I have that ... .. installed ... ... I've been slowly going through the extensive how-to documentation .... then I get distracted by other sparkly objects 😆
Now I'm trying to fix a broken venv in my a1111
and ...
yeah
venv fixed. Seems the dbooth extension for a1111 has been broken for a while. Saw a couple fixes on various forums that didn't seem to solve my issue. SO FORGET THAT IDEA
Now back to seeing if I can get dbooth to train in kohya_ss.
If not ... EveryDream Trainer 2.0 💪
OK! I finally got all my ducks in a row and hit the Train button for Dreambooth in kohya_ss.
Shortly after, I got a memory error 😄
(I have 12 GB)
So that's an official "no" to running Dbooth on my machine? There's no config to edit or slower way to do it?
"try setting max_split_size_mb to avoid fragmentation" -- not an option?
🙏
I got this error, where should I change the padding_side parameter?
https://github.com/Vision-CAIR/MiniGPT-4
This is a heck of a place to ask that 😄
ik, but there isn't a server for them so asked here
Not the worst place place. There are people here that might know. But are they awake right now ....
Not sure
Are you familiar with Text Generation WebUI? There's a discord, I've gotten some great support with related stuff there...
I just tried sharing the link but it was blocked 😛
Easy enough to google it
Indeed!
HUR HURRRR, EveryDream2 can't use safetensors models ... .. . .. . .. . .. . ..
so I guess that's the end of that one
...unless I use THIS perhaps??
https://github.com/diStyApps/Safe-and-Stable-Ckpt2Safetensors-Conversion-Tool-GUI
Just looks like a timeout error .... tried installing too many github things? Happened to me once. Go to the main github website in a browser and see if it has you on timeout...
also not sure what the "jllllll" thing is about
maybe, because I got timeout even on gallery-dl
How can we fix it?
When you go here, no errors?
https://github.com/oobabooga/text-generation-webui
And you're using the one-click installer?
Is it better to train on a celebrity with a smaller set of images (40) or larger set of images (160)? How do you decide how much is too much? I heard that 100 steps per image is correct. Is that true?
Yep
I'm trying to figure out how to train for specific faces. My graphics card only has 6GB vram so I can't use Dreambooth so I am trying to use Kohya to create a lora but everything I try with Kohya causes errors and nothing works.
What should I do?
@valid coral Same error when I ran pip install -r requirements.txt through the normal installation meth
Hi All,
Noob to AI Art here, having starting generating with early access to the Leonardo AI cloud platform 2 weeks ago where I generated a fair few images, using different models, and even trained a couple of models as well (albeit through a very simple user friendly GUI)
I swiftly moved on to installing Automatic1111, and SD, and have been doing some great local generations, and upscales using different models and LOra's from Civitai etc.
Today I installed Kohya_SS GUI, after much frustration as the install doesn't "just work" when running Setup.bat, or at least didn't for me.
I set about doing my first training in the Kohya GUI, and provided a set of 130 images, and used WD1.4 to tag then. Tags look good enough to me, so I went ahead with the training, and got my first character/preson Lora out of it. I kinda of worked, and it's had an affect but it's not strong enough.
I just don't know what I'm doing with the Repeat, Batch Size, and Epoch settings 😦 I ran that first Lora on defaults:
Repeats: 40
Batch Size: 1
Epoch: 1
Optimizer: AdamW8bit
Text Encoder Learning Rate: 5e-5
UNet Leading Rate: 5e-5
I have no idea how this correlates to the number of steps, how often I should output a sample image etc
I understand an epoch is a complete pass over the data set, where each image is trained "repeat" times at part of the epoch. What's a good number of steps to aim for to train a model? How does the number of repeat / epochs affect things
Any advice appreciated, or pointers to some good resources.
Thx
You'll be much better off asking at their Discord, friend:
https://github.com/oobabooga/text-generation-webui/discussions/600
Sure definitely
I think just answerd some of my own questions lol:
I wanted to know how image repeat, epochs etc related to steps, and then saw this in the Kohya output:
That tells me what I want to know, but what I don't know is, is 26K steps good? Is there any guidance around this, and also the batch size.
I know what batch size is when doing generations. Is it the same here? So if I set batch size to 4, I'd end up with 26K x 4 steps, and the idea being that each pass over an image, per epoch, will generate 4 images to train on
I'm running a training now, and have set it to output an image every 100 steps. That's gonna be a crap ton of sample images as it trains
@valid coral did you have any luck with training? I'm also trying to figure out how to do it
Well, I took out a few images that I thought would pollute the data set because in one the person was wearing a hat, and a couple of others they had weird colour contacts in.
I tried with batch size 4, epoch 10, repeats 30.
Took an hour, and I could see it was getting closer to representing the person, but wasn't quite there. I was watching the sample pics being generated.
I'm doing another run, this time batch 4, epoch 20, repeats 40 which is likely to take 3 hours.
My learning rate for both is 5-e4.
I'm new to it all, both AI art with SD, and training. Fingers crossed I'll get something good this time.
I'm doing lots of reading to try and understand everything
Still making my way through the EveryDream2 tutorial...
https://github.com/victorchall/EveryDream2trainer
It won't allow me to train a .safetensors model from civitai, but at this point I'm desperate enough to get ANY kind of result 😉
The Dreambooth extension is broken with a recent update of A1111 ... and the lora I trained with kohya_ss was also totally unusable (generated errors).
mama mia, has anyone been able to get this to work?
Of course! Just nobody that's here right now 😄
And I just hit the part in the tutorial video where the guy shows it running -- it's using 22 GB of VRAM ... so that's the end of EveryDream2trainer for meeee!
For the record, I did TONS of training, but it was last November ... and the whole world has changed since then. Now I'm trying with 2.1 models
Ya I think I might need to do the training on something that's cloud hosted. Then once I have the lora I'll be good?
if I'm doing cloud hosting maybe I should use dreambooth? does dreambooth produce lora's too?
My first lora generated from koyah today worked fine in the Automatic1111 gui, just want close enough to the look I wanted.
I didn't do anything special first time around, except to select the rpgv4 checkpoint as my base model for training.
The rest of the training settings were default mostly
I'm on koyah, running on a 4080 16GB vram.
No idea what my training run in koyah used, but I had a batch size of 4, and a training res of 512 * 768.
Took an hour to do around 9300 steps, and produced a working lora
Yeah I'm gonna rewatch the koyah_ss tutorial video and do exactly what the guy did, instead of branching off with my own model...
If you're only doing lora, that's much more VRAM friendly. You can use kohya_ss
I think for me it never ran over 6 GB...
it won't work for me, whenever I try to do anything I get errors. When I try to caption images it says I'm missing "cudart64_110.dll"
then it says " Ignore above cudart dlerror if you do not have a GPU set up on your machine."
but I do have a GPU
I have a laptop with integrated graphics and a GTX 2060. So maybe its accidentally using the integrated one?
anyone know if it a problem that i have buckets with only 1 image?
also if i increase the bucket size does that mean that my images are also handled in bigger chunks or is the only difference the grouping?
I have the solution for that, as I was installing koyah for the first time today and had that error.
What I did was copy that file from the Automatic1111 directory, to the root folder of koyah.
I then got errors about my xformer version, and pytorch version.
To fix this from a python prompt, and in the koyah venv root dir, I ran the following to install latest xformer, and pytorch+cu118 (koyah wasn't running at this time)
pip install -U xformers
pip install -U torch torchvision torch audio - - index-url https://download.pytorch.org/whl/cu118
After that, it amazingly works.
Why the setup.bat of koyah gui just doesn't work is a mystery.
Took me a while to figure out, as whilst I am a dev, I know nothing of python.
If you need the cuart dll let me know
if i dont "upscale bucket resulution" (dont downscale my images to fit the max size) in combination with random crop does that mean that for each time the image is picked for learning a random 512x512 section of the image is selected for learning but without scaling the image down?
THAT WORKED
400 steps into trying to burn a model lol
if its only 400 steps xD
2.1 can burn like ice
all my 2.1 trails failed hard so i stick with 1.5 for now xD

Glad to help out. I really struggled this morning to get Koyah Gui to work, had to many issue, but glad I got it sorted eventually, and also that I could spread word of how to get it working for others 🙂
My own lora works nicely (though I need to re-train to get a closer look to the person I trained on), but what I'm finding is that when combining with other lora's, the prompt weights seem off. it's like my lora's weight is too heavy, and I have to really up the prompt weight of the other lora
maybe your model is overfitting too much
yea i see no ring xD
xD
look at the blurry, soft focus
come back when it can ride in mordor - then we talk!
please have gandalf ride that thingy
the ballrog?
no the bicycle

@surreal lagoon can you answer this: if i dont "upscale bucket resulution" (dont downscale my images to fit the max size) in combination with random crop does that mean that for each time the image is picked for learning a random 512x512 section of the image is selected for learning but without scaling the image down?
that gandog?
gandog the gray
i'd have to go look at the source code
you know my feelings on this bucketing nonsense
yea i know but if the behavior is as i describe i an actually even see a benefit
because i would leant way more details on a model this way
btw my patterns are gone after latent upscaling all the images
i guess it can only break if if the images are too big and the sections are too small
but i see your point
probably creating those details beforehand and then caption them propperly would be the best option
baseline's understanding of a leopard gecko man
leopard gecko
make a cacco? xD cat gecko?

also its way more gecko than man
but still nice result
"woman": "a woman, hanging out on the beach",
"man": "a man playing guitar in a park",
"child": "a child flying a kite on a sunny day",
"alien": "an alien exploring the Mars surface",
"robot": "a robot serving coffee in a cafe",
"knight": "a knight protecting a castle",
"menn": "a group of men",
"bicycle": "a bicycle, on a mountainside, on a sunny day",
"cosmic": "cosmic entity, sitting in an impossible position, quantum reality, colours",
"wizard": "a mage wizard, bearded and gray hair, blue star hat with wand and mystical haze",
"wizarddd": "digital art, fantasy, portrait of an old wizard, detailed",
"macro": "a dramatic city-scape at sunset or sunrise",
"micro": "RNA and other molecular machinery of life",
"gecko": "a leopard gecko stalking a cricket"
very much and clear details
it's the gecko prompt
yeah when you train 2.1 on terminal SNR it goes amazingly clear and crisp
SAI needs to update it with this baked in
my main issue with moy models atm is that the training picks up the colors and patters sometimes more than the objects. is there a way to tackle that?
yeah freeze the text encoder more
also can i see your "cosmic"
my next try will be to stop the text encoder at 50% and see where it takes me
100->400
looks cool
and lol for "menn"
i stop mine at 25pct
ill copy that xD
it was my idea
and if it sucks it was yours
some of my concepts have very few pics (maybe 2-3) - do you think the batch size of 6 will kill those concepts?
im quite happy that i could get rid of those patterns as they were a real PITA
i broke it. my cosmic prompt
how?
successfully regressed Stable Diffusion's capabilities by like 3 years 
i'm testing training earlier layers for style transfer
some of my concepts have very few pics (maybe 2-3) - do you think the batch size of 6 will kill those concepts?
no idea
I was trying to see if I could successfully create a lora with kohya so I just made one with 2 images. It's still taking a while, downloading a bunch of stuff it seems, and I gotta take my laptop with me to the airport in like 10 min. If I cancel the operation will it screw up my kohya installation?
likely yes
pseudoterminalx: does it makes sense to train a 1.5 model with images greater 512px?
Did you choose the right stuff in koyah?
When training a lora I didn't, and don't, have it don't downloads of an that stuff.
Once setup, it just worked. Are you using the dreambooth lora tab?
probably not
broke it again, this is supposed to be a wizard
i mean, kinda looks like a wizzard xD
I didn’t change any of the defaults. Maybe it downloaded stuff because it was the first time. And ya the dream booth Lora tab
100 steps earlier tho
What are you trying to make?
i've extracted frames from The Hobbit and i'm transferring its style
experimenting with different text encoder layers training
so far it seems like the 16th layer is too early still
too many fundamentals you can break
Very interesting. Where did you learn about layers?
by breaking them
I've a 900Mbps connection, so maybe I didn't notice the downloads lol, as I was doing other stuff at the time as well.
Hope you get a lora creates. Any problems let me know, and I'll see if I can help.
I'm new to koyah as well, as of this morning. I'm on my second training now.
It's going to take 3 hrs total on my 4080. Around 24K steps to train in total.
Same one I did earlier, that came out pretty good. But it wasn't quite close enough, so I'm re-running with more epochs and repeats
@surreal lagoon Are you using koyah?
How do you go about selecting the layer to train?
i'm using a script i'm developing as i go
i love this one
looks so damn bootleg
Thanks @jaunty grove I’ll definitely give you a ping if I run into more trouble. I’m trying to figure out if I can create a Lora with someone’s face and use it in combination with the studio ghibli Lora to make ghibli themed profile pics
@surreal lagoon do you have a background in data science?
lol i just keep breaking it god damn that's supposed to be a bicycle

same ckpt
sigh
good
Looks solid to me!
I actually love this, I would make it an album cover
this training session is going much better.
i've obviously bumped the weights of "people" up immensely. a scene that just describes a robot serving coffee, now serves it to a person
this one, in other training sessions with different dataset, would result in an animated wizard
are vast.ai gpu prices cheaper than runpod?
As its just 1 cent/h and on run pod its $1.4/h
that's so cheap. why
you should ask vast.ai
i would honestly do performance tests. it might be overshared
might be
Its 4x3090
It shows VRAM of 24GB
so does have 24gb vram per gpu or collectively 244gb vram?
hmm
24 per
that's priced linearly above the 48gb system too
damn so i can run like a cluster over there for what i pay now 
so we get 96gb vram
seems like it
yeah
good question
SO they can acquire the customers?
well let me know if you find out lmao
the woman hanging out at the beach has always been a difficult prompt for 2.1 for some reason but now she even has all the right number of fingers. they're just the wrong ones
a40 for just 0.165
hm that shoulder tho
the wrist is pulled towards her, can we fix that with inpaint?
i think it's not getting better
not sure
@warm agate the price increases when you allocate more than 10gb of storage
oh, then it would definitely increase as every AI model is something like 20gb+
yep it is, as i got an option to host
so the datacenter prices are still good
makes longterm prices much higher tho, about 2x the price for 60gb of space
How to filter this?
We have a reliability score, so we can easily estimate
@surreal lagoon do you know how we can add minigpt4 into text-generation-webui?
nope.
@valid coral how to add minigpt4 into text-generation-webui, I have asked in their discord, but they seem unresponsive.
i have added minigpt into the pipeline, but how to work with the model?
not really the right channel for any of that
hmm ok
Indeed, this is an image generation place.
Though Stability did produce an LLM of their own... why isn't there a channel for it?
🤷♂️
I've only played with the text generation webui for maybe 20 minutes myself, I'm not a good person to ask for support unfortunately. I was only suggesting their Discord as a potential place to get support for playing with LLMs.
Hmm ok
@valid coral Can you debug this
Debug? It looks fine to me...
But no, not really, I have probably 3% knowledge when it comes to Python.
I see you were getting a bunch of replies on the other Discord, but you were in the "dev" channel and not a support channel.
Asking all over the Interwebs is only gonna get you banz0red 🙂
So .... yes it's frustrating and confusing, but that's where we're at, this stuff is very much in its infancy.
Cuda extension is not installed. oh ok\
ok
@surreal lagoon: what can i expect from the UNet Training after freezing the TI? what shall i focus on to see if training is still improving
i would expect it to start taking on the textures of your images more than their contents
at least for sd2.1 it can kind of improve the model to keep training the unet but if your source material isn't truly high quality it just ruins it
the text encoder is most worthwhile to train
it's also the hardest to 😦
currently i'm testing offset noise for the first time and i'm just not expecting to see stuff like this in my results. is that what it does at first before swinging back and making more sense?
well
tuning with multiple GPUs apparently you can't freeze the text encoder during training, at least, not the way i've done it
for me when i reduce the noise ofset to 0.02 the results are getting way better
also it got rid of alot or wired stuff for me
what value did you pick?
GPT4 is telling me i can't have offset noise trained in and also freeze the text encoder so early 😄
it says it is not going to work
0.1, is that too high 😄
oh whoa, GPT4 was right LMAO
god damn it
i hate it when the robot is right
0.02
try that
works quite well even for dark pics
also you need to increase your epochs with low noise
love how thorough the notes from LAION are on OpenCLIP
H/14 with big batch size works well, but unstable, and very hard to recover
Planned for 256 * 135M epochs from 2B-en
Spike at epoch 122. Tried a lot of stuff to recover
Only one thing worked: decreasing lr fast for 8 epoch, got 74% that way
Batch size 79k, starting lr 5e-4
1 week to train + many days to try to figure it out
800gpus
Doing 8 epoch with batch size 158k gave 75.4%
Finished up to 256 at batch size 79k in bfloat16 and reached 78.0%
so trying to fine-tune 2.1 on a huge cluster of GPUs is harder than training on a small group
i hadn't seen this page before now but it's fun how my results mesh with theirs
kinda wish they'd started from scratch once they figured it out again
seems to be doing a lot better with that low of a value
i had the same experience
especially for very dark and very bright. and for me after adding that low noise value the contrast got much better in general
omg omg omg
it's happeeennnningggg
ALL schedulers in diffusers will do zero terminal SNR now
oh Max is the one who did all the work, i just trained a model that allowed them to test it
@surreal lagoon can you freind/pm me i have 1 question regarding hardware
i hate doin that tbh i get a huge friends list full of people i don't know lmao
np. can you recommend any AI workstation with 2-4 4090?
what hardware setup are you on?
2 to 4 of them? 😮
threadripper, to start with
dual power supplies.. it's a lot
like, my 5800X3D and the ASUS X570p (AM4) are capable of having two GPUs but the 4090 uses three slots
thiungs i have seen so far have capabilities for 2x 4090
so get one 😛
i allready hve 2 workstations with 1 each but it soooo darn slow
i am a bit concerned that high batch sizes mess up my concepts
you want your entire dataset absorbed in a single shot if possible
that is the best, but, no one can do that
that's the only reason we batch stuff
so, the higher the better
i see
it just still doesnt make sense to me if i mix up the learning result of 2 different things in one update that it can still learn both the things you know
and mixing 6 of them IMO cant make it any better
just makes no sense in my head
i now have a pretty good National Geographic dataset
cool
i found a way to add another rtx4090 to one of the workstations xD
@surreal lagoon do you have any other usefull tipps for training larger datasets? im still have some issues that the details in the pictures are not picked up very well
nay
also any tips on how i can prolong the training and improoving the dataset on the long run?
so far 100-200 epochs are gettting me somewhere but some concepts are still very badly generating
slower LR
well
you might need to up the learning rate and freeze more layers for 1 epoch
and then, go back to old settings
or unfreeze more layers
it's a game about tricking the model into a new space that is in the direction you want and then slowing down and refining it
if i reduce the LR the concepts wont create at all mostlky
that's why the polynomial learning rate has a really high learning rate for a number of warm up steps
so it can move the model into a new zone that it needs to clean up the output of
then the learning rate decays and slows down
im currently using constant scheduler
but i tried cosine and tbh i didnt see much difference
Hello! anyone using Runpod? I want to know if it worth the use
@tall condor ever seen this pixelation?
hm it's nto always there
/imagine, soggiorno, parquet, grande porta finestra, tramonto, arredamento country, divano, camino
Meta key: Title values: ["RHODES"]
Meta key: Creator values: ["Cushman, Charles W., 1896-1972"]
Meta key: Date modified values: ["02\/03\/2022"]
Meta key: Subject values: ["Towers","Spires","Seas","Forts & fortifications","Coastlines","Buildings","Clouds","Waterfronts","Islands","Rhodes (Greece : Island)"]
Meta key: Roll Number values: ["4-65"]
Meta key: Date Created values: ["1965-04-04"]
Meta key: Source values: ["P13980"]
Meta key: Holding Location values: ["Bloomington - University Archives<br \/>Wells Library E460<br \/>1320 E 10th St.<br \/>Bloomington, IN 47405<br \/>Contact at <a href=\"mailto:archives@indiana.edu\">archives@indiana.edu<\/a>, <a href=\"tel:812-855-1127\">812-855-1127<\/a>"]
Meta key: Alternate ID values: ["465.37"]
Meta key: Campus values: ["IU Bloomington"]
Meta key: City values: ["Rhodes"]
Meta key: State/Province values: ["Aegean Islands"]
Meta key: Country values: ["Greece"]
Meta key: Genre values: ["Seascapes","Cityscape photographs"]
Meta key: Call Number values: ["P13980"]
Meta key: Frame Number values: ["37"]
Meta key: County values: ["Sporades"]
Meta key: Persistent URL values: ["http:\/\/purl.dlib.indiana.edu\/iudl\/archives\/cushman\/P13980"]
Meta key: Cushman Identifier values: ["P13980"]
generates the caption:
Generated title for image: rhodes towers spires seas forts fortifications coastlines buildings clouds waterfronts islands (greece island) aegean seascapes cityscape photographs sporades county
I would like to run some dreambooth training with ShivamShrirao repo, two questions:
- Is it mandatory to provide a
class_data_dirfolder? - Is there somewhere I could find good quality regularization photos (for men and women)?
gabinino: i recommend you start without regularisation
and see where it takes you. as far as i undersatnd you can not just take any regularisation images, the need to be made with the model you train on
I have discovered a workflow that has never been explored before, which allows for studio-quality realism beyond expectations using Stable Diffusion DreamBooth / LoRA training. To achieve this workflow, it required an exceptionally high-quality dataset of classification / regularization images. Additionally, I developed a script capable of autom...
can be used for fine tuning
Thanks!
Hey Snubber,
I now know why you were getting all those downloads in Kohya when you kicked off training that first time. It's because you had the default model path in "Source model", that causes Kohya to go off and download all the checkpoints from the runwayML git.
Probably a bit late now, but what you need to do, and what i did, is press teh paper icon button and open an existing checkpoint/safetensor model e.g. SD1.5, or RPGv4, or any other model that you likely have got installed into Automatic1111
I just set the path to a model that is in my Automatic1111 directory, and then Kohya just uses that and doesn't download anything
Okay cool! Good to know thanks for the info @jaunty grove !!
Did you get any Lora's trained in the end? I've done two Lora's for two different people. I had to try a few times with different learning rates, epochs, repeat etc, as I've read that for a person around 1500-3000 steps are enough.
I found with too low a learning rate for the unet, and main learning rate, it just didn't take. Too high, and it ended up just looking really bad. I'm still trying to figure it all out. Through trial and error I got my two people Lora's to work, but the weighting seems really "heavy". By that I mean my Lora's default weight compared to other keyword is too high. I've no idea how to change the weightings that get backed into the Lora as part of the learning 😦
I haven't tried, been busy, but hopefully will get around to it tonight. That's good to know. Also I'm pretty sure you can put weights on your loras. At least I know that you can do it in comparison to the weights of other loras
civitai isn't loading for me but if you check out the example in here https://civitai.com/models/6526/studio-ghibli-style-lora
One example they use a zelda lora and the studio ghibli lora to make an image of zelda in ghibli style. And you can see they have different weights for the different loras
here it finally loaded
Yeah I can put weights on when I use it, but the default weighting feel too heavy, so in use I end up putting a low weight on it to bring it down a bit e.g. lora:mylora0.6
ohh ok, interesting
I'm sure it's me missing something. Let me know if you find the same when you get a chance to try 🙂
will do
honestly, you don't need accelerator if you have only a single gpu
I observed that, too. I guess it's the same reason why people merge models and get better results. Setting your lora to 0.6 weight is essentially merging your lora with the base model in a 6:4 ratio
can you share your learning rate, number of steps, and number of images, as well as the rank?
so it appears running with 2GPUs is not halfing the time for training, anyone know why?
also it shows like it is running for 400 epochs eventho i specifiy 2
anyone know why
it's great though, it handles compiling the unet.
that's just not how it works
you just get to run a larger batch size but everything is limited by the main system doing the training
actually i think that the time is halving but for some reason if i spicify 200 epocs it runs 200 epochs per gpu
so it will run 400 epocs
same for batch. i specify 6 batch but it does 12
however 1 epoch does run much faster
but what im not sur eof is if i can just say 100 epochs if i want 200
but i though heigher batch size means more stable learning?
thus i could even increase the learning rate
or did i missunderstand that part
no, thats correct
i wonder what paralell mechanism is used by dreambooth
but thats a one liner xD
if i merge 2 models somehow the result is different if i merge A+B and B+A - site for first is 5.8GB size of 2nd is 7.8 GB
anyone know why?
is the merge not combining both models into one and apply a weight if they both have the same key?
not during training
dunno, I thought its also just unet.compile(). But to be honest: I didn't noticed any performance improvement by compiling (beside waiting minutes until the compiler's done)
when you compile it and try and do certain operations it has to be recompiled and it breaks if you try and recompile it when it's already done
hey so I'm not sure if this belongs here or in #📝|prompting-help let me know if this isn't the right channel for this please :)
I'm trying to take a seamless tiling image and upscale it with Controlnet 1.11 tiling resampler, following this guide: https://stable-diffusion-art.com/controlnet-upscale/
Environment Info:
- A1111 webui
- dreamshaper model
- Controlnet v1.1 sd15_tile model
- Ultimate SD Upscale script
I'm getting it to generate nice upscale details, but its not seamless at the edges of the image. Its generating abrubt lines where it repeats even though I am selecting tiling setting under the img2img settings at the top.
**Pictures of settings attached: **
- img2img settings: https://cdn.discordapp.com/attachments/273241020077047810/1115441000010240070/firefox_2023-06-05_18-26-34.png
- Controlnet settings: https://cdn.discordapp.com/attachments/273241020077047810/1115441000463204433/firefox_2023-06-05_18-27-03.png
- Ultimate SD Upscaler script settings: https://cdn.discordapp.com/attachments/273241020077047810/1115441000861679656/firefox_2023-06-05_18-28-36.png
I'm using the website: https://www.pycheung.com/checker/ to check the images tile seamlessly, heres a comparison of the input and output images:
Images in the tiling checker
- Example 1: input image tiling: https://cdn.discordapp.com/attachments/273241020077047810/1115441002388410448/image.png
- Example 1: output image doesn't tile: https://cdn.discordapp.com/attachments/273241020077047810/1115441001721511986/image.png
- Example 2: input image tiling: https://cdn.discordapp.com/attachments/273241020077047810/1115441001113341962/firefox_2023-06-05_18-37-56.jpg
- Example 2: output image doesn't tile: https://cdn.discordapp.com/attachments/273241020077047810/1115441001369174046/firefox_2023-06-05_18-37-29.jpg
Does anyone know how to fix this?
I'm gonna post example 1 again so it embeds these two but not everything else, I don't want to clog up the channel lol
Sorry if this is the wrong place if so I'll delete:
Looking for guidance on where to go to learn to train "styles" as embeddings or loras or whatever where it won't affect the content much but will affect colors and lighting to make it match a style independent of content.
Also want to learn to train specific concepts better. I have been trying to make monuments into enormous fishtanks with mixed results. Is there a best practices here? I've trained hypernetworks and textual inversions with mixed success
Currently I am training by masking out the alpha on everything except my subject. That works ok but I don't have enough control. And I can't make the monuments (such as the leaning tower of pisa) into fishtanks well.
hello! checkout nitrosocke's github
im way too high but i want to help, can you distil it for me
you're saying the seams are always detected?
lol thanks, same tbh
So I made a seamlessly tiling image in txt2img. I'm trying to upscale that and keep the tiling.
Its always generating a bit at the edges and doesn't tile seamlessly after the upscale
you can see what I mean by using these two images in https://www.pycheung.com/checker/
you can try inpainting the seams maybe
unfortunately i don't think they will ever be truly invisible. the way it works is by inpainting them already
hmm I can try inpainting but I doubt it'll work, when I was inpainting images before without upscaling it was messing up the seams
I've gotten Topaz Photo AI to upscale them and keep the seamless tiling effect, but it doesn't keep generating the image with fine details like SD Controlnet does, it just kind of makes the low res image sharper and smoother looking
yeah, unfortunately inpainting the seams made it worse
its like the controlnet is ignoring the tiling setting at the top
i've managed to reintroduce the idea of smoking 😄
they're super smoky smokers now
i've upped the resolution of all my validations to see if i've managed to train out the model's tendency toward duplicate subjects, and voila, 1152x768
When running hires fix under latent, it always ruins hands that were previously perfect for me, I've been trying different settings but can't quite seem to get it. Anyone know the issue? Also, anyone able to explain the difference between the different Latent upscalers like Latent Nearest and Latent Nearest Exact? I've tried googling all these things to no avail, so I resort to bugging people here.
So it looks like its the stable diffusion upscaler doesn't actually support seamless tiling images, but I might have found a workaround
- generate your texture with the tileable setting turned on (result: 512x512 image)
- tile the resulting image 2 by 2; meaning 2 tiles in X and Y direction = 4 tiles in total (result: 1024x1024 image)
- upscale the 2-by-2 tiled image as much as you like (result: for example 4096x4096 for 4x upscaling)
- crop the center part of the upscaled image (result in this case an 2048x2048 image)
- check that the center crop is seamlessly tileable - which it usually is...
have you tried mixture diffusers
Not even sure what that is, so no.
Yeah I have it trained on the arc de triumph and on fishtanks with textual inversions but how do I make it do the arc de triumph made out of fishtank?
@jaunty grove first attempt at creating a lora based on me. It's pretty cursed. I did 1500 steps. I can increase the steps but are there any other settings I should try changing? (some people mentioned epochs, what does that do?)
When using the Tiling option (in auto11), does it change the Lora layers too, or only the base models?
I want to train a Lora that is optimized for Tiling and I don't know if I should enable Tiling (changing Conv2d layers' padding to circular) just for the base model or also for the lora network during training
Good first attempt. Can't say mine was much better lol.
So from all the reading I've been doing around 1500 - 3000 steps total is good for training on a person. My two that I've done that have worked out ok, and present the person in the AI art accurately were around 1900 steps total I think.
So there are a few elements that you need to consider when working out the total steps:
Total Steps = (Number of Images * Image Repeats * Epochs) / Batch Size
Number of images = Self explanatory
Image Repeats = Number of times the source image is shown to the model
Batch Size = Number of source images taken as a batch, which are presented to the model for training
Epoch = A single training pass (Number of Images, Images Repeats, Batch Images)
Think of each epoch as a training run, the more epochs you do, you're kind of re-enforcing the model by going over it all again and again. I've seen reccomendations that around 10 epochs is good for training.
For learning rates, there are 3 of them. LR Rate, Text Encoder Learning Rate, and UNet Learning Rate. I've literally no idea of the specifics of these, but I do know that the larger the value the faster the learning rate (which isn't good), the lower the value the learning rate slows, but make it too slow and it doesn't learn a lot. It's a huge balancing act, and I'm doing it through trial and error.
@tall vault I set all 3 learning rates to the same value. If you put in something like 0.005 that will be too big a number, the learning rate is too fast, and it won't learn. I used 0.000005 at one point, and that was crap, it didn't learn a lot. Eventually I used 0.00005, and that worked out great, but I do feel it will depend on your images.
So for my first character that I trained I had 122 images. I had more, but dropped them because the person had face paint on, and weird contacts which was messing it up). So 122 images total
To get around 1500-3000 steps total, I used the following:
(122 images * 15 repeats * 10 epochs) / batch size 6 = 3050 steps total
3050 was near enough for me. I was happy with that so then I set about training with the learning rates mentioned above. Once I tried 0.00005 it was good for me, for that particular training data set.
Another tip for person training, is to make sure you have images of the person from different angles, wearing different clothes, different expressions etc, or the model will get over fitted (basically baked in), and your rendering using your lora, will mean your AI person will always have the same kind of pose etc Under-fitting is where it hardly looks like your character you trained.
Apparently you can get away with 10 - 15 images to train on a person. I used 122 for that first person. Second person I used 19 images,and it worked ok (though not as good as the first one)
OKay awesome! thank you so much for all the info! I am going to give it another shot tonight
Also I've been reading up on, and watching lots of videos on AI upscaling techniques. The best one yet, uses a combo of the ControlNet, and Ultimate SD Upscaler extensions to Automatic1111
Here's some examples where I took an original 512 x 768 render, fixed up the eyes via inpainting, and took the image in 2x scaling increments to to a whopping 8192 x 12288. So four lots of 2x upscalings to get there :-).
The small pic, and the first pixelated face are the original image 512 x 768 image (eyes fixed up via inpainting), followed by ones with increasing detail.
i recommend to add a noise offset of 0.02 to your lora and use a LR Scheduler with 10% warmup (constant with warmup or cosine)
hi, do you use the inpainting control net? It makes a HUGE difference, in particular, if you inpaint with high noise strength
I had trouble for a long time, too. I found that the most important thing is to use a very low CFG when you want photorealism
like it is super easy to train on your face and generate nice anime images of you. But making it photorealistic is difficult. use a CFG of 3 or 4
I'm trying to use a studio ghibli style LoRa to create stylized images of me or anyone
do you think this is even possible?
yes, I found that works straight away
even textual inversion is often good enough for that
the funny thing is that I also got extremely photorealism anime portraits of myself without problems (like they fit my face in super high details) but as soon as I want photorealism things get hard
reducing CFG to an extremely low value helps a lot, though. Like I got almost good photorealism with that
how did you get it get your face details? training a lora?
or just textual inversion?
for best details you need a lora
but for a anime character (where you don't need all wrinkles and other details ;)) a textual inversion is enough
I have to say I found it easier train on SD 2.1 than on SD 1.5 (in contrast to what most people say)
ok cool, any tips on settings/steps for anime?
hmm so at batch size 18 i'm seeing 2.23 seconds per iter and at batch size 6 i'm seeing 1.2 seconds per iter. how much faster is BS=18?
a train leaves los angeles at <x> miles per hour, ...
using a batch size of 18 processes samples approximately 1.614 times faster than using a batch size of 6 (8.07 divided by 5 equals 1.614)
For batch size 18: 2.41 hours * $3.18/hour = approximately $7.67
For batch size 6: 3.89 hours * $3.18/hour = approximately $12.37
still waiting for the model to be able to make a white background
some random samples from a training I ran last night using zero terminal SNR, it gets very close to white/black backgrounds
black and white backgrounds
settings, this was just sort of a big random dump of training data I have lying around, 13k images
doesn't help me much because i'm using diffusers implementation 😛
this is diffusers
I think there are some things that still need to happen in auto1111 or whatever since it mostly uses ldm code and patches, some issues with getting the extensions that supposedly do the CFG rescaling to look right
well you're only training the last 2 layers of the TE
damn, you ran 20 epochs of training? how many samples?
you're using EveryDreamer2, not diffusers
https://github.com/victorchall/EveryDream2trainer/commit/81b7b00df736894be0cd8a053656e062690a7cde
odd change they made 3 days ago with no comment why
i wish it would be better at faces already, how many do i have to show it
definitely understands the overall concept
i guess it's actually doing much better then in the image i showed. i'm using offset noise as well as terminal SNR
this is about 5000 steps earlier in training. it did not want to do a white background at all
i'm assuming that it's going to take a while to fully train all of this new noise schedule i'm applying
@surreal lagoon can you please explain what generated images mean from this?
does it mean they were artificially generated using the training images fed into the algo?
they are completely different from the ones that were used to train it?
https://github.com/NVlabs/stylegan
i don' tknow
ok
it's still diffusers, its a bunch of augmentation on top to allow multiple optimizers be used, layer freezing, multiaspect, etc
given what I'm seeing here I think offset is not required with zero terminal snr, I think that was the one of the points of the paper as well, offset noise is not very stable over time
that training was on 13k images, fairly random assortment of things
offset noise helps it converge more quickly
might have to stop applying it at some point but it does help still, even with trailing and rescaled zero SNR betas
at least to me, it looks like a stable version of offset noise, with offset noise you need different mounts of it based on how long you train, like offsetnoise*0.1 the 0.1 is too much if if you train more than a few thousand steps and the model will turn into splochy figures on black etc
if you hand tune offset noise down its more stable for longer periods, but not entirely stable
well i trained at 30k steps without offset noise and the terminal SNR stuff didn't help anywhere near as much as both together did
the trained_betas is easier to pass into the from_pretrained and takes care of the alphas and betas
diffusers handles it more elegantly than changing both betas and alphas manually, just easier, but you need a schedule/timestep curve to run through the code snippet from the paper to "correct" it, so step 1 is load normally, then load again withthe corrected schedule and discard the temporary scheduler instance
pipeline.scheduler = DDIMScheduler.from_pretrained(
model_id,
subfolder="scheduler",
rescale_betas_zero_snr=True,
guidance_rescale=0.3,
timestep_scaling="trailing"
)
I'm a few subversions behind, not sure that was in
there are a few of us hacking on it regardless, its sort of a backdoor feature until we sort it all out and document
it's still a WIP pull request i've merged on my fork and have been testing
ah cool
this guy is also working on it: https://github.com/huggingface/diffusers/compare/main...AMorporkian:diffusers:main
at least so far I think zero term works, several of us getting great samples but I think there may be issues in the auto1111 whatevers that handle inference side, but works like a charm on diffusers since the trained_betas actually get saved right in the scheduler_config.json, so works on invoke, sdgrate, samples from actual trainer, etc
perfection
d-adaptation adam also seems to be working well for some people but unforunately not very efficient rightnow
should have AdamA in soon
well the offset noise has done what i've wanted so now i've removed it, at 12k steps
let's hope he comes down to earth and improves on the next ckpt
as for faces it needs to see the same face at least 500-1000 times for it reproduce it
my concepts that run very little times suck very hard with the faces
I did try that as well, it still showed the seams very clearly. Thanks for the suggestion though
is there any tool that can convert regular text captions into tokens/tags?
tokens and tags are different things, what is it you're actually trying to accomplish?
well if i use wd14 or any other clip captioner i do not get tokes
like car, red, open window
its more like "a red car with open windows"
and i am wondering if there is a tool that can convert that into tokens
those are tokens
people get caught up on this stuff and think that something little is going to solve their problem when it's not even close to being the issue
like, what problem are you trying to solve with that
How much quality/accuracy is lost if you merge multiple models together?
textcap models like blip write sentences and phrases, but some caption utilities do something they call "clip flavors" which is just trying to figure out if your image is visually close to a bunch of words in a dictionary, i.e. tags,
some of caption utilities do both, use blip to create "a man standing in a park" then clip flavors would add something like "claude monet, daytime, oil on canvas, outdoor"
@hot breach ok so offset noise breaks things a lot more now that terminal SNR is in there
without it, training goes better
offset noise is unstable, ztnr should be stable
getting things to play nice in auto1111 may be a challenge, diffusers handles sharing the data about the updated beta schedule better since it can be explicitly shared in the schedule_config.json
that's ztnr only, no offset noise
i don't use automatic
once i hit 10k steps of training though, the thing starts screwing up
10k -> 12k -> 14k
above is 30k or so steps at batch 15
i'm at batch 12
also grad accum 6 so effective batch size close to 100
oh wow
that's a lot higher than mine, i have zero gradient accumulations in use
i'll restart from 10k steps with a higher batch and gradient size since i'm not interested in speeding through this training
unet LR 3.5e-6 constant, TE only unfreezing last 2 layers with lr 2e-6 cosine schedule
some of those settings are somewhat haphazard as I experiment but they're not far off
i accidentally unfroze my whole text encoder for a few hundred steps once
it was about halfway into 30k steps
works with brighter stuff too
i've unfrozen a couple more layers of the TE to see whether this helps bring the weights up or whether it makes it worse 
we'll see, i guess
my assumption is that it makes it worse
Okay maybe not the greatest example, but it should suffice. This is what I was referring to with these. (excuse the cat example, just wanted a really obvious choice its not trained on cats) These are tested with ((masterpiece)), outline, cel_shading cat, <lyco:CelShading-000001:0.75>,1 and ((masterpiece)), cat, <lyco:CelShading-000001:0.75> (with xyz plot on epochs/weight), for style loras, I've always heard it should be avoided if possible to make sure that you don't have to enter anything into the prompt style-wise. How could I avoid this? Would I prune all variants of "outline" or "cel_shading" from my training data?
so damn close
How do i inpaint a fist in this pose, with the palm facing towards the viewer?
https://i.imgur.com/60pjsKM.png
i cannot seem to do it, i keep getting fists pointed the opposite way, with the back of the hand facing the viewer, even when inpainting over an image like this.
ive tried using varions on "palm facing viewer" and having knuckles in the negative prompt. But all i'm getting is either high quality inverted fists, or garbled flesh spaghetti
i have even tried inverting the colors of the fist in photoshop, and it made no difference
wrong channel @woeful goblet
how the heck did you get the discord number 0001
Wow dude... looks amazing ✌️✨
thanks, 2.1 is a workhorse
Mind sharing more details about how u did this fine-tuning??
I am currently trying to get my fine tuned model to give at least some level of realistic faces... but well, uk 1.5😂
Ig i'll try those configuration settings on 1.5... see how things with that
Try using a lower cfg
Damn
does LORA add new info, or does it just tune your prompt to get the best result, like embeddings do
LORA is more or less same as dreambooth. However, it depends a bit on the implementation you use
is dreambooth the same as embeddings 😛
okay 😜
Dreambooth finetunes the complete model.
Lora finetunes large parts of the model, depending on the used implementation.
so it's not just the embedding but the tect encoder and the unet
fine-tuning progress going super well this time
still quite a lot of contrast tho
that's my prompt asking for it
for the end of the training run i've added more faces to the dataset. that collection worked well before but if i used it for too long it started picking up watermarks
hoping that resolves this munchkin face issue
it did on a separate training run 🤞🏽
can we except one of these models, you invest so many gpu hours in, online and downloadable soon? ;D

