#š§ļ½finetune
1 messages Ā· Page 9 of 1
oh, i mean the batch size and count.
get a little side-tracked.
Cause I was testing out the anythingv3 model to see if I can get highly detailed eyes from others I've been inspired
ā (Not mine)
But nothing yet out of the sort, Just horrible quality mess of eyes...
ā (Mine)
thats a resolution issue
youll never get great faces or eyes on a 512x512 image
that's why there is a whole channel here called upscaling
the ones that inspire you have probably been upscaled at least once if not twice
I don't start to see faces look really good until 1024x1024 or higher
That's the thing, Higfix to increase the resolution. I even used img2img sections to upscale the images : ESRGAN_4x and also R-ESRGAN 4x+ Anime6B.
But it gotten worst...
highres fix does not add detail
it just resizes the same image
take your image from text to image, send it to img2img
under script select SD upscale,
set your denoising strength between .2 and .3
since its img2img and you wont lose your original feel free to play with denoising and try a few times
down there at the bottom were it says script none
you drag an image up top
or send it from the txt2img buttons
click where it says none and there is one called upscale
that will actually add details not just change the resolution
SD upscale?
yes
no
when you do the SD upscale it already says what scale factor
and its 2 by default
so its going to make a 1024by1024
Well the Image is 512x768 actually.
thats fine
it will be 1025x1550 orwhatever
with enough resources you could then plug your 1024x1550 thing into the image to image and run SD upscale again and get a 2050x3k image etc
it usually falls apart on me when I try to do that though
Should I do this many times to find the right pair detailed eyes like in img2img and Inpaint?
Along with the batch count and size?
I don't usually batch the upscales as they are so heavy
but you might try making sure your prompt has information about eyes in it
and for anime generally its the danbooru tags they are trained on
so you can go to danbooru and search for some images with eye tags you like
Oh, one thing. Does sd upscale goes the same for txt2img?
I dont think you can do it on txt2img because what is it denoising then
unless it generates an image and then denoise on that image
Ah, ok then. It's taking it's time upscaling right now...
Im generating batch img2img for another video
It's done upscaling
And it just upscale into multi images like 7 images in to one!
and I mean it in a serious fashion.
did you change any of the settings
it usually tile maps with some overlay but stitches them together
You mean the tiling one or the 64 one?
I didn't use the Tiling but I did use the SD upscale tile maps around 64.
I just left it there.
Anyhow, I should hit the haystack. Thanks for the help/information man, I'll try to keep learning around the webui and stuff!
And you keep doing what your doing!
Anyone got insight into the distinction of concepts in dreambooth? Like are all 3 datasets smashed together into the same latent space? Or are they trained sequentially, in that order? I ask because It seems like certain scenarios have different training results depending on the order of samples trained, and it'd be nice to treat this as a queue/sequencer.
on the 1.5 webpage is says the v1-5-pruned is "suitable for fine-tuning", this is the larger 8 gig model. Does that mean it's better for textual inversion? or when they say fine tuning do they mean, like training the model on additional iterations using the original 1.5 source training data
is the 1.5-pruned model better at creating textual inversions than the 1.5-pruned-emaonly model?
Did you manage to fix the problem? I try to train with google colab but got the same error
hi guys! š I'm working on a way to quickly train different people faces to replace on the images already created.. .what is the best way to do so, hypernetwork or embeddings?
what are the difference?
Also, whatever I do i can't train an embedding od sd 2.1, keep getting black pictures.. what can be the problem?
How can I create such a matric with different sampling methods?
at the bottom of your webui
click where it says script and there is one called X Y plot
pick sampler for one of your plots and then type like Euler a, Euler, DDIM, Heun etc
its both case and spelling sensitive
Great, Thanks! š
is there any way i can change the setting of stable diffusion in the UI to make it use my RAM to work more efficient?
Any ideas why? Same thing on colab, runpod, my local PC... does embedding training work on that model?
you need a yaml file to use the SD 2.1 model i know that
I assume you have generated images with SD 2.1 successfully then?
can anyone help me learn how to upscale without this sort of result?
or am i stuck with skin teeth? š¤£
I have no idea what I'm doing. I was experimenting with SD Ultimate Upscaler with these settings
havent used that one
whats your base image look like
a lot of times for me the images fall apart when I go from 1024x1024 (or 1024x1500 or whatever) to 2k by 2k
because the "scale" the models is trained on doesnt think of things being that large and endsup making extra heads and stuff
so I usually only do 1 2x upscale using the SD upscale script I don't have that ultimate one
Do you have the normal sd upscale script
yep
guess i should try it? recommend any settings?
Is it possible randomly extract extract only a random subset of key frames in a video with ffmpeg?
I am using: ffmpeg -skip_frame nokey -i video.mp4 -vsync 0 -ss 00:03:56 -t 00:00:36 -f image2 frames/f1-%06d.png
but that is crazy slow for something like a short film or movie
Ive had really good luck with default settings, denoise .3-.4 and LDSR model
txt2img
the 2x upscale LDSR default settings
the clips, the crispness of the button/light on the back of the arm etc
see what happens when I try to upscale again lol
oy
worked this time
denoise .4 notice things like the green thing on the shoulder became like a badge etc
If I wanted to train on a bunch of images that are naturally and very distinctly not square, like 16:9 game screenshots (where capturing that UI appears in the corners is important), would it be better to add white padding above/below to get to 1:1, or cut it into two slightly overlapping squares?
Or, hell, maybe Iāll just train it at 16:9 and report back, just to see
AI seems really bad at text
so UI seems like it would be sort of a terrible thing to train
I think you'd be better off having the UI be layers you add over the generated images?
Oh for sure, Iām cool with the text being absolutely junk, Iām just interested in the UI art, layout patterns, etc. So if I asked it to show me a mock-up of a fantasy character selection screen itāll get it in the ballpark, with some AI garbled text where the name and stats are, and that would be enough
I've got it doing some interesting things with text2img already just with an embedding created from a few 1:1 screenshots from one game and then upscaling, generating at 904x512 with sd1.5
Wracking my brain since yesterday on why my tuning wasn't working. It'd basically ignore half my samples. Gave up, screwed around in image2image. Dropped one of my samples in, and ... it failed? Had a bad TIFF header or something. Half my samples did too. The half that were the result of my hasty upscaling.
So at least I've found a solution, but man
training tip: when training pixel art, be sure to scale it up to fit in 512x512, making sure you dont lose the quality in the process. keeping it small (like 60x60 or even a clean divisible 64) leads to blurring (which a lot of software tends to do with really small images)
yeah upscale using nearest neighbor
Can anybody tell me the diference between the V1-5 prune-emaonly and the pruned?
I read the second is better for training, is this true?
Is there a benefit to using xformers on a 24GB setup? I turned it on, and seeing some unpredictable results.
emaonly tends to be better for generation and ema+non-ema (full weights) is typically better for training
I'm trying to make an embedding that makes a character like the one in the first picture, but after training they don't come out properly and I get these results, any advice?
Thank you Alicat!!
Should I use CLIP or deepbooru for making an embedding or hypernetwork? Is there any case when one is superior to the other?
Also, how many images would you say are necessary for a good embedding/hypernetwork? Should I go with 140 CLIP annotated images or something like 25 but with manually corrected annotations?
More questions:
- When I launch the training for an embedding or hypernet automatic1111 is training it using the model that has been selected right?
- Is it possible to train (using a 512*512 model if the above is true) an embedding in 256*256 or even 128*128 without encountering issues?
Are the other ones examples of your embedding? or more examples of what you want?
generally since anything anime uses danbooru youll be better off also with deepbooru tags, remember to change to subject filewords if its a character
for training you want to use a generic model like the base SD 1.5 or WD 1.3 etc don't use more limited custom models while training, however do not trust the auto generated images while training, set your checkpoint saving to 200-500 and then put all those embeddings into your embedding folder and test each one on your custom model after training
also I am not really sure about training things for smaller than 512 by 512 images most of the models are trained off 512 by 512 tho
The image are a bunch of portrait of different people and I want to generate more portraits, should I style use subject filewords for this?
Alright, I generate an image and an embedding every 250, the preview are kind of bad even though I've used my own prompt but I'll test some of them anyway
so are you wanting to learn to make like a yearbook page as a style?
or to learn how to make portraits
Actually it's for a game I love, I'd like to make more portrait for pilots in Starsector
They look like this
I have 140 of them from the game, deepbooru generated weird annotations so I went with CLIP but I was wondering if I should reduce that number to a few dozen only and make the annotations myself (or at least modify them)
preferably on a white or black background
and you should try to make them 512x512 even if in the end you will downscale
What I want is making more characters like this, keeping the style (I guess that answer my earlier question about subject/style)
the captions are more about what other things it might describe like if you want to call your embedding and brown hair
Right and if you are ingesting those right now you are confusing the model
That's what I did, resized them to 512 before training but I trained on dreamlike diffusion, I haven't had time to test the result though, I stopped at ~5000 steps
ah! interesting
- make them on white or black not transparent backgrounds
alright, I'll run another training on sd 1.5 then, thank you!
set it to 4000 steps, learning rate default .005 save image every X to 200-500
ignore what you see in the training outputs
Is this the right one?
when its done go to textual inversion/image embeddings and copy those to your embeddings folder
dunno I have this one
waifu diffusion for whatever reason is also a good training model
I never really use those two for image generation but they work well for training
I have a different hash, will check
You are making each head one image or sets of 2 heads?
I think you want to use style_filewords for what you want
like Starsector_style
1 head per image, just basic portraits
will do
put whatever you want to use to activate it in the ini text on the create embeddings tab
its often the same as the name for me
emb_star_sector
or whatever
your textual inversion directories will spit out a bunch
copy these into the main embeddings folder, swap to your custom model and try a prompt like "a portrait of an asian general, emb_star_sector-200"
Yup, I already made some embeddings but did them naively before
then you can keep changing the number to try the different amounts of learning
usually for me at 4000 steps, 4000 steps is not my "best" one
you can also use prompt s/r from the x/y plot, and maybe another setting, to really test which embedding is the best out of them all.
somewhere in the 2k-3k range will be best
good idea
noted
yeah could totally matrix a prompt
I usually also X/Y checkpoints to try my embedding on a bunch of models
"Number of vectors per token", I've seen 8 and 10 used, is that a decent value?
I use 8
great
It's the "number of keywords" you want your embedding to reference sorta? as far as I understand
So if you were just making like a "neon" style it doesn't really need to know other words
but if you are describing a portrait of a person you expect more adjectives
alright, sounds right to use 8 then, glasses, helmet, jacket, not too many variations on that in my training set (nor in the results I want to get)
well if you don't care about specifying them anyway when you do a batch prompt its not that important
I made this one. could be better but its hard in the sense NA miatas have popup headlights so it gets confused
lol, even at 200 steps it already looks way better than what I had before
(that's from the preview but still)
yeah so generally it should be better than it is in the preview in the end
great!
I've also read in a guide on rentry.org that I should use either deterministic or random here (random being slower), do you agree?
I've used both once and deterministic I can't really tell what it does
Although it just came to me its probably like which reference image it uses?
like does it go through each once, or similar ones, or random?
Oh ok, I checked and it's about VAE apparently (I unloaded mine)
Ensure you have "Choose latent sampling method" set to deterministic or random (random could be better but incurs a performance penalty, deterministic will suffice", do NOT have it set to "Once" or the VAE will not be correctly sampled which could lead to hypernetwork death when training over many iterations
Its also talking about hypernetwork and not textual inversion?
not sure about how that works honestly
erf, you are right
I'm still unsure what the difference between those two is in practice
[ā]HerbertWest 3 points 21 days ago
I'm also interested in the answer to this. Mostly because I've tried training with all methods and my best results are still with "once," which everyone else has said sucks. It seems to work the best for me, which is bizarre. Maybe it has to do with my dataset in some way.
permalinkembedsavereportgive awardreply
[ā]PervertoEco[S] 1 point 19 days ago
Thank you, can confirm.
"Once" gives better results, with greater subject similitude between generations.
"All tutorials recommend "Deterministic" over "Once" or "Random", but do they actually do?"
ah! given how quick 4k steps are I'll try both
yeah let me know if you see any noticeable difference
I tried adding about 10 more images specifically of helmets to my space marine embedding and it still struggles
I found this on github: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4680
ahah, I tried spacemarines too, one thing I have noticed is that the guns tend to be handled in hand like a warhammer (handle held vertically in the closed fist) rather like gun/blasters should be, I think this word (warhammer) confuses it.
I noticed a lot of autocaptions will use the word warhammers but I don't use that
It also tends to produce image of a warhammer game board, I managed to get rid of that but it still an issue sometimes
well thats why I was training an embedding
and only used images that are not of figures
those are old generations, I'm better at it nowadays
yeah looks more like a knight than a space marine
almost everything is pretty good obviously not chapter colors except the damn helmets lol
Yeah it's really hard to get details right with AI
yeah
I'll try this when I'm done https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4680#issuecomment-1320629832
yeah same, I've never seen anyone talk about it and I'm curious what training so quickly can do
sorry to bring this back but I'm wondering if I could train without any caption?
I mean it does it automatically with the preprocess tab so why would you
I also don't know the answer however
Sometimes the captions are very wrong, adding stuff that's "incompatible" with the pictures
like this one for example, CLIP gave me "a digital painting of a man in a suit and helmet with a helmet on his head and a gun in his hand"
Given they are all portraits and I don't want any gun there (and no portrait in the training set has one) maybe it would be better to just leave out the captions, at least that's what I'm wondering about
With 30 images, gradient accumulation at 20, 4 tokens and 0.1 learning rate the training has slowed dramatically.
It went from 4it/s to 10s/it
after 90, 110 and 100 steps, ahem:
I'll try again with less steps, I forgot to enable the preview before
Gradient accumulation steps just allows for higher batch size without the increase of vram but at the cost of speed/ram
For the webui dreambooth extension, it's Batch Size x GA = Equivalent batch size
oh ok, I had no idea, I'm just playing with the settings a bit randomly now
The selected image is pretty good, I'll try inference with this one
Strangely I'm not seeing "embedding" in the list for X/Y plots
This last experiment didn't work well it seems, I'm not getting anything close to a portrait most of the time
Because you use Prompt S/R (Search and replace) to replace your embedding token with all the variations
That makes sense, in case you want to use multiple embeddings, thank you
so the only thing gradient does is the same as batching?
or I should say its a lower vram option than straight batching?
It allows some freedom to have higher batch sizes without the need for that extra VRAM, yep
Though, I'm not sure how much RAM it eats up 
ah I don't care too much about a larger batch size just for faster training
well usually my VRAM is full and my RAM is at like 50%
And bigger batch size isn't necessarily better and all that 
and here I was hoping for a secret technique that would make more details (like the stupid helmets) get learned into my embedding
I even tried training with some lower learning rates wondering if lower rate would mean more "detail"
Right right
yeah, lower learning rate could just mean it doesn't learn anything
There was a really cool visual with learning rates, I'll see if I can find it
What kind of batch size can I use with 12GB? Is that the same memory consumption as the inference batch size?
About learning rates there is this article which includes some good x/y plots
The second and third pictures are the failed examples after training
hey newbie question, I found some LoRA embeddings in .safetensors format. Using google colab, I loaded it in exactly like a .ckpt models but it just doesn't work as intended. The thing is how can I properly use LoRA, preferably for google colab TYSM!!!
I updated Dreambooth in A1111, I cant train anymore. Get the following message: Returning result: Exception training model: ''CheckpointInfo' object is not subscriptable'.
Any ideas?
I can't really answer your question on how to load them. But there appears to be different formats for saving and loading LoRa, one is the one based on the lora github repository. The other I think comes from A1111. They are different formats, saved and loaded differently.
Here you can find reference LoRa that are guaranteed to work with the latest version of lora_diffusion library: https://github.com/cloneofsimo/lora/tree/master/example_loras
are you including full body in your prompt? do you have enough images of the specific outfit? Are you trying to make a style or subject IE is that a person or a style
have you manually captioned any of your images?
I've been trying to train a LoRa (using the pti scripts in cloneofsimo/lora). But I'm facing an issue when I try to train SD 2.1, the textual inversion works fine, but the model training always results in nan loss. Been trying to tune the parameters available in the CLI, but to no avail. Anybody faced anything similar?
@split onyx ^ ?
Hi.
Oh.
Have you been using 512 512 images?
Oh hold on, ive not yet trained with SD2.1 since v0.1.0
So it might not work. I'll work on it!
I can train the LoRa just fine using 768x768 using SD 2, and can train using 512x512 using SD 2.1 Base
Appears so, no idea why
I'll run some more tests, but I'll compile a matrix of what works and not
I can indeed train 768 with sd2.0. Getting a loss. Using this train script https://gist.github.com/functorism/3428c024f6f8fe73041d9e405885fa53
I am tryig to get the character in the first picture. I have about 30 images
Is she always in that outfit?
For example looking at Danbooru your prompt should have lace-trimmed dress, and detached sleeves or wrist cuffs in it
I think you need to work on your prompt I can get this without even having an embedding
@carmine zinc
"a photo of a young girl wearing a red white and blue (lace-trimmed dress), (bustle:1.3), (wrist cuffs), hair bow, blue thighhighs, solo focus, looking at viewer, hand up, masterpiece, best quality, ultra detailed, (full body), (anime:1.1), cowboy shot, "
I wrote this prompt in the last couple minutes and have not even really refined it or added negative prompts
try putting the above with your embedding
No she isn't always wearing that but the harstyle is always the same
the pigtail is one of the characteristic traits
thats not a pigtail
looks like a version of a high ponytail
have you searched danbooru for the character?
and looked at the tags used to describe her?
ok so its hayasaki mei from idoly pride
most of these images she is wearing a school girl outfit so makes sense the embedding learned that
for a very specific model that you just want to copy this is what dreambooth is relatively good for
if you really want to train an embedding for her I would suggest 40-50 images, not blurry, 512by512 use empty canvas to make them 512 wide since she is likely mostly not in square images
and put her on a plain white or black background
since she is an anime character I would use subject_filewords and deepbooru tags
and train it using the wd_v1-3-float32 or similar waifu diffusion checkpoint
has anyone had any experience with negative prompt training? I'm using some fork that says it supports neg prompts using flags but it seems to suck ass
I tried using stable tuner also but it just eats up all the memory and dies immediately
Is there some recourse to track fine tuned models that are trained on large datasets? Say if someone fine tunes sd1.5 on a large dataset in a bespoke manner, I'd like to identify that as opposed to someone that has done more simple dreambooth or lora training?
It seems quite rare for people to publish what datasets they have finetuned with.
Very common in the LoRA community, but yeah, others not so much
I'm especially interested in finding efforts of finetuning that's been started at a base of say sd1.5 but have been trained with a decent and public dataset. Seems hard to find via searching, perhaps because it doesn't exist.
danbooru has a little bit more than anime but yeah a main focus for sure
I am interested in fine tuning for illustrations (think corporate modern clipart), so anime is not too far out. But I'm worried that datasets like danbooru will skew the generation too much into risky territory.
Enough fine tuning might make that a non-issue however. So a fine tune on that dataset that has fine tuned the entire model, might perform very well for illustrations. I might have to test.
Risky data in, risky data out. Clean data in, clean data out
You can filter it out
just know that if you train on a specific model and it's geared towards NSFW, then it won't matter 
Corporate modern clipart might work best on 2.1
which has an INCREDIBLY strict NSFW filter on it
That's probably a very good starting point, yes - 2.1
Are there any models or finetunes that are good for getting alien skintones? Green, red, blue etc?
I haven't had an issue just saying like "blue skin" in the prompt
I find emphasis is needed, but inevitably it makes EVERYTHING blue...
for embeddings, is it better "once" or "deterministic" for the sampling method ?
general opinion is deterministic or random, but some prefer once
best to test both
you should not need more than 4000 steps or you are doing something wrong so training should be relatively quick
if your embedding is off by a lot its probably your source images
thanks for infos !
i'm wonderig if i should name the caption as cloest at it should visually be
like A rendering of a skull head, octane render, sharp focus etc
or only A skull head
make sure its not blurry and has a clean white or black background will have a much larger effect than the caption
Steps should be proportional to number of images, so might not be a bad thing to up it if you're trying to cover a larger set of images.
I've never needed more than 4k even with 60-90 images but ok
Deterministic or random 
The original paper used "random"
does hypernetwork needs the same base models for generations as embeddings ?
Nope
But you will run into issues with one's trained on 2.0 models mixed with trained on 1.5 models
I don't think they're compatible, as far as I'm aware
Also some may not work as well with different models
yeah i guess so thanks
hello
Hello everyone!
Do captions for dreambooth models for example serve the same purpose as in TI? What I've learned is that captioning for embeddings basically seperates the style you want to train from the subjects on the images- but I'm making my first tiny steps into model training right now so this is all very new
Yep!
It can work either way and of course your results may vary, but it seems to work better, in some cases
I've noticed that after the checkpoint I set (5%), my Vram usage drops by 8gb and stays this way, would I be able to increase the batches ya'll recon? Even if the first epoch gets throttled, it should hopefully resume its speed after the first checkpoint right? (as long as the vram usage drops again)
does vectors are related to what I write on captions for embeddings ? A more described caption = more vectors ?
More described captions are more flexible but are also harder to pull out without having to use a lot of captions
Gotcha, thanks!
Not in the literal sense, no- vectors are pretty much the amount of information you're storing on some subject, captioning doesn't influence that
https://docs.google.com/document/d/1JvlM0phnok4pghVBAMsMq_-Z18_ip_GXvHYE0mITdFE/edit# This document by @simple ivy might help you š
Textual Inversion - Training embeddings for Stable Diffusion 2.0+ in Automatic1111 UI By Adam Desrosiers feel free to leave a comment for correcting any points or clarifying something that is still unclear I'll tell you what I know - but that ain't much. And I'm not convinced anyone knows that m...
As Joppe indicates in his comment, the amount of vectors required can loosely correlate to how complex your initialization text is
Huh, is that true?I suppose it'd be easy enough to test
I thought it was just related to the embedding itself
Yeah, is related to how complex the embedding is itself
It is related to the embedding itself, but you can kind of very loosely use it as a rule of thumb- if you can describe the concept semi-accurately in 3 words, I don't think you would need a lot of vectors
but if the concept is complicated, very detailed and precise, lots of things to describe- more vectors
I just think it's misleading, because people might think the vector token count relates to the tokens in the initialization text
although I definitely see the point and I do think it's generally a good rule of thumb
fair enough, might need to be phrased a bit differently š
Gradient Accumulation Steps should be a number that, multiplied by your batch, equals the total number of training images.
I think that works pretty well! But I wouldn't say "should"rest seems good
That's of solid info in there 
[filewords], [name]? or [name], [filewords]? There's more weight to the first token in the prompts
Should is a bit of a strong word there lol, although people are often looking for concrete info on values they should use instead of speculation
LOL true that, but with any machine learning, there going to be sad when reality hits em
because there's no one fits all for anything 
BUT
rules of thumb or guidelines are fine
like "generally these settings work" kinda thing
I tend to have good results with [filewords], [name] so far- though my embeddings are usually supplementative instead of an actual strong subject
Do you do tag shuffling?
generally, not really- though I'm but a simple man who doesn't touch things if the current way works fine 
gotcha gotcha 
I think tag shuffling works really well for more anime stuff? because it works with danbooru tags and those are tag shuffle friendly
yeah that makes a lot of sense- the captions I use are manually enhanced BLIP captions, more natural language like- so I feel like shuffling that won't do it any good
interesting! And is a good idea to try [filewords], [name] when I don't want as strong an effect 
also I'm not into anime so booru tags won't work well for my stuff 
Yeahhh, I don't image it would, but I haven't tried it yet, so I can't say for sure 
It might though
isn't the tag shuffling based on commas?
Yeah, but the idea would be to seperate concepts using natural language in the caption itself
because there's a LARGE effect on whatever is closest to the start of the prompt
so by shuffling it, you better distribute it
Makes a lot of sense- might play around with it in the near future when I have the time to train something again
Yeah, Auto1111 webui training is pretty slow, 10k steps at 40 minutes (w/ a 3090) is 
Usually the captions are just long sentences with maybe one comma for me
If you try that method, you could try with more commas, like separating concepts. I don't have a whole lot of knowledge with non-anime so I'm not enitrely sure 
.....3060 and making 2.1 embeds, yikes
Like, basically you'd look for is when you do a prompt and it produces good results while including commas, or produces results that are very similar to your own pictures (accurate). Then you'll be golden
if commas breaks things, then it probably won't work that well but 
I saw some tutorials where the initialization text should be unique, it's only for subjects unknown to sd no ?
mmm I wouldn't say unique
good point- I use tons of commas in prompts but not in my captions
Describing the character traits or what you ALWAYS want your subject to be related to, seems to work really well
And then captions is what you DON'T want your embedding to learn, seems to work well
initialization text is pretty much where you want your concept or subject to start training
yep
so its the same for styles, if i want to train a new pointillism style the init would be pointillism and not "test5478" right ?
yes
pointillism art style or something
god thanks i got misleaded pretty crazy haha
there's a ton of speculation out there š
The "unique" token is for the instance token
unique meaning rare 
as in, the model doesn't know what that token is
When I started I would just use the embed name again, but at some point I switched to using it like a starting point and the difference is immense if you look at the generated training images
and for hypernetwork, there is no init how does it work ?
and the ones with the initialization text describing the character turned out a LOT better
i see thanks
my knowledge on HNs is pretty much zero for now so I can't help you there unfortunately
I haven't really touched HN
Your customization comes from the hypernetwork layer structure activation function and Layer weights initialization mostly
deeper being better for subjects, iirc
and wider being better for styles?
I don't recall 
deep meaning like 1,3,1
Guide ^
can anyone offer some advice on dreambooth captions? i'm using TLB-FastDreambooth and i'm just struggling to get the output i'm looking for. i'm using a keymash token (ohwx), naming my images ohwx-1.png, ohwx-2.png, etc. (total of 12 images). the dataset is properly varied. some medium, some wide, some close. different backgrounds, expressions, looks, etc. captions are descriptive, but don't include the ohwx token. (i tried including the token in the caption and instead it was getting setting and pose but not her face). example caption: "a photo of a woman in a dress standing in a room with a door and a clock on the wall behind her". i'm not using regularization images based on TLB's advice to skip the step if training a face. i set a text encoder step value, but it doesn't run ever (i assume because i'm not doing regularization). i've tried 15k steps with a dataset of 82 images and it was generating her face well but was locked into selfies because too many of the 82 were selfies. am i missing something?
15k steps is a LOT for a subject with the idea of learning a face
Has anyone gotten an embedding training to look just as good as a dreambooth training for a specific subject, not an artstyle? I am currently training mine but thinking it's a waste of time because I don't think it looks as good as dream booth.
82 images is also a lot for just face learning
Not to mention to actually make it look like the subject it takes 10 times longer since each step takes a lot more
you could tone it down to like 5-30 and see better results, tbh
Most of them should be closeups and you have the right idea with having them varied 
Not using reggies is bad advise so right off the bat I suggest you look for someone else or guidance
and yep you can just skip regaularization if you're fine with the model only generating that face
Nooo Reggies make everything better
You don't need them if you don't care about generating other things
i'm fine with the model only doing her face
its a model just of my wife, so no biggie
Of course, you want her to only do her face, that's the point of training a subject
That might be "better" but for someone starting out, I really don't think they need it
With Reggie's not only do you get more flexibility, but you also point the model into the right direction, for example, high-quality Reggies increase your chances of getting a high quality image if that's what you're going for
Midjourney style Reggies will push the style towards a midjourney style generation, etc. I honestly don't see the point of not using Reggies
Do you have any examples, that's super interesting 
well, if this train i've got going doesn't produce anything usable i'll go back to reggies
Like regularization is typically just random images (of the class) generated from the model you want to train on, so it's interesting to see people not doing that and it having a positive effect on the outputs
So comparison examples would be super interesting to see
Gimme a mo, I cannot touch my computer right now because I'm doing an embedded training and even opening a JPEG will crash the training because my VRAM is pushed to the max
Try both
And see for yourself
In fact, I'm going to be generating my own set of Reggies based on the new protogen model, since I already have a bunch of them for other models
I will upload it to hugging face and you can download it from there and use them, and you can see the difference yourself and post results for research
Though I will firmly say that it's more user friendly to not worry about them to start off with. Like I'm confident you can get usable results without them. They might not be as high quality, but it is less work and less variables to worry about, imo
but if it makes it easier to get usable results, I suppose I would just be wrong 
accumulation steps is the number of my total pictures if the batch size is 1 ?
I'm all for high quality though
really it don't matters ?
I think it means gradient accumulation steps
i would like to do batch 2 but i have 47 pictures x)
You can get away with either, it'll just change how long it takes to train
Which you should leave at one unless you have any reason to change it, which is usually Vram limitations
and LR might need to be adjusted
what gpu do you have?
Batch is VRAM and GA doesn't care about VRAM
rtx 5000 (16)
so you keep stuff on batch if you can and then GA if you have VRAM limitations
Yeah, so for example if you can't fit 40 images on the batch, then you can do 20 images on the batch and two for grading accumulation
right
To try to imitate the same result, I don't remember the exact process on which grading works, but it is inferior to using full grading
and sometimes less is more
my guy, you can do like a batch size of 15 or something if you lose two images
It has to carryover data to the next epoch, or something like that when you do gradient accumulation, instead of passing the entire data at once so there is a negative effect if used unnecessarily
batch size of 1 would be a waste of time I think- as in, you have the vram
if i cut to met the calculus requirement i can set up the batch more higher

yeah, pretty much
You could argue Batch 1 is easier to train?
even if it might take longer to do so?
so 44 images with a batch 4 i set the gradient to 11 ?
Or maybe, easier to not overtrain, is the better way to word that
is an option, yes
I would not agree though- as in- overtraining won't be a difference of one image
I'm more so referring to the learning rate
I personally do 1 epoch every step so it's easier to keep track of progress
You're speeding up the learning process, right?
Which makes it easier to overshoot or overlearn
or something like that
Something like that- although I feel like the lr isn't something that will be accurate anyways on a first try
You ideally want to train 60 batch size, 1GA
yeah
But u cant fit 60, so u can the try 30 BS, 2 GA
right right
I don't know if that's true, is it?
GA is just a matter of time, right?
like just taking longer but same results
But since VRAM is worth its weight in gold and it is scarce then you have to deal with it
I remember hearing that it does cause a loss of accuracy but I need to research it. These topics are super complicated unless you have a degree in machine learning which I want but can't afford.
Because all you hear are super technical answers
This was an interesting read, but yeah, kinda complicated:
https://kozodoi.me/python/deep learning/pytorch/tutorial/2021/02/19/gradient-accumulation.html
So when someone says yeah, it causes a loss of accuracy, usually the reason behind it is explained with a bunch of more jargon so it's hard to fit the entire idea in your mind
I'm a take a read
Also, easy enough to test out and compare
I should do that at some point
mmm there's another one, one sec
I also liked this one, also more user-friendly:
https://towardsdatascience.com/gradient-accumulation-overcoming-memory-constraints-in-deep-learning-36d411252d01
Has anybody tried a full tuning with Everydream or similar?
Yep!
I'd argue it's one of the easiest ways to train
since you're just learning on the captions
Why would you use it as opposed to dream booth ?
depending on what you're looking for, I suppose?
If you want to learn multiple things, finetuning/caption training works well
or if you don't want to worry about rare tokens or
just being overall easier 
I think it's easier to not overfit?
it doesn't have the same level of overfitting that DB can have
Letās say I want to train a model for a very specific purpose. Designing a car Rim, but I donāt want it to use generic wheels that are in the model. Iād like to train it with a specific data set so the results are more interesting. Would it be better to go with full tuning ?
I'm not sure what one works better for something like that
mmm, unfortuntely the community doesn't share a whole lot of info on how they trained specific thigns
some people do but yeah 
mm there are a couple good example starting points
one sec
Thanks Alicat!
A really solid model AND has a lot of info on how it was made
you could do that for rims
I think 
oh oops, that's an embedding
Appreciate it šš». So In this case do you think they used res images?
nope! Sorry I'll grab a different one
It seems they made a model from it
Using Dreambooth,
I see some using a single keyword in Prompt instance (zkz, ohwx,...),
others on the other hand use an additional class word (e.g. ohwx woman)
while some use a whole prompt (e.g. Photo of a beautiful ohwx woman, award winning photography).
What is the right entry among these 3 methods?
I don't understand how this works.
Instance token is like ohwx. Class token is "woman".
Instance prompt is like "Photo of ohwx woman"
Class prompt is "Photo of woman"
When prompting you would use "ohwx woman"
for the strongest effect
is one way to do it
So "Photo of a beautiful ohwx woman, award winning photography" should work well for an example prompt when using the model
I saw many video where they just leave that filewords entries blank and it work. That's what puzzles me here.
What was the entry called?
[filewords] can go in like Instance Prompt or like Class Prompt
or do you mean, the text files were blank?
Ooohh
I often see this parameter blank in "tutorial" video
Yeah if you leave those blank, you would just be doing caption training
also known as finetuning
and then it's ONLY learning on the filewords
is how I understand it
which tutorial are you watching? there are a couple solid ones out there now
I see that it is still a guessing game for many people š
https://www.youtube.com/watch?v=Bdl-jWR3Ukc
https://www.youtube.com/watch?v=XBn3K1L_TAI
https://www.youtube.com/watch?v=HahKXY7AQ8c
https://www.youtube.com/watch?v=9Nu5tUl2zQw
To name a few
I am explaining from scratch to very advanced level how to use #Automatic1111 Web UI and D8ahazard #DreamBooth extension to teach new subjects, e.g. your face into a model. Moreover, I am showing how to inject your taught face into a completely new model e.g. Protogen x3.4 to produce awesome quality images without wasting too much time on findin...
Support us on Patreon: https://www.patreon.com/entagma
https://www.entagma.com
After installing the stable diffusion webui (https://youtu.be/cL_ZYdkIqBU), Mo goes over how to train an AI model to generate portraits of your face using dreambooth.
Download Automatic1111's WebUI:
https://github.com/AUTOMATIC1111/stable-diffusion-webui
Installing...
Dreambooth local training has finally been implemented into Automatic 1111's Stable Diffusion repository, meaning that you can now use this amazing Googleās AI technology to train a stable diffusion model with your own images. You can train a character, an object, a style, or anything you want! There is also a new option that allows you to use D...
DreamBooth for Automatic 1111 is very easy to install with this guide. With DreamBooth for Automatic 1111 you can train yourself or any other subject. Use your own trained Model to create images in your styles or of yourself. The DreamBooth training in for Automatic 1111 takes only around 30-40 minutes with a good GPU.
LINKS From Video ##...
Fuck this bald bitch
Go team Aitrepreneur!! šš»šš»
@halcyon linden Team Aitrepreneur FTW. Kay do you have your own discord? If not, make one
For me, the one from SE Courses is what gave the best result, although hardly usable at the moment.
mm
Yeah you can do that, the difference between the Class prompt and Instance prompt is the token you're training on
essentially
Also a huge fan of SE Courses
Thank you.
The descriptions of the fields are often haphazard (succinctly, what is listed on the github of Dreambooth extension A1111).
Yeahhh
plenty of magical numbers too š
Entagma's is the clearest of all (you feel the teaching professional behind), but the video is a bit dated now, it has quickly evolved on this side.
It's still the new dreambooth UI too, and the video is solid! 
oh hey that's me 
Hi Stile Willem, great model. Could you tell us how it was made ?
It's an embedding, actually- I put up the dataset as someone wanted to make a model out of it which they did
I'm only just getting into making models
Oh ok, yeah was wondering how to make a full tuned model. Looking around my deduction is that you need more input images and perhaps no reg imagesā¦somebody correct me if Iām wrong.
oh hey, it you
yeah is a good embedding
Hi, I'm about to make a TI of a specific body type, should I go for style or object?
in init text if i do : inittext1, initext2, initext3 is that the training spread across three or the whole sequence ?
embedding
How do I enable/disable bucketing on AI Dreambooth training? I can't seem to find the option anywhere
depending on the repo, it can be automatic
Resolution being 512, does:
bucket 1: resolution (256, 896), count: 0
bucket 2: resolution (256, 960), count: 0
bucket 3: resolution (256, 1024), count: 0
bucket 4: resolution (320, 704), count: 0
bucket 5: resolution (320, 768), count: 0
bucket 6: resolution (384, 640), count: 10
bucket 7: resolution (448, 576), count: 10
bucket 8: resolution (512, 512), count: 90
bucket 9: resolution (576, 448), count: 0
bucket 10: resolution (640, 384), count: 10
bucket 11: resolution (704, 320), count: 0
bucket 12: resolution (768, 320), count: 0
bucket 13: resolution (832, 256), count: 0
bucket 14: resolution (896, 256), count: 0
bucket 15: resolution (960, 256), count: 0
bucket 16: resolution (1024, 256), count: 0```for example
automatically (at least that's how kohya's works, I'm pretty sure the DB extension does that too as of 2-3 weeks ago
"disabling" it would just be using only 512x512 images
then no bucketing would happen
when were you thinking you'd have the reggies posted?
I'll do them today
Since I'm in the process of comparing embed vs db
how many reggies would you recommend i use for face training a woman?
rule of thumb is 10 reggies per sample img
kk
I'm wondering because I fed it very different aspect ratio images and they all get bucketed to 512
And I just pulled yesterday, and the console does show that bucketing exists
I'm just not sure how to enable it
odd! yeah not sure
You would never want to feed anything other than 512
Never rely on dreambooth resizing
is what I mean
or birme, that thing compresses like a mf
Ideally I wouldn't, I just want to try it out
Do you have any comparisons? With the auto resizing from dreambooth versus manually? I'd love to see those 
like a 1 to 1 comparison with the only diff being that?
Intuitively I feel like that'd be right, but
I usually do it all in imagemagick myself at 95quality
I can do it
I swapped to photoshop for that exact reason
Im probably training a dreambooth model soon, I have soooo many models on queue
I've been using Kohya and they auto bucket, but if that seriously hurts quality, I'll just manually resize them
Sadly few programs have respect for quality
Also this might be superstition
But if they are outputting anything other than PNG there is always a loss, even if you cant see it
Jpg loss is less than 1% at that size by the way
If a png is 2mb and the jpg is 300kb, where did that 1.7mb go? It's data you can't see with the naked eye, but a computer can probably see it.
nope
My png's are about 10 times bigger than jpgs
Size doesn't mean more quality
Just uncompressed
My tiffs are 10 times bigger than PNGs, should we be using those instead? Probably but we aren't using machine learning to detect stars at the edge of the observable universe
We just making picshures
tiff can be lossless or lossy
I mean, ideally you would, but these models are running on 512 anyways, 678 for 2.š©
Or 1024 if ur a maniac like me 𤪠You wouldnt believe some results that Ive gotten. But you would have to hit the lottery to get something coherent. INSANE detail but low coherence
anyone know why TLB's fast-dreambooth leaves instance prompt blank?
wow just tried my embeding on my own machine with the same "name" of the model and it obviously didnt worked
look like the sha is different
Quick question regarding DreamBooth with SD v2.1 (sorry if this is the wrong place to ask).
I'm comparing different pre-trained models (SD v1.4, v1.5, and v2.1) in the DreamBooth method and notice that v2.1 has a greater loss during fine tuning (normally just under 0.5) compared to v1.4 and v1.5 (normally just under 0.2), but v2.1 still produces high quality images for my particular application. I know v2.1 is trained (as least in part) using a different loss function, but that shouldn't affect the loss here as I'm using the same training script regardless of model version. Any idea where the difference is coming from? Thanks
Can anyone who was experience in Hypernetwork training help me to train a style from a specific artist?
I have the captions in the dataset directory but couldn't find how it is related with the training since the prompts are coming from prompt template file.
Remake the past
ssshhh don't tell anyone this. I make a lot of money reviewing people's datasets for AI and just telling them they shouldn't use 20% JPEG compression to save space or they can't train anything š
People pay for advise? Dayum
It comes from DB2.x being shit for dreambooth, in simple terms. Noone trains those models cuz they flat out suck for db for reasons no one really seems to care about.
yeah man that's what a consultant is
1.5 is still the alpha chad master race of models
Oof. Wait until they hear about chatGPT
I swear chatGPT is my therapist, consultant, tech support, and I shit you not, once realistic female androids come out - wife. Just plug that shit into an android and voila. The perfect human companion.
Interesting, thanks. In my application v2.1 leads to much higher FID but then better performance for downstream tasks, which is weird. Is there any particular reason why it's considered worse than v1.5? And is there documented evidence, or is it just general opinion?
no celebs, no nsfw, a pain to train on db, no artists...
If all u do is draw cute pictures of bunnies though I think it might be better
just cuz of 768. Once we get a 1.5 1024 model tho, the world will implode on itself.
Cool, thank you!
Well there's still celeb content, but the names aren't attached to them to the same extent anymore.
Human content was HUGELY nerfed due to the insanely strict filter on 2.0. They attempted to fix it by continuing to train on it by introducing a moderately strict filter, but it could only do so much.

It's good for landscapes and non-humans though!
Question: what could I be doing wrong? I'm using the a1111 DB to train 2.1 512, but this is the garbage I'm getting no matter the sampler or the settings. I feel like I did everything right that I could be doing right, and 1.5 training with exactly the same settings, samples etcetera worked like a charm
Is there something about training 2.x that is so fundamentally different?
Or should I not use a1111's DB for this but rather EveryDream or the main diffuser based repo
looks like you don't have a correct yaml loaded for the generations
I had weird colors like these when I lowered the image resolution way way down in Stable Diffusion 2.1, idk why.
That might just be it- I'm an idiot, somehow I assumed it would just copy the yaml from the source
yeah 2.x is weird when it comes to low res
Yeah I know. No wonder why I have to stay on 512x512 all the time. 384x384 seems to work correctly though.
Nevermind, that was not it
It had a yaml already, and even with the original SD2 yaml the problem persists
yeah that's the model I used for training
I might just be unlucky I guess haha, or might have done something wrong- though I have no idea what
haha the "what" has so many variable in there indeed
wondering why in my embeddings samples i asked a simple "microwave in mystyle" and i got a volcano
all my txt captions are empty
we don't even looks for something evident first, it can be "everything" x)

hmmm no that's weird captionsa re there
100 steps at 0.0025 lr
An intricate microwave, art by xxxx
same dataset worked fine on 1.5 don't really understand
hmm maybe the 2.1 embeddings need to be trained on 768 ?
768 resolution dataset ?
hey guys i want to train style with dreambooth i have tried with small dataset like 20 images and the results are great but it seems when i train with 200 the results wasn't that good,i have searched about it and found there is Everdream but it seems my gpu is not enough for this, so i was thinking if i train the model with dreambooth partly with every 20 images and merge them into one model, how would it be ?
for 2.1 "which" models i should use for train embeddings ?
got really nonsense results even at very low steps with the 2.1-768 nonema
here's a tip for fine tuning
use the config file v1-finetune_style.yaml or v1-finetune.yaml or v1-m1-finetune.yaml
Hey Crew!
So I am finally getting into fine tuning. Looking for some dataset setup advice. To begin with I am using fast-dreambooth.
I want to train on a specific vehicle, in this case a pickup truck.
There is a new model of the pickup truck coming out and there are limited photos of it.
The photos I have are just of one colour. However, the new model is only slightly different (has a longer chassis).
If I wanted to give dreambooth more context, is it okay to give it photos of the old models of all different colours, but perhaps label them differently, like pickup-long, and pickup-short.
Penny for your thoughts?
i say just try it
you can also start collecting 512x512 pictures of what you want to train on and make your own model
try learning how to code up a web scrapper, and use something like googles, custom-search api
this will allow you to automate the process of gathering the images
also prob want to add a script to resize and convert images to .png
I'll post some examples/scripts I have
APIKEYS n such are fake, I did not post my keys š
example of using ffmpeg to resize and convert image
ffmpeg -i input.jpg -vf scale=320:240 output_320x240.png
- Note: The scale filter can also automatically calculate a dimension
- while preserving the aspect ratio:
scale=320:-1, orscale=-1:240
Cool, thanks. I am a python coder actually. The tough thing is the pickup truck I am talking about isn't released yet, so there are only a few "sitings" of it in the community. I only have about 10 images.
So I am going to pepper in some images of the front and back of an older model which those parts of the vehicle look the same.
Now just put all that together, and add some nedy programmer magic and booyakasha
tbh, I hate python
but I have to use it so, I use it.
the output of that google script is json, so you'll just prob need to add a regexp to filter out image links and pipe that to something like wget keeping it simple
Thanks!
try adding this python snippet to the google search script
testfile.retrieve(url.replace('"',''), "tmp/images/full/fish")
or
print '\n'.join(jsonpath.jsonpath(parsed_input, "$..URLs[?('unica' in @)]"))
or
import urllib.request, json
with urllib.request.urlopen("http://maps.googleapis.com/maps/api/geocode/json?address=google") as url:
data = json.loads(url.read().decode())
print(data)
just idea's of what one could add to it, idk, try shit, see what works n what fails
yw š
once you figure this part out, all you'll need to do is just change what it searches for online, wait for it to finish and have a folder populated with all the 512x512 images you need to train a model with
š¤
Good thinking
grep -Poza '(?:\G(?!^)",|"groups":\s*\[)\s*"\K[^"]+'
-
P- use PCRE engine to parse the pattern
-
o- output matches found
-
z- slurp the whole file, treat the file as a whole single string
-
a- treat the file as a text file (it [should be
- used](https://stackoverflow.com/questions/152708/how-can-i-search-for-*a-multiline-pattern-in-a-file#comment44086821_152755) because when the
-zswitch may trigger grep binary data behaviour that changes the- return values).
- Pattern
-
(?:\G(?!^)",|"groups":\s*\[)- either the [end of the previous
- match][1] (
\G(?!^)) and then",substring, or (|) a literal text "groups":, 0+ whitespaces (\s*) and a[char (\[)-
\s*"- 0+ whitespaces and"char
-
\K- [match reset operator][2] discarding the whole text matched
- so far
-
[^"]+- 1+ chars other than"
- As you see, this expression finds
"group": [", omits that text and - matches each value inside
"s only after that text.
@hoary stone
example pulling ip address's out of a file with grep
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' file.txt
Just use your imagination, and write it out in code, (or find someone that already has online, lol)
or since the google script outputs in json, just pipe it into curl using a GET request
hmmmm so many ways š
This enables you to be the creator of your own models rather than submitting to an authority figure (someone else) that you have to wait on to solve your problem for you (windows users, -- cough -- lol)
Haha, wow man. You didn't have go that far! Thanks for your input, much appreciated.
Haha, got ya.
it was no problem
i just get carried away sometimes š
and there may be others reading this and taking notes
i was writing this for everyone, tho you were the focal point
āļø
hmm just love how turned my embedding for 2.1
Does anyone know how to add negative prompts to the openjourney model?
I don't have a specific negative_prompt field and the usual [] or ::-1 isn't working at all in the regular prompt
hey question can i allocate more ram to SD becasue it looks like mine is hard locked at half
also is there like a guide to the UI
like what everything does and setting n shit
someone was telling me you can use LORA to compare two models and extract the difference as a pt but I can't find anything about this in Google. can anyone steer me in the right direction?
Dreambooth is a Google AI technique that allows you to train a stable diffusion model using your own pictures. This Imagen-based technology makes it possible for you to insert any subject you want into a stable diffusion model. I made a similar video in November showing you how to use the Dreambooth extension in automatic1111 but since then a lo...
Updated guide that fixes the current broken extension. Finally
Mixing trained embeddings AND dreambooth of the subject seems to give very good results.
My embedding model is terrible, though, even when blurring and noising background.
My Dreambooth model is "okay" alone, but without embedding I can hardly get the subject at all when introducing style and scenery.
I guess my dataset could be better
How would this look like on a prompt?
Are you talking about an embedding of one subject AND the dreambooth of the subject? That legit sounds dope af
I JUST got my embedding to work and look like my subject after 27,000 steps of training.
Dreambooth has always worked like a charm for me, if you are using the current dreambooth model, it will not work, you have to use the models provided in that video
Aitrepreneur explain it well here : https://youtu.be/usgqmQ0Mq7g?t=1327
oo shit I havent finished watching the video lmao
I got stuck halfway trying to fix his old install which didnt work right off the bat for me
I am legit gonna become his patreon, the guy's tuts are better that anyone else out there
Se Course help me to get the rope though :
Textual inversion :
https://www.youtube.com/watch?v=dNOpWt-epdQ
& Dreambooth :
https://www.youtube.com/watch?v=Bdl-jWR3Ukc
In this video, I am explaining almost every aspect of Stable Diffusion Textual Inversion (TI) / Text Embeddings. I am demonstrating a live example of how to train a person face with all of the best settings including technical details.
If I have been of assistance to you and you would like to show your support for my work, please consider bec...
I am explaining from scratch to very advanced level how to use #Automatic1111 Web UI and D8ahazard #DreamBooth extension to teach new subjects, e.g. your face into a model. Moreover, I am showing how to inject your taught face into a completely new model e.g. Protogen x3.4 to produce awesome quality images without wasting too much time on findin...
They both have slightly different settings, nothing extraordinary though.
lol, thanks for the thumb down
a pleasure
I hate those guys, him and olivio
Ppl always come here cuz their tuts dont work
Team Aitrepreneur ftw
They both badly use the X/Y plot script though.
Prompt S/R should not be used like
token:keyword (and then a value = keyword, 1.0, 1.5, 2.0,...) that would made the first column of data useless
but rather
token:1.0 (and then a value = token:1.0,token:1.5,token:2.0,...) which would save you one full column of data
Has anyone ever tried training different aspect ratios?
has anyone managed to successfully run tensorboard on either colab free or kaggle free ?
Yeah, that dreambooth video is solid
haven't seen the TI one, but I'd imagine it was solid, as well
Would anyone have any advice with approach for training a style in the vein of ālego city adventuresā tv series. At the moment I am attempting via a hyper network with about 200 images from the show. About 10k iterations (still early I guess) getting concerned perhaps itās not learning very well. Iāve manually edited all the classifications so that they say ālego manā or ālego womanā if there is a character in the shot. And prefixed the text classifications with āa 3D render ofā¦ā.
Do you also have completely grotesque faces with Textual inversion? Even after 2000 steps and a strength lower than 0.2?
Are there any Dreambooth projects capable of training on datasets using non-square images?
Not to my knowledge,
but it would change little (or even potentially degrade the result) and especially increase the use of VRam significantly
to train an embedding, for example on different kinds of dragons (ice, fire, black, undead etc.), is it better to set it to character or style?
Is there any way to upscale the image on weak graphics?i want this, but RuntimeError: Not enough memory, use lower resolution (max approx. 960x960). Need: 2.0GB free, Have:0.5GB free
use the sd upscaler script in the img2img tab
I'm too stupid to use this
EveryDream2 trainer now on free/low tier Colab: https://colab.research.google.com/github/victorchall/EveryDream2trainer/blob/main/Train_Colab.ipynb
Also running on <12GB š
Hello , can anyone tell me that which finetune technique is much better if we want to train our model on large dataset , dreambooth or lora or texual inversion or Everydream2Training
I would assume dreambooth probably?
happy for hear other suggesion if you want to give
I haven't done much training myself, and I don't know what that last option on your list is, but Dreambooth is probably your best option if you want to train on a large dataset.
if you have a really large dataset you probably want a general case fine tuner like everydream, the trick is labeling the data though
dreambooth is good for training a face or a couple of things but tends not to scale well, not sure if LORA is really meant to capture many different things, but if its all one style or person maybe it works well and sometimes TI can as well
Actually we want to train and generate model on food images
interesting
actaully at this moment we are suppose to go with dreambooth but not sure that it is best fit for us or not !
I don't even know what regularization class you'd use for food but I'm personally not that keen on the regularization process anyway
Thanks for all the hard work 
Hi. Kinda noob when it comes to finetuning/ training models. Thereās a character I like but heās so obscure / unpopular thereās almost no fan-art / official art of him lol.
I want to make a Dreambooth model with about 20 images of him. Which would be the best Google collab repository / settings for this? I already tried with Last Benās fast collab and got acceptable results in my first try, but it seems to be overtraining the model on my following attempts. Any help would be appreciated. Thanks in advance!
hey all. i'm looking to train a SD model on a particular art style (not specific subject). should i be using dreambooth or textual inversion?
I recommend LoRA 
are the new lora updates integrated into automatic1111 yet?
don't think so but kohya's extension is about as fast as built in loras now and much easier to use since you don't have to touch the prompt box
speaking of Lora, how are results lately for training (real) faces, compared to DB? my goal is photorealism and style-ability
I heard the AUTO1111 dreambooth extention got broken recently, has it been fixed yet?
@hot breach is it possible to train the model on colab, sorry to bother you
yes
link is there, hopefully the descriptions on the cells are enough to get you going
just need the low-tier GPU instance (16GB T4) and its mostly setup in a way that should work out of the box
Can someone please explain what is this, and if itās possible to make an embedding for using with 1111 out of it?
https://huggingface.co/sd-concepts-library/sherhook-painting-v2
is there a way to keep SD from cutting object in parts like this:
Tiling is on but I'd rather have the big snowflake in the middle of the texture
Any of you know a software with which I can batch-remove watermarks? Like say load up a thousand images into the program, then switch with arrow keys between them and adjust the watermark removal mask individually for each image?
I have found programs where I can edit an image that way individually, but I have to load each image individually and then save before moving on to the next one which takes a lot of extra time. It would be a lot faster if I could load up all images simultaneously, then just switch between them via arrow keys or whatever, and adjust the mask for each image, and then save them all at once.
Kind of like how https://www.birme.net/ works.
I am willing to pay a small sum for a program that can do this but only in the paid version.
does anyone know a forumula for how many repeats i should use for x number of images when training a lora?
or at least some examples of how many repeats for how many images people have used succesfully
When it comes to fine tuning a model on different concepts with 2.1 and DreamBooth, can you train the model on different steps for each concept or does it have to be universal for all data?
And they have zero clue what they are talking about and they just throw in 100,000 for epoch
They just race to get a video on youtube for the ads and they even try to advertise their videos on github and huggingface wikis. That to me means they don't know jack shit lol
has anybody looked into merging specific parts of the unet from different model sources to capture different elements such as composition, lighting, texturing, etc, which the unet presumably focuses on at different scales? Some current mix-models have apparently gotten much better results doing that, with this extension being built for it: https://github.com/bbc-mc/sdweb-merge-block-weighted-gui
The orange mixs model https://huggingface.co/WarriorMama777/OrangeMixs creator says:
2.Added IN deep layers (IN06-11) to the layer merging from the realistic model (BasilMix).
It is said that the IN deep layer (IN06-11) is the layer that determines composition, etc., but perhaps light, reflections, skin texture, etc., may also be involved. It is like "Global Illumination," "Normal Map," and "Ambient Occlusion" in 3DCG.
reasons to use '512-base-ema.ckpt' vs 'v2-1_512-ema-pruned.ckpt' as a vae ?what does pruining really do if anyone could explain please.
do you want pruned for generation, and full for when youre using it to do training?
Hello, can anyone help me to understand diagram of texual inversion and how texual inversion works ?
How would I describe uncommon scenes when training? I want to train images for zombies (which are not recognizable as such, they are part of my game and might still look like normal people) penetrating other people with their hands. This is totally new, no model knows this. How do I write prompts that describe the scene? Should I write "A 40-year-old man puts his hand into another mans chest"?
You need to update pip, then it should work.
Excuse me, if I may ask. This is first time I trained Dreambooth in Colab. So far is okay, but could I convert it to LORA?
Uh sorry. I found LORA in different colab
Does anyone know if training 2 separate subjects in dreambooth and merging the models would work just as fine as training 2 subjects in a single model?
I ask because multiple subject training appears to be broken in my dreambooth commit.
Letās say, i have 20-30 pieces of art works like this sample (like this concept drawing by artist Maniani), and i wanna to use them as dataset to train a model, maybe for both style and characters together, is it possible?
I have tried the methods of embeddings and hypernetworks in stable diffusion (not yet tested in dreambooth due to hardware limitations, although lora released recently), but the results are not fairly nice.
For hypernetworks, basically, the style can be trained, when i use txt2txt to generate some images, the results are acceptable in artstyle. But, for example, it cannot generate a new pose or action that are not appear in the trained images, most likely it is a combination of various limbs, body and legs extracted from different trained images. When i use some prompts of objects that are not existent in trained images, it almost generate some ambiguous shapes of elements of the keywords.
In addition, when i use img2imag to generate images, the results are also not so ideal. My purpose is the trained hypernetwork can turn my input image (some rough sketches like line art filled with basic color without shading) to the desired style with details.
I donāt know if the dreambooth training method would give me a better results, so i will use this after an upgrade of my display card.
So, no matter embeddings/hypernetworks or dreambooth were used, i would like to know if i can train for a more specific items, like āeyesā, āmouthā only, or merely the facial expressions; or training the āposeā or āactionā. I am not sure if this way of training is possible?
I hope that the questions above are not too much and hope anyone is kind enough to provide some directions to me to let me find a nice solution.
Anyone has some tips on training a character model on TheLastBenās Dreambooth colab? I keep either overfitting or getting weird generations :/
I trained a stable diffusion v1.5 model on ~100 hand-picked 768X768 images targeted to get the best close-up portrait pictures, as well as amazing-looking landscapes. Resolutions 1024x768px, 768x1024px, and 768x768 work best for this model.
HuggingFace: https://huggingface.co/Kaludi/ARTificialJourney-v1.0-768
@silk hare A short tutorial on how you made this model would be much appreciated! Keep up the good work!
Wouldn't a dreambooth model only affect those that use the same prompts you used for training?
For every other image it would still be 512, I know because I tried one with 1024
@split onyx Would you say this example is up to date? https://github.com/cloneofsimo/lora/blob/master/training_scripts/use_face_conditioning_example.sh
can anyone teach me how to load a githbub model to finetune it in thelastben dreambooth???
i dont know how to train this in a new character
eimiss/EimisAnimeDiffusion_1.0v is the correct path
thanks! am i puting it in the right space?
yes in the huggingface path
@ me if you need help
thank you so much friend it works
didnt know what link was
good luck then i hope it gives good results
If I have an image thats 512x1000 and I train it with Center Crop disabled, does it squish the image?
No, it will probably just end up looking like crap
Not to mention, it can crash. The training, I accidentally had a few smaller pictures in my subject folder.
By the way, if anyone is having issues with dreambooth, giving the CPU allocator issue, I found out the reason.
Its because Aitrepreneur recommended you toggle on an option in the settings to transfer some stuff from VRAM to RAM. Even with 32 GB of RAM this was causing me to crash my training with generating samples.
Turn that shit off and problem solved. Took me weeks to figure out. Needless to say, I am officially no longer going to be his Patreon.
wondering if anyone has had any success training an image classifier. Looking for some advice on transfer learning, need to classify 40-50k images of an animated tv series
You should be able to finetune a CLIP fairly easily, is that not what you're looking for?
This guy here claims to be able to finetune faces for avatars in 3.5 minutes using only 12 images, 600 steps w/ dreambooth: https://www.reddit.com/r/StableDiffusion/comments/10a9tku/comment/j5p0o33/?context=3
7 votes and 9 comments so far on Reddit
any idea how he's doing it?
LoRA and LEAP can be around that fast
I imagine they're borrowing code from LoRA, would be my guess
where they're only training weights and not the full model
Any idea how many class regularization images i need for a model based on 300 images?
Are there any good resources for fine-tuning stable diffusion inpainting?
would like to know this as well if any one could help^
Easiest way is to follow these instructions:
https://www.reddit.com/r/StableDiffusion/comments/zyi24j/how_to_turn_any_model_into_an_inpainting_model/
@brisk swan
-
Is it more benificial using characters name rather than describing their gender in the image prompt text files generated from BLIP for traning?
-
For the text file prompts for the indevidual photos being trained, can I type one prompt that is vague and then a second prompt that is a copy of first with more detail? for example, "woman wearing green leotard, red hat, posed with back to camera giving tumbs up," and then, "cammy white doing victory pose," will that help it associate characters and outfits to names?
-
Do I need to use underscore to combine words like BLIP and DeepBooru results are or can I type any number of things without underscore anywhere in the text file seperated by comma? Must the main prompt be first and keywords last?
-
Should I use the BLIP results even if they are inacurate or am I doing the proper thing by reviewing and corecting them?
-
Will changing any images in the database botch the end results or can it be used strategically?
-
Same question as before but in regard to the training prompt. Do I need to keep the prompt I train it on the same through the entire session?
... continued
-
Checking the results of my training can be done by draging the .pt file from my hypernetwork into a directory called embeddings and then calling the file name in prompt without the file extention in txt2img and img2img correct? I have not started traning, still reviewing my BLIP results for a lot of images.
-
is over 500 images too much? or am I gonna have excelent results after running my computer for 36 hours?
-
Should I use hypernetwork or the other options and I can us the pt from hypernetwork in embeddening same as a file from textual inversion right?
I plan to eventually finish the documentation, I just want to quickly know these things so I can get started training though. Thank you!
For anime models:
-
Generally recommended to not use names. DB you would train on the rare instance token. Finetuning you would train on the various tags/captions and the character result would be implicit with the use of some of those tags. Including gender (e.g. 1girl) is recommended. For DB, that's what people use for the class token.
-
Depends on the repo. Many don't use underscores at all. I think the only one that did was the earlier WD models. You could get away with no commas at all, but "tag shuffling" can help training and that looks for things separated by commas. "Keep token" allows for the first X tokens to be kept when shuffling. This is important since the beginning tokens are weighted heavier than the ending tokens. Typically you'd 'keep' the rare token.
-
Don't use inaccurate tags, is a good rule of thumb. This just comes down to time/cost. For smaller datasets it's easy to correct, but for larger ones it might just not be worth it.
-
Probably depends on the repo and training method. People typically don't recommend it as it can mess with the learning, but I have no idea if there's a specific way to utilize that, since most people avoid doing that.

6) Same as 5. People generally don't since it can interfere with training. I've done it before and it broke the model I was training.
- A good rule of thumb is to fail smarter / fail softer. You should really be starting with a small dataset and getting the hang of it before doing a large one, but I can't stop you. I can say that you can have really good results with under a tenth of that size.
For questions 9 and 2 I don't understand the questions
i highly recommend making a 20 image subset of your training data and doing multiple training runs on that dataset trying out different strategies. using everydream 2 on a vast.ai or runpod 3090 a dataset of that size should train in just a few minutes
you should be able to learn enough doing that to know how to structure your larger dataset and prompts more generally
okay that seemed to answer all my questions for now. Where can I learn about tokens? I was about to start looking into textual inversing, cant find too much documentation though other than what was on AUTOMATIC1111 git wiki tab.
i can answer 2), somewhat. CLIP is very weird, it won't behave like you'd uh, expect. it bizarrely seems to "know" things that are surprising, and then it is very stupid about other things. i told it that a bunch of photos of a VW Polo care were photos of "an sks car" and when i prompted "sks car" after training i'd get a VW Polo 100% of the time, but when i prompted "sks" i got a gun and "car" gave me some generic car
I get surprised by both its accuracies and inaccuracies, im trying to us tags that i see being recomended through BLIP
you probably want "cammy white wearing a green leotard and a red hat, posed with her back to camera, doing a victory pose with a thumbs up". i haven't done much testing with training humans but i'd expect you'd get training data attaching itself in useful ways to all of those terms.
i'm not sure how/if BLIP's text style relates to CLIPs. in fact if you're using OpenAI CLIP (SD1.5) nobody outside of OpenAI actually knows that their training data looks like
basically fully english sentences work better than tag lists, the training process does (somehow, unreasonably) break up the prompt into chunks that actually make semantic sense when you try and prompt them later
don't think about it too much, just describe the picture
and then test what you're doing
and based on the test, try doing another pass of "don't think about it too much, just describe the picture"
I'm worried about having to use the same prompt to train. Does it matter how vague or specific the prompt is? like should i have a more narrow goal or is woman posing for picture good enough? Should I go ahead and get rid of the single word tags or keep them in the training txt files because they help? ive been spending a lot of time correcting the generated prompts but trying to keep them resembling what I generally get.
Im changing to a smaller database, im gonna have to get more of a feel for this as I go on, thanks for your help!
an insight that might help: when you train SD you're sort of doing img2img generations
every step is an img2img. the trainer takes your input image, does a one-step img2img on it with a random strength %, compares the generated image with the input image, and then pulls the model in the direction of your image based on the difference.
so anything you can put in a prompt that would make an img2img generation better will help the training, with the caveat that any difference between what that img2img would produce and your image will alter the terms in the prompt you use to do the img2img
so if you train "cammy white wearing a green leotard" it will be more easy later to prompt "cammy white wearing a yellow business suit", however at the same time if you prompt "boris johnson wearing a green leotard" the green leotard he's wearing is going to look more like cammy's
oh, i'm assuming dreambooth training - if you're just doing a TI then only the TI term itself will get trained, but you should still caption the other details because it will help
Yo
I have made the "Dreambooth" extension to work in the webui on google colab
I think I'll share it soon
Really? Haha, that was exactly what i went in here just a minute ago to look if someone also had problems getting to work š
Yea
After many tries
and I'm not even a coder :)))
But I used my logical thinking
Haha, knows the feeling. Was doing the samt thing all night last night. Went to bed 5AM reluctantly but had to get to work at 7 so had to find myself be defeated by that task. Just sat down again getting into it ^_^
Now I'm looking into making this to work in google colab https://github.com/Klace/stable-diffusion-webui-instruct-pix2pix
Nice!
Editing images with only one prompt
I have a new model i need to train, then i was planning to check that one out as well š
The proof
I even tried it with some cat training
It works
but I canceled, because the training was taking 30-40 mins
im not getting lastben's fast-dreambooth to work, so i was hoping i could get DB to run in the webui instead and hopefully if will accept all the captions for my model.
Sweet! š
Found a way to make some extensions to work on google colab webui
Was it hard to solve?
like "prompt generator", "model converter" etc.
Idk. It took some time
knowing that I'm not a coder š
Haha, me neither š but with logic and google on your side anything is solvable
Im gonna try fast-db once again. I have chores to do, and i was hoping to have a model in training during š #has put everything on hold at home wile trying to be able to train a model first, for the last two days š
From what I have tried, the extension is better
I mean it really showed images like the one from the training set
And I didn't even change the settings
I used the default ones
Yeah, also heard that. But noticed yesterday that the extension wasnt working anymore in google colab š
Great work!
Thanks!
Feels good to know that i wasnt the only one with the problem, and also that it actually IS solvable š š
im planning on using a hypernetwork, I have not learned many other options yet.
The hypernetworks are the less precise and the worst
Use textual inversion, dreambooth or lora training
I've been working mostly with human-models. Now im trying to get deeper, more into also chathing body-patterns, facial expressions, postures, body sizes, every trait that resambles that specifik person. Glad i have a girlfriend that bears with me on this š
Just been using dreambooth though. Tried some hypernetwork, but didnt get a good result at all. And havn't even tried textural inversion yet. Just read about Lora today though, so was thinking of trying to figure that one out.
From what I heard, dreambooth is the best atm
followed by textual inversion/lora
Promptgen working
happened to me too
like 1 hour ago
i had to write captions manually
yeah, i had done that also. but in the files š 2967 images, and i had edited them all by hand. Reeeeally dont feel like doing it in the browser also..
#grumpy
Well.. dont have time to fix that issue now. Just have to run the training without my captions and have a test-run.
there.. now its atleast running. Trainingtime 7 h, 48 min.. without the captions.. š wish me luck!
yeah, impossible to write manually but if i recall correctly in the gdrive theres a folder inside fast-dreambooth that says "captions"
also a zip file
Hmm... damnit. now i have to try once more
perhaps i can just manually place the captions in the session folder, and edit the captions.zip manually, and it could work š¤
yeah i was thinking that
it should be the same, better than training without captions
If looked like dreambooth was confusing the txt files with the png files. But this way they really get separated. Might work!
Done, lets see if DB gets along with it!
Wooh!
I think it actually worked!
well see in 8 hours then
well.. right now, with the captions it says 45 h and 48 min š
Guess i have time to do the dishes now atleast
I keep getting this error when I try starting up the learning
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage).storage().untyped()
RuntimeError: PytorchStreamReader failed locating file data/0: file not found
i remade the pt file
it solved it!
question, how do i change the res of generated classification images in the DB extension? im trying to make a model using 2.1
That is what im asking about yes
is dreambooth really the best? it seems subjective. alot of people are saying everydream v2 is better for fine tuning. i have only done a few hypernetworks, and a couple of everydream models. so havent yet tried dreambooth.
hmm.. just heard about everdream today actually, but from what i saw now it seems to be worth a try later on though! Especially when working with larger images-sets š¤
yeah, my last attempt using was with about 300 images and was "ok".. so trying to get a dataset classified now with about 30 - 40k.
i had some really poor results yesterday with a hypernetwork and the character "admiral ackbar" from star wars.. so might try again with dreambooth.
Oooh.. nice! That would be interesting to see the result from when you've tried š
Damn.. 5.13 AM today again š really need to turn my day rhythm back to normal again. Good night!
sure thing.
pretty bad - but will see give dreambooth a shot later today instead..
if your results are bad itās probably more down to bad captioning and/or incorrect learning rate and/or too much training. iāve been getting excellent results with <20 images and <150 steps with batch size 4. donāt waste time collecting and captioning more images until you can get usable results out of 15 images.
That was 11 images. Limited resources. Iāve had good results with hyper networks before actually. But the reference I was allowed to use was quite poor.
Shouldnāt just assume my captions are rubbish
The learning rate was .00005
But Iāve used hyper networks enough to see when itās not learning the way i want it to. Hence I thought Iād give dreambooth a try.
The top image above was def over training - after 30 steps. The bottom was after 10k
So Obv I was testing dif amounts of training also
I canāt show the reference I was using because of an NDA - but I figured there wouldnāt be an issue with these results given how off they were
idk about hypernetworks or TI but thatās a very high learning rate for everydream2, at least 10-15 times higher than whatās recommended. does your trainer have a proper validation split? if it doesnāt youāre going to have a hard time finding the correct LR
It isnāt every dream 2
can you post some sample captions? not assuming theyāre bad they just might not be the best for what youāre trying to train
.00005 learning rate is fine, then when it start to overtrain you lower the rate and continue thatās standard practice
The main issue is the reference. They are of the actor from rotj in his suit from when they were making the film
youāre trying to train his face? are your captions describing the background and what heās doing?
If it was just his face that wouldnāt be a problem. Need the full body.
And then the goal is to put him in the stance with his arms and hips also from the film
I donāt need the background so Iāve left it pretty vague
thatās a mistake
Also, for the record he is captured on a white background
SD needs you to tell it the background so it knows what to ignore
When they were doing the costume tests for the film
Iāve said on a white background
ok i canāt help without sample captions and you seem petulant/resistant to my advice and experience, goodbye
Lol. Im not at my computer so no I canāt provide a caption. No need to be so irritable on a forum.
I also clearly described what the background was given jt was white from the photos they took off set.
yet you are getting bad results. ergo something must be wrong. you donāt know what, i suggest some things that might be a problem, such as too high learning rate. you shrug it off as ānah thatās what everyone usesā. idk what to say mate
Dude, I said it was the reference. And said I was not going to use hypernetworks next time. Maybe you didnāt read the context above the images. You also said you donāt hypernetworks so maybe you donāt really know?
just try a lower LR and see what happens.
Did you not read the part where I said I lower it once it looks like itās overtraining? When I am giving dreambooth a go I will be using a totally different learning rate
try lowering it from the beginning
once it looks like itās overtraining it is already overtrained
that might be your problem
I save out tree hypernetwork every 100 iterations. So I go back to before it was overtraining
But this is irrelevant - since I wonāt be using hypernetworks on it
Next time
yes but if you start with a too high LR the network has already baked bad data into itself
Ok. Well if i go back to hypernetworks next time - if dreambooth doesnāt work well - then I will start with a lower rate
But Iām fairly sure the results will be better with dreambooth
sure, if you pick the right LR and have a good captioning strategy and donāt overtrain š
Yes, with those things in mind..
these systems are all the same under the hood
different degrees of flexibility, thatās all
Yeah, well above (before the images) I was merely saying I was looking forward to trying it out in dreambooth for comparison. Iāve had good results w hypernetworks in the past, and good results with every dream