#✨|sdxl
1 messages · Page 8 of 1
you have 77 tokens, each token consists of the encoding from CLIP-L and CLIP-G. So if your prompts are "a dog" and "national geographics" then you get two tokens, one "a"+"national" and one "dog+geographics". These tokens are then used in cross attention
tokens or embeddings?
@visual glade I've been sketching out an easier-to-use API for comfyui. Something that's like the prompt parameter, but where nodes can be named instead of numbered, and where node inputs are named instead of indexed. I was going to just put that in my own code and use it myself (all Rust), but is this something you'd be interested in getting a PR for?
i doubt national is one token 😛
node inputs are named in the api though?
just to give an example: if you make one prompt as "a dog" and another is "a cat" then you have one token that is "cat+dog". So each pixel in the image has to be assigned to "cat+dog", it cannot be assigned ONLY to dog or ONLY to cat
Hm? Hang on.
however you want to call it. I talk about the set of vectors that are used in cross attention
is overwhelmed by all the conversations at once
I added an option to export in api format in the latest version, to see it enable the dev options in the settings
"22": {
"inputs": {
"add_noise": "enable",
"noise_seed": __SEED__,
"steps": __STEPS_TOTAL__,
"cfg": __BASE_CFG__,
"sampler_name": "dpmpp_2m_sde_gpu",
"scheduler": "karras",
"start_at_step": 0,
"end_at_step": __FIRST_PASS_END_AT_STEP__,
"return_with_leftover_noise": "enable",
"model": [
"10",
0
],
"positive": [
"75",
0
],
"negative": [
"82",
0
],
What is the "0" in positive?
they're not even doing pixels, it's latent space. I believe that what is happening is the multiple tokens are turned into encodings with cross attention before they are combined for the image guidance
nope. And yes, not pixels but latent pixels, which is almost the same just bigger
Does anyone have an idea why some ints would be incompatible with eachother in comfyui
There's so many... Terms being stated.
Anyone happen to know a guide on the technology of diffusion, specifically written for dunces such as i
So that is to say, your CLIP-L and CLIP-G prompts both have full encodings and at that point they can simply be added together to form a final vector
or vectors
There's no such thing as "pushing people" here. There are just different tools, some fit certain jobs better that others. Sometimes fundamentally, sometimes because it's more developed of mature. People are free to choose though.
The only kind of pressure you can possibly apply involves A1111 and Vlad. The better your software gets, the more important it is for them to stay on your level. But that's a very good kind of pressure, if you ask me! The community surely benefits from that.
yes. I mean, concatenated, not added.
primitive nodes need improvement
no matter how many disclaimers i put on my code repos that it's highly tuned for my specific needs. people insist on trying to use it
so each token in your input prompt becomes a vector. The k-th token in the CLIP-L prompt is concatenated with the k-th token in the CLIP-G prompt
Do you know any tricks to solve it? Like editing the json file maybe
each token in a prompt does definitely NOT become a vector
(and with the pooled prompt from CLIP-G)
or these AIs wouldn't work at all
no
because it always uses 77 tokens, so they are filled with blanks
that's the index of the output on the previous node, that's defined as a tuple in the node def so it should be fine
I expect it won't change (often), but that's what I meant I wanted to name.
maybe some of the vectors become 0 after cross attention?
It's named on the UI, after all.
you only have to pad if they're different lengths
nope
at least they can't be vectors with any strength
yes, their attention can be 0
it's named but some nodes might have the same name for multiple outputs
but I'm sure the "blank" tokens are still used somehow
you could. But its not done
ive always wondered about token concatenation, how does it work in regards to not using concatenation? say if i prompt a scene with 75tokens. then i prompt with those same tokens but add 100 more to describe it more finely. Does the second one have less weight on the original 75 tokens or something? or does concatenation work flawlessly and can just simply overcome the 75 token limit and also add the extra tokens?
you will get different results when using no padding. However, it would be more performance efficient without padding for sure
LLMs don't work in such a way that they consider a catalog to be a cat and a log
Honestly I feel like that shouldn't be allowed...
hm? I never claimed that
hi dude, I was just messing around with comfyui, it's really powerful ✨
do you have a discord / community somewhere?
yes it's on matrix: https://app.element.io/#/room/%23comfyui_space%3Amatrix.org
actually I have no clue how the >75 token prompt extension in auto111 works. I don't like to look into their messy code
thanks 👍
there's some cases where a node can return 2 of the same types of objects in that case it makes sense for the name to be the same
Only if they're completely equivalent, I think.
Is the core idea you're trying to convey that you basically have a prompt 'a cat licking a dishwasher' and the CLIP-L and G are creating encodings like 'a a cat cat licking licking a a dishwasher dishwasher' and cross attention is applied afterwards?
no
your prompts are
[a, cat, is, licking, a, dishwasher] CLIP-G
[a, cat, is, licking, a, dishwasher] CLIP-L
and then your vectors are [a++a], [cat++cat], [licking++licking], and so on
but this does not mean you concatenate words or something
you just concatenate their vector embeddings
basically for SDXL they couldn't decide which text embedding works best, CLIP-L or CLIP-G, so they just used both
can anyone tell me what ToBasicPipe is? i imported a workflow and it showed me that node is missing and i can't seem to find where to install it
For prompt extensions, would it make any sense to just... average them?
except the output vector embedding don't match that much with the words
the cross attention gets one token with "cat" is encoded oncy by CLIP-L and once by CLIP-G
yes, probably
that's why it doesn't matter
if noone here can then i'd ask whoever you got the workflow from
encoder2: "a dog" --> [4, 5, 6]
encoder1: "a cat" --> [7, 8, 9]
encoder2: "a cat" --> [10, 11, 12]
[[1, 2, 3], [7, 8, 9]], # embeddings from encoder 1
[[4, 5, 6], [10, 11, 12]], # embeddings from encoder 2
]
[1, 2, 3, 4, 5, 6], # embeddings for "a dog"
[7, 8, 9, 10, 11, 12], # embeddings for "a cat"
]
an example of how it looks.
Then you would not need the tokens at all
Bug report: it doesn't work.
Bet you get those too.
no because i'm a genius that disables the issue tracker
you only need the tokens because of language
Lmao
human language
I doubt so
if you trace back which pixels are attended to which tokens you clearly see that the tokens still keep their meaning
in the sentence "a cat is licking a dishwasher" you will still see that the latent pixels in the image that belong to a cat are more strongly associated with the word "cat"
you could connect the latents to the tokens in a way, but the tokens themselves aren't relevant
a rose is still a rose
maybe we just talk about different things, cause I don't use the word token precisely
I talk about the vectors in the cross attention
each vector is connected to a token in the original sentence
just because the vectors are concatenated doesn't mean they lose their meaning, i think is what kai is trying to say. but that's precisely why it works to have "a dog" in one prompt and "national geographic" in the other.
and of course, some words will consists of more than one tokens, then their vectors are probably very similar to each other
yes, exactly
I think it works because you can mess up with SDXL in soo many ways and it still works
and yes, sometimes messing up makes it even better
I just say that it is strange because you technically align both prompts with each other token by token, although this alignment has no meaning
I guess I'm trying to understand what kai is saying means wrt the 2 clips
yeah that's why i said it's not THE way, it's just Different, and that allows you to access a wider subset of the data distribution than just doing it the same way every time you prompt the models.
my refiner doesnt run automatically in comfyui, is there a button I am missing?
for example I wonder if the following works similarly good:
That is a good explaination, I thought maybe they kept CLIP-L because of the knowledge loss witnessed with OpenCLIP in 2.x
https://github.com/SytanSD/Sytan-SDXL-ComfyUI
use this for reference, to diagnose any problems
OpenCLIP didn't have much knowledge loss, if anything it knows more and is more precise than CLIP-L
CLIP-L: "a cat is licking a dishwasher BLANK BLANK"
CLIP-G: "BLANK BLANK BLANK BLANK BLANK BLANK national geographics"
anyways, I guess there is still a lot to experiment
Even if it's vectors and not words, do they really empty fill the vector space like that?
I just wanted to say that its not so obvious that you can use two different prompts and, therefore, I don't find it shocking when auto1111 has not implemented that yet
yes
its extremely inefficient, but I think you cannot simply drop that
that's how unconditional guidance works
it ensures you still get unseen prompt features aiui
ComfyUI SDXL 0.9
cause due to the transformer layers you not only change the text tokens but also the blank tokens
SDXL loves forehead adornements
like the embeddings for BLANK might still contain knowledge about "a cat is licking a dishwasher"
when you misalign the timesteps of the two models, it does that reliably
I asked for it though lol
oh, i was just stating something randomly
so I guess when you would remove all blank tokens, the model loses expressive power
™ terms and conditions apply
forehead jewels
misaligned timesteps are badass
it kind of "cracks" the image apart
you can see these thick black sharp lines form on faces quite often
@visual glade one of the best aspects of AUTOMATIC1111 getting SDXL support is that it seems to be putting a rush on resolving issues that cropped up with SD 2.0 support there, and were never resolved
like models not loading the correct VAE or having hidden errors that just silently fail to load the model, fallback to the prev model etc
idk why --no-half-vae isn't the default now, the only model that works with that is 1.5
probably because that's their solution to the high vram usage of the vae
So is that to say if you could diffuse on just the tokens inputted you'd basically get cat, licking, diswasher related output only, twisted and stretched to fill every latent?
it has VAE tiling and slicing but they're not enabled by default either kek
...tempting
I'm pretty sure it doesn't actually and it's an extension
the fewer tokens you have, the less there is to "pay attention to" and Weird Stuff Happens
Could it be since the models are trained with tons of empty tokens, that 'bulk knowledge' goes into 'empty token space'
that is what caption dropout aims to do
the zero or one vectors are tweaked by the inputs that have no captions to them - the empty caption is replaced with all zeroes or all ones, depending on the encoder, since they use different tokenizers
this i think is how a lot of models end up improving their negative latent space so that you don't need negative prompts
you can run the text encoder on very high quality data with caption dropout around 5-10% and it definitely stops needing negative prompts. but without a very diverse set of captioned images, you will start losing knowledge that exists in this 'empty space'
the answer is yes XD
there's even a second dishwasher there!XD
Glad to see there's no non-cat non-dishwasher in the latent space
it's just that they didn't ask for a coherent cat, so, that didn't happen
when you don't use classifier free guidance, your output is pretty well-pinned to the prompt
if you don't ask for something, it doesn't happen
unseen prompt features? fuck 'em. never needed em 😄
Bruh that’s a washing machine not a dishwasher😂
dish, washer
that's a hybrid clothes dishwasher
Is that really what happened here?
have you not seen the way my family eats? a hybrid dish/clothes-washer would be awesome
I don't know the exact tokenization but it wouldn't surprise me if it were something like that; lots of washer associations out there that don't have to do with dishwashers...
and is that because of insufficient self-attention?
trying to get a powerful dishwasher and the outputs now make a lot more sense to me. it always seems to be a powerwasher version of a dishwasher
Incoherent cat would suffice!
Cross Attention, probably 🤣
oh my god lmao please try and get the self-attention guidance pipeline working with SDXL
what I like to do for fun with negative prompts... generate with no prompt, and then add a negative based on what the unconditional output looks like
SAG scaling never worked for 1.5 or 2.0 or 2.1 but DAMN IT things can be different
oh so is clip not doing that at all when encoding?
read the SAG paper
screen actors guild?
doth thee question that ai gods understanding of the dishwasher?
yeeeesss, i managed to do this after 20 attempts lol, someone on reddit said it was impossible
thanks ok
impossible in a reasonable time limit on a 1660 maybe
each image takes 4 minutes, so that's about an hour and 20 minutes for 20 attempts
haha. they are so non-responsive to issues. i have already opened some. that code, does not work.
SAG? More like SAD!
I had no idea that there wasn't a self-attention component to all these diffusion impl. I realize that self-attention wouldn't be perfect, but thought it was doing some.
It was really cool when it worked
SAG only works on v1.4 with non-square resolutions, on v1.5 with square resolutions, and nowhere else
this image was created with the bot, but i have a 2070 super 8gb vram and with 5 steps base and 15 on refiner it's 15 sec
i got tired of keeping track of which models do/don't work with it.
the guy on reddit said that sdxl wasn't capable of creating that image
the 1660 has no Fp16 support so it runs in single bit precision mode (fp32) and uses more VRAM which hits the 6GB VRAM limit sooner which puts more burden on memory bandwidth/latency due to extreme number of page transfers
that sucks
so a single image takes 4 minutes on a 1660
yeah it sucks but it's better than what LLaMA would do on that hardware, which is, nothing.
Does anyone know, does even encoding do self-attention?
encoding ONLY
in these impl
"sir mitten, a 1 year old kitten, taking on the adventure of prevailing over its archnemesis the dishwasher "sir brrrrrrs a lot""
Self-Attention Guidance (SAG) is an advanced method that uses a model's own attention maps to improve the generated images. You can think of an attention map as a heat map that shows the parts of the image that the model is currently focusing on. By blurring only these parts of the image, the model is better able to focus on the most important features of the image, which leads to better results.
fined tuned sdxl will be the same as midjourney 5.2 and sometimes maybe a bit better
SDXL fine-tuned or not isn't dependent on the developers to develop new features like with MidJourney. People seem to celebrate new features in MidJourney that has been in SD for a year.
finetuned sdxl will be expensive as hell, as you prob dont want to run it 24gb vram - essentially locking you out of prosumer options, now that the A6000 is also more expensive...
OK so what I'm asking is this, if I typed 'a cat licking a dishwasher' I could maybe type 'ing cat a washer dish lick a' and the only difference would be the vectors having a different order, there's no attention applied prior to sampling?
SD is more esoteric
(I understand it's not equivalent)
will i be able to run finetuned sdxl in 8gb vram?
for inference? with the refiner and the base? maybe. with just the base? absolutely
some huge LoRAs might end up increasing VRAM requirements depending on how they're handled during runtime
🙏
instuctions unclear. wet cat stuck in dish
Yeah but that's because the model was trained with the tokens in a non-fucked order 🙂
So that's where the non-equivalence comes in
stable doodle is really good
I always assumed the encodings would be wholly different
because of LLM self-attention
but now I see it's not used that way
I guess
it's raw encoding only...
at least I can confirm, that teachin lora with a lack of order in words works - under the conditions that you add commas (and turn on shuffle!)
OK whelp
ahhh finally got a LoRA to kind of work, maybe I overbaked it? 1e-5 LR and 90 epochs
oh hell yeah you overbaked the living fuck out of that
that's ... surprising lol
OK well now I'm sad we don't have self-attention at any part
what do you suggest?
pic?
@sudden cliff same lol it was pretty good for 1.4
hi
@inner ruin i was going to suggest you ping Caith 😛
OK now I'm really wondering how DeepFloyd IF did as well as it did. Did that have SAG?
Because if DF IF / imagen work as well as they do without SAG, that is very interesting
it only works kind of though. The moment I change the prompt too much it just defaults back to random girl
it's weird cause I used to get faces really well in 2.1 and 1.5
while there are better options, overfitting is a lot easier for now
so literally do what I'm doing? 1e-5 and 90 epochs? lol
base -> undertrained on faces
refiner -> hella overtrained on faces
it averages out. but you want to plug your lora into the middle of that ideally, which doesnt exist
v1.0 prob fixes that, and then training faces should become easy
yeah I can't train on the refiner right?
is your suggestion to just... wait? lol
ok DF IF DOES use SAG
The problem of MJ don't even start with advanced tools. It's mediocre when it comes to following the prompt, either due to the models they run or due to all sneaky additions they do under the hood. SD doesn't do that.
Midjourney has an LLM prompt expander running in the background
nop.
so easiest solution I found for now, is overtrain the face only - base model wont break for A LONG time, so you shouldn't have issues there, then around 600% it makes nice faces
SD can have it too. The difference is, you can control that here, but can't control it there.
oh ok, which parameters did you find work best? and how many training images?
nop to refiner
yes to wait - not worth producing a good workflow, for something that fixes itself in 4 days
yeah exactly. They were very smart though because 1.5 was so hard to prompt, so the LLM layer made their prduct more accessible
1e-3
8/1 rank/alpha
training images - 30 to eliminate any possible problems. more images = better results, up to around ~150 at which point it just takes longer
if you have less than 30, it can work, just try a bit around, and see if you can avoid training the background as well XD
under 10, make sure your captions are good, and always caption the background!
Final actual question for anyone that knows:
Is the lack of SAG why MJ, SD, SDXL, Dall-E2 cannot do 'a boy with red hair and a girl with blue hair' (extrapolate with various associations) reliably (and DF IF can)?
that makes sense, like regular 1.5 LoRAs
I find this suprising though, if the goal is to overbake. Do you just run it for like 200 epochs or something?
there is self attention. In the text encoder and also in the unet
yep.
my 2B lora, where I trained a face as well for it - should have been done around 50 epochs, but I let it run to 350 to 'fix' the face. remaining model didn't suffer any damage though, clothing even improved and didn't get overbaked
OK well everyone just got done telling me there is not
In a transformer text encoder (as in GPT models), self-attention is used to capture dependencies between all words in a given text regardless of their position. For each word, it computes an attention score for all other words to determine their relative importance. The word embeddings are then weighted according to these scores to produce the final output. This mechanism allows the model to focus on relevant parts of the input sequence when generating each word in the output sequence.
The U-Net architecture is typically used in tasks such as image segmentation, where the model needs to output a pixel-wise classification of an input image. The self-attention guidance in a U-Net-like model isn't used in the same way as in a transformer model. Instead, it is used to better incorporate global context and guide the generation process in diffusion models. This guidance helps to improve the image generation quality by allowing the model to attend to different parts of the image at different stages of the generation process.
they're different and not the same form of SAG.
OK this is what I was asking
terminalx talked about SAG (Self Attention Guidance) which has nothing to do with self attention
That's why I kept saying self-attention not SAG (which i didn't even know)
i'm stupid and use the wrong words sometimes
OK so the encoding does use self-attention, I am satisfied then
I was horrified about that mainly
(I was thinking that other than how it was trained, token order didn't matter AT ALL)
basically anything that isn't • face, body, anatomy, eyewear - is super easy to train. but those few things require workarounds to get working. at least faces will change to easy to train as well on v1.0
will it change because they're merging the models / ditching the refiner?
but the transformer layer does calculate without focusing on the order of the tokens. it's just that when words are in different order they can tokenize differently.
Yep, that's all that matters
well see that's why I'm surprised that dishwasher and dish, washer wouldn't be more different
I think the text encoders were never trained for such subtle things
maybe it's because the CLIP isn't that huge
i still want self-attention guidance for SDXL 
like image captions are usually quite bad and general
dude the CLIP is enormous
what do you want from it LOL
But if you had something specific in mind, you had to dilute the prompt with tons of synonyms, weak supporting tokens or outright bogus input so the LLM doesn't add too much on top of meaningful prompt. It was a good solution for inexperienced user, not so much for someone who can actually prompt 1.5 well enough. And it seems they moved in the same general direction SDXL is moving, because I heard current versions also benefit from natural prompting which is closer to an actual sentence instead of 1.5 notation.
yes (faces on base wont be undertrained anymore)
eyewear remains to be seen... might stay problematic, might be fixed as well
anatomy/body parts will stay hard
you rarely have captions like "a photo of a girl with blond hair and a boy with brown hair"
Well why is dish and washer not encoded more differently from dishwasher
interesting! How do you know this, btw?
if you caption with T5-Flan, you do 😄
not sure how accurate they are but it was Good Enough for Me
also, CLIP is trained to create a pooled embedding. You don't care about the single words in the caption, but you want to compare a complete image against a complete caption
For a moment I thought you said your LoRA model was 2GB 😄
bot is running v1.0 base only. faces are no longer under/overtrained on it. just ideal.
eyewear is still biased towards glasses in bot XD (meaning no blindfolds, or cosplay accessories)
I read some youtube comment about how people aren't appreciating what is possible with SDXL and something about you'll be able to make cars for shoes. And I thought really? Oh yes really.
so it is very likely that the transformed word embeddings in CLIP in the last layers carry a lot of information about the complete image. That's why clip skip worked so well in SD
still doesn't explain it tho. Because if CLIP can differentiate between 'dishwasher' and 'washer' and 'dish' in encoding, then there shouldn't have been a cross-over in training nor diffusion
I hope for a "Make stuff in the face work correctly" finetune.
43mb 🦾
I can understand if the self-attention isn't absolute, like 90%
If that's the only reason, then I can remain sane
ok so it's just based on the empirical observation of bot 1... let's see! IDK how they're gonna release the model on Tuesday!
just gonna be painful to dataset - but easy to finetune
all you need is 200 images per concept they didn't add... which arent that many tbh. I just got unlucky with 2B
dish<-washer(.9) => encoding that is a little bit 'washer' and more the concept of 'dishwasher'
don't get that. The last layer of CLIP contains the pooled embedding. However, nothing stops the model from letting the layer before the last layer already containing pooled embeddings
if confirmed true I can shut up
for the loss function clip is trained on it would be totally fine if in the last layers all words have exactly the same embedding
it would be just a waste of parameters
but it would mean that if you use the embedding from these layers you loose the individual meaning of your words
Are you more saying that 'washer' can be carried forward with full strength sometimes or is always carried forward with full strength despite 'dishwasher' also shaping its own strength?
of course this is not the case. As said, if you look at self attention maps in SD you see that it can differentiate between different words in the sentence. It's still that sometimes words get mixed up a bit and a "women with blond hair and a boy with brown hair" the vector for women contains both, blond and brown hair information
I guess you could do this right now in 0.9 -> fine-tune with 200 or so faces and then make LoRAs on that model
offtopic - but you can also throw a photo of a dishwasher into a Vit-H model, and get back prompts biased towards it. Often there are supporting words that eliminate all false bias.
so in this case, what is the 'words get mixed up a bit'?
its a dumb solution, but it works painfully well :/
interesting
that makes sense tho
I'm not sure what you mean. CLIP consists of many layers. In each layer you have attention where you mixin information from other words. In the last layer your "sentence start token" have to contain the information about the complete sentence. Depending on which layer in CLIP you look at your words might contain more or less context
Making stuff look right that the base failed at seems to be a lot more complicated than just new concepts.
like the sentence "girl with blond hair and boy with brown hair". In the first layer each word is isolated from each other. As more layers you go forward as more context is transferred to the words, such that "girl" is associated with "blond" and with "hair". In the last layer, the complete sentence has to be associated. So this means that its very likely that in the last layers every word is associated with every word in some way
I think that's basically 'confirming' what I was saying about the attention. It's as you say that the 'dishwasher' concept isn't absolute in the coding, and that attention between say the word 'cat' and 'washer' also has some strength and exists as context
OK yep I'm understanding then
which contains information about the complete sentence
in SD the last layer is removed and the layer before is used, where HOPEFULLY the words still have their individual meaning
OK tbh this WHOLE conversation though kai, I thought you were saying that there are NO associations in ANY layer
but it is still very likely that a little bit of attention is leaked in each word
sorry, I'm probably bad in explaining 😅
It's fine because I am just more familiar with the LLM stage
forgot what it was, but essentially I wanted a very specific chinese flower dress, and obviously it couldn't make even remotely close. was gonna train a lora. then I threw it into vit-h, it gave me back an artist name XD put the artist name into the prompt as well. works 100% how i wanted it, and only produces the right dress. wtf right?
turns out there's a photographer who does nothing but photograph people in that type of dress. the weight on his name is stronger than the real name of the dress XD
that's very cool and a very good idea. I have the same exact issue with not knowing the words for a specific style of fictional plane
Fun fact: the bot censors the prompt, so when you ask SDXL to generate "cucumbers on a dish", it generates this #1100170365604483202 message
there is CLIP Interrogator
Yeah I've seen that sort of thing being discussed, but I'm mostly running locally
it works really nice for these cases
So for my job I actually created a multimodal captioning software
that is focused on accuracy
it way outperforms even KOSMOS-2 etc
I do both
Vit-L does not do justice for sdxl, just a heads up. Vit big g, or Vit-H if you can
CLIP interrogator has a lot of different models you can run
I mean, diffusers has them all ;D
jealousy intensifies
Thanks all for the discussion, confirmed the bits I suspected and did actually understand but also learned a lot of things that I didn't know about at all or didn't understand
And I'm grateful that my whole world isn't shattered
fwiw I get better LoRAs with kohya than with diffusers lol
where can i prompt the sdxl 1.0?
Today, with collaborators at @Google , we're excited to announce 🥳🥳HyperDreamBooth🥳 🥳! It's like DreamBooth, but smaller, faster and better. 25x faster. Think of 30 minutes vs. 14 hours for 100 models. And works on a single image!
(Thread 👇)
webpage: hyperdreambooth.github.io"
Seen on twitter
A new dreambooth
Locally, in a week 😜
Or in the #1100170312106127410 through #1101178553900478464 channels, any of those, SDXL Beta Bot section.
So is it a lora or a hypernetwork
Results dont look that amazing, I wonder what model they used
I guess Im curious what the difference is between that and traditional hypernetworks
Not sure why they glossed over it
Technically 4 days 😁
its not about the model but about the technique
hey guys, I know it doesn't 100% belong here but I guess it could be related to SDXL as well,
if I'm training a lora for the openjourney v4 model, should i train the lora on the model itself or on the 1.5 base model?
I don't think its about SD at all
the only right answer here is "depends", and there are no one line answers for either way
in most cases, you're better off training on the model you intend to use the lora with
no hard rules, of course
oh, I'm wrong, they applied it to SD
anyways. I don't think that it is so interesting either. It is very similar to an older paper by google which was doing the same just with "rank-1 lora" instead of what they call "lightweight dreambooth"
it might be interestint for applications and cloud services that want to create personalized images on the fly for their users
porting it to sdxl is not the issue - rather the theory behind its speed up is probably no longer applicable to sdxl. due to the larger model, we no longer have to worry about so many of the issues of training on 1.5.
hell, I trained the same dataset on sdxl in 6 different way to see which work, some completely wrong for the hell of it. and they all worked
but anyone here could just wait a few minutes longer and train a, probably much better, model using Lora
yes, I also have the feeling SDXL is easier to train than the previous versions 😄
if its speed you want, 2e-3 is the fastest you can go to achieve good results. While it can't be overfitted too much, that is rarely what you want to do to begin with - and then training is a speedrun
They said they used Stable Diffusion, but they didn't specify the version. Chances are it's either 1.5 or 2.1.
Cause it already has so much knowledge to begin with probably
in the end their model is similar to controlnet in the sense that it uses a pre-trained network for faces. It's not exactly like controlnet, and I guess its because the results with a controlnet were not good enough. But the point is you have to train a model that is able to finetune a model for face images+
which means it works ONLY for faces which makes it kinda boring imo
Ohh yeah maybe that will be useful for like phone apps to personalize AI filters and stuff
Probably exciting for some startup out there lol
It might be easier in terms of know how (idk tho, didn't even try), but it should be harder for the hardware since the model is much bigger, and possibly might need further tuning for the refiner.
But not for me
maybe also game development where you get a personal avatar based on a photo and stuff like that
Ohh yeah like you can put your pic in and it will generate a bunch of images personalized for you
yes and no. Its bigger, but that also means it learns faster.
Yeah my Loras learned super quick but it takes a lot of vram
It might take less compute, but will require more VRAM. That's harder to achieve with consumer grade hardware.
in that case I'd steal the microsoft solution, of applying it to a 3d face. while sounding barbaric, the results are pretty damn good, now that a bit of ai optimization was added
jim carrey as shrek?
is it just me or is the comparisons image in that hyperdreambooth a bit misleading?
Google's research hardly ever goes anywhere until someone else picks it up, and their original ideas for dreambooth we now understand are pretty destructive and shitty
I guess it is all single-image datasets comparisons. Any comparison made behind closed doors will always be misleading.
it's not hard to improve on their original research paper
I take it back. if you're ok with this, then 4e-3 is your limit XD
But yeah I don't feel like the outputs are that good.
But it might have some uses still.
thata you
face reveal
casually pretends lora doesn't exist
though it definitely work a lot better in niche applications. just not generalized
though I also question their prompts, since you can't just compare "A Pixar character of a [V] face" when that prompt was never intended to work on the default model... while there IS a prompt that does work.
i don't understand that test grid at all
that thing belongs in the Facebook group of scientific charts that look like shitposts
Yeah without prompts it is kind of worthless, and where is normal LoRA...
just finished reading. I feel bamboozled. They just made a new variant LoRA and gave it a fancier name...
i think they want a line down the middle separating the men output from women output? looks like a god damn continuum where they gradually shift the weights
I didn't read too much of the paper as I don't understand the fine-details too well. But if you're improving on LoRA you should compare this to the other LoRA variants.
basically it's a 1/0.5 lora XD
It is like me developing a new screw and makes comparisons to bolts and nails but not other screws.
welcome to the wild world of machine learning research where the comparison don't mean anything and the demonstrations don't matter
that was the previous paper
@visual glade how do these work? https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/ac2d47ff4c00b041cae3d882c2832662c2c64935
in this paper they say directly that they do lora, but they use a hypernetwork to predict initial weights
which they doubled down on...
also, how the fuck does midjourney's bot send its partially denoised outputs to the discord message
😭
also they use a random vector factorization before doing the lora to further shrink down the number of parameters
i can't update an old message with a new embed
but end result is still a 120 KB lora, right?
yes
but from this view you could say Lora is the same as dreambooth
when you add the lora to a model you get a normal model back
LiDBx2 would be the proper name XD
lmao, my training has gone back to not working again. I don't understand lmao
if you read the "eating our lunch" memo the person goes a way to explaining why that's the case.
oh buddy i love that memo lmao
yeah i'm well aware of how groups like SAI and Google all burn capital just because they have it. i have always done everything i do, with much less than these groups spend. but i'm not working at their level i'm sure 😁
btw, i found out where all the stability cluster is going
LoRA is probably the best thing Microsoft ever did
Vit models go brrrrrrr on Stability Cluster
oh, looked at wandb logs?
but it must utterly suck to be wrapped up in your own red tape and having to form an orderly queue on an idea while people outside your window are all running at it from every angle like wacky races.
its kind of nasa versus the redbull flugtag and inexplicably the flugtag is competing!
this scene is an incredible metaphor for so much time wasting resource expenditure we have in life
why doesnt google release an image gen I wonder
https://ai.googleblog.com/2023/03/scaling-vision-transformers-to-22.html
probably copy-pasting some more Google research
seem happy enough for adobe to work on that stuff honestly.
Why do even google research these things if no one gets to do anything with it.
it's a fast and shitty way of converting latents to RGB
nice
Does ComfyUI has support for that?
yeah i wanna do what MJ did for image gen and show the images as they generate
i have a progress bar but that's boring
Look at how Automatic1111 is doing live-preview.
do you mean preview every x steps?
Fast and shitty, that should be Auto1111's motto
Although it's not that fast either
if you want preview in comfy it's: --preview-method auto
you should leak some documents with that idea in it and he'd probably implement your motto then
except a1111 is not fast, both diffusers and comfyui beat it in speed
so it's actually the slowest
it's fast AND shitty which means it's the best at being the worst. it's like how 1+1=3.. for large values of 1
this a bit.... weird.
What is happening

wow
Ah, I thought it were a node for it as I want to replace the normal conversion with the faster one as I've a bit of an issue with the vae atm.
I don't have an iGPU so it doesn't, it just starts trying to use a little System Memory
But I don't understand why it's suddenly started doing this, when I trained one the other day and it did work.
So why now is it fucked
did you update?
updated, didn't work, downgraded worked. Went to sleep, ran again, didn't work.
if you haven't updated, my guess is there's some state file it's picking up on from the last run
well upgrading and downgrading is ... uh... well, did you look at the code changes before doing it to verify it'd be okay?
When I say downgrade, I checked out an old commit
nvidia driver update?
Nah on the old driver still. I tested updating the driver again and that just eats fuck loads of RAM and tries to crash my PC
have you tried turning it off and on again?
Many times
Let me check to make sure it's not some corrupt cached latent or something, I'm going to clear the whole training folder.
that's what i meant when i said some saved state
I thought you meant model state
cached latents or aspect buckets can i guess do that too
I'll see if this works
well it sounds truly frustrating, i hope you figure it out, because maybe it's the same issue for all
what should be faster than that?
It's just bizarre
Goes from 1024x1024 batch 2 working
to 512x512 batch 1 not lmao
I wonder if it's doing something fucky with some cached latents and trying to load them multiple times or something
I don't understand what this is in reference to
https://ai.meta.com/blog/generative-ai-text-images-cm3leon/ just saw this pop up
the VAE approximation in auto1111 is super fast and a great idea
noice
I want to see a model or method where you can get multiple subjects in on the first go with no merging
thats my challenge to all the eggheads
Does not look healthy
Well when it was working it was getting to 9.8GB out of 10GB and then would stay there and be fine
You were getting 1024x batch size 2 on 10GB?
I was yes, dunno how but I was.
I suspect somewhere some settings on the Kohya scripts are getting fucked up
And it's not doing what it's telling me it's doing
Dang I need to switchto linux I cant get anywhere near that
I mean it doesn't work at all anymore
I would OOM with a resolution of 1x1 lmao
I think I'm going to delete this whole thing and start again
Because something is obviously broken
How much VRAM?
12GB
Hmm, maybe I'll try that
https://github.com/derrian-distro/LoRA_Easy_Training_Scripts/tree/SDXL
It was this one yeah?
Yeah it's based on Khoya just that it installed correctly for me instead of the actualy kohya
It also has some nice QoL features
I hate windows sometimes
It won't let me delete an empty folder because it's "In Use"
how... by what lmao
What you're on windows?
yes
How did you get such good results
an open folder or background process could be using it. i usually just reboot 🙃
I would love to train at full res
How would an open folder be using an open folder XD
forget the exact thing i use since it's been a while, but googling "windows unlock in use" should get you some tools for it
I have a tool, but it does it on files and not folders
Because an empty folder shouldn't ever be locked lol
if you have the empty folder open, viewing the empty contents
Do you have a command line open in that folder?
or if you are inside the folder in a terminal
It's fine, I'm just getting irritated because this thing randomly stopped working
Are you sure it was working in the first place?
100% it make a working LoRA
What if you were accidentally resizing to 512 or something
Well it OOMs at 512 so...
But my tests worked, not brilliantly, but they absolutely worked.
Hmm
Im interested now I want to get it working on my machine too lol
Good luck
i think i saw this char somewhere
anime with bikes
I dunno what the character is, it was a style lora, as you can see from the watermarks it baked in lmao
im also training a lora on wdxl rn but for some reason it wasnt learning anything
couldu yoink me your parameters?
I mean mines not working at the moment, so don't think you want them lol
😂
Just re-installed all Kohyas scripts and still doesn't work
Exact same training settings
So dumb
how many steps and images btw
parti vs sdxl (no cherrypicking)
{\"resolution\": [768, 1152], \"count\": 2}, \"1\": {\"resolution\": [768, 1216], \"count\": 1}, \"2\": {\"resolution\": [832, 1088], \"count\": 2}, \"3\": {\"resolution\": [832, 1152], \"count\": 4}, \"4\": {\"resolution\": [832, 1216], \"count\": 4}, \"5\": {\"resolution\": [896, 1024], \"count\": 1}, \"6\": {\"resolution\": [1024, 896], \"count\": 1}}, \"mean_img_ar_error\": 0.010579165855970867}"
This is the info from another test one I trained, so it was using the correct 1024x1024 with bucketing
Thats from inside the safetensors file
So I'm so confused as to how that somehow took less than 10GB VRAM but now it takes more than 10GB VRAM to try with 512x512
is that what you're using? How did you install?
No I'm using https://github.com/bmaltais/kohya_ss
But I might try that one next and see if it's any better.
oh nice, I'm using the kohya trainer colab https://github.com/Linaqruf/kohya-trainer/blob/main/kohya-LoRA-trainer-XL.ipynb
should be close enough to kohya ss
If the Collab works I might just use that. Although it's annoying having to upload all the images
I mean it runs, I get a good lora if I prompt the exact thing I wrote on the caption
but if I change anything it just forgets the face
it's dumb af
Do you know why it wants admin to do the install process?
I see no reason for it to require it
reimagining Dragon Fruit
i tried derrien's for the first time recently and it didn't need admin
I just ran it, it tries to change the powershell restriction policy and then a UAC Prompt for Admin
@visual glade oh good it's MIT licensed so you can take it over 
Call PowerShell -NoProfile -ExecutionPolicy Bypass -Command "& {Start-Process PowerShell -ArgumentList 'Set-ExecutionPolicy Unrestricted -Force' -Verb RunAs}"
It doesn't need to do this
i used the bat, hrmm
Yeah so did I
the Bat loads up that
The bat triggers Installer.py
Which does this
try:
subprocess.check_call(f"{os.path.join('installables', 'change_execution_policy.bat')}")
except subprocess.SubprocessError:
try:
subprocess.check_call(f"{os.path.join('installables', 'change_execution_policy_backup.bat')}")
except subprocess.SubprocessError as e:
print(f"Failed to change the execution policy with error:\n {e}")```
Which then runs another BAT that runs that code to change the policy
so why didn't i get a UAC prompt?
Have you disabled it?
no
Or were you already running the bat in an admin command window
also no
Not sure, but it did it for me
AndI'm not installing stuff that asks for admin for no reason
Actually it might be because your powershell policy is already on bypass, so it didn't need to change it.
that's weird pencils 🙂 the one in the middle is blue and black 🙂
oh good call
But I don't see any powershell scripts that would need that changing to run
Think I'll just wait for 1.0 and improved tools before I train anythiing
I need food
not so unusual tbh.
🙂
I I mean, thats a good thing, the model is not forgetting everything. I think you have to train on trigger words and ideally also train text encoder
for dreambooth, yeah training the text encoder is pretty strong, have to do it very carefully
i need brain. i'm using bmaltai's repo. was wondering why i couldn't find an installer.py, except one in the venv
I mean I did train on a trigger word and I leave the trigger but add "dancing" and then the whole thing doesn't work
hm, thats weird. I had no problems train on a subject.
I did first train text encoder for a few epochs and then, much longer, the unet
Yeah that one's fine and doesn't do that.
I did textual inversion first, but I'm pretty sure you can skip that if you have a good trigger word
what was the script/LR/epochs/# of input images?
it
LR 5e-4
it's funny because it still gets the demographic data (like white man with long hair) but it loses face details
i do have derrian's installed and used it previously, but that was several months ago, and i vaguely remember admin issues. anyway, doesn't matter now i guess
Steps in total? I don't know anymore. Guess ~400 for text encoder and ~800 for unet. Actually, I just train until I see severe overfitting. Then I use the last model and a model a few epochs before
I would try with training text encoder. This is really powerful
gotcha! I'll try doing that in diffusers cause it lets me customize text encoder and unet
anatomy why...
cursed seed, forbidden token, or both
in this case: try text encoder first and only train the OpenCLIP - it is totally sufficient. Then train unet afterwards. That was how I did it. However, I haven't done experiments with other settings yet
leg status?
I have to say, SDXL is doing anatomy often surprisingly well. I see very rarely wrong number of fingers
diffusers > *
its still sometimes mess up composition. But it feels like they trained it well on legs and fingers
brb reducing RAM use in my bot by 3GB 😄
Which do you guys think is better? I like V1 more.
I like the second pic
my prompt?
Fingers seems to be very hard to get even close to looking decent. Will 1.0 be better?
Sometimes there's correct number of fingers (five) but a there's a normal finger instead of a thumb. 😄
yes thats still hapens a lot. But I'm happy that the number is correct lol
a little but not super massive, 0.9 is pretty indicative of what you're getting
there is a decent quality bump in 1.0 though
there are specific models for inpainting hands afaik
also negative textual inversion embeddings you can use to reduce extra fingies & stuff, nothing's really perfect though
Eyes from steep angle (profile shot) is hard too.
But about the same level as some good 1.5 models is I feel.
In Comfy, I guess if you batch it doesn't save the seed for each image in the batch generated?
I understand that it's pretty hard for a model to predict how hands can be shaped, they are very handy tools after all.
Should be saved as noise_seed, atleast my batches are.
I was told comfyUI was faster, but its very slow.
When I press on queue... Its takes a lot of time before it starts to generate
I have 16RAM, and 12VRAM
Definitely doesn't in my pipeline anyway, identical information in each image in the batch
I have it in all modules that uses the seed in the flow.
If you clear the UI and drag the image back does it show the seed?
Make sure you saved the UI first.
did you try to make more than 1 image? 😂 it has to load the model before making the first image m8
I mean looking at the raw data in the output it's identical so there's no way
wonder how I could fix that
Compared to what? SD 1.5 on A1111?
What you think about my upscale result? Too sharp?
yeah, this workflow is very slow.
Like you have to load the model for every image...
And it takes like forever
in fact, you have to load 2 models for every image.
Looks great, how did you do it?
It's much faster if you run ddim as Sytan's flow has.
Yeah @vast narwhal, looks terrible, could you share your workflow?
this is three steps: highfix+ 2x ultimate upscale using juggernaut with control net tile but i need more tests
Not all SDXL 😢
I wonder if they'll give us a controlnet tile for SDXL
It might fix the detail loss
I will do. First i need a good realistic image workflow to test with a different thing beside the dog
no you dont. you load it before the first image and then its loaded and wont need to load again unless you change the model. there's something wrong if it keeps reloading the same model.
Wow. This is fantastic. What was your process?
id like to know how you did that aswell if thats sdxl
im assuming you used a 1.5 model for upscaling
or 2.1 or whatever
He did, he said he used Juggernaught
ahh okay. i need to try that out.
Do you have the 12gb vram available? Or is stuff in background using vram, and comfy needs to unload every time?
Juggernaut xl will be in the works soon and oml will that be good
I've basically finished all my lora tests. if anyone has a dataset they want to me test, feel free to ping me
a stunning portrait of StabilityAI deepfrying the VAE
"yep, looks done?"
"better give it another hour"
@uneven dove
my favorite response to the comfy inside A1111 extension XD
Is this just control net for SDXL? #📣|announcements message
If you read the blog it tells you it's a T2I-Adapter
This guy is just pouring lava from his hands, ouch
or you can just use bf16
that's not as much vmem savings
both are 16 bit though so it should be the same?
it's not, the dynamic range is higher in bf16
also slower
bf16 is great but only because it is convenient in terms of development costs, same as tf32 for fp32-sensitive applications
I clearly did not 😉
Thx
How long do you lot think we will have to wait until we get a anime finetuned model of SDXL, upon release?
Really looking forward to exploring XL's better understanding of context in prompts tho.
use Vit-H on an anime image of your choice, take that prompt, feed into sdxl, get new anime images. profit.
but also waifuxl for easy use
Vit-H? Not heard of that before. Is it short term for a tool or something?
Interrogator -> Vit-L model
And this is awesome! Fantastic news to hear that we will be getting a XL version
Oh right! Ok. Nice! Thanks. Haha
But every image needs to load both models, so how can there be something wrong if it tries to load a model?
its supposed to stay in vram, not be removed again
removal should only happen if you lack enough vram
oh so 12 must not be enough I guess
it is -based on people here
however stuff in background may be taking some of it
photoshop or similar apps
or other uis
i just use --highvram
does this force it? in that case nice! 👍
as you see here, I am not using any vram demanding app
for what?
keeps the models loaded on the gpu so it doesn't load them every time
yeah but it's still 16 bits because they only put 7 bits on the fractions part
Where should I add that?
as long as the hardware supports it speed should be the same as fp16
here?
yeah
ok will try
it ain't
not on a 4090, an A100, or an A6000
maybe on a TPU it works better... hm
ok this time it worked, as in it made something, but it does not keep it coherent at all
ComfyUI is loading the models at 1st generation, if you come from A1111 it loads on app loading, so ComfyUI is faster to start but 1st gen is slower.
I think it's loading in the Text Encoders that takes time
As it takes that extra time every time you change the prompt
feel like prompting on sdxl is pretty hard, i only get medium good results always, sometimes blurry, grainy, missing details
I've noticed it takes an extra 30~40 seconds to start generating when using a lora as well, even on subsequent generations
yeah loras are not currently handled in a very memory efficient way so if you only have 16GB ram it's going to be slow but I'm fixing that
dont understand why my results always get so blurry
Congratulations to the stability.ai team, you have done a very good job with this model
it's better than google's models
Nice 😄
hey guys, did we ever get official information regarding 1 vs 2 positive prompts and clip_g clip_l?
cause i tested both ways and i'm still not sure what is best
and similarly, ascore seems to have 0 effect at all if I change the int value of both positive and negative
This is the full SDXL result. I just don't think it's better
ascore is just for refiner
Which of these looks better?
I've done some tests but not enough. Right now I'm using the same prompt for CLIP_G and CLIP_L - that gives the most coherent result what the prompt says. But I started to try different concepts. Main prompt part in CLIP_G and style words in CLIP_L. But I'm not really sure.
something in between these 2 @eternal fog , either blurry or overly sharp to me
but it's really detailed tho
1
me neither. It definitely changes the style of the image, but it using different prompts for clip l and g did made it worse in most of my cases, while in sytans workflow its supposed to improve quality on photorealism 🤷♂️
Why do you think 1?
ok thanks guys, so its still mostly speculation with no real consensus
2, it has more details in the clothes and face, 1 looks too soft
yes, and as we had a very long discussion a few hours ago: from a theoretical standpoint it's awkward using different prompts
yeah, i feel UIs won't adapt to have 2 prompts just for sdxl
TPUs have limits to memory bandwidth, too. I haven't tried it on any of ours, but I would surprised if it was any different.
I think that's not an issue. I'm pretty sure they would if it really helps
ok what about these two?
textures look more real, it doesnt feel "fake" or plasticky if you look around the area of breasts it starts looking artifact and burned
what is the impact of ascore?
why not? a1111 already added another prompt box for hi-res fix alone, not hard to do the same for sdxl
doesnt comfy already have it
Yeah I think I'm gone a bit hard on the sharpness, but I'm trying to remove that soft look you get from doing an img2img upscale.
did you ASK for the facial lines?
it looks like misaligned timesteps
She's a "Demon", so that's why it's done that
hmm
fair enough but when there's more noise than it knows what to do with, it does that kind of facial lining
I'll experiment a bit more I think I'm getting somewhere though
it's only for the refiner and is supposed to make the image more aesthetically pleasing while less following the prompt
try 'hairless demon' 😄
a more severe example
Instead of generating then upscaling then doing the img2img pass with refiner.
I'm generating then going straight to the img2img refiner pass, THEN Upscaling and then doing another img2img refiner pass. It seems to keep detail a lot better and only takes a few seconds longer.
2
Let me try 2 more without facial lines this time.
less misaligned makes it into some kind of excusable crayon lines. after all, he is a jester. but it looks odd
its the same question as ai auto processing of mobile phone photos.
more real, but worse / definitely fake, but fits aesthetics more
here's the effect you get when you randomly add noise during denoising
Yeah I want a balance in between, the left is too soft in my opinion, but the right is too sharp.
ok ok, thanks
in fact i think the random noise added during inference is possibly the best example of teh face cracking in an 'artistically acceptable way'
realistic ? i'm okay with it ... hands? no freaking way 🙂
i am talking about this
@eternal fog just merge both 😄
what is the standard ascore for positive and negative?
5/1
I think 6 and 2
its definitely 5 and 1
refiner steps too high
they ain't great values
These two are a bit closer, although it's buggered up the eyes on one of them
tbh this effect kinda reminds me of the various customizable masks from payday2. google them if you havent played. the artist for that game would probably love it lol
1 by a long shot
yep. you can even write it a story and get back a good result
simplify them
I think 2 pops out a lot more, it's not as smooth. But I do think it's too sharp. Time to play with more values.
those two look like skyrim 6 🙂
is there a new token limit?
but you can also just copy an old prompt and it will usually work
no the model is loaded into ram. its only taking up vram during generation
nope
oversaturation problably
masterpiece, trending on artstation, they make real people look like vector graphics
you need to remove a lot of that crap
for realistic, what do you use?
nothing
please share some prompts
just say what you want
pure luck, i would say 🙂
a stunning portrait of a 1985 adult in leggings
be careful with negative prompts that look innocuous, i just figured out that 'blurry' was making my photos paintings, then i added 'painting' and now everyone looks wrinkled
did that, all im saying is without any of those, it looks better 🙂
negative prompting can have a lot of strong effects that are hard to predict
I checked and it's 2.5 and 6 😝
ಠ_ಠ
"a photo of jim the plumber working hard on pipes as he ponders the world and its meaning"
using bot v1.0
how did you upscale using sdxl?
wow, 100% believably jim
yeah, you usually don't need negative prompts. Avoid them
use them only if you really need them
usually for excluding overfitted subjects
not like 2.1 where we by default used complex negative prompts
I think I might need negative prompts.
2.1 just needs like one neg embed lol
i have massive success with tiny negatives on 2.1
"a photo of jim the plumber working hard on pipes as he ponders the world and its meaning"
using mimizukari setup. no style/no negative
same, just needs a bit more guidance through positive prompts
😂
but if you ask for certain anime stuff you sometimes just get crap back that looks like fan arts
I tried some Dragonball Z prompt and get crappy images back. In this cases you have to improve your positive prompt, not the negative one
e.g. add artist names that describe the image style you want
ran 1 w some color corrections... idk u tell me
https://www.midlibrary.io/ has a tonne of useful artist names that work with SDXL as well. Using photographers will usually give you decent quality photos
I've had a great deal of luck using GPT-4 for first pass prompt engineering; 80% of the time it produces great pictures, although not always what I want.
Mind you, 10% of the time it outputs what I posted above.
question: in comfyUI there's a KSampler (Advanced) node, that has a start_at_step and end_at_step ... what are those for ?
if yiu want to change the model during sampling for example
or other situations where you want to stop the denoising process, do something else with the latents, and continue
that's freaking crazy 🙂
e.g. stop in between and continue with the refiner model
or change the prompt or model in between
You guys seen this? Personally I've not had issues but apparently this saves a little VRAM
https://huggingface.co/madebyollin/sdxl-vae-fp16-fix
...now if you give the right prompt to GPT-4, it produces this.
Depends on the node you're using. but everyting needs to be from XL, including prompt box and the cliptextencoder
So if I had 20 steps, but the end at step is 12, does it stop at 12 when it was scheduled for 20? If so. What’s the difference of putting the steps at 12 vs 20?
if you want comfy to use the VAEs in fp16 mode use: --fp16-vae or --bf16-vae for bf16 mode
I think so. If you say 20 steps and stop at 10 then it stops at 50% denoising
What other command line options can you pass to main.py?
--help will show all of them
Gotcha ya that makes sense
nope, works with the regular text encoder aswell
Too simple!
ok, I'll take a look. See if it's any faster, or less memory intensive
Well it didn't seem to make too much difference with that vae. If anything I think it made it use more VRAM.
It does though infact fix black images when you run it at fp16
prompt: i installed Clippy today to show my grandkids how we used to talk to a paperclip and they said "grandpa's sunsetting again" and i was rushed to the doctors who adjusted my medication and insist i don't have grandchildren and that i need to stop going off my meds
the narrative here, it is exquisite
I tried the --highvram but now...
So without it every image needs to load both models constantly
and with it it runs out of memory
on 12gbVram ?
each model is pretty big
whats the prompt if I may ask? Looks amazing
With only 12 GB VRAM you shouldn't run refiner and base with highvram
yeah don't do highvram with SDXL if you only have 12GB, both unets on the gpu take that amount of memory
Still experimenting. At the moment...
Given input such as "A picture of a boat", generate a creative description such as "Digital painting of a boat on the stormy ocean", deferring to user input when convenient. Also output a style, selecting relevant artists and stylistic choices that go well with the prompt. Series/character names don't work, so describe the scnee or character instead. Always include artists. While the prompt should be regular english, the style should be comma-separated keywords.
Respond using JSON, in the format {"prompt": "{prompt"}, "style": "{style}", aspect_ratio: "{e.g. 4:3}"}
Which, with this request:
Machikado Mazoku.
Produced this:
Anime-style digital artwork depicting a young girl with horns and a spiky tail, surrounded by a mysterious aura in suburban scenery --style Contemporary, Manga, Modern, Magic Realism, Hayao Miyazaki, Yoshitoshi Abe --ar 16:9
Highvram is useful if you're only doing the base, but not with both.
With 12GBvram I need to wait for each model to load (twice an image) ?
It shouldn't be that slow really to load both.
last year they were talking about 24 fps and now 6 months later we need to wait 5 minutes for an image.
yeah it takes like 500 seconds per image
diffusion is fast but the model takes too long to load
And if you create a batch of latent images then you've got less model switches to worry about
I use a 1080ti11gb , only time I notice the SDXL models slow(ish) to load is when I initially start the server and load for the first time, afte rthat its almost instant even if I switch models and use a 1.5/2.1 workflow
man its so funny seeing people claiming their upscaling setup works great then showing absolutely horrendous results
TBJ & IMHO Upscaling is oiverrated unelss you're going for comercial contracts or HQ print outputs.
If like me you're simply generating for pleasure of simply to create wallpapers then even just a basic upscale is "just fine" IMHO
nah man some people just like generating full body and semi-full body images with clear faces
nothing more to it
A well-matched 1.5 model works fine for upscaling. But SDXL is so flexible, there isn't any single model that'll work.
also idk why you wouldnt just like to look at an image with more details than 1024... your take dosent really make sense to me at all actually
a lot of the quality improvements are just people showing off that they CAN do it
doesn't need prompt comprehension worth a damn
12GB isn't high vram; i think you'd out of memory trying to run even batch size 1 with both models loaded
yeah sure, im just talking about the ones saying the refiner works great for upscaling 💩
It does wat mate?
my 3070 8gb takes 2,5 minute to generate 4 images with refiner - something in your setup is wrong in my opinion. Btw: what drivers do you have? I had huge issues with speed in SD with drivers I updated couple of weeks ago - needed to revert
wat?
does this VAE fo into the refiner too?
"What", but flatter.
3090 gens an image in 17 or so seconds; it's dependent on your hardware
I am running a batch size of 1
I removed the highVram already
Ufff. No bf16 support, right?
if your 7x upscaling workflow doesnt take at least 7 minutes to complete a single image, you're not taking your image generation seriously and should probably just give up
I think it is working faster now
It seems it loaded both models now
maybe the fp16 vae helped
try 536.40 - revert helped me as I was generating one image for 2-3 minutes with updated instead of 15 secs
maybe. Its working well now
Cool, Maybe they fixed the bug with latest update ^^
Is that with base + refiner?
an OG image and a crude 4x upscale from it
Some people like to use ai image generation for fun to see what it spits out. It’s fun. Ups along to get the best quality is fun too. I find this technology amazing.
I get that, Ifor example I prefer to play withhis than do a corssowrd or kntting but.................... 🙂
Crossword and knitting sound boring lol.
Pretty sure they use the same vaes, so yes you could use it with the refiner.
I use ai for all my dnd stuff XD when I do prints I need it to be crisp
to not break immersion
trying to do both at the same time can be fun though lol
You can always just upscale in batch over night or something.
I wish I had a better use case for ai art, I know some people would pay for a high quality image of something specific but I wouldn’t even know how to sell that service😅
you could do prints for yourself - for the wall. your top 5 gens get a proper A3 art print at a photo shop
might not make money - but at least rewards your hobby c:
thats cool but thats a x4 upscale model and not sdxl right
kinda wish i would get some consistent results, i always fail with sdxl
is sdxl free?
It’s just feeding the sdxl image into a simple 4x pixel based upscale
Workflow is left in there
thanks @boreal bough
It wrote pretty good the name of my friend
and the image is nice as well with very simple prompt
u know where i can find prompts from sdxl?
hmmm I'm starting to believe this is good model
especially with neg prompts?
Negatives aren't needed, really.
For positives... it understands English much better than 1.5. Stick to simple language without prepositions, and it'll work fine.
Well, pronouns. Prepositions you can try to use.
mh ok , dunno i always tend to get blurry or extremly grainy results
Negatives do some things, I've been putting a few in like deformed, blurry and this is the sort of difference you can get from with negatives and none
Friday night hype leggo
i dunno my results are just always blurry or extremly grainy
i cant get result like yours
What sort of steps, samplers and cfg are you using
Don't change samplers and noise schedules between the base and refiner
And DIMM should be using DIMM Uniform
oh ok someone suggested its way better using diff samplers on base and refiner
Why?
A. get a good setup
B. either prompt properly sdxl (trial and error) / or write a sentence in natural language that is around 10~15 words long, no commas / Use Interrogator with Vit-H to get prompts from existing images
C. no negatives, unless you know what you want them to do
D. generate an image to make sure you didn't include a word that messes everything up (rare, but can happen)
When I've tried to use it with others it's done strange things
I feel like that’s way more steps than needed lol
I find prompting in SDXL better and worse than in 1.5 the same time XD
Better is consistency - worse is consistency.
I mean - I would like to have higher randomness to the output - often same prompt gives +90% of the same results.
Which is a plus as it seems working as intended but is also negative if you found good style but need to experiment with each prompt just to have different image to previous.
They are very similar.
Another issue is color - if I add "white background" usually it dramatically adds white as a whole to the scene/artwork.
What issues do you have and how do you overcome them?
Overfitment/training known issue hope 1.x fixes
If you want consistency then you can make a LoRA out of it.
longer prompts often solve this. the shorter, the less variation on many heavily weighted words
im probably not happy with the overall result of sdxl, i think urs is super blurry too, 1024x1024 on 1.5 look so much cleaner, and crisp imo
People will fine-tune SDXL so results will improve.
Thats wasnt an attemp at anything good, just an example of what negative prompts can do
Remember how poor 1.5 was and how well fine-tunes work now..
i dunno everything i tried got kinda messed up with a lot of grain or blurry
"a man named void, 30 years old, is unhappy with the results of sdxl as he sits in front of his computer"
no style, no negative. first attempt.
😂
No negative, just his posts 😉
second image generated, seed+1
Edward Snowden?
trust me i look that depressed every day haha
even if SDXL was absolutely irreproachably perfect in every way. void: "oh well, I bet ill go blind soon and not be able to see it."
use this setup (from the lovely mimizukari), or the one from sytan
https://github.com/SytanSD/Sytan-SDXL-ComfyUI
replace your phone wallpaper every 4 days
SDXL improved a lot with fantasy but on the other hand - in some areas is overfitted as hell. Still - lot of improvement overall
Thanks for the advice, it really helped my prompt.
no style, no negative. first attempt. Cat sitting in a kitchen sink
replace your phone wallpaper every few hours, but its the same wallpaper but the person has a subtly different facial expression.
hahahaha
you can tell that's an AI generation though. you hid the cat's hands so people couldnt tell.
oldest trick in the book. (the book being around 18 months old)
Yeah I hid that it didn't have 5 fingers.
