#✨|sdxl
1 messages · Page 151 of 1
Yeah, the dev of it said that but now says to use some others over his Prodigy due to how large XL is. Sadly, the suggestions we don't have access to.
worst case gpu just self combusts 🤣
i decided on prodigy after watching this
https://www.youtube.com/watch?v=QpWacUWeqbE&ab_channel=kasukanra
#sdxl #ComfyUI #LoRA #runpod #dadaptation #prodigy #stablediffusion #style #styletraining
This is an SDXL 1.0 training log for art style. However, the workflow is also interchangeable with SD1.5. I document my thought process, experiments, mistakes, and analysis of quantitative and qualitative results. Hopefully, this can be a good starting gui...
okay this problem I'm too dumb to fix
well it didnt even compile before so progress?
Yeah, been subbed to him for a long while now. His is all anime while mine is realism so a lot just never translated well.
https://civitai.com/models/167991/loving-vincent
and i trained this with prodigy
that;s cool
Trying to add TensorRT as an extension in A1111 - getting this error - AssertionError: extension access disabled because of command line flags - found the answer - DO NOT USE --share and/or--listen in the .bat file when installing TensorRT!
Oh, I remember you in here using that and van goh-ing everything, lol
lolol yes and the cats
I never have much luck training loras on any SD version but DB just works for me
catshrek, lol
does db even run with XL?
yes
Only option is to extract to Lora and Lycoris says it is easy to make it extract to locon that I want. Doesn't do it
lora extraction only makes sense if you train the same layers as in lora
Oh, it makes a whole lot of sense and actually does give better results from the people I talk to. Takes longer to do is the reason people say they don't do it.
for me that is the only way any of my loras have been made
it does not make sense if you train db on all layers and then only extract a subset of them
freeze everything except the attention layers when you want to extract lora
for locon you would also train the conv layers in the resnet
in general extracting loras afterwards can be more parameter efficient, as you can set the lora rank dynamically. But it has to be done right
yes, and why I have been trying to get that as extraction but they are just not doing it saying Kohya can add it that it is easy. Well, if the dev says it is easy who are any of us to say it isn't? All I know is for XL it hasn't been done.
it is easy, just a few lines of python code
exactly what he said too
Kohya just hasn't done it for some odd reason
I prefer Locon over lora
probably XL had an idea about it
what's up with emma watson?
im not using "animation, disney, pixar, 3d, animated, cgi" or any styling words at all. just subject line and token. XL knows how to do pixar kinda but i make sure to keep the concept separate from that data
btw, I am training using your info but I had to make it 2 epochs.
my generation grew up with her in our favorite movies, hard not to love her
546
I had far more luck with locon in 2.1 than a lora so let's see if this trains after a bit
I hate how the images on civit is so small now. Someone mentioned that to me last month and yeah, a pita now.
far to small to really investigate
good, now to test
caption this
lately i just eliminate the most problematic checkpoints in the XY plot until the last 2-3 and then choose by preference
ones with the most artifacts
oh, I never see those
in that first one you sent, the guy's torso is glitching in half
what I see is skeletons so that is too similar to the original
that first one all by itelsef IS base XL
the one on the left by itself is base XL
no lora
I am going to train for 1500 more steps
1092
what hardware and settings to train SDXL do you have guys?
yeah, this is a failure
im surprised at this flexibility considering its only trained on one movie and no styling prompts are needed at all. no regularization either
looks like 300 is it but all of the faces are screwed
The first one is 300 more steps over the last highest
It's XL I live with janky jacked up trainings.
Do we know of any wire extension for comfy that would allow me the user to actually put a pin on a part of a wire that connects the nodes?
I want to just orientate the wire to my will.
Nope
Nope. We really need something to handle the mess of wires
The only thing we have is the straight line one right?
the ones in built, and it really messes with me when I am building. I am used to way different and ones if I grab the node the wire will follow so slide right the noodle/wire will go left so I can see where it goes. Right now it is 100% crap. Another issue is that just clicking the wire doesn't tell you where it is connected most times.
ah okay
ComfyUI SDXL Sytan's w/flow DynavisionXL model
It's up!
Is it normal for images to not look that great with SDXL compared to 1.5 upscaled? Here are my settings: absurdres, high quality, top quality, a colorful parrot flying in a mangrove jungle
Negative prompt: easynegative, worst quality
Steps: 40, Sampler: DPM++ 2S a Karras, CFG scale: 7, Seed: 1338516404, Size: 1344x768, Model hash: e6bb9ea85b, Model: sd_xl_base_1.0_0.9vae, Clip skip: 2, ENSD: 31337, Refiner: sd_xl_refiner_1.0 [7440042bbd], Refiner switch at: 0.8, Version: v1.6.0. I am running lowvram (due to 6GB) no half vae and xformers arguments
@glass notchyour prompt
I did not use anything but the positive prompt from what you posted
It's clear that it's easier to prompt concepts but from your own example it does seem that the quality I got is to be expected. Thanks for doing the test
Freyja
Freyja after acquiring Brisingamen, yet unable to find joy again
I never trusted a banking agency ever since Sberbank did what they did
Those MFs made Kandinsky
That's fucking wacky as hell
It is, it also came out of nowhere.. they just decided to pour money into training a pixel diffusion model
And an LM right?
Though that model had image blending capabilities somehow
Unsure, they're sketch af
I'm also not sure if Kandinsky can blend images due to being pixel diffusion or maybe because it uses ViT-14 or something like that
not much came out from it from what I can tell.
Yeah, I hate them. They made a model that can blend images and never explained what causes it to be able to do so
Oh that's it. "Image encoder: ViT-bigG-14-laion2B-39B-b160k"
So it can have image input due to having both a text encoder and an image encoder?
because they are trained on clip image embeddings - it's as easy as that
SDXL has that, too, by the way. But they always train on text and clip embedding at the same time. Thus, providing only images but no text is probably less effective in SDXL than in Kandinsky
I'm not soo interested in image blending, so I haven't tested if image blend in SDXL works as good as in Kadinsky
looking for masking tool for auto, asking for a friend o0
I use chatgpt, but I also like local Vicuna 13B
is it a Lora?
prompt?
No, SDXL won't have image input unless you use it with IPAdapter
Or ControlNet?
Blending is definitely not a capability without IPAdapter, many people have tested and come to that conclusion
@rustic garnet are you referring to the CLiP_vision as an image encoder? Because that's not really a part of SDXL.. SDXL just has 2 text encoders, no image encoder according to HF
With using CLiP_vision, SDXL can have image input, I only tested this with 2 inputs for blending and it wasn't coherent at all.. it was able to blend the 2 images only one out of like 30 times
Haven't tested that method with 1 input, so idk how does that behave
clip vision is part of clip which is part of sdxl
it doesn't work though
yes, as said, sdxl is not trained on clip pooled only input. That's why it's not as good for that. But this is a problem of the training not the model itself
the outputs with image input using CLiP vision has nothing to do with the input
you could fine-tune sdxl on that - but honestly, ipadapter is the better solution anyways
spelling needs a bit of work, but the cat pumpkins are nice
IPAdapter causes the quality to degrade by a lot most of the time, I have plenty of experience that thing
it can blend images, but again, nothing like the quality you normally get out of SDXL
PixArt Alpha used dalle3 liked structure and it might open source soon.
when showcasing they said "can be as good as SDXL and even midjourney" which is a huge red flag since SDXL itself already is better than midjourney
Hair is clumpy? I think it is FreeU?
does FreeU even improve anything on SDXL? I never tested it
FreeU does help in some cases but not a general solution
It does, but very finicky, beter eyes, face and hands, and lots of detail...
idk, the images I generate already have exceptional quality imo, I'll try it, but I doubt it'll improve anything
kinda different though, dalle and sdxl hava a unet, this one has a transformer structure, the similarity with dalle is t5 text encoder and automatic capitioning to generate better captions. but all in all pixart seems more innovative than dalle-3, unless the dalle paper leaves out lots of details, it seemed kind of ordinary to me, just dalle++ more training, better data, and go with it, even sdxl did with two mixed clip encoders, aestic score, size in training data, might be because llm's is openai's core (and that makes it all the weirder they kept unet vs transforsmers)
Suggest a cool concept to train a lora on?
if you were Italian I would like Dylan Dog comics
Hackerman SD XL 1.0: https://civitai.com/models/170538?modelVersionId=191617
Chalkify SD XL 1.0: https://civitai.com/models/170557?modelVersionId=191642
Introducing Hackerman SD XL 1.0, the LoRa that transcends ordinary transformations. Join the digital revolution, as Hackerman SD XL 1.0 unleashes t...
There's been a few recent updates to both ComfyUI and the IPAdapter Nodes and it's sorted out the memory efficiency a lot.
I used to constantly go over the VRAM limit with 10GB and get slowdowns, now it's so much better. I can even run multiple controlnets with it and get no slowdowns.
also AITemplate is about to be officially supported in ComfyUI
The custom node broke for me a bit ago, so I just stopped trying
modules and code from the previous AITemplate repo for ComfyUI had been salvaged and slowly gets implemented into a new one made by Comfy and FizzleDorf
which AIT nodes are you using now, the legacy set or the newest release in the manager?
The old ones
With pure txt2img I'm still using a commit from a month ago, that's the fastest on my 4070ti
I can't get the new ones working at all 🥲
What about him?
Yeah, that's the old one
Both the new one and the old one use the modules I made though 😛
oh I did get the new one working now after all, had to compile aitemplate manually... but the git patch still doesn't apply so not sure what's that about
Comfy said the patch will be included in soon ComfyUI versions by default
I assume that's already happened then(?)
unet has transformers, too 😉
I'm not so sure if using transformers only is really an advantage...
Nope, not yet afaik
It will eventually though
well the patch just says
error: patch failed: comfy/ldm/modules/attention.py:91
error: comfy/ldm/modules/attention.py: patch does not apply
error: patch failed: comfy/ldm/modules/diffusionmodules/openaimodel.py:370
error: comfy/ldm/modules/diffusionmodules/openaimodel.py: patch does not apply
And that'll be what everyone will probably use until exDiffusion comes around
so I thought it has nothing to change in there, but I guess it's probably because it was made for some previous commit
It's not compatible with any commit, Comfy needs to do some stuff on his end for that
And he will, eventually
right
It should be as fast as the old commits are, so that's probably what everyone will use for a while
Well, until people figure out how to make optimized kernals instead of engines for diffusion
Much like happened with LLaMa
yeah umm I still don't see that happening, openai released their dall-e paper though which is nice
shows that a good model should probably be trained on something better than laion captions for starters
How does making optimized kernals have anything to do with OpenAI?
no just thinking that maybe let's get a solid model or something before making it super fast
much like everybody started doing stuff with llms only after llama was released
no
SDXL isn't a solid model? I know it only does text input, but it pretty much masters it
compared to dall-e 3, it could certainly use some work
(inb4 somebody posts 'but look at this cool image, dalle can't do it', but then dall-e draws 3 people with exact specified shirt colors)
I often compare SDXL to DALL-E 3, SDXL is usually way better with the quality of the images. The reason DALL-E 3 has a better understanding of language is due to using a better encoder
it's still t5 (how many params remains a mystery), their captions are just better
It's not just the captions, the model itself has some kind of LLM to handle inputs
if you look at the paper they sort of mostly explain that, basically user = dumb, so they process the prompts to match the ones the model was trained on
user's prompt -> gptv4 -> descriptive prompt (which match the trained format with the model
The model is also a pixel diffusion model, which is losing points in my book
Well you kind of have to read tokens.json to really understand what to prompt it with yeah.
It is latent diffusion model
DALL-E 3? Nope, you can easily tell it's pixel diffusion by looking at the graininess of smaller details
thanks, got cofused because the pixart paper claimed to be new in having transformers instead of unet. Whether it's better or worse remains to be seen.
if you make your prompt too long you can see the vae bleeding through...
Well, if DALL-E 3 is latent diffusion it's definitely not a good one at that; the most common symptom of pixel diffusion is graininess - which DALL-E 3 has
What irks me about these new textencoder heavy models (pixart now but also deepfloyd/imagegen) is the text encoder is larger than the latent model.
you can offload/quantize the encoder so it's really not that huge of a deal
Speaking of, does DALL-E 3 have an image encoder?
that was my first thought as well, but since it'snot done yet, maybe it'snot so simple
Highly descriptive captions is the key to improve prompt following.
I don't know about pixart but imagen encoder can be quantized with bitsandbytes (the results are terrible but that's a separate matter I guess)
The main weakness of SDXL is the lack of an image encoder, so I'm assuming future versions of SD will also have image conditioning capabilities?
I don't see image encoder mentioned in Dalle3 paper.
Damn, this means DALL-E 3 won't have image input without the assistance of something like IPAdapter
But Dalle3 is highly connected to GPT4V which able to accept image input
main weakness i'd say is prompt following, image input is nice to have too of course
is it though?
That's just CLiP being a bottleneck
Maybe clip's a bottle neck
GPT4V handle user input and make the prompt to Dalle3
maybe it's a strength too for styles/aesthetics
dalle-3 goes out of its way to follow weird/contradicting prompts for me leading to ugly images it seems
Gluing together DALL-E 3 and gpt 4 is just a method of censoring the model without choking the training
sdxl's understanding of prompts is beautifully abstract in a way
I'm still stumped on the idea of DALL-E 3 being a latent diffusion, it has such a pixel diffusion look to it
Anyways, I think SDXL will have a better understanding of language if the encoder won't be CLiP, and switching from CLiP to something else will also open the opportunity to also have an image encoder
argh, no, it doesn't
this has really NOTHING to do with the text encoder
in fact, using CLIP is the best thing you can do if you want to easily use input images
if you want a model that can take images as input you have to train it so that it also accepts images as input (at least in a certain % of the cases)
lotta new diffusers use t5 which is apache licensed
SAI haven't done this, probably they thought image inputs are not important anyways
the only ones I've seen do that is PixArt
and DeepFloyd
DeepFloyd is pixel diffusion, making it somewhat irrelevant
dalle 3 uses t5 it seems
so T5 is the way, huh?
is that disabling a model to also have image input? because Dall-e 3 doesn't have image input it seems
pixart is hilarious. the text encoder is 10x the size of the actual model
No idea. I just briefly skimmed the paper to confirm the T5 thing
I imagine you could just use a non-zero latent same as other diffusion models
that encoder seems to do a better job than CLiP though
has pixart uploaded their models yet? I can only find links to t5 and vae, and the hf space is dead
I think due to encoders being more simple than the UNETs it can be quantized, no?
yea I just find it funny that it's the complete backwards approach to SD with a small encoder and a phat unet
haven't yet. Might be soon
they have a python script to run their pixart alpha model on GitHub as of like 2 days ago but I couldn't make it work
also no safetensors format or diffusers pipeline so sus
one red flag about PixArt, they say it's good at photorealism, but wtf about all the other stuff?
yes. they use diffusion transformers instead of unet. I just say, the unet is ALSO a transformer. A unet is basically a combination of transformers with convolutional residual networks. The transformers in the unet work on the latent pixel space (thus, they are expensive) and the convolution is necessary to get the spatial relationship between the pixels right. diffusion transformers instead split the image into large blocks and then use transformers on these blocks instead on the latent pixels. This is cheaper. Question is, though, if this is also better. I somehow doubt that.
once they actually make a safetensors version and it gets a diffusers pipeline it'll be easy to compare
it's licensed under AGPL which is cool
if their example images are to be believed it's good at a lot of styles
It probably is if sorted correctly. Group abstraction is very useful.
i never believe examples
using something like t5 seems excessive to me, a text to image model shouldn't need capabilities of something like t5 (it can to text to text, eg translation, that's crazy if theres proper captioning in one language only) it just need to understand things like how tokens are related how x inside z is interpreted
could be an nvidia "the 4060 gets twice the performance as the 3060" example
as mentioned earlier: a good text encoder doesn't help you if your captions are bad. The good thing on CLIP is that it is extremely robust even on bad captions. The key is to use good captions with a good text encoder, and that is only possible if you improve the captions using some powerful llm and multimodal models like blip or llava)
maybe one of you guys will get it working
https://github.com/PixArt-alpha/PixArt-alpha
but I agree, now that we have these good multimodal models we could replace clip
but yea kinda funny. The mid size T5 with a tiny diffusion model
I just looked at the examples, they also have the graininess I was talking about
pixart is trained on almost nothing though, can't help wonder what happens if more images are fed to it
sometimes small highly curated datasets can do better than a billion bits of trash
yeah, maybe if we'll get an SDXL2 that's trained with T5 it won't have that graininess to it?
how curated their set actually is though idk
It trained with mutiple stages
also it uses the SD 1.5 vae which is funny
i mean if it's not broke dont fix it I guess
I feel like dall-e has a really good vae compared to any sd model, it's evident when trying to feed the images into sd - like half the details are lost
Dall-e suffers from the graininess syndrome it seems
idk im a bit sus of pixart. They constantly bring up carbon emissions and hardly ever mention inference performance/quality
all the models have some grain to them, idk why you keep saying that
when I feed images in to SD VAE I have to zoom in to see any difference...
Don't hate the player hate the game.
SDXL doesn't have nearly as much graininess
what does that even mean in this context
for any model, using it is being able to tell whether it's good, cherry picked images say nothing, deepfloyd seemed promising, but i never managed to get anything remotely decent out of it that wasn't similar to "[subject] holding a sign with the text "wtfbbq this is next-level"
also deepfloyd OOM's on my 24gig card...
it follows prompts decently well but the end result looks like hot garbage
so maybe it could be possible to train something like SDXL with T5 and quantize the encoder? that seems like a logical way to go
why quantize the encoder
because it's 6B
yeah, I agree 😦
T5 is small enough to run on 8G cards isnt it?
4.3B*
just swap the encoder and unet from ram->vram
in my opinion quantisizing the encoder would make totally sense. As Aliquip mentioned: the T5 model is way to heavy for the image caption problem anyways
quantizing hurts performance so quickly it wouldnt make sense tbh
maybe 8bit would work?
at least 8bit quantization wouldn't hurt much I guess
so a model that has 8bit T5 and a UNET like SDXL would be a good idea?
actually try it in Oobabooga or something
I mean, SAI tried T5 for SDXL and they decided for CLIP
could also be that all the 30B models are older which doesnt exactly help
2 clips lol
the 2 clip thing is kinda weird to me
so I guess that T5 might give better text understanding but this doesn't mean the images look better
wonder if it was to make it more compatible with 1.5 style prompting?
2 clips actually good
it toally is. I guess they didn't cared and just chosed what worked best
they tested this all in the bots
so im guessing the new and old clip working together acted to sorta boost people constantly 1.5 prompting the bots
T5 probably needs a totally different prompt structure so people had shit results
I assume that CLIP-L is better in certain artist and styles which were removed in CLIP-G training data 🙈
this is also possible...
but yeah. prompting style also affect result. People is dump compare with LLM
I wonder why. T5 seems like a logical choice from what I'm reading
this is a huge factor for the bot, doubt all was decided based on bot, but alreadythe bot sometimes feel sooo bad to me when old prompts fail
why?
CLIP is a multimodal model
makes sense for a text to image model
T5 is a pure text model that has never seen any image and was never trained on image captions
i suppose that'd mean if they went T5, things like unclip/clipvision wouldnt work
I guess you could always interrogate with clip then just feed the text into T5?
(and yes, I totally agree that training on a text corpus makes sense to get a model that is better in text understanding. I just say that CLIP is not totally stupid)
wouldnt be ideal though
you'll loose the subjects in that process
to achieve image input you'd need the conditioning to have the images
regardless of what they do next with text encoding I just hope they ditch the refiner lol
I cant say I've used it once in the last month
same, the refiner seems useless for the most part
just extra params
hmm, refiner....
It is kinda interesting. You could use GPT4 to read the image, create prompt and feed it into sdxl to get the required result
main problem is the refiner butchers high frequency details
like it can make some structures look better but the image almost looks lower res as a result
you will loose the subject in the image input like that
no
often the refiner is a skip, then i use foocus with all the defaults, and am amazed how nice the results are
in my opinion, Figure 3 in the PixArt alpha paper shows nicely why their method might work so well
if you feed it an image of let's say: a dog, then use the output as a prompt- it WON'T be the same dog
It wouldn't be same via vae
yea when it gets a diffusers pipeline it might be worth trying out in SD.Next
though it WOULD be the same via image conditioning
so smaller dataset of "refined captions"
yeah, they should refine LAION
they have
and as they write: if your captions are well aligned, you need less data to train
yea kinda why you can train a Lora with just like 100 or so hand-captioned images
so, a better dataset will produce a model with better language understanding without depending on CLiP?
is that what's the Dall-e 3 paper is about? I haven't read it yet
that's the pixart paper
Ah I meant kaibo's screenshot
I was talking about this
i'm hoping captions is 90% of what makes models better 🙂
dalle3 and pixelart used similar approach for captioning
I think dalle3 is doing better on this
according to this paper the understanding of language can be substantially improved by training on highly descriptive generated image captions.
pixart is a tiny (ish) model by comparison though. it might be a better approach for local inference
hence why it's also AGPL licensed
In never have much luck with the few times I have tried to prompt Pixart.
a high quality model built specifically for easy training and local inference licensed under AGPL is a winning combo if it actually turns out to not be garbage
but for now we wait for a diffusers pipeline
did they? I heard that multiple times, but never read about that anywhere
LAION-LLaVa is the refined dataset
fair tbh. the sd 1.5 tuning ecosystem is so big cause they're all just undoing laion-isms
also waifus but beyond that its other datasets
They evaluate those dataset in table 1 and choose to use SAM-LLaVa
yes, but they created the captions for LAION and SAM themselve
what I meant: did LAION ever came up with the idea of refining their captions? I think pseudo said that once, but I never found evidence for that
oh, I understand what you mean. I don't know anything about LAION itself
it seems that T5 isn't limiting PixArt from having image input, they were able to make ControlNets for it
the text encoder has nothing to do with that...
the params of the model didn't say anything abot an image encoder though
the control net is its own image encoder
a control net is a separate network
that takes the control net image as input
it won't be able to blend images it seems
not if they haven't trained for it
controlnets but no diffusers pipeline smh
the code uses diffusers and follows a similar structure/api
doesn't seem hard to incorporate into diffusers
at make a safetensors file
they seem to really do their best to blend into the existing generative ai eco system
yea they just made the HF page two days ago so im not expecting miracles
but diffusers is listed on the "todo" so 100% just wait for that instead of trying to hack their inference code into a UI
hypothetically it should just work™️ on sd.next then
I compared some of their showcased results with SDXL, SDXL seems to be better in most cases
so better language understanding or not, SDXL still takes the cake
though I don't doubt SDXL would be even better if it was trained in the way they trained that model
so maybe we'll get an "SDXL2" or even an "SD3" that will have better language understanding
wonder how pixart does with dark latents
its interesting how XL still has this split ground perspective problem. idk how it can master reflections but cant keep the ground level. happens to me constantly
what's the largest SDXL images you guys have made?... I'm trying a 3440x1440 right now
yeah... 3440x1440 is a no go
I've looked into it, it should be possible to have a model that uses T5 as a text encoder, while having something different as an image encoder
PixArt doesn't have an image encoder due to trying to be as efficient as possible in training, but they theoretically could
This might true to DALL-E 3 as well
In the case of DALL-E 3, the language understanding doesn't come from the text encoder though, they explained it was the dataset they trained it on according to the paper
programmer
did someone mention a few days ago, a website where you can upload a dozen or so images and it will make a lora for you?
with a 2-pass approach I've done 3840x2160.
just one pass it's like 1080p tops
two pass?
what the automatic1111 UI calls "high res fix"
you mean when you make a lower res image then use AI to upscale and add detail?
or just upscale and enhance?
yeah, I'm mostly just stressing my hardware out to see what it'll do
what's your GPU?.. I was getting about 7s/it for 3440x1440
7900 XTX
i generated a 3440x1440 image but it failed at the decode phase
use tiled
vae decode tiled?
I'm going to attempt a 50-step 3440x1440 with tiled decoding and see if it works, or if a red text box pops up and cusses me out again
jeeze try it out on like 5 steps first make sure it decodes
boom.. 50 steps 3440x1440
12 minutes, roflmao
there is so much wrong with it too like how planets and even a sun are sitting on the ground, but that wasn't really the point
after the first tile decode and it builds the kernels subsequent ones are faster
Got a prompt question, when prompting everything that I get is extreamly new, almost like a 3D render for somethings, how I change that is to use old dirty that works 80% of the time, but then I get something like an animal that I want to look like anatrual looking animal but dirty makes it dirty, suggestions on prompts to fix this?
photograph of
cinematic photo of
portrait shot of
digital photo of
movie still of
made an image gallery from the ComfyUI output folder, no more searching for an image from 2 weeks ago
mobile friendly 🙂
I meant the whole image took 12 minutes doing that high of a resolution. But that resolution was mostly just a proof of concept, I'll never actually have a use for images of that resolution and if I did, I'd do it in a more organized manner, starting off probably with 20 step SD1.5 stuff I could upscale and feed through img2img after I have a good starting point
1.5 is slower than XL after 768x768
yeah, but you can start out with a smaller image, and use it for img2img when you create one that has the basic look that you want, and you can upscale before plugging that into img2img and make bigger images based off that smaller one.. the benefit of doing the smaller ones is you can generate a lot of them, often several at a time, and reach a starting point faster to build from
if I do a 1920x1080 image, and there are imperfections, it requires a lot of subsequent effort with inpainting to correct, and all of that may be avoided by starting out with an image already close to what you want
bonus image
prompt and model?
base XL no refiner
prompt uhhhh
if you use comfyui you can just drag the image onto the canvas and see the workflow
grainy deep ocean footage of a an monstrous tentacle woman in the abyssal depths below
base xl is just SDXL
like the normal model
base version
I found the one I was talking about that is dead for a long time now, and the one I was mentioning I left their discord when they said it was dead. I had their icon in my mind and just found it. https://github.com/devilismyfriend/StableTuner I never heard of simple tuner before.
Just happy my Comfy is working again...
Is that Rutger Hauer from Blade Runner in the background?
The quality is amazing 👍🏻
How do I make pics?
This could be a superhero whose power is to replace and repair public water utilities
Shit, lol
nice
pretty good conan
Chalk Lora from @stone fossil
Did you use the same prompt? If you wanted the prompt and model I used it is here: https://civitai.com/images/3082642?period=AllTime&sort=Newest&view=categories&modelVersionId=190677&modelId=169671&postId=713401
I took a look at DALL-E 3's paper, they do indeed use T5
they've given out access to their paper? i thought anyone who is allowed to see it has to sign an NDA
It was leaked and after they noticed it was leaked it popped up on their website to keep their cool
https://cdn.openai.com/papers/dall-e-3.pdf oh they've put it out finally.
Yep
Uses T5
time to pour some extra strong coffee
yeah i figured they did. it's pretty good
Why didn't SDXL use T5?
/shrug
targeting home gpus probably
T5 can be quantized, that's no excuse
would a t5 trained model work with a 3070?
Yes, easily. An 8-bit T5 won't hurt it
also i think they like openclip becuase they have the license to it. t5 is a restrictive license isn't it?
It is too big
It can easily be quantized with minimal loss
I'm almost sure OpenAI aren't using full precision on T5
Even 8bit would be enough to make it run on about the same scale as CLiP
i'm sure if it were easy we'd see more researchers doing stuff with it. there's probably big caveats. t5 has been out for a long while
it's very impressive too
people don't just ignore that. there's gotta be a reason why
we do see that though
kandensky never used it either right?
Kandinsky uses ViT_14 or something like that
At that moment, people used to use Clip L style's prompt. I think using prompt like that T5 might not perform a good result.
lots of other models coming out but i only see google and other big proprietary ai companies using it. must be a lot of licensing issues tied to it
It really is the simple things that hook me.
people slipped right into prompting dalle 3 so easily. dalle 2 too. it's clearly a bitter layer for prompting
DALL-E 3 prompts very easily and it uses T5
prompt don't go directly into dalle 3
you can punch sdxl style prompts into dalle3 and it'll do fine
has better comprehension too
user prompt -> gpt -> descriptive prompt -> dalle 3 T5
That's if you're using it through chatgpt. there are other interfaces
the GPT part is just to censor the model, it isn't improving anything
Dalle 3 trained with descriptive prompt. They use gpt to generate that style of prompt to use the maximum capability of the modal
using gpt to rewrite a prompt won't make it better comprehension. you can't just throw gpt4 at sdxl and get dalle results
t5 is the core reason why dalle prompting is so good
Actually, you could get better result using gpt4 refined prompt
yep, I wonder if future SD versions will use T5
sure, but that's subjective. prompt comprehension won't improve though.
you can actually score prompt comprehension thorugh a variety of methods
prompt comprehension wasn't only come from the T5
it's also from the dataset, but I'd say SDXL's dataset is way more than enough to make a model better than DALL-E 3 when trained with T5
maybe SAI didn't use T5 because quantizing wasn't a thing when they began training SDXL?
because if quantized properly; T5 can have close performance to CLiP
doesnt DFIF use T5 already?
they're pretty familiar with T5 so I assume the reasons for going clip on XL was more than just the extra gig of vram
it was also a pixel diffusion model though, and was extremely performance heavy, so IDK if they already quantized T5 there
it doesn't diffuse latents, it diffuses pixels. that's kinda dumb as of now
so no, no VAE
I don't know, it's either they couldn't quantize T5 or had their own reason not to use T5. because evidentially; T5 is the better text encoder
the double-upscaling sucks ass. pixel could maybe work with a better method for that
though maybe using unedited images directly instead of the VAE makes it noisier
i did a dozen generations with df and decided it wasn't feasible as a tool
we are already past that point, latent diffusion is the superior method as of now. but that has nothing to do with the fact that they had no reason not to use T5
maybe performance, but we can fix that via quantization with minimal loss as of now
apparently T5 by itself uses > 12GB
also LLaMa, but that's full precision
if you quantize T5 down to 4bit it's going to destroy the quality
and 8bit would still be like 7 gigs
which is as much as the entire XL pipeline right now
not if quantized properly, look at LLaMa for instance; even when 4bit, it can do stuff just as good as full precision
I've literally made my own 4bit quants and they suck ass
even 6bit can be sketch
if you compare with what the full fp16 model does it totally destroys the outputs
you didn't do it properly then.. it won't make sense that it barely effects LLaMa and destroys T5
or maybe degradation rate is higher with 4.3B models?
tf you mean I didnt do it properly. have you actually used higher than 4bit on meaningful models?
if you only ever use 4bit gptq then they seem nice until you try the same model at 8bit+
my figuring is if it was as good as hype says it is, it would've caught on by now. i used the 4bit llama too and it descended into garbage after two prompts. oculdn't make it work at all.
24gb doesn't seem to be enough for a llm that doesn't dissolve into gibberish with the slightest bit of context
if you have ooba I have a bunch of EXL2 quants on my huggingface including a 4bit https://huggingface.co/Beinsezii/MythoMax-L2-13b-EXL2
There's absolutely 0 comparison between 4 and 8 bit
maybe i'm doing it wrong, okay, butif it were simple to deploy and usable, people would be using it. That's how i see things.
I did try 8bit, it was barely any different than 4bit
people use 4bit quantized LLaMa all the time
i hear that okay and i see them using it, but i feel like they're struggling to use it and are pretending it's all good
comes to a certain point where i might as well just write things myself
it works very nicely though, if inferred properly; people rarely complain
i mean what's there to complain about?
the alternative is using cpu offload with transformers and getting 0.2 it/s
um, there's a HIGH bar to use LLM. a level of technical know how. people complain constantly about that. there's a huge sea of newbs wishing they could use LLMs of any kind and they're all rabbling
LLaMa speed isn't measured by it/s
so having 4bit fit entirely in vram and running @ 5 it/s is a godsend to them
tokens whatever
same thing
it's hard to find good advice for ooba booga because of how many newbs are trying to make their own e gf
maybe you filter out all the noise but i assure you there are complaints
so you're saying SAI doesn't use T5 due to performance, eh?
no. i think its more about compatibility and ease of deployement
if you really want T5 you can always use DFIF
that includes performance, but also helping people implement nich software libraries
look, a model that uses T5 is going to release soon, we'll see how that does
pixart?
yeah, that uses T5
they already have an inference script on their github
go try it
I'm just waiting for the diffusers pipeline
so it'll work in sd.next
i disregarded pixart when i first heard about it, because they were bragging about how few carbon emissions it costed to train. i think all of that is just dumb poppycock nonsense. minimizing the carbon footprint of a single project isn't going to do jack all. we need to plant trees.
Anyone trying to brag about their carbon footprint are scammers like recycling companies are, so i tend to lose trust when i see it
it might be good, but they're using carbon footprint to hype it, so i don't think it has any legs
optimistically, if it's as easy to tune as they say it is and runs well locally then it could be a success if the images arent garbage.
- AGPL license nice.
but the fact that they bring up carbon emissions 5x as often as inference quality is sus
its out now? oooo. worth a look, but i'm a giant cynic about carbon footprint obsessed tech projects
oh no no weights yet
and Inference requires at least 23GB of GPU memory.
Something pretty for a change... 😜
sdxl always struggles on teeth really bad i've noticed. worse than hands ever were. Cool image though. life an death
What do you mean with "struggles on teeth really bad"?
no canines. all incisors. sdxl LOVES incisors
100%
Cheers...
Cheers
this is odd. T5 has 4.3B params, how does it take that much VRAM?
the entire SDXL model has more params than that and it doesn't take as much
different architecture
look on pixart's HF. the t5 encoder is literally like 16GB of just weights
yeah, and they didn't even release the UNET yet, the HF has just the text encoder and the VAE
single artwork for famous song gotta have teeth - obie trice
so considering one of the main points of XL was fitting on 8gig cards I think t5 was automatically off the table
fair enough I guess. so the reason T5 wasn't used is the encoder itself being as big as the UNET
bigger
SDXL UNET is over 6B, a little smaller
even if you halved the size with a quant without destroying the inference quality t5 would still be bigger than XL
bigger in bits not unrelated parameters in different architecture.
in terms of gigabytes
yeah, I see
some pentiums reached 4ghz. they are not faster cpus than modern i3's
both clips combined are like one gig I think
how does df's t5 implementation work? i was running it on my pc and it didn't need 24gb
maybe an "SDXXL" targeting 16 or 24gb minimum would work
dfif used > 24 gigs for a single image when I last tested
unless there's improvements in diffusers now
maybe the two step helped me out. i only got 16gb
made a dozen or so images when it dropped public.
was slow though
maybe load t5 first, encode, unload t5, diffuse?
I don't see that happening, the idea is to make a solid model that can run on most stuff.. it seems like T5 is automatically defeating that purpose
when nvidia releases the 5080 it'll be good right? that'll come with 30gb right??? /padme
the 3080 had 10 gigs and the 4080 was gonna have 12 before they renamed it
if they'll continue the path they're going in they are going to cuck the VRAMs bandwidth again
if you just want vram and nothing else you can get an A770 16gb for like $300
7900 XTX 24gb for $950. outperforms a 3090 when on equal playing field
they literally used 128bit VRAM on almost the entire 4000 series, that's stupid
idk if they'll do that to 5000 series
gets you banned in counterstrike though (not that i play i just think it's hilarious and typical amd driver moment)
play better games, ez
i grow weary of amd. was using them for a long while. 4080 is my first nvidia gpu tbh
my XTX absolutely demolishes VR games
unironically, the 4080 and above are the only 4000 series cards they didn't fully cuck
running everything at 150% of my headset's resolution
yeh when it works it works. Linux drivers were so superior when i used amd. gained 15fps in alot of agames when i had my vega64
i haven't bought a budget card for a long while. enthusiast and prosumer options only
with Proton on Linux I can play Devil May Cry 5 @ 8k Ultra and still get 80fps
yeah proton and other options, lutris, all that, it's so good
you can get some nice fps boosts on windows too if you use dxvk wrappers on old games
there are trade offs yeh
for cyberpunk I still switch to windows
mesa right now is like 1/5th the speed
apparently if you compile it from git it's "up to" like 1/2 speed
cyberpunk's new patch is 🤌
still have to play that I need to finish other games first
SDXL's quality is certainly good, but it doesn't follow the prompt as much as T5 models do
maybe SAI'll make something that replaces CLiP? T5 isn't the solution for what they're going for
agreed. luckily with sdxl we have so many other options to control the end product other than prompts. it's a boon.
they'll probably make a new openclip model. or somehow tie one of their LLM's into openclip
what i love about stability is they are all in on researching this stuff to run on consumer hardware instead of corporate hardware
dfif in shambles
credit to others who are contributing to that effort too of course, but sai seem to be leading the pack here
dfif was always just a resaerch model and i think they got a ton of good results from it
yeah, I bet they'll figure out eventually to make SD have a better language understanding
I think they should make a dfif2 tbh
could be their "sdxxl" for absolute highest quality at cost of all your vram
what was the prompt? I kinda wanna try
a fluffy cat on their back, playing with a computer mouse as if it was a real live mouse
a fluffy cat pouncing on a computer mouse like it was a real live mouse
the way it understands prompts is phenomenal for real
it just sucks that it needs corporate datacenter level computation
you can see in his eyes, he wants to eat
I get this. I'd say the quality of SDXL is better but the prompt following isn't there
yeah the quality of renders compares very well. especially if you're a skilled operator
honastly SDXL's quality is better imo, but idk. this is just my opinion
It's a lot different that's for sure
But once you get the hang of how it works you can definitely control it a little better
Yeah often, i prompt knowing that sdxl isn't going to get the prompt very well. i'm just throwing stuff out there to sort of nudge it towards what i want
sd1.5 did that even more so. prompt salads i use extensively on that side
Like I made this render with SDXL and there's 30 different anime in it, and I think it nailed just about all of them, through prompting alone #🎥|animation message
also the speed of SDXL kinda makes up for it, even more so when using AIT
try with something harder like a "1girl praying inside a dark temple with a golden buddha statue with 16 arms in background" this one took me like 140 images to get on sdxl
takes me less than 30s to generate an entire batch like this
I don't use things like 1girl on SDXL
that's kinda exactly what I mean
that works grat on 1.5 but certainly not XL
great
1girl is a booru porn board tag. you won't get that in sdxl until someone trains it in
it's on sd15 because novel ai did all that expensive work and eveyrone stole it
first try on dalle
quality isn't there though, but it followed it perfectly
wonder if you could expand the scene with outpainting,idk if sd would understand the details and add them properly
microsoft likely found a sanitized booru tag dataset to train with. likely are investing millions into data set building
4chan did that, right? people hacked NAI and leaked them
someone leaked em. i don't know who. everyone who used it after that stole it
sd15 is a poisoned well. a lot of garbage happened in it's early development before the popularity kicked up
if mona lisa were a hot valley girl
Bing
Bing can make some great images at times, as Dalle3 can, but they both are so grainy if blown up to anything beyond quick viewing size
Sdxl
Wanted to give a little update on my realism LoRA progress. Here are some new examples of what it looks like now
Top left is mine, top right is RealisticVisionXLV2, bottom left is Realism Engine, and bottom right is Real Stock Photo
Current dataset is only 90 images and not trained too well. Working on the 500 images version with very meticulous tagging. Also experimenting with some new papers in training with the goal to get much higher fidelity, and much better brightness control
It's being trained to mimic the look of properly color graded professionally photographed portraits and various other image subjects
"hypertile" in recent comfyui commit 🤔
yea it's broken
supposed to help 1.5 models scale like XL does I believe
but it always errs out
ah
it tiles the first unet attention or something
so it doesn't blow up at high res
so it should make a sorta more linear XL performance curve instead of just super exponential
Could be cool, does say still testing, so maybe it'll get perfected soon-ish
and in the future it'll maybe expand to other areas of the model
Great job! Your images feature more pronounced shadow details compared to the stock photo (and the other generated ones as well).
Stock photo is a SDXL realism fine tune
I specifically hand sourced and tagged the data set images for the training. My specific goal with this project is to increase the realism, diversity, and most importantly, the lighting / camera dynamic range of SDXL when doing photographic images
I can share more information about it tomorrow, I have a horrifically bad headache at the moment, and I'm off to sleep
Here's a sample from my latest LoRA.
Has very nice fine details. My LoRA wasn't trained fully enough to really pick up on fine details
Yours has a bit more of a painterly look mixed in with the realism which is a nice aesthetic
I've mixed in some post process grain and LUT adjustments but it's subtle.
Hi all, Do I use sdxl? I have Sd 1.6
depends on model you are using. If about 6GB and containing XL most probably
She has 2 different skin colors and 3 legs, but the details are quite nice 👍
Yeah, legs got bit messed up on the gen. Skin I see what you say but not sure if it's a normal tan or not 😄
3 legs == 2 vagonyas
Normally when you have as light skin as she has I'd say it's normal to be white on the non-sun side or do you see anything else that I have missed?
her left arm is tanned like she hangs it out the window while driving
compared to her right arm and first two legs
cookie monster always scared me
do you see it?
Who in the right mind has not commented out this download out of the Automatic1111 code? smh
it only does that if your models folder is empty
Hi, newbie here. quick question with regards to the image dimension when generating with sdxl checkpoint in Sd webui. Do I keep it at 512 x 768 and then upscale it by x2, or just generate it at 1024 x 1536 without upscaling?
the A1111 webui is kinda outdated
not nearly as efficient as ComfyUI
No shit. I am a hardcore comfyui fanboy. Just want to check out BLIP2 captioning.
But loading this antique checkpoint....
1024x1536 will probably work okay-ish on XL you'll just have to seedhunt a bit more.
512x768 is actually too small for XL
Thank you'
also I think SAI are working on a new encoder, I looked at things mentioned in the SDXL released and Emad said something about a future SD3.0 being entirely different
emad speculates on a lot of things. we'll only see when the time comes
Made a test training a lora on text. Getting interesting results so far.
"I served with your father in the cookie wars"
hi guys! Any model recommendation to generate landscapes?
XL model j mean
with good prompts SDXL base model is awesome
final stilization
SDXL 1.0 base is still very good guys.
@vital ermineHow many Loras do you have in the works?
rad
2 and one I want to start work on in a week or so
rad
About to release this one if this training works right but buckets are not playing nice
Sorry, was a thumbnail
Sdxl -> sd1.5 dreambooth -> Pika -> After Effects
what's the minimum spec for the XL refiner?
Does anyone know a proper tutorial to download SDXL, I've tried so many and everytime get some cmd error
try invokeai, that's quite user friendly
command line terminal work isnt' always intuitive. The tutorials might be telling you commands that work on their system, but not yours.
There's another app that is easier to manage this stuff with. Stability Matrix. Package manager for a lot of different UI's for you to try out. https://github.com/LykosAI/StabilityMatrix/releases/download/v2.5.5/StabilityMatrix.exe
@steady grove Will it allow me to generate AI images from prompt
matrix won't. it's an installer for various UI's like automatic1111, sd.next, foooocus
one of those will do prompts to images if you've got the hardware for it
can you provide your prompt ?
Pos: vhs camcorder footage of bladerunner Japanese town
Neg: black and white (cartoon), 3d, render, low res, low resolution, ((text)), ((watermark)), ((logo)), tongue out, ugly, masculine, vibrant, .com, ((tanlines)), (( ososedki.com))
thank you
yuppers
I just released a small update for Searge-SDXL to version 4.3 on CivitAI and on Github that adds support for FreeU v2 in addition to FreeU v1. It also adds some more FreeU presets.
@noble shoal how does your LoRA work for making things out of text? My research group has a couple people researching text performance for SDXL, and one person who is doubting how good SDXL could ever be for text
I'd love to see what else your text LoRA can do, or even play around with if I'd you'd be so kind
From what I have seen so far, I'm quite impressed to say the least
Thank you. Well, my one man research group has carefully captioned 98 photos with text in it. I included {ObjectInPicture} with the text "{Text shown in the image}" on it in every caption. I might be hallucinating but i think overall text coherency improved. One or two words get usually nailed instantly. I managed up to 6 word sentences in my tests. So yeah, i guess if the dataset is captioned good enough, SDXL has no problems with text.
It's quite incredible how fast SDXL seems to pick up on concepts with around 90 images of properly tagged data
My realism LoRA is 90 images (working on a new much better 500 image version), and it makes a monumental difference compared to even the best realism finerunes out there
Mine is top left in all three
It's trained specifically for much better lighting, foreground/focus/background separation, and overall DSLR dynamic range compression
It's also trained to work with painfully simple prompt
"a portrait photograph of a black woman with blonde hair wearing a green suit at dusk in front of a shop"
Unfortunately, I'm only on my phone right now so I don't have any more examples, but I've probably tested at least 80 comparisons
Oh, maybe this makes also a difference, but i am unsure. My training images have only a resolution of max. 768x512. This allows me also to create images in this resolution and then upscale them.
thats a very interesting approach
my training images are 4k-8k+ lol
It doesn't matter much right now, but the final version of my LoRA should be able to handle absurd detail levels
8k? Wow, i don't think that this is really needed. Weirdly enough, my text lora is spitting out some very nice people too.
for the type of training I am gonna be doing, 4k is gonna be necessary. For now, they are all being downsampled to 1024x1024 and its eq ratios
for OpenPose ControlNet SDXL ive tried a few openpose models none seem to work, no errors, the image just never comes close to it. do i need that 5gb open pose model? i tried the smaller one and no luck..
openpose controlnets are harder to train and only know the poses in their data set. that's a tricky one
thanks 😄 so im guessing that means its not quite good enough yet?
or maybe a simpler pose