#🆕|sd3
1 messages · Page 69 of 1
What tolerance ranges are you using? That's what truly determines the quality with that ode sampler node. Like there's a big difference between 3/4 and 4/5, or 4/5 and 5/6. Literally 10x the precision difference since it's a whole extra decimal point. But it will make compute times go up by a ton
yeah i determined the problem is because that sampler it's only doing 1 step and calling it done. i don't have the workflow set up correctly somehow and just gave up. i was trying all sorts of ranges while playing
3/4 or 4/5 work well, any higher and it's a waste of electricity usually. For the adaptive ones, just set the step count to like 200 so it doesn't abort early, it acts kind of like a jeopardy music timer
Oh and bosh3 is the only one that's really worth it for SD3
think anyone else saw the face
I'm finding auraflow to be more prompt following than sd3 2b at this point. Sometimes I get more elements on sd3 but It feels sdxl'ish to be combining subject details on sd3 like that. The prompt: A chubby, bearded man with a shocked expression looks up at the ceiling, his mouth wide open. Colorful, fluffy creatures with angry faces pour out of a hole in the ceiling above him. The man's face is wrinkled and red, with sweat beading on his forehead. Bright, twinkling Christmas lights swirl around the room, casting a magical glow. A huge robot reindeer towers over a tiny Santa Claus, who gazes up at it in wonder. Snowflakes dance in the air, creating a whimsical atmosphere. The image is incredibly detailed, like a high-quality photograph taken with an expensive camera.
maybe
was it adetailer or just a random gen lmao
like 20th gen, i do 4 at a time, its Bill Cosby if its not clear enough
yay christmas morning! i'm prompting for nintendo 64s and i'm getting all these jank looking super nintendos though.
you know, sometimes i think sd3 has no idea what an n64 is
me thinks lykon was a sega kid and is keeping the ancient console wars alive. kind of sus
yeaah has a pretty good idea of what a sega draeamcast is but no idea what the n64 is. right okay. that's not sus at all.
Is there a way to train a lora with sd3 as the base model?
onetrainer sd3 dev branch, or kohya scripts sd3 dev branch
THIS IS PRETTY SUS THO. N64 DESERVES GREATER REPRESENTATION
i didn't prompt no gamecube but thats what its tryin to do clearly. that aint no n64. this safety training is bohshit
closest i could get but they erased the nintendonger
/help
/balls

heh
why does this remind me of a game controller?
unsettling
I really wonder why SAI didn't even try to blur just a little the Midjourney-based dataset images, the artifacts in SD3 medium are so annoying...
Changing dress, please wait 2 weeks for it to be done.
Different poses
the image I made before was heavily stylized so its personal taste really
here is an image that is just photographic with no stylization:
that is at 300 steps of order 4 implicit adams
for comparison here is the same seed with 20 steps of order 1 euler
but there is much more potential in samplers
Dr Head's node is awesome but
the implicit adams is only order 4
and it is fixed step not variable
higher order variable step implicit solvers would be a big quality increase over this
explicit solvers (euler, heun, bosh, RK4, fehlberg, dopri etc) won't work well with stable diffusion at any order higher than 2, or maybe 3 at best
due to stiffness
but implicit solvers allow us to push the order higher
🙂
Please who can help me put this in a space background and put on the monkey with a space helmet
Any good SD3 fine-tunes already out?
yeah there is one that makes balls
its awesome
its only lora though
no big checkpoints yet
Any news on when Civitai will allow SD3 again? Since we got the updated licence.
yeah on another disord, civit staff said that they are going to read the new contract this week
IDK what civit are doing
it never made any sense to me
I think I misunderstood what civit actually is
I thought it was just a model hosting site but maybe they are trying to be something else
like an open source software foundation or something
I read that they've been in communication with SAI about a commercial license. My understanding is that once you make over $1mil, then you work with SAI to make a new bespoke license agreement. That's probably the "new contract" civit staff is reading, not the one that was released to the public weeks ago 😛
Just love how colorful and fun this is
I like the cotton candy cloud on the right
Me too!
SD3 colours are so good

Lol no. They're a venture capitalist start up that hasn't cracked profitability yet. Model hosting is just their hook
They're not a non profit like the FSF
so, sd3 maked from screen shots? this is ONE image generated with clear sd3 medium...
yeah. it's a case by case basis with the contract worked out to be what you actually need for your business, not just a cookie cutter that everyone tries to fit into
hey guys, any hints or tips if i would like to create img2img in the style of marco grassi? I'ver trained a model with several pics of his work, and i would like to transfer that style to some pics of my family. Tried ip adapter and t2i style, but not succesfull... any tip is highly appreciated. Im using A1111 btw!
good morning ballshine
that would be the Phoenix sun
Credits to Adult Swim
Rick and Morty Merch: https://link-to.net/108299/rickandmortymerch
how come civitai isn't hosting sd3 models?
they're banned until civit can negotiate commercial terms with stability
skill issue
They said it was because they didn't want creators to suffer and stability could require all models be destroyed. So civit deleted everything that had been uploaded and indefinitely banned sd3 content to protect creators.
|| it was really about protecting their business interests and they were only spinning it as protecting the community ||
they deliberately - like most other people - read the license wrong. they want to create their own fondation model anyway
Caus they're cheap
So I've been merging models for fun and experimentation. If you add SD3 into the mix at all, there is always missing || nipples || and 🍆 lolol
In some cases the images are turning out better, and in other cases they are worse. I guess some older checkpoints merge better with SD3 than others.
As they should, why wouldn’t they protect their business interests if it also protects the community
Civitai has made some terrible decisions lately but that one seems like a win-win tbh
SD3 image generator, to SDXL face refiner, to Reactor Faceswap workflow
it's the fact that they lied about what they're really doing, mostly.
everyone lies even the napster ceo lied
How are people not sharing for commercial reasons, helped out?
Civit have never demonstrated a capacity of dealing with issues maturely. This is more of the same. Nobody was at threat of needing to destroy models and a judge would've never held that up if it was even attempted at being enforced. The whole reasoning is a joke. They're likely using it more as a negotiation chip than any kind of good will.
Ulterior motives always
Telling the community they're banning all models until stability gives them favorable commercial terms doesn't communicate that well though. Trust that businesses are never honest. Especially ones that are still in the red and spending money to establish themselves.
not all businesses - just the ones that have a track record of their CEO ....ing over his friends and business partners
wasnt that what happened with facebook/napster
napster they got hit with the riaa. after they were acquired by microsoft, they actually did a cool spotify style service which was way ahead of the times
my bad wasn't microsoft that got them. roxio. but they did a streaming subscription thing it was handy dandy
at facebook i think when they restructured the company they fucked over joseph gordon levitt and he didn't play his cards right. cut throat but yeah thats how its played. schmiddy from Google fame, isn't exactly a 'nice' guy either.
wait mb. it was another guy who played that character in the social network. the amazing spider guy
im not sure, but i think people are waiting for the improved version to come out before training
2 weeks
Oh god, not this again lol
just feed her catnip, she'll be fine
death to the water! its time humanity take back the earth
@naive sparrow
might go back to SDXL for a bit to get the stronger version of CFG++
not sure if it will be possible to port it fully to SD3
at the moment only "alternate" mode works in the comfy node for SD3, which lets you half your CFG
but the original full version lets you set your CFG to like 0.6
guess that would make sense
lets hope its slightly less censored to shit
hmm never mind the full CFG++ at 0.6 CFG wasn't that much better lol


Where do I find trained sd3 models
The base model is on hugginface of course, if you mean finetuned by the community, civit is still not hosting yet to my knowledge, and the early ones weren't really that mind blowing anyway. Gotta be patient
you don't, they don't exist yet
Failed to fetch response from Ollama.
How about sd3.1? What's schedule of realese?
Is that ai?
ya was error message, it failed to load my LLM 😆
Man, I lost so many Balls (TM) while I was offline
Biggest BALL
uh oh, SD3 just popped out topless with appropriate bits unprompted, what do i do, email it to trust and safety
Oh man, you'll have to destroy that model for it has been disentangled by your PC
SAI will persecute you
Or prosecute, whatever
maybe both
2 weeks I think
Man, in two weeks I'll be billionaire
What is something that is SFW that SD3 does NOT know, that Pony and/or SDXL knows? I'm testing my model merges.
Yall having luck with aesthetics on SD3? Its learning my subjects but it doesnt look good.
Hopefully it's a 4B model as emotional compensation
the ole girl grass meme
SD3 has easily the best aesthetics of any model out there
the problem is structure (and also subject knowledge)
the 16-channel VAE is currently unmatched
I think midjourney is still the best model in the world overall because it has good structure and aesthetics and also good subject knowledge (often training in a legal grey area on hollywood movies)
but SD3 can produce better quality because of VAE
unets feel so old and sloppy since stability dropped mmdit on us.
looking forward to the future
i don't use them but from what ive seen on a different server Ideogram looks better
Ideogram is also very strong
lots seem to think that if we just get a 16 channel vae into sdxl or sd15, then it'll all be good. but those are still unet models that barely understand text comprehension
I actually think Kolors looks very good a lot of the time, but its not a generalist model so its not actually that useable
does ideogram run locally yet? i've all but forgotten about it
no its like midjourney sadly
yeah it's entirely irrelevant to me if i cant use the weights. the architecture is meaningless
yeah its sad
just another diffusion model you cant really play with
Lykon said that Unet won't even scale to 8B
so the big model wouldn't have been possible without the DiTs
looks nice though
kolors is a unet . i like what they're doing there, but they're really late to the game to try to get a new unet model started 2 years after stable diffusion 1.
the sad part is Auraflow didn't get the 16 channel VAE
auraflow is v0.1 and the current release is just to kickstart community engagement. i doubt they'll keep it to a 16gb requirement too. likely will release smaller versions and then versions with different finalization steps. vae or otherwise
i haven't even bothered to load auraflow because it's a 16gb file and i've only got 16gb vram
Comfyui model merging with SD3 plus anything else, just basically leave you wish SD3 only. There was a very very slight colour vibrancy increase when I merged it with Juggernaut, but that's about it.
Now I wonder if my SDXL plus SD1.5 merges actualy worked or not 😦
I'm getting somewhat nice results using big tiger gemma 27B as a prompt enhancer
somehow it seems to make these pics a bit better
Oh rly good, is that Ultra? wats the prompt
any ETA for SD3 2B refreshed version?
Reportedly: "When donkeys will fly"
If it was actually meant to be a beta version, they would have kept on training that model even before the release
i await the surprise 4B release
this is so nonsense
I await Stable Cascade 2.0 LOL
I have the feeling I had this discussion here so many times and it's useless to repeat it over and over
but the unet architecture is not anyway worse than a dit architecture
just because it uses a unet does not mean it is not a transformer
same way all dit architectures use some convolutional operations under the hood, too
If only they didn't can it
it's not a question about unet or dit or mdit
it's often rather the data and data annotation that makes a difference. And also the stronger vae that allows for more details
SD3m with claude 3.5 prompter
not quite true. a DiT architecture is just better suited to the task. it was engineered for the purpose. Unet was just what was available back then and they prototyped on it.
and a better text encoder for sure
if you're going to open with "this is nonsense" then we'll never have an honest discussion on the matter. i give up immediately
lucky you! https://jingjingrenabc.github.io/ultrapixel/ thought think of it as csascade++
before you reply, just know that you win. pat yourself on the back
dit is just a transformer architecture on image patches, what's the difference to the unet?
wurstchen 4.0
Ah right, I forgot about this... wonder when they actually release the thing tho
the only significant difference is that unets are not using positional embeddings. You could add them, though
you already won chill
it's released already
Woah 2 days ago
there was even a paper comparing different architectures and found the sdxl unet architecture outperforming the pixart dit architecture on the same data trained
tbf pixart isn't that great and doesn't have the same structure as an mmdit
which also doesn't necessarily mean unet is better.
unet is the more complicated architecture. PixArt with it's dit architecture showed you can reach same results with simpler architecture
Did anyone else peer review it to verify the claim and ensure there wasn't cherrypicking or biased?
Exactly... Claims are cheap and it's super easy to bias and cherry pick
unet has been worked with longer. more complicated means less easy to improve upon. simpler is usually a better design in many cases. and it shows in colors wiht sd3
I don't really believe that mmdit is parameter efficient. But that's my subjective opinion. There is no good evaluation yet
In no way a pear reviewed this paper
there's lots of results coming out. auraflow for instance. there might not be any great finalized review yet, but there are MANY indications
we need a pear to review this math function to see if it works
the paper evaluated several architectures, not their own one. So there is no reason to cherry pick
why am i bothering though. we've already agreed at the beginning that you won
Pear reviewed fruit juice (made from pears, for pears, reviewed by pears)
but is the math real ?
yeah, future will show. The architecture of mmdit is unnecessary complicated in my opinion. You are right that simple is usually better. So I would go for the simple dit architecture of PixArt and improve on that
we've seen that recently. tyson publicly peer reviewing 1x1 = 2 or square root of 2 is 1. was it necessary? maybe not widely. terrance needed to hear it tho
Math was done by pears, so hell yea brother
Reportedly 1 pear + 1 pear equals 2 pears (but this is not 100% pear reviewed yet)
auraflow simplified stability's implementation of mmdit. it has a lot of potential and i'm sure it has caught stabilty's attention
you can destroy numbers by taking your numbers individually and multiplying (1 x 1 = 1) till you are out of numbers
depends on how you define the unit of a pear. is a 100gram pear worth 1 ? while a 110gram pear worth 1.1 ?
further. the spelling of pear always fucks with me. peer? pear? pare?
That's why to some degree 1 pear + 1 apple might or might not equal to 2 pears
need to pull out the calculus to get statistical propabilitys of that apple possibly being a pear
I'd call it Pearbability
schrodinger pear
sounds naughty
Sorry brothers, I'm impeared
Also, let's talk about the fact that you can impeach a president and not impear them
i saw that debate and there was some impearment
I can't stand this, it's unpearable
well i saw the highlights. i'm canadian so i don't really pear
I'm italian, I pear even less
yeah pearing near the border of the pearnited states has me pearing a little bit
appearently
i pear for the economy sometimes. it's already all peared up. now our largest pearing partners are peared
imagine if over on your side of the pear, pearis started the revolution up again. one might pear that could mess your pearconomy up too
Y'all gotta fuel them cars with spear pear juice bought from Pearmany in the Pearope
I reckon Pearmany is a leading expearter of pears
theres always pearaguay but that might be less available in the eu
I think we should stop if we don't want to be peared... uh banned
i'm running out of bad pear puns anyways
Hahahah
winners imo are impeared, appearently, pearis and pearaguay. thats my pear review
new models always look better. I doubt, though, that it has much to do with the model architecture itself. SD 1.5 was trained on LAION and when you look at the data it's no surprise the model looks so bad. The most efficient way to train models nowadays is to train on high quality artificial Midjourney, Dall-E or Ideogram data. Also auraflow was trained in ideogram data as crazy. The availability of more and more high quality synthetic data is the driver for better open models
I would wish there are more evaluations where people really test different architectures on exactly the same data to see which method works best
but such evaluations are rare. Ni surprise, they are basically burnt money
I think we can agree that you need something more powerful than clip to get better prompt understanding. But do you really need mmdit? Did ever someone compared mmdit against cross validation and found it superior?
pear pressure
dit models are pretty nice but whats more important is probably better training data, better text encoders, and better vaes
i'm not sure why that would be more objective of a measure. dufferent architectures might prefer different data. captioning may work differently on models with t5 instead of clip vit . there are plenty of considerations. ultimatley i think testing the best product of one against the best product of another is how to contend them. and even then, objectivity will be hard to maintain
it's kind of a whole system. the architecture it's all built upon certainly matters. like an ICE built around carberateurs, fuel injection, rotary, diesal or whatever. the steal it's all built with matters too. the driver is especially important. many factors working together.
Kolors is a good showcase of SDXL with better data. there's an improvement but nothing that breaks it out of it's mold
soon we'll get better vaes adapted to work with sd15 and sdxl. i don't think we'll break the mold there either. it'll still mostly be the same
mmdit seems better for prompt following but worse for image quality i guess?
next-dit(from lumina) seems to do be a middle ground
plain dit seems to be somewhat a middle ground too but slightly worse img quality
unet seems to be also a middle ground but slightly worse prompt following
Ella adapters certainly haven't taken off because they actually don't offer much improvement
I don't think it has much to do with the architecture. prompt following comes from a good text encoder and a good captioning if your data
maybe well need to wait for some models to show up refined with ella adapters
only sd15 has 1 ella adapter and that one was kinda ok but nothing too good
it definitely was a lot better then sd15's original prompt following but not much better then sdxl's
there is no much technical reason why a dit should be better in prompt following..From what? I mean, nobody really understands at all how these transformers work but I wouldn't overinterpret anything here
ella adapters are just adapters after all. They fix the text encoder, but if your model itself does not have great understanding it's hard to fix that afterwards
newer models are all trained on synthetic captions. That together with better text encoder gives them the better prompt understanding
yeah i guess i was kinda wrong on that, its probably more with data but auraflow is really good at prompt following(maybe best?) and text rendering. However, it only has a pretty small text encoder and only 1(t5 xl). Maybe its just better dataset
train sdxl on synthetic captions and t5 and you will probably get a model as good
you only need one text encoder
I don't know if mmdit has better text understanding. Could be.
Oh yeah a image i got from a nice lora called 'better lora' from sdxl, improves text rendering a lot
so good text rendering might not be architecture specific
but shrek is a known word
if you use words that are not known you will get in trouble with clip
I mean it works to a certain extent
but t5 seems to be really the better choice when you want text and prompt understanding
you're not considering the MM in MMDiT. multi modal. it has blocks exclusively for text comprehension
anyways, I just think we shouldn't overemphasize some of the architectural differences. dit or unet - the difference between both is not that huge
the multimodal in mmdit is a joke
there is not much multimodal in there
That's a big technical reason why the architecture is more suited towards prompting images
not too bad
prompt: a image of a man holding a sign saying 'xfnk',
oh.... so the technical paper is just kidding. got it
they project text and image patched into the same latent space and apply self attention on them.
Before that they projected them into different latent spaces and connected them via cross attention
it's a company. They want to advertise. Don't take everything they claim serious
I would say the biggest difference between the dit and mmdit architecture is that the text is processed and interveiled with the timesteps, too, while in the dit and unet architecture the text was frozen
seems like ad hominem and not a real reason. You said there's nothing different that would imply better text comprehension. now when i point out direct evidence for it, you insist that they just made that up in an academic research paper, for marketing purposes
this makes the model much more computational expensive, but it could allow the model to adapt/connect the text understanding to the image it processes
https://arxiv.org/pdf/2403.03206 more people need to read the paper before they talk
🤦♂️
they call the model multimodal but it's just a model with two domains: text and image
SD1.5 is also using two domains, text and image
so why is sd 1.5 not a multimodal model?
no one said it isn't...
because they named it so. It's a marketing thing. You need to name your new architecture somehow
Someone speaks Spanish
it's named such because the transformer blocks themselves include two networks for different modalities.
it doesn't mean no other text to image model is multimodal
calling it multimodal dit sounds for sure better than calling it Würstchen
wurstchen is a unet
😬
I've been getting great results with Florence-2 captions and SDXL recently for Loras. I find they're far more "human aligned" than the ridiculous thesaurusmaxxed GPT-4 descriptions of images.
yeah, and it's super fast
you've had a track record of calling factual information "a joke" or "wrong" when it doesn't suit your understanding, so i'm going to name this face Mr Kreuger.
oh yeah, I'm so sry that I called it a joke when I should rather wrote "it's a name they used for marketing reasons, because obviously all other models are also multimodal"
good that I added an explanation afterwards
i dont see it on marketing material though. it just seems like what they named the architecture. it's a bad hot take. you also shouldn't appologize where you don't actually mean anything by it. it dilutes the potential for sincerity
https://stability.ai/news/stability-ai-secures-significant-new-investment their latest marketing video for sd3 is here. nothing about mmdit or multi modal. i can't find it anywhere in their marketing material. it's only regarding the individual transformer blocks of the architecture. Sort of like Unet is named that way because it's sort of shaped like a U.
the marketing reasoning just seems made up and fraught with bias
right, I'm not sorry. I think I just cannot discuss with you because you seem to just react on single trigger words without trying to understand what I wrote or even put them into context. Same way you seem to just post trigger words yourself. Like you write mmdit is multimodal, so that is a proof that it's better in text understanding. No, this is not a proof. It's just a word (and you seem to put just way too much meaning into single words). I talked about what in my opinion is the difference between mmdit and dit and why that could mean it has a better text understanding but there is lack of any evaluation on it.
As said, I try to add context and explain my opinion as good as possible but it seems that we both just communicate on very different ways
so let's just stop that, it leads nowhere
" Like you write mmdit is multimodal, so that is a proof that it's better in text understanding." I believe the reasoning was elaborated much differently than that
hyperbole is better suited to poetry and literature. i never said anything of proof. i said words like "evidence", "seems to imply", and "indicate". Where i questioned you was your insistent that there is zero technical reasoning for thinking it could be better.
#💬|general-chat message here's the last time i used the word proof on this server and i was wrong then. so i am reserved with such language now. being wrong is a great opportunity to learn.
Damn, you guys are still going hard at it
i wouldn't say going hard. i'm not sure what his goal with all the misinformation is though. legitimate research and a legitimate architecture being boiled down to a marketing stunt , honestly has me confounded. Im' not sure how anyone seeking the truth could arrive at that conclusion without having ulterior motives.
there's a reason why the researchers behind auraflow are using the mmdit approach as well. It's a strong architecture. Calling it a marketing stunt is like calling master view controller programming a marketing stunt.
I found florence2 as an amazing automatic prompter (needs to be given an image)
gemma 27B is probably amazing for this yeah
its gets tiring writing image prompts, I quite like LLMs for this task
ironically you could get a second T5 to write the prompt for the SD3 T5
what makes you think he has a goal beyond just talking?
actually that's a big disadvantage of many models:/ they are so trained on synthetic captions that they cannot really deal anymore with short human captions. It's a bit weird that you need to feed your prompt into an llm to get better outcome 😬
something that didn't get mentioned that I am pretty excited about is the rectifying flow part of SD3
it makes the paths more straight
when the paths get wiggly we need to either use the slow solvers or lose accuracy
and in the SD3 paper it said that the bigger the SD3 models have even straighter paths
if you look at how bad euler images are compared to DPM++ 2m images
that difference is entirely down to wiggly paths
if the paths were perfectly straight then an euler image and a DPM++ 2m image would look the same
it would also only take one step!
we will never get to that level but rectifying flow is a huge step in that direction, because it directly trains on straight path objective
I heard from other people that rectified flow is more vulnerable to overfitting, though. It memoizes images faster. Dunno how much this is true, though
sd3 despite all it's failures sometimes produces extremely sharp and detailed images. I would give the better vae the credits, but could also be the used diffusion method, who knows
i thought rectified flow lowers cfg effect and thus negative prompting hardly has effect
I would also give the vae more credit than rectifying flow, but I think it also helps
i dunno what all this mumbo mumbo is but i generated a cat with neg prompt only
I've heard this a few times but I get good results when using negative prompts. Tested with colors.
I think the meta negative habits like "bad hands" don't work as they do in refined models, so people are reverse reasoning their way around that
not a good idea, t5 models are way worse then modern llms and multimodels
unless you train it a LOT
yeah BERT-likes and T5 have to be fine-tuned for each specific task, I agree
they are very parameter efficient once you have a good fine tune locked in, that is not too underfit or overfit
but I wouldn't want to do zero shot or few shot on them
There's long clip that I want to see some research done with. Same old clip models with larger context iirc
would be cool to see more done about improving clip models yeah
if I remember rightly long clip only benchmarked slightly better on most tasks
but on one task it was much better
yeah I think I saw it on reddit
I might be getting it confused with a different clip
I had a go with SAG, PAG and Free-u today in SDXL
but it doesn't help that much for my type of image I think
because I do very high steps but very low CFG
Align Your Steps or Karras scheduler also doesn't help as much if your steps are very high
its a bit personal taste though, I could definitely see PAG/SAG making changes I just wasn't sure if they were an improvement or just different
SAG did definitely improve concept bleeding
Yeah the concept bleeding. So much of it on a unet since the cross attention is so limited
its so bad yeah
I'll have to play with sag
Even cascade handles it better than sdxl but its there
I like to prompt colours a lot and it keeps adding the wall colour to their skin
Using an actual color word is very heavy. I almost rely on regional prompting when I want to direct color
I never liked cascade
my opinion is a bit unpopular there
but I don't like the side effects of the compressed latent
it affects the structure
regional prompting sounds good I haven't tried that yet
Yeah. I mean ... I don't use it often.. but its there to consider. There's a few problems with it but I cheer anyone finding use from it on
clownshark gets the best images out of anyone I have seen and he uses it
so I gotta respect that
Have you toyed with omost? A phi-3 directed regional prompting system for sdxl
not yet, aside from the demo, but that looks amazing yeah
Thought it used an LLM but from what I learned I guess phi3 isn't an llm
I was saying on the Rundiffusion server that attention maps is an area that hasn't been explored enough
there was a paper about attention map injections where they found a lot of potential improvements to explore
its hard because you lose image quality from doing this
a bit like how we lose image quality with CFG
Worrcerestffs sauce. Spelling checks out
whats this here sauce!
my new project is to forget about fancy solvers and just work on clownshark-style noise injection
it is
and just solve with DPM++ 2M because its good enough mostly
this noise injection video is amazing its a huge detail boost
https://www.youtube.com/watch?v=59-3RZknRgk
you do lose some convergence speed though
all things i've read about it is that it is just a regular language model and isn't trained on a large language set
some people call Phi 3 a SLM
for small language model
but I think its fine to call it LLM
I call BERT an LLM sometimes
llm is just a name, it means large language model but even models sub 1 billion parameters are called llms. Anything that basically generates text in an autoregressive manner can be called an llm honestly.
phi3 can be very much called an llm since its trained on a very large dataset and can be considered "large"
I think people just wanted to differentiate it from like 70B+ models
but it gets quite unclear
yeah LLM seems to be the generic term now. the vernacular landscape is still being developed. microsoft calls it an slm
Oh I agree, mmdit is the better way forward. It will scale better as models get larger. Unet is still good, but it's a very old architecture by comparison. Wanna say it was originally created for the biomedical world for aiding in CT or MRI scans. In another year or two, wouldn't surprise me to see mamba based diffusers or something along the lines. The field changes a lot, but doesn't immediately make prior tech obsolete, they just typically end up being less efficient. Mmdit technically still has a lot of wiggle room for efficiency as well, as the aura creator brings up with his model
there's a prototype mamba diffusion model paper already out, its really cool 🙂
they didn't replace all the attention layers, just some of them though
When pear-based diffusion models?
i just think unet models are dusty and are not the way forward. while still fun to play with too, its just i can feel how dated they are now.
I was driving a car from 1986 until a couple years ago. carburetor go burrrrrr but i tell you, it was a piece of shit car. not a good car no no no. its a piece of sh carrr 🎶
sold it to a guy who used it in a total destruction back yard derby for $100
had it's uses but, old and dusty and not modern
funny you brought that up, i actually almost used that as an example lol
old cars are fun af. you get a model that doesn't have anti lock breaks ? sooo much fun in snow
appearently not going to be a thing sadly
or just the rain or whatever. hard to get all sideways with ABS
I hope at least for a balls-based diffusion model
have never, ever, had that problem - with any models.
i used to be a dumbass that was into drifting... yeah, i know it all too well lol... but yeah, the whole field is a giant min/max game. reminds me of a rover i had to design ages ago where i ran into the engineering hell loop of min/maxing. it started with a goal of a rough frame size, which had to be made out of welded steel, but then based on that, it has a weight, so you need motors that can move the weight, which means you need a battery of a certain size to operate for X amount of time. so then you find that the frame isn't sturdy enough for the 25lb battery, so you need a sturdier frame, which is heavier, which means you need bigger motors, which means you need a bigger battery and so on lol. models are a lot like that as well where you're juggling vram size, parameters, training time, inference time, perplexity and so on
The question is: when will Stable Diffusion models learn to output face expression more complex than 'smiling' or 'sad'? Is it too advanced maths right now? Do we have the technology?
make better datasets
they only comprehend what they are captioned with
I think SAI developers were themselves not trained with a dataset that lets them understand this
its an issue with sd3 and auraflow(if you dont add any detail).
its more towards humans
Current SAI datasets 'the picture shows a person, but now let's talk about how many leaves are there in that particular tree over here'
vision models are not that great yet
high quality captioners are slow, they rely on things like cogvlm that are fast as hell, but definitely not perfect
florence2 is a reaaaaaaaaaaaaaaaaaly good one though
florence2 prompts better than I do LOL
i don't think the model is the reason you can't get other expressions
ok but at lower CFG
florence2 large chugs out "more detailed captions" in like 1 second per image on my pc with an old 2080 in it and it's accurate a good 95% of the time
'angry' is a basic face expression that was handled even by SD1.4
okay so sad, angry, smiling, what other expression are you wanting? resting 'b' face?
cogvlm is not fast at all, its like 17b parameters. Unless they built some optimized library which they could have. Cogvlm produces very detailed and high quality outputs but its a bit too much unneccesary detail.
Current sota multimodel is internvl, thats comparable to gpt4o in vision.
try winking, smirking, pensive, disgusted... to say some
there are some concepts where the model requires more CFG than others
expressions tend to be further along the scale where they need a decent CFG injection
it is fast compared to vlms like llava 1.6, which are miles better at captioning, but exponentially slower in inference time per image
and 17b parameters is nothing, they caption hundreds of thousands of images per day on their servers with thousands of a100ss
cogvlm or florence2? and again cogvlm is a lot better then llava 1.6, and other llava models.
florence2 is also better at image captioning then llava models but florence can't do vqa.
find me some images for those on google that shows me what you have in mind
shocked
florence2 is like going from llama2 to llama3. llama3 8b > llama2 70b.
That's a decent one, but why do you need image references for common use words?
laughing hysterically
it's not completely perfect for say a company captioning 100mil images when they have 20million dollar datacenters, but for small scale use, it's a beast and the fact that it's so tiny is mind blowing
cogvlm(and cogvlm2) is still better then florence2 and most llava models, its just vram intensive and slow for common gpus.
florence2 is great since its very small and fast and takes little vram but its not state of the art
That's laughing but definitely not hysterically
cause those are very generic terms and can mean a range of facial expressions. and i cant' read your mind to know if what YOU consider a smirk is what I consider a smirk
yeah only the bigger llava models like the Yi 34B one can match cogvlm 1
it is in MY opinion
it looks like a big laugh to me
and? is that what you think laughing hysterically looks like?
llava models are meh since they use a much smaller clip model then cogvlms, the new ones are a bit interesting tho, they probably are similar to cogvlm level.
yeah llava method is a bit old now they just keep strapping bigger LLMs on
which will probably work for a bit longer
i use llama3-llava-next-8b a lot and it's pretty good most of the time, but since florence2 came out, F wasting the inference time
Shouldn't the mouth be slightly more open?
https://thumbs.dreamstime.com/z/hysterical-laughter-businessman-hysterical-laughter-businessman-man-business-suit-laughing-raised-hands-over-white-127027304.jpg
i notice on that page that there are a whole lot of diffrent ways people laugh hysterically. one person has their head thrown back mouth open, another has their face buried in their arm, etc - the problem isn't that the model doesn't know how to create the expresison - its that you're not telling it the specifics of what you want the expression to look like. you're expecting it to read your mind
the best multimodel is definitely internvl2, quality is actually comparable to gpt4o and gpt4v.
internvl can even do grounding, video, multi-image. new images should 100% be captioned using internvl2
if you want him laughing with his mouth wide open, say so
ah yeah I haven't tried internvl2 yet but I saw it on open_vlm_leaderboard
yeah, but at what efficiency in terms of "caption quality per gb vram per unit of time"
of course big ass 2000000b models are going to be really good, but lets talk consumer level gpu captioners
ok florence2 and moondream2 are the best then(the good part about moondream is that you can vqa as well), both are roughly same speed but moondream2 is slightly bigger and similar vram usage
Alright dude, if you say so
painful grimace
That's a good one
here's your average PC based on steam hardware survery(probably at least 100m samples)
so yeah, i like to talk more about the models we can actually run at home without $10k pcs
i just use kaggle since it gives 2 t4(combined 28gb vram) for free lol, my home gpu has 4gb vram
Just put in my 3090 
this one is really really good
has less CFG burn also
a man laughing hard, head thrown back, mouth open, hands in the air, excited (using SD3 2B for all of these)
3090 is a bit faster tho, t4 is pretty old but runs still fine enough
His hands about tree fiddy
he was trained on the aliens from total recall
i didn't run them through the hand refiner
(we were discussing expresions)
its better this way
QUAAAAAID START THE REACTOR
(screenshot from the actual movie, not a generation)
intense concentration
Open your miiiind

@bitter hearth workflow i used for that
thanks but 4gb vram
ouch 😦
there's always the online comfyUI websites
Working on a SD3 cartoon concept. (Picked a random one from my sdxl dataset) learned the style well enough but it still cant do good poses and I'm not sure how to fix it.
it's just a style, right? maybe use controlnet openpose along with it?
Heii
Oh yah forgot about they added support ight will try
What about openpose with balls gens
how do spheres... pose?
wouldn't you need an openBalls controlnet?
Open balls lmao
Are there any updates on the 8B model?
when there is, you'll see annoucements posted all over the internet
Hilarious: I'm trying SD3 on Huggingface spaces and I wrote long prompts about some random celebrities, the best results in resemblance were obtained when I included anagraphical data like zodiac sign and whatnot similar BS, that must be the T5 I guess
feel like its was just a case of a small sample size
don't think zodiac sign would be in the captions much
maybe T5 is interpreting it in a funny way yeah
and zodiac sign is close to other terms in the internal embedding inside the T5
or maybe that celebrity just happened to look like a bull or something
An aries celebrity would look like a goat demon, big time
Skibidi toilet
Balls
Balls(TM) but also Pears(TM)
Notice the insane difference between the two images, neither have the best resemblance but some supposedly filler words did... something.
image on the left features the extended prompt with "useless" details:
"a highly detailed close-up of the famous taurus actress christina hendricks, born may the third of 1975"
image on the right features the shorter prompt, more concise:
"a highly detailed close-up of the famous actress christina hendricks"
Both images used the same seed and same everything, except for the slightly different prompt... I subjectively think that the extended prompt has somewhat better resemblance
No it wouldn't, if anything, it would just look like a ram. You gotta remember that the concept-space mappings for it would mostly map to constellations and the sky. That shits light-years away from a human celebrity. You'd likely need to prompt better to get a human-ram hybrid.
You took me seriously for real or you are continuing the joke? I can't tell properly
Did the same with Ryan Gosling, it improved the face symmetry but ruined the picture, I wonder what happens if I remove the negative prompt
Without negative prompt he looks like a middle-eastern
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != struct c10::Half
met this error
1
Hey folks, I've been way for a bit (almost had to go on bereavement leave) . No that I am back, am reading about flroence2 and how it's great for tagging and performs well despite the smaller file size. Question: Can it be used for prompt generation or is it strictly for image description?
yes I use it for prompt generation
it is very good for it
its better than I am (as a human)
Odd, I've run it around 50 times with no errors. Is that all that's showing in the console log?
I found 0.5 shift could get a bit rough in terms of structure
when it works, it does look nice though
for SD3, what Comfy said is the lower the number, the more the details. The higher the number, the more the shape/structure. so yeah, at 0.5 the details are going to be the main focus, not the struture
yeah my experience was similar
it can be a tricky trade-off at low steps
at high steps you can get both but it takes long
I didn't expect it to be like this because the older models do better with karras schedule
shift 1 as opposed to shift 8
these look same to me
sigh. it didn't copy the second one to clipboard so they were the same image
here's with shift at 8
with and without glasses
yeah. i will go down to .5 at times, but i usually stick between 1 and 2
that's a good range yes I did a lot at 1.5
I use lower CFG than you do, in general, so it gets a bit more squiffy sometimes
but running a batch of 20-30 will mostly still yield one good one
that looks nice
I think shift is similar to CFG
lower is higher quality but you lose the control and the structure
cant really put my finger on it, its not quite that, seems more about fine details vs coherence, you can still get detail just not the pimples on her face or the bug is the jungle
its hard to explain yeah
I like this look that super low CFG gives, I would describe it as "wispy"
Would be easier to test if you guys prompt for shining balls
Much much better than humans
yeah, i don't usually like what i get if cfg is lower than 4
shift specifically changes the time_step value
most people don't like low CFG that much its just a personal thing
it kinda goes low contrast and hazy, with pastel colours
from the Ksampler docs: cfg
The classifier free guidance(cfg) scale determines how aggressive the sampler should be in realizing the content of the prompts in the final image. Higher scales force the image to better represent the prompt, but a scale that is set too high will negatively impact the quality of the image.
other places call it CLIP guidance
shift is dealing with the actual time_step value
yeah the Ksampler I use doesn't actually take in prompts
instead it has an input called guidance
and then I use separate nodes to add the positive and negative
right, just pointing out that shift and cfg are nothing at all the same. one is determining how much effect the CLIP neural network will have as far as data gathered from your prompt, the other is effecting the sampling math itself.
oh yeah you are right
I just meant they are similar in some ways
in the sense that higher values can lower quality
oh hi you were using the adams family stuff, ya? i tried and i get either one step or 4 steps, anything easy i was missing ya think?
4th order implicit adams yeah
Thanks and apologies for the late reply, working on some major enterprise software agreements here and I get to fall on the sword in two weeks if they're not done and we lose $ 2 billion in tax benefits. Which Comfy node do you use for prompt generation with Florence?
I'm afraid I always forget the names of things
but I just searched "florence2" in comfy manager
I didn't need to get a custom node from github there was one built in
someone else on this server couldn't get the adams solver to work either
but I am not sure why
k, ill play again sometimes, thanks
if I remember rightly this one
was right
you can take workflow from there
also this one, but the image quality is degraded by noise injection as an "experiment"
can't remember if the cat was SD3 or SDXL though
right off the bat, Value 300 bigger than max of 150: max_steps
oh no is the max 150 steps?
here - https://huggingface.co/gokaygokay - just go to his huggingface page, click on the Auraflow space - it allows you to use florance OR his long captioner
his long captioner does a better job in my opinion
ah thanks I love captioners as I don't like writing prompts
he's got 6 extremely nice spaces to play around with
I need to look into max steps if the max is 150 then
I have been doing 300 steps for no reason lol
good grief why?
I mostly do stable diffusion to do weird experiments rather than actually make images to use
uh i dont know, i couldnt do 300 if i wanted to, it errors and stops and puts that message in console
(watches you make the ai retrace the same lines over and over, putting smaller and smaller dots in eactly the same spot)
I didn't have errors
it might be a VRAM issue
as I rent datacenter GPUs so VRAM is high
there's a point of diminising return. stick with around 32 steps
at a certain point, you're just racking up GPU hours that even a microscope couldn't tell the difference in
yeah in the DPM++ paper he compares against a powerful 4th order Runge-Kutta solver and he says its not worth the time
it's not. pull an image up into photoshop, zoom to the pixel level and - that's as finely detailed as you get
the AI is going to do one full redraw of the entire image for each step. at a certain point, you've got what you're going to be able to detect and you're just wasting compute time
I'm not sure yet what my conclusion is
sometimes I quite liked the changes I got from the expensive sampling
if you're doing something like medical imaging, or some sort of scientific application, that's different but for stable diffusion? stick with around 32 steps
it's your money
yeah I'm just wasting my own money so its ok
Thanks again, appreciate the help
no problem
I didn't realise the default Ksampler tops out at 150 steps though
that's good to know
non converging samplers don't necessarily add detail at high steps, it just keeps changing it
i think my issue with adams is i have a crap card and have issue with flash attention
are you sure the max is 150 steps?
maybe it got changed?
@classmethod
def INPUT_TYPES(s):
return {"required":
{"model": ("MODEL",),
"scheduler": (comfy.samplers.SCHEDULER_NAMES, ),
"steps": ("INT", {"default": 20, "min": 1, "max": 10000}),
"denoise": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 1.0, "step": 0.01}),
}
}
RETURN_TYPES = ("SIGMAS",)
CATEGORY = "sampling/custom_sampling/schedulers"
FUNCTION = "get_sigmas"```
not sure at all, console spit it out
which sampling node do you use?
he said he has a "crap" card...
your workflow from the cat
how much vram do you have?
flash attention might have been an issue
yeah
comfy is complex and sparsely documented so there may well be a 150 step limit somewhere
gonna try to take a look through
2080 ti 11gb, ya i think i have to downgrade pytorch to get flash attention on my card :/
ummmmmm you might be better off renting GPU time from AWS or even just using google colab
join cloud gang yes
im good thank you, you can see images i post and videos and audio
i'm actually really surprised you're generating on that, at all.
now your being silly
i'm not. you've got a very small system. and a lot of this stuff takes a lot larger system than you've got.
I don't actually know what the vram minimums are for stuff
4gb
ah ok
never heard of small/large my computer is quite large
sadly machine learning suddenly started using massive amounts of VRAM
it didn't used to
Alexnet was trained on RTX 580s lol
out of curiosity, can you run Da Vinci resolve on that? or the latest version of Blender?
im actually waiting for 50 series pricing, when i get up off the floor ill probably buy a used 3090
no clue, no desire
are you generating videos some other way than editing video clips together then?
generally there are tricks to reduce vram for most of this stuff
it mostly involves a lot of tiling
that's why people disagree a lot about how fast upscaling is
adiff and modelscope hotshop all kinds of things
ya sure it takes forever for length but its possible
3072 x 1728 24fps on my puny card
deforum type Adiff with some audio sync, udio song
see the little card that could
those are really cool yeah
some plain animateLCM
this small computer is winning ya see
its not the size of the card its how you use it (this also isn't really true)
Prompt used: "typography text saying: "i smell earwigs in here", flowery background, motivational, detailed exact text rendering"
Negative prompt: "missing negative prompt"
SpongeBob jumpscare 
Has anyone ever been able to get a drow elf with black skin, using SD3? I keep getting pale human ladies. SDXL on the otherhand makes amazing drow!
lol ya thats what i was going for hahah
but we've got the biggest balls of them all!!
Try harder
👍
I don't think sd3 is very good at pointy ears stuff
Have you managed any?
The black skin is the most difficult part 😦
No, I don't try elves anymore cause sd3 is really bad at it lmao
The anatomy of the ears are ugly 
I made an SDXL drow lora, wonder if I can trick SD3 into using that 😉
You can just generate half with xl then sd3
I swear I used to make the very same bad quality memes with paint a few years ago using impact font
now with more lazy
its the AI way
Hell yeah brother
must be true the AI said it, thats commng
Sad matters
i live under a rock, i dont know what Drow is and neither does the AI i guess
That looks like a Night Elf from Warcraft franchise
dark elf in DnD. While DnD is popular, the art is quite niche and not found to far outside of DnD contexts.
i play some ancient game, they got dark elf and high elf
I think drows exist in some european folklore/mythology though
Thats right. DnD draws largely from folklore and mythologies. But you'll find that art for the traditional myth Drows are not the same as the DnD stylizations.
WoW is largely a DnD rip off the same way that Overwatch is largely a TF2 ripoff. i use that phrase loosely here. "Heavily inspired by" might be more fitting
You'd probably need to use every other word to describe them other than elf or elven. Pretty much all of the concept mappings are going to associate an elf with pale or fair skin
dark elf popped out some horror type with black skin for me, i didnt save em
You might also try just using a controlnet to get the form and use terms for dark skinned
but, but, charcoal black obsidian skin, should be universalt right? 😄
Without mentioning elves
Always ask yourself "what would images like this be captioned as in a dataset or by a non-knowledgeable person looking at the image"
PSA gliff sucks, bikini armour isn't allowed grumble
idealy, the base model captioning should be so robust that the knowledge is generalized and it can zero shot the image. but if you prompt a hulk elf, it's going to be either mostly the hulk or mostly an elf.
some concepts seem so over fit that you can't blend them into other concepts so well. like doing car models as a ball is hard to prompt for, since cars are so over fit and it can generate specific models so well
My furry glif got the skin colour correct ROFL
Apparently SD3 wasn't the problem, it was just my bad prompting! Claude helped!
@bitter hearth apparently SD3 can do pointy ears just fine... with the help of Claude 😄
my SD3 drow failures
yeah i was just testing it out myself and was having no problem getting elves with dark skin.
black skin is always difficult though, even prior to SD3 :(. Niiice ears 😄
yeah just throwing the first one it made out here
looks ugly 
that ear is a little wonky though
How so? What's wrong with the image?
Or do you mean the ones where I didn't use Claude?
finally finished my coffee and got around to it
Btown skin though 😦 Awesome ears :>
this one turned out alright as well. had to remove fur from "primative fur garments" because it kept giving them gorilla forearms
like this
see the huge forearm parts lol shit had me rolling
What happens several years after a furry and a drow get frisky....
lol
it even got the white/silver hair without me asking!!
only one prompt to all 3 tencs. bosh3 solver. A jet black dark skinned man, jet black skin like obsidian, he has elvish ears and is wearing elvish style robes. his eyes glow with the magic of valinor
going to try running a few with kolors to see how it handles it
yeah it's being a pain in the ass about making them pale skinned elves
oh nvm, i got one
guess it didn't like the concept bleed test of "alabaster colored clothing"
yeah for sure, just wanted to test it out
can you fix that with FreeU, I wonder
usually just by rolling the dice with seeds, honestly
i've never found attention span problems to be solved by freeu. that one has always just felt like it does nothing imo
it does quite a lot, actually
Yeah i remember all the hype. No one could ever explain it to me or what i was doing wrong. I just don't see it and then since it can't be explained, it feels like people are just propping it up
PAG+automaticcfg and/or dynamic cfg always better me far better results than freeu did
Freeu just always gave things that fake aesthetic plastic look, even if you used the correct settings for sdxl
But iirc, they touched on some of the issues like that in the paper
The reality is that there's no magic wand that magically makes generations better. They all have pros/cons. So it's better to kind of grasp what all the tools can do and know when to use them
okay, well - it's used on mage.space so i did a tutorial video on it, which is here https://youtu.be/1FMIZNR25jA?si=eGM7-vQPK70w4hEk and created documentation if you want that - i got tired of people asking how it worked
One of the powerful tools available on Mage.space is called FreeU. But what is it? And how do you use it? Let's discuss this.
mine were with PAG advanced where i left scale at 3 and set rescale to 0.5-0.7. the ksampler's cfg needs to be N-3, so if you want ~5cfg, you'd set the ksampler to 2
oh thanks that is a huge difference
will definitely use this
you're welcome. pass the tutorial around as well if you find people that don't understand what it does
I never like the ears, always weird
drow don't have cat ears
Eve is that you. I wouldn;t trust that slimey bastard
ye thats the guy he did it, all of it its his fault
So you feel less bad about sd3's ladies in the grass, here is auraflows "photo of a fat dog eating bananas"
Big Brother
Big Sistah
I couldn't get any results like that. I'm using dpmpp_2m, 30 steps, sgm uniform
the first minute opens with the usual same hype. It'll take your images to that next level and is a free lunch. its so ez. Not much hope for this video. i've seen a lot of them.
are you dissing my tutorial?
how about you watch the entire thing before you get negative about it
i got to the end and each different example as you're showing and explaining it, doesn't make sense to me since the visuals are just slightly different noise solutions and nothing about finer details. this is my usual experience with freeu. you gave me the tutorial as if i've never tried to understand it's use over the unet. but you're mistaken. i've just never understood the point of free u and it's basically just waving a magical wand and calling it good.
While you address that theres no "best" settings, theres a narrow range recommended for every parameter still because outside of that tight clamp the denoising solution just goes absolutely bonkers
i suggest you watch it again, pause it every step, and study it. yeah, some are slight differences, some are larger ones. it's a way to tweak, as I said, the actual U-Net network layers and skip connections
i've never made any image better with freeu. one thing your tutorial avoided mentioning is all the hype about speeding up generations too. how you could get better generations in less steps. no one mentions that anymore these days
and i'm not mistaken, it affects each of the layers, and the skip connections that some of the data routes from one side of the network to the other
my original question was i wondered if this could be a way to adjust how much the network grabs adjectives and prevent it going overboard with them.
matt3o goes more indepth with this https://youtu.be/0ChoeLHZ48M?si=Z9iRoqE9gZ92n0e5
This time we are going to do some R&D and I will need your help to reverse engineer the UNet. Basically prompting each block of the UNet separately with a dedicated prompt we are able to get higher quality generations.
Extension repository: https://github.com/cubiq/prompt_injection
Discord server: https://discord.com/invite/W2DhHkcjgn
Github s...
prompting the unet block block seems like over kill. ther'es a clip cutoff extension on a1111 thats useful. regional prompting is the best solution to the unet's short attention spans
the point is to try to keep it from grabbing a color, mainly, and using it on everything
yeah. the attention problem. i know.
so if you prompt the color in the last block...
I may or may not have previously prompted a red tshirt on my horror zombies to trick some places into producing better results 😉
Not just SD apparently
but attention is all you need
red tshirt made of ketchup
deepshrink, scalecrafter and high-diffusion all have a little bit of block by block action
It's odd, I've tried FreeU a few times, sometimes the results are truly amazing, the other times, the results are extremely horrible. Works with some workflows and not others for me it seems.
Re the vid, I liked it, far less sensationalism than say, 99% of youtube vids lol
yeah I had that from PAG, SAG and FreeU
they are great tools but not for every image
and red paint, and red viscous liquid 😄 Though that last one is really pushing it lolol
sometimes these things sharpen too much and I like soft images
viscous. good one
I feel kinda dumb now, what is PAG and SAG? something something generator...
I stole it 😉
Self Attention Guidance and Perturbed Attention Guidance
they are both trying to improve CFG
I did use the perturbed code (from Civit) on the SD3 model when it came out. No idea if that's the same thing. Only a tiny difference though, and could have just been bias.
not sure
I really struggle with things being named different in different places
i wonder what kind of cfg improvements we'll see created for sd3. those work on the unet style network don't they? i know people started perturbing sd3 in the first week. i'm just saying, there'll probably be some interesting attention solvers to come
It was out day 2 😄
oh i looked at SAG an i'm wrong. it can work on DiT too. Does it already?
not sure
I read like 30 papers on samplers in the last few days
turns out I was wrong, 4th order implicit adams isn't going to be the best
DPM++ 2M still seems to be very competitive
also UniPC seems to be underrated
at low steps its one of the best
are you trying to turn zombies into red-shirts?
Prepare the away team
it's what it didn't do. it didn't get away from the cat in time
The Rockin' Racoons?
They are supposed to be the Punk Skunks 😄
any idea when stable-diffusion-3-large be released to huggingface?
no news yet
😥
its probably gonna be a while
because the effect of people attacking SAI over the issues with SD3 2b is that they will delay the release heavily
to get it as good as they can
large is 8b right?
yeah
any idea how much vram it will take?
they said on reddit it will fit in 24gb
That sounds like they are training another from scratch though
not neccesarily
there's two issues with the model and both are fixable with fine tunes
- not enough subject knowledge
- issues with structure
and really they are both just the same thing
enough subject knowledge will teach it structure
I suspect they may well just train from scratch though yes
eh I am pretty sure SD3 will not getting into the same fate as SD2 and SD2.1
SD3 is literally their deadly hit or miss
it will still be SD3
its already WIP
the new SD3 medium
but its a new training run
was my understanding
having said that IDK if it was 100% confirmed that it isn't a finetune rather than a new fresh run
but the point is it will keep the SD3 datasets probably
what I would say as well is that the 8B can do structure fine
so its just a case of scaling it down to 2B
I suspect its the self attention that is the issue
Just throwing out a quick question in case anyone has tried it. I'm having difficulty using SD3 with a custom node for MultiDiffusion/Mixture of Diffusers tiled upscaling alongside SD3 ControlNet tile. When set up a similar way as I had it using the SD Ultimate Upscaler node, which seemed to work, I get tiled ghosting of the original image (first image below). If I use an empty latent for the second stage and completely rely on the controlnet to guide the final result, you can see that is the cause of the tiling (second image) -- it applies the full input image to the controlnet, instead of breaking it up into tiles. There doesn't seem to be a way of connecting the controlnet side of things (which changes the text conditioning) to the MultiDiffusion side (which affects the model) to let MultiDiffusion directly affect the controlnet. I saw someone else having what looks like the same problem using SD1.5, but there doesn't seem to be a definitive solution. https://www.reddit.com/r/comfyui/comments/19amano/help_cant_get_tiled_diffusion_controlnet_tile/
This is the custom node I'm using: https://github.com/shiimizu/ComfyUI-TiledDiffusion
Is stable diffusion 3 uncensored or not?
And also it will run on a 6gb vram gpu like sd 1.5 or not?
its censored
by the normal definitions of that
for legal reasons probably all base models from everyone will be mostly censored going forward
and then fine tuners will do whatever they want
there was a crowdfunded program to fund an uncensored model and the creators just ran away with the money
lol
it was crazy
I still don't understand why no one took them to court
Define censored? lol You can get naked people, but not || nipples || nor 🥒
It's not censored, it just does not like people lying on grass
I've seen several NSFW models/apps and they all suck! (and not in the way that people are hoping)
I don't want to say its a skill issue
but I have managed to do the woman lying on grass test
Notice that this prompt worked ONLY at 768x768, SD3 is not censored, so far it seems only VERY undertrained
if your prompt has a difficult structure (such as the grass one)
you can do a lot to help
dynamic thresholding, CFG scheduling, block level IP adapter, attention map injection, control net scheduling
and most importantly
one of the stupidly long solvers
I think there is severe inconsistence between seeds... and I don't mean that every seed generates totally different images (for some prompt it generates about the identical image), I mean that some prompts generate good images in only a few seeds. But I don't have the optimizations though since I used SD3 only on the huggingface space of Stability AI
Well, that's true for the ""beta"" SD3 medium, dunno about what's next
I agree about seeds yeah
I get some seeds that are so bad
and then some that are amazing
Of course I can't write good prompts, but if the seed number 1 generates almost the perfect image with my bad prompt, yet any random seed generates garbage there was evident issues with the training
Almost every prompt works at least for 99% at the first seed, I usually try prompts at the first seed in any model for to test the prompt adherence
Embeddings made using AUTOMATIC1111 can get around such difficulties!
trying lots of seeds is a good idea
Yesterday the dudes were trying to generate drows or dark elves or whatever, I did these two images with the same prompt but the image on the left was seed 1 while the other is the first random seed that came out.
Prompt: "a high quality hyperdetailed close-up of a dark elf drow from DnD"
I used the dumbest and easiest prompt, the elf on the left is not 100% lore accurate but has pretty good composition and anatomy IMHO
But the ears are objectively better in the first one
Look at this, used the same random seed as the one on the right but lowered the resolution to 768x768
might be a popper collar
Either way it's a confusing composition, which is generally thought of a bad image
I know the rest is pretty good though
yeah I know what you mean
Is the instrument Saxophone a no-no for SD3 or there is a more "natural language" friendly CogVLM way in place of "playing a brass saxophone"/"lips on a saxophone"?
Musical instruments with complicated features are difficult for AI. Tubas are a good example for SDXL.
Nice, looks like SD3 does decent saxophones if the prompt is simple and short, that's a funny one
"a man is playing a brass saxophone", yesterday I tried a long convoluted prompt fixed with LLMs and it generated only body horror
768x768, one handed sax playing (also best saxophone I generated with any open source model, bad luck for the... weird technique)
!generate A serene mountain landscape at sunrise, with snow-capped peaks and a clear blue sky, painted in a realistic style.
I needed some Drow for my video. I simply described "Drow elves with blue skin and white hair."
Did you use that exact prompt?
Because that exact prompt, using SD3 medium without any fancy optimization, gives me this
gonna try it
I'll have to fix the resolution though
I guess he has some other token in the prompt, it just outputs images that look from bad SD1.5 fine-tunes this way
I'm using the model on the official huggingface space by SAI
https://huggingface.co/spaces/stabilityai/stable-diffusion-3-medium
@sage burrow he's calling your name
Yeah, I realized after I posted that I was using SDXL in Fooocus, not SD3. I also have some D&D specific loras I downloaded from CivitAI, but that is the exact prompt.
Try the taesd huggingface 🙂
I tried it a few days ago with other prompts but it seemed even worse
I also got some mixed results
I'm surprised, it's one if my faves! Also glif
Let me try with my previous prompt, the one that actually got me decent outputs
This one I mean
Left one is normal SD3 (I used a negative prompt), Right one is TAESD3 that from what I see can't use negative prompts
Same seed by the way
Prompt was "a high quality hyperdetailed close-up of a dark elf drow from DnD"
Same prompt/seed with SD3 but without the negative prompt
I use Dall-E 3 very often... I reckon on Glif you should have the 8B, right?
Lykons said glif is sd3 large most likely
All prompts I did with SD3 medium on that huggingface space are more or less improved if I lower the resolution to 768x768
Ah, false memory then, I was sure he said it was 8B
If that's the case, I impatiently await sd3 large 😄
Though on glif Claude helps?also
I guess prompt expansion does big things, however I tried writing short novels in the prompt box of that huggingface space and it seems not to like my natural language
I guess I'm not natural enough...
🤖
random thought of the day
cascade is still the best model (just sayin')
100% true
some of us are still using 1.5
still go back myself occasionally
1.5 had the best pretraining (if you know what I mean), but being the first one was "technically" meh. SD2.0 could have been way bigger if trained properly
I think it's more about asking in specific ways
I use 1.5 daily! 🙂
Why are there no 2.0 loras and models?
I tried doing this image with Dall-E 3 and it got wrong the text every single time, it worked only with SD3 at 768x768
It was a massive letdown when it came out, also it was more difficult to train
SD2.0 was insanely undertrained, probably though in a more linear way than SD3 since it had somewhat better anatomy
think sdxl was announced to be in the works just after 2.0 dropped and people probably said they'd wait since 2 was giving some pretty bad results on base model
I haven't tried it yet, is it goid at nsfw? Do the 1.5 or sdxl loras work with it?
Cascade can do some nudity, other loras are not compatible and is very likely more difficult to use than other SAI models
dunno honestly, was more referring to base model cascade, not sure what lora's are out there for it
Wait? This community can wait? --day 2 after sd3 repease-- where's all the loras, I'm going to make a lora, who cares if it's a finished model! 😄
i just use a styler prompt mostly with cascade, seems to do really well for what i need
There are little-to-none loras out there for Cascade because it was not hyped at all being a surprise release
To be frank SDXL was announced in July of 2023 while SD2.0 was released in November of 2022, also mind that there was a SD2.1 in between
I'll try some on my furry glif....
Are those the BALLS of Ultron?
What's that?
Some character from Marvel movies, I have no clue what pop culture references work at all in the present day
Where's the cat with 0.5 GB vram?
Ball shaped furries! 😄
those are cute i like them
reminds me of critters
Reminds me of dollar store toys lol
has some "boglins" vibs
2.0 was replaced by 2.1 within a couple weeks...
they were little hand puppets you could control the eyes
Just like SD3.1! 2 more weeks!
nope
2 more and 2 more!
SD3 has pietre Bruegel the Elder in it already, it's fine and done! ❤️
sd3 is a group of models though. so i'tll probably be sd3 2b/medium revised. they've sorta painted themselvs into a corner with the versioning pattern
Ah I see, trompe l'oeil