#πο½sd3
1 messages Β· Page 7 of 1
it is definitely not 1024px)
Lykon said on twitter that 2b is better than current 8b in some categories 
considering that 2B was their focus, it makes sense
like a well trained small model will outperform an undertrained large model
llama3 8B vs like the largest Bloom model
yea and it shows that 8b now has even more potential
Sd3/pixart/hunyuan - pixels being sampled by an unruly bunch on a untimely schedule, they're analyzing a model but it's too small.
It does really well with a lot of stuff so I keep using it.
SD3 really does have excellent visual acuity!
with people like the creator of HelloWorld, sd3 will be awesome
you think that 2b will be most popular or will people support the 4b and 8b? I feel like the community will just use the 4b and barely tune the 8b
I think the one that they'll release first will be the most popular since tooling, finetunes etc. wil be built on top of it
part of why I'm sad that we dont get all of them is that instead of the community finding out which one works best for most people, everyone is funneled into the same model
Sd3/pixart/hunyuan - a collection of cells in a fierce battle with a virus, spears, guns, shields, cannons
pixart did nice
I wish the api had the 2B model
it would possibly increase user count as its so much better
two good questions:
if 2b is better than 8b, why not put the 2b on the api?
if 8b is not ready for release, why charge ppl for it?
would make a lot more sense imo if they just swapped em
exactly
also it seems they'll give us fine tuning code or something
- Fine-Tuning: Capable of absorbing nuanced details from small datasets, making it perfect for customization and creativity.
I wonder if this means that they have made some really good finetuning implementations themselves
"absorbing nuanced details from small datasets" sounds really promising
alex said that HF has had code etc for months so i'm guessing a diffusers implementation will be available day 1 or at least quickly
also cant wait for Stable Audio 2
"When's SD4 coming out?!" π
ideogram
yeah ideogram is still the goat
It is uglier, but more faithful to the prompt
but damn, SD3 2B finetuned might take the throne, even if its still not the best
yeah ideogram has a... style...
still prefer it over DALLE3's super smooth style
I am eagerly looking forward to the 12th to see how 2B works (I have bad feelings, I hope I'm wrong).... but the one I would like to have is 4B
I'm just looking forward to more knowledge in the model
like it knowing the look of video games, video game characters, etc
looking forward for people who spam "1girl" images in the subreddit
I want it to know a lot, like how going from Llama 3-8B to like Llama 3-70B, 8B is coherent and all, but 70B just KNOWS more
Yes, with 2B we will be able to play, understand it, until the others arrive
oh my god I forgot about that 
yes it will keep us busy until the other models come out
and when 2B comes out, and its actually good and has diversity/variety and etc
I'll buy $10 credits
then it will get overtrained
with shutterstock pics
community finetunes on big boob girl dataset???? nooo this is not true at all 
from the SD3 research paper https://arxiv.org/pdf/2403.03206 this is how CLIP and T5 come together in the model:
You can whole on just stack more stuff horizontally at will and it works. Similar works on SDXL/SD1 and I think comfy does it by default for >77tok prompts, but SD3 is basically designed to be happy with stacking like that
woah
is controlnet ready for sd3 medium?
you can pick and choose which tencs to use yes. It's only trained for G+L, and T5, and will need training to recognize other formats like longclip
oh yeah that's a good question, we'd like to know that
he already heavily implied no earlier
clipskip setup is exactly the same as SDXL
oh I didn't see, sorry lol
hopefully a fixerupper won't be needed for this one but yeah you can easily extract the VAE separately and tune it same as always
I also thought Emad said we'll get their own controlnets with release but I think they've went back on that, too or I've misremembered
yeah I Emad said it on twitter
cause sd3 is multimodal can you prompt using only images?
it is what was said
you mean like clip vision or something?
I'd be surprised if he said it would exist at release, since it wouldn't make sense to delay release for that
I tried to find the reference but the thread and tweet I found were leading to [deleted]
so not sure what the wording was
oh, ya i dont remember, lol did it get deleted, i surely dont know hahaha
ah no, there's still a bunch of replies where he says it (note: this wasn't for sd3 my bad)
yup
welp about that, it didnt happen, ran out of compute
but it was also supposed to be up to 8b and estimated to 2 months ago, so clearly a lot was said out of hype, or at best blind optimism
uh they work there
Let's call it optimism and stop listening to Emad at all.
So I guess T5 only lost at the ablation tests?
i know it's been looked into but idk if that'll be ready on release or not. Probably not.
The way to make controlnets is really clean+clear in SD3 though (there's a direct place to add a new stream, vs on old unets it was pretty hacky) so I'd expect controlnets on SD3 to be pretty cool once they're actually out
if it really does work better and more easily that'd be awesome since tooling is the main advantage of SD over everyone else
technically yes but idk how intelligent it will be with only an image input
note that Emad doesn't work here anymore
lol yeah he had a habit of saying "yes thing will be ready" far before that was guaranteed
he did then, and to be fair the few things the next CEO said before he went silent also didn't come to be (mainly timelines) but let's not dwell on that
that tweet is about cascade
oh that is feb 13th
the 3rd one is definitely sd3
wait
huh?
yeah the second and third one yes
but my bad, I shared too quick, I didnt even find the one I remembered anyway
ye
I remember that there was some mention about the research team wanting to test if using only T5 performs better, don't remember if it was you or someone else.
as in just training the model on t5 only, leaving clip out completely
oh, yeah, that experiment got deprioritized in favor of releasing the known-working arch for now
Fair. Probably wouldn't be too difficult to tune it on T5 only on my own anyways.
can't wait for community efforts similar to this
Thanks Alex for answering all our questions! Very much appreciated π€
tbh I mainly just hate the token limit of CLIP. And you can't really blame them for doing it that way because sequence length is SOOOOOO EXPENSIVE
Plus I think over 99% of the data used to train CLIP was less than 20 tokens long, you can see the consequences of this if you look at the positional embedding and also read the Long-CLIP paper to see experiments on it. CLIP can barely function past 20 tokens on its own.
I did try aligning a long clip model to a finetuned SD1.5 model and it takes like, over twice as long to train like that with the 258 token long context window. That's with the text encoder unfrozen. But it did seem to work as advertised, it pays much better attention to the whole prompt.
gonna need to handle that pooled embedding omission for finetuning tho
this gonna take a while to tune out clip from sd3
anyways I do hope that T5 does enough to at least hold different parts of the prompt together when we are doing unholy things with clip embeddings and torch.cat
concatting can get everything in different chunks to at least be present but it won't allow things to properly combine with each other (except for whatever the denoiser can accomplish on its own. Honestly MMDIT might just inherently be able to handle this a lot better anyways). t5 would be the only thing on the input side able to combine distant concepts
Does anyone know how PonySD3 would be trained
would they continue with training clip with like tags, or did they switch to some vlm to caption the dataset
my assumption is that they'll use both. They already used some vlm captioning for XL using their own trained captioner
thanks
V6 was vlm for half captions, v7 is full captions run and the quality of captions is much higher
woah
I.e. OCR, character name recognition, support for nsfw, image grounding (wip but you should be describe object positions better in dalle3 esq way)
nice
@dull star what you think of the helloworldsdxl models?
yeah, imo best model for photos of people
all controlnets for sdxl came very late and the people behind control net are very arrogant so i will assume maybe 8-10 months after the weights
We just barely got the sdxl controlnets so I can't imagine we will be getting sd3 controlnets anytime soon.
also how good would loras be in a DiT vs Unet?
via sd3 api. I hope that sd3 medium will be able to do these kind of images too
Because people qsked about the weights so often?
honestly doesn't seem like a lot of words for 2B to fail
then again, 2B was trained for a more correct amount of time than 8B
@sterile pendant ahhh
sorry if you have already seen
YESSS A PROPER 2B IMAGE WITH NO UPSCALING
the 16 channel VAE is doing its job quite well
a 4 channel vae would probably fail in this case
so with highresfix, we'll get image quality that Lykon's been posting
but even for a native image, this is quite clean!
Anyone know if you train a Lora on the 2B SD3 model, will it work on the 8B SD3 model?
probably not π€·ββοΈ
but a textual inversion probably will
She has long arms
If you look closely...
LMAO
As you can see here, the 4 channel sdxl vae has a detrimental effect on facial features.
I wonder why nobody increased the amount of channels before it seems to work even in small models
wouldn't it require a complete retrain?
Idk
yeah, highresfix makes facial features worse the more you increase the resolution

it's a bit noisy if you look too closely, but still far better than the c4f8 vaes would do.
what does the f8 mean?
Can someone explain in easy terms how the 16-channel vae is so much better than the 4-channel one and why?
How to do ir
I suggested this since I do SD 1.5 π€·ββοΈ
but yes, yoz have to fully retrain the vae as well as sd for it
less compression. The vae is like an image compression algorithm. Think of transforming your image to a jpeg image, you will get a lot of artifacts
same happens with vae. You get artifacts and lose small details in the image
furthermore the vae is extremely sensitive to small changes because it's so strongly compressed - so it's hard for the diffusion process to get small details right
that's why all small things like heads far away go lost in diffusion
Ah! Or prints / patterns and the like. That's great news since it was one of those SD bottlenecks that's really starting to bug me.
What are the vram requirements for sd3 per model size? Will an 8gb rtx3090 be able to run the 8b model?
Thereβs an article on civit by the author of pony who described how heβs gonna handle sd3
If anyone deserves first access to the model weights is the pony dev
3090 has 24gb vram and yes it would
i like how right out the gate the pony devs are the most uhm... productive.
π
I do have a tiny suggestion for the ponies tho, please make your models more varied. All I get is a woman in an empty room, no matter the promt. π¦
In the same clothes too
If I am lucky enough to get clothes
your anatomy is pretty good tho if anyone can finally get good hands across the board is the p0ny sd3 i think
no more 
apparently hands are gonna be better according to the email we have been sent about 2B
I have a massive doubt about it, but I definitelly expect hands to be at least a little better or on the level of SDXL
- Photorealism: Overcomes common artifacts in hands and faces, delivering high-quality images without the need for complex workflows.
Yes I read it when I got it.
bold claims.
i hope tho because facedetailer is terrible at detecting and fixing hands
in comfyu anyway
smaller faces in the image, I expect to be much better thanks to the improved VAE
so I believe that
but I will still do a workflow (a very complex one) to make images better
you won't expect this at all....
Highres-fix π€―
- Typography: Achieves robust results in typography, outperforming larger state-of-the-art models.
I wonder how much this has improved over the 8B Beta days
tiled upscales are the only way around the untrained resolution problem of latent upscales
but then you can have issues with compositional drift
but best i've found has always been a tiled approach
That's why I just use highresfix (t2i -> denoise at like 50% and t2i)
cause I have enough vram to waste
but controlnet tiles would be good
hehe everyone talks smack about hi res fix
ugh
also the backdrops and architexture will make more sense
like this pohoto the closest column if not right
also it doens t make sense what is it a bunker?>
lines in buildings are always not straight
sizes of floors are always messe dup
and in general they feel funny and unreal all over sdxl too sd15 even worse
I just trained a lora on a building
hmmm
i use nvidia cards
same
if only rocm caught up
i thought you were training on a building, ill show myself out now
oh I just realised the joke sorry 
π
just as ai figures out hands, humans go and make them more complicated
thats an act of war
when judgement day isnt' from fear of being shut off, but a temper tantrum about drawing hands
I still don't get this membership thing
can somebody explain if this means that you need a professional ($20/month) to make images for commercial use, or this is only for hosting the model on your own service for example?
cause "utilize within that member's own product" sounds vague to me
like I "utilize" the model OFFLINE (so I basically host to myself, and not to paying customers), therefore I use it for personal use, so its okay, but then what's with the generated image, if it's owned by me?
can I use that generated image, which is owned by me, for commercial use?
the answer I got (but for youtube) didn't clear that much except basically saying dont worry unless you are making a lot
if it doesn't require a membership I'll still donate to stability
that's mainly where I'd probably use it yeah, youtube
don't know about game assets yet
and for just making images, I'd just share them for free on social media for people to see anyway
yeah, I want to create some animated videos, and it's kind of unclear at what stage you need to start paying
but I guess I'll worry about if I ever make enough stuff to be making >$20/mo
its like the membership is a suggestion, not a rule 
but yeah I wouldn't be making much cash from it
it's just a bit annoying that if you for example put a lot of effort and make something and it happens to go super viral, you might be in a weird spot
the $20/month is at least fairly clear but if you needed to have contacted them to have made an agreeement beforehand it's a bit iffy
I hope SD3 is as precise with the prompt as ideogram...
"A composite of three distinct scenes. In the top scene, there's a spacious room with a table set with elegant decor, including a vase with pink flowers, a black teapot, and white spherical objects. A woman in a black outfit sits on the floor, engrossed in her thoughts. The middle scene showcases a woman with a unique hairstyle, wearing a black outfit, sitting at a table with a wine glass in front of her. The background is a dilapidated building with a reflective body of water in front. The bottom scene depicts a serene scene of a man sitting alone on a boat, surrounded by a calm body of water with a dilapidated building in the background."
yikes, probably not
there's nothing stopping us from just simply using regional prompting though, this could be easily set up and accomplished
Taking it to the limit... π¨
"A collage of various intricate and artistic photographs. Starting from the top left, there's a close-up of a person's eye with a detailed pattern on the iris. Next to it, there's an image of a hand resting on sandy terrain with a ring on one of the fingers. Moving right, there's a person wearing a black and white striped outfit with a reflective face mask, revealing a cityscape behind it. Below, there's a detailed close-up of a moth's wings with ornate patterns. Next to it, there's a photograph of a white horse in a snowy landscape. On the bottom left, there's a close-up of a person's face with a detailed sketch of a city skyline on it. Adjacent to it, there's a photograph of a castle-like structure in a snowy environment"
honestly if we finetune SD3 on this then it might get really good at splitscreen stuff
I also love making movie posters on ideogram
SD3 can make nice paintings man
I hope 2B will excel at these too
at least SD3 could make this
this is just so good man
SD3 did perfectly
Is it not possible to finetune it?
But isnt that like training on higher ress and takes longer?
Also deep floid if trained in pixel space and still had artifacts
odd
but yeah its pixel space, yet it has the same small face issue as models that use VAEs
Small face?
Maby the artifacts in if are created by the upscaler
no, the resolution determines the performance and memory consumption of the transformer layers and the resolution stays the same between 4 channel or 16 channel vae
of course, channel count does also have an impact on performance, however, the international channel count is independent from the vae channel count
Souds to good to be true xD
the first thing that happens in sd is that the 4 channels are mapped to ~1000 channels
How do u know this stff btw?
so it doesn't matter what's the channel count in input and output, when most of the time the model is using a much larger channel count anyways
it's open source. Anyone can lookup the code
Yes but u have to have some skills to do that
I'm scientist in a field related to machine learning π€·ββοΈ
Do u think that increasing channel count would also help the casade model without removing the 16x training speedup?
@dry wave
Hey, please take into account there is no team and it's just me in my garage, but you are right - the model has strong bias for female characters (and bad backgrounds), V7 should have much more diversity but I don't know if that would be fixed completely.
i like how dalle straight up set him on fire
sure sure just a suggestion, i really appreciate your work and its amazing what we do for free basically.
I'm very sure. It's same issue. They have a huge 1000 latent space they then compress into 16 dims
I think cascade was some kind of proof of concept. Showing that you can achieve incredible compression
but this amount of compression doesn't make sense if you want high quality output
Yes i woder if a incresed channel amount could let it seem like a normal model with more details but still high compression for faster learning
I think the main advantage here is the multiple stages
doing composition first and then fine details
in sd you always use the same fat unet in every time step
but we know that most of the unet is not even used most of the time
Stage b does not really add details currently. Its only like a insane vae
like the output of the down layers of the unet stays the same most of the timesteps but they are still computed all the time
doing staging just makes sense ad composition on early timesteps is just a very different task from the later timesteps
there is not such a big difference
like I think dallE is using a "vae" made out of a diffusion process
I have seen a paper where they remove atiantions on later steps of geerating a image
Wait is that ideogram or sd3? So impressive
If you are generous with the definition then cascade too
in principal diffusion is a method to go from a random normal distribution to a complicated distribution. A vae is doing something very similar.
I am just so impressed with cascade because it lerns 16 times faster. This can make finetuning or training in general more possible on consumer hardware
anyways, I go to bed. Good night
Hehe same gn
I hadn't much luck with fine-tuning Cascade π€·ββοΈ
fine-tuning results in sdxl were always much better
Another advantage of cascade is that u only have one big clip model i think its 2b
But we can write tomorrow. I am following 2 cascade fi etuning projects and both seem promising
Stop comparing apples to oranges, sd3 doesn't use a unet, it uses a transformer based architecture now
Oh and also, 16 vae channels means that you have a lot better control at decoding an image vs the old 4 channel method
wdym better control?
A vae is what resolves an image from high dimensional latent space. It takes some Nth dimensional data and collapses it down to three dimensions: RGB. The more channels the vae has, the more accurately it can do the job
It would be like comparing mono audio to stereo audio
did you know VAE is a lossy process? XD
i didnt know
everytime you decode and encode you loose quality =0
Yeah it's a form of compression and decompression
Oh and the extra channels thing also applies to the encoding part as well
Wow Cascade gets some love after all? π
Pretty sure that just uses something like this: https://github.com/YangLing0818/RPG-DiffusionMaster
Hey there π
Can I generate images with a lower resolution than 1024x1024 with sd3 ?
Does it like the same promting style as sdxl? Or natural laguage? do we still need load sof negative promts?
do we still do parenthesis? the emphasize?
(((((((great hands:1.5))))))
You can always downscale them if you need smaller resolution. Are you trying to save on credits or something?
When do you think there'll be SD3 LoRA training? π€ So interesting
most here are and not the doomers
That or they might be worried about inferencing locally when the model comes out
But I'd imagine SD3 can probably handle something like 768Β² without imploding
we will see 3-6 months after the weights the first loras and finetuned models
Honestly, I'm waiting more for the controlnets than anything. From what they've said, the controlnets will be far better and easier to train than the hacky stuff that was needed to use them with unets.
We are so happy...
very exciting
Loras/doras should be neat as well, but if controlnets can wrangle tough scenes, you won't need so many models of people trying to fix things like hands and whatnot(if you're working on people)
Loras will likely be a much bigger deal since the base model is already really decent. Good controlnets and maybe ipa(or some kind of dit compatible version that does the same kinds of things) will make things far easier than relying on overtrained models
Ideogram. With SD3 these are the results
#1237459938901491852 what's your model?
Yep, exactly. But again, we'll see how it all plays out. Training loras for the 2b version should be pretty easy on resources though, well maybe as long as it's not being trained with both clips and t5.
If it's just 2b and the two clips, should likely be doable with even 12gb vram, maybe even 8 depending on the dim size
lol, the unet is a transformer based architecture π
No, it's a convolutional neural network. The new MMDiT style is different and is very similar to what LLMs use.
So again, apples and oranges.
sdxl unet has some attention, but wouldn't call it a transformer
a unet is a cnn and the attention happens across the shape of a U, hence Unet
but vision transformers and convolutional neural networks are very different in how they work
"Vision Transformers and CNNs (Convolutional Neural Networks) are two different types of neural network architectures used to solve computer vision tasks. Vision Transformers are based on the Transformer architecture, originally designed for natural language processing, but adapted for image analysis. CNNs, on the other hand, are a type of deep learning network specifically designed for image recognition and classification."
https://www.edge-ai-vision.com/2024/03/vision-transformers-vs-cnns-at-the-edge/ from a quick google link to save time trying to explain it all
"The main difference lies in their architectural design and the way they process visual information. While CNNs rely on the use of convolutional layers to extract features hierarchically, Vision Transformers utilize self-attention mechanisms to capture global dependencies and relations between image patches directly. This allows Vision Transformers to model long-range interactions within images more effectively than CNNs."
unet is not a pure CNN
but anyways, the moral of the story is that a cnn != dit. so stop sweating the parameter size differences because they function completely differently under the hood
it doesn't have to be a pure cnn, it's still a cnn
like with unets, you can still do things like self attention and whatnot, but at the core, it's still convolving
it's a convolutional dnn with transformers. Yes, it's different architecture, but that doesn't invalidate any arguments.
this is just wrong
the sd unet is a transformer at its core
alright, so all these dozens of articles are just talking out their ass then β
the convolutions are necessary for some things like composition, downscaling, add positional information.
In the ViT architecture you have also downscaling operations, called patching, but they don't use convs
the articles don't talk about sd unets
U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segment...
boah dude, I know what a unet is.
unet is a general term, you have to look at a specific implementation
exactly!!
unet just means scaling stuff down and then up again
the sd unet is a transformer architecture
it doesn't matter what flavor your trying to talk, sd unet is still a unet, they just hacked in some self attention. again, the way cnns and dits "see" are completely different and sd still "sees" like a cnn
there is also the hourglass transformer architecture which is... just a unet with another name. They don't use convolution so they gave it a new name;)
but i'm not going to argue it any further, keep thinking what you want
it's not "hacked in" some self attention
the transformers are the core component of the unet
the main differences in sd3:
- you have positional embeddings
- text and image embeddings share a common latent space and are transformed together
youre still missing the point of how the models "see" the data. i can run a c++ program that then runs a python program inbetween steps, but then jumps back into the c++ program. doesn't make it a python program.
that's what's happening in the sd unet essentially with self attention. all the actual real operations are still happening in the cnn
that's just not true
there's a grand total of two convolutional layers in the sd unet
the model "see" it's data in some latent space. It's totally unimportant if convolutions are involved here. What's important is that in sd3 text and image share the same latent space
i see you are arguing again with the clown-bot
you don't understand, someone is wrong on the internet
tbf it takes a high iq to understand what that bot is saying
If I wont wrong U-Net is just a base?
no its an acid not a base
not sure what you mean with that
acid = π
I actually have no idea too lol. Didn't learn much about architecture or machine learning networks
i think we need to eat more lemons
again, i guess all these dozens of resources are all just wrong about it then... sd's unet has some elements of transformers in it, yes(some attention), but it is still a cnn and still revolves around the unet. sd3 uses an actual transformer network that is completely centered around it. it's what llms have been using for ages and is very different under the hood. until recently, it was a pain in the ass to make work with things like image generation while keeping the hardware(vram/perfromance) and training costs from ballooning out.
Has there been any news about SD3 Turbo lately?
dude, 85% of SDXL are transformers. Only tiny 0.7% of the model are convolutions
the transformers in SDXL are much bigger than the transformers in SD3
and you tell me "it has some attention"
yes, the model architectures differ. And yes, this can have some implications. For example, you probably won't see this weird duplications in SD3 in superhigh resolutions, as these are artifacts from the convolution. So, using convolutions instead of positional embeddings definitely has some effect
personally im doubtful sd3 will even be a better "model" cnosidering all the improvements and ecosystem the community has built around sdxl
like will it be good enough for people to retrain everything i dont know
yess!! this is from one anime finetune
Sote DiFusion <3
I think CLIP is on it's limit. For real prompt understanding you need T5. But yeah, maybe we get something like ELA for SDXL and it will then be better than SD3
although SD3 has some cool technical features I like and which might indeed work better than SDXL
what about emma https://wrong.wang/blog/20240512-what-is-emma/
After completing the work on ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment, my objective shifted towards the lightweight and cost-effective transformation of the Stable Diffusion series models into image generation models that are conditioned on cross-modal sequences of text and images. I explored various approaches for i...
Ella/emma is all great and wonderful and I use Ella a lot, but because it's all restricted to sd 1.5, it's going to fall by the wayside once sd3 is out. If they had released it for sdxl, that would be a different story, but that's never going to happen.
would it cost much to train though?
what about that adapter that used pre trained llms as an encoder
The guy already trained it. Ella for sdxl is finished, but they couldn't release it because sdxl has a different commercial license than sd 1.5
man if comfyui_tensorrt had pixart support too
that would be so awesome
but if SD3 comes out and they optimize that too, it would be great as well
I'm sure they will. Hunyuan released tensorRT libraries the other day. It'll be neat if those are comfy integrated. I need to message the author of the comfy extra models nodes to see.
wow
honestly, SD3 2B is so close
I just cannot wait to finetune lora models or textual inversions
and why not make ella for sd3 xD
One might hope it makes more sense to train/finetune the model directly, as ella is just bolting on t5
do you mean the guy who is reverse engeneering it? and what about the license is not alwoing to train on sdxl
but t5 is good
But i'm really curious about an sdxl ella as well, initially i thought sd3 would be miles better than the ella approach, but sd3 obviously has limitations as well, it'd be interesting if ella and sd3 prompt understanding turned out to be in the same ballpark
I mean the guy who put out Ella for sd 1.5. He also did sdxl but won't release it.
but it's already part of sd3 (what i understand the whole thing about sd3 is that it can deal with various inputs/outputs, so a solution like ella might be obsolete for the new architecture)
reverse enginerring what? its opensource
and the SDXL licence totally allows you to release something like ELA oO
I think he don't want to release it for other reasons...
The Ella guy works for tencent or one of these other big Chinese companies. They'd probably be subject to more than the $20 a month commercial license.
the commerical licence is just sdxlturbo
the training code not
I rather hink that this big company he works for doesn't want him to release the weights
ella guys also said the sdxl version was fintuned, might be those images used for the finetune are the licensing problem. or it's simply that they want to use the sdxl version for their own imagenen
isnt that alibaba or another chnese company
I assume the later
they said on github that it would take to much work to make sure its "secure". and its by a big chinese company and not just a guy
https://github.com/TencentQQGYLab/ELLA yeah it's tencent
its so anyoing these companys always say wewill releset this ool shit and they never do it. remember that dace ai
why are so many chinise ai companys like that
there is also training code. But there are also many community written training codes for SDXL. This is no secret stuff.
training ella seems difrent then just traing sdxl. i think you freeze sdxl and t5 and only train a small adapter model
and i have not seen anyone finetune that
maybe I just understood you wrong.You mean the training code for ELA is not open
yes
and he is trying to revese engener the training code
oh, I think people did. To be honest: It makes much more sense to train stuff like ipadapter than using controlnets for everything...
but even adapter training costs a lot of compute, usually more than you can effort with consumer hardware
can you show me who did finetune that. its cool
I mean there are several ipadapters out there
isnt ip adapter just a very advanced and a bit diffrent controllnet?
thats probably it
while the controlnet is "mimicing" the unet
so controlnets have the disadvantage that they take a lot of resources/performance. Basically they are as huge as the base model itself
but isnt a controllnet also injecting it like taht? and they just finetune a copy of the unet for faster traing
the advantage of controlnets is that they are initialized by the base model, so they already "know a lot about images" and can be trained faster
yes
no, controlnets just use "addition"
basically they compute some delta they add ontop of the original unet
which doesn't mean it's less powerful. But because they have to be as big as the original unet they are very resource-ineffective
also you can only use controlnets for images
while the ipadapter idea can be used on any kind of input data (including T5 text prompts like in ELLA)
interesting i dint knew it was this flexible
but the disatvantige is a high traing cost? how high do you think?
training code for ipadapter is also on github
if you really want to train something like ELLA the costs will be massive
the problem is that if you use CLIP+T5 then SD will probably just ignore the T5 as the information from CLIP is much easier accessible
so you probably have to train it like in SD3 that it gets T5 information only sometimes to really encourage it to learn something
but as T5 embeddings do not align at all with images... it will be much harder to learn from T5 than learning from CLIP
basically, T5 is totally alien for SD. It knows nothing about this latent space and it has to learn everything from scratch
interesting! i just think its so sad that everything is so expensive.
dunno. I mean you can rent gpus for relatively low amount of money
i wonder if neural networks that create neural netwoks will reduce the cost in future
but then it's an expensive hobby ^^
I think most people, if they spent a lot of money into that, want some money back
I don't believe in that stuff xD
why? to complex?
just not something AI is good at
how do you have a feeling for what its good at? isntat that also just data distibutuion
would we have to truncate our prompts when training loras and stuff?
no, because if you want it to find something new and better than humand could do, its outside the data distribution
all cases where "AI found some cool new algorithm no human ever found" so far were exaggerated. Like what they usually did was just trying billions of different algorithms and use the best one. You didn't even need a neural network for that, you could just combinatorial generate code
what LLMs can do today is writing you python code that makes a DNN as we have it already
you could just write it yourself
but humans are also able to find new stuff why cant ai do it?
so in best case you don't have to learn python
because it's an ongoing discussion how much "AI" is currently in our "AI". Some people might say all ChatGPT is doing is just autocompletion via statistical inference. No real thinking.
it's hard to say, though, where is the border between statistical inference and real thinking
my answer to that is always how do you know your brain is not a advanced atuocompletion
I would say because we have MUCH LESS training data
like I learned programing from a few examples
ChatGPT need millions of programing examples to learn something
from few examples you cannot do statistical inference
maby the brain is just better at atocompetion with less data then current models
as said, it's an ongoing discussion. Nobody knows the truth yet
i agree. i just want to know your opinion
but while I think that large llms do have some kind of understanding about the data they process
I still doubt that they are able to reasoning on a level of a human (or even about)
i 100% agree llms are not even close
and I don't think they are able to generate a new scientific discovery or algorithm or something like that
so far they can only assist human in doing so
(which, to be honest, is totally fine for me xD)
Biggest thing with llms is they're so confidently wrong, and using llms for a field you're not familiar with, you won't know it's wrong
i wonder if we will lern good algorithms by revese engeneering nature. this is very interesting. https://www.youtube.com/watch?v=8Ukin_-5aLQ
Alexander Borst, Max-Planck-Institute for Biological Intelligence, Martinsried, Germany
Abstract: Detecting the direction of image motion is important for visual navigation, predator avoidance and prey capture, and thus essential for the survival of all animals that have eyes. However, the direction of motion is not explicitly represented at th...
they did revese engeneer part of the fly brain
(but that leaves the question, what is wrong, llms learn from lots of data, the don't understand the difference between high quality data/low quality data, they just
"remember")
yes thats a big problem
and that they can generate seeminlgy right stuff
humans aren't that great at it either, they use lots of heuristics to validate their data, authority figures, popular opinion, personal experience, etc.
I don't believe in that, either xD
and that they sometimes cant do stuff if they are overtrained. like some models do everything in lists even if you say they shuld stop or some add emogys inot everything even if you say they shuld stop
i mean it worked for that tiny part
like yes, biological brains are far superior, but we don't really know how they work and how/if we can simulate that
i am sure they can be simulated. but yes we mostly dont know how
my problem is just that neural network research for centuries was full of this biological bullshit
like people came up with a mathematical/statistical solution and then they added some biological bullshit to sell it/get more funding/make it more interesting
yes thats stupid. but if you look at the fly example it is very cool
- convolutional neural networks work like the human visual cortex. WTF. How often people repeat this bullshit. You know what? they also work like EVERY stupid filter in ANY graphic program. Convolution is a totally normal mathematical operation and it is used since centuries for image processing
reminds me of the universe is a neural network paper xD
- the sigmoid function simulates a biological neuron. Nah, it doesn't do that. A sigmoid function foremost is a logistic regression which is used in statistics since centuries
but neurans have a activation threshold and thats just a simple way of making a model of it. or am i wrong?
- my highlight: a few months ago nature published a paper about "neural networks that dream". They claimed that they were inspired by "sleep research in human" and came up with "improving neural networks by letting them sleep and dream, too". You know what they did? They "reinvented" the "regularization images" idea from the Dreambooth paper. Yes, adding regularization data improves learning. But that's not how you make it into nature. You have to come up with a fancy but totally unscientific idea of letting networks dream
a sigmoid function is a basis function and that's why it works in deep neural networks. It only works, though, if you make it not too steep. So it only works if its not looking like an activation threshold in a neuron
i agree i never understood that dreaming comparison they often do with ai
its more like halucinating in most cases
but even if its like an activation threshold. So what? Its like saying "A unet resembles a human ass because it is also shaped like that" π¬
i mean you can also train the activation functoin insted of the other weights
yeah xD Kolmogorov-Arnold Networks
which is neat, but honestly, we had this centuries ago and called it "general linear models"
for a good comparison you probably have o train both
but idk if that would be good or not xD
me neither. I just found it funny, because the idea is so old and now they treat it as something totally new
but sure, if it works better, then it would be cool. I'm sceptical, though
but i have seen the training of activation functions a long time ago. i think it was controlling a waliking spider or something
i also thoght its strage that manny said its new
anyways, I just hate this kind of "we have to find analogies from biology to sell people AI". One of the really nice things on transformers is: people haven't found any biological analogy for it so far xD Like for the first time nobody could say" yeah chatgpt is a transformer which is like XYZ in the human brain". Really nice
i mean i dont know much about the details but you could make a comparison of a neuron conecting diffrent deeper layers together but idk
KAN networks use linear combinations of 1D splines. That's what is usually called "generalized linear model" (GLN). The only difference in KAN is that they use more than one layer. GLN usually only use 1 layer, because you use them when you want a linear, interpretable model.
has anyone managed to scale KANs beyond toy problems yet anyway?
not as far as I know
lol
They have a Large and X-Large that are not being released
Look, misinformation is very funny, but its getting old
yeah the ragebait is getting boring, the sd reddit is also filled with it
when i was a kid we didnt include companies into our core beliefs to be defended or whatnot
π€·ββοΈ I don't disagree, but SAI is just as much to blame for the shit poured over them, they should have invested in proper communications a LONG time ago. Even this "outrage" could have been prevented by simply better wording. Something like "Stable Diffusion 3, our most advanced text-to-image is on its way! You will be able to download the weights for the Medium model on Hugging Face from Wednesday 12th June, while we continue to prepare the other 3 versions for later public release"
But i agree, it sucks, internet sucks, just don't give sooooo much room for all this misinfo and misrepresentation π’
nah. you gotta create some drama
yeah the way they worded "SD3 is coming on june 12th" is quite vague and misleading though I have to agree
ya sure they could of added an "a"
In this server they got it right though:
The βweightβ is nearly over! Today, at Computex Taipei, our Co-CEO, Christian Laforte, officially announced the open release date of Stable Diffusion 3 Medium for June 12th.
its the email that's stupid
Have you heard that the SD3 weights are dropping soon?
it's saying weights, like multiple
ok wait its saying Stable Diffusion 3 Medium, our most advanced text-to-image is on its way! right after it...
yea still remember when ppl where posting stuff emad said about 2weeks soon like 2 months ago
people don't want to read past the first few sentences, so it kinda makes sense
both in the email and in the #π£ο½announcements its saying that Medium is coming
but yeah, saying "the weights" is misleading, not a good headline
this one of the juicy ones https://www.reddit.com/r/StableDiffusion/comments/1be1g74/per_emad_on_twitter_sd3_weights_expected_to/
Emad: I'd expect proper release next month (weights)
with Cnets
kek
we, including Emad, heavily underestimated how much time these models needed to train
and currently we're getting 2B right now, 8B has a long way to go
we won't see 8B until like august or september, maybe even october
but when it comes, it could be a DALLE3 killer
if a fully trained 2B gets on the level (if not above) of an undertrained 8B, then I can't imagine 8B trained to its fullest potential
lets see if sai makes it to october
or even 4B
yeah... π¬
they need 2B released to get money from the subscription
they should replace the API model from SD3 8B to SD3 2B with highresfix
and it actually becomes a competitive product, like Core (which is just a heavily finetuned SDXL Turbo with a workflow)
soo good ive never heard of it teehee
to be clear "large" and "xlarge" are not being released ever or right away? have they said whether they plan to release the larger versions at all in the future?
oh that answers it thanks for answering it before i asked just caught up
Alex (mcmonkey):
We're on track to release the SD3 models* (note the 's', there's multiple - small/1b, medium/2b, large/4b, huge/8b) for free as they get finished.
oh and to answer your question from earlier here's the link to the article i was talking about:
https://civitai.com/articles/5069/towards-pony-diffusion-v7
and the key quote:
I am keen on training V7 using SD3, although it's currently uncertain whether we will have access to the model weights. I remain hopeful and would be delighted if someone from SAI could discuss this possibility with me. Despite my efforts to reach out, there has been no response yetβperhaps there's a bit of apprehension about being outshined by PD (just a light-hearted thought).
oh thanks
also we might not see pony on SD3, JUST because of the license
but we'll see how it goes
oh that's sad to think about, so SD3 vs SDXL licenses are different?
yes, SDXL is openrail++, like pixart sigma and sd1.5
commercial use with no licensing required
SD3 is non-commercial, you need a paid membership for commercial use
but I am not so sure about all of this because
but pony isn't commercial use is it?
the image generated, are owned by you
and since you are generating offline, for yourself, you are using the model itself for personal use
and since the image is owned by you, you can use it for whatever you'd like
yeah i agree, lol ill hold hope pony dev integrates SD3 despite any potential licensing issues
but I'd still recommend you to pay the membership fee if you start making more than $20 a month
I'd do that for sure, but I only make images for fun 9 times out of 10
its the same way that photoshop is able to go after artists who use it without license. and they do.
that makes sense
adobe lawyers get pretty aggressive about damages, but i dont' think theres many cases where they try to claim that they own the ip made with unlicensed photoshop
but that's an illegal copy though like you are saying, but what about SD3, which is inherently free, and the gray legality (or whatever) about AI generated images
or do you mean like, in some countries, using pirated software for personal use is not illegal for example, and therefore Adobe can sue people?
yeah iamal. copyright law is complex. i certainly wouldn't test it. i'd license it. the cost doesn't seem to be a lot
lol ianal i mean
alex (stability dev) isn't a lawyer himself, but this is what he had to say about using images for like monetized youtube vids
yeah he communicates the same intentions. the licensing is broad enough to cover many more cases than they intend it to. it's not intended for youtubers unless they're raking in 5 figures a month
yeah at that point I'd feel guilty for not buying a membership from stability, even if the model wasn't non-commercial to begin with
another consideration. maybe you're using sd3 for free through another service that does pay the license
I totally get why Pony v7 might not be finetuned on SD3, this sounds weird and intrusive
hmm
yeah, then can I use it for commercial use?
i dont think most pony users are trying to commercialize their creations. thats one of the funniest user example galleries on civit. dozens of new entries every hour. a constant deluge
yeah they just want to make cartoon corn for themselves or make images to impress or arouse other
it's the model creator who might want to commercialize it in some way maybe, idk
or a service deploying it
yuh
this is like how Microsoft doesn't own the created images from Copilot image generator (DALLE3), (but in Microsoft's case they can use the images if they want to)
if they're taking donations because they made a model, that's a legal grey area that i don't think has been tested much
is paying for credits on the api going to stability, or is it split between them and fireworks or whatever
cause idk how to donate once besides that or just cancelling the membership after a month
copyright shouldn't be a tidy discussion anyways. human creativity is a messy field. the rules governing it can't be orderly. that's how disney swoops in and owns everything
MJ's Copyright scheme seems totally contradictory (and I paraphrase): "MJ owns the outright copyright to any image produced; yet extends an unlimited and inalienable right of use to the producers of such images!!!"
Take that as you will...
lol
many software as a service companies will do this. especially ones that are planning on an acquisition exit. they can claim more value
What do you mean 'nothing ' ?
its a spam link farm. you'e a spammer. think that clears it up
If it were spam, I wouldn't have shared this link here. I started using this resource myself, so I decided to share it
"this resource" it's an affiliate link farm. spam.
Hm
Ah, now I understand what you meant, sorry. π€
In general, if you're interested, look for the thread about Dalle-3 on 4chan, everything will be clear there
The one who get the fine tuned version also need a membership for using it?
the free one
I mean if someone pay for a fine tuned version. The author and the customer both need the membership
people paying for finetuned models? that sounds like bullshit
i really hope that stability's new license doesn't unleash a wave of enshitification like that
Seriously what is wrong with all those clueless people spouting nonsense?
Alex and others were very open about the limitations (and the advantages) of Medium... and I gotta say I'm sold. I can't wait to get my hands on it. Crazy that it's "just" another week.
8 days
The Author needs the membership to profit from their own Model. Of course that doesn't mean everyone who uses their free release can then piggyback off of their membership and also use it commercially. That would make the whole idea void.
Pfffff - I'm so tired from too much dataset shenanigans that it's almost wednesday for me. π
That being said - are we looking at a midnight release? Which timezone? πΌ
most west american timezone at 11:59 PM
π
Seriously... the wait on wednesday will be the worst. π€£
i'm most west canadur timeszone. its 10:25 here
USA has Hawaii too so thats further west
haha - we are living worlds apart. π
break those chains that bind you
yessir
I have an idea. The fine tuned version host on a platform and provide api to the customer. It only paid once.
Well the customers are paying the Finetuner in that case (it will happen). Seriously - almost all AI-Generator websites and apps are based on SDXL. It was a missed opportunity for SAI to get their cut of those profits. They would deserve them. (of course while SDXL is still free for non-commercial / hobby use locally)
Going to be interesting to see how finetuned models proliferate. If SD3 refiners start charging for their versions, i'll move over to pixart sigma or stick with sdxl instead
imagine needing to subscribe to someone's patreon to use their loras
I'll just keep using the base model 
yeah bruh
lykon will probably keep making free models
so just to make sure I'm clear
- if i wanted to download SD3 and run locally that's free and doesn't change from SDXL
- if Pony dev wanted to download SD3 and fine tune it for his purposes and provide it to users he would have to pay SAI a membership fee and he would have to offset those costs by charging users to download his model?
it is changed from sdxl. there's a non commercial limitation now
so does that sum it up correctly? are you affirming that's right?
pony dev could offer it as a paid download but once it leaks anyone else can just download it and then it's a 'pirated' copy at that point right?
model authors don't have to charge for their models. they might though.
well is there a membership fee? and if so how much? I'm sure a trivial $100 fee wouldn't cause anyone to offset the cost to users but if it's like a monthly $10/K fee then that's a different story lol
if i wanted to download SD3 and run locally that's free and doesn't change from SDXL
yup, you can download SD3 2B when it comes out and keep making images for free, just like with SDXL, but commercial use (selling images or using your images in paid products such as games or youtube videos) is a different story, it needs to be figured out
but we're not talking about selling images or using it for advertising we're just talking about using fine-tuned models the text here says:
"as a member you may build products.... including fine-tunes from SAI core models"
so does that mean that only members can create fine-tuned models?
free membership is a membership
they didn't specify which one
if they make it so that only paid members can finetune, then stability have dug their own graves
so I'm pretty sure that's not the case
oh good point, i didn't know there was a free membership, yeah my understasnd was if from SDXL to SD3 the only change is paid members can finetune then that would suck for guys like pony dev
You could fine tune model for non-commercial use
well I suppose all finetunes follow the non-commercial license, no?
okay good so i guess i have a hard time understanding how SD3 license is different from SDXL
finetunes will require a paid membership to STABILITY to use the finetuned model for commercial use
basically like openrail, but you cannot make money from hosting the model (or using it to make images that you might include in paid products???? have to figure that out), but if you just make images and finetuned models offline then its not different from SDXL
Commercial use seems to be if you are using it as a paid service. The output from it is subject to local laws.
But if someone use the non-commercial fine tune for commercial usage, let say hosting free models and making profit. How would it count
ah i undrstand so for example if i decide to use PonyV7's SD3 finetune model in a commercial application, then I'll be required to sign up with SAI as a paid member. right?
the output is not owned by Stability apparently, this is why I'm confused
It's not, no, it specifically says in all the licenses that output is not a derivative of the model.
License says you can't finetune if you're not licensed
You can fine tune it, and you can use that fine tune to make pictures to sell, but you can't put it on a hosting service and ask people to pay for use.
Unless you pay
How? The author fine tuned for non-commercial use and someone hosted his fine tune
this seems reasonable
author can't distribute fine tunes without a license. all derived versions of the model are subject to stability's commercial license
this wouldn't seem reasonable, like you can't finetune just for free for non-commercial issue? i doubt it
yeah that would make sense that seems reasonable
Really? That means no free fine tune exist
end users can download and use models locally for free. they can do that with finetunes too. but authors may want to charge for those. we dont know yet
If you fine tuned it, and you then uploaded it to a hosting service for free downloads, and then someone downloaded it, and use it on THEIR hosted provider that people had to pay for, then the onus would be on the person hosting it to pay for membership NOT the finetuner
thats true. the finetuner has a responsibilty to pay for membership first though too, before any of that.
random guessing logic, ianal, don't quote me on this:
- if you think about it, you host the model offline, to yourself (comfyui, a1111, etc), therefore its personal use, which means non-commercial
- and since the image outputs are owned by you, so theoretically, you could do anything with it
I want to hear from stability how this all really works.
Why would they?
creating and distributing fine tunes requires membership
but a paid one though? I want to know that
The current licenses for all current models (including the core models) say you just have to keep the same license.
You pay for commerical usage of it.
seems that way at a glance. ianal
Once it's on YOUR computer you can do what you like with it until you make it public in exchange for payment. That's how I read it.
whatever is the case, if I ever use it for commercial use, I'd buy the membership if I actually go past $20 a month
From the Turbo license, which is the current Non Commercial license.
Merely distributing the Software Products or Derivative Works for download online without offering any related service (ex. by distributing the Models on HuggingFace) is not a violation of this subsection.
The subsection being "Non-Commercial Use"
or Derivative Works
ah yeah
Whole section:
b. You may not use the Software Products or Derivative Works to enable third parties to use the Software Products or Derivative Works as part of your hosted service or via your APIs, whether you are adding substantial additional functionality thereto or not. Merely distributing the Software Products or Derivative Works for download online without offering any related service (ex. by distributing the Models on HuggingFace) is not a violation of this subsection. If you wish to use the Software Products or any Derivative Works for commercial or production use or you wish to make the Software Products or any Derivative Works available to third parties via your hosted service or your APIs, contact Stability AI at https://stability.ai/contact.
As it says there, finetuning it and giving it away is fine.
It should do, this is the updated one they're using on all the 'core' models now.
they are really just targeting companies using their models for free
#π
That's exactly it. Yes.
from what I've heard though, is that the license for companies (enterprise membership or whatever?), is suuuper expensive, and some of them just thought of training a model themselves
#π | sd3
Any thoughts on how it'll split the community though? I think M and L will be most popular. I guess S is for phones?
yeah idk how much difference there will be between 4B and 8B
cause if a fully trained 2B is already catching up to an undertrained 8B, I'm not so sure if we'll need 8B
unless 8B has INCREDIBLE amounts of knowledge and prompt adherence
then it would be worth to make slower generations at the cost of superb prompt adherence and stuff
I suppose M will be the most popular
My fear is where companies now ceate things like controlnets/ipadapters for SDXL, they'll now create PoC's for something like the 2b model, and keep the one for 8b in-house, bit like we see with some 1.5 only releases. And for lots of research, showing it works on 2b will be enough to proof their work works, no need to even create it for larger models.
Otoh maybe it's for the better, the fact that the small models are cheaper to train might result in things get developed that otherwise wouldn't even be tried at all π
naming on 4B/8B isn't locked in
2B==Medium is locked in
1B is very unlikely to be named anything other than Small
4B/8B will probably be Large and Huge/Giant or something, or it might be we skip 4B and say 8B is Large, or idk
skip 4B?
@viral plaza what quantization will we be getting bf16/fp16?
I think fp16
(with cascade we've got bf16 iirc, why did we though?)
hey, anyone here have experience in creating anime waifu type images out of inanimate objects ,cars etc. need help with something
bf16 would be better on 3000 series nvidia and up.
running the model weights (not calc) in fp8 even is near-identical
so exact format in storage doesn't overly matter
only matters what you calculate in and what you train in
isn't running it in fp8 slow on non 40xx though?
running yes not storing no
thanks
again, weights in fp8, calc in fp16 or bf16 to preference
interesting
basically half the VRAM cost and maybe a tiny bit timecost from the conversion (not much I think) and identical results
It wasn't quite the same in SDXL with FP8, you could tell something was off.
ayo what????
I expected something like this from 8B, cause its such a large model
but from 2B...?
ye iirc XL is very close but a lil off, and SD3 is closerer
could it be because its transformer-like, therefore it handles quantization better? (theory)
not actually model size dependent, more step count dependent: fp8 on turbo models is harder to do
SD3 Big McLargeHuge
Possibly, bet let's not go 4bit... π
imatrix 2-bit ggml quantization

lmao idk what I'm talking about at this point
but this is good news
You could possibly quantize it down to 6ish without too much loss
and about T5
weights at bf16/fp16 (compared to fp32) already decrease load times and ram usage if being run on CPUs
what about storing them in fp8 too?
I assume the T5 we're getting is in FP16 as well, but that does quantize pretty well using bitsandbytes.
yeah bnb4bit is perfectly fine with T5 when I tried it with pixart, heavily decreases vram requirements compared to raw weights
regarding training with 2b... i think the biggest question of all is what it takes to train controlnets
I hope we can release an SD3-Medium-fp8 safetensors
I just run T5 on the CPU cos it's not actually that slow
it'd be a literally 2GiB model, same size as SD1 model files, but better-than-XL quality
sdxl wasn't left wanting for long for loras and finetunes, but controlnets? that's been the real problem all along
If you don't, someone will anyway. π
they're finally rolling in but it took almost a year to get good ones
thankfully someone trained a good openpose model for SDXL after all this time
(it like... actually works this time)
controlnets have a clear logical place to go in mmdit - it's built around multiple streams as a concept, so just tack on another stream (vs SD1/SDXL, controlnets are kinda hacked in)
I wonder how much better controlnets will get because of this then
yeah what i'm really wondering about is the training cost
if it can be squeezed into 24gb of vram, it will be amazing
or whatever vram the 5090 ends up having
Rumour has it having 28Gb.
I count on 28GB 
What I want to know if I have a 4090, would I be able to just swap in a 5090. Is it the same form factor. If not, that's gonna blow
yeah... if they made it even bigger... π€£
I have a perfectly good Alienware box with a 3080 that has the power and cpu/ram for a 4090, but it won't fit in the case. Would blow chunks if they do the same thing again with the 4090s.
I'd expect training reqs to be similar to SDXL but slightly lower
so if you can train XL you can train SD3-Medium
didn't he mean training controlnets?
oh controlnet training idk
yeah i think controlnets are the biggie
also, can you tell me if lora-like training code will be provided out of the box?
or will it be more like dreambooth
the weight size would be ~half the weight of of SD3-Medium, so roughly 1B-ish to add a stream
so should be trainable
i'm guessing the fact that controlnets were considered when designing mmdit means that they will be much more effective than the sdxl ones, which are often really weak or hit/miss
Also rumour has it they're going back to a 2 slot card
lora is perfectly doable in concept it's just a matter of what code gets published where, which rn idunno specifics of
maybe diffusers has that, I forgot
HF will have code published so presumably they'll cover all the usual training
cnets trainable on a consumer card would be very cool
We won't need loras, everything's in the model right?
man I wish
https://github.com/huggingface/diffusers/issues/4925
i've never tried training a sdxl controlnet, but i recall reading it required more than 24gb... no idea where aside from what i just found here, so take it with a train of salt
"You can add the --use_8bit_adam and --enable_xformers_memory_efficient_attention flags, it works for me. The VRAM usage for each card is about 35GB when setting --train_batch_size=1 and --resolution=1024."
Describe the bug Hi. I am running the Controlnet SDXL example as it is shown in the examples section [example-link]. I am unable to reproduce the results in a SLURM managed environment, where I hav...
reading the announcements of the new sdxl controlnets, training them doesn't seem to be a thing for mere mortals :p
SDXL controlnet requirements for training are higher than SD3-Medium by a fair bit
also, for SDXL we had control-LoRA but idk if HF training code supports it
the whole point of Control-LoRA naturally being to reduce the resource cost
yeah that's one of the most exciting thinsg here imo
one of the biggest things that sets the potential with SD so much higher than with anything else imo
yeah on here they talk about running it on an A100 as well https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md
so if it's 35gb at a min for sdxl and if the vram needs are 30-35% lower for sd3-medium, it's doable on 24gb
t5 is a different type of text encoder, not a clip text encoder
SD3 Smol? π
I think we're awaiting to see what pipelines work or are delivered, the paper said T5 can be optionally dropped, T5 is huge, much bigger than either clip model, maybe mcmonkey can chime in or we'll know later
dropping T5 works fine if the size is an issue for you
CLIP G+L without the T5 is very close to having all 3 on most prompts
I imagine pipelines can be setup to load T5, run it once for embedding, then move the weights to cpu while the DIT runs
or maybe T5 can be quantized heavily?
you can even just run it entirely on CPU
Also yes T5 happily quantizes to 4bit, idk if there will be code for that on launch day but HF Candle runs T5-4bit on CPU well
T5 4-bit on GPU fits well with pixart sigma 0.6B
around like 8GB of VRAM the last time I tried, don't remember
but it's not so bad on CPU only
but you can also run it on the cpu very fast
especially with the bf16 weights
it wasn't as fast on the cpu for me
but I'll try again
on gpu it was instant
i mean for people with less then 8gb
yes
it takes about 10-20 secs on CPU for T5
then again, after the conditioning has been done, you can generate on other seeds instantly
so its just generating the conditioning once, then you can change cfg, seed, and other stuff and don't have to use T5 again
that's actually pretty nice
@viral plaza do you think we'll see the 2b on the api or artisan before the 12th?
hopefully yes
API team is talking about it but idk the timeline
if it gets on API it'll be added to Artisan immediately
Ok great thanks
hell yeah
2B or not 2B :3
man can't wait to try it out
and of course see what the community has in store
im also curious about the smol model, how good will it generate stuff, and also are most people gonna train loras or finetunes on 2B?
12 gb is enough for 2B sd3 ?
yes
you can choose to offload various bits to main system ram as well, so no matter what it'll render with that.
It is (although all versions are available for free for local use). Training pony is very expensive, so I have to recoup the costs somehow - I run Discord service for about 20k users and have partnership with SaaS services. I also (obviously) have the SAI Membership, but the problem is that SD3 seems to be non-commercial even for members and you will have to maybe make some extra deal? But this is not communicated at all right now.
If you could just go ahead and fill out this form in triplicate, we'll get back to you around the time we release the 8b.
Interesting insight thanks for the feedback, excited to see what you come up with next week thanks for the update π
Sorry, was that directed at me?
What's with the drama, can;t we all just be happy we're getting the weight
Ah, that felt too real so I was not sure 
It's happenign for real! who cares its medium
its tstill gonna mop the floor with dalle miedjourney and sdxl
I think pony represents all that is wrong with society and shows off who we really are in our dart hearts. And we salute you.
π
It's a model to make pictures of cool ponies.
p0ny is great even if i dont use it for uhm anatomical studies
AI always sound slike spekaing in tongues and summonign demons when trying to make text
and then you just pipe it all into the image generator.
Worst case scenario we will get a v6.9 based on XL
assuming sd3 is released as core model, i don't see an issue as long as you stay below the enterprise reqs and get the pro membership thingy, doesn't seem you get there with your 20k discord users. But yeah, would be good to get that as a response from sai itelf
it would be interesting to see how your new training translates for better quality images using the sdxl model and then see the results translated to the SD3 model, I think a 6.9 version would also appease the community who have set up their workflow and system around sdxl. so to be clear you're going to wait until the 12th at which point there will be a clear answer on licensing terms and then you'll decide which model to train next?
Discord is for pony lovers, it's SaaS that makes more reasonable money, but again, the whole issue is that membership may not be sufficient and so far I can't get any specific comms.
It's going to be a (better, I hope) different model anyway, there has been so many changes to tech and data that I expect it to diverge a lot (but be closer to XL)
i'd think that if you make more than what pro allows, you can afford the enterprise license π If the worry is that you get a small fee for making the model available to those saas providers, that those providers do need the enterprise license, that's not your problem, they need to get the enterprise license to use the finetune (cause it still has the default license attached), not you
do you have anything you can show that you've generated lately? π any sneak previews? lol
i just think overall SAI should have a special room for VIP fine tuners where they can get dedicated support and service and answers to their questions, just a curated list of top tier devs who make the models better so they can be taken care of first and foremost
But that's just my interpretation, that whole membership thing is clear as mud, all it really says it grants you commercial use (where the license that you get with the weights does not)
That sounds like a logical approach. Training is needed and I don't think there will be a huge benefit from splitting the (SD3) community in 4 Model Groups. 3 is already a lot.
yee
should - definitely. But if there is one I am not cool enough to be in it.
if time/compute is an issue, it would probably be better to skip 4B and train 8B properly vs having both 4 and 8 but both under-trained.
Or train a very good 2B?
2B looks highly capable.
And accessible.
I am in the data dungeon fixing image captions π¦
hey I'm excellent in dealing with data processing and automation, i have free time, let me know if you need a hand or some scripting and I could lend a hand, feel free to DM me whenever and we could discuss any solutions I could develop for you to expedite your process in any aspect, it's the least I could do for using your models so much π
@viral plaza This is a similar issue to the one i was referring to earlier. I stress that SD3 may not reach it's full potential if it doesn't have the full support of major finetuners, but no one seems to be able to contact anyone official for crucial info on the final conditions of the SD3 License.
@viral plaza I know that you are extremely busy, and only one person, but if there is anyone you can put forward this issue to, we all would very much appreciate it.
(Thank you btw.)
π
we still have the one that was made for SDXL launch
haven't expanded it since and the relevant team has changed around
that was a Joe Penna initiative. With The Joe gone, gotta get the higher ups on board with Joe β’οΈ methodology
Already relayed it internally to the relevant people, they said it'll be clarified before the actual launch
Thanks btw. We're all on the edge of our seats. Haha! xD
@viral plaza Do you have any info about how SD3 memory use scales with resolution relative to SDXL? I like to use hiresfix to generate at 3840x2400 resolution with SDXL, but donβt have a whole lot of memory spare above that. Just wondering what sort of resolution Iβll be able to achieve with SD3. (Mac with 64GB unified memory running Invoke.)
initially SD3 is not gonna agree with hires fix directly due to oddities of the mmdit arch (pending some clever fixes to positional embedding code), so rather tiling based upsampling is a better strategy, which doesn't use more VRAM (but does use more time)
Thanks. Too bad! I hope some geniuses can work on that. Has anyone done any experiments with native generation above 1 MP? Does it still go crazy or generate artifacts? Would a higher-res initial generation be useful to lessen the number of stages or tiles in a tiled upscaling workflow?
Does it do image to image at anything above 1024x1024? So if I have a 1536 squared image from something else and want to do image to image on it with sd3, I can't without tiled ksampling?
if you go out of resolution range without fixing the positional embedding handling or using tiling, it does this (clear image in center, distortion on the outer edges)
(A) fix the pos embed code (B) train the target resolution (I'm sure somebody will do a 2048x2048 tune right away probably), or (C) use tiled
tiled works well on SD3
So A) would be sufficient to allow the same resolution flexibility as SDXL (assuming the fix is possible)?
What is the current resolution range before it starts artifacting, assuming ~1 megapixel. Like can it do 1344x768 without bugging out? Or does it have to stay around 1024Β²?
Basically, how far from a non-square aspect ratio can it handle?
roger, thanks
aspect ratios are trained in and work fine ye
basically the same as SDXL
since you expect people to do 2048 tunes, something no one really attempted with sdxl, and pixart has 2k/4k variants of their DiT model, are DiT models are generally easier to finetune to higher base res vs unet based ones?
nobody had a reason to with sdxl
cause sdxl you can just do hires fix and you're done
sd3 will get distorty if you try to run it straight like that
so there's a reason to bother making a hires tune
also yeah the training team said that sd3 moved resolution objectives very easily
Awesome, thanks! I kind of assumed so, but didn't know for sure.
honestly, i don't think it works that great with sdxl either - a lot of compositions are degraded a bit with those latent upscales
stuff like... a sandy beach with patches of wet sand underneath dry sand kicked up with the color and texture clearly visible, pebbles and stones scattered around... that kinda stuff disappears during those latent upscales
it's not a huge degradation... in a way i'm glad it's a big one for sd3 so we can actually get a proper tune on higher resolutions
tru
If they decided switching to 2b version, that means 8b wasn't close to be ready, so could API 8b be really far from it's final quality and we can see big improvements? Or it is already on the level that difference won't be really noticeable?

someone from stability said 8B is very undertrained
yes 8b needs a lot more training
How many vram do you need for sd3?
Which one will best for anime version 8b or 2b
How many vram do you need for sd3?
@viral plaza
2b model is smaller than SDXL while 8b is larger than SDXL + the refiner
2b model is about 2.5x larger than SD2 in terms of params
But itβs smaller than SDXL so if you can run that youβll be fine
How many power version 2b... Is that need more than SDXL
It probably depends what the community trains the most.
8b will be better out of the box but itβs likely 2b will have more fine tuned variations via the community
Yes but better knowledge before experiment
8b will way more powerful and detailed then 2b...
Only kind of, it has more potential but is harder to train. People still use SD1.5 fine tunes over SDXL even though SDXL is way larger and more βpowerfulβ SD1.5 is simply easier for the community to train on their limited resources.
You're not considering that SD3 uses T5 XXL as one of the text encoders, with T5 it could use more memory than SDXL actually
This could also be irrelevant by moving T5 to VRAM and switch with the transformer diffusion model when being used
T5 can be either quantised or loaded on system ram and offload to CPU
True that, that's pretty much what I said. The size of T5 XXL might be irrelevant if the inference switches the transformer model with T5 when being used.

Is stable assistant using the 2B version because I liked some of the stuff I was getting when I used the trial
I still had some problems with hands but all in all I got some good output from it
You can also do T5 inference on CPU as well, granted it's slower. You can try out what I mean right now by using t5 with pixart sigma. Even on cpu, it's not all that bad. Also, the prompt tokens can be cached, so as long as you don't change the prompt, you can just go straight into ksampling.
yeah I didn't even know that you can just have it cached and reuse it, it actually makes tuning CFG, changing samplers not a chore
Try the free SD3 @Glif - by a user named FABLAN
well i wasn't prompting nudity
lol
Yes, even the most tame of female pictures were blurred-out!!!
yeah that's the API itself
glif only uses like a word list
ClipDrop can often do the same - but they do not recompense you when you lose 28 of 40 pictures like that - and all from the tamest of tame prompts!!!
You quickly learn which words/themes/topics will send ClipDrop into a headspin!!
Slender is a non-ClipDrop word ....
Sensual too ...
So I'm doing beach-crazy lighthouses for safety's sake!!! π
Pixart-Sigma into SDXL
Yeah it will hopefully be that way day 1 in comfy. Pretty sure it's just kind of a comfy thing for all nodes already. The only thing that might make it have to do inference on the t5 prompts again would be if it had its own seed set to random or something.
Then again, it could also be one of the other other nodes people commonly use like rgthree that does the caching im talking about. Haven't used vanilla comfy in ages
ah yeah rgthree
That's quite a good workflow for promt adherence + details. π
Either way, it's an option and even in vanilla comfy, it would be like two lines of code for the node
A lighthouse straight out of my pixart tune. (No SDXL)
I think that prompting skills are laregly underrated when it comes to getting the best out of AI.
In this room at least, more faith seems to be placed in epochs, samplers, sigmas, noise-schedules etc etc etc π
I'm saving 4 versions: PiXart-Sigma processing only, +SD15, +SDXL,+FaceDetailer (when a face produced)! Mostly completely different each time
