#๐๏ฝsd3
1 messages ยท Page 8 of 1
Sigma is a bit of an oddball though, it uses very long captions (300 tokens) rather than the 77 people are used to, so to get the best from PixArt you need to type in a large prompt or expand it with an LLM
SD3 still has that 77 token limit though right?
Yes, my best PiXart-Sigma output is when I use Jan.ai to prepare a "natural language prompt" as input ๐
T5 is better when natural language is input ...
I've been using zephyr7b in comfy
In fact T5 was conceived to use natural language
Probably shouldn't discuss pixart too much in the SD3 channel of SAI though. Feels weird.
But as too few people can get their hands on SD3 - we have to talk about something ... ๐
interesting, Pixart (DiT) had the entire image noisy/distorted when doing a higher res version on any model
at least, if I recall correctly
I'm @ClipDrop SD3 now ... anybody got an interesting prompt for me to try?! ๐
OK, gone to Openai - Daily Theme for a prompt ๐
Photo of Criminal in a ski mask making a phone call in front of a store. There is caption on the bottom of the image: "It's time to Counter the Strike...". There is a red arrow pointing towards the caption. The red arrow is from a Red circle which has an image of Halo Master Chief in it.
ah im late
A moody and atmospheric fight scene between stick figures rendered in a realistic digital painting style. The scene is set in a dark alley, illuminated by a single streetlamp casting dramatic shadows. Two stick figures are engaged in a fierce battle, with one delivering a powerful punch knocking the other backwards. The realistic digital painting technique gives the stick figures texture and depth, making them appear almost human-like. The atmospheric lighting and detailed background enhance the emotive intensity of the scene
I've tried that one ... was OK
Pixart
A futuristic terrarium with a unique and captivating design. Inside the terrarium, miniature bioluminescent plants and flowers emit a gentle, glowing light, creating an ethereal and otherworldly atmosphere. The plants are arranged in a way that mimics a mystical, enchanted forest. Tiny, intricately detailed fairy houses and pathways are nestled among the foliage, adding a whimsical element. The terrarium's glass is etched with delicate patterns that reflect and enhance the inner glow, making the entire piece a captivating focal point
Create a square visual representation inspired by the DIY spirit and featuring prominent use of safety pins, styled as a monochrome street snap photograph from the punk era. The image should focus on a close-up of a single person's face dressed in typical punk fashion, caught in a candid moment as if walking through the streets and looking back over their shoulder. The photograph should emphasize sharp contrasts and capture the gritty, raw aesthetic of punk culture. The person should have a brightly colored mohawk or spiked hair, appearing self-cut and dyed to emphasize the DIY ethos, though rendered in black and white. They should be sticking out their tongue and raising all fingers towards the viewer, conveying a rebellious and defiant attitude. The person should wear dark eye makeup, with black eyeliner and shadow. Accessories such as safety pins used effectively but sparingly as earrings, body piercings, and prominent decorations on their clothing should be visible. The person's eyes should have a nihilistic, vacant expression, emphasizing a sense of emptiness or detachment. The background should feature a gritty urban scene with graffiti-covered walls, capturing elements of street art that align with the punk aesthetic. The monochrome style should highlight the rebellious spirit of punk with intense, sharp contrasts and a raw, edgy feel. The photograph should have a candid, spontaneous feel as if the person was unexpectedly captured while walking through the city and looking back. Add a grainy texture to the photograph to enhance the raw, unpolished aesthetic.
This punk prompt is a monster!!!
SD3@ClipDrop
SD3@ClipDrop prompt = A chubby Shiba Inu named XiaoShuai, standing on two legs with its belly exposed, not holding any items. XiaoShuai is wearing a cute human outfit that includes a brightly colored shirt and shorts. The Shiba Inu has a warm, tan coat typical of its breed, with expressive eyes and a playful expression. Scene in warm tones, realistic style, capturing the essence of a cozy home atmosphere
kinda. rn the SD3-Medium model mostly just makes it work at 512, idk if the release version will be the same or not
well not perfectly, but, like, relative to what you'd expect for borking the res entirely
Is this the 2b model of sd3๏ผ
So it does closeups, but with "reasonable" framing
No, that's the 8b beta on the api
amazing picture
What SD3 does have is visual acuity - poor limbs, fingers and faces (at times) - yet with an enhanced visual acuity.
If Stable Assistant sticks around and stuff, I'm gonna keep it I think. Its an awesome tool to finetune what you are trying to get. For example I had it optimize prompts. It'll be great once its fully featured. So o nce weights are dropped and it isn't necessary to burn through credits really quickly and to use it as a tool on the side, it'll be worth it.
I can never understand why Stable Assistant gives you 150 prompts/150 pictures a month (for $10); and ClipDrop gives you 300 prompts/1200 pictures a month (for the same $10)?
SA needs the money and knows people will jump on SAI quicker. Cause of the new tech and brand name
Yeah true
Boltning Hyper
@viral plaza do you know if it's just the standard T5 encoder that everything elses T5 uses or is this specific to SD3?
publicly available google/t5-xxl
Phew, I don't have to download it agaaaaaain.
https://huggingface.co/mcmonkey/google_t5-v1_1-xxl_encoderonly the encoderonly file loads more fasterer tho
๐ I've already got one of those.
nice
Also Pixart-Sigma
Pixart uses the same one at FP16
Same ๐
Can this be used inside PiXart-Sigma at all?
I'll let you know if it loads...after it downloads
Yes. I'm using an fp16 xxl t5 right now.
Are you using the encoder only version though?
Yep
How? I get AttributeError: 'NoneType' object has no attribute 'get'
Loading T5 from 'F:\ComfyUI_windows_portable\ComfyUI\models\t5\t5-xxl-encoder'
!!! Exception during processing!!! 'NoneType' object has no attribute 'get'
The clue was still there though, thanks! ๐
Changed path_type from folder to file...now works
https://huggingface.co/city96/t5-v1_1-xxl-encoder-bf16 if you want the BF version (which may be more accurate than FP16)
I seem to have all of them! ๐คฃ
No wonder I'm running out of disk space
I have mt5-xl too
15GB
You can probably find a smaller xl as well.
Does that work as well?
Not tried
I wonder if the outputs are compatible.
Trying bf16 now ...
bf16 works - not made a noticeable difference - it might if I was doing photorealistic though (I will try that later!)
Trying XXL ... it's working ... again, no discernable difference?!
Now you see why nobody uses llms at high precisions. You can look up perplexity vs quantization to see more about it.
I remember in the GPT-J days I wanted to run LLMs, but the best we could do is load-in-8bit lmao
BF16 and FP32 are very close in precision, FP16 loses precision in the sorts of ranges DiTs have.
I'm running a lot of art which does not need a great deal of artistic precision - do you mean precision in detail; or in prompt coherence?
Mathematical precision. ๐
I come to all this primarily as an artist ... glad to know that mathematical precision exists!! ๐
the city96 guy has a 3 gig version of the t5 for hunyuan, and I did a side by side comparison. Details on the face, lines on the clothing, various stuff like that are suddenly not as sharp or don't make sense. They're not major, but it's definitely noticeable.
Isn't the T5 just an LLM that helps improve the prompt? I'm surprised it has any impact on the details in the image.
no that t5 is the thing actually doing the encoding for the image model.
it's converting the words to numbers.
Which is great. I tried it in TagGui and it's pretty neat. ๐
hunyuan has a separate prompt expander model that they also include if we want to use it, but we have better stuff like gpt4 and llama3 for that.
OH SHIT STABLE AUDIO HERE?
non commercial license as expected
cant wait for comfyui implementation
hope like hell we someday get the full weights for the full version
i'd go wild with that
@viral plaza On another note - could we theoretically load CogVLM instead of T5 XXL? At least in TagGui it gives me better results. Too bad it's quite VRAM hungry and CogVLM2 is Linux only...
Let's hope we get a good UI for that soon! (looks like audio-webui isn't actively maintained anymore)
honestly comfyui would be fine for me
Ooh... RIGHT! I didn't consider that... absolutely.
I bet we'll see some addon implementation a few days later
maybe around the time SD3 2B comes out
is this the one we will need to download when sd3 releases?
no, you should download the bf16 one
smaller filesize, faster loadtime, less memory used up on CPU Ram
ah nice, i like the bf16 version anyway :3
๐
wait a sec, the link to the model that Alex sent is the same filesize... maybe its fp16 as well
oh its from him, mcmonkey
mcmonkey d luffy ๐
calm down Musk
good one
When using the api to generate images I get finish_reason: 'CONTENT_FILTERED',
seed: 2950283743
}
It blurs the image. Any way to say that it should not filter content ?
Nope. Have to wait for the 2b on June 12 for that.
Where did you get that information ?
Look here. #๐ฃ๏ฝannouncements message
Thank you very much! ๐ So are we sure that we will be able to turn of that filter ?
The diffusion gods have revealed their divine will unto us
The local version won't have filters. It'll just be capable of something or it won't. There won't be any blurred images.
Do you know what format the image-to-image mode has to be ?
I tried to send an url and tried to send the image as base64
both times i get bad requestb ack. I also changed the mode to image-to-image but doesnt seem to work
what am i looking at lol
kek
1dog
that looks like a picture from an actual commercial LOL
"Inside of you, there is 2wolves"
ideogram
Noice.
I am very impressed by ideogram. I took your image and used describe, then generated the resulting prompt... it has a lot of potential
Although sometimes it makes real crap
now im thirsty
So then....

The patient may not survive...
They will be fine.
some people saying text is gimmick but I have a hell of a fun with it and it is good
yeah it's pretty good for a base model
I don't really understand how do glif.app host sd3. It takes 8s for gen while someone from SAI (or it was in paper) said it takes 40s or so on 4090 with 50steps. Do they have something more powerful? But why for free then, how do they profit?
At first I wasn't sure that it is real sd3 but it is not similar to any other services
and it is yet undertrained
i can fix her (in photoshop...yea...)
Yup she looks healthy.
with sdxl refiner
I haven't tried it, but could it be fake? Lately, I see a lot of https://huggingface.co/spaces/markmagic/Stable-Diffusion-3
// I see that it no longer works; it was working yesterday
I also thought about that but what is then? It is definitely not Dalle3, Ideogram looks very different too, SDXL could never achieve such prompt following...
At the beginning I got fooled by one "free sd3" site. Decided to check metadata and found "Ideogram" there) But it is not seems to be case here
Well could just be using the sd3 api
Considering the source code was literally "exec(os.enivron("CODE"))" they were probably using the initial signup free credits on the SAI API
I think some people are using Ideogram. It's easy to make calls to Ideogram via the endpoint URL. I do it sometimes by inserting the bearer token, and it works well
yea it could, but it just too generous, I didn't even hit limit once. Some of these sd3 glifs has about 30k plays while, in addition having gpt4\claude prompt rebuilding, or sdxl refining
I made some comparisons but couldn't find obvious similarities between Ideogram from their website and glif
its like hl2 in russia came out before hl2 in america came out
๐
shady characters pushing SD3 in the backrooms and alleys
Sd3/sdxl refined
first thing to try when weights drop
Sd3/hunyuan
i just want to try dumb contradicting stuff like "luminescent Rose madder Bright lilac huge crooked (fur:1.9) (alien:0.2) (Titanoboa:0.5) (Meerkat:0.4) amethyst eyes. full body. Boreal Forest, puddle, centered, sunlight, focused, uniform light background, photo-realistic" and T5 is probably too smart for that :p (though it doesn't even look half bad)
I never got this actually. Can T5 also have prompt weighting?
no idea, i'd think why not but it prob needs a new implementation
Refined, it's really neat
Wow these came out great
purple furry snakes in the shape of "SD3" are wrapped around the arm of will smith who is taming a lion
grin, asking a bit much ๐คฃ one day, imagegens will happily oblige, i hope ๐
Other than the text portion, pixart did it more accurately.
sigma slow
this is with 512 sigma
If you think sigma is slow, wait till sd3, apparently it's also quite slow. I believe lykon said 30 seconds per image on a 4090? 50 steps.
isn't that 8B?
and also 50 steps too
In early, unoptimized inference tests on consumer hardware our largest SD3 model with 8B parameters fits into the 24GB VRAM of a RTX 4090 and takes 34 seconds to generate an image of resolution 1024x1024 when using 50 sampling steps.
50 steps yeah. 30 seconds for that isn't so bad. thats 1.5 it/s at megapixel resolutions. not sure what the problem is
not even any optimizations done by the community yet
well for 8B that's pretty good, but for 2B, a 4090 doing 1.5it/s isn't that good imo
I hope MODE13 meant 8B
and yes, optimizations
a lot of the time may come from t5 layer too. faster without it and many won't want to use it
the way a lot of civit users prompt, t5 won't be necessary
yeah lmao
if i can get under a minute 1024/1024 30 step ish i cna live
1girl, big boob, intricate won't need T5 for sure
now piling on loras controlnets ipa adapters will be a different story
i've been using Omost which takes like a minute to make the prompt code in the first place lol
i'll be fine
what about "2 girls, one with A cup boobs and one with C-cup, intricate"?
there's gonna be a lot of debate over the usefulness of t5. many will declare it to be another sdxl refiner layer since they won't see immediate gains
another prediction i have. controlnet won't work as well on larger sd3
how so
same prblem with sdxl. more parameters means less influence
interesting
then again, wouldn't the MultiModal nature of it make controlnets just superior?
multiple input streams and such or whatever
rather than some hacky solution
i'm not saying i have an educated prediction
and I don't evne know what input streams are
there was a thread on the sdxl channel today where andreac75 who makes a lot of top notch loras is just deleting all their stuff off civit because it's all just porn now. They didn't want to see all the heinous stuff people were making with their stuff anymore.
and the porn side of the community have this notion where if you even slightly dont' want to see that, you're very obviously an fundamentalist christian oppressor
ermn error
sd3 hasn't even released yet...i think.
ik
arguably the more important question: deep floyd stage 3 wen?
I wonder how long after SD3 2B releases will we be able to start training LORAs and finetuning?
developer API has been out...
https://stability.ai/news/stable-diffusion-3-api
if it's very easy, why is stability not officially doing that?
Do you want it to take even longer to release the weights?
well if "very easy" means it would just take 1 day longer, then I think that would be worth delaying it 1 day for, yeah ๐
Of course. I think Stability would just do it if they thought it were THAT easy.
That's just going to be a comfy node update later. The weights don't need to wait for that
Easy probably means "you don't have to go to special gyrations to get the loss to decrease, but you still need to train with a lot of compute and data".
And rather than spending all that compute on training the model on other resolutions, I'd prefer that the issue with the positional encoding get fixed so that the existing model is more flexible.
The downside of tiled upscaling is that it takes many times longer than standard upscale image and denoise at 0.5.
It's probably why they're not doing it on the api and instead upscaling sd3 with sdxl turbo
I'm quite sure @viral plaza in the message I replied to was saying that SD3 can easily be trained for higher resolutions, so no, that cannot just be a comfy node, it needs training
All of these models are 1024 res trained.
It would take a massively larger amount of time to train the base models of millions of images at higher res. It would take massively more money to buy that gpu time, which a near bankrupt company can't do.
So instead, they're delegating that to the community, which makes sense.
I believe that they have the best of intentions at this point. The reality though, is that we may never see 8b or even 4b if someone doesn't come along and fund them somehow.
So we should just get 2b going and do what we can with that until/if we get surprised with something better down the road.
Nah, cnets were kind of a hack job to implement into unet based architectures. They should be exponentially better with DiT once people start rolling them out.
Stupid black bars

"action movie" in the prompt, really likes to add fire to things
looks more like a video game to me
I have unreal engine in the prompt
What you said makes all the sense :p
ohh
/help
The green reaper..
thats a pickaxe not a scythe..
i heard its pretty bad at scythes too
later when we get to see it finetuned
thats a scythe now
at least Friday is close
it's not bad at the scythe itself it seems, but same problem with guns, the holding part
yep
Looks like even those 4Gb of VRAM are overheating...
I wonder if we will be able to rig up a multigpu setup for sd3 in comfy with any form of ease. I'd use my second weaker GPU for t5 inference and my main one for sd3, that way I wouldn't have to swap models constantly or do t5 inference on the CPU...
I know that in the python code, it's usually as simple as saying cuda0, cuda1 or CPU, so it might be pretty easy to slap into a node
I just don't know if comfy would natively like it with the smart memory management stuff
Is there alternative?
||not hf pls||
What's wrong with hf
idk it is just not that convenient if you don't know what exactly you need
could be skill issue
What is likely going to be the minimum requirement for Stable Diffusion 3 Medium ?
4GB
to be fair in 2024 almost everyone have at least 4GB
Technical things being easy doesn't mean business things are easy.
Like first -- Our technical team could release a thousand models... but how do we organize marketing around them all, how do we get safety testing on them all, how do we get legal signoff on them all, how do get licensing organized, etc. etc.
But also: if we decide to spend three days training 2B-2048, that's three days not spent training the 8B. (Opportunity cost: every action no matter how easy comes at the cost of something else you could've done with that same time/effort).
in short: something being technically easy doesn't mean a company like Stability can or should do it internally.
In fact, that's part of the point and value of open releases: Stability doesn't have to do everything, we just open release the models and what info we can, and there's ten thousand other people who will each obsess into any given subsection of work and make that
I'm pretty sure there's a way to software-patch the pos embeddings and make them friendly to other scales. Not 100% confident but similar things have been done before (see eg RoPE scaling for LLMs)
no just wait until launch day and we'll have clear proper ways to get any required files
oh yeah lol wasn't gonna upload an fp32 file in the year of our quantization 2024
imo we should be using a 4bit quant, but, need proper software support first
CogVLM is an LLM that takes image inputs, designed to generate text descriptions of those images
T5-XXL is a text encoder-decoder generic pairing, in which SD3 uses only the encoder, to convert prompts to inputs.
These models are not interchangeable, they're not even in the same category
I hope it gets figured out cause I don't know about how good it will upscale images (tiled) with depth of field in them
yeah
for example i use a 4bit T5 for https://github.com/mcmonkeyprojects/translate-tool and it works great
there seem to be quite a lot of variation model for T5 though
there are fp16 version by theunlikely
(T5 is an architecture intended to be finetuned, there's tons of variants out there, it's kinda silly that we use a base model of T5 for sd3 tbh)
there's an fp16 encoderonly T5-XXL intended for SD3 here https://huggingface.co/mcmonkey/google_t5-v1_1-xxl_encoderonly/tree/main
(which is on HF because the SGM training code uses HF so i uploaded that to dodge the long slow load on the full fat fp32 pair file)
Loading fp32 T5, was always the slowest part when using deepfloyd and pixart lol
"... obsessing into subsections ... groan, if only!!!" ๐
Pixart has a couple of smaller versions in fp16 and bfp16. Drops the size to around 8-10gb and speeds loading up a lot. You can find links to them on the extra models comfy plugin page
Though the absolute easiest way to deal with the slow loads is to just get a cheap $40 nvme in the 2-5 gb/s range and models load in seconds
The one in the link is only 10GB
huh i dont understand, then how do yall caption sd3 datasets?
oh cogvlm i see
it's possible those get finished soon after and released or something, or community takes over, idk, we'll see
it is pretty weird that some having memory about Stability will release ControlNet models at day one
CogVLM is an AI model that generates captions, not a dataset. The captions that don't come from CogVLM, are just whatever captions came with the source images
ight captioning with my own brain then
if you're doing your own dataset, yeah just write your own captions
yeah, we were talking about the fp32 versions from before though. it was like 20gb.
if you're doing a megascale dataset (millions or billions of images or whatever), the collection tools to build datasets like that usually can copy out original source caption data
for instance im planning to finetune BTD6 Lora on SD3, well i have to make my own cogvlm finetune..
its like 400+ images iirc
I expect the HuggingFace team will publish finetuning info on day 1, followed later by the various community finetuning software vendors (kohya, onetrainer, etc)
i have one question

How to get membership or early access?
Do you know what the system prompt and prompt was that was used in CogVLM? I've had some good and some bad results from using it.
I got CogVLM2 running yesterday and my first prompt got responses that copied the whole question before answering. Made it absolutely useless for captioning. ^^
I rewrote the prompt and then it was way better.
Only persisting problem so far(which I can understand is unimaginably hard for an Ai to understand) is that right and left hand / arm are swapped when a person is facing the camera.
I'll have to manually check all captions because sometimes it gets it right, most times not. ๐
Most vlms are good to go out of the box, since that's literally their sole purpose and what they were trained to do. But if you're trying to have a specific format, you have to do a little bit of sys prompting. I spent a few days a while back making one for llava 1.6 Mistral and it was around 1000 tokens long with examples in it. You can also use RAG as well, but the info still bloats up the context size. I know llava 1.6 ate up like ~2000 tokens for the image information alone, so I didn't use RAG for it when I only had a 4096 context size(no sliding window or other optimizations)
The most annoying thing is it starting every caption with "In this image" or"The image contains" or variations of those, which I then have to take out. Also I had one vlm that just kept on about handbags for some reason (it was a Microsoft one).
Anyone has suggestions for an image to text/prompt model? I am trying to get a prompt from an image to generate similar images
an easy way to avoid that is by using a predefined ouput-prefix
question: Write a short caption for this image.
response: This image shows [autocomplete from here]
The other trick you can use, granted it's double the time, is to use a second pass with llama3 instruct(or any good recent instruct model) to clean up the captions and format them better
Or use taggerui and discourage from caption:
They are starting to make some llama3 based vlms now, so that should be awesome. Think there's a llava next llama3 someone is working on
1 week... ๐ค
But up until now, the best one I've used that was the most reliable was llava 1.6 Mistral. Granted, that's for consumer level hardware and easily runs on 8gb vram. Mistral was a beast until llama3 came out(again talking about smaller models you can run on a normal PC)
here's an example of a long description version i made a while back. it was part of a two pass system where this would be fed into a prompt generator llm. the system prompt for this is ~700 tokens
memberships program is @ https://stability.ai/membership
i don't think there's sd3 early access
i don't know the answer to that sorry
i have some matter if you can come to dm
please
should use CogVLM2 anyway instead of the original, would be better and prompting it will be different

i don't trust llava at all for architecture reasons -- it just has images fed into the context token space of the LLM, vs cogvlm has active transformer layers dedicated to the image
have you tried llava 1.6 mistral? it's pretty decent
(for it's size i should add, being able to run on an 8gb and all)
obviously, for larger scope and better hardware, i'm sure there are better alternatives
i'm playing around with a different version called llama3 llava next 7b right now and it's pretty good so far. the next versions of llava are based on 1.6. i think the one you linked is based on 1.5
That doesn't work as intended most of the time.
In my tests CogVLM always performed better than Llava1.6.
Btw @viral plaza is there a time on the 12th we should look forward to?
A Midnight release? Late evening in the West US?
Thaaat's a good question, will ask internally
What image captioner did you get cogvlm2 to work with?
I went on a lengthy adventure yesterday (thanks with a lot of help from Llama-3 70B & an OT community member) to get the repo of TagGui running on WSL. I'm very very happy that it works now. The next time a repo asks for Triton or bust - I know what to do. โก
Omost asks for Triton - and when you install it (the only version available!) - it tells you it is the wrong one?!
On another note entirely - I've been trying Stability Audio (via Pinokio) - and it's quite some fun!!!
the local mode?
i wish i would know how to fine tune it xD
Pinokio puts it all into a local VENV yes
did oyu know that the model got leaked like a week bevore it relesed
how much GPU does it need?
i think 7gb vram
that ComfyUI node said 8GB VRAM
Here is my first groove ... an arpeggio of a C9 chord https://drive.google.com/file/d/1lG2fArbCe2AauplHH2BhEAsUaKF3mKih/view?usp=drive_link
too
Pinokio uses GRadio
i dont have accses just post the aduio in discrod
I have an 8Gb RTX 2070 and it works OK, about 45 seconds (100 iterations) to get a 47 second mp3 file
sd3 "Spectrogram"
RGB gaming catheter
Is my audio interesting enough for you to continue your interest in Stability Audio at all?! ๐
yes i want to finetune the model but i dknt know how
i alredy installed it
I'm sorry if this has been asked or answered already, but will we have to truncate prompts ourselves if we make our own Loras or Finetunes?
i hope sd3 is able to make this
awww that sucks
poor calf is out of focus
Yeah when it's local I can just tell it to render 30 of them, with tons of various llm variations of it. Not gonna do that against a paid api.
Hah
Can you get it with the cat snuggling the oxen though and taking a pic of both of them
damn, they don't want to cooperate
huh?
is this ideogram?
oxen is sitting next to a tabbie cat who is taking a selfie of both of them. Background is a farm.
Dalle
oh
looks pretty good
we could overfit Loras to make something like this
We'll see how 2B looks like
or fully trained 8B
Festivalman do you think that skipping 4B is a good idea?
It makes sense to me, you will have 2B, which is what the majority of people will be able to run, 800M for low end users, then 8B for enthusiasts or small businesses if the Enterprise membership isn't too expensive
4B might stick out like a sore thumb and would delay the other models
tabbie cat that is snuggling with an oxen while taking a selfie of both of them. Background is a farm.
Dalle/sd3
I think sd3 did a good job, it's technically the selfie pic itself
Every size of sd3 they release will segment the user base that much more. I'm on the side of releasing as few as possible.
if anyone wants to challenge sd3, try something like this (Dalll) (saw the pic in reddit a long time ago, sadly sd3 really doesn't shine with abstract concepts like "mustache from her hair" ๐ข )
thinking about it, that image is extra evil as it also has hands ๐ (this is sd3, closest i got, but that mustache, is so much more real :p)
Yeah I'm getting the same thing
Did a good job with "anime" though. It really looks good
Sharper than deepbluev4 that I've been using I think
ok the sd3 is well done
is this supposed to be mewtwo?
nah thats mewfour ๐
That's fine though. I think people will see that fine tunes aren't going to be as needed and that loras will be more of the go-to. Eventually someone will make an xadapter to share the loras between editions.
Yeah but you can't share Lora against different base models. These different size models will literally have different training against different images.
octopussy SD3
Often feels like SD 1.5, lots of trial and error. But when the high roll hits it's really good. 6 more days

yeah 2B will be probably more consistent, especially with Finetunes
Hmmm... currently curating all the CoGVLM2 captions... which are nice, but still utterly filled with "suggestions" and "seems to be"'s "possibly" "appears to be" and other nonsense...
--> and now I'm wondering if that's actually stupid cause those might still be in the SD3 captions as well.
if those appear at the end like I remember them being, then those are probably truncated anyway...
or they had a good system prompt for it anyway
I do hate those assumption parts of the captions though
coping for the best
8B is heavily undertrained
but it is coping for sure
we can't expect images to suddenly look 100x better
i'd sacrifice consistency for variety anyday. Variety is amazing on SD3
Oh I mean like anatomical consistency or whatever
it will have variety
they talked about the model overfitting, causing differing seeds to become useless
so I think we're going to be okay
maybe the Turbo model(s) will have that though
and finetunes
it can have both, I'm just sayin what is more important to me. Had lot's of fun with API already
same
8B is really good at paintings, so I hope 2B Base will be good at it too
same dataset after all, no?
should be similar enough
Playground 2.5 is an example of almost no variation across seeds.
btw @coral sable idk if you have seen this, but this is a TRULY raw image from 2B, no upscaling (From Lykon)
this is how the face looks upclose
remember, no highresfix or other tricks
bigger dataset, same tech. Wouldn't call that the same dataset
I see
yea saw that, really good for a base model
I wonder if the 16 channel VAE is just that much better for smaller details such as eyes
what's unfortunate, is that like with Pixart, highresfix has an issue
resolutions higher will create very very noisy artifacts
I suppose by the time we get it, they'll fix the pos embed code, or the community will within a week
if Tiled upscaling is okay, then I'll just do that then for the time being
highresfix/adetailer still needed, details drop significantly on anything that isn't closeup. But that's to be expected as consumer hardware is what SD3 is targeting
yeahh
I wonder will there be launch event on 12th as it was with SDXL
what was the event again?
I mean just discord call with bunch of people
ah
Stability guys presenting new tech
oh so like the center stage type of call?
yea
I can't wait to remake Tekken intros as Live Action movie stills
I wanna see how good img2img is with such an intelligent AI
oh heck I could have tried that with pixart already lmao
I keep forgetting that it can do img2img too
you guys have any plans on what to use SD3 for that you couldn't or wouldn't bother to make with previous models (too much controlnet or regional prompting involved)?
I'm hoping it's a lot easier to do images with multiple people in it without their features (hairstyle clothes etc) blending together with the other person's.
Is there any new update on pixart models
only 2K model is the biggest model released so far
there's lora training code
its in diffusers
nothing much to be honest that I know of that is "new"
yeah the new VAE is majorly beneficial for small details
nice
I'm told there is not a specific time we can give in advance, the actual release might be a process over a few hours (model weights, code, etc.) rather than necessarily all go at once
no event planned
Thanks for asking. Usual routine then (while I'm probably all giddy). ๐
Which is why I brought up: https://showlab.github.io/X-Adapter/
I'm sure someone will make something similar for the various sd3 models. Who knows, maybe by the end of the year you'll even see one that allows you to use sdxl loras with sd3.
Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
They won't be perfect 1:1s though, but should be close enough if it's done right. Maybe like 90% accurate
I wonder if comfy and your UI (stableswarm) releases the update earliest
nice
yes nvidia I will keep buying your 24GB high end cards for 3 gagillion dollars 
๐คฃ
yeah with 0.3 denoise
lol
lol - I have an 8gig 3070ti, itโs just never enough sometimes. (Not doing ai just yet)
T5 bf16 on CPU with SD3 2B fp16 will be good for you ๐

it takes about 10-15 seconds to T5 to process the prompt, but once you are done and you don't change the prompt
you will be able to change anything else (CFG, Sampler, resolution, etc) and get instantly to generating the picture without having to regenerate it
so the future is bright
the future is now, thanks to science
or just don't use T5, we don't know how much worse it's going to be without T5
according to the research paper, the difference is not that huge
yeah but they also claimed stuff like DALLE3 level prompt adherence and stuff so... ๐ฌ
yea i guess will see :3
but I do believe that images without Text might be decent without T5
if its as smart as Pixart-Sigma without T5 then it'll be good
caues its still a massive improvement over SDXL
Can you break this down for a noob?
pixart has a 300 token context window. sd3 is limited by clip length
yeah its funny
its kinda already broken down :3 long story short, you dont have to run everything on gpu
it's really the benefit of the t5 stuff over clip
SD3 2B, which is the model that will get released on june 12th, will run just fine on 8GB of vram
you can use SD3 2B with T5 (a text generation model, which can also be used as an encoder for image generators), which increases its text capabilities (and possibly accuracy to the prompt)
but its expensive, it takes up a lot of VRAM
Iโm a hobby photographer, and I am just stepping into wanting to create AI generated landscapes for my fantasy world Iโm starting to write.
ah
If you are looking for mostly portraits of people or subjects, or simple landscapes, current models are fine for that.
wll we ever get sd3 large
if you want to make detailed scenes with meaning, then pixart and SD3 2B will be the perfect fit
when tho
800M, 2B, 8B
but for now, 2B is the most optimally trained one
like a month after medium
No noโฆ like deep sci-fi/fantasy stuff. Really describe a scene and generate.
2B will all the toys with it will keep you occupied enough until the other models :3
ah yeah, then SD3 2B or Pixart will let you describe your scenes in detail and give you the best result you can achieve offline.
Pixart is completely free, SD3 2B needs licensing for commercial use (we are awaiting more info about this, but if you are just making stuff for your own enjoyment, ipso facto, personal use, then this does not affect you).
4D chess ๐ฎ
can you describe an example scene?
Well, probably eventually will need commercial usage.
Iโm honestly creatively brain dead atm. Been a long day
sounds good, here ya go
It sure if thatโs a building generating energy. Or a weapon that landed and started imbedding itself into the moon to destroy itโฆ
I like SD3 for paintings too
Here is the image that you requested:
Iโve toyed with also training a model for a specific side project idea. But I have no idea if you start either full compositions, or individual objects by name.
I love this!!!
thanks
if all else fails, I'll still gladly use SD3 for paintings
basically just get a chat going with llama3, and keep changing elements until it does what you want.
I donโt even know what this is!
its a text generator like chatgpt
or rather, an AI chatbot
you can ask for suggestions for prompts and etc
yeah, it's llama3 language model, running locally on ollama, which has the open-webui front end so it looks like chatgpt.
I need to now where to go to get all this stuff ๐
llama generates images now?
oh yea but not for commercial use*
so using those combination of tools and a comfyui backend, you can have the chat interface generate images based off llama3 responses just by clicking the picture icon on the response.
so you can have a chat back and forth, generating images along the way
add this, change this.
youโve just changed my whole world man. thank you this was really insightful. where did you learn all this?
- self portrait, โallegory of the caveโ depicts @solid violet led to the light by @low stone
i do this stuff for work and spend way too many hours at home doing it.
Will SD3 be usable in Fooocus?
eventually
man this ultra on the api for sd3 is 8 cents per image now. I've thrown a good amount of money at it, but that's Dall-E money already. I think I'll wait for the 2b at this point
This is my fav so far
where 2 bees
In the imagination
is this lama thing worth it?
According to Terrance Howard 1 Bee+1Bee=3 lol
There are many competitions on Discord - and there are some very good prompts available - I borrow many from the OpenAI Daily Theme ...
(One Bee + One Bee) or not (One Bee + One Bee) - that is the question?! ๐
Eric, the half-a-bee?
Using Llama 2 Chat 7B Q4 - this is its result - Against a dark, starry backdrop, a sleek, metallic structure rises from the lunar surface, its gleaming facade reflecting the pale light of a distant sun. The building's fluid lines blur the distinction between form and function, as if it's alive, pulsing with an otherworldly energy. As the moon's gravity pulls at its base, the structure's edges begin to distort, revealing a glowing core that seems to be the source of the energy.
I mean it depends how itโs implemented. You donโt even need to run both on the same device or at the same time. You could generate an embedding, save it, load the next model and pass the embedding to it if you wanted to be silly. You can also just prune, quantize etc.
Me next Wednesday if SD3M turns out to be what is promised.
Every time I work on a Dataset or train a LoRA Atm... I can't wait to play around with that sweet Sweet 16-channel Vae.
๐
+tax = $3500
this one even has two fans
M
Will the model released on June 12th have the possibility to input an image or will be just text to image?
Short question did you used the sd3 API or a local running comfyui with api exposed?
You will be able to download and use it for anything you could with sd1.5 or sdxl
are you even answering the question
yeah i think festivalman hook up the SD3 API onto Ollama / open sourced LLM webui
Sorry for answering the question
A living room with a large window, hardwood floors, and a fireplace. In the center of the room, a stylish couple is examining different sofa options, including a modern gray sectional, a tufted leather chesterfield, and a mid-century inspired loveseat. The room is filled with natural light and the couple appears to be deep in discussion, considering the size, color, and style of each sofa to determine the perfect fit for their space.
cool stuff
yt thubnail generator xD
I use both. It all ends up at their api in the same way.
Specifically with Ollama/open-webui, that's going to their supported sdxl json which gets sent via api to a local comfy. As of this minute, it doesn't support sd3 api for the image functionality. It only supports dalle / a1111 / comfy, with their prebuilt workflows and then your prompt or the llm response
I am 'Omosting' ...
Sounds hot.
Aye it is pretty dope.
Tested it a while, I made these:
Would they have been harder to make without Omost?
Yes.
On top, in front else to hard.
Also things like bleed between 2 objects.
Omost is regional prompting on steroids for lazy people. ๐
I'll try some 'positional' stuff
So... You can download sd3 now?
No.
Oof, thanks soul
Trust me, literally everyone will know when it's available. You won't miss it. It'll be posted everywhere.
Are these sd3 or omost? Impressive if omost
These are omost.
Omost - here is my 'positional' prompt - a giraffe is wearing a top hat. On the hat sits a green frog, eating a red apple and reading a blue book. A snail sits next to the frog. A butterfly sits on the book
Ultimately very very poor positional takeup at all.
Yet the Omost text does specify all of this prompt ... it just doesn't deliver?!?!?!?!?
Hi
I think this is to much.
Keep it simple?
As I say, Omost picked up on all this and laid it out neatly section by section ...
I dunno if we'll be able to do use sd3 instead of sdxl in omost but if we will, that's gonna be a huge improvement
does it say anything about changing the model on the github page?
If Omost is this disappointing though ...
you cant really say "disappointing"
cus it's using sdxl
we all know how sdxl performs
we can see a clear improvement there
Nerdy Rodent did an Omost Video on YT - he opens the code and drills down to the model name ...
Someone said it was RealVis SDXL?
I think so but you can just swap the SD XL model.
You just need to convert it to diffusers and route it via config.
Tell me how? ๐
When I'm Omosting, I swap-out Torch 2.0.0-cu118 (Xformers in ComfyUI) for 2.3.0-cu121
... and then back again when using ComfyUI
DM-ed you some stuff to avoid spam here. ๐
Could u send me too
Give me a prompt which will work in Omost, svp?
sd3
Where can I try sd3?
FABLAN@Glif
Stability api
@brisk cipher https://glif.app/@fab1an/glifs/clv55nhjy0003yr19n09wz0np if you want a completely raw one where you can do paintings and other artstyles, and even write negative prompts, I recommend this
gives you the most control
Omost - Peppa and Cthulu
From SD3@GlifyGlif
Omost is not remotely SD3
Absolutely, at the moment the only thing (in my opinion) that competes with understanding the prompt and with good quality is Ideogram
Example
Is that SD3? ๐
ideogram
It's not too far off then. SD3. I mean we are also comparing a undertrained SD3 model with a fully trained Ideogram. So i'd say that's pretty good.
In my opinion, i think SD3 did better with the Horse looking at the TV monitor prompt.
The burger on the screen looks more coherent on SD3.
I'm not saying it's better than SD3, especially considering that SD3 will have more training, and with controlnet, etc. It's just that, as of today, it's the only thing I see that's similar and free
Oh for sure yh.
But I also see that it is more obedient to the prompt "slightly plump". I'm not sure if 2B is capable of being that precise.
I am seeing a significant difference between CORE and ULTRA in the API. the first image is comning from CORE the second is the result of ULTRA, same prompt and same seed. Ultra seems to ignore the details of the prompt. Does someone here is experiencing the same issues? In general everything done with SD3 in the API resembles the quality of SD-1.6
Prompt: Picture of a Anime-like real life woman with intricate face paint baroque style, Japanese facial features, realistic skin texture, photographic, photo-realistic
(ideogram + sdxl refiner)
if core fits your preferences, use core. Core is generally best stable setup and Ultra is best experimental/beta setup, rn Ultra is using experimental SD3 models and doesn't have the xl heavy finetuning that core has
If you don't mind, would you explain how to swap the model for Omost? I would appreciate it.
Nvm, someone make a fork to allow you to select a model https://github.com/runnitai/Omost/tree/main
SD3 VS My SDXL Merge
I know portraits are not going to be SD3's strengh until people finetune the art out of it.
I cannot wait to upscale image with SD3 locally though.
that's pretty good for a base model
and it doesn1t have that typical finetune look to it
girl
what propts did oyu use for the hyena
Here is the image you requested.
is this what te fellow kids ar einto these days
๐ฆ
lol why a ghost
You didn't specify
๐คฃ
lol
sd3 request: man crying because there are no food in the fridge
Asking for a friend? ๐
not interested
2nd is better though he should face to the firdge
also
there are still FOODS!!
shouldve been an empty fridge i forgot
touchรฉ, that's true, well it is not that smart
ough
Does Omost not have good prompt coherence?
Went for depressed instead of crying. ๐
Just trying Anytest Controlnet ...
Originals made using Portrait Master. ComfyUI+Anytest Controlnet - prompt = beautiful leopard, sunny glade, hat, cinematic lighting. [NightVisionXL, dmpp_2s_ancestral, karras, 30 iterations]
Originals made using Portrait Master. ComfyUI+Anytest Controlnet - prompt = beautiful peacock, sunny glade, hat, cinematic lighting. [NightVisionXL, dmpp_2s_ancestral, karras, 30 iterations]
Originals made using Portrait Master. ComfyUI+Anytest Controlnet - prompt = beautiful fox, sunny glade, hat, cinematic lighting. [NightVisionXL, dmpp_2s_ancestral, karras, 30 iterations]
What's the best generator for producing text?
Any usable on a phone? (Android) Not near my computer anytime soon lol
Nothing on the phone works.
@noble coyote Thanks pal
Just trying on my Android phone ... Ideogram
nice
I logged in to Ideogram (free-to-use) via my Google account, and have generated these 3 images via my Galaxy A9 phone - SD3 will have to be very (very) good to get to this standard of text!
It may actually get these correct, it's just that the layout might look too 2D or pasted on. We'll see eventually.
A staff member in the office is wearing yellow clothes with three conspicuous letters: PPT.
ddf
If I gave you a 4031x2687 image of squirls to re-run with whatever prompt that was, could you convert it?
Yeah -- I'm deff looking to eventually use stuff for commercial purposes. Doing some research for a project right now. Need to learn how to do stuff in general, and then start working on building a model for the plans we have.
A close-up of a hyena's face. The hyena appears to be in a contemplative or playful pose, with its front paws placed near its chin, as if it's holding its face. The photograph is in black and white, emphasizing the texture of the hyena's fur and the intricate details of its facial features. The background is blurred, putting the focus entirely on the hyena.
One of the big benefits of ideogram is their magic prompter. It's crazy good at making great scenes with little input.
When I saw this I was like ... Oh look, its a decedent of Frog from Chrono Trigger.
if you use it and not make money, you can use it with no difference at all
it doesn't affect people like us
its only for the people who want to make money using the model who might have a bit of trouble
it'll get sorted out at release.
The Goat needs 6 legs to be accurate.
what?? faces are clearly the strength of sd3, there's a ton of variety of faces, all sd 1,5 and sdxl faces look the same, when finetuned it will be mj6 quality for sure
I have never met anyone with eyes like this...
But an upscale with SD should help, just too expsense on the API right now.
this is a super undertrained base model, you're comparing it to an extremly finetuned model with a workflow and highres fix, of course is not gonna be as detailed
while all the people on sdxl look exactly the same
result of overtraining
No they where both 1024x1024 raw outputs.
and merging
but still you chose the worse result that sd3 could have gave you
Looks kinda hot tho. ๐
I'm not saying it is bad I really like it.
i've had far better results with better prompting, gpt4o really helps a lot in most cases
This is what SDXL looks like after my workflow (6K zoom in), I'm sure SD3 will look great as well.
it's crispier higher res yes, but the skin looks like plastic and eyes soulless, like doll, and i'm sure every image is almost the same, no variation, almost all sdxl models look like that
I can do less smooth skin, a lot of people just prefer that look.
also the people are looking at different directions, sdxl people always look straight and the eyes are focused looking at the camera in a soulles way
It uses a chatbot to plan out where the given object will be in an image and then uses normal SDXL with some regional prompting extension
It has nothing to do with how SD3 architecture works
Often my SDXL 'subjects' give no clue that "there is a camera at all!" Until I use Face Detailer; then evey subject is looking straight-to-camera!
We also using a undertrained version of SD3 currently too!
me too man, imagine what things companies like leonardo.ai will do
I wonder how long after the release of 2B that it'll be supported in A1111? I can use Comfy but it's not ideal
Paintings seem pretty damn good to me.
https://glif.app/@Jib/runs/wy2hokyaknzijufrcujo3l4d
what are the requirements for sd3
How about doing SDXL-> SD3 or SD3->SDXL->SD3?
i have 8gb vram 32gb ram
I was fooling around with denoise yesterday. The biggest issue with sdxl is multi subject refining. It just wants to make everyone the same thing. I managed to upscale by having the various refining stages with various denoise levels, but it was super finicky and needed different denoise numbers per image, so in the end it wasn't useful long term. I'm really hoping that refining with sd3 will solve this because it understands multi subject.
For example.
Ella is really good at complex multi subject. But it just makes the sdxl upscale's job really hard
yet have you tried the way mentioned? ๐
That would be 16 cents per image at this point. This new api pricing is too expensive now. I can't see myself using it anymore.
I get 1000 images a month from ideogram for $20 and I can img2img with sdxl or sd3 2b when it comes out
yes, I see,as well thought you would possibly have done it already
Sd3 is dalle price now, but we all know that for most things, it's not as good (the censored api sd3). Obviously local sd3 is a different story.
which one is Ultra
That should be enough to run the 2B model that will be released on wednesday, but not the Huge 8B mode that's coming eventually. 2B seems to be plenty good enough for most uses!
yeah considering Ultra is a whole SD3 workflow, it's less acceptable that it's DALLE price
i hope we could run at least the 4B model on 8gb vram
they might skip 4B
yeah, the 4B doesn't seem like it has that much of a use case, stuck between the top quality 8B and the 2B that's balanced between performance and output quality
I suspect most finetunes and lora's will be based on either 2B or 8B
what about upscaling with ella + kohya deep shrink?
I'm upscaling with ella for that first increase from native 1.5 res up to 1.5x that, but then going sdxl after that. the problem is that sd 1.5 checkpoints aren't very good compared to sdxl which are amazing at this point.
so it'll keep multi-subject until high res, but compared to sdxl refinement if it works, is night and day quality.
hope there will be a way to quantize 8b
, because I feel like 4b would be a perfect option for 12gb cards
oh thanks
I found sdxl hyper a really good for upscaling with only 4-8 steps
that's what my ella workflow looks like.
first 2 on left are ella native, last 2 are sdxl.
so as you can see, the sd 1.5 checkpoints just don't have the content in them. they're very lacking in training outside porn.
yeah i'm using sdxl hyper now all over the place. it's amazing
does it know Bob Ross style? :3
why?
or 8gb cards too
they promised to release all versions
2b is too low and 8b too high, i think 4b is what majority of people would use
Store at FP8, load in FP16, less space used and iirc, less VRAM too maybe
already too many choices to split the community
It will give you Boss Ross if you like (take that Dalle.3).
yea sdxl is nice, can't imagine how good upscaling can be with 8b 
there is no 4B and community split talk never seems to effect anything
i mean sure... :3
hope it gonna work well
however, 2b could be really enough with controlnets and other stuff
maybe with some optimizations? Even sdxl (2.6+something) takes about 8gb on my pc
it runs perfectly on my 8gb card, really fast also 12 seconds an image
oh, I will run out of ram for 8b 
even loading sdxl peaks roughly at 14gb
it that standard model with 20 samples?
i think it was 15-20 steps i dont remember
1024px, 20sp, sdxl = 19s on 3060
not bad
then it is really nice
I didn't expect
yeah, on comfy tho
is the artisan bot using sd3 release candidate already?
those hands are transcendental
she has 2 rings but ill let that slide
i guess that could happen
can it protect your chicken from Dokken?
very epic
elden ring boss ๐ฎ
very sharp details
edge of technology
cutting egde
edging
hello sd3 make me lego ninjago characters
a tornado of library books, weird supernatural ghosts flying around sliming the library
slimy ghosts
Has there been any word on if any finetuners got early access?
does anyone have a ComfyUI workflow that's compatible with SD3? I realize it's not realeased yet but I'm guessing for those who do have early access there's already a workflow developed for it? I just want to make sure I get the nodes installed and ready for launch day
I wonder how long after release of the model that we'll be able to train and finetune using Kohya? I'm definitely excited about the possibilities with SD3 when it comes to finetuning
