#🆕|sd3
1 messages · Page 9 of 1
I don't know about how long kohya will take to get it up and running with the UI, but I'm sure there will be some update to the accelerate and other python libraries that will support it and when that happens, you'll be able to do it manually with a CLI script (all koyha's interface really does is gives you a convenient UI to make the command, then runs it in the command prompt)
Unfortunately I have no idea what that means. I don’t know much about Python itself.
SD3 loras = 2 week after the weights
SD3 finetunes = 1 Month after weights
but good finetunes merges after weight need 3 month minimum.
kohya sd-scripts is a cli script.
you don't need updates to accelerate (you don't even need accelerate if you use single gpu). You need updated on the training script
Where can I see the results of 2B?
"the villain, a villainous villain making villain stuff, in a dark alleyway"
very villain
2 weeks? 3 months? Why so long? Fine tuners will already have datasets ready to go.
I know it is, I already said it's just a UI to make things easier to build the command. There's even a button to print the command in the command window so that you can use it without the interface if you desire. I said accelerate because that's literally what gets called when you run kohya, it sorts the rest out. But yeah, the other libraries, like I mentioned, will need to be updated before anything can propagate downstream from it.
Should only take a few days to weeks before you can train though.
yes, and that's wrong
kohya_ss is the python script that does the training. It's not an ui
there are several UIs on top of kohya_ss, that is what you are referring to
accelerate had nothing to do with that. It's just a wrapper script that simplifies things like multi gpu training
See what I said about other python libraries...
And accelerate can then call those other libraries. But yeah, I should have worded things a little clearer. Accelerate is like saying diffusers in the sense of it simplifies all the platform/hardware stuff and simplifies the calling of other libraries. Can you manually call those the hard way, sure. The training scripts will be updated pretty quickly, but won't work their ways into UIs until things like accelerate are updated, since you'll have people trying to train on a variety of PC environments
yes and that is still wrong lol
you don't have to update accelerate
accelerate is really just a "simple" wrapper that allows to switch between different precision types and gpu setups.
and you're 100% postive that it will be able to handle switching to the requirements that the DiT architecture has?
doing that manually is not more effort or complicated than using accelerate. The idea of accelerate is that you can easily turn off these things without changing the code
yes
DiT is just a transformer architecture running on pytorch
and we had this discussion already but: DiT is using transformers, sdxl is also using transformers. You call the same pytorch kernels under the hood
i suppose we'll see, just figured there'd potentially be compatibility issues with SD3 due to all the components that will have to be shuffled around like if you're trying to train sd3+encoders+t5 at the same time. i imagine they will add in some kind of control over where to keep the various models, like having the T5 stay in ram instead of vram and so on. but yeah, at the core, they are all going to just call pytorch at the end of the day
yes, but accelerate is not doing these kind of things
i'll read through a bit more, but it definitely looks like it handles where shit sits
from a wrapper point of view
But anyways, again, other libraries are going to have to be updated, along with training scripts, etc etc, before you'll see UIs implment them. Your average user isn't going to want to go into command prompts with a scalpel to manually train. 99% of people are going to rely on things like koyha and onetrainer
So it will likely take days to weeks for all the stars to align to be able to just do it through a UI
You can freeze the embeddings for the text encoders and the image latents before training to save vram. All the stuff for training DiTs is in pytorch already otherwise pixart and Hunyuan couldn't train.
that's true, so there's already a leg up there
stuff like cpu offloading, pre-conputing and so on have to be done by the library itself (diffusers or kohya_ss)
but you are right for sure that it will take some weeks until everything runs smoothly for the end-user
Yeah not everyone but the enthusiasts and early adopters will want to fiddle with cli scripts.
right, again, the main point is that people want to be able to do this in UIs like koyha and onetrainer, so all the various libraries used in those are going to need to be updated. there's a reason why linux only has a 2.3%(counting steamdecks) usage rate for home PCs, people hate CLI bullshit(even though modern versions don't make you do cli stuff really anymore, people still have the bad taste in their mouthes from when they tried linux ages ago lol)
We're just seeing new people starting to train pixart now OneTrainer added support.
i saw that they finally added support for it. personally, i like onetrainer more than koyha
WSL does kinda help with that though if you truly wanted to train a lora the day SD3 comes out (assuming they release the LoRAs code same day)
even if they don't release it day one, i'm sure people would try to frankenstein something up anyways
if you have fine-tuning code going to lora is trivial
I'm sure. The code will be there in diffusers, just gotta glue it together.
it's not some new architecture or anything, llms have been using it for many years now, so a lot of the functions are already there. obviously, you just have to get the steps in the right order for the layout of how SD3s specific flavor works though
Wonder what happens if you give T5 and clip completely different prompts entirely that contradict each other when inferencing. 🤔
Similar things have ready been done with SDXL
Anyway I assume clip would dominate
So after following the conversation, from my understanding, even tho people already have early access to the weights, none of those people have knowledge or ability to integrate or update the UIs
comfyanonymous :
Seems like another missed opportunity by SAI, not giving early access to devs who integrate stuff and fine tuners, seems like a strong case for that VIP room
Bro literally SAI staff
So we’ll have to rely on CLI scripts to generate anything after release? I realize someone said to expect workflow for comfy quickly after release but it seems prudent to just have given access to devs who would be making this work so it can be refined and ready for launch
Lykon who also got access on SD3 Medium said he will train new Dreamshaper based of that model
I’m a big fan of dreamshaper, is lykon the guy who made it?
Yes, he just said that in Civitai Discord server
I believe comfy will be updated same day as launch to support SD3 (it's literally what they are using to test and generate the images you see on X and discord)
SAI literally use Comfy to generate SD3 images yeah
Maybe one in a hundred thousand to million users train stuff. It's really not as high of a priority as you'd think. We're just in an echo chamber of enthusiasts. Plus, I'm sure SAI just can't wait to see all the degenerate stuff people will immediately train; that further gives AI a bad rap and further pressures governments to want to crack down on generative AI even harder...
Gradio is such easy to use that updating the UI to train SD3 should be quite quick once you've got a CLI script to train it.
So comfy supports SD3 internally and they’re just going to release the weights along the software update I’m gusssing?
I’m not concerned or interested with training stuff just wanna do basic level stuff like use it with comfy lol but I do see your point
Yes
But personally I think using config files are easier in the long time than dealing with any UI. The only benefit the UI serves really is reasonable presets for training.
Well I dont think it is mattered that much for high quality finetune creators
This is what you get without regulations
Everyone ( at least here ) hate regulation, but they must be there, for legal purpose
So you prefer editing a file and running a command over using a UI?
You do not wanted people to easily generate porns with your face on and share it online
funny that they use that excuse to regulate ai models but ai military models used to bomb other countries are not regulated
AI Military Model? Care to share one of them please
Yep, and so long as people are consentlessly making vulgar loras of people, along with other endless amounts of illegal content, governments will feel more and more pressure to regulate and enact laws
If you cannot show it then it is just a scaremongering
if you live on land within the planet earth, you are subject to the rules of who ever owns that land. they make the rules, end of story. military offense/defense is critical in keeping a nation going. making waifus and training ||drippy butthole || loras are not.
(i was legit horrified to learn that it wasn't just a meme, that these were actual loras people trained...)
oh yea bombing people with ai drones is better than some nude b
look, i get it, you have a toy and you don't want people telling you how you can and can't play with it. but you're comparing two completely different things
not really we are comparing goverment regulation of ai and why they arent touching ai military models
reading comprehension 101
oh i get what you're trying to say, but what i'm saying is that you're too immature and self-centered to realize that they aren't even remotely in the same league
this isn't checkers, it's chess
national safety is exponentially more imporant to a country than you playing with ai waifus
government wants to regulate ai but doesnt touch ai military models,there i kept it clean and simple for you seems like english is not your forte
if they truly cared about security they would regulate both
it's actually my third language, yes. the governments want AI for the military for national security. it's an arms race... protecting your country from other countries that want to harm your country. AI will make a larger difference in physical and virtual warfare, than going from the sword to a drone strike.
no, they will regulate the one to prevent internal stability issues that would arise from people making fake content that could cause economic/political issues within a country. like deepfaking a politician doing or saying something
I mean that's already been possible for years and years with Photoshop premiere, etc.
not easily
these reads like one of those evangelical clowns in america telling me why killin ppl is better and safer than sex
and not doable by 99% of people that can type simple words out
i'm an atheist, but reality is reality. if you want to keep a country running, you need to stay a step ahead of your enemies and rivals
No, but they don't really matter in the big scheme of things, instability comes from large actors who have enough money to be consistent
@storm saffron i need to create images where should i go
I dunno.
yes the ai drone that killed the humanitarian workers from belgium in gaza helped to keep my country safe
do you know how many civilian casualties there have been, historically, in wars throughout the past century?
it absolutely sucks and ideally should be zero
but modern warfare has cut it down astronomically
Anyone know ??
thats true,to increase that number we should train more models to automatically target ppl based on an algorithm
perfect argument for ai robots in wars
which is why im all for it in militaries. let the bots duke it out with the other bots and leave the people out of it. it would just be the modern version of siege warfare, minus wasting human lives
yea if your family dies because of a misplaced airstrike, its just collateral damage bro,man up
the point is, that number will be far lower vs historical statistics
And to target you and your family?
Yeah, sarcasm 100%
so yes, governments are going to continue to strengthen AI usage in their militaries and yes, they are also going to crack down on what citizens are allowed to do with AI usage. you can cry into the wind about it, but it is how it is
We went from SD3 to whole AI military regulation and usage
well there was someone cryin over waifus so it always goes down that way
yeah, it was a dumb false equivalency fallacy thing that someone brought up, like usual
can you do the same image but monochrome fuzz pets and instead of kids replace it with men in black suits? lol i dont know if you're taking requests or not but that one came out raelly nicely looks great
i would better see the mice facing to the side instead
no way
Ribbed, for HER pleasure.
i'm curious, is the SD3 prompt comprehension better than pixart sigma that has a 20GB T5 model ? Has anyone compared them both ? I'm currently using sigma with sdxl refiner and it can understand things like concepts, emotions and feelings and represent them visually like a real artist. Curious if SD3 text comprehension can do that
The issue isn't so much the size of the t5, it's how well it maps to the image generation model. Think of t5 like chatgpt and how well it can handle and understand conversational input. Since it can better understand semantics like a red ball on a green cube on a blue table, it can then do a better job(in theory) of pairing those colors to those shapes when you go to generate it. That 20gb is it in fp32, there are also fp/bfp16 versions of it that are around 10gb. Pixart sigma is awesome, but is pretty small and not all that trained yet(in comparison to say sdxl finetune level image quality). SD3's t5 will likely be on par with pixart's, but the image creation half will blow it out of the water since it has a far larger dataset and more training to map the concepts
base image quality in sigma is shit, that's why i use sdxl refiner
Also keep in mind that SD3 has three input encoders: the same clips from sdxl and t5. I'd imagine you can run it with any combination of the three, maybe even pure T5, but I can't remember if they've verified that or not
yeah sigma is meant to be more of a research paper kind of model, they don't have the resources to fully "train" train it all the way
It costs a lot to make quality base models
but sigma is great for hashing a scene out to then be resampled with sdxl. it has kinda been the poorman's sd3 while we wait for sd3 to come out
how many gigabytes do you guys expect SD3 2B model to be? looking for a ballpark estimate
Is there any open source image generation models other sdxl 1.0 and pixart
hunyuan-dit, lumina-t2i, Kandinsky 3.0
Hunyuan has been great. Has different composition than pixart so its adds variety to use both
Thank you so much
exactly, however i find that with the poorman's sd3, i prefer the images this workflow does than midjourney. Midjourney does not do what you want most of the time
also i'm using 20 gb model since i can use it in cpu. The half precision one only works on gpu with comfyui
and i need gpu room for sdxl refiner
i only have 6gb vram
Yes, that makes it easier to test out different parameters when training.
Was kinda just hoping it would spit out a chicken since they are living descendants of them lol
(well all birds are living descendants of therapods)
yes xD
another result xD
what the hell
Lol thats a dualtops.
FBI catfish?
xD yes
the thinnest smartphone that exists android
the thinnest smartphone that exists android 1999
New china model (Lumina-Next-SFT) https://github.com/Alpha-VLLM/Lumina-T2X
If these devs keep this progress up this lumina thing might start be be actually good
looks impressive
sd3
WTF ... this is awesome!
i could see that, where you need to batch things so you quickly script a thing that'll batch a few different scenarios to examine the results. sounds about right?
Anyone here using Hunyuan dit if so can you please share some examples
That's the new finetuned version, right?
I need to reinstall it to mess around with it again
cool
can you use it in comfyui ?
this starts to look good
If nobody has answered... 4Gb
thanks i've been super curious about that, so the 2 billion parameter model will be around 4 gigabytes and the fine tuned models are expected to be 6-7 kind of like SDXL right?
Nope, 4. SDXL was around 6 due to also including CLIP and VAE in the safetensors file, if they also include the VAE and CLIP in it then it'll end up around 5.5Gb (ish)
SDXL was 2.6B and ended up at 5Gb just for the UNET
Without VAE and CLIP
Lumina-Next-SFT promt: This image is a captivating digital artwork that portrays a surreal scene set in a misty, swamp-like environment. The dominant color palette is a deep, neon purple, which bathes the entire scene in an otherworldly glow. The atmosphere is thick with fog, obscuring the details of the trees and vegetation that surround the scene. The water in the foreground reflects the eerie light, creating a mirror-like effect that adds to the sense of depth and mystery.
In the center of the image, there is a figure seated on a motorcycle, poised as if ready to depart. The rider is clad in a dark, form-fitting outfit that merges with the shadows, and their helmet has a reflective visor that mirrors the intense purple hue of the surroundings. The motorcycle itself is sleek and modern, with a design that suggests speed and agility. The rider's posture exudes a sense of anticipation, as if they are waiting for the perfect moment to make their move.
Lightning bolts pierce through the dense fog, creating a stark contrast with the otherwise monochromatic scene. These bolts of electricity add a dynamic element to the image, suggesting a storm or a supernatural event. The lightning's jagged lines and the way they illuminate the mist create a dramatic and intense atmosphere, enhancing the overall sense of suspense and wonder.
The image is a blend of natural and surreal elements, creating a dreamlike quality that invites viewers to immerse themselves in its mysterious world. The neon purple color scheme and the ethereal lighting contribute to a feeling of otherworldliness, making the scene both captivating and unsettling." image
at least 8 GPUs are required for full fine-tuning of the Lumina-T2X 5B
Ouch.
i think ther is a 2b model?
at least it says to b in the web gui
There's a bunch of models and it's not that clear how to actually use any!
yes they shuld have sepereated models into diffrent githubs
but i guesse more stars or someting
no requirements.txt either for pip
What's SFT?
i think that means its finetuned to look better and its not a base model
Not just the base model safetensor's version?
"Lumina-Next-SFT is a 2B Next-DiT model with Gemma-2B serving as the text encoder, enhanced through high-quality supervised fine-tuning (SFT)."
xD
What was your prompt?
i think it uderstands natural laguage better so my promt was not optimal "photograph of ghost special force agent, adorned in all-black human anthropomorphic furrsona fish in fursuiter at a con highly detailed, the interplanetary from "2001 a space odyssey"
no
this one is more like a fish
"photograph of ghost special force agent, furrsona detailed, the interplanetary"
Anthropomorphic, ✅
I have seen some disagreement about sd3 75 tokens due to clip or 512 due to T5 a while ago, is this still concern or it is not a problem?
its still a porblem
OK now it's gone more furry since I changed furrsona to fursona
we are not sure how it will behave
these models dont like mispellings
Comfy needs a spellchecker node
someone noted "clip stacking" or somthing like that, could it be a solution?
I guess it's going for Ghost Recon?
Considering it's a 600M param model that I badly finetuned, it's doing alright. 😄
i am not sure.
yes thats smaller then 1.5
it might allow 512 tokes
Well DiTs behave more like models 1.5 - 2x the size of their UNet counterparts of the same parameter count.
what do you mean
I mean they behave more like models bigger than the equivalent UNET model.
oh
Hard to explain. A 2B parameter DiT is more like a 3-4B parameter UNET in capability
vs sd3 but sd3 was only able to fit a small part of the promt
daim
i think one problem of the lumina model is that they use sdxl vae
That could be an issue, especially with small faces and text.
Same problem with PixArt
IDK if I can be bothered to get Lumina working with SD3 just a few days away
xD if i would be home i would install it xD but you can also just use the web demo
lumina vs sd3 (but sd3 is like 8b and lumina is like 2b ) The image depicts an alien-like creature with a large, elongated head and dark, almond-shaped eyes. The skin appears textured and rough, reminiscent of reptilian or amphibian skin. The creature is sitting in a body of water, partially submerged, with its legs and lower torso hidden below the surface. The background is foggy, adding a sense of mystery, and features tall, thin reeds and barren trees, creating a marsh-like or swamp environment. The overall atmosphere is eerie and otherworldly, with muted colors and low light enhancing the creature's unsettling appearance.
thanks for the technical answer i appreciate the insight
I like the composition and light of the 1st one.
yes
but the second one looks more like a photo
but i never said photo in the prompt
PixArt. (600M param)
this looks also interesting
"The image shows a cozy, eclectic room with a vibrant, colorful ambiance. The ceiling is draped with multiple tapestries featuring intricate designs, including mandala patterns and depictions of plants and celestial motifs. The lighting is soft and atmospheric, with various sources contributing to the overall mood: Ceiling Lighting: There are red, pink, and purple lights that illuminate the tapestries, highlighting their patterns and adding a warm glow. String Lights: Multi-colored string lights are draped around the room, adding to the festive and relaxed atmosphere. Television: A flat-screen TV on the wall displays a scene from the animated show "The Simpsons"
cool
Lumina seems pretty undertrained or not at all finetuned.
yes this look under trained
impressive
PixArt constantly impresses me for its size
did stabillety improve fursuits?! "The image shows a person dressed in a colorful fursuit against a plain pink background. The fursuit features a large, stylized head with prominent, pointed ears that are white on the inside and purple on the outside. The head is covered in bright red fur with a white stripe across the face and purple fur around the muzzle. The eyes are large, black, and oval-shaped. The person is wearing a black shirt with long sleeves and a purple skirt. The sleeves have blue and purple striped arm warmers extending to their paws, which are covered in purple fur. The lower part of the fursuit includes purple fur-covered legs and feet. The overall style is vibrant, playful, and highly stylized, typical of fursuits often seen in the furry community."
maby the lumina model wants diffrent promting or something
I guess they would have trained it on fursuits to please a certain demographic?
It's definitely NOT a fursuit. It's a weird combo between the two. Hmmm... Try "Mascot Costume"
idk but stabilety models are normally very shit at fursuits. i know that xD
Mascot costume is not as bad, but sill not good. 😄
notes If I did a retrain on my finetune of pixart add more furries.
the only one good at furrys is open ai here my tests xD
My dreamMODE model for SDXL is pretty good at furries.
also fursuits?
I'll try your prompt in it
the promot is just made with chat gpt so it might not be a good one
It'll probably miss out half of it due to being SDXL
xD
yes it looks better
that's dreamMODE CosXL on civit if you want it. 😄
i just asked chat gpt to generate the image with the promt and it made this
Why so much fire?
I have no idea I tested your prompt.
That seems like a lot of fire for it not being in the prompt. 😄
Ok ok I admit, I might have spiced it up a little.
did you just add "on fire"
✅
Lumina, idogram, sd3 "The image is a top-down view of a photograph placed on a wooden surface. The photograph appears to be in black and white and has a vintage quality. The edges are clean and straight. The subject of the photograph is a person standing and wearing a long dress with a high collar and buttons down the front, with their hands clasped together in front. The person is wearing a mask or headpiece resembling a goat's skull with large, curved horns. The background behind the subject is dark, suggesting the use of a flash which highlights the subject and makes them stand out against the darkness. The photograph itself is not dirty or damaged. Below the photograph, handwritten text reads: "Fear is weakness.""
I mean there are some sub-sets of furry culture to also try prompts for, but that might get removed. 😄
Tell me more.
at the moment? no, not that I'm aware of.
The information will be used very wise. 😄
As u know me. 😉
Research is left to the reader. 😉
I do thank you kindly for the nice bat minion tho. 😄
gimme prompt for these 2

lmao the explosion became a christmas tree
can the SD3 API generate stuff that's like licensed/copyrighted? Like if you ask for it for super mario riding a skateboard on a sidewalk will it do that or will that throw an error?
oh wow so like fully unlocked lol cool thanks for the share
The Monster was:
In a foggy mountain pass, a repulsive creature emerges, but its form is indistinct and partially obscured by the dense mist. The winding road disappears into the thick fog, flanked by towering pines that create a tunnel-like effect. The fog muffles sound, creating an eerie silence and a sense of mystery and solitude. The creature's grotesque appearance is only vaguely visible through the haze, blending seamlessly with the eerie atmosphere of the journey through the unknown
The Cat:
cinematic photo film still of a little cat, (close shot), snow treading, rain dripping, fog filling, .shallow depth of field, vignette, highly detailed, high budget, bokeh, cinemascope, moody, epic, gorgeous, film grain, grainy, cinematic photorealistic, 8k uhd natural lighting, raw, rich, intricate details, key visual, atmospheric lighting, 35mm photograph, film, bokeh, professional, 4k, highly detailed
So they just released a new version of Lumina-Next-SFT which is kind of impressive and more prompt following. I prodded city96 to see if we can get comfy support for it.
@vapid radish am I doing it right 
Would you rather fight the monster-sized cat or the cat-sized monster?

bs has nothing to do with furrys and it also exists in every other group of people
I'm confused. Anyway let's move on.
Yes I love that stuff like pop culture characters and etc weren't omitted, and gore/blood neither!
And yet bunch of artists were opted out and it doesn't change the quality of the model that much
And fine-tuned models are yet to come
Side by side comparison: left is SD 3 Ultra raw, right is the Upscaled + Eye-Corrected version
Side by side comparison: left is SD 3 Ultra raw, right is the Upscaled + Eye-Corrected version
Side by side comparison: left is SD 3 Ultra raw, right is the Upscaled + Eye-Corrected version
Prompt: anime art, 1girl, kemomimi, harpy, curvy, [white|black] hair, looking at viewer, [yellow|aqua] eyes, hood, fur coat, underbust, corset, night, moonlight, castle, sitting, squatting, Large_wings, hands between legs, from below, rooftop, wide_shot
I prefer the left on all occasions as far as general image composition but the right has better eyes, if i could take the eyes from the right and the composition from the left that'd be the winner, feels like upscaled is making the image overall worse
The upscale toned down the colors a bit, and some parts of the lighting made it worse, I agree. But the final quality is still superior to the original.
Yes, I could do that. Correct the eyes without the upscale. Very easy to do. 👍
so you have access to the sd3 weights and you're doing this in comfy or something?
I'm using the official Ultra API colab for the SD3 generation. And the post in Fooocus
soo cool, pls whats the prompt for these
a fat man 40 years drinking beer funny picture
nice
can anyone tell me what version of the T5 model SD3 uses?
Llama3 prompt (it's so creative) and SD3. June 12th is gonna be great: A grotesque Trump mask made from human skulls and twisted metal hangs upside down from a rusty chain suspended above a trash-strewn alleyway illuminated by flickering fluorescent lights casting eerie shadows on crumbling brick walls.
after that we only need to wait 2 weeks for 8b 👍
whats 8b?
8B is a unit in NieR:Automata, and one of the targets in the "YoRHa Betrayers" quest. 🤖
great, cant wait
@desert garnet
since when did you become the sd3 ceo
2 weeks ago 
dude pls whats prompt for this
are thee maade witth ideogram?? 😲
Yes 🙂
These are with SDXL, but I used national geographic documentary style shot of a living avian theropod
No it's real photo.
Have developers of tools like a1111, fooocus or comfyui received the sd3 models in advance or will we have to wait a bit for integration?
Or can we use the model directly?
I am pretty sure comfy has them and is ready for the "launch". I doubt that A1111 and Foooocus will support SD3 from the start.
Oh and swarm obviously too.
ah too bad, in any case I can't wait to see the workflows that it will be possible to do in comfyui
Request: Can anyone try to push sd3 to its maximum capacity for realism in ai images
living avian theropods
You can use artisan in the other channel for that
news reporter with a microphone standing in front of a chaotic scene where a hideous gigantic monster is destroying buildings and stores
what is your promt?
"Sonic the Hedgehog does Sears Catalog?!?!?" 😄
SDXL, not SD3
i dont think thats the prompt
who pinged?
nice! thank you
New system available ♥ - https://www.youtube.com/watch?v=-mTT49TDxIE
New system for audioreactively generative geometries, intervened with various SD configs.
You can access this new TD patch and SD configs (3), plus many more systems, experiments, and tutorials, through: https://linktr.ee/uisato
#touchdesigner #stablediffusion #generativeart
0:00 - AI intervened - 1
0:07 - AI Intervened - 2
0:15 - AI Int...
SD3@ClipDrop

Too Daze alreddie
12th june in 2 days tho
just like Naruto says in the dub: "Believe it!"
What if it was a dyslexic mixup in d/m/y vs m/d/y and they actually meant December 6th?
so close 🙏
its so good that we have a date
and 8B will come later and will be released as well
is it known that its comfyui ready?
what fr?
it'll be like sdxl I guess so yea
as far as i understood, it will be day 1 ready on comfyui
or rather, available day 1
ya just lots of things have been said, wondering if reconfirmed
maybe ask comfy directly :3
interally they are using comfy. so yes, its ready day 1
same for stableswarm, if you wanna generate grids
exacly, comfyui and stableswarm are day 1 things
anything else is probably gonna take a few days if not weeks, don't know
we should also know that controlnets are not going to be day 1 things, but we might expect 1.5 quality if not better due to the MM part of MMDiT
it's just that it may take more time for people to do research about it first
as long as regional prompting is day 1 and they figure out a pos embed fix so that we can get proper highresfix, I'm perfectly satisfied for the time being
i dont care about controlnets right away, il be happy enough with t2i and i2i for now 🙂
oh btw they might've already started training 8B
😮
I dream about memes finetune
then again, these must've used Ultra's workflow with like highresfix and everything
if finetuning is made easy, that's LITERALLY the first thing I'm doing
I can't wait
and since its 2B, we can actually train it 
hiresfix can be done only with sdxl rn?
no, with any Unet model to my knowledge
but with DiT models such as pixart and now SD3, highresfix is broken
I will try too!
for Pixart, the entire image is noisy and distroted, and with SD3, the area outside the base resolution part is a blurry mess
tiled upscale breaks bokeh and blur for me 
if you go out of resolution range without fixing the positional embedding handling or using tiling, it does this (clear image in center, distortion on the outer edges)
from Alex (mcmonkey)
this is exactly why I am worried about them saying "just use tiled upscal"
then agian, lykon's image must've been with upscaling, and they look wonderful
maybe low denoising is okay? 🤷♂️
I really hope its just a case of fixing a part of the code and not a limitation with DiT
maybe sd3 just smarter)
also tiled controlnet might be more effective
no its definitely upscaling
but the VAE is superior though
guys
this an image withotu highresfix and the woman's face doesn't look like a mess
the eyes look perfect
do i create images here or can i invite bot to my dms
I actually don't know, but you can use #1237459938901491852 channels of course
check out #artisan-faq
thankyouuu
will it be slower?
no idea actually
hope not 4 times slower 
I hope its mostly vram difference and it can be solved with a tiled VAE
but I hope its not 4 times slower
it wont let me type anything in that channel
these are 8B pictures
Meh. That tease right before the 2B release is in bad taste.
honestly doesn't look any better than 2B right now
Has there been any update on what time on Wednesday we should expect the model to be released?
don't recall
just simply expect the worst and you won't be disappointed
I would wager midday to night in the US
I'd be surprised if they released it around greenwich midday time
as long as it comes out at some point on wednesday as promised I'm happy lol.
thanks 🙂
it cannot possibly be delayed, the model is trained well now
oh yeah, wasn't expecting it to be! Might've worded that wrong. I was just curious what time to expect it 🙂
GUYS
YESSSS
YESS THANK YOU COMFY
excellent!!!!!
we can leave prompts emtpy to try it all out
hopefully someone will make a workflow cause I can use comfy but I don't know exactly how to connect everything together properly (especially now that it got more complicated with SD3)
How many gb takes sd3 to memory including all components ?
cliploader cliptextencoder not a very good name for t5
yeah idk why its still cliptextencodesd3
when there's a T5 as well
but I suppose its for familiarity
https://comfyanonymous.github.io/ComfyUI_examples/ not now, but later it will probably be here
Do we know how big the SD3 model file is going to be yet? I did just order another 4TB SSD just in case 😀
2B isn't going to be big, especially since its going to be at bf16 or fp16
it's T5 that's gonna be massive of course
which is 10 GB if you haven't installed it
why T5 is so huge 💀
its a large LLM model (large large language model kek)
even if we are only using the encoder part of it
Don't jinx it now. 😉
I mean if they train it more, I don't mind
okay maybe I do, cause I kinda want to try the model offline now 😔
Pretty neat, but like early SDXL prompting where we experimented with G or L only or custom prompts for both... it'll probably turn out not that useful.
lolyeah had that conversation earlier
I think T5 will make a difference
swarm too
The noodles don't scare me no more. I have become one with the noodle.
oh absolutely 🔥
probably on launch day yeah
if you use Swarm it autogenerates workflows for you
great for getting set up and still lets you take the workflow itself and muck with it after at will
Alex linked a T5XXL version that is only the Encoder. It's smaller and apparently loads faster.
SD3-Medium-fp16 is a 4GiB file, it relies on a separate 1.3 GiB CLIP-G, 0.24 GiB CLIP-L, and optionally a 9.5GiB T5-XXL.
When downloading finetunes expect to usually only download variants of the 4GiB file, not the textencs
Hi Alex. 🙂
Also it's possible to store the model in FP8, making it only a 2GiB file when you do that
yeah its the encoder part only at like fp16 so its around 10GB
what about q4? 
it loads WAAAY faster than fp32
I don't think we have the software tooling ready for launch day for T5-4bit
hypothetically if we did it should work fine and would be a ~2.5GiB file
I wonder if a ggml implementation of encoder models such as T5 would help 🤔 (I'm a broken record)
however you can just not use T5 at all as an option, imo probably the best launch-day option
yeah t5 works well in ggml 4bit, just need a convenient way to shove that into comfy

there's llama-cpp-python pip package or whatever its called
That's interesting... for SDXL (and v-1.5) training the TENC alongside the Unet makes all the difference. You think that's unnecessary now?
thank you for the info ♥️
Basically the difference between mere Dreambooth and Finetunes / multi-concept-LoRA's
Sigma Shift, if you're familiar with sigmas in SD it literally just offsets those values, see here https://github.com/comfyanonymous/ComfyUI/commit/8c4a9befa7261b6fc78407ace90a57d21bfe631e#diff-6c3064a93127b01542c5772a797c9d356b876fc9940ec14951f95ff8ea270656R172-R204
or for a simplified version of the same code:
Taggui can apparently run Salesforce/blip2-flan-t5-xxl in 4bit. Maybe that helps?
no clue what im reading
it's quite possible that training the textstream will replace the power training the tenc.
probably still training the tenc is a powerful tool, but it's less valuable with the streams, and also harder with the 3 tenc setup dealio, so the tradeoff made the most sense to just not include tencs
I've been hearing about Sigmas more and more for a few weeks, but apart from "something something detail" I don't really get them. So changing the Shift will have a direct impact on the output and should be experimented with?
and for reference when I say this, I'm basically personally the reason SDXL has tencs in the model lol, others wanted the tenc separated for XL but i fought to include it because training it is so worthwhile and the tencs are only 1 out of the 7 gigs of space the model takes anyway
balance is different for SD3 so I can't really argue it on this one
bnb is so weird and jank with all the limitations like this. ggml and exllama just do things so much better
So of course they would have to be loaded alongside SD3 2B to be trained with it (excluding T5 since that would need more than consumer hardware)?
Textstream == the new Ddit architecture?
I saw in the terminal that apparently it's an immovable brick due to bitsandbytes, so could it be that you cannot offload it to RAM after you have used it?
It's ~1.5GB for L & G now. Are they exactly the same Clip's SDXL used or custom ones?
Uh so the short of it is:
- in early diffusion models, we had timesteps 1-1000 exactly
- in the modern era of diffusion, we now have dynamic steps (eg 20 step or 50 steps or etc) and that worked by converting to approx timesteps (so eg for 50 steps, you multiply your step by 50 to get a timestamp value) and adding the sigma value to represent the timesteps in a way the model can now actively process
- the shift effectively curves the timestep space, so it can spend more time in the early (structural) steps or more time in the later (detail) steps
If you've ever used my Dynamic Thresholding toolkit in auto/comfy/swarm, the CFG Scheduler feature is very similar to sigma shift (albeit of course using the CFG rather than sigmas to push this preference schedule)
in the case of SD3 I think sigmas are basically just linear by default until you apply the shift (vs other models had more of an algorithm to em)
(the sigma goes through an embedder to turn into latent magic inside of the model)
So Sigma Shift 0.5 -> more time for structure Vs Shift 2 -> more time for detail?
TextEncs don't have to be loaded, you can precalculate embeddings separately. Better for VRAM that way. Kohya and other trainers suppor this out of the box iirc
the textstream is part of the model, it can't be separated. Presumably trainers will let you select which parts of the SD3 model you want to freeze or unfreeze
Like we could (later down the road once it's fixed) go 0.5 Shift for the initial image and Shift 2 for the HiresFix / Ultimate Upscale / whatever?
the "mm-" in "mm-dit" is "multimodal", referring to the multiple streams - in SD3 base there's text & image streams, but hypothetically if you go to add eg a controlnet, you'd just add a controlnet stream
SD3 is, for all official purposes, CLIP-G + CLIP-L + T5-XXL. While I think in practice everyone's gonna just yeet t5 out a window and focus clips-only, the official reference publication stuff is all including T5
https://huggingface.co/Alpha-VLLM/Lumina-Next-T2I Uses RoPE scaling to generate at higher resolutions. Any idea if SD3 will also be capable of this?
this was 5 days ago, idk if something changed
but we can do tiled upscaling for the time being
this graph shows blue = no shift
red = shift 3
blue = shift 0.5
red as you can see runs through timesteps much faster early on in structural phase, then spends more time slowly on the later detail steps
SD3's default reference recommended shift is 3
it makes sense to play in the range of 1.5 to 3
you probably wouldn't ever do below 1
nice, this is a good demonstration
not at launch, and probably won't be the exact same code (I think RoPE doesn't apply here iirc, the posembeds are weirder) but most likely someone will find a way
("RoPE is "rotary positional embeds" and "RoPE Scaling" is a technique to scale RoPE, SD3 has its own pos embedding logic)
I still have to wrap my head around the fact that very early day embeddings are back on the menu... xD
TI embeddings have been awesome the whole time
they are incredible on XL
i wish people would use them more / write good tools for them
it's theoretically possible to train a small TI in a few seconds (some people even published code that does this before, then deleted it off github argh)
LoRAs are more flexible obviously but for single-concept training TIs are hard to beat for speed/quality/usability
also aren't they going to work across all 3 (or 4) model sizes?
since they use the same clip?
the only time I remember using TIs for fun were in SD2 days
idk what happened, but stylistic TIs were really good in quality
we had like midjourney TIs and Greg Rutkowski TIs
they were nice and super small in filesize thanks to TI's nature
would making a TI for SD3 have nearly the same procedure as making a TI for SDXL?
we´ll see in 2 days 😁
eh but I'm inexperienced in training 😔
I'll have to wait for random rentry blogs lmao
all I did is use google colab and then later some super braindead easy GUI to train Loras for SDXL
I think I was decent in training untill I tried training a lora / checkpoint with more than 30 images
My model broke
damn
https://github.com/Nerogar/OneTrainer okay this has a super easy installation method and it has an SDXL Embedding preset
It is super easy to use
and convenient
oh shit.. waddup 😮
ah, CEST has 1 day left
I found TIs amazing in SD 2.1 but in SDXL they didn't worked well for me
also I had the feeling that TIs overfit much worse than unet training
Yes. For that matter an SDXL TI trained today will work on SD3 when it's out
that's not how timezones work but lol
excellent, thanks
damn son.. give me :3
@raven fern lol
hype is unreal, last day (or 2) of waiting


@raven fern
❤️
all according to keikaku
When it drops on huggingface I should be able to just grab the model and toss it in A1111 right?
Nope
There'll be comfy support at launch. Not a111
noooooo
Comfy is written by stability ai. The other guys are their own separate devs so they'll have to work on integrating afterwards.
Aaa I can´t wait to try it
132k lines changed 😮
it will be on huggingface right ?
Ooooh very exciting. Didn't realize they pulled it into main branch already
you can use it in Swarm as well on day 1, which has a friendly interface like auto1111 does
ye
most of that is just the tokens file for t5 lol
yep lol everything goes straight to master for comfy
Sshhh don't say that, it's all code that you guys toiled away on. Just go with it. 🙂
8B started training months ago but yes it's regained focus for the training team now that 2B is about ready
excellent!
(scrolling through backlog sorry if this already got an answer) T5-XXL, encoder only, as found eg here https://huggingface.co/mcmonkey/google_t5-v1_1-xxl_encoderonly/tree/main
updated and rdy ^^
if you update swarm you can get a valid SD3 workflow out of it too lol
Is there a new ksampler as well or still using the same one?
Ah
Not sure if I can run Swarm, I got a 2080TI with 8GB of ram and all the other ones work well but I heard swarm uses more recources to run is that true?
same ksamplers as normal
Swarm uses less resources than auto does lol
it's way more efficient
what about schedulers?
can we use all samplers and schedulers?
how does the flow matching influence it
You generally want Euler+Normal, but any Sampler that isn't Ancestral or Stochastic (ie SDE) should work
ancestral/stochastic are incompatible with flow
That's a good question. Is there any benefit for the perturbed and align your steps nodes with sd3?
go with the flow 🙂
we'll see, about that..........
AYS won't work out of the box, but can be retrained for SD3
so nvidia has to release something new for sd3?
it's gonna be cpu by default?
alternate guidance options like SAG and PAG will probably just flat not work as-is and need new research to make things like em
yea some new research
and freeu neither, cause its specifically for unets
for AYS they published a unique list of sigma values to use per model arch, so they'd need to rerun it and get a list specific to SD3
yea basically anything unet related wont work
oh yeah freeu is entirely out
as is cacheysamply :(
my precious baby, killed to death twice in a row
right
i personally like the Inspire pack ksampler for ays, cause it's already built in, dont have to use a lot of nodes for AYS
at least we can experiment with stuff on day one, like messing around with text encoders, textual inversion embeddings, tiled upscaling
what about regional sampling?
ok I'm convinced to try SwarmUI, isn't it on Pinokio, cant find
I should try my stone man shooting money from his hands regional prompt on sd3. Seeing as it understand placement keywords, it can probably just do it natively
wait, i just noticed comfy did a small fix to cosxl edit models, does this mean they fixed the artifacts? im gonna check 😮
is API call the only way to test SD3 right now?
sweet thx
you're welcome
man greatness awaits 🙂
SwarmKSampler (built in with swarm) has it too
nice
swarm is super easy to install natively, you don't really need an install manager like some other uis need https://github.com/Stability-AI/StableSwarmUI?tab=readme-ov-file#installing-on-windows
I think I was able to achieve just directly what I did with regional promoting with this one. One of the things that's so impressive about sd3 so far is the variation available between seeds. It's not just the same image from a slightly different view each time. Every one is very different which is great.
@viral plaza do you know when we will get an edit model for sd3, like cosxl edit?
was just a bugfix
don't need an inpaint model really
here see for example i did an inpaint with sd3 medium release candidate directly
my lil terminator kitty looks awesome
i set the creativity a bit too high so there's some discoloration on the edges but not too bad unless you're looking really really close
this isn't some SD3 magic btw this is just Swarm's inpaint code working well lol
can do the same with XL
iiiddunno. Hopefully somebody finds a better way to make one than just ip2p. If not... well somebody's gonna make an ip2p on sd3 at some point
wow that works well for a base model
like yeah its 0.9 strength, but still
can of course also just do the mask better to prevent the discoloration
kek
nice kitty
oh actually turn off Mask Shrink Grow and this looks really good
how come a1111 users didn't switch to stableswarm yet, it even has these convenient features similar to it
... anyway tldr point is, if you use a good UI like Swarm and fiddle settings a bit, you don't really need an inpaint model
main reason for most people aren't using Swarm is just cause they don't know about swarm or don't really comprehend what swarm is
most people once they try swarm never go back to anything else
its not 1:1 to a1111, so therefore it must be bad 
comfy users it's a 100% no-brainer to use swarm, auto webui there's differences to learn but ... like, most of the differences are improvements so lol
the one pain point for auto users is if you have old auto extensions you really like - you can usually find a comfy equivalent, but if you don't like the noodles, it's awkward
waiting installation to finish :c
swarm is generally less reliant on extensions though, a lot of stuff is built in
you only need extensions when you're getting really really crazy
meanwhile octopussy v2 (SD3)
stable swarm seems very well built, il try it one day
would this work too? i tried with pixart and the output was identical https://huggingface.co/city96/t5-v1_1-xxl-encoder-bf16
just saw yours is roughly the same size
that looks like it's the same thing just bf16 instead of fp16, so yeah
do you know why the team chose xxl over xl? xl seems to be more popular just from searching for fine tunes of xxl
they literally just grabbed the biggest one they could find im pretty sure lol
bigger is better 😄
Prepare the consistency nodes for the arrival of SD3.
the only thing left to conquer - consistency.
we have it in lighting btw
😄
we just need shape/texture consistency now
1 day left
dont predict the arrival of the mesiah
well in this case we can
can sd3 make it 2 more days from now though? can it time shift?
guys we gotta think of something to make time go faster so it's 2 days from now
no but effectively-time-travel has been proposed before
("anticausal" = model that's trained to solve in the reverse of the direction of time)
using improbility fields? thonk
i was thinking MM-DeLorean
this is heavy
like .. if u move fast enough u time travel to the future right? we all just gotta get on a spaceship that can travel at near light speed
naw. that'll just age us infinitely before we get back to the same moment
if i eat that will it be 2 days from now?
maybe from a food induced coma
whos ready for unexpected delay on release day like last time 😆
And then all the selfish people that will complain about their free toy being late
obviously if you delay something right as people expect it to come out
I think I figured it out. so like time is a measurement of light moving, so if we can move at light speed, it'd be like pressing pause on a video game but then we could just skip to any point in time instantly so to get to SD3 launch all we gotta do is go fast like sonic
damn selfish ppl who pay for api access
That already have access to it and whose credits are still going to be in their accounts...
is this the first model that hasn't leaked?
we truly live in a society
But for real, people will complain about anything. Even when it does come out on time, people will complain that they can't run it on their 2gb vram gpu and 8gb ram laptop from 10 years ago
Or that it "suck"
Because no bobs n vegine
thats valid tho
too many bobs and vegana,ban model pls
To who? 90% of the complainers are just people from countries where prawn is banned, like India and China.
They just happen to come from the two largest countries on the planet
lol you think they gen images cause they have no access to porn? 😆
It's illegal in both those countries.
didnt u know? ppl in china and india have an iron mask on their head with tiny wires coming out of it
so im asking if they are using ai as a workaround?
we still got 2 days for somebody to be stupid about it
like the person that saw "stable audio open" uploaded to HF and stole it and reuploaded a "leak" with the word "open" removed
(i really wonder what the goal was. Was it, like, clout chasing? or... why tho?)
do they even have a set of closed beta testers like sdxl?
nope, only a few business-tier partners this time
vs XL for example was sent out to anyone with a .edu address
so why not anyone with .com address? ;p
any sensible training parameters to recommend for training a lora?
that'd be a @lavish osprey question
Is there a free way to use sd3?
when the weights are release you can run it on your own computer for free
Thanks
Fabian@GLIF
Cause auto1111 lets me remotely share and access the website on any network
my guesstimate is an 8gb gpu and 32gb of ram should be okay
since T5 can be loaded in CPU ram
pixart is a 0.6B model, requires 1.5 GB of VRAM
so roughly 5gb for the mmdit
loading clip g and clip l uses 1.8gb of vram
swarm does too!
comfy only uses 3.6gb to run sdxl
does that make the 2B model excluding the clip models less than 1.8gb?
Awesome! Thank you
Does it generate a link you can access on mobile on another network like A111 does?
SDXL minus CLIP is ~5.5GiB, SD3-Medium (2B) minus textenc is ~4GiB
mostly yep, check the docs i linked, use cloudflared option probably
mobile support is a liiil wonky atm
it works but not a great experience tbh
is comfy offloading the clip to cpu maybe?
Sick
if you're using only 3.6GiB of VRAM usage then yes comfy is offloading things for you
so might be able to run on my 6gb laptop card then :>
yeah probably

do you know if it's any quicker for our friends with apple computers than sdxl?
哈哈哈
@viral plaza Does Swarm's Inpaint put the whole image through the Vae (degrading the image over time) or is it using a stitch-approach to mitigate that?
there's an Advanced param under Init Image that controls that, defaults to enabled (ie stitches the image back to prevent VAE damage)
Perfect.
I think I'll try to get used to Swarm more with SD3. (as I can still use my overconvoluted crazy Noodle workflows in that tab)
does swarm have tiled upscale?
Kinda like KoboldCpp + SillyTavern.
Where can I use SD3?
tomorrow. OK?
ComfyUI just update its compatibility to SD3
the Midnight is coming

uhhhh... actually are they going to just do the license announcement along with open weight
yknow not built in currently in a convenient way but i oughtta add that before sd3 launch eh
(can do in comfy tab of course)
Is SD3 available for download now? I thought I could experience it on Discord.
nah the Discord is one is kinda like Midjourney, which need a subscription
But I couldn't find the SD3 bot in Discord.
Artisan. I suggest you to wait til tomorrow when the announcement is here
okk
Sd3 on hugging face ?
added support for https://github.com/BlenderNeko/ComfyUI_TiledKSampler - if you install that, under Refiner you'll get Refiner Do Tiling checkbox. It's not perfect but if your RefinerControlPercentage isn't too high it does its job well
yeah
Ha cool , i hope it will run correctly because stable cascade have lot issues with diffusers lib
Should've told you I made a tiled sampler 😛 It adds the tiles together using a feathered mask technique which is pretty seamless. https://github.com/bash-j/mikey_nodes/blob/main/mikey_nodes.py#L3015
comfy nodes from mikey. Contribute to bash-j/mikey_nodes development by creating an account on GitHub.
ooo
any chance you'd be willing to PR the bare minimum version of it into swarm? (I'm intentionally trying to keep swarms dependencies at bare minimum, no large packs except when absolutely necessary)
I can try, where would it go? I have a function that splits the image into tiles, then a function that stitches the tiles back together
as a parameter to, or variant of, SwarmKSampler: https://github.com/Stability-AI/StableSwarmUI/blob/master/src/BuiltinExtensions/ComfyUIBackend/ExtraNodes/SwarmComfyCommon/SwarmKSampler.py#L115
@viral plaza do you know if SD3 will work with tensorRT (not sure anyone would have tried it yet)?
it will yes, nvidia will publish stuff for it
bro Swarm has everything, you were right. I'm rdy for SD3
has cat 
lots of cat
params for lora training? It would greatly depend on the type of lora, the dataset size, the goal, etc
at the moment we didn't test any style or character lora (well, we briefly did but then we had to move out to other stuff).
We did lots of aesthetic or "fix" low rank training
So I read thru the history to try to understand. But I still have a few questions. Woudl appreciate if someone could fill me in 😄
SAI is releasing SD3 Medium, but there could possibly be a small and a big? Also what is the impact of the 2B vs the 800M vs the large? Is the impact more on the model's ability to translate text into images and understand what it's trying to create? Or will it have an impact on image quality too? (my understanding is that it's a bit of both because it will have less latent's to draw from right?)
Finally what is SD3 Ultra? is it a comfy UI workflow when using the SD3 Api? Or is it the large model?
Thanks to anyone who takes the time to answer!
It's style aligned well enough to allow most types of lora training imo
param numbers impacts all type of things, from general knowledge to inference capabilities (ie the ability to create new concepts from the ones you know)
that being said, 2B mmdit is likely all a human can ever need.
SD3, same prompt(lots of cat)
thank you for clarifying!
What about what SD3 Ultra is? Is it a workflow? or the larger model?
it's only "Ultra" in theory, not "SD3 Ultra"
but what is it? a model? or a workflow to make the most of the output from the model currently on the API
what's in Ultra is a trade secret, you can think about Core and Ultra in terms of msg, which is salt on crack, while base models are just salt.
I see, so Ultra is basically the model that is available via API is that it?
or is that the one available via stable assistant?
ultra is not a model
there is no model named ultra
so what is ultra then? a workflow?
ultra refers to an api endpoint, what's behind that is not for me to say
or was is this a meme/joke that i'm not getting xD
I see! i get it now
thanks!
enjoy some ai generated pizza

this is the one on fireworks right?
that's evil, I'm hungry
no, Ultra is provided directly by us, not Fireworks
then enjoy a dog on a ball on a surfing board
@viral plaza give me the yellow name so I can start banning people who put pinepple on cheese and tomato sauce

got it! i'm looking at the api now. This one has Sketch & Canny as controlnets right?
elaborate on "this one"
tested some MJ prompts with SD3
I don't remember if Structure has canny to be honest
For legal reasons I would like to clarify that Stability AI does not compare its services to crack, Lykon's choice of phrasing is entirely his own
for not-pissing-off-reddit reasons I'd like to clarify this is also Lykon's own phrasing gasjlasg
I so hope that this is true.
I can't wait to try Lora & FineTuning...
Can't wait how different the Textstream (& Imagestream?) will train compared to the XL Unet+Tenc.
And I wonder how long it'll be until Kohya & OneTrainer add manual Clip-G/L training for concepts.
And dedicated simple Embedding training.
you don't really need to finetune TEs for concepts
historically NAI used SD1.4 TE without any change
and Pony finetuning TEs kind of destroyed them, to the point it lacks basic knowledge like "a bank"
the first 5-6 months of SDXL lora finetunes didn't touch the text encoders and they mostly work very well
even with made up words as activation
(as a matter of fact, I'd suggest to keep the tes untouched and just use them for preprocessing)
say I have a 50 photos of my cat and I want to train a LoRA, where should I start? Would it be any different to training a LoRA on 100 paintings by Monet?
Is that because the text encoders will just output tokens for whatever you put in there anyway? Even if it's completely nonsensical?
And it has problem with colors and color bleed.
I have high hopes that SD3 got rid of most color bleed.
I'll definitely try that.
For SDXL at least training the Unet only was never enough for new concepts (at least during my tests) - adding the Tenc always got it into the right direction.
each token gets an embedding, yes
that's how the original dreambooth paper worked: they used a "non-sensical" input token and trained on that
instead of using one token you can also just use a bunch of tokens (like your name)
as more rare (and non-sense) the token is as harder it is for the model to learn from it but there is also much less overfitting and damaging then
it worked also for sdxl, but training the unet takes much more time than training the te
tokens are the input, not the output, but your intuition is correct.
^ this is mostly correct. The model will produce an embedding of your prompt depending on the tokens. So there sre still going to be vectors that the Unet/DiT can "catch" to understand new concepts
with SD3 you also have more stuff you should check when training, like "did I ruin text understanding" or "did I create conflicts among the various text encoders"
If the SD3 2B is released tomorrow, will I be able to use the inpainting or upscaling features directly in ComfyUI?
A quick dumb question is there a fixed time for the release like 12PM PDT or something?
Another thing - I started pruning my LoRA's per Weights - for some concepts that was highly effective and countered the detrimental effects of my LoRA's on the base Models.
Will that be possible / necessary with the new architecture, or is it structured differently?
what do you mean with "per weights"?
but loras work technically the same everywhere. They are not specific for diffusion methods
You can load LoRA weights individually (can't remember what the Comfy Node was called exactly).
But aren't they like a "UNET jank" for XL?
each matrix has it's own pair of lora matrices. You can remove them individually, yes
I don't know what you mean with unet jank.
But they are not limited to unet anyways. You can have text encoder loras, too

It's funny that if you mess up the prompt with SD3 it really tries to follow it, with hilarious results.
"vampire with fangs, wearing a black cape and black with red lining inside shoes"
its a goomba
its wednesday 12th on new zealand already,where SD3 ?
two weeks =]

True. But it was just established we should try not training the Tenc first.
We'll see I guess.
Someone else wrote "jank" as in LoRA's interact with Unet's in a strange way.
I don't know what you mean 😅
if you train the TE then you should train it first
or you train it together with unet
but training unet first and then te sounds wrong
regarding lora and unet: I have no clue what this "strange" interaction should be
in the end a Lora is doing nothing strange, it's just a weight update
Generic phone selfies look so good YESSSSSS, I hope CCTV and other styles work as well as 8B, this is good news
you can just take a selfie yourself
why gen them
Yeah. Thanks for this m8!
I was making sure i could run it, because i was having troubles running Pixart with the same encoder.
I turned out in the end that i was just doing something wrong with the workflow, because it works great off of thwe CPU now.
Thx again for the reply. 🤗
So, just want to mention we're around 1 day away, and we still don't know how the licensing for SD3 works despite being told it would be explained before SD3 launches
Generic selfies aren't exactly difficult.
this one looks 300x more convincing to me
generic ceiling light, no depth of field, just a dude smiling
There, badly lit, mildly out of focus, terrible selfie
I'd prefer to see examples of things that can't be done already in SDXL.
Tell me one thing SDXL can't do
Kek
Weapons are annoying indeed
Don't think this basic version available can do any better
It can do it, sometimes, a bit better
6 dogs standing next to each other,each one is of a different breed and has a different fur color
Or a woman talking to another woman in the street, one has blonde hair and is wearing a red dress, the other is wearing a purple dress and a baseball cap
I mean it was fairly ambiguous.. 😄
more specific (for SD3 purposes)
a woman talking to another woman in the street, one has blonde hair and is wearing a red dress and a fedora, the other has brown hair is wearing a purple dress and a baseball cap
SDXL has mixed hair colours now, and only baseball caps
but this doesn't look anywhere as realistic as the one I posted.
this is likely the most realistic one you posted but still looks like "good cgi" to me and not a real photo


