#💬|general-chat
1 messages · Page 130 of 1
you know how you walk into a very basic grocery store and it’s got aisles of redundant boxes, all with very clever marketing designs yet comprised of the same refined substances, etc.? it’s not like a higher end grocery, where there are more varieties of produce, more varieties of food in general…that’ll be the trade-off between the various parameter builds of SD3 that get released
in terms of aesthetics
(theory)
The model needs to be trained for both image and text conditioning to be able to do image encoding without controlnets or IP-Adapters, but the SD3 arch specified in the paper uses a CLIP model and a T5 model simultaneously for text encoding, and has a separate CLIP model for image encoding
Parameter builds as in different parameter counts?
yes
let's wish for some news to arrive this week
yeah i’m just being a pessimist, i love being wrong because it means progress was made where i didn’t think there would be
Well it's usually the case that with less parameters comes worse performance
yeah—just trying to imagine how that will translate visually
i think it will lead to a lot of syntactic homogenization
(which is…regrettably…also in the best interests of high profile marketers, etc.)
And we'll get a 1gb SSM model outperforming everything else
(not fact checked)
that would be sweet
I'm pretty sure people will start to make quantized versions of the biggest SD3 model similarly to what was done with LLaMa2, to have the same model used by most people
I don't think there was any use of mamba architecture for image generation yet
Though quantizing diffusion models isn't nearly as simple as LLMs
when you consider the differences between a quantized text model, low and high bit
quality-wise
what stands out?
because what stands out in a text model will translate visually
Community will probably make a Turbo sdxl3 in less than a month
I'm not sure it works that way. Tokens and embeddings work way diffently in llms
- the way they generate
not sure what you mean by simultaneously, this sounds to me like you are saying it's necessary component, and yet from everything i gathered from the research paper and elsewhere, it seems T5 is an optional component
Omg what's with these new accounts today
this is true!
how about tech support?
oh maybe you meant for training? i misread, cause i know for inference, t5 is optional
It isn't a necessary component, as the model can still go through the diffusion process with a singular text cond, but, it helps the model be more coherent and understand prompts better. Similarly to how a negative prompt can increase output quality but isn't a requirement
yea
it would be cool if the architecture is modular in a way that we can plug any variant of t5 we want and not just the one they are using
I still can't get over the fact that we came a full circle back to GANs with turbo models
How is that even still diffusion
i personally dont want speed, i want quality :3
I want quality and low VRAM usage
Only 6gb 1060
It could be, SD3 won't be the first open source diffusion model to have a transformer model instead of a UNET and use T5, and that model could use T5 XXL and a smaller variant of T5
my god 1060 is so old right? when was that released
No clue lol
context is what makes a field of red a rothko painting
well for example, the flan t5 pruned is only like 2.5GB, so very compact and still works nice
If they use a pretrained one and the features are similar for both I'm sure it'd work
I remember a model called Pixart being able to use different variations of T5
well i tried pixart sigma, but cant get it to work with flan t5, so rip
Sigma recently came out, it is quite impressive
there’s an llm2vec model now that can take mistral7b and turn it into a text encoder lol
Cool, it's similar to SD3 arch wise, right? Transformer diffusion model with T5 text encoder
but wait, the first one was alpha right? so they skipped pixart beta, or pixart gamma? straight to sigma? lol
Encoded features have to be similar to the training ones for it to work
So not all encoders will work
i understand, but as a proof of concept i think that’s a fascinating direction
it also suggests we might see techniques applied inversely to diffusion models so that they are more adaptable to this framework
I was wondering whether I should take clip encoder from sdxl and fine tune it on some stuff but it was too much work
i think they’re really just gunning for PixArt Omega: PixArt Forever
PixArt Infinity
i’ll premiere it on my fake youtube game comedy show series called “What’s With Those Latents?”
Pretty much yes. t5 does load up into ram so vram required just for the model itself
It is quite coherent
im still waiting for them to release some version of stable audio, musicgen is nice, but very limited in some aspects
That's probably also what stuff like ComfyUI will do for SD3 to run in a usable speed with typical GPUs, and we'll have optimizations like TensorRT to have it run faster without needing to loose quality
I wonder if some other big diffusion player will join the 'market' in the near future
Hopefully opensourced too
Pretty sure SAI said SD3 is their final diffusion model
it seems that the most proprietary and advanced platforms are being engineered to exist in a very narrow market, i.e. multimedia professionals who already have the equivalent of seed money from their hollywood success
Well, that's what was said last time I checked in here
it’s their final T2I model yes
why would they make it final? they dont want money anymore?
Not too long after playground 2.5v released claiming to have better color space, stability ai gave us CosXL. So I'll say yeah
they have set themselves up for a direct course with many different industries that are already reveling from other AI-related developments
cosxl is very cool, im using it to edit some pics, very cool results, but sometimes the output is a bit blurry it seems
Well then they better come up with a whole another generative model type
stable confusion 🙂
Stable disfunction
unstable corruption
and sd3 will have cosxl or editing capabilities too, so that's gonna be awesome as well
Stoic scattering
sd3 might never have a public release
no it will
SAI and the others will probably start working on multimodals now that we pretty much have a model for most stuff (3D, image, audio, etc..)
a lot has been changing at SAI recently, they might just abandon public model weight releases
yea multimodal is the way it's going these days
Then we wont get anything good for a while
or maybe
is gonna release some greatness in the future
I like how after a month we can start seeing a change in development direction and call it 'these days'
haha
i hope so but it might be hard for him to get funding now
and I doubt he can just buy H100s with his own pocket
I'm still waiting for the Mixture of mambas cosine scheduler adversial contrastive slicing diffusion
Some paper names these days seem really absurd
yo you saying the word mixture, like the mixture of experts in LLM, maybe we will have mixture of diffusions (MoD) where one expert is good at hands (assuming you want to generate something with hands), another expert is with face, etc :3 lol, prob not...
While it'd seem intuitive, MoE don't actually seem to divide the task the way humans would
It seems rather random, but improves the speed, so we just go with it
on the downside... MoE models are usually huge in size... so imagine something similar for image generators, i dont think it would be great for even folk with 24gb vram cards LOL
A 70b model compares to a 6x14b model in size
yea
But for MoE only 14b params should be used at once
So who knows if no other new architecture comes maybe will get MoD
idk... i mean considering we have all these cool toys in 2024, and at the speed tech is moving... my goodness, imagine what we will have in just couple years from now.. but hopefully nvidia starts releasing cards with at least 32GB vram... cmon bruh
i mean we dont have the final vram confirmation for 5090 right? it's all rumours i think
Hopefully we'll focus on squeezing the most power out of every param, lowering the VRAM required
I ain't spending 2K for a new GPU
ye the other thing is for the image generators to somehow use less vram, using some new algos
But yeah the next couple of years, especially if no law against open source small company models will be passed, is going to be crazy
one day, we will be able to make super mario 64 by clicking just one button 
Maybe AI based game engines? Ai physics, rendering etc
Yeah, also all nerfs and what nots
And we already use DLSS and fsr to improve performance
next we need a neural network trained on level design, and it can generate a level with 3d assets for you, and you prompt just what style or whatever you wanna see in the level :3
Although it's more like a addon ontop of current rendering methods
What a world we live in
Too late to explore the earth, too early to live on mars, but just in time to experience ai technology exploding
instead of image 2 image, you have game style 2 game style, so you input mario 64 and it will give you something similar design wise, 3d platformer :3 , or maybe you just textually prompt: in the style of mario 64
We'll get brainwaves to image
lol
Especially with the new stability ai research
one day we will have ai brain surgery: you prompt: fix brain and it fixes the brain 🙂
Man I wish I was a part of this major ai development
welp, time for me to shave... and im lazy af
Just use stable razor
That's gonna be stability ai merch rebrand after they drop ai
So what do the lucky ones have that layman peasants lack that they were chosen to test this out?
Stable coin is what they have
which samping method should dreamshapers use?
I cna live without fkn incentive. I can;t wait for AI to destory the whole money thing. It never worked.
If all you get out of bed for is to make money might as well not get up. I welcome our AI overlord and they can have all our jobs and economy.
What
the ultra tall super empty skinny towers of NY tell the story
We are movign towards the Star Tek future folks
no more money
stable coin lol?
wtf is wrong with you lol
youre high af
Nah he just testing the prerelease Stable Schizophrenia XL
she
loool
makes sense looooool
stable coin? she is using dodge coin obviously
That's why she doesn't have access
dodge coin is very unstable
buy? how bold of you to assume i have money... 😦
She doesn't know about stable mining...
Well no one has money.
stable miner is the new minecraft game
Those who have it its mostly tied up in empty glass towers and other legalities.
Stonks and "Art" and off shore areas. Etc.
So the top 0.01% is busy stressing over whatever they have so even thay can;t enjoy it. Really.
Of course the party line says otherwise.
Everyone is happy on Facebook.
even the oil barons smell their end and are scaling back their absurd megaprojects.
there will be blood
Well there is
you could argue we are in the middle of WW3
no one labeled it as such tho
history will label it in 20 years
uhuh
il use my stable fork
it's made of diamonds
mm
stable minig??
what lol
stable fork?
everyone high af rn lol
there is no spoon
^^
@karmic cedar your about me says you are machine learning researcher? you publish papers? 😮 or just student/learning type
just a learning type! 😛
cool
yea mind, neural network, it all makes sense now :3
yeah it all runs together :_D
i love that people from different backgrounds can relate to AI based on its logic and structure
im working on a neural network from scratch, so im also learning i guess
wow easy there cowboy... when i said from scratch and learning... it's really from scratch... and learning.. as in... i just started 🙂
so technically im learning the inner workings of neural nets, like simple concepts and moving forward, no idea what it will turn
out in the end. i ultimately want to create some sort of domain-to-domain network, so like maybe GAN or idk.. like you input something
and it outputs something within the same domain, so either image to image or music to music or idk.. but not like text to image, then
again depending where this journey takes me, i might try that too, but for now want to learn mostly image to image stuff, cause i want
to combine it with my other research field, with digital signal processing and combine it 🙂
Anyone knowing the launch date of SD3?
april 26 
yea, sure
i should probably stop saying this date cause people might start believing it and then when it turns out it's not, im gonna get like 100 pings
too late, I got my hopes up
haha
on the other hand, i hope it's not april 26, cause then people will think im working for SAI lol
I've tried 3 times to make GANs and failed every time lol
yea not easy stuff :3
Only thing that somewhat worked from my models was SR
Well obviously you don't work wink
GANs are one of the oldest models out there and I couldn't even make one work lol
i think GANs started in 2014 if i recall
Well yeah, but with current rate of progress that's archaic technology
And easy to write from scratch using torch/tensorflow
oh im not using torch or anything, im literally doing it from scratch, limiting myself only to numpy for example and some plotting i guess, i really want to learn the inner workings
I think I wrote a evolutionary neural net from scratch in c# a while ago
nice
But I'd not dare to even try implementing SGD with just numpy
Yeah except every source for anything ai related gives me some crazy equations with characters I've never seen lol
well i got help from wiki math when it comes to symbols im not familiar with, i know most of them
Well good luck with that but I'll stick to joining torch blocks together
And even with that I barely understand anything from the last few years
i spent a lot of time reading technical research papers and sometimes they provide pseudo code within the paper and that helps piece the whole thing together and then i implemented it myself, for some of the projects (non ai projects), so you can learn, but some stuff is a bit too convoluted and the paper wont help you, unless they maybe decide to also release some code on github to learn from, but yea...
Yeah I try to ready every bigger paper that comes out and some older ones I find interesting
But outside of the general mechanics they describe in text, most of the pseudo code or math equations don't make me understand it better
that's why i decided to start from scratch or just math (numpy) cause i really wanna grasp it, so i then have total understanding and control of what im doing and i know what im doing... as opposed to.. here are these legos... and make something... but how did they make the lego itself :3
and of course i take notes and comment my code a lot
Well I understand those most basic basics, but I doubt I could code them without further reading
im actually super crazy when it comes to commenting code haha, i spend like paragraphs just on one line of code sometimes, cause i need to remind myself what this is doing and how it can be used if you alter it or whatnot
or if i write a custom function
I make a comment every 200 lines if not more
Unless I specify the tensor sizes inside model parts
Or to segment my code
i remember during one project, i was stuck implementing a research paper, and i remembered i did something very similar and even commented it on another project and that saved me... and i completed the project
cause i had to understand the logic
im also working on a 3d game engine, got the renderer part (but not complete), the physics (but not everything), and now doing animations, but that is a pain in the thingy
I never made an engine but I did make some game-ish projects, but that is way easier
im a programmer, so i like to try all sorts of projects 🙂
I do some programming for fun but wouldn't call myself a programmer
hey
very real
are you a bot too? cause they always say hey, or hello or something haha
Hello World!
nah just tryna figure out if there was a verification or not
ok if you are replying, you are not a bot :3
just couldnt run stabilityai/stable-video-diffusion-img2vid-xt
ah
maybe i find answers here
the xt version takes a lot of vram if i recall
Hello MNIST!
i think you mean img2vid tho? and not img2img?
img2vid
4070 super how much vram is that? i dont know all the models
Works fine here
Hopefully not more than 25 🙂
Model can't handle more ...
yea
well i have no idea
are you definitely using your gpu? sounds like you might be using cpu/downloading some models from the workflow for the first time if it's hours
from diffusers import DiffusionPipeline
from PIL import Image
print("-------------------- START -------------------")
pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-video-diffusion-img2vid-xt")
print("Pipeline loaded")
image_path = "image.png"
print("Image loaded")
image = Image.open(image_path)
print("Image opened")
result = pipeline(image)
print("Image passed to model")
result.save("output_video.mp4")
here is the code
no idea
Im guessing that downloads the models from that links, that can take adges currently
first time just running it without ui
show the output not the code
is your pipeline using cuda?
and check in task manager or nvidia-smi or whatever if the gpu is being used
ok iwill
Maybe he is into Anime?
did u use a ui
Stable Diffusion Forge has SVD included and is pretty easy
i recommend you use comfy, it's very well optimized for memory and image2vid works nice in there
or forge i guess
Comfy is better if you don't need to learn everything 🙂
nah you dont need to learn a lot :3
can you. guys provide me a doc or a link that would help a lot
actually im tryna implement ai to my app so i need an proper api
you can use comfy and forge in an api too
thats why i tried it on python
just copy model and that's it ...
after you install comfy, you follow this for example: https://comfyanonymous.github.io/ComfyUI_examples/video/
for api services which one do you guys suggest
i never used api services personally, they are very limited
oh why
Same here
if your workflow is simple then yeh the way you are doing it with diffusers is best for API, if it's more complex then you can turn comfy into API
im gonna stay with comfy forever ❤️
so you just want the most easiest approach, in which case comfy is not for you
but all the examples on internet are about UIs
yep i guess so
just a simple python code to run this model
i could not find any
the only problem with the most easiest approach is that it's most likely not gonna give you the best results, for example people literally have comfy workflows specifically tuned for svd to work and give nice results, which is not just the svd part, so yea.. you prob wont get the best results
well good luck and always ask if you need help
thxx

Nicer to have some serious questions 🙂
as opposed to not serious questions

what up doc
SD3 when? 😄
come on,.. that is serious haha
Got fresh brewed tea ... that's serious ^^
tea for two
i mean the question is serious, but the answers are not... for example april 26 🙂
Nobody here can give the answer so how serious can be the question?
well to be fair, not everyone is up to date with news perhaps, so technically speaking, they could have missed some actual news from SAI in which they did announce an actual date, and so the people asking sd3 when kinda makes sense in that regard
but yea most are just meming at this point... cause sd3 is a myth :3 lol
But even if somebody working for SD would be here ... I don't think he would know ...
i can only assume the lead dude knows, cause he gives the final order for release, the devs dont know when it will be released, they are just devs working on it
And let’s not forget who the devs report to
But the lead dude has other problems than chatting here 🙂
exactly
But he can DM me 😄
we are just the peanut gallery for the lead dude, he can't possibly be bothered by us down here
I still offer to watch the killswitch ^^
"look at these peasants waiting for sd3", he said calmly
look at these people thinking they are owed timelines and status updates
right
Maybe we should start a crowdfunding for earlier release? 🙂


im actually curious what Emad is gonna cook
him joining microsoft is poetic for the most part
i hope he works towards his own dreams while facilitating his role there
Spaghetti Monsters ...
i mean what if that was a clever name for a new diffusion model
since diffusion models are technically entangled spaghetti
@honest mica hey just have a question for you. i know you trained the CCTV loras, and that is technically a concept i guess? i want to try to train a concept lora, so wondering what are the best practices when training for a concept rather than just some object or thing? do concepts need something special? either parameter wise, amount of picture wise, or captions wise, or idk.. any tips? :3
what is the best model for converting a pixelated image to a realistic one?
SUPIR
unless you mean like an actual pixel style image, in which case, it can be almost any realistic model, cause you are doing image to image.
but if by pixelated you mean low degraded quality image to restored version, then SUPIR should help with that
and if the realistic model doesnt work, you can force it further with a lora like realistic slider, and put the slider to max strength and it should convert it
i did this to convert some anime pics to real and vice versa
magic image refiner could also do the trick
is that available in comfy?
i believe it is a comfy workflow—let me try to source a link for ya
thx
hi, does anybody have a reference to a good image to video comfyui workflow?
so wait im confused... cog? but it can used in comfy or no?
i’m gonna say no
i’m looking more closely at it
sorry for the confusion
it is a pretty good controlnet sandwich though
this one? https://huggingface.co/camenduru/SUPIR
i recommend the Kijai version https://github.com/kijai/ComfyUI-SUPIR but yea
yea Wizard LM 2 released
crazy performance looks like
i think i wasnt clear on what i wanted
i want a model that converts pixel art to realistic
that model looks like just an upscaler
it looks like they made a small mistake and released it a bit too early https://new.reddit.com/r/LocalLLaMA/comments/1c586rm/wizardlm2_was_deleted_because_they_forgot_to_test/
lol, so i guess the first version out there is less censored
i think they were under the impression you had a pixelated image. in your case i would recommend a good img2img controlnet-based solution
I enjoy Fooocus for its ease of use and consistency
yea cause i wasnt sure what you meant by the word pixelated
it's ok 🙂
all good here
ye so you can use almost any realistic model, combine with optional lora like realistic slider, and then use image 2 image with strong denoise (cause it needs to convert it, so maybe 0.50 or above) and perhaps controlnet canny or whatever, depending if you care about the details
you could find a really complex workflow and then go Doc Brown on a DeLorean in ComfyUI
sometimes even simple workflows can work too, no need to go extra crazy 🙂
Lora merging still broken in Forge 😛
but adetailer, peturbed attention guidance, regional prompter, and controlnet all working awesome
Prompt Engineer
I tired that Connect Net, got some picture of someone in some pose, it was like 50% chance of the person to be in the pose given lol

connect net or control net?
What's prompt engineer?
Someone who writes prompts XD
Yes
you need to go to university to become a prompt engineer
I've heard some shit takes in my time, and that is one of them - a half hour on youtube and a couple of good links and you can be writing great prompts in no time
hell even just asking the bot on civitai for a prompt gives you a half decent starting point to work from
i hope you didnt take what i said actually seriously... lol, i meant it as a joke considering people call it "prompt engineering" and engineering is usually related to university studies :3
the /s was dropped
it's all good fam
Virtual assistant available here ✌🏻
virtual? im only interested in actual real assistants, sorry
when is DeepMind going to start playing Vampire Survivors and can I watch the stream plz
wow Cascade does sausages really well
is that because it’s a german architecture?
based on what i understood, Cascade is basically Wurschten v3, or however you spell that in german and it seems that word means sausage, so i guess it makes sense? lol
haha
I feel dumb, but browsing the cascade sub, I'm looking for conversations and information, but it's mostly images. Is this an image generation bot channel thing like mid journey? I'm not seeing any indication in the rules channels.
speaking of midjourney what’s with all the mid finetunes claiming to match v6 performance lately?
small spoiler: they do not
and it's gotta be a top master's program
for sure
it's because cascade is dead. it's been non-stop discussion for the few months since it came out, but ultimately most of what's being generated now is being put through an sdxl refiner because they know nothing new will ever come from cascade ever again. all the people who made it left the company, and nobody is making any finetunes because sd3 was announced within a week later.
yep. unfortunate.
i was using it for it's superior clipvision ability, but now that ipadapter for sdxl has been revamped, that's now way better than cascade was.
the really big issue with cascade is stage B
if you don't use enough steps, you have leftover noise. but if you have too many steps, you have oversampling noise of some kind.
generally, the number of steps that lay between those two areas is zero
würstchen = tiny sausage
can you imagine how great the new IPA would be with cascade's CV?
too bad that's not a reality
Apply for a job, put it on your CV

I mean a good beer and generating images is always a good time 😛
why not generate images of Beer?
🍻
hmm quick question, anyone know where .pt files go?
well.. where to find it
dropped it in the embedding file but cant seem to find out where to use it
first make sure you have the right kind of model loaded. ie. a 1.5 model with an embedding trained for 1.5. if you have something like 2.1 or sdxl it wont be compatible. secondly, look for a tab called "textual inversion" on the main page of automatic1111, and you should see the embedding there
the .pt files go under stable-diffusion-webui/embeddings
hmm yeah i have it all in embedding but idk how to use them
you click on it and it puts a tag in your prompt
so they think my sausage is tiny eh?
like i dont see anything for the embedding
in a cute way
I’m going to finetune a lora of just action heroes busting through doors with guns and call it Embed This
do it
nah
but uh if anyone know how to use the embedding lmk
"General chat about all things Stable!"
“Democracy”
Hi lodis. Is it though? I think you got confused with concept and style. Anyway, I just caption the whole image and put my keyword in front of the caption. Like CCTVfootage, {caption}. Amount of images I would recommend something in the range of 35 - 250. Parameters I could send you later, because I am not home at the moment.
Has anyone been able to make pixel art by using a image as its style?
Thank you very much 😄
np 🙂
Just to double check it goes in the Lora folder right?
yup
err
yes
that is a lora
had to check real quick cuz there's also a checkpoint that's kinda similar
Doesnt seem to be showing on in the lora tab I am putting it in D:\sd.webui\webui\models\Lora
might need to hit the refresh arrow
damn lora how u tune so fine
lora? barely knew her
I just had to go in settings lol for some reason it doesnt show up on default
restart it then
yep It shows up now 👍
ah cool, thx for the info, and yea send the params if you can ❤️
do you think in the future we won't hire clothing models?

huh
Are there people that are considered to be experts in use of AI now? Not only image related but also other stuff? I have an interesting idea and was wondering if you guys could point me to 'experts' 😄
Like basically all the creative stuff like image generation, LoRAs, music, animation (Most important is image generation but you get what I mean)
I mean theres levels of experts - there are those that are aware of the wide variety of tools, there are those that know how to put those into a workflow to combine em, and then theres those that know how to build em
Sure, I'm looking for people that are good at setting up the technical side of it. I feel like I got a good creative mind and a good feel for what works and want to get really good at it but the setting up process just kills me sometimes. Like I'd spend so much time on setting things up but it feels like it's almost endless
So a tech support expert - youll find those over in #🤝|tech-support
most of the stuff doesn't even have proper explanation and/or even if you get to that point, there are always minor issues and it's so hard to understand why
which model would you use for inpainting faces?
yeah the documentation situation as a whole is horrendous
most of the time i've spent learning this shit was wasted combing through bad or nonexistent or wrong documentation
e.g. I have a class picture of my students. One parent has requested their son's face be removed (e.g. a celebrity kid and they don't want that shared). I could inpaint and change his ID without wrecking the picture
I feel like this is super underrated in this field and would make someone bank if they manage to actually simplify things
hahahahah yeah I'd follow something super closely, literally do things 1:1 but then its missing important information or is straight up inconsistent or doesn't work. Then I'll try someone else but I already used stuff from someone else so then it just becomes a clown fiesta
I have so many random basic questions all the time, I don't even know where to start 😄
But then if I ask something simple and/or have an issue that's clearly not even my fault, a lot of people I talk to would be condescending
and I also really want to set things up but dont wanna bother someone 24/7 so idk
very frustrating dynamic
Baby Iron Man
YaY! My 5k ASUS coupon arrived 🙂
ICBINP XL
I might be biased though
😮
but use that, dpm++ 2m karras, 3cfg, 832x1216, 30 steps, no hires fix, and pag scale of 0.6, adaptive scale 0.6
This used to be easy, but i guess it changed. What syntax am i supposed to use to generate an image now? I tried # stable-cascade and it doesn't respond
none - #1047610792226340935
why do i see other images that look recently generated?
because they would have been made with a different generator and copied in
I'm working on a workflow for AI architectural renderings. So far my results are very promising and already very useful in my practice for design choices. My first goal was to have control over the materials in specific elements in the view, also avoiding prompt bleeding, and I have achieved it with the use of ipadapter+attention masking plus regional prompting/regional sampler over color masked exports.
My next goal is harder and has to do with being able to generate multiple renderings from different points of view while keeping consistency in the materials. Using the same references in ipadapter helps but the material is not exactly the same, details appear at different places etc. Geometry consistency is obvs achieved with controlnets.
For this I have considered those strategies:
-
- Using the og color masks to cut the parts in the first generated image and using those cutouts as references for ipadapter in the next views, with the plus model and a high strength. The problem is that, for small elements, this cutouts can be quite small or far from square ratio. This could be solved with an upscaling but seems too inefficient. Also this wouldn't help with the position of specific details, only with the general material color and texture.
-
- Given that I have the og geometry. Somehow transform the parts of the first image and place them in the correct spatial location in the next view, use this as a latent in a img2img setting (but it would have a lot of "non filled parts".
-
- my fav: Considering AnimateDiff, This already tries to solve a problem of temporal consistency. I could export a video orbiting around the space, generate a full video maybe at low steps or low res (for efficiency) and then only choose the frames that are interesting to me to continue denoising and upscaling. I like this idea but also seems inefficient. I wonder if there is a way to "hack" the motion information from the module to use it directly without generating every frame in the middle. Also, having access to the geometry, maybe I could export the accurate motion vectors directly without relying on preprocessing.
I'm relatively new to SD and therefore I'm sure there are a lot of other ways to tackle this problem of point of view consistency. I'm really looking forward to hearing about your ideas.
thx in advance 🙂
idk if it will help you, but this still works: https://huggingface.co/spaces/multimodalart/stable-cascade
From my experience at the moment they are these:
(1) REALVIS-XL V.4
(2) NEWREALITY-XL_ALLINONE V.2.1
(3) LEOSAMSHELLOWORLD_XL V.5
currently these 3 models are excellent for very photorealistic images
😮
I got real vis, I should try those others out
not just for portraits, they are excellent because they have many varieties of faces and postures. Obviously it depends a lot on the prompt and the samplers you use (I recommend DPM++ 2M SDE Karras, or DPM++ 3M SDE Exponential)
lol, I got@like 6 models already end dowmldoed for some reason XD
hi
Gone for now ...
I like these: fullyRealXL, cinematicredmond, juggernautXL
forget the bots, no way they're coming back
Good morning, everyone! How are we all today?
原神
Well thank you for that information! I'll stop with the cascade and just use xl, so as not to waste time on an abandoned resource.
While we are waiting for SD3... is there any way to use Pixart-Sigma with ComfyUI, Swarm or Forge?
https://github.com/city96/ComfyUI_ExtraModels yeah this page will get you going with comfy. i don't know about forge with it.
its not even that good is it
IDK - I just want to see for myself.
300 token limit and color / prompt adherence sounds good.
Hope the reason SD3 is taking time isn’t because they’re retraining without any copyrighted data.
comedian eh
why do u assume they trained with copyrighted data in the first place
I have to be able to generate pictures of Mario taking a shit 🚽
That’s my benchmark
it’s behind, development-wise. but it’s not a Stability model, so already it has that going for it. lol
thats a bad thing why?
you think they will stop after that? why
Stability has declared SD3 to be their last T2I model.
it has a lot to do with Emad leaving, but it could also be some other factors involved as well.
yeah well I wouldnt know enough about that
lots of politics involved there
I dont know if people will adopt sd3 anyway
the inference times seem really bad
maybe with turbo
I’m certain they will, which is part of the complexity of the issue for sure.
but it probably costs a ton more to train
cloud GPUs will step in on the timing part.
those will begin to take on more commercial representation as more investments are made
but prices will rise
the same logic Apple uses to deduce that consumers are okay with having a monthly plan for their smartphone as opposed to owning it outright is going to be the same logic used for cloud GPU, etc.
personally I dont think diffusion transformers are very good right now.. I dont see the same kind of generalisation LLMs have in these DiT models
Is forge better than comfy regarding memory management and avoiding out of memory errors? I like comfy and have no desire to go back to an A1111 style UI, unless memory management improvements are significant.
i think forge is memory king atm iirc
people believe diffusers can go that much further just with better attention?
that’s also why i say structure
with more controlnets for different types of data
etc.
or rather, not data but syntactic detail
Am I stubborn for liking the comfy interface above all other considerations?
nah not at all 😛
it’s a cool interface.
it represents the actual workflows that these models all use and that’s like operating a steam engine almost 😛
I like it but it gets spaggheti
^
I have no need to generate images, but it's fun, and comfy keeps me interested. I suppose I do need to try forge though.
do we even have models that can identify features that are small/obscure?
not really—not to my standards at least!
I cant think of a good example
just like, does this piece of clothing make sense, should this thing look like this, etc
text encoding precision is also a key factor…obviously
text encoding is sort of like how the number of pixels in a raster window are defined
in my simplistic view lol
its counter intuitive to me because I assume smaller prompts would work better for some reason
consider how much better SD 1.5 images tend to look when they use ELLA
less stuff for the model to get wrong
thats true
I just thought DiTs would be able to recognise its own mistakes more often but that doesnt seem like the case
perhaps there’s more potential for them to down the line, but the code doesn’t seem to support that function as much as it’s theorized at the moment.
yeah
we have an instinctive tendency to approach models holistically, which is good, but we’ve managed to make older stuff shine more just by building in new functionality to preexisting architectures. this is going to continue to be a powerful thing since the sky’s the limit with creativity and AI.
and how it gets extended. it’s like digital putty.
IMO
I agree
thats why I was on the fence about sd3 being "good enough" when 1.5 and sdxl are still getting better every day
i’m just getting caffeinated this morning so i’m already on my AI soapbox
I think it’ll be a really nice, polished sports car of a model. But we’ve got Honda Civics already that have plenty of potential for mileage. That’s how I see it. 😛
like Sora—that’s a lamborghini for sure.
we get a sports car when we need an off roader
lol
oh….we don’t get the sports cars
hollywood gets those lol
j/k
not j/k
what’s really going to be interesting is when smartphones and other devices start to carry localized LLMs. for example the current iphone can run Mistral 7B
and others
When those types of models begin driving other functions of the device, that’ll be a game changer. Even local image diffusion will be a thing, the likes of Apple could even have their own proprietary diffusion algorithms baked into a future release
Sd3 api released
😮
Open weights with stability membership soon according to twitter
badass
oooo someone’s got an SVD multiview project going https://github.com/king159/svd-mv
tweet or something? how
By chance does that cost money? and is that the only method to get stable diffusion 3?
no
we'll get people invited to Stable Assitant where you can use SD3
and also we'll get the models themselves in the future
(model files + code)
So now we just play the waiting game?
like we have been all this time tbh
lol
but this means we're finally closer
they've been promising API "soon" for 3 weeks

I guess its better than nothing.
yup
but all that matters is that we WILL get the models that we can use offline and etc
even if its like 2-3 weeks away
I cant stand being impatient, it feels like ive been waiting for ever.
I wonder why people are so impatient, yes a few weeks is a lot of time in the AI world, but sometimes it feels like the only reason to live for some people is to complain about SD3 not being released already
i can confirm this is true we also are depressed.
see
now I do admit that announcing it so early was a stupid mistake
its been almost 2 months since they announced it
like thats the worst possible way to hype something up
I hope they create a #sd3 channel soon
And I reading that announcement right, sounds like paid membership will be required for SD3 model weights even for personal/noncommercial usage?
didnt membership used to be <1m users is free, $20 is more? Now it seems to be 0 users for free, and <1m for $20? https://stability.ai/membership
Okay good, it's not overly clear in the announcement
well that stinks.
so you have to pay for membership even if you have 10 viewers and get 2 cents from youtube ads?
We aim to make the model weights available for self-hosting with a Stability AI Membership in the near future. kinda sus
will they be open sourced by chance?
the code yes, the models will have that license where its noncommercial
but everything offline available
dang i was hoping to make a comic or something with the new models...
if you dont make revenue, like its just for free you're fine
I mean I would not have the heart to make these images for money
I do these for fun
Why not?
Id like to make some videos where I can get at least some ad revenue if anything goes viral, but kind of a nonstarter if you have to pay every month even at first when you are making 3 cents
from my tests, sd3 is laughably bad 😢 could they have some pre pre alpha in the api... that's hard to belief as well
well unlike these ai comic artists, I barely put in effort, and even with all the opted out artists there are a bunch of artists or studios that had their style left in so that would also make me feel guilty
it can't even get text without errors while that was supposed the big thing
if you put in a lot of effort and draw over it a lot and stuff then sure, I get it, you'd want some revenue for it cause you actually put in effort
@charred mesa same, I love seeing these models as open and free as possible as it fuels research and innovation. Like supir, omg that is incredible, especially for restoring old photos.
Im not evens sure if i could even make money with stable diffusion.
I hope it's worded that way to advertise the memberships, but it's confusing for people waiting for the open non-commercial weights
There's just so much cool stuff that wouldn't exist if SD wasn't open
Hi guys
Is there any steps can you suggest to change expressions in video using stable diffusion, (not faceswap to different person) just like my video is has a bored face and turn it to singing?
can anyone tell?
inpaint a different expression
in video?
can you please tell me more how to do it, if you can please
idk what you use for your video, but most of the approaches have an initial image you can provide
temporal video editing hasn’t really become a thing…yet
it’s getting there
but not quite
So just to make sure sd3 is free thru api at the moment but cant be used to profit off of? is that right?
it’s being made available via the developer API, yea
Ok i just wanted to make sure, and said api is free or is there a limit/paywall?
Interesting...
they must be having some interesting internal conversations
possibly?
So you will also need a membership to use the models on your machine? (comfyui/invoke/a1111)
the tweet makes it sound like that but I'd love for it to be clarified before it turns into a shitstorm
World definately be nice for stability to clarify this.
membership is free lol
the only time you need a paid membership is for commercial usage and its $20 a month
^
dark your pfp looks like netero
So not really free
Lol, i guess
well if you thought you could use a million dollar model commercially for free thats basically stealing
I think $20 a month is a good compromise
yes
It is if you do more than that in "revenue", not if you maybe you make a couple bucks once in a while. Anyway, I get the point
I still think there should be like a threshold that makes it "commercial use" kind of like Unity3d does
If you do less than X money, it's free
yeah I wish the $20 a month was at least only for if you do make more than $20 (or a bit more even)
though I guess they are unlikely to come at you if you are making $10 but still
are you sure, also 20$ isint much.
ok gimme 20$
Subscriptions tend to pile up...
eh, if you are just someone trying to make tiktoks or youtube videos, and you pay for a year trying to make it $240 while you make $0.34 back, on top of all the other stuff you use it's not great
i cant tell if thats bad or good
especially since you also need a bunch more hardware, electricity etc. to generate compared to paying slightly more for sora or whatever
I remember times when you could buy stuff and simply own it
gpt costs 20 bucks a mouth and you get more features and isint such ai cheaper?
There's this, MJ, openai... But ok, not really their problem
gpt costs a lot more to run, like you can never run it at home at that speed, so being same price while gpt4 has no hardware you need to purchase and runs on their hardware is a good comparison of how much better of a deal gpt4 is
I wish i could run sdxl but my hardware stinks.
what are the sd 3 costs via api? can someone let me know real quick
where can i find this at? if i might ask? or like how does it work?
A1111 now support fp8 mode
what if i use forge?
so if you download new version it should be right there in options under optimization
no idea
never used
I guess ill research it then.
forge generally has lower requirements than a111 so definitely try it over a111 if that's your issue
4 cents per image
I use forge im just not sure how i can run sdxl on low end hardware. ive been using sd 1.5....
you probably cant but 500 means their server is crashing from the request, so either it's a problem on their end or you are sending broken data - ilformatted or something
Wouldn't that be a 4XX error though? (like 400 BAD REQUEST)
who said
oh
20x is crazy
scaling laws my ass
sounds like issue on their end
no idea, didnt test it out
so exactly what I guessed and they've even told you what it is
youve put 'your account' instead of a token, presumably coppying the documentation directly instead of actually registering, getting a token and putting it where it tells you
When can we expect the weights?
oh lol
my guess is by the end of the month
I mean the model is essentially not released for as long as it's on the API.
my guess is next month
are you sure you didnt close/unclose some bracket or quote in your request
nope, it really is an error on their end... that mess is the response
I think we should focus on SDXL Ella implementation and getting text gen working in SDXL.
If SD3 is going to be this heavy, Stability is not considering local usecase
hm fair enough then
the implementation is there someone just needs to train it
500 generally means internal server error
there's multiple versions of sd3, some smaller than sdxl
A few are already working on it.
I have SD1.5 Ella > IpAdpater SDXL running and it is great.
most servers arent that well configured, and can return a 500 in a ton of cases
not to take part in the current discussion, but as I wanted to try SD3 so made a quick node to use the API, you get 25 free credits anyway to try for the curious: https://github.com/kijai/ComfyUI-KJNodes/commit/22cf8d89968a47ce26be919f750f2311159145d1
good to know, what news on lavi bridge?
no news from what I've seen
I fixed the comfy native ELLA node btw
it never worked properly before today 😛
I read the arvix paper but had not seen the github. I am really interested in checking this out, especially as the Llava models get a boost from WizardLM and Llama 3
I think I resorted to another deployment, I will wrap back around and see if I can get yours working.
yeah because the ella models are trained on t5 arent they? bigger is always better imagine using llama 3 would be great
I have made a wrapper node for LaVi bridge as well, it's far worse than ELLA and there's no SDXL model for it either :/
I made a PR for the ComfyUI_ELLA about it: https://github.com/ExponentialML/ComfyUI_ELLA/pull/25
Ella 1.5 > Composition Adapter > SDXL
And then additional images for style control is really ace.
Unbelievably controlable on SDXL, and the CLIP seems to work better when composition weighting is involved.
I stopped messing with Stable diffusion for a few months and came back and now everyone's talking about ponies, what's going on
Pony is the go to anime Checkpoint right now if im not mistaken
What makes it so much better?
a terrifying location
Only nice people in there 
Hi everyone, I've been wondering about is it possible to generate exact some person on different pictures, like some Tom(which is not real or celebrity). For example I want to create picture where Tom is cooking or walking dog.
How to make it? I need to describe Tom in prompt or use some special seed?
There's a guy in the images room showing off the preview, fwiw
is it better than nai v3 yet?
Im not the right Person to ask
right XD
I don't like the wording of the announcement
A few ways, you can create a pose collage, split it and use that to train a lora, there are also a variety of faces swap tools, such as reactor and I controlnet ip-adapter, instant id
It will take effort, but if can be done
Is it possible to get it from default model, without training lora?
like any model from civitai
Yeah, with the controlnet method, but Lora will be more flexible
(Masterpiece), (Best quality), (Ultra HD), (Super detail), (Whole body :1.2), 1 girl, Chibi, cute, smile, flowers, outdoors, holding the camera, sitting on the roof looking out into the distance, with mountains in the background, amber, warm yellow, sunset, artistic sense, Quadratic style, white clothes,
Eww, 1.5 prompt
tbf that's more natural to me, because that's how I search in google or whatever, and not writing prose with a bunch of obviously useless filler words
i never heard "quadratic style" before
well I got to try the SD3 api
and its almost over
only 25 credits, 4 credits per SD3 Turbo image
and thats not a bundle of 4 images or anything
just ONE image
so I guess I'll either wait for stable assistant or weights lolll
they need to make money before making it open weights, i guess it makes sense, but yea...
So.. as just a normal user that wants to try SD3, any easy way I can use this API?
Is there like a simple website I can go try this on?
so he said few weeks from now huh... that kinda passes my estimate of April 26 it seems :3
so maybe May 10
api only
so you have to figure it out
you could try this https://github.com/kijai/ComfyUI-KJNodes/
Nothing I can just input my API key into and go off to the races?
so he says today (always API first, then a few weeks later weights), a few = 2+ so likely early may, possibly mid may
my new estimate is May 10 
yeah with all your massive 25 credits
4 credits per one SD3 Turbo image
and 6.5 credits per one SD3 image
I have 567 credits
so how does the turbo version compare quality wise to base sd3?
nice
its pretty nice
could NOT test it cause I dont feel like spending 10$
understandable
so sd3 will require a membership for commercial use?
yes
technically required for non-commercial use too but it's a free membership in that case
thats only for the api
that's not what the post says 🤷 "In keeping with our commitment to open generative AI, we aim to make the model weights available for self-hosting with a Stability AI Membership in the near future."
as far as I know sdxl also requires membership for commercial use
they cannot be this stupid
but isnt that openrail++
you sure you don't mean sdxl turbo
well, the https://stability.ai/membership page has been like this since at least sdxl, and it says core models (which includes sdxl) are free for non-commercial, $20 for commercial same as now
nothing has changed there
also when you download most of their models through huggingface (I think including sdxl but idk anymore) you agree to the same non-commerical clause
I think it's just so people sign up and they can get user numbers to show to potential investors, and track engagement etc.
exactly, sdxl is fine, 0.9 and turbo not
that's stupid cause on huggingface sdxl still has openrail and sdxl turbo has sai-nc-community
well, whatever, I'll check sd3 licence at launch, if it's bad I'll just stay on sdxl or other open rail models that come out
I guess Ill worry about it later, I do hope this helps them keep existing at least
because currently the chances of ever getting sd4 are kind of grim
trying to do this.. I'm stupid and can't figure it out. I cloned into custom_nodes and installed dependencies, but I don't see a SD3 node :/
yeah its gonna be sai-nc-community
10 times out of 10
guaranteed! 👍
I guess at least if someone trains using the code and architecture from scratch, but not their weights they can make it fully open
the nodepack is in the Manager too which is probably easiest to install, it should work with the steps you described though
we have pixart-sigma which is openrail++, but that's 0.6B
similar prompt adherence, no text capabilities at all and somewhat cooked images (ESPECIALLY FOR PHOTOS)
Let me know when someone makes a UI for the SD3 API
will sd3 not allow fine-tunes to be uploaded to e.g. hf and civitai because of its license?
naaah
I figured it out, beginer mistake I've done before. Installed dependencies on the system python enviroment
it will
ah yeah, classic
hm
its just that the finetuned models will have to be licensed the same
:/
at least we now know why there was drama with people leaving stability ai
those were researchers
ok...?

cause emad left because of some opennes reasons or whatever
I forgot the exact reasion
and that is exactly what i refer to?
I mean, they've been too open anyway, it was impossible for it to last - it was basically burning VC money to give us free shit and that can only last so long before they run out of people giving them money
look up enshittification
sort of
its the definition of it
no, the definition of it is closer to milking users to increase profits
pretty sure civit and HF will pay for a commercial membership so they can do inference services on SD3
this is literally burning VC money for the users
drhead do you think Textual Inversion will make a comeback
you can train on 24GB and it may work for all 4 models
a dev said that SDXL Textual Inversions may work on SD3
hopefully meta will join the diffusers game too and scare all commercial solutions like they did with llama2 and are gonna do again with llama3 next month
that'd be sick
textual inversion is already decent if you know what it does and what its limitations are, though I would imagine it might be difficult to make one work on multiple text encoders, and god knows how T5 will handle it
llama hasnt scared commercial solutions, it's been really cool for people but hasnt scared openai or anthropic one bit
stuff based on llama did, tho
look up mistrals latest open model
yeah T5 might interfere
hmm
the top open source currently is from cohere, cmd r+
but still none of it scares them as it can never really quite catch up
tho yeh supririnslgy close
Meta has Emu, they haven't released their image model 😢
if only there would be another openrail++ t2i besides Pixart-Sigma
pixart too cooked for me
let me guess, it somehow derives from llama as well?
but it has great potential
yea potential is there
if you go back the chain to its root
like, as far as I know textual inversion is one of the safer ways to do "quality" alignment. when i was trying to make soyjak faces on one of the furry models, I noticed that all of the preference alignment loras I looked at made the outputs into just glossy sparkly generic stuff, and the negative embedding for boring images made higher quality soyjaks which is what I wanted
I dont believe it does
also I didnt know Command R+ is rated higher than some gpt4 versions lol
i wanted to try Emu Edit, but it's not open weights 😦
yes it's the first one that got higher than early gpt4s
Isn't CosXL similar? I wasn't super impressed with that model
both emad (former stability ""dev"") and Lykon is saying that weights will come
yea similar but cosxl has some blurryness problem idk or it kinda deforms the output
So to clarify, once you are done finalizing the architecture, will the model be released where people can download it for free for personal use?
Maybe even before, it's not up to me to decide.
Its artifacts remind me of SD1.5
but cosxl technically is the "test" before sd3 edit model i guess, so hopefully the sd3 edit model will be better
Even cosxl base was mid, I was hoping for different clip coherency.
Turns out IpAdapter Composition does way more to help things than a model switch did.
^
then again... i didnt try to combo cosxl edit with sdxl refiner.. maybe it fixes the output
oh I was hoping for more than a blurry finetune with just contrast being increased
I'm not concerned one bit about the non-commercial license tbh, I already am used to releasing my finetunes as non-commercial. I make those models for people to use, not for people to throw on some expensive cloud service or for people to put into some paywalled low-effort merge.
I was getting very large oddities, I don't think the refiner would clean some of the mess i was getting.
I also never use the SDXL refiner ever, so... lol
^ same
yea refiner i only used at the very start of sdxl, then finetunes came and it was kinda pointless
not for people to throw on some expensive cloud service or for people to put into some paywalled low-effort merge
based
Re: Command-R #🏞|general-with-images message
I used the refiner in like the first 2 weeks of sdxl, and I think maybe once in january
yeah
I just used hires mulitpass from the get go. I only played with the refiner right at release
they probably should have made the refiner go over the last 300-400 timesteps instead
A bunch of the finetunes integrated refinger training stuff for a while too. I don't really pay too much attention once I get things working
the 2nd wave of finetunes were already exclusively saying to run without refiner
exactly
yes because they were too lazy to change the one line of code necessary to train it and invest an additional 20% compute in the training run
literally all you have to do is change the timestep selection from sampling [0, 1000) to [0, 200)
and since you're only training 1/5th of the timesteps the model should pick up on things about 5x as fast since it will only ever need to learn high frequency details
It's also just tedious to manage/publish 2 files rather than one 🤷♂️ and help confused users who applied them in the wrong order and so on
plus the refiner model is also differently structured:
"attention_head_dim": [
6,
12,
24,
24
],
"block_out_channels": [
384,
768,
1536,
1536
],
"transformer_layers_per_block": 4,
"up_block_types": [
"UpBlock2D",
"CrossAttnUpBlock2D",
"CrossAttnUpBlock2D",
"UpBlock2D"
],
vs the base:
"attention_head_dim": [
5,
10,
20
],
"block_out_channels": [
320,
640,
1280
],
"transformer_layers_per_block": [
1,
2,
10
],
"up_block_types": [
"CrossAttnUpBlock2D",
"CrossAttnUpBlock2D",
"UpBlock2D"
],
idk, there were some using it and some not using it, and those that didnt use it ended up rising to the top of civitai downloads, presumably for a reason
Same reason cascade is not popular, people do not like to load a million things to do one thing.
I feel like most workflows require you to load a ton of things and people deal with it
i like cascade more for the image remixing part 🙂
if it was enough better people would've lived with it
But each piece of the puzzle feels independent when crafting a flow. Having a segmented model is just awkward.
It also doesn't help that both ComfyUI and A1111 had incorrect implementations of the refiner. A1111 was switching over based on sampling step until I fixed it, and ComfyUI doesn't have an easy way to switch based on timestep that is built in.
Look at civit rollout of cascade models, how was that ever going to make sense for sharing finetunes? lol
the a111 was really off yeh true
I think Diffusers is the only one that implemented it correctly

