#🆕|sd3
1 messages · Page 129 of 1
It does create extremely good loras fortunately 😉
looks good, capturing the 3 dimension made of something look
I wonder which surgery is tomorrow
I do respect your taste, but is there a fine tuned model for more realistic human
Live-action version of Crayon Shin-chan
sure. use Duchaiten's pony models. pony no score is probably your best bet
an effort was made
@rapid pivot @amin_06894 it has! Did you get your vram yet? lololol
if only Santa were real
Is it just me or does hunyeon video do better hands than most still image creators?
I just use Mage lol
also sdxl on my own system
cool
it is so hard to lerarn the SD
Thank you for using comcom analytics.
"comcom analytics" supports all community managers (moderators and server owners) by stats, visualization, and analytics.
If you have any questions, feel free to ask us!
Your dashboard
Help
Support server
Other languages
en: help
ja: help Japanese
Thank you for using comcom analytics.
"comcom analytics" supports all community managers (moderators and server owners) by stats, visualization, and analytics.
If you have any questions, feel free to ask us!
Your dashboard
Help
Support server
Other languages
en: help
ja: help Japanese
"Couple holding hands on rural hilltop, watching apocalyptic sky filled with violent aurora borealis and magnetic storms, burning city in background, windswept landscape, dramatic lighting, 8K, photorealistic, cinematic framing"
doh
MGM Grand Las Vegas on 11 hectares land area designed by Veldon Simpson, capturing the entire edifice in a single shot from a distance of 100 meters, through a soft-focus lens, bathed in warm Sunlight, modern architecture, rtx lighting, cloudy sky
"MGM Grand Las Vegas on 11 hectares land area designed by Veldon Simpson, capturing the entire edifice in a single shot from a distance of 100 meters, through a soft-focus lens, bathed in warm Sunlight, modern architecture, rtx lighting, cloudy sky"
help
Thank you for using comcom analytics.
"comcom analytics" supports all community managers (moderators and server owners) by stats, visualization, and analytics.
If you have any questions, feel free to ask us!
Your dashboard
Help
Support server
Other languages
en: help
ja: help Japanese
Posting my full findings soon and the relevant additions of code for ai-toolkit and koyah_ss but I am fairly certain I’ve discovered mass scale misalignment of the text encoders across the most popular training tools. Here are some before/after tests from multiple character and style LoRAs with the exact same settings aside from the added parameters to ensure proper alignment of text encoders with the u-net. I know this is a large claim with huge implications that said, I would not be sharing if I did not 100% believe this to be true.
will this affect flux
Yes. As a matter of fact, the bottom two rows on the first image are examples of improved training stability with Flux Dev.
While the other images highlight the more drastic improvements to SD3.5 Large training as a whole.
Going to test 3.5 medium and Schnell next but I need to finish documenting and get this fix out to the community today.
okay thanks
the text encoders aren't even in sync with themselves
Very true, while StabilityAI did document the text encoders in the config for each of the three various text encoders when the model was first posted, It seems to have gone overlooked by the creators of these training scripts.
trust me, it wasn't overlooked. It's far more likely an issue of training an encoder is expensive and they just used what was there
You would be shocked. I am not training the text encoders at all. I am defining its parameters for proper alignment between the text encoders and the u-net. There is no noticeable difference in compute resources. Style LoRA training starts to take at lower steps and there are clear improvements with far less deformed features and better color depth.
From my tests this seems to be a universal misalignment issue. In the results across various character and style LoRAs at different ranks double checked with both ai-toolkit and koyah_ss as well as 3.5L and Flux Dev.
part of it might just be ai-toolkit and koyah_ss - have you done the same tests with luca's dreambooth trainer?
he has a dreambooth for flux, and he has one for sd 3.5 large
I have never heard of luca's dreambooth trainer googing did not provide any concrete results. Is that the name of the github author?
oh boy do i have a bunch of things for you to explore :) https://replicate.com/lucataco this is his main repo on replicate. he's got all SORTs of stuff there, including both of his dreambooth trainers
just scroll all the way to the bottom and slowly scroll back up, he's got tons of stuff
click on anything, and then look across the top, you'll find a link to it on his github repo
The full code doesn't seem to be shown and runs through a paywalled api.
While it is possible that this or any other induvial user could very well be taking the extra effort to define these parameters. I do think think this is a known issue and if it is a known issues that some are keeping secret behind paywalls.
That fundamental goes against my personal views on the technology as a whole.
just click on the github icon on the top of the page and go to his github repo
the entire purpose of this is to make sure whether the issue is the trainers - ai-toolkit and kohya_ss - or if it's something else.
Yes, I understand. Sorry for the confusion I mostly avoid commercial services such as replicate and was not aware it was also posted on github. I will run some tests but I do not see it taken into account in the code itself.
dreambooth is stability.AI's trainer - so you should be able to see if it's the 3rd party scripts, or if it's something else.
Fully aware of dreambooth. Just haven't ran it since SD1.5. Thank you, will look into if this was implemented and if not make the required adjustments and re-run some test. Still going to publish my findings so far today.
i've been using luca's implementations of it for the loras i train, it works well - so it should be a good way to test what you're seeing with the other scripts
dreambooth was actually published by Google. They've been reluctant to publish open source research since then because of what was done with the techniques they demonstrated
it idea is for him to make his tests on something other than just the two scripts that seem to have issues, and to determine whether it's those trainers - or if something else is going on - for that, dreambooth is a good option
I don't think reference code for it was ever published by stability ai either. It was just a community member that published code to github. Huggingface has been the biggest source for reference code afaik
Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) by way of Textual Inversion (https://arxiv.org/abs/2208.01618) for Stable Diffusion (https://arxiv.org/abs/2112.10752). Tweaks focuse...
few points on your gooogle fu "gotcha" attempt.
- is not reference code
- is not maintained by stability ai
- was published before Penna was hired by Stability
- Joe don't work here no mo.
Hope i don't get banned for disagreeing with you on something you're wrong about.
Dreambooth is probably not even what the guy should use. LoRA would be a better approach for their needs. But i'm not commenting towards that. I just correct things when I know better
all i did was give you another repo to look at, where someone else had done some implementation, in case you might have been interested. cripes you have a bad attitude. and if you look at what he's doing, he's testing trainers.
history informs my interactions with you. In the past, while you were moderator of /r/stablediffusion, you kicked me from that server after similar disagreements here, on another server. You remember that don't you?
If i have a bad attitude maybe you should ask yourself "what have I done to this person?"
you earned everything that's ever happened to you - and probably a whole lot more.
don't start. i'm done talking to you.
oh okay. So you're sociopathic and can't recognize that others disagreed with your arbitrary kicks, leading you to no longer be a moderator of that server or subreddit. but alright. got it.
assume whatever you like.
not an assumption. conversation with the admins of that server and subreddit confirmed why you were removed. because of kicking me all those times for no reason.
there were valid reasons. however, assume whatever you like.
and just keep on posting
"valid reasons" being we had disagreed about something mundane on here and i usually back it up with facts.
when i asked the other mod there why i had been kicked repeatedly from that other server he looked into it and found no valid reason. I imagine when he asked you, you told him some sing song story and they disagreed with your validation. I doubt you told him "he was arguing with me on another discord server".
depends on which 'admin' you're referring to - no, there was no discussion
@viral plaza pinging you sorry. ... but... i mean... seriously.
https://discord.com/channels/1031106063837184021/1308975746529890344
This topic was the last time i got kicked from the /r/stable server. None of the kicks were ever explained to me. I only found out because i noticed the server icon no long in my list.
I dont like being gaslit like this didn't happen so i have to address it.
Another community i'm part of had a member talk to sandcheezy about you which was illuminating as well.
always nice to know people are spreading lies behind other people's backs. you do realize that alex rarely pays attention to any discord but his own? he's somewhat busy, you should try pinging him there.
I did. Thats why you got removed as moderator from their community. ❤️
lol, sd3.5 large lora easily training on 12gb gpu
under 8gb vram at 1024 with offoading 0.5
7.70s/it which is almost 3 times faster than flux
but it might converge slower, or I don't have correct settings yet
I am training at lr 0.002 💀
Since crystal wants to gaslight you about it here I'll go ahead and post for you and anyone else that cares:
Crystalwizard not only silently kicked nuuideas from the r/sd discord repeatedly, but deleted the logs of having done so from our internal mod logs. I only discovered this when nuuideas asked about it and I digged through the discord audit logs and managed to find this out. I asked about it at a time when crystal was otherwise active, they didn't reply within the span of about a day, and I spoke with the other mods and we mutually agreed to remove crystal from the team, as not only was this far from the first issue with their activity as a moderator, but also the fact that they were deleting data from mod logs indicated that past reported incidents we had no proof of were potentially true as well. After removal, Crystal left not only r/SD but other discords as well, without ever saying anything at least to me. I think they spoke to cheeze at one point after?
(Also to be clear, no, as best I can tell, crystal had no valid reason to kick nuuideas at all, they just had a disagreement on some random technical point and crystal would rather exert authority than let themself lose an argument on the internet or something)
that what happens when u add crazy ppl as mods
doesn't surprise me and probably nobody else who read messages from crystalwizard 😂
name change incoming, the truth is spreading
He mistreated so many people including me.
don't you know its assault if you disagree or have factual information
Thank you for taking time to address this. It seems a little insane that he's in full on denial mode.
happy 2025. 🥳
I wish discord would literally just make a person completely invisible for you when you block them. They did send me a survey asking if I like the block system on here and I suggested it. Hopefully this is a much needed change they will implement soon. I think most people would prefer if blocks worked this way.
He probably has the record tho on being blocked by most people. :))
by far
i've always had issues blocking people. can't follow conversations. him specifically tends to be very sycophantic , so many will engage with him when he has his behavior facade up. I just give people notes. I wish the notes would show next to someone's name though. Or could give people custom colors.
He is strange... sometimes he acts almost normal but then has this other side. There is something going on with him for sure.
It's rude to talk about someone in the 3rd person when they are right there. I just wanted to point that out. Not a criticism. 😉
probably unrelated but where is 4GB VRAM cat!???? XD
(for those the new guys who don't know there used to be an active user here named "Cat with 4GB Vram (send help)"
@bitter hearth allo
What's everyone's go to flux model?
base FP8 with turbo lora
shuttle 3
this i can answer for you tho
he lost his account 😢
he goes by @rapid pivot now.
awww

You're alive. ^_^
sad about your other account then.

I guess the 4GB cat didn;t get their VRAM. :/
why u lost your acc? 
oh hai steven segal
hai there
Probably the same reason that i got kicked repeatedly from other servers. He argued some technical stuff with someone who has an army of sockpuppets ready to report them
still safe here aint no mfking redditor can touch me
i quit reddit long ago. also quit x too. not sure where to find good info feeds anymore. I still use those sites but i dropped the accounts
i just use 4chan,better guides there
i've never had any idea how to browse 4chan. its like in no order at all
yeah IDK how to navigate 4chan
apparently there is a lot of diffusion stuff on there
but I am not sure if it is good advice or not
yeah i have nothing against it. i just dont know how to digest info there
Same as the other 5

Phone number cancelled, not enough patience to go through recovery and blah blah
but why 5 phone numbers cancelled
can AUTOMATIC1111 use SD3.5 gguf?
well not all accounts were lost phone numbers
I don't pay for mobile data, sometimes I want one for emergencies or whatever 
so they get cancelled eventually
if discord gets stuck in login for me for whatever reason I just literally create another account its not a thing I care that much about lmao
im lazy 
photo realistic, a pretty woman (with dark red bob hair wearing a black suit with a dark red tie and a long black coat with long black palazzo pants with vertical red stripes) a determined look, standing in a white room, dynamic shadows, volumetric light, long exposure, sun rays, cinematic view, bokeh effect, fashion advertisement, Dior
Photo via Kodak portra 160,young pretty girl, 20 years old. She has blonde hair, blue eyes, pale skin. Split into four images, Shot of different angeles, white background --style raw --v 6.0
nice shadows
We need an easy simple, non-technical local Lora training like a FluxGym (Is there one?) and we need DreamShaper, Juggernaut, EpicRealism and Realvis versions of SD 3.5 or better in 2025 🙏🏽
100 steps of lion
RealVis is on the way!
https://huggingface.co/SG161222/RealVis_Medium_2.0b
have anuone tried it?
https://github.com/lehduong/OneDiffusion
i have tried it
its ok
is there a tiled controlnet for sd 3.5m?
I wanted to try it so much but there is no easy way to use it locally 
That runs with only 8gb GPU 😉
You can install and run your own civitai locally?! Unfortunately the lora training aspect won't work though. https://github.com/civitai/civitai
i did test it locally
its not worth it. omnigen is better
3d风格一张图以红色为基调,上面祝福语是大吉大利周围装饰图标金元宝、红包、烟花、梅花
Which one do you prefer?
2 cause i like kodak tones more than fuji tones. totally personal preference though. in the west we're all about the warmer colors
left looks like one of those disposible cameras from the 80s! Or you ran out of printer ink 😄
We are all art critics! 😄
Does anyone know any good sources for documentation for ComfyUI and Custom Node creation that I can feed into something like ChatGPT or a local LLM that will allow me to talk to it and have it help me either create my own custom nodes or at least help me put together better workflows for specific use cases?
comfy_ollama
if anyone here that still has respect for crystalwizard then here's some interesting reading for you to see on the comfy org discord server. #1319770970868945057 message
catch it before he deletes it
It says I don't have access to the link
I guess I don't have access to that channel for some reason
i thought discord would let you join servers thorugh a link like that. it's the public comfy org server. i can't post the link here because they block discord invite links
DM'd you it
Thanks
So what is he getting at. I guess could I get a TL;DR of how this convo started. It seems like Comfy is now monitoring created nodes?
They have a public registery of nodes now. Anyone can apply to have a node on it. By default, manager will only install nodes that are registered
its' a good direction and a step in the right direction. but he's flipping out because he has to change a few configuration files to install a custom node
you can still git clone directly into /custom_nodes/ folder
personally i find it very exposing of his absolute lack of expertise and professional decorum
He definitely has a strong opinion. I can see both sides of that coin.
It is a good direction for sure. but maybe there should be a time delay for some legacy nodes, or at least an audit and conversion timeframe for older pre-existing nodes.
This has been coming for a while. The registery wasn't just launched today empty
it's just fully deployed now
I wish I knew enough to be able to make my own custom nodes. I would take a shot at getting on the registry and see how it goes.
I think its a good thing.
I always complain about the security risks of having 5 dozen custom node packs installed, but what i truely hate most about it is the dependency problems. Nodes over writing each other's dependencies in the virtual environment is an ongoing problem. This could help to alleviate that issue among others
It can also lead to a standard library. Something i herald often.
Ya this could be great for conflicting nodes for sure
in the past, there was already a list. nodes that were recognized by the manager. but now it's public and anyone can submit to it
much more standardized and tied to secure practices
the registry was in response to malware yeah
but it will also help with the dependency issues
Is it public? I can't find it.
www.comfy.org has a link but i'll dm you the invite @proven pecan
https://en.wikipedia.org/wiki/Quantum_mind
could use traditional search much easier. LLM's don't need to replace every single task. traditional softwre still excells in most arenas
The quantum mind or quantum consciousness is a group of hypotheses proposing that local physical laws and interactions from classical mechanics or connections between neurons alone cannot explain consciousness, positing instead that quantum-mechanical phenomena, such as entanglement and superposition that cause nonlocalized quantum effects, inte...
it's funny to me that people are heralding all the capabilities of gpt, like counting the letters in strawberry now, when a simple string operation could do that already for a half century
LLMs are certainly a break through. I don't believe we're anywhere near AGI though. They may become more generalized in their use, but general intelligence they are not. It's all theranos level hype
thats the person on here that induced me to misgender Comfyanon
seems to thrive on faux expertise and flexing
what is it 
I dm'd you the link to it chaos. everyone should join comfy org discord server. there's shakers an movers there
also, manditory linking needed now. https://www.youtube.com/watch?v=ZG_k5CSYKhg
Faith No More - "Epic" (Official Music Video) from the album 'The Real Thing' (1989)
🔔 Subscribe to UPROXX Indie Mixtape and ring the bell to turn on notifications: https://uproxx.it/mrln2hd
✅ Subscribe to the newsletter for weekly music recommendations in your inbox: http://indiemixtape.com
🎧 Stream the official Topsify playlist: https://lnk...
Full show notes: https://www.latent.space/p/comfyui
Happy new year friends! Thanks for all the love on the Latent Space Live and 100th Episode End of Year recap. Your support has boosted us 30 places in the Podcast charts, and that always helps us book great guests and organize more industry events for you! We don't say this enough but thank yo...
ai video im scared 
52 minutes of interview 
I have watched so many interviews these past few months why you do this to me 
The various llm can give vid summaries or cliff notes 😉
Those are amazing!
flux tried 😄
SD3 large gave it a try as well (don't count the fingers!)
大家好
He was replaced.
glass rail samples
anyone knows flux finetune that makes it more unique? I want to get rid of this style as it become too generic
https://github.com/welltop-cn/ComfyUI-TeaCache this thing works on rtx3060 - 1.9x speedup
Contribute to welltop-cn/ComfyUI-TeaCache development by creating an account on GitHub.
Pff that's marketing too, in reality it's like this 🤡
A group of 8 realistic cats taking a selfie together. The cats have human-like expressions, they are all standing close together in a friendly pose, resembling a group photo. The background shows a blurred indoor setting with other people. The lighting is natural with soft shadows, creating depth and realism
Sd3.5 large, medium, turbo
you still stuck with just 4gig vram?
there is a medium turbo too on huggingface
made by tensor art
4GB of VRAM is fine you can run flux.1-lite-8B-alpha-Q3_K_S.gguf in headless mode
its 3.74GB so it will fit
only leaves .26gb for generation though. better off using an sd15 refine then
is not an issue
so long as you are running headless
your screen goes black while image is generating
and it works ok
LOL IDK if people would like this advice though, they might not like their screen going black
dog
Screens off? what are you some kindda luddite hippy communist?
at least its not a blue screen (of death)
Cloud services or mage are so affordable now, don't need more than 4gb vram at home 😄 @rapid pivot
ye pretty much
He's a local cat, doesn't hang out in the cloud.
ah okay that's fine
Now try running that on a 2018 amd card
oh no

Both glif and mage.space are pretty awesome
I saved this comment and now I'm wondering what is left of it?
Haven't had time to experiment more and change things but I have shared my ai-toolkit configs and have had many others say they have seen improvements with the changes. I saw ai-toolkit was updated last week but I havent touched anything since I was getting such great results before.
I have been very busy with both work and getting my Project Odyssey 2 video finished before the deadline but I have uploaded an absolutely amazing 3.5 L negative detail LoRA that outshines everything I had done before (with no changes to the dataset or other settings) so I am convinced there is something there. Wish I had more time to dive in but have published my "best guess" at the time to the cause as an article on Civitai.
This starts as a SD3.5 base render and each frame is a decrease in lora strength by 0.01 (since its a negative lora) The video pingpongs back down to a loop.
how do you make these?
Ok, thanks.
I will be posting a full article explain the process and various improvements I have found to the process, share my dataset, ect I have found with what I call negative reinforcement training. I am currently spreading myself a little thin. The idea is to train on what you don't want and use the end lora with a negative strength value to force things to a conceptual opposite latent vector value.
I first found this by accident when training a SD2.1 textual embedding to try to make images I could feed into "point-e" by OpenAI to make 3d point clouds. But I had not gotten results as stable as this with 3.5 until the recent changes I have been discussing.
okay thanks
I was wondering what the conceptual opposite of high detail like this looks like 🤔
is it like something extremely smooth or blurry?
My first dataset was the seed images used for COCO CLIP R-Precision evaluations you can find it at the bottom of OpenAI's point-e github page.
The images resemble something between early CAD and 2000s video game renders with simple flat colors and minimal details. I find at times some pixel art or anime loras can also have somewhat of this effect to some degree and always recommend doing a negative test with loras you find in the wild because you never know.
okay thanks this makes sense yeah
funnily enough negative schnell lora does this a bit
I have been testing with the de-distilled versions of Dev and Schenell and have had some limited results so far but want to get them to a better place before sharing.
lol I love this dataset
its funny that it worked
yeah it could work on de-distilled
The Dev one I recently did works but only from 0 to -0.25 and then things just get crazy in the images. Ill prob share it soon but I got to get back to my PO2 todays my last day before getting back to the office Monday.
I'm printing and framing on my wall
maybe this would make a good negative lora for flux
I just put this in the ollama chat... I think this is an interesting conversation starter....
I kind of had a runaway thought.... hear me out.
So they say AI on regular computing with LLMs and such are way different than Quantum Computing, that is 100% true, however. Why not give the best AI access to Quantum computing data like RAG or a knowledgebase and see if AI can help advance Quantum Computing.
Its probably already been thought of but, I haven't heard anyone mention it so I figured I would put it out into the ether.
Any takers?
looks like nvidia's already on it https://developer.nvidia.com/blog/enabling-quantum-computing-with-ai/
keywords?
Hello, I have an issue with diffusion models on a new computer
it's with an RTX 4090, when testing it with flux-dev, it seems to take forever to generate an image, several long minutes
what do you think im might be missing?
"Bury the light"
"Stained, brutal calamity"
"several species of small furry animals grooving together in a cave with a Pict"
wow these are incredible, which model is this?
Obviously sd3. 5l
okay nice
yes you
Patience 😉
for Flux Dev 1024x1024
RTX 4090 should be at least 3.6it/s for FP8 and 4.5it/s for SVDQuant
this is using sage attention and torch.compile for the FP8
with the Flux Turbo lora, that uses 8 steps, you should be getting an image in around 2 seconds
if you are getting slower than that then something is up and needs fixing
moss man
he comes from here
in that case, the keywords sent to the llm: alternate species, monsterification,hybrid,, mist and fire creature:, nature, anime ->
"T5": "In a mystical forest under a twilight sky, a hybrid creature emerges from swirling mist and flickering flames. Part beast, part ethereal being, it combines elements of various natural forms—sharp, clawed limbs intertwined with delicate, flowing plant-like appendages. Its body is enveloped in a cloak of smoke that dances like living flames, casting an otherworldly glow. The creature's eyes glow intensely, reflecting the fiery and misty elements around it. The scene is vibrant yet eerie, with deep greens and fiery oranges blending seamlessly, capturing the essence of both nature and monstrous transformation. The anime style renders this creature with exaggerated, fluid movements and expressive features, emphasizing its hybrid and fantastical nature.",
"CLIPG": "hybrid creature, mist, fire, anime style, forest, glowing eyes, plant-animal fusion",
"CLIPL": "A vibrant, anime-styled hybrid creature blending plant and beast features, enveloped in mist and flames, glowing eyes, amidst a mystical forest.",
"ARTSTYLE": "Anime",
"NEGATIVE": "photorealistic, mundane textures, dull colors"}```
i'm actually using all different parts of the clip
thursdays
First output from Cosmos
Cosmos is video model? Looks wicked tho
its a "world" model, but yeah it creats videos. seems neat
frozen bolt. not very world accurate but i LOVE the effect
Another one
A bit janky
yeah but it's free jank so yippee kiyay
Yeah idk why they named it world model when its architecture is just for video gen. Hunyuan is clearly much better in t2v but cosmo is faster and control is pretty nice for sure, you can have 9frame context and you can input multiple images in middle/end/beginning.
Hello, everyone. Any fresh updates from the SAI team? I’ve noticed that SD 3.5 turned out to be a largely underwhelming model, with very little community activity on Civitai.
personally, i don't think it's underwhelming, but with the release of flux a few weeks earlier, people lost interest really fast
I'm old enough to have actually heard that particular Pink Floyd song.
i haven't heard it either, was a a prompt someone suggested to test on my llm setup
It's the title of a VERY old Pink Floyd song.
ahh good Syd Barret Floyd
thank you for the idea, gonna try some Astronomy Domine prompts
3.5 absynth finetune, not bad
What is the best way to create photorealistic images with SD3.5? My experiments sp far are giving me plasticky/cartoony photos. Any ideas would be most appreciated. (Particularly with Turbo model. )
Some photorealistic Lora is the best choice probably, I don’t really like their turbo model, flux schnell seems better in terms of realism/overall and sd3.5 turbo lacks detail too imo.
This is what I got with sd3.5 turbo: “Polaroid, amateur photograph, a woman”
Not too shabby but you can clearly see hair is weird and white borders.
Above is 4steps, this is Schnell with 1step only and looks much better(although white borders still there)
turbo seems to be smooth but normal version is just on another level of realism
computer virus
is it better than hunyuan tho?
i would not know, i never make vid
Just asking coz hunyuan is pretty tremendous for a local model so is it worth looking at other models for now...
if hunyuan gets image to vid thats like game changing
where can i train sd3.5 Loras ....you cant do it in KohySS so far as i know?
Cosmos VAE is a lot better yeah
computer worm
OneTrainer can do it - very low requirements. However, SimpleTuner should be better as the creator still messing with sd3.5m
Hunyuan is miles better at text to video in quality. Cosmos does have a few benefits like 7b one is a bit faster, and vae is more efficient like Neon said. Also has image to video which is very useful.
Quality wise, hunyuan is much more better and comparable to closed source stuff while cosmos is considerably behind at img2vid, text2vid. But it’s pretty controllable at least.
It is the best open img2vid so far right now
I'm not saying the Cosmos VAE is more efficient I'm saying its higher quality
Still holding my breath waiting!! 😄
Civitai. About $2 per.
Can it be used with other models?
sadly using a VAE on a different model never rly works
it would have to be retrained
a long sword, simple color, no one, game icon, 2D animation style, white background
R1 just kicked Marco's a$$ 😍 it's insanely good
I'd have to get the 7b r1 for fair comparison tho..
I did and it's doing a lot better.
yeah, i suppose it does. but marco o1 wasn't bad for it size at all
Yeah, I've been using it since its been released, and the CoT really good.
rofl xD nice 4th of july
Good stuff 👌. Sd35 is probably the most creative model since pixart
it's the smartest by a far shot
Yeah. Flux becomes limited very fastly unfortunately
Hi, I haven't try flux for a while, is there some light good models, for 12 GB VRAM? I got OOM when running Flux
I run Flux Dev on 12GB without issue.
This encouraged to try it again, but It still stops my youtube video from time to time
use quantized version of model and encoder to save vram and ram
q4 is the lowest you can go, I think. And it looks fine still
I am running it at 12gb too, also able to use controlnet
svdq can give ~3x speed improvements with the quality of q4 but it has the worst comfy integration. I only managed to make it work outside comfy
on other had, there is a teacache which can give 2x improvement for flux dev in comfy, however there won't be noticable improvements at lower steps
uhm ok! that's why I asked about improvement in the models for 12 GB
so flux dev quantized?
And I didn't know encoders were quantized aswell
I was using the ones it came out when it came out
I don't entirely understand the question.
Flux we have is distilled from full model but it can be further quantized in a smart way to reduce memory requirements and not get too big of a loss.
also, there is an 8b parameters variant of flux but I am not sure if it worth using over sd3.5l for now
I mean which models, like these ones? https://civitai.com/models/647237/flux1-dev-gguf-q2k-q3ks-q4q41q4ks-q5q51q5ks-q6k-q8
yes
If you dont want to suffer heavy quality lost on the quantized flux models you can use Flux dev with wavespeed and xformers on comfy to a smooth 1.62s/it
its what i do and i use 12gb 3060
xformers needs to be installed separately?
1.62s/it sounds super good
can do --xformers in command line arg but i installed manually
but wavespeed node is what makes it fast
with 3060 wont be able to use compile+ node and only the block cache
Thanks!!
I will try it
np
I use it on Ubuntu
but ty
and I have AMD card
AMD card specs
RX 6700
But what about this?
Some realistic 1step gens with Flux Schnell, no loras or anything.
what happened to the witcher man
no one played his games anymore, had to retire x.x
Winnie the chinese president lora
So you guys know about tiananmen square regardless of this hush vee pee enn thing right?
google it lee!
You almost did it
Give it another try
😉
Trump's got you
It's now or never
Ceausescu PCR
I grew up in Romania under that fker. If we can do it so cna you.

Alo Alo Beijing.
gguf does support lora
Why no tibetan flag emoji? 😭
someone is trying to fine tune sd3.5m for make them look like illoustrious : https://huggingface.co/AngelBottomless/Illustrious-sd3.5m-fails
i never seen theses finetunes on DiT models
next Pony is DiT
on auraflow
things will change a lot when GB200 NVL72 comes out
its gonna unlock quite a lot of new abilities in terms of training
the biggest issue with training is that there is clearly a benefit from large batch sizes in terms of training quality
but to use very large batch sizes at high speed is not easy- the issue is the communication between the GPUs
GB200 NVL72 goes a long way towards fixing that because it puts 72 big Nvidia machines in one pod
I think the issue with anime models is dataset quality though rather than compute, at the moment
but how you build a high quality anime dataset out to the tens of millions of images I do not know
its 1000x easier with photographic stuff because to some extent all photographs above a minimum quality level are useable, whereas anime has to be very specific styles and content
has there been much changes in the last 3 months ? i didnt watch the news kinda
i can't seem to find improved pictures in any of the channels here sadly
@bitter hearth I didn't know that, thank you for your explanations, I'm really looking forward to seeing these finetunes come in the DiT models because understanding the prompts they offer will be a game changer, I don't really like this tag system I prefers to use natural language
yeah that would be the big advantage of a DiT anime model, the prompt following
Janus-Pro-7B is nice
bare in mind its 384x384 and is not one of the fast types of autoregressive
So far, I don’t understand the hype around this model. But I’ll keep an eye on it to see what it might turn into.
fairy sure the hype is just over the brand name deepseek and not because people actually want a 384x384 autoregressive model
Sorry for the dumb question but this cheaper and efficient training method used by deepseek can help img2img models?
there is no new efficient training method
training on artificial data can help speed up training - PixArt is already doing that. I would still prefer large scale tuning though
Yeah kinda, this is a very very cheap cheap moe diffusion models yet by Sony. It’s not bad for the size, but everyone still uses something like flux Schnell/hyper for speed
This is not a moe architecture but very fast training(21x faster) as well, and better then normal dits at similar sizes. It’s more of a demo then an actual usable model but has a lot of potential: https://github.com/hustvl/LightningDiT
the svdQuant project seems abandonned , that's sad
I know they do a lot of engineering and optimization like training on fp8. It's not a new method, though.
you can optimize every training setup and everyone is doing that already, although some teams are definitely better in this than others
they had an update 6 days ago
not so stunned - most of the people that have been talking about it have also been saying that they're not impressed
Not stunned by superior quality. Stunned that they released it
It wasn't on anyone's radar
it was a suprise yeah
there's a new Lumina too
https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0
Now that's a nice surprise too 🥳
I think this might be really good yeah
Feature Idea https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0 Existing Solutions No response Other No response
It feels a bit like auraflow while trying a few prompts in their gradio space, in that it at times seems to be rather rigid in following prompts. fun 🙂
that was likely their previous model
yes the space is using Lumina-Next-SFT
The one linked from https://github.com/Alpha-VLLM/Lumina-Image-2.0 (http://47.100.29.251:10010/) should be the new one?!
it doesn't seem that impressive on its own in terms of quality. It could be nice to finetune though. Lumina is interesting, but it seems to struggle with complex poses from what I've seen like sd3 2b. It might need more parameters to learn properly.
yeah 2B is rough for DiT
Sana 1.5 just dropped at 4.5B or so
might have potential
Sana is different
it depends on the channel count for the vae and compression
there was some math I saw that's a bit over my head. The main takeaway I got from it is that a higher channel vae is harder to train with so it needs more parameters. Sana is also higher compression so it doesn't need as high of a parameter count.
the DOF ratio between the input vector dim size vs model dim matters a lot here
16(64) channel vae needs DOF larger than 32x atleast
so you need atleast 2048 hidden dim
better VAEs are harder to train with yeah
its harder to get the DiT training to converge
dice?
请画一幅满屏幕都是笑脸的橘子和苹果
look at the timing - it was released as a repost after the tiktok ban, along with a massive amount of hype, targeted right at where it hit - the stock market
not that it did much in the long run but it sure did damage for a few hours
I could tell when people hadn't tried the demo, for Janus-Pro-7B
the only fake hype was openai lying about inflated costs to train LLMs just to milk investors thats why the stocks went down Sam Altman bs finally caught up to him
I think Deepseek themselves did not hype Janus-Pro-7B yeah
I think some journalists found it and wanted to make an exciting article
the Janus-Pro-7B paper makes it clear several times that its not going to be amazing image quality and that its just a base for future models
its only 384x384 after all
if you want nice autoregressive image model that is out now you can use Infinity, Switti or the CoT version of Show-O
i liked the image classification or whatever you call it on the 1b, but dont really need it
yeah the image understanding was what it was really about
it improved that a lot, for that class of model
these are the other models I mentioned if anyone wants to try them https://huggingface.co/FoundationVision/Infinity/tree/main https://huggingface.co/yresearch/Switti https://huggingface.co/ZiyuG/Image-Generation-CoT
they look good already, in their current form
Hello, I have a hard time running stable diffusion large on with A4500 with 20GB Vram, it is always running out of VRAM. if I us fp8, it is runable, but how to run without quantization? I heard someone able to run it on even smaller VRAM.
shopping for groceries in 2026
scam link
Anyone has info about Stability, how are they?
pretty good
would i say
the best you can do is to add the prompt ''Detailed environment'' or ''Detailed background environment'' should give you every time a high quality image
They're stable.
Hope they're diffusion too
is getting replaced by the james cameron emoji
absolutley beautiful

Apple, engraving in the style of Dürer
请画一幅满屏幕都是笑脸的橘子和苹果
I forgot where to put bbox files in comfyui. Can anyone help me with that?
ComfyUI/models/ultralytics/bbox
One message removed from a suspended account.
https://www.deepl.com/en/translator this is an excellent translator
Really happy with my latest 3.5M fine-tune!
please generate a picture showing a house
please generate 4 pictures with a Tiger
generate 4 pictures to Halloumi cheese
please generate an image of apple
The lumina model is okay.
The point is that it's Apache-2, but unlike auraflow it's much smaller
hi, how it compares to sd3.5m in speed and quality?
please Generate a living white shrimp with the cephalothorax organs clearly visible
beeing smaller is also a advatage because oyu can train it more easy
i cnat generate more images because i am no longer on a ai computer i am just at a steamdeck now
yeah I hope we'll get a painting lora or something
read the information in #artisan-faq <--- that channel
Roughly similar quality to sd3.5m base model(not a finetuned one like absynth though). Cool thing is that you can add system prompts though.
I heard that it is a lot slower than sd3.5m, is this true?
yeah its decently slower then sd3.5m right now but I believe thats just an optimization issue.
SD3.5L and Lumina 2.0
I'd love a bigger lumina 2.0, its style and details fall a bit short, but its prompt following (not apparent in such simple prompts) is really next level. As it's now i think it'll just be an interesting curiousity,
It doesn't even have to be as big 3.5L. 3B could be pretty nice. I'm pretty sure sd3 and flux are kinda inefficient since they use fused transformers. I know someone who wants to finetune lumina 2.0 as well.
they might add extra parameters to lumina and train like that
thanks. Seems like comfy support already there 
Flux. LOL
Generate a image with a blue bird, that has three claws on its wings, it is flying in the sky.
"cinematic wide shot, 21:9 aspect ratio, 1920s Jinan Railway Station, steampunk atmosphere, baroque architecture with Chinese elements, crowds in Republican-era clothing, steam locomotive emitting smoke, golden hour lighting with volumetric rays, Kodak Ektachrome film simulation, intricate historical details, hyperrealistic textures"
did stability ever release the creative upscaler?
don't remember hearing anything about that releasing
daim
"A beautiful and enchanting humanoid nine-tailed fox spirit from ancient Chinese mythology, blending human elegance with mystical fox traits. She has long, flowing silver-white hair with golden highlights, and her eyes are sharp, intelligent, and glowing with a magical aura. Her nine luxurious tails fan out behind her, shimmering with ethereal energy. Her face is delicate and serene, with a hint of otherworldly charm. She wears a traditional Chinese robe adorned with intricate patterns, standing gracefully in a misty, ancient landscape surrounded by bamboo forests, flowing rivers, and distant mountains. The atmosphere is dreamlike, with soft moonlight illuminating the scene, evoking a sense of mystery and fantasy."
You need to read the information in #artisan-faq
The back of a woman on a cliff.
uhm ok...
A glowing portal to another dimension,星空背景,科幻风格,次元之门,门外是城市夜景,门内是奇幻世界,4K高清,cyberpunk style, glowing particles, futuristic, --v 5 --ar 1:1
Whatever happened to SD 3.5 medium controlnet?
oh they came out
you can use them now if you want
they made a turbo lora as well
Original SD3 (left) vs SD3.5 Medium (right), on the "Juggernaut XL Model Card Lady Prompt". 30 steps / DPM++ 2M SGM Uniform / CFG 4.5 / same seed for both. Eyes are focused oddly in the original SD3 one but the overall image is aestheticically way closer to what I'd want
It's a somewhat unfortunate recurring trend I've found after using both for a while now
SD 3.5 Medium is definitely compositionally way more coherent for photographic gens (moreso for non-closeup full body stuff) but it trends far more towards a sort of fake airbrushed look aesthetically than the original SD3 did
A model with OG SD3 aesthetics but 3.5 Medium coherence would be essentially perfect lol
SD3M was more photographic yeah
Prompt:(minimalist logo design), (granular texture), (fading gradient), (data visualization elements), muted color palette,
clean background, geometric shapes, symbolic metaphor, (calm and rational mood), high detail, 4k, Negative Prompt:complex patterns, 3D render, glossy effects, neon colors, handwritten fonts, chaotic composition
Sd 3.5 medium is more artsy
where are medium controlnets !?
sdsad
Prompt:(minimalist logo design), (granular texture), (fading gradient), (data visualization elements), muted color palette,
clean background, geometric shapes, symbolic metaphor, (calm and rational mood), high detail, 4k, Negative Prompt:complex patterns, 3D render, glossy effects, neon colors, handwritten fonts, chaotic composition
prompt:A hyper-realistic portrait of a young man with delicate facial features, holding a cup of coffee in a cozy café. His hands are elegantly positioned, with natural-looking fingers and realistic skin texture. A newspaper with the headline "AI Revolution" is visible on the table beside him, with sharp and readable text. The café background has warm lighting and a blurred effect for depth.
Here’s your image
PROMPT: Create hyper-realistic background image designed for use in a video. The scene features professional-grade lighting with a warm, inviting atmosphere. A sleek modern desk is positioned to the right, camera left angle, complementing the overall aesthetic. The background has a blurred yellow neon effect, adding depth and cinematic appeal. The composition is clean, with no people present, ensuring a seamless integration into video production.
prompt : an image showcasing how there is no image generation bot active in this channel
sd3.5-large
#artisan-1 and
looks good. My own results with sd3 large so far are rather disappointing. What's the trick to make it look so good?
sd3-l or sd3.5-l?
3.5L, sorry
SD 3.5 large
what all are you doing with it so far? i give each of the encoders their own prompt, written to what they use best - and a fairly simple workflow.
you're welcome to play around with if if you want to
prompt : beans
I already do that. So far the results are just inferior to Flux, though. Last time I tried to make a dnd character with it and while Flux gave diverse results sd3.5l gave me the exact same face all the time (exact opposite of what people say)
also, I thought the big advantage of sd3.5l would be negative prompts - but it has issues with them
flux does fantastic: women, dogs, animecat girls, and fantasy. if that's what you want, it'll do it. you want something else? you're not going to get it to do what you want without a huge battle and probably resorting to breaking it
it's also good with complex prompts
but I'm still experimenting. Haven't found the sweet spot for sd3.5 yet
it worked better in the official demo on huggingface than in Comfy and I am not sure why
将这双鞋子的背景替换成在木质地板上面放着
prompt : beans
prompt : no generation robot available here, check out #artisan-faq
I didn't like the subject matter but I liked the technicals a lot
this is rly impressive
the way that green guy flies up at the start is good
its hard to make the video models do movement directly towards camera
circle0624
A long-haired girl leaning on a mailbox, standing on a busy 1940s Shanghai street, with a few pedestrians walking and vendors setting up stalls on both sides, grayscale, high resolution, slightly blurred background."
Are you making assets for a game? Because a lot of these look like they would fit right in with a puzzle game.
bear
Original SD3 (left) vs SD3.5 Medium (right), on the "Juggernaut XL Model Card Lady Prompt". 30 steps / DPM++ 2M SGM Uniform / CFG 4.5 / same seed for both. Eyes are focused oddly in the original SD3 one but the overall image is aestheticically way closer to what I'd want
It's a somewhat unfortunate recurring trend I've found after using both for a while now
SD 3.5 Medium is definitely compositionally way more coherent for photographic gens (moreso for non-closeup full body stuff) but it trends far more towards a sort of fake airbrushed look aesthetically than the original SD3 did
A model with OG SD3 aesthetics but 3.5 Medium coherence would be essentially perfect lol
got examples?
First try. Amazing.
The physics lighting is amazing. Very UFO.
Tried to make a fox playing with a butterfly, but forgot to change my prompt. "A plasma orb UFO full of lightning hovers slowly above a rustic barn at night. The light from the plasma orb UFO illuminates the scene with a silvery glow. The plasma orb UFO flies away in a flash."
Poor guy. Zapped by a butterfly ufo.
lol this is amazing
1
A close-up of a glowing, fiery Sun with bright orange and yellow flames swirling on its surface. Solar flares shooting out, creating a mesmerizing effect. Space in the background with small distant planets visible.
A serene and introspective scene of a young adult sitting cross-legged on a cozy bed in a softly lit room. The person is holding a leather-bound notebook in one hand and a pen in the other, deeply focused on writing. Their expression is thoughtful, with a slight smile, as if recalling vivid dreams. The room is warm and inviting, with soft morning light streaming through sheer curtains. A cup of steaming tea sits on a nightstand nearby, and a few books are scattered on the bed. The atmosphere is peaceful and reflective, emphasizing the act of self-discovery and mindfulness. The art style is realistic with soft, dreamy lighting, capturing the quiet beauty of the moment.
Intricate dragon and phoenix embracing a candle flame, traditional Chinese ink painting style, gold and crimson colors, flowing ribbon with company name
I've assumed that was the case mostly
let me tell you though, from the perpective of someone who has actually very very very extensively tried to train Loras for SD3 originally a bit and now more recently SD 3.5 Medium (and still is)
it's not, in fact, "easy to train" in any way shape or form relative to SDXL (or Kolors)
"easy to train" would mean I could mindlessly use the exact same UNET LR 1.0 / TE LR 1.0 / Cosine Scheduler / Prodigy optimizer settings for literally any dataset and they would be 100% guaranteed to produce desirable results every single time without fail no matter what (as is the case for all UNET based models)
and it'd also mean that the extremely annoying exploding gradient thing wouldn't be a problem that existed at all (as it also wasn't in any way for UNET-based models)
TLDR Basically from the perspective of an enduser / "finetuner", DiT as a general architecture seems rather flawed in all honesty, as in practice you only notice the numerous blatant downsides (can't do normal hi-res-fix in the way people have come to expect, is limited to very mediocre sampler / scheduler combos, and so on and so on an so on), you do not notice any of the upsides of the architecture that (supposedly) exist 
SD 3.5 - not SD 3
SD3-2b-medium was released as an unfinished beta - it is missing a lot of the fine tuning that normaly goes into a model, and was only releaed to enure the community that SAI is still interested in being open source
SD 3.5 has all the fine tuning, and we worked very hard to make sure it was very easy to train
I think you missed my overall point
e.g. left / first is stock Flux Dev, middle / second is stock Kolors, right / third is Kolors with a photo Lora I trained (on the same seed)
the Flux output is arguably significantly less sensible composition wise, and certainly the ONLY place it has any kind of real advantage is in being rendered with a 16-channel VAE
a hypothetical Kolors with a 16-channel VAE would make all versions of Flux and all versions of SD 3 / SD 3.5 look like absolute jokes comparatively speaking in terms of output-quality-to-overall-ease-of-use-and-resource requirements
a company that makes a model that in practical terms functions EXACTLY like SDXL, but with a 16-channel VAE and a better text encoder
WILL make a kravillion dollars day one
is all I'm saying
NAIv4 apparently uses a 16-channel VAE, the same one FLUX does
naw, people are in ruts. if they want something that's exactly like SDXL... they'll just use SDXL
(additionally you don't want to see the SD 3.5 Medium outputs for this prompt because it's almost completely impossible to not have her fingers be melty noise weirdness)
I like the photographic realism of all versions of SD3, comparatively to Flux
but the weird, weird noise issues it has even in 3.5 are just super annoying
but again it's not really about SD 3.5 in particular
it's about the supposed advantages of DiT as an actual architecture not actually being visible in any way in any model that anyone has ever released
in practical terms, at least
re-read what I said, I guess, I don't think you got my overall point
which is that ALL DiT models have perceivable downsides relative to UNET models
but no real perceivable upsides
and then less so my point was that (as DiT models go) SD 3 / 3.5 (both) are "explodier" than others
by a lot
of course they do. but you said "a model that in practical terms functions EXACTLY like SDXL, but with a 16-channel VAE and a better text encoder, WILL make a kravillion dollars day one" and i'm saying that people are in ruts. those that like sdxl will just use sdxl. and everyone is already getting into a rut with the other shiny toys. by the time someone does that, and no one's likely to now, no one will even look at it
I agree the community is full of "dragon chasers" who constantly wait for the "next new thing" while basically barely trying new things that are actually released
People are still making models and custom nodes specifically for SD1.x/2.x architecture. People will always look at different models.
i'm just saying like, from a practical third-party training perspective and inference perspective
absolutely no extant DiT model actually has any architecture-specific advantages that are visible regardless of what advantage they might have on paper
in particular the supposed better support for multi resolution seems like absolute fiction in practice
becauase the whole image just getting ugly artifacts even if the composition might be perfect, when going outside the trained resolution range, is far less easy to "fix" than just re-rolling a seed if you get like an extra foot or something, and just generally way less preferable as an outcome
and also because you could already just go ahead and train UNET models at whatever res you wanted, even if it was beyond their original training res
basically what I meant was, assuming people WOULD actually use it, hypothetically, the practical manner in which SDXL functioned from an inference and training perspective was completely perfect
so the "perfect model" would in theory be one that was not in any way different in those regards
but just had a better VAE and stronger text encoder
the officially suported samplers for DiT are a notable pain point too
absolutely nobody would ever use Euler SGM Uniform or DPM++ 2M SGM Uniform if they didn't have to
because they're just really not very good in comparison to e.g DPM++ SDE Normal or DPM++ 3M SDE Exponential or what have you
ComfyUI hacks to make the Ancestral ones work help a lot in that regard but it's still not perfect
so that just again seems like a design flaw
try bongsampling
SD35 medium
I'll look at it
how's speed?
Anything in the multistep menu (res_2m, 3m, deis etc) runs at the same speed as euler
Res_2s is the same speed as the old dpmpp_sde and is pretty special with bongmath on
one other thing I should mention (again, as one of the few people I think who has actually very painstakingly trained the same datasets over and over and over again on both the original SD3 and SD 3.5 Medium just as I'm the sort of person who actually enjoys fiddling with this sort of thing)
they have BIG issues picking up the likeness of single subjects with any remotely obvious training settings
which is what most people are going to check first
it's not impossible to get good results, but you're literally basically limited to the CAME optimizer (AdamW and Prodigy seem to be total dead ends for single subjects for reasons that aren't really clear to me at the moment)
and also training as Dora instead of Lora (at a low "factor", no higher than 2 - 4) with 64 Dim / 32 Alpha is pretty much a necessity in particular for photographic datasets as far as single subjects go
and lastly due to the annoyingly-rigid-and-not-actually-better-in-any-visible-way that resolution works in DiT models, to avoid artifacting you kinda have to (I'm referencing SD 3.5 Medium again specifically, here) train at a BASE resolution of 1440x1440 with images that are all equal or higher to that resolution in the first place and bucketing enabled to sort them properly
simply to avoid severe degradation of base model knowledge
figuring out literally all of that by myself by training the same lora about a zillion times over was the only way I was eventually able to get this pretty accurate Sydney Sweeney likeness, for example
the overwhelming majority of people will never go to the lengths I did, they will just immediately throw a model in the garbage if it doesn't perfectly and predictably learn the likeness of XYZ single-subject with very obvious default settings with absolutely no potential for "exploding gradient" whatosver (as is the case for all UNET based models)
so that's again what I really meant by "easy to train", no DiT model comes anywhere remotely close in that regard (not even Flux, because degradation of base model knowledge is still a huge issue there and you also typically need about 2x as more steps than any UNET model did to get good results)
nice, i'll have a look
cool gens btw 
Currently have RK-Sampler, so I'll give this a shot too. Thanks for the link.
guide image
output (WF embedded)
they will take longer as you go from stuff like 2s to 3s to 4s, but the quality will sometimes go up spetacularly
with medium i like using stuff like res_3s and res_5s
adding a bongmath implicit step will make it take longer but can also really improve things
I totally agree with most of what you are saying
giant 16 channel Kolors would be amazing
its easier said than done though
a lot of the recent models are designed around what is easy to train rather than what is good to use once it is finished
and so models are being made to handle less and less variance over time
this does allow them to train easier and get bigger, but then when you use them the sampling is more restricted
👍
/generate
y'all are blaming the wrong thing. It's a problem with the 16-channel VAE. There are actually downsides to using it, with the most notable being that it bloats the needed parameter count for the model to learn effectively. It's why Flux is so big. It's why sd3 medium learns so slowly and has mangled anatomy.
lumina 2.0 also use flux 16 channel vae and it is 2.6B
and it also learns slowly. It's not as bad as sd3 though since it uses more efficient arch
Also more efficient than Flux, but better arch isn't enough to make up for Lumina's size
I heard Lumina team is going to release a bigger version though eventually
The efficient arch being that it doesn't used fused transformers which sd3 and flux do use btw
Flux also wastes 3b on encoding timestep embedding
VAE channel count has nothing to do with model size
you only have it in the input and output, that doesn't matter
it might be true that training on a larger vae takes more time, though, as it preserves more fine details which are often hard to learn. But I don't think that this is the reason why models take more time to train
I mean, the main reason why Flux is taking so much time to train is probably that it is not a CFG model
gonna try these now
would you say CFG for the SDE-alike ones kinda "scales" in the same way it originally did? Like for example I would usually run DPM++ SDE GPU Normal at around CFG 5.0
or DPM++ 3M SDE GPU Exponential at around CFG 4.0
started with CFG 5.0 with 2S
latent preview looks pretty good so far
I'd say it's actually slightly faster than an SDXL gen with DPM++ SDE at the same resolution / step count FWIW
using 3.5 Medium
at least on my machine
I guess the gist of my point was again it doesn't really seem in practice like any existing newer model is "better" specifically because of being DiT and having XYZ more parameters than any given older UNET model
the improved text encoders and higher quality VAEs seem to do the overwhelming majority of the heavy lifting
and then there's various factors that come off as straight-up regressions in practice with DiT
like the whole "the image just immediately begins to artifact randomly when you go outside the training range" thing
so if your max is just 2MP like on SD 3.5 Medium you kind of have to train loras at that resolution to begin with just to have at least some hi-res-fix headroom when coming up from a generation at a more standard lower res (because the artifacting problem doesn't happen in reverse, e.g. it can scale down fine seemingly, just not up)
BongSample 2S definitely very nice
is there any particular one you recommend using for half-strength-denoise hi-res-fix passes? E.G. typically I would tend to do DPM++ 2M Simple at 0.5 denoise strength and the same CFG of like 5.0, for hi-res-fix on an image generated with DPM++ SDE GPU Normal
multistep 2M seems to work actually for hi-res-fix, sort of the same way
similar result (arguably better even) for the second pass with it, but a bit faster
yeah multistep is pretty good for when you want the image to stay similar
you might like some of the guide stuff too, that can help with that
i made a little summary of some of the new functionality here
includes ultracascade too so you might see a couple red nodes on the left side (https://github.com/ClownsharkBatwing/UltraCascade)
tricky because to use strong VAE well likely requires stronger model
DiT scales better with data and compute, and DiT has better self-attention
some of the issues in this conversation were more to do with rectified flow loss, and its possible to have DiTs that don't have that
e.g. Pixart Sigma or Flag DiT
nice, i'll have a look, thanks
i'm assuming "steps_to_run" isn't important right, for the beta sampler, just leave it at -1?
I don't think I understand the point of the SANA VAE, it's much slower and resource intensive than the Flux / SD3 ones but also worse in quality to my eye
its worse quality but its a lot faster and less resource intensive
it's not though, it's like a 1.5GB file
it uses more memory and the encodes / decodes are a lot slower
i'm talking about like
in ComfyUI
just the actual like, "write image to file" decode, when the image is done
was WAY slower with Sana when I tried it
than the same kind of decode with the 16-channel SD3 or Flux VAE is
and the initial load is a bit longer too of course because like I said the physical file is much bigger than the SD3 / Flux VAE files
oh I see
yeah if you don't include the diffusion time or the diffusion model vram then Sana is slower and more vram-heavy
to just decode a latent with the vae
yeah that's what I meant
the inference was pretty fast
but the VAE by itself in a vacuum is very slow
there are some niche areas of machine learning where you only use a VAE
so yeah for those it could be worse
Just say "image classifier" next time.
LOL
I read that some people use VAE encode/decode to store images
is a cool idea
although if you were gonna do that, the greater compression ratio of Sana might be good
I'd have to recheck, but it didn't seem like the overall experience was much faster than SD 3.5 Medium
possibly just due to Gemma
maybe with a GGUF it'd be better
Gemma is faster than T5 though
whatever I had to use when I tried was not faster to load / encode in ComfyUI than the Q8_0 quant of T5 is
it could be a node code issue
not sure really
oh a Q8_0 quant of T5 encoder is indeed slightly smaller than Gemma encoder apparently
again it could be that the "ExtraModels" code for "Gemma Loader" is just slow in some way relative to the City96 GGUF loader also
I don't know
its normal for Q8_0 T5 to be smaller than Gemma
its just that you are comparing a quant version to an unquant version
which is not a fair comparison
yeah that's why I said maybe a quant would help, initially
but there wasn't one that worked with that whole ExtraModels node system I don't think
I had a look and there are two other implementations of Sana
the official one, and one by the SVDQuant team, both are in Diffusers though
in another news it seems like Clownshark stuff makes Camera Lady prompt work a lot better, so far 👀
in SD 3.5 Medium that is
normally her fingers kinda just melt
SLG helps but it's usually too high contrast for this one for some reason
so getting a good result without it is pretty cool
came out great
this is why I'm hesitant to give up on SD 3.5 Med and keep fiddling with it though lol
it's technically capable of like perfect photo gens
they're just tricky to get out of it
No, just no. The requirements are too high
Resolution enable_model_cpu_offload OFF enable_model_cpu_offload ON enable_model_cpu_offload ON
Text Encoder 4bit
512 * 512 33GB 20GB 13G
1280 * 720 35GB 20GB 13G
1024 * 1024 35GB 20GB 13G
1920 * 1280 39GB 20GB 14G
2048 * 2048 43GB 21GB 14G
why do these new models keep comparing themselves to SD3 Medium
instead of 3.5 Medium
seems kinda sus
They also don't compare VRAM usage at specific resolution sizes. That's my issue with most of them.
what was the prompt for this one
Cinematic movie still of a majestic dragon resting in a dense, misty forest. The dragon’s scales glisten under soft, diffused light filtering through towering ancient trees. Mist swirls around its massive wings, and glowing embers float in the air, hinting at its fiery breath. The scene is captured in a dramatic wide-angle shot with rich cinematic lighting, deep shadows, and a shallow depth of field, evoking a sense of awe and realism. Ultra-detailed textures, realistic foliage, and a filmic color palette enhance the immersive atmosphere.
but I pressed prompt enhance button as well
you can calculate VRAM estimates btw
they won't be 100% accurate but
VRAM use is highly linked to the parameter count
I got this for a quick 25-step gen with SD 3.5 Medium, just using Euler Ancestral Beta at CFG 6.5, no fancy Clownshark stuff
its nice yeah
it's not really a great prompt to test I don't think, neither your or my gen are really "better" than this one I just did with Base SDXL and two loras lol
I know I don't rly know why you wanted it lol
just thought I'd try it I guess
try camera lady on it a close-up photograph of a young woman holding a vintage camera in front of her face. She is looking directly at the viewer with a serious expression on her face, as if she is taking a photo. The camera is silver in color and has a large lens attached to it. The woman has long dark hair and is wearing a black top. The background is blurred, so the focus is on the camera and the woman's face. The lighting is soft and natural, highlighting her features.
I think Kolors "wins" actually
as is not uncommon I find lol
the demo broke for me sadly
it just keeps going 120 seconds plus
its a common bug with grado demos
Kolors is underrated.
yeah
great lora results from it too
whoever that guy was that said it "needed" Chinese prompting was just wrong
I captioned this Lora only in English (as I don't know how to read, write, or speak Chinese lol) with JoyCaption and it worked great:
https://civitai.com/models/1204546/zoots-flux-pro-ultrafier-for-kolors
I trained this model with all english captioning. https://civitai.com/models/602580/kolors-openkolors-v24-multiple-style-general-kolors-model It is able to produce decent Chinese understanding with fine tuning in English. The embedding space between Chinese and English is different. Even you messed up the English caption, it still produce good Chinese pormpt adherence. (I find that when I trained with mismatched color prompt in English.
yeah that's what I figured
it'd have to be kind of separate like that
my Lora also doesn't have a text encoder component at all anyways since I wasn't going to try to train ChatGLM obviously, it's just the UNET part
which seems to be all you need
I actually don't think a ComfyUI node that could load a Kolors Lora with some sort of ChatGLM part even exists lol
nor am I sure you even could train a Lora like that
now that I think about it
I would still recommend chinese prompting but I don't think you need it yeah
wow OpenKolors looks great, thanks for this
Personally, I would say it is much better than official one in most case.
yeah for general use its better
I really love the base Kolors style but base Kolors is artistic
OpenKolors looks better for photography
I really should release my Kolors photo Lora, I keep meaning to lol
it's pretty good IMO
a photograph of a woman standing at a formal event. She is a young woman with a light skin tone, striking green eyes, and a slender, athletic build. She has blonde hair styled in loose, wavy layers that fall just below her shoulders, with a subtle, elegant updo at the back. She has a natural, glowing complexion and wears a sophisticated makeup look featuring a nude lipstick, subtle blush, and well-defined eyebrows. She is dressed in a luxurious, off-the-shoulder, white satin gown with a deep V-neckline, accentuating her cleavage. The gown is made of a silky, shimmering fabric that catches the light, giving it a radiant appearance. She wears an elaborate, multi-layered necklace adorned with large, sparkling diamonds that cascade down her chest, complementing the neckline of her dress. The necklace is paired with matching diamond earrings. The background is a vibrant red wall with abstract geometric patterns in shades of red and white, creating a bold, modern aesthetic.
base Kolors is first one, Lora @ 0.7 strength is second one
same seed and everything
its more realistic yeah
gets the prompt way more accurately overall too
this is with Lora but the prompt translated to Chinese
which interestingly I guess kinda still works
even though the Lora is not captioned in Chinese
not quite as good though, it misses e.g. She wears an elaborate, multi-layered necklace adorned with large, sparkling diamonds that cascade down her chest, complementing the neckline of her dress
that helps as well yeah if the prompt adherence improves
I mean I wasn't going to bother Google Translating all of my made-with-Joy-Caption-and-then-manually-edited captions lol
since I wouldn't know if the Chinese caption result was even good
so seemed to make more sense just to do it in English and improve the (already pretty good overall) English support in the model
it might take a lot of compute though for the english side to catch up
is the issue
I mean this 1000-image Lora was enough to make it like, very significantly better than pretty much any realistic SDXL model in terms of prompt adherence
I was even able to teach it ~2 full on NSFW concepts within that 1000 images lol
the results for photographic stuff specifically aren't really much if any better on stock Kolors when translated to Chinese anyways, when I've tested that before
ah okay I had read the chinese side was better but maybe it is not
it seems to be more important for sort of smaller things
like I guess Kolors is supposed to be able to do some text output but as far as I can tell it works much much better in Chinese
so things like that
another one lol
a high-resolution photograph featuring a young woman of Asian descent with a radiant smile, standing in front of a classic white Porsche sports car. She is wearing a shimmering silver bikini top that accentuates her medium-sized breasts and white, frayed denim shorts that are unbuttoned, revealing her toned abdomen. Her long, wavy black hair cascades over her shoulders, and she accessorizes with a delicate necklace, a wristwatch, and several bracelets on her left wrist. In the foreground, she is pointing a black handgun directly at the camera, creating a sense of excitement and boldness. The background features a clear blue sky, tall palm trees, and a desert landscape, suggesting a warm, sunny location, possibly California or Arizona. The car's sleek, glossy surface reflects the bright sunlight, adding to the vividness of the scene. The overall mood is playful and adventurous, with the woman exuding confidence and a sense of fun. The image captures a moment of high energy and boldness, blending elements of fashion, adventure, and a classic car aesthetic.
It is the first open source model able to produce chinese characters. It is not very good to generate English characters
this one came out even better lol
Kolors is like the Flux of UNET models for hands, generally
it doesn't mess them up very often
even without this lora
I bet it could learn English text generation pretty well though since the text encoder is already a lot better
maybe I'll try a Lora for that sometime too
even on SDXL it sort of works to just have a bunch of images where the text is all accurately described, in a Lora, to teach it to generate at least short stuff
In general I agree with you. I always said the unet architecture is not as bad as many people think and I had a lot of discussions with people who didn't understand that a dit architecture is just simpler but not fundamentally different from unet. I do think that Flux has sometimes an amazing understanding and logic in it's generation so maybe there is something in the mmdit, though. This becomes apparent if you let it generate multi-part images ("give me a technical sketch of a building on the left and the very same building as photography on the right").
Regarding resolution: SD 3.5 just has a very shitty resolution handling. I wouldn't say that resolution is a general problem, it's just SD.
but I think the problem is: we never have fair benchmarks where different architectures are compared on exactly the same training data. So it's never clear if model A is better because of it's architecture or it's training data or it's parameter size
the 1k version of Sana removed the positional embeddings and it went ok
positional embeddings seem to be the main unet dit difference
they added them back in for the 2k to 4k sana though
yes. But I think positional embeddings make totally sense. Why let the unet learn them itself?
I agree, I don't use Sana, I wish Sana worked because I care mostly about speed
but I can't get adequate quality out of it
there are some resolution flexibility advantages from not having pos embeds
but I don't think that matters much because Flux goes to 8k without tiling (e.g. in the CLEAR paper)
there was some fast 3x3 conv model on arxiv last year but its like SD 1.5 quality at best
actually, resolution is the biggest disadvantage of conv
like you can usually increase resolution in pure transformer models without everything fall appart (see Pixart for example)
while convolution models get a lot of artifacts when increasing resolution (double heads, enlongated necks and so on)
How i use this? Good Mornig xd
when it goes well yeah but then you get models like SD3.5L that are way less resolution flexible than SDXL for example