#🆕|sd3
1 messages · Page 130 of 1
in the end you have to train all architectures on different resolutions. There won't be a solution where the model can extrapolate to any resolution
but convolutional architectures have the problem that their receptive field is fixed to a specific size. So they won't be able to generalize to much higher resolution without you having to change the architecture
on the other hand: who cares? I think 1-2 megapixels are more than enough and having flexibility in the aspect ratio is way more important
I wasn't sure if it had been confirmed or not what caused some of the issues SD3.5 has like that
I just don't like the fact that SD 3 / 3.5 cannot do like half-denoise-strength hi res fix at resolutions higher than their training
In the way that both SD 1.5 and SDXL could
I wouldn't really care past that too much
I see some strange attention issues sometimes like very small objects or collapse structure sometimes as well
I could imagine the reason is not just the architecture but also the training
if you train on super high resolutions you can either
a.) downsize the image
b.) crop the image
I think earlier SD versions did strategy 2). They random cropped the image
so they learned how to denoise extremely zoomed in cropped tiles of a high resolution image
later SD variants then rather used strategy 1.). They only cropped the image such that it fits into their aspect bucketing and otherwise used downsizing
this might explain why SD and SDXL are weirdly good at tiled upscaling
the reason: if you train on cropped images, you lose the alignment between text and image (e.g. if your prompt is "image of a man with sun glasses" and the cropped image is just some part of the street)
also sometimes the models created cropped images (e.g. headless people) which was a result of the cropped training
its probably better not to have crops in the base model yeah
I noticed flux doesn't actually have close up textures in its data
like if you try close up of trees or rocks
SD and SDXL know it but flux doesn't know what to do
maybe having a diffusion model without text that is used for upscaling? Dunno. All the upscaling models so far are very small and trained on smaller datasets
not sure what is best for upscaling
every week about 20 different arxiv papers all claim SOTA
which clearly means most of the SOTA claims are not right
and they cherry pick the comparisons so they make the other methods look bad
As far as I can tell the combination of like, a small ESRGAN / DAT / whatever model of your choice to do the actual low-to-high-res upscale and then a diffusion model to properly denoise the result with prompt context and everything is still the "best" way to upscale in general
there are dedicated upscaling models these days that are stronger
they lose the ability to generate images "normally" though
SD3 is using relative positioning where the (0,0) coordinates are always in the center of the image.
Flux is using absolute positioning where the (0,0) coordinates are in the top left corner of the image.
In theory the SD3 thing makes sense, but it seem to not work. As the Flux devs are the same who also developed SD3 I'm pretty sure they changed to the simpler positioning scheme for a good reason.
I never found using ESRGAN helpful at all. Using diffusion seem to be the only way to upscale :/
I see thanks, I wasn't aware about the relative/absolute positioning thing
Really? What do you use for the actual upsizing? The result of a trained upscale model going into the secondary hi res denoise pass is always better than the result of any traditional method like lanczos or whatever I find, that's what I meant by combining actual upscale models with diffusion ones for the overall process
You've never messed around with Kohya's Hi Res Fix or HyperTile and it shows. There's definitely ways you can extrapolate out to ANY resolution. The only thing that matters is how much VRAM you have.
Using the toys, I know for a fact SD 1.5 can gen to at least 1920x1080 (or vice-verse), and that SDXL can gen up to at least 3840x2160 (or vice-verse)
Talking about pure generation, no upscaling required.
I don't get to test it much because I'm only on 8GB of VRAM though
there are these kind of hacks where subsample the latents in the unet
a normal unet cannot do that
not if its not trained on these high resolutions
the problem with these hacks is: you don't really need them. You could just do the normal upscaling workflow (lowres generation, upscaling, img2img)
I could do that, but I will always prefer a one model solution without upscaling. Just to see what things are capable of.
yes, but its not capable of that
all these "solutions" are also doing something similar like downshrinking the latent representation of the image internally
convolution has a fixed receptive field. It just cannot generate images in arbitrary resolutions
Extrapolating out is easy, the only time I've ever had issues is when you try to get an image that's smaller in resolution than the training data. the Model goes nuts.
it cannot extrapolate xD That's what I say
like it can "extrapolate" in the sense that the image has same size and is just expanded infinitely. Like outpainting
but it cannot make the resolution arbitrary high. So to speak: there is a maximum size a "human head" can have in the unet architecture. It cannot make it larger
so making an image which is a super high resolution close-up of a face will fail
you can, however, do a lowres image first, scale it up, do img2img IF the model was trained on high resolution textures
That's up to prompting really. If you use "big head mode" (old school video game cheat), you should be able to get it.
I talk about technical limitations
Also make sure you're adding "close up portrait"
If I had kept the image, I could show you the model I'm working on for SD1.5 that does exactly what you're talking about.
the SD unet has no positional embedding. The "latent pixels" in your image don't really know "where in the image" they are. This is the responsibility of the convolutional layers: they lay out the image composition by "telling" the pixels where they are (how far from the left corner of the image, how far from the bottom corner and so on) and where they are relative to each other ("these two pixels are neighbours")
but the receptive field of convolutions is limited. There is a "maximum range" you can exchange information this way. So if your image is very large, then the pixels in the middle of the image do not know how far away they are from the border. They do not know where in the image they are. They still can exchange information via attention, but as they don't have any absolute positioning information this is really difficult. One pixel on the left side of the image does know that another pixel on the right side is "not close", but it does not know if this pixel is above or below, left or right of the other pixel.
yes with reasonable quality there is a pretty low limit with SD and SDXL
they work very well tiled but not in one big tile
I actually made an SD 1.5 model that for the most part fully supports XL-equivalent resolutions natively
Just by training it like that
https://civitai.com/models/490451/zootvision-eta
None of these are upscaled, all genned at the resolution you see them without deep shrink or anything like that
these sorts of numbers, below 1560x1560 are fine yeah
I never tried to go that high yeah lol
It does have a lot of like 1536x640 / 640x1536 buckets though from a bunch of ultra wide / tall landscape photos I swiped from Wikimedia Commons at one point
yeah this level is fine, in fact its probably better to use SD 1.5 this way, around the 1 megapixel level
these are native 2k gens (non upscaled) using flux or rather flex trained on a 2k dataset. I couldn't get these results on any of the SD models
Sd 3.5M beeing the only multi res capable model completely breaks apart during training with multi res, especially when using resolutions higher than 1.5k
how many images were in your dataset
I think things have to go step by step
get the model to work well at 1.5k before think about 2k
at least from what I have seen/read, it could be possible to fix SD3.5's 1.5k generation with something on the scale of a few million images or so
yea and thats the problem, i used around 600 2k images to make it work in flux
not millions
and to make matters even worse, i didn't even do a full finetune, it's only a high rank locon
the positional understanding in SD is just very very bad even in medium
Do you happen to have some prompts for testing? I once came across a list of "impossible prompts" (reflection in a mirror) but I can no longer find it.
Positional prompts would certainly help in my tests
I'm not at my PC currently, but i mean positional understanding at higher resolutions. SD models will duplicate and stretch like crazy at 2k
Meaning if you ask for something in the center, you will have it 4x instead of one large image
or if you ask for something in the corder at higher resolutions it will do weird stuff
The base Flux models do a lot better in that regard which means their concept of positioning is already better and not as tied to resolutions
with careful settings flux can do 3k
regarding SD3.5, I do think its plausible it will improve its 1,.5k ability substantially
but yeah my understanding is that the ballpark is million+ images, it may be less than that though
The other huge issue with sd3.5 is the extremely low context size for the TE
training for more than 154 tokens already makes the model behave weird, but with over 256 it gets very bad
yeah the attention masking wasn't done so what this means is the model has issues if the text encoder token count goes above a certain amount
this is less concerning because you can just limit prompt tokens
you can, but this limits your outputs and you can't do detailed multi frame images which Flux can do
I think in terms of expectations it might be better off giving up on boosting it to flux level
and instead just try to improve the model a bit within its current performance bracket
Yea i just hope the next SD model meets the expectations
same I'm essentially hoping for SD4 at this point
the thing about flux is it has this insanely strong self attention
and then flux fill boosts it even further
if you are willing to sometimes re-roll seeds then you can use flux fill as your main model instead of flux dev and just outpaint everything
I'm using a de-distilled model of flux schnell as my main. This prevents most of the issues flux dev has, especially in finetuning
yeah that's good
The self attention of flux makes it difficult to train in anything it doesn't know yet and you also break it's guidance embedding
I only use flux checkpoints that have de-distilled checkpoints as part of them
the lighting gets so much nicer with CFG
yeah I don't like base flux at all I find it unuseable
but then with photography checkpoints and 3 or so strong photography loras I like it
these are outputs from the locon i trained of flex (the de-distilled model). The lighting and flux like style went away basically instantly at the first few steps
they are from very early in training
yea im gonna soon merge the locon to the base model and release it haha
default flux blur can be slightly strange
everything looks like out of a game engine
yeah for sure its the unrealengine look
not actual unrealengine but what models seem to think unrealengine is LOL
cos actual actual unrealengine can look better than flux base 😄
yea, this was done intentionally for some reason. Their guidance embedding is meant to always default to this kind of look. With some heavy prompt engineering you can get photographic styles but it is very hard
and it is even harder to train that out of base flux
thats also why person loras work so well, becaue flux ignores most of the poses, stylistic elements and so on
I looked around on arxiv for other models that had guidance embeddings like that
but I couldn't really find any
well that's distillation for you. Flux pro also doesn't use it
the guidance is fun sometimes because what you can do is set it to 1.4 lol
guidance 1.4 flux is a wild time
yea its like midjourneys chaos mode
I never used midjourney cos I couldn't get their discord to work
im just using MJ to rip dataset images haha
lol
I thought about using flux ultra for that
but instead I might just filter HF datasets
Im using MJ because it basically says do whatever the f you want with the images. I think flux has some kind of no using for training or so
particularly if you are doing reward learning / reinforcement learning type fine tuning
you can re-use existing big datasets because you are using them in a different way to before
yeah flux probably does
im doing that, i also do self reinforced learning. Since my current model is capable of 2k image output with nice quality, i can easily use the images for the 1k dataset and remove worse images
im also using the 512 res for concept transfering
okay nice
even if the images are bad, flex is very good at keeping sh... quality in the 512 resolutions haha
there are more image quality assessment things around now so
filtering can be better
flux dev hides some weird data in 384x384
poses do improve in that res though for some reason
the problem is getting images like that to become realistic looking is very difficult. I had to do lots of trickery to make that work in 1.5k
and since i now have the output i can reuse it for self reinforced learning
just implemented regional conditioning with SD35
pretty much doubles the token limit too
absence of planned sd3.5m controlnets fueling my cope that SAI dropped 3.5 and in best case cooking 4.0 with the newer architecture and better coherence to regain it's crown... or at least 3.6 to address trainability
yeah SD 3.5 Med tops out at around 1440x1440
got a pretty good result for Gun Lady with Clownshark sampler and SLG, on SD 3.5 Med
SD 3.5 does cars particularly well I've noticed
I tried reprompting these with JoyCaption on the current version of my own Flux photo Lora (older / smaller dataset than the Kolors one has, was still decent though)
not bad I think
didn't expect the composition to be as similar as it was either TBH
I guess that just comes down to the text encoder patterns though
CogView kinda mid on the lady one here
rail to nowhere lol
it looks like a distilled model despite apparently not being one also
same as Lumina 2.0
for some reason
this SD 3.5 Medium one should be nice
doing it direct 1024x1536
I can see she has the right number of fingers already so we should be good lol
Clownshark saves the day again
Kolors Lora ones here
pretty good
just up against the limitations of the VAE as always sadly
I've noticed that this problem doesn't exist if you ONLY load and prompt T5, not either of the clips
it never explodes with only T5
so it has to be some weirdness with the CLIPS I guess
SD 3.5 Medium really likes prompts that are exactly the average length of regular Florence-2 Large "more detailed caption" mode
that's what this is
a portrait of a young woman with pink hair. She is sitting on a couch with a cityscape in the background. The woman is wearing a black leather outfit with a gold necklace and earrings. She has a pair of sunglasses on her head and is looking off to the side with a serious expression on her face. The lighting is red and blue, creating a futuristic and edgy vibe. The overall mood of the image is dark and mysterious.
yea sd3.5 can be made into a usable state, but it is just so tedious... 1 thing that is really noticable and annoying with flux no matter how you try to tune it. Fur and clothing like wool and so on never looks really right, for some reason it always puts this smoothness on it
hard to put into words, but it always looks a little off
it always tries to have the patterns of it too perfect
I mean it looks like all distilled models do basically
3.5 Large Turbo on the left / first, 3.5 Medium in the middle / second, Flux Dev on the right / third
Aesthetically Dev is just like, a slightly lesser amount of visible distillation than Large Turbo I'd say
but you can still tell it is
whereas Medium is obviously not a distilled model
this one's Lumina 2.0 lol
it's weird looking for a non-distilled model
more like actual plastic than Flux
lumina propably comes from beeing low param count if you're using the 2b version and propably very undertrained
/ dream Painting of 2 fairies looking at the camera and smiling,Asian girl's face, full body photo, white wings
SD 3.5 Medium vs CogView 4
both 1152x1536
SD still winning in "groundedness" as far as as realism IMO
I dunno why everyone else is apparently allergic to making a model that just looks normal for that kind of thing
got this out of cogview, similar result to you
I don't get why she has Flux Chin lol
that was mostly a product of distillation as far as I could tell
but it's not distilled
maybe it was just trained on too many Flux gens
I guess the timeframe would work for that to be possible
more likely to be DPO I think
the different categories of distillation act pretty differently
some are very subtle
like high step PCM
a photograph of a woman with long, wavy dark hair sitting at a wooden table in a dimly lit coffee shop, holding a teacup in her hands. She is wearing a light blue ribbed long-sleeved shirt, and her expression is calm and contemplative. The background is blurred, but it appears to be a cozy cafe with warm, inviting lighting. The image has a high-quality, cinematic feel, with a focus on the woman's contemplative expression and the warm tones of the setting.
with Kolors photo lora
Is there any guide on how to begin?
its a bit better yeah
yeah i've heard of people using them like that
I forget the name of the paper but it is something like "inpainting with video priors" it said video models learn stuff like laws of physics better
/generate 现代都市公园,阳光柔和,绿地和树木,背景有摩天大楼。一个白人男性(30岁,浅棕色短发,浅蓝色衬衫,灰色休闲裤,微笑)和一个黑人男性(30岁,黑色卷发,深绿色针织衫,卡其色长裤,开朗笑容)站在桥上,影子交融,背景有鸽子和国际象棋棋盘,插画风格,低饱和色调。
what was prompt?
Today's API service is not stable, always prompting timeout
Build an exhibition hall for IQOS electronic cigarettes, with an artistic and high-end style.
A tall fantasy art panel divided into four vertical sections, each showing the same stylized tree in a different season:
- Left section: winter theme with vibrant icy-blue leaves, snowflakes, and a dark starry sky
- Second section: spring theme with bright green leaves, soft glow, and gentle sparkles
- Third section: summer theme with warm golden-orange leaves, light glow, and shimmering atmosphere
- Right section: autumn theme with fiery red leaves, falling foliage, and a darker star-filled sky
Each section seamlessly transitions in color and mood, leaves softly glowing and drifting,
intricate detail, ultra-detailed, fantasy lighting, digital painting, trending on ArtStation, 8k resolution
Realistic style surreal visual scene
hello everyone!
I keep changing this absentmindedly instead of denoise for hi-res-fix with Clowsampler lmao
cause it starts at 0.5 I guess
When will the service be restored?
is there anything else in the incredibly gigantic list of sampler options worth checking out for general use cases lol? or a breakdown of what it all even is anywhere?
do you have rgthree installed?
it looks like this if you have rgthree, and turn on the setting for nesting folders
it really should be part of base comfyui imo
cuz yea otherwise it is a HUGE list lol
worth checking out is hard to say, it all depends on how fast your model is and your patience i guess lol
and what's best for what, it's hard to say... which is why i added so many lol
res_8s can be insanely good with sd35m, obviously will be a lot slower but medium is pretty fast soooo... can be viable
pretty much the big think to know from a user perspective is the multistep ones will be fast, and the higher you go with the "s" number the slower it will be, but probably better
i'll try that one then thanks
do you recommend still using SLG with all / any of these, also? seems to work fine still, just wasn't sure if it was meant for it as much
these were both the same seed / prompt / settings etc on SD 3.5 Medium with your res_3s and "ModelSamplingAdvanced" in exponential mode
only difference is first was no SLG, second was with
SLG version background is way more coherent, especially the buildings, I think
and it seems to not have the yellowy high-contrast kinda look that SLG usually brings, when I use it with your exponential samplers
so that's a bonus too
not having the conditioning thingies there makes a huge difference for reasons I don't understand, even with an empty negative prompt
so that's like the overall best combo of settings I've found
yeah, a blank prompt gets encoded to something different than all zeros
the default scale 3.0 is WAY too high it seems for SLG lol, the colors are super wonky, 2.0 is way more reasonable
and the slightly lower default end percent of 0.015 is also a bit worse for whatever reason, at least in my experience
Exponential Clownsamping makes some of my SD 3.5 Medium likeness Lora experiments come out way better than I had "ranked" them at lol
I may need to re-evaluate like everything
I use these two a lot for testing lora training on new models just cause the models will very often struggle to reproduce the various somewhat unique aspects of how they look
in comparison to other celebs who don't look quite as distinct
holy crap
a miniature model of a castle on top of a large mug. The mug is made of stone and has two handles on either side. The base of the mug is covered in moss and rocks, and there is a small waterfall cascading down from the top. The waterfall is surrounded by greenery and there are two small figures standing on the rocks. On the left side of the base, there is an oak tree with yellow leaves. The castle is made up of multiple towers and turrets, and it is lit up with orange and yellow lights. The background is a bookshelf filled with books and other decorative items. The overall mood of the image is magical and whimsical.
Clownsample res_3s (right / second) absolutely steamrolled the DPM++ 2M SGM Uniform output (left / first) on this one lol
with SD 3.5 Medium
like it directly made the prompt adherence better
as far as the background
SDE can help a ton with getting style out of loras, and likneess, in my experience
it gives the model lots of chances to make little corrections and find its way to a better output
yeah it was always the obvious best choice for essentially all UNET models I always found
for any use case that wasn't like, fully 2D (anime or what have you)
for that Euler Ancestral always seemed better
yeah, it def helps
everyone kinda gave up and never did anything to get SDE working with rectified flow for whatever reason
but that's what the "eta" parameter does
if itha'ts at 0.0, it's not SDE, if it's > 0.0 it's SDE
cant remember if i gave you this but it has an explanation on most stuff, to some degree at least lol
yeah you did, it helped for sure 
Vibrant energy waves pulsating across the cosmos, different frequencies manifesting as different celestial objects.
how did you get the amazing reflektion on the ground?
By using stoqio a sd35L fine-tune
https://civitai.com/models/161068/stoiqo-newreality-flux-sd35-sdxl-sd15 it's the pre-alpha on here, it says it's sd35 medium but that's incorrect, it's large
https://github.com/ClownsharkBatwing/RES4LYF also using this which really helps get the best out of a model
Here is the image you requested.
@spark grove spammer - you might want to block steam links
Phoebe, the most massive irregular satellite of Saturn.
Here is the image you requested.
tree
hello
Seeing your stuff again makes me wish I could run comfy on my ipad.
imagine an online store dased on darktheme ui,ux design selling shampoo, conditiooner, texture powder
Here is the image you requested.
phòng khách hiện đại
Modern living room includes: TV wall, table and sofa, wall hanging, door frame, decorative lights
dog big smart
a girl in forest
Nice Nose Flute! 🙂
/imagin prompt:Generate an attractive FB homepage image for wholesale customization of shoes, clothing, and bags for e-commerce, so that customers can know what product you are making as soon as they come in:: --aspect 16:9 --version 5.2 --quality .5 --stylize
70-80年代,盐城,怀旧,文创店,温馨,年代感,老照片,复古海报,手绘墙画,木质柜台,货架,文创产品,老物件,绿色植物,暖黄色灯光,老式唱片机,邓丽君,老式自行车,缝纫机,丹顶鹤,麋鹿,剪纸,刺绣,草编
is this for image generation? help pls
Try Artisan
https://arxiv.org/pdf/2503.10618v1
Although increasing the channel capacity of the VAE generally improves image reconstruction quality, it can inflate the KL divergence, hindering
subsequent diffusion training.
could you post your workflow for that.... my creations in comfyui with SD3.5 are just bad
here's my basic sd 3.5 workflow - each encoder is separate so you can put in a different prompt for each one. they each understand the same tokens differently, and function differently - so prompting them to their strengths will give you the best results. And the prompt i used for the image was just "wild and crazy, surreal, untamed piano"
@honest yarrow perfect english - no, oops, it repeated a word that wasn't in the prompt - and this is the most advanced generative AI model right now
what prompt did you use
you can see i did not tell it to repeat the word pie
none of the AI image generators are good at text. they're getting better, but they're still not good. and the only text they are the least bit good at is English. if you want one that is good at any other language, you're going to ahve to spend the time to research, learn how to create data sets, get a data center, and train it
and then work with it over and over and over till you get it working
well at least It did type it cursively right. I will check SD 3.5 for Arabic now
since flux and SD3.5 are the same architecture, i doubt it'll do any better than flux
also It is only 8B
i know. flux is stuffed full of stuff it didn't need and then dpo was run on it to ensure that the things people are most likely to want to generate come out nice - and mask all the issues it has
it would also be around 8b if they hadn't done that
recently there was a Rombach lecture on youtube and they showed pre DPO flux sample
I have mixed feelings about DPO because it works very well for certain models
particularly SPO for SD 1.5
I think the weaker the model the more it helps, so it was more needed for SD 1.5 sized models than for big 10B+ ones
it was designed for LLMs - it reads well on the paper - i don't like what it does in practice
dont' spam the channels
it has a high risk of overfitting compared to some other methods
there are some more modern similar methods to DPO that address that a bit
but its still a risk
Hi dudes. Happy wednesday.
Started tinkering with SD 3.5 medium recently. Very few information in the net about styles. Had someone found a key to it yet? The artist styles seem to be present, but hard to use in compex prompts, as they dissipate very quickly with the prompt length increase.
Modern living room includes: TV wall, table and sofa, wall hanging, door frame, decorative lights
for medium keep the prompt short and consider using it as a refiner for sd 3.5 large. try prompts like this "pen and ink line drawing of love birds on a branch"
"Peaceful landscape with lots of dogs."
specifically no more than exactly 75 tokens, you can check with certain comfy nodes
you will sometimes see 77 written for the clip L and clip G size but apparently it ends up being 75
would leave a few more just in case so maybe 70
((cartoonish style), (Q版 fantasy)),
main elements:
smiling sun character with straw hat (拟人化太阳),
wheat fairy holding scythe (木属性精灵),
dynamic composition with wind-blown wheat waves (火性动感),
color palette:
orange sun (丙火),
emerald wheat (乙木),
light gray clouds (金属性弱化),
avoid deep blue or silver (忌水金)),
text overlay: "庚午匠心" in bold calligraphy (火属性印章)
I think this may be the halfmoon model on here
I'm finding little info about them beyond what's on their website. They're pretty recent.
there's honestly no reason to use closed these days with flux, stepfun and deepseek
Did you give it enough steps? The bars on the stairs are wavy instead of vertical.
its closed source, there is no step option
That is awesome.
trying to animate with sora but its shit
Have you tried wan yet? It's nuts.
don't have good gpu
I'm on a 3060TI with only 8GB of VRAM. I can run wan. =/
I'm using the exact files mentioned here though. https://comfyanonymous.github.io/ComfyUI_examples/wan/
can you put the rest of layers to cpu ram right ?
because i have 6gb, but i can run flux, if its the same size
I'm using ComfyUI so it offloads to RAM what I'm not using.
hello
I really like this one. "I can fix her" mentality going on...
dynamic angle, black and white color scheme, monochrome, an artistic depiction of an alluring demon girl with demon wings, surrounded by flames, holding a huge huge fantasy magical sword, green flames, the scene is depicted with a feel of melancholy and angry engrained in the composition, long black hair, medium elf ears and golden fiery glowing eyes, high quality, realistic artist render, digital painting, realistic artist illustration, incredibly absurdres, intricate details, incredibly detailed, perfect lighting, HDR, volumetric lighting, year 2024, high contrast
limited colors looks good
I've heard of having skeletons in your closest, but a school locker?
anthropology student
Or they took the Introductory Class "How to Get Away with Murder"
The Advanced Class is taught by Michael Scofield and it's "How to Break Out of a Maximum Security Prison"
they failed it then - cause they didn't hide the evidence very well
Aesthetic wallpaper inspired by Oriental Five Elements, gold and wood fusion, soft green leaves with intricate designs, shimmering golden branches, interplay of emerald and gold, tension of balance, delicate mist, hidden golden glimmers, minimalist abstract, ideal for mobile, 4k
generate a girl
👧
Surrealistic illustration of a human face fused with mechanical and organic elements, featuring a steampunk and biomechanical aesthetic. The face includes gears, circuits, veins, and exposed anatomical structures, combined with antique clocks, alchemical symbols, and scientific anatomical details of the brain. The background consists of aged parchment pages filled with handwritten script and scientific diagrams. The artistic style is highly detailed, resembling ink drawing with watercolor shading in gold, red, and blue tones. Dramatic lighting with a worn, vintage paper effect.
木质桌面上散落辛拉面袋/溏心蛋/午餐肉切片/葱花/芝士片/韩式辣酱,暖光俯拍,日系清新滤镜,突出食材色彩对比
A hybrid AI approach known as hybrid autoregressive transformer can generate realistic images with the same or better quality than state-of-the-art diffusion models, but that runs about nine times faster and uses fewer computational resources. The new tool uses an autoregressive model to quickly capture the big picture and then a small diffusion...
And for a direct link to the Arxiv (which is much better than a crappy article) https://arxiv.org/pdf/2410.10812
anie
gerate animated spider man cartoon with holding crricket bat
#🆕|sd3 gerate animated spider man cartoon with holding crricket bat
Tiny low quality images, a joke
create a image of a dog
❤️ Check out Weights & Biases and sign up for a free demo here: https://wandb.me/papers
4o Image Generation: https://openai.com/index/introducing-4o-image-generation/
Apple terminal: https://www.apple.com/mac/lumon-terminal-pro/
📝 My paper on simulations that look almost like reality is available for free here:
https://rdcu.be/cWPfD
O...
I generated an image with the prompt: ‘Convert the image to a hand-drawn style, keeping all original content unchanged, including the person with twin tails in a floral dress, sitting in a car with a seatbelt, the car interior background, the phone lock screen with the time “13:16”, text “向上滑动以解锁”, notification “微信 4个须知”, and status bar with signal, Wi-Fi, and battery icons, using a soft black-and-white sketch style with clear details.’
Une pub pour une nouvelle collection de survêtement iconique d’une marque qui s’appelle CASA X MENA photography, colorful modern, art by , Greg Manchess, in the style of , Artstation Pinterest, accent lighting, Pub survêtement , highly detailed intricate details Unreal Engine fine details HDR hyper realistic sharp trending on artstation
hi可以用中文吗
what
Looks like it is a mad race for big releases. After OpenAI's new model (Dall-E 4?), Ideogram just released their own new spectacular v3.0 and Midjourney is getting ready to release v7 soon
Meet Ideogram 3.0 — stunning realism, creative designs, and consistent styles, all in one powerful text to image AI. Now available to all Ideogram users for free.
Ideogram 3.0 introduces Style Reference. Creators can upload up to three reference images to guide the style of their generations. This enable creators to quickly specify aesthetics...
Really the Golden Age of AIs
redo this low quality flash card to be ready for print and remove the texts
Dall-E 4? Did I miss this? Dall-E is my fave!
See the 2-minute video above. They may not have called it this, but it is nevertheless their new image generating model
that's for Ideogram's.
above that
LOL... ah... thanks 🙂
🙂
I had a chance to try it, but not much before hitting the limit. It's now limited so much that's it is essentially unusable.
Apparently even paid accounts are heavily rate limited.
yeah openai always starts by making the biggest most inefficient model they can and then distill it, quality wise its a really big step up from normal t2i models, even closed source. Imagen/Ideogram/Recraft might have slightly better aesthetics but prompt following is pretty insane. some imgs from it
Now tho, the model already seems considerably worse so looks like they are already distilling it, some open source variant would be nice.
because that's what they always do. they release something as an early release sort of thing - they get people to use it, talk about how great it is. once they have enough of that to show their real target customers so they can sell it, they crack down on what the general public can actually do with it.
Give me a picture that everyone can like, with a theme of a soul singer named Na Yin Handsome Boy. The background is an immersive and dreamy color sensation, with the character appropriately reduced in size, in the style of a cyberpunk 3D anime
#artisan-1 Give me a picture that everyone can like, with a theme of a soul singer named Na Yin Handsome Boy. The background is an immersive and dreamy color sensation, with the character appropriately reduced in size, in the style of a cyberpunk 3D anime
amazing images
unique style
/image_dream BB King
create a zeppelin with pov angle in the amazon forest passing through smoke fog
create a zeppelin with pov angle in the amazon forest passing through smoke fog
if that was true when you said this it's not now
Reve is quite good IMO
it refuses significantly less prompts than any other API-only generator I've ever used, also
Fusion of future aesthetics and natural themes, flowing glass texture, refraction effect
Fantasy warrior
#🆕|sd3 Fusion of future aesthetics and natural themes, flowing glass texture, refraction effect
Reve has nice colours and cinematic theme
better fine tune than most of civit
it cannot do hands
faces are ok a certain % of the time
structure sometimes messes up as well
kinda a medium quality release
I find they're not as good as like, say Flux
but they're also not "bad" by any means really IMO
like they're usually fine
even the worst examples of them is more like, kinda melty looking
or just a missing / extra digits
Reve has better overall prompt adherence than Flux also IMO, by a noticeable amount
it doesn't just go like "nah, I'm gonna do this instead" like Flux tend too sometimes
this is Reve on that "Gun Lady" prompt I was doing before, for example
it occasionally slightly screws up the gun hand, but not that often, and not in like a ridiculously huge way
gets it right only like a smidgen less often than Flux
but overall looks better always for that sort of gen than stock Flux Dev IMO
I didn't try it until a couple of days ago
i think when they first first released it there might have been some issue with the backend
i think someone else also said the images were initially coming out like lower res overall than intended and stuff too
but as it is now it's about as good as my pics up there for basically anything usually, with really good prompt adherence
it also is just willing to do more things than other API-only generators, by quite a bit
they have a VLM prompt enchancer button that will very rarely just like output "I can't enhance this."
but you can turn that option off to retain your original prompt anyways
and then beyond that they do the typical blurring thing, but it only seems to apply to gens that had resulted in like full-on explicit NSFW
it doesn't care about any amount of like stomach or whatever as some of them do lol
ah that's unlucky
I actually have no idea why people pay money to use censored models when there are like $0.50 H100s everywhere
with optimised workflow I made 10,000 Flux dev images for $0.50
most people can't or won't optimise that hard but they can get a decent fraction of that number
yeah but that's still Flux
there's an argument to be made for these other models that are increasingly more easygoing about what they'll allow
and just like, overall good / better than any free version of Flux, particularly for complex typography and such, especially if your use case is moreso commercial / business related and all that
Reve you don't even have to pay for technically, they give you daily free credits, similarly to Ideogram
only Midjourney has absolutely no free option of any kind, these days
in terms of refusals though OpenAI stands alone, and I don't really get why
they're the only ones who still actively block ALL copyrighted characters from anything known to the model, for example
Meta's Imagine doesn't do this
Google Imagen doesn't do this
Reve doesn't do this
Ideogram doesn't do this
and so on
they're just shooting themselves in the foot relative to every single one of their competitors
like, 4o literally will not do a high-quality illustration of Bart Simpson from "The Simpsons".
every other model I mentioned will, without hesitation
its cos openAI are AGI company
and the other stuff is secondary
whereas most companies that sell AI products the products is primary
yeah but I don't see why that makes them unusually obsessed with pretending to care about copyright, versus any of their competitors, specifically for image gen lol
i see no upside to that for them
dalle 3 was earlier in time than most
its aged so well that it still gets compared
but it is pretty old now
I mean its now their third best image model as well since they now allow image making using sora
and the GPT4o thing
I hope they publish the tech specs on it at some point
for like the whole pipeline
I've always been curious about how it really compares to newer models as far as all that stuff
yeah for sure
something about its prompt following is still the best to this day
not in every aspect cos flux etc can hold more objects
but it was very responsive
to an extent
the context is definitely not like, THAT long compared to a number of newer models
although we don't really know how much fuckery they do with your prompt on the backend
I'd also like to know just how good it actually was at photographic gens if they turn off all the stupid bullshit filtering they do that makes every image look like it's trying to imitate the overdone implementation of ambient occlusion from Far Cry 3
yeah i didn't think so
i never thought it was much better in a broad aesthetic sense than like, base SDXL
the images were never very "high quality"
it coul just do sort of more interesting things
yeah that's right
yeah although I have a different view to most on vaes
I follow the lightningdit paper's idea that our vaes are too good and we need worse ones
cos worse vaes are easier to train your diffusion model with
worse in what sense though
like a diferent sense from a direct comparison between the SDXL and SD3 / 3.5 vae?
if not I don't really think that quality contrast could ever be worth it
the only way around that on XL was training on ridiculously high res images with zero JPEG artifacts
and even then it wasn't as good as just having 16 channels
worse as in a smaller vae, less channels and/or deph
there will be a size you can upscale to to even out detail differences
here's another example of where Reve is significantly stronger than like, any Flux beyond Flux Pro Ultra, though
it can do stuff like Bart Simpson standing next to Garfield standing next to Goku quite accurately and consistently
Flux can kinda-sorta do that
but not nearly as often
and at least one of the characters will often look strange
Flux can't really resolve the three of them to a "common" style that makes sense the same way, I guess is the gist of it
yeah although I have a different view to most on vaes
I follow the lightningdit paper's idea that our vaes are too good and we need worse ones
maybe 16 channels are too much, but 4 channels are definitely too few
I don't understand why they went directly from 4 to 16 instead of doing something in between. On the other hand, they might just did some evaluations and found 16 channels the best
they use different variables now than just channel count
broadly its just a trade-off between reconstruction and generation
they always did
you usually have a lambda parameter that controls the KL strength. If you would train a vae with normal KL strength your reconstruction error would be too large
ah okay nice
I looked a bit into variational inference and KL strength comes up there too lol
in the original publications they keep it always at 1
but for many applications that is just a bit too much
I think if you distribution match super super hard then it can be too inflexible
with KL divergence in general
its nice to have a bit of a looser fit
My recollection is that the original SD3 paper had a study experimenting with varying numbers of VAE channels. I think they found that 32 channels improved their metrics further, but they decided on 16 for some reason. I don’t remember why.
Symbol: A stylized, simplified representation of the Indian peepal leaf, symbolizing knowledge and growth. The leaf can be designed with subtle, interconnected nodes or lines, representing the connection between ideas and research.
Generate a logo for a company called 'anveshana' as described.
Color Scheme: A palette of blues and greens, conveying trust, growth, and harmony. Blues can represent intellectual pursuits, while greens signify growth and innovation.
Typography: A clean, modern sans-serif font with the word "Anveshana" written in a flowing manner, suggesting continuity and exploration.
Meaning: The peepal leaf symbolizes the sacred tree under which knowledge is shared, while the interconnected nodes highlight the collaborative nature of research. The color scheme reinforces the themes of intellectual growth and harmony.
Vegetative electron microscopy
Almost one year...
if you click SD 3.5 Large as well then there were a few more
Shakker.ai had some more as well
what do you think, is sd4 possible?
after sd3.5l they planned to release sd3.5m controlnets too but there are still none, maybe they dropped sd3.5 and moved to other projects?
not sure actually, why the sd3.5 controlnets never came
tensor.art released some in the end
as well as a fresh set of distils
but not sure why SAI didn't do the first party ones
Huh, I did not even know about that
They even did a 5M finetune of medium!
diversity of sd3.5 is astonishing, if only not the coherency problems...
maybe this is just how it works? you have either unique model with bad coherency either overtuned model with great coherency?
I didn't even know about that SD 3.5 Bokeh model
but yea looks like they did a 5m image finetune
there are more factors, but there is a bit of a trade-off between quality and diversity yeah
dunno. I find diversity in Flux larger than in SDXL for example
Flux having low diversity is a myth yeah
particularly at low guidance numbers
its a much larger neural network than SDXL so it can be expected for the larger network to have a better trade-off
I think the trade-off is more for comparing different versions of the same model, e.g. finetunes/distils/CFG levels, more than it is for comparing different models
Try selecting the SD 3.5 Large category too
More definitely exist than that
Not sure what the "SD 3.5 with no suffix word" category is for TBH
Medium is getting a bit more love though
Two separate anime finetunes for it now
Both looking pretty promising when I tried them
RealVis dude also has a WIP Medium finetune on huggingface
TensorArt has a decent number I think too, that aren't anywhere else
They even had a bunch for the original SD3
The actual difference between SD 3.0 and SD 3.5 Medium continues to throw me curveballs also lol
3.0 really is legit objectively better sometimes
Most notably the "everything goes grey and melty" thing that happens with 3.5 Medium when the prompt is overly long actually didn't / doesn't happen nearly as much in 3.0
This is an "all settings same" comparison, 3.0 on left, 3.5 Medium on right
For a super long prompt:
'''a photograph showcasing an intricately crafted glass teapot, featuring a detailed, miniature scene inside. The teapot is made of clear glass with ornate, golden details on its lid and base, giving it an elegant, antique appearance. Inside the teapot, a serene seascape is meticulously painted, depicting a turbulent ocean with white, foamy waves crashing against rocks. A majestic, wooden sailing ship with two tall masts and white sails is navigating through the turbulent sea. The ship is depicted in warm, earthy tones of brown and white, standing out against the cool blues and whites of the ocean. The sea is rendered in realistic detail, with waves crashing against the glass, creating a sense of movement and depth. The rocks in the foreground are textured and detailed, adding to the immersive miniature scene. The scene is illuminated by a warm, golden light, possibly from the flame of a candle or a lamp, visible in the background. This light source casts a soft glow, enhancing the golden accents on the teapot and adding warmth to the cool blue tones of the sea. The background features a blurred, cozy indoor setting with a wooden table and a single, large, orange candle flame casting a warm, inviting ambiance.'''
3.0 looks like most models do for this prompt
3.5 Medium though is like, trying but melting in the process
So I dunno what's going on there lol
@devout schooner Well, this is my sd3.5 medium version of your prompt. So it must be ...
This is with no CLIP L or CLIP G text
What sampler settings? Also what was the seed if you have it
My images were generated with workflows that were literally identical except for the model swap BTW
just the default comfy ones for 3.5 / 3.0
I'm using Draw Things (mac app) not comfy, which uses a bit different jargon so not sure if this is helpful.
Sampler DPM++ 2M Trailing (matches SGM_uniform)
random seeds 4082523719, 3144246774
3.0 medium was always much stronger yeah
and 3.0 large for that matter
there is something up with certain implementations of SD 3.5 (both M and L) because when I use it in the official Huggingface demo I get much better results than when I use it in ComfyUI
and nobody noticed comfy about this?
these days I tend to either use pure pytorch/JAX or C++/Rust kernels (when I can) so it didn't really matter that much to me either way
hey, sry to bother you but you seem very knowledgeable
so theoretically higher order ODE solvers should converge in a fewer number of steps right? then why can, say, dpm++2m generate a nice image in ~20 steps, while something like ipndm needs >30 otherwise there's very visible artifacts?
Testing HiDream, first result that really blew me away. Just beautiful. "Three antique fantasy potion glass bottles with labels in cursive font are sitting on a rustic wooden bench. The first bottle contains blue liquid and has the label "Mana". The second bottle contains red liquid and has the label "Health". The third bottle contains green liquid and has the label "Stamina". The warm lighting refracts through the liquid in splashes of beautiful color, casting raytraced caustic colors on the table below."
However, that's not cursive but calligraphy, and stamina is misspelled. There's no image2image in comfy at the moment, so I can't refine an image with a second pass.
A glass cannon. It took a little more prodding than expected to get a cannon though. The model tends to ignore unexpected words maybe?
No text reflection (no model I've used can do this yet, I'm just waiting for the day).
Gave me the title, art, and text I asked for, with a slight mistake in the text. (This was a 1-shot, usually in Flux I would do quite a few rolls.)
This skin and hair are very believably wet! Is it the best I've seen? Maybe.
there are different types of ODE solvers
if you are looking just within the category of explicit runge kutta solvers (like DPM++2m), higher order solvers can converge in a smaller number of steps
but ipndm is not an explicit runge kutta its a different category
It can't do many-numbered dice pip prompts. Not AGI here anyway. This is basically a slightly more capable version of Flux.
I mean, it has more parameters 🤷♂️ I would like to see how Flux would perform when replacing T5 with a more powerful text encoder
from the architecture I found HiDream very disappointing and wasteful
oh
so for a fair comparison, I'd look at say dpm++2m and dpm++3m?
Actually it has banding in all these images, like Flux gives at res >= 2048. I'm not sure HiDream is even usable at all because of that. I'm really hoping that's just the Comfy node.
what resolution do you use? I think HiDream has a max resolution of 1024x1024
So far I've only tried the resolutions HiDream used in their python scripts in their official repo, although I plan to test large image gen later for things like duplication. I've tried Euler and UniPC, although the Comfy versions might not be the same as the versions their using in their repo. (Part of why I'm holding out hope the banding will go away.)
Wow that is without a doubt the best prompt adherence I've seen so far. This is a 1-shot.
From left to right: An old man, a little girl, and an old woman are sitting on a park bench. The old man on the left is Chinese with gray hair and a green jacket and he is asleep with his eyes closed. The little girl in the middle is Russian with black hair and she is laughing happily and wearing a yellow sundress. The old woman on the right is Native American and has faded red hair and is wearing t-shirt and jeans, and is looking down at the smartphone she is texting on in her hands. The scene is brightly lit outdoors.
I think the girl might have come out a little more Chinese than Russian though. But Mongolian sort of blends between the two, so it's not too wrong.
these are both multistep which complicates things
if you want easy comparison then compare euler to heun
Huh, even modding the script, 4k resolution actually fails with an error. I can't even attempt it. 😕 Never seen that before.
I guess I'm trying 2048*2048.
you've gotta learn ODE solving outside of comfy/diffusers though
like get a copy of Julia or Matlab instead
diffrax is ok as well
by the standards of the computational mathematics community, the code in AI community is fairly error-prone
so its better to learn the math seperately
on the other hand computational mathematics libraries tend to be less optimised in terms of things like CUDA kernels so there are pros and cons
I see
that's weird, most of the time things let you generate at 4k the image just comes out bad, but it will let you generate anyway
Yeah it's not OOM either.
I remember that in the codebase they check for too large resolutions and reject them. You might have to remove that.
Doesn't matter, 2048 fails spectacularly. Completely unusable above expected resolution (and already banding there, so just forget it). 325 seconds on my PC.
From HiDream's python script:
yeah, you have to change the max resolution parameter in the script to generate larger images
but I assume they put it in there for a reason 😅
As I said, I did that, and got a tensor mismatch error. This might be some weird architecture, or a problem with the comfy node script. I'm sure Comfy will get native support ASAP considering this outranks Flux on that leaderboard everyone uses.
no, the tensor missmatch comes from that
you changed the wrong part
its the max resolution parameter
(or the max_seq variable respectively)
Well, it was also completely unusable at 2048, so I'm not going to pursue it any further.
accidental horror/comedy lol
TBH this still looks better than the way SDXL will error if the resolution is too high
base flux was not particularly great either above 1560x1560
there are fine tunes that take flux to 2560x2560 but they have some de-distill in them
Yeah Flux too. I think the best tile controlnet around is still for SDXL but I haven't kept up with it. Mostly been trying to get better video gens lately. Been a while since I tried a new image gen.
the SOTA for mirrors is an SD 1.5 or SD 2.1 finetune lol
can't remember which
they made an entire foundation model just for mirrors
SDXL tile controlnet is excellent, I like SD 1.5's one best though
https://github.com/Kosinkadink/ComfyUI-Advanced-ControlNetwith the softweight node from here
as far as I know it lowers the strength per block
I'm doing a last reflection test, and then I need to go.
Good, but not exactly what I asked for. Gotta split.
my favourite controlnet of all is SD 1.5 with XDoG scribble
okay bye
fur details were good
if Hi-dream can deliver higher small details than flux and then be the same in other areas I would still take that trade TBH
please do examples with dogs and not with little girls. That's weird...
I didn't ask for a little girl though. Could've said "woman" instead of "girl". Meh. Deleted.
The dog's lighting and shadows should be reflected in the mirror, and they're not.
I just say the image is a bit borderline
I started running NSFW classifiers in order to force outputs to be cloud-friendly
I don't agree or care though. And I'm so gonna be late. Gotta run for real.
ok bye
on Vast.ai I always assume the docker container is being watched by the host
so I mostly make 1950s city images lol
Does anyone know if HiDream functions on MPS?
not sure its always tricky with Apple cos their version of pytorch is missing a ton of functions
Comfy Manager still doesn't list any custom node for it, and I'm hesitant to install directly from a github.
Yeah, I'm running the nightly torch builds, but understand that, in their infinite wisdom, they chose to have thousands of unique operators.
since the registry update I stopped using manager I would actually call installing directly preferable
IMO Apple should have improved the OpenVino or Vulkan ecosystems instead of making their own thing
Yeah, well, I just saw people reporting difficulties and possibly getting their comfy install nuked due to it. Like needing to install Flash Attention (which, as far as I know, doesn't function on MPS).
OpenVino in particular has been cooking rly hard lately
ah yeah ok I do know that the default Hi-dream workflow requires flash-attention 2
cos I was installing flash-attention 2 on a server the other day for that reason
if Apple doesn't support that at the moment then that's gonna be an issue potentially
More like Flash Attention doesn't support MPS. A majority of AI stuff is built purely with nVidia in mind.
yeah
I've been looking at making a distributed Intel CPU inference engine and its tricky with lack of support
Alright. Maybe I'll try to be patient for a while and see how things play out, rather than getting jealous of people using the new hotness.
I'm assuming there won't be any tiled diffusion solutions that work with HiDream for quite a while anyway. I'm not really satisfied with 1MP generations.
I could cobble a workflow together using Flux for the upscale, but I'd end up chugging the VM too hard.
if you can get openvino working on mac I've been working on a tiled image editing thing for openvino lol
its sort of a joke but it really does have tile counts up to the low millions
I found out that python PIL package stops working if your image goes above 300k or so because it assumes that the image is malware
Each tile being 1MP?
LOL in that test each tile was 2 pixels wide and 2 pixels tall
Heh, okay.
but yeah I want each tile to be the size of a proper diffusion image so 512x512, 1024x1024 or 1536x1536
My tiled workflow for Flux is working well enough that I've thought of going higher (currently 3x scale for ~9MP), but there are some blockers. The second stage is sensitive to the level of detail (and the structure of those details) in the input image it is provided, so I need to do a model upscale with 4xUltraSharp. Otherwise, the 2nd stage result will just be blurry. I don't need to invent a very expensive bicubic scaler. Anyway, I've never seen a node that can do a tiled model upscale in Comfy. If I give the model upscale node a 9MP input image and it scales 4x, I'm going to have some major memory issues. I'd also expect more image consistency/hallucination issues when the ratio between the size of the target image and the tiles increases. I get that even at 9MP when the image has large areas of low detail (like a foggy, overcast scene with few foreground objects).
9MP is a pretty good size, I think above that size its diminishing returns
since most people have 4k screens I generally would use 4k as the minimum
these nodes work if I remember rightly https://github.com/kinfolk0117/ComfyUI_SimpleTiles
with Flux though I tend to use SD 1.5 as the upscaler
Flux itself adds less details
I'm mostly OK with Flux's details. I find it does well with natural details. The castle is only so-so.
this looks rly good for flux yeah
definitely above average for flux img
the castle is a good example of where flux upscaling goes a bit weird- SD 1.5 would for sure have also boosted the castle detail
it feels like flux picks certain objects to not improve lol
I suspect it is another case of sensitivity to the upscaling model, but don't have proof. I've tried a lot of upscalers but keep coming back to 4xUltraSharp. It just seems to work particularly well with Flux in getting those details. But it definitely has weaknesses and sometimes doesn't generate enough pixel-level detail for Flux to work with.
if you can do some pixel-space noise injection that can help
as well as noisy sampler
problem with noisy sampler is you then tend to need more like 60+ steps
which is rough for an upscale pass
I haven't tried overlaying noise. Flux already seems to like slightly noisy output, so I don't particularly want to encourage that.
yea it can be tricky to not have the noise stay in the image
Not sure what you mean by a noisy sampler. I've settled on bosh3, but it's hard for me to tell if there is an "optimal", let alone what it is.
you have the option of doing a third pass to clean up noise with SD 1.5 etc
I meant ancestral or SDE
bosh3 is nice though
I'm kind of a model purist and am resistant to going back to SD1.5 😆 .
I've heard people claim they found ways of getting ancestral and SDE samplers working with Flux and SD3, but don't know how they accomplished it. I've never found it to work with a normal workflow.
I also kind of dislike ancestral samplers because they don't converge, so you have no clue where to stop in pushing the number of steps.
I have a natural tendency to min/maxing, so ancestral drives me crazy.
to get SDE working with Flux and SD3 its just a matter of making sure the variance adheres to the variance of the VP SDE, essentially
but it can be tricky in practice to convert from math into code sometimes because different papers use different notation systems
you tend to need more like 60+ steps for SDE so if you had less than that then that is why it didn't work well
I don't know what you mean by "VP".
IDK if it's worth getting into the details but it goes back to an old paper called Song 2020
Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distrib...
Not sure I'll be able to parse the paper. I only ever studied DEs at a surface level, and most of the AI-related papers require knowledge of previous papers to understand.
ye its not needed to go into that level of detail neccesarily
for the most part you can just pick from existing implementations of stuff
Yeah, but if I need to somehow "match variances" by playing around in Comfy and not having any insight into the math of what it's doing... lol
ye this is what I was saying earlier you've gotta learn the math outside of systems like comfy or diffusers
and then if needed you can bring what you learnt back in
I found SD 1.5 often adds too much details on super high resolution
if every little spot in your image is super sharp and detailed it looks weird, too
but yeah, flux often looks a bit blurry when upscaling. I wonder if anybody tried fine-tuning flux on cropped ultra high resolution images
Euler Ancestral and the DPM2 Ancestral ones both "just work" out of the box in a normal KSampler in stock Comfy with the Beta and Normal schedulers, currently
For SD3 / 3.5
Presumably also Flux
It's just the aesthetics on certain prompts I find
Like anatomy generally is definitely way worse in 3.0, they did improve that a bunch in 3.5 Med
But it seems 3.0 just had a very very different dataset than 3.5 Med or something
yeah I never did people in 3.0 just landscape and sci fi
I used it a ton until flux release day
3.0 was much more photorealistic than 3.5
I only jumped to Flux once photorealistic loras/checkpoints arrived
the first being RealvisSchnell
followed by a bunch that had some de-distill in them
I never really used regular Flux so to speak
I did figure out how to fix the teapot boat prompt on 3.5 Medium BTW
i'm not sure now it necessarily has anything to do with prompt length (or at least not always), I think it's moreso just the dataset vs 3.0's
2d, 3d, cgi, render, smoke, fog, haze, mist, cartoon, anime, painting, drawing, sketch, illustration, traditional media, watercolor, airbrushed
in the negative gave me this on 3.5 Medium
it's still a bit oil painting esque for the boat area for my tastes, relative to 3.0, but way way more normal looking than before
good to know that broadly negating stuff like that does actually work
give me a colorful desk
its definitely better
it still has this scratchy details effect that I struggle to get rid of
its as if it needs perturbed attention guidance or something to clean it up
with negatives you can boost them a bit by delaying the negative for some steps, sometimes up to like 30-40% of the steps
its different for every prompt so it takes some experimentation
essentially negatives seem to work better once the thing you are trying to change has just briefly appeared in the image
its swings and roundabouts cos some of the details are excellent like at the base here
in the diffusion models there is a clear trade-off between big and small details (this is what FreeU is about)
SD3.5 won't have the same FreeU mechanics but maybe there is a similar trade-off
FreeU is specific to unet - which has skip connections between layers that can be tweaked. SD 3.5 uses MMDIT - which is a totally differnt architecture. however dango did do some tweaking in this workflow
yeah, but mmdit doesnt' have those :(

The only issue with this image is that the AI made the rocks and water in the bottom distort, giving the impression it is painted outside
honestly these boat attempts I posted here were with DPM++ 2M SGM Uniform and I think no Skip Layer Guidance
in general I find that using Skip Layer Guidance along with ClownsharkBatwing's RES4LYF samplers produces WAY better results
Euler Ancestral also "just works" in stock Comfy for SD 3.5
and doesn't have nearly as much of that grainy look, particularly with the Normal scheduler
relative to Euler
TLDR as I've said before a big problem with almost all these newer models is that the default samplers recommended are nearly always super mediocre ones that nobody would ever use if they didn't have to
it's like very close still to exploding into melty everything-is-very-greyness though
which is a distinct problem of SD 3.5, both Large and Medium
it wasn't so much of a thing in the original 3.0
might be a captioning problem or something
it seems like there's excessive bleed of extremely painterly traditional media data into basically all gens unless you negate it
or something like that
that's the best theory i have
like if any significant number of the captions just said like "a man beside a tree"
instead of "a painting of a man beside a tree"
or a "a photo of a man beside a tree"
then that'd be the problem
if there was a lot of art data without any particular categorization
i think
I don't know what you mean, but aside from that oddity it looked great
this is the same prompt on the original SD 3.0 Medium, with a particular seed and the increment of that seed
it looks just, normal by basically any metric
as I'm pretty sure most people would expect
this is the same seed and same increment, on SD 3.5 Medium
note how the entire image is distinctly hazy and grey in 3.5
and the line resolution for small details is just worse
and this is WITH the negative prompt I mentioned before (for both the 3.0 versions and 3.5 versions)
I sincerely doubt this was intentional
it looks objectively worse
I definitely think the 3.5 is better overall
the model is yes
way better anatomy and such
but the grey haze bleeding into EVERYTHING is incredibly annoying
positive:
a photograph showcasing an intricately crafted glass teapot, featuring a detailed, miniature scene inside. The teapot is made of clear glass with ornate, golden details on its lid and base, giving it an elegant, antique appearance. Inside the teapot, a serene seascape is meticulously painted, depicting a turbulent ocean with white, foamy waves crashing against rocks. A majestic, wooden sailing ship with two tall masts and white sails is navigating through the turbulent sea. The ship is depicted in warm, earthy tones of brown and white, standing out against the cool blues and whites of the ocean. The sea is rendered in realistic detail, with waves crashing against the glass, creating a sense of movement and depth. The rocks in the foreground are textured and detailed, adding to the immersive miniature scene. The scene is illuminated by a warm, golden light, possibly from the flame of a candle or a lamp, visible in the background. This light source casts a soft glow, enhancing the golden accents on the teapot and adding warmth to the cool blue tones of the sea. The background features a blurred, cozy indoor setting with a wooden table and a single, large, orange candle flame casting a warm, inviting ambiance.
negative (used for both, although it's only really necessary or at least helpful with 3.5 Medium, 3.0 Medium doesn't need or benefit from it):
2d, 3d, cgi, render, smoke, fog, haze, mist, cartoon, anime, painting, drawing, sketch, illustration, traditional media, watercolor, airbrushed
Sampler was DPM++ 2M SGM Uniform (no fancy RES4LYF stuff for the sake of the examples), CFG 5.5, 25 steps
without that negative to push away the haziness, 3.5 Medium produces absolute garbage like this, with everything else the same:
whereas 3.0 Medium always looks normal / propera and doesn't have the greyness issue at all
Normal how? The lighting is all wrong
normal as in it does not literally look like the entire room is filled with smoke or fog lol
and as in the lines aren't nearly as much of an utter mess
instead of soft candle light it is high contrast with bright colors
the opposite of the prompt
if the people who made the model were actually that aesthetically blind than that would explain everything I guess
but like
THIS is a fine take of soft candle light from 3.0
no
the entire image is supposed to be the result of candle light
not just the candle
3.0 is all wrong
I mean we clearly disagree but this is semantics
this is a VERY real problem that SD 3.5 Medium has but 3.0 didn't
3.5 Medium VERY regularly produces images with a ridiculous, excessive grey haze across the entire image, in cases where you could not possibly argue it makes sense
and terrible resolution of lines for small details
unless you use negatives and better samplers
3.0 Medium had a lot of issues but it didn't have ones like that
this is not semantics. 3.0 looks like a room with electric lights
even putting the haze aside
the teapot looks like absolute butt
in this no negative 3.5 Medium version
it really looks like it desperately wants to make it an oil painting
and not photorealistic
no way those reflections on the glass are from candle light
they come from bright electric lighting
I mean i don't tihnk this conversation is going anywhere useful
this is a blurry, hazy mess that looks like a painting when it should not, any way you cut it:
the nitpicks about lighting are not relevant
no?
yes
if you think that looks "good" this conversation is as pointless as i thought
litearlly nobody wants that output from that prompt
i promise you
then I guess the prompt is irrelevant too
nothing in the prompt says "literally add extreme fog EVERYWHERE, be sure that the lines are horribly resolved, make everything as blurry and foggy as possible"
which was the end result
that is the only issue I care about here
i don't know why you're nitpicking the other stuf
since it says in detail it is supposed to be low soft light from candles
that does not look like candlelight
it's a problem that 3.5 Medium has even for prompts that don't even mention ANYTHING about light
which impacts everything
I will give you numerous examples if you want
it looks like butt
nobody wants "realistic" gens to come out like that
I assure you
and part of this IS definitely caused just by too long prompts
but it's not entirely
as 3.0 Medium was simply not as impacted by it
it's almost certainly related to poorly captioned art data somewhere in the 3.5 Medium dataset, I think
partially at least
I have seen many people say variations of "why the hell is it so grey?" about 3.5 Large and Medium
never seen any opposite opinion expressed until now
I shouldn't have to fight to get 3.5 Medium to produce non-2d-or-painterly-in-any-way outputs
is the overall point
and indeed you didn't really have to do that with 3.0
despite the other flaws it had
I find this image to be far more realistic looking than the 3.0 counter samples you shared.
that was the one I gave as a better example aided by the negative yeah
but the colors across the entire image are still way too dull and grey for what I'd want, in a way that doesn't really make sense
and the lid / bottom of the teapot as well as the boat are very clearly wanting to be paintings instead of realistic based on how messy and poorly resolved the lines are
The colors reflect the lighting
Here is your prompt without all the insistence of candles
what's the exact prompt for this version?
a photograph showcasing an intricately crafted glass teapot, featuring a detailed, miniature scene inside. The teapot is made of clear glass with ornate, golden details on its lid and base, giving it an elegant, antique appearance. Inside the teapot, a serene seascape is meticulously painted, depicting a turbulent ocean with white, foamy waves crashing against rocks. A majestic, wooden sailing ship with two tall masts and white sails is navigating through the turbulent sea. The ship is depicted in warm, earthy tones of brown and white, standing out against the cool blues and whites of the ocean. The sea is rendered in realistic detail, with waves crashing against the glass, creating a sense of movement and depth. The rocks in the foreground are textured and detailed, adding to the immersive miniature scene.
same samplers
yeah it's definitely an improvement
trying it on SD 3.0 too though still gives like, noticeably cleaner / crisply resolved lines throughout
and I still find the 3.5 Medium version to be overly grey and dull-looking
I think one of the people who work for SAI have even said that the 3.5 Medium dataset was more art focused too, so I suspect my suspicions about rogue captions are probably at least semi-accurate