#SDNext WebUI on Intel ARC

6474 messages · Page 7 of 7 (latest)

fickle plume
#

Arc has some issues with VM support, might be it on windows

restive parcel
#

It wasn't working in docker so i tried just running in windows and its still not working, so idk what's happened DinaKEK I took a half year hiatus from my computer so idk what's changed in that time

fickle plume
#

When you ran it in windows, did you run with --use-ipex

restive parcel
#

aye

#

fresh git clone, webui.bat --use-ipex, no errors, open up webui, select an old model I had > "Queue cannot be constructed..."
ok, odd. lets try with a fresh civitai model from the sidebar, can't possibly mess that up. Downloads fine, select model, loads up, cool. Hit generate > "Engine did not start..." error and the webui crashes

#

tried it a few times and didn't get any farther

fickle plume
#

@restive parcel what gpu, what driver version

restive parcel
#

a770, current driver is 32.0.101.6458 due to instabilities present in latest

#

had to revert

restive parcel
#

even farther back?

#

I guess i could, though that'll have to be for another day

dull shore
#

32.0.101.6557_101.6262 from January works alright on an a750, but the most recent driver certainly not

dull shore
#

6647 works alright

dreamy tree
#

6632 works fine

#

but I'm having difficulty using sd.next, as most youtube sd tutorials are on a1111, which has different settings and layout from sd.next, and sdnext wiki has only the most basic doc

keen marsh
#

you can try the sdnext discord if you need help learning the program.

teal hinge
#

this happened the first time i tried to open it

#

i downloaded a different driver then it fixed it

#

but now its happening again after not using it for a while

#

and also a slight memory upgrade if that matters

fickle plume
#

and keep it. it will probably vanish from intel's website after the next driver release

teal hinge
#

sry forgot to mention it is an arc a750

fickle plume
#

that's an alchemist gpu yes, so get 6314

teal hinge
#

i se

#

ethank you

fickle plume
# teal hinge why

because not all the old drivers are available (in an easy to find way?), something like only the last 10 or so are

teal hinge
#

ok

#

will do

#

also

#

i am supposed to use --use-ipex right?

fickle plume
teal hinge
#

ok

teal hinge
#

after installing the new drivers, would it be necessary to delete the venv folder to make a new one?

fickle plume
#

no need to reinstall anything in sdnext

#

just try with this older driver, should work fine

teal hinge
#

it worked!! thank u

#

how do u figure out which drivers work and which dont? just by trying out each of them one at a time?

worldly igloo
#

Hey all, i finally got SDNext to install on my pc, but im having an issue where when i launch it, the only "error" it gives, is that torch is running in cpu only mode, i am unsure how to fix this. I have tried reinstalling torch, using the command argument --use-ipex, also have tried to force the id to my gpu.

When i use the ipex command, it just won't start properly and gives me the error "torch not compiled with XPU enabled"

Any help would be appreciated.

grave condor
#

which version of python are you on?

#

if you are on a really old one, it might not find pre built wheels for torch+xpu

proper cradle
#

You should have been using --use-ipex from the start

worldly igloo
worldly igloo
fickle plume
worldly igloo
worldly igloo
keen kelp
# restive parcel aye

I'm getting the same error now. I've tried plenty of different drivers, and made clean SDNext install.

Using A770. SDNext used to work fine a week ago, unsure why it's acting up.

#

The full error being: "Queue cannot be constructed with the given context and device since the device is neither a member of the context nor a descendant of its member"

restive parcel
#

i just ended up using Comfy off the backend of I Playground, since it seemed to work for some reason? akoShrug

#

never got SDNext to run

keen kelp
#

I managed to fix it! The issue seemed to be the iGPU being enabled and making some conflicts. I had it disabled before but something must have happened to enabled it again, so I didn't consider checking it.

vapid sierra
#

@fickle plume I will continue talking here.

#

There is no error, I think the Graphics Drivers (or something related to it), crashes.

I am currently trying to do a fresh re-install. Let me try that first and I will let you know what occurs.

#

I don't have access to the stack trace rn either, I believe.

fickle plume
#

so it just exits when loading diffusers?

#

how do you know that happens specifically when loading diffusers

#

and what driver version

#

and what GPU

vapid sierra
#

32.0.101.6647, Intel Arc A770 16GB

fickle plume
#

Install 6314, try again

vapid sierra
#

Also, do you have any resources for how to make Flux1 work on this GPU?

fickle plume
#

dunno about sdnext, I use comfyui mostly now
it should be possible in sdnext

vapid sierra
#

Oh boy, I am not allowed to post my error.

#

"XPU out of memory. Tried to allocate 2.67 GiB. GPU 0 has a total capacity of 15.56 GiB. Of the allocated memory 6.66 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. Please use empty_cache to release all unoccupied cached memory."

#

This was specifically when I was trying to upscale.

vapid sierra
#

Thank you for the assistance, I will let you know if I encounter anything else.

nimble cliff
#

Just made the switch to SdNext from A1111 after switching to intel GPU. It's been anything but pleasant.

Can anyone guide me on how to optimise it better?

  1. I can't use img2img from images as sdnext will just crash. I have yet to try lowering the res for the image to see if it works.

  2. Sometimes generating an image will just take a really long time only for Sdnext to crash. Sometimes when I see that the progress is really slow, I closed sdnext and reopen the webui and it's fast again.

I'm on windows 10, 11th gen intel processor, b580 gpu and 32 gb ram.

I got medvram in the argument as without it, sdnext will push my vram and ram usage to the max.

keen marsh
nimble cliff
#

Yes I am

keen marsh
# nimble cliff Yes I am

It's been awhile since I used sdnext, are you using any of the attention optimization settings? Also what size images are you using?

nimble cliff
#

I can generate 512x512 and 1024x1024 just fine, but when I want to go higher than 1024 most of the time it'll take very long and oftentimes it's just crashed after awhile

#

I tried to install forge or fooocus but both seems to have issue installing haha

keen marsh
#

How much higher are you trying to go? Its probably spilling over into system ram which will cause slowdown.

#

Larger res will always be slower

nimble cliff
#

I usually do 768x768 in A1111 before I double it... No reason specifically. So I can just work with a lower res. The thing that bugged me is that img2img doesn't seems to work without it crashing haha. I'm gonna try to lower the res of the image first before I load it up to see if it'll work

fickle plume
nimble cliff
#

Even though now I got a higher vram? Was using a 3050 with 8gb ram, and med vram was working just fine... Then again that's an nvidia card. Will give it a try later

keen marsh
#

Nvidia offloads vram to ram by default in the drivers, so basically runs lowvram unless turned off

#

I used to run 768*768 no problem in sd next though, are you using any of the attention optimizations? I forget which was the best in sd next, but they should help with vram optimization and speed. You have to chose them in the options in sdnext

nimble cliff
#

I enabled tiled VAE to 96 I think and it helped out a lot. Even able to go upwards to 1920x1080. Not sure what attention optimization is

keen marsh
#

I haven't used it in over a year so maybe it's changed, I think the cuda hijacks maybe enable it by default for intel now not sure. You can also try ultimate-sd-upscale or another tiled upscaler to get higher resolution images.

teal hinge
#

is there anyway to fix the "No XPU deviced found" thing without rolling back the drivers?

fickle plume
teal hinge
#

i dont know

#

i just wanna know tho

keen marsh
#

You should probably be more specific with whatever error your getting. Drivers shouldn't cause that, so it's likely something wrong with your environment.

teal hinge
#

you were right im sorry i just needed to disable igpu in device manager

dreamy loom
#

If you manually install torch 2.8 you don't have to disable the igpu. But you must use FP16 or you will get NAN output when decoding VAE.

proper cradle
#

NaN with BF16 is very strange tho, usually it is the reverse

proper cradle
#

A1111 doesn't support Clip Skip for SDXL, so any change you make is ignored on A1111.
SDNext supports setting Clip Skip for SDXL and you will have a bad time when you actually use Clip Skip with SDXL.

#

Pony will simply break and output static colors with non-default Clip Skip for example.

#

Other SDXL models still work to a degree.

proper cradle
#

HiDream on A770 with qint4:

#

Full model with CFG gets 10 s/it

#

VRAM usage is 12 GB with model offload

proper cradle
#

Flux Dev with qint4 + model offload gets 3.75 s/it and uses 8GB VRAM for comparison

fickle plume
fickle plume
teal hinge
#

hi guys

#

what does AssertHandler::printMessage mean?

fickle plume
#

The whole stack trace

proper cradle
proper cradle
normal quail
proper cradle
#

Depends on the offload threshold

#

0.6 threshold with 16 GB VRAM GPU means use 10 GB of VRAM for the model weights

#

It uses 15-16 GB total with these settings

normal quail
#

ouch, so difficult to run on B580

proper cradle
#

You can reduce the threshold but 6 GB will probably be the minimum

#

As the compute needs 5-6GB

proper cradle
#

Just reduce the threshold

#

B580 has 12 GB

normal quail
#

yes

proper cradle
normal quail
#

thats actually not that much of time

#

had to generate 120 images and prediction algos etc

proper cradle
#

24 fps * 4 = 96 frames

normal quail
#

ah my apologies i'm blind

#

😭

#

then approx 10s per image, seems normal according to benchmarks i saw online

#

im only wondering about vram usage for things like flux and sdxl

proper cradle
#

Balanced offload is pretty fast

#

VRAM doesn't matter that much unless you have less than 8 GB

#

System RAM is more limiting with these models

normal quail
#

i hope its enough.... ddr5 is expensive

proper cradle
#

64 GB is enough for most

#

Only issue will be if you try to run Hidream at full BF16 as the model weights alone are 64 GB at BF16

#

Tho you really should use int8 quantization with HiDream anyway

#

int8 will fit in 64 GB

#

Some quant tests for HiDream:

HiDream Full
All BF16 / BF16 Transformer + NNCF INT8 TE / All NNCF INT8 / All Quanto INT4

normal quail
proper cradle
#

New NNCF Quant modes:
Flux Dev, transformer only quant

BF16 / INT8_SYM

#

INT8 (the original method with old nncf) / INT4 / INT4_SYM

#

With SDXL:

BF16 / INT8_SYM without Conv quant / INT8_SYM with Conv quant / INT8 (the original method with old nncf)

#

INT4 / INT4_SYM without Conv quant / INT4_SYM with Conv quant

#

On the fly Lora usage is also supported.
with SDXL + LoRa:

BF16 / INT8_SYM without Conv quant / INT8_SYM with Conv quant / INT8 (the original method with old nncf)

#

LoRa starts to lose its effect with INT4 quants

INT4 / INT4_SYM without Conv quant / INT4_SYM with Conv quant / BF16 without LoRa

#

NNCF changelog:

  • NNCF update to 2.16.0
    major refactoring of NNCF quantization code
    new quant types: INT8_SYM (new default), INT4 and INT4_SYM
    quantization support for the convolutional layers on unet models with sym methods
    pre-load quantization support
    LoRA support
#

the new INT8_SYM method has basically the same quality as 16 bit

#

pre-load quantization support is to support quanting the model while the original 16 bit model is being read from the disk so no out of ram will happen

dreamy tree
proper cradle
#

Yes

#

FramePack extension works in SDNext dev branch rn

dreamy tree
#

is there a tutorial page on using it in sdnext?

dreamy tree
#

AttributeError: 'MemUsageMonitor' object has no attribute 'summary'

proper cradle
#

also were you on dev branch?

#

anyway, dev branch merged to main now, it should be able to run on the main branch if you update

dreamy tree
#

i updated to latest dev, so it's working now. can I use only init image to generate?

#

i've updated to latest dev so it runs now. but it's been 50min , i'm still only 36% doing 4sec 20fps...

dreamy tree
#

GPU's being used, vram's filled, but speed's a joke been 1.5hr, now at 50%

proper cradle
#

what is your offload and quant settings?

#

you won't be able to run the full model at 16 bit with the default offload settings without going oom

#

set balanced offload high threshold to 0.5 or 0.6

#

Enable transformer, video, llm and te from nncf quantizaton settings

proper cradle
#

New NNCF update might make it need to reduce the high threshold to 0.5

dreamy tree
#

video did come out after 2hrs...

proper cradle
#

Offload

dreamy tree
#

wonderful, how to have your nice UI?

proper cradle
#

UI settings, UI type -> Modern

#

Then do a full restart and also clear the caches from your web browser

dreamy tree
#

with the NNCF.. I ran a 8sec video before heading to bed, which took ...3.5hrs

#

Video 20250429-075742-libx264-f145.mp4 | Codec libx264 | Size 512x768x145 | FPS 20
sample 12303.43 vae 147.78 offload 56.95 move 23.26 vision 19.69 encode 12.46 prompt 11.81 save 1.68 gc 0.87 preview 0.55 | GPU 14852 MB 93% | RAM 31.25 GB 49%

proper cradle
#

What is your PyTorch version?

#

New update uses 2.7

#

I was using 2.8 nightly

#

2.6 is slow

dreamy tree
#

my modern looks different

dreamy tree
#

you're right i'm still on 2.6 nightly, i'll upgrade it right now.

#

been using comfy since 2.6, cuz it's tough to learn sdnext without someone like you, there's very little tutorial out there, most of them on comfy

#

thank you so much

#

I did do clear all cache and then restart server. i'll try again after download 2.8

#

i went into broser setting to clear images and files, but i suppose also cookies and other site data ?

proper cradle
#

Also just checking, did you set modern as the UI type or did you change the theme?
Because they are different things.

dreamy tree
#

i've been meaning to ask, did you make ipex_to_cuda?

proper cradle
#

Yes

dreamy tree
#

how come it's being used in comfy, but i don't see it mentioned in sdnext?

proper cradle
#

ipex_to_cuda is from SDNext

#

I originally wrote it in SDNext

dreamy tree
#

so it's still integrated in sdnext under the hood

proper cradle
#

Yes

dreamy tree
#

after clear cookies

proper cradle
# dreamy tree

remove anything that starts with ~ in the site packages folder

proper cradle
#

Show the user interface settings

#

Try setting the theme back to default

#

Then restart and do CTRL + F5

#

CTRL+ F5 should force actual refresh

dreamy tree
#

This sycl kernel dll sounds serious,it's been there since a month ago, why is it causing problem now

#

I restart and do ctrl+f5 still

#

not sure what to do here

dreamy tree
#

much better with 2.8nightly, 12sec video in 1.5hr !!
but how come you can get it done with your A770 much faster?

dreamy tree
#

turned out that the modern extension was not enabled, one down!

#

do i need to check voncolutional layers on nncf?

dreamy tree
#

i'm confused with modernUI tho, with old UI, I can do faceswap by start with Image to Image, script->face->faceswap->input image->generate, now with modernUI, after doing the same in scripts tab, hitting generate at the bottom doesn't do anything

dreamy tree
#

Does webui --lowvram invalidate the balanced offloading settings? do you run webui without --lowvram?i thought with A770 lowvrams a must...

proper cradle
#

Which will probably make it run on a potato but very slowly

proper cradle
#

You can re-enable text2img tabs from settings

#

It was called something like Hide txt2img tabs in User interface settings

dreamy tree
#

So just webui --use-ipex for A770 ?

proper cradle
#

Yes

#

You might want to enable VAE tiling if you run out of VRAM on the VAE step

proper cradle
dreamy tree
#

Goodness... but comfyUI guys says to use --lowvram with a770 on comfy

proper cradle
#

ComfyUI doesn't support balanced offload

dreamy tree
#

Gotcha

proper cradle
#

It is offload all or nothing on comfy

#

We do partial offloads

dreamy tree
#

I tried to find flux1 qint and saw ur hf, but didn't see any checkpoint qint file?

proper cradle
#

It is a diffusers model

#

Also available in the reference models in SDNext

#

You can just click on it and it will download & load it for you

dreamy tree
#

Enlightening

proper cradle
#

Tho i recommend using the original model and quantizing it on the fly instead

#

We support quantizing any model on the fly

dreamy tree
#

You mean with nice on, I can load flux.1 dev directly in A770?

#

With NNCF

proper cradle
#

Yep

#

Just set a quant mode

#

We also support quantizing while the model loads with transformers models (like flux) so you won't run out of system RAM

#

That's called "pre" load mode

#

post load mode will load everything into RAM first, then quantize

#

SDXL and UNet models are only supported with post load mode

dreamy tree
#

You guys are so good, really wish you can take over comfys backend

#

Yes I read thru the wiki that's how I realized about lowvram

dreamy tree
#

sigh, I remembered wrong, haven't been using --lowvram with sdnext

#

what else could be causing my super slow framepack?...

proper cradle
#

what is your resolution and frame rate?

#

12 seconds is very long

#

15 mins was for 4 seconds, 24 fps, 480p

#

12 seconds, 30 fps, 512x768 will take 1.2 hours just from multiplying 15 minutes with frame and resolution increase

dreamy tree
#

I see. what does lowering balanced low watermark from 25 to 0 do exactly? and how about high watermark from 75 to 60?

#

Video 20250429-230714-libx264-f73.mp4 | Codec libx264 | Size 384x576x73 | FPS 24
sample 568.83 vae 50.10 offload 37.98 move 13.00 vision 10.95 prompt 10.78 encode 9.78 save 0.65 gc 0.64 preview 0.40 | GPU 12868 MB 81% | RAM 30.97 GB 48%

dreamy tree
#

high watermark i'm guessing from doc that if vram's being utilized more than high watermark, the rest of model get sends to ram? but doc says nothing about lower water mark

proper cradle
#

aka model weights will use 75% of your vram with 0.75

#

low is used for when to offload models

#

if vram usage is smaller than low watermark, it won't offload

#

clip models stays in vram with 0.2

#

llama text encoder uses a lot of vram for compute so setting 0.7 for the high will likely oom

#

tho it will be fine if windows can use the shared memory

dreamy tree
#

so changing low from 0.2 to 0 makes sure clip gets offload to ram?

proper cradle
#

Yes

dreamy tree
#

Curious if your ipextocuda directs all torch.cuda to torch.xpu, all porjects using torch that only has cuda codes can work with xpu?

proper cradle
#

tho anything isn't pytorch won't be directed

dreamy tree
#

possible to do the same with onxx?

dreamy tree
proper cradle
#

We don't use ComfyUI in SDNext

keen kelp
#

FramePack was working awesomely yesterday, then I updated the extension and now I get this when starting the server

#

'ImportError: cannot import name 'ui_video_vlm' from 'modules' (unknown location)'

proper cradle
#

That's a new feature in dev branch

#

Either downgrade the extension or switch to dev branch of SDNext

keen kelp
#

Ah thanks, I don't have much experience with github but I managed to downgrade after some reading. Framepack is seriously impressive!

proper cradle
#

SDNext updated now

dreamy tree
#

I tried to do faceswap on video using control tab, sdnext did see 577 frames, but all it does is faceswap on first frame for 577 times

keen kelp
#

are there examples of Loras that work with framepack? I've tried 3 different hunyuan loras from civitai and they all give load errors

proper cradle
#

You might want to create an issue on github for these.

#

INT4 quants now runs 75% faster out of the box or 3.5 times faster with torch.compile compared to before.
INT8 quants now runs 30% faster out of the box or 2 times faster with torch.compile compared to before.

#

torch.compile is only used for the decompression, rest of the model is untouched

#

Flux.Dev now runs at 2.4 s/it with INT8_SYM quant or 2.5 s/it with INT4 quant

proper cradle
#

NNCF in SDNext beats Bitsandbytes now : )

dreamy tree
#

when trying it on Juggernaut v9 I have
no NNCF 1.65 it/s
NNCF 1.72 it/s
NNCF decompress 1.79 it/s
NNCF decompress+matmul 1.7 it/s
above numbers are on 2nd generation after changing/apply setting, not sure why the first gen right after changing setting has much better speed

proper cradle
#

Those settings require full restart

dreamy tree
#

btw, with this dev version, if I toggle on/off one NNCF setting, screen automatically scroll to top that I have to scroll down manually to NNCF setting area, and repeat this for every setting toggle

#

can I load gguf models in sdnext? I saved them in Unet folder, but they don't show up in base model dropdowns

proper cradle
#

NNCF settings are on top of the settings menu, just below bitsandbytes

#

Are you up to date?

dreamy tree
#

i was using show all pages, won't have this problem if going directly to quant settings

proper cradle
proper cradle
#

Renamed NNCF to SDNQ in dev branch.
I have re-implemented and optimized enough code to not use any imports from NNCF and the modifed code is not really NNCF anymore.

proper cradle
#

HiDream now runs at 6 s/it with uint4 quant on A770

fickle plume
#

that's better than before but to be quite honest imo that's still far from reasonably usable. something like teacache/first block cache would likely help, and would probably be even better than comfyui given the speedup without it

#

generally a ~2-3x speedup for flux, i'd expect the same for hidream

proper cradle
fickle plume
#

that's better

dreamy tree
#

I simply git pull repo, I didn't install any hidream modules...

proper cradle
#

did you install your own diffusers?

#

your diffusers version is outdated

dreamy tree
#

I install torch xpu nightly first, and then requirements.txt and then webui.bat

#

isn't diffusers latest 0.33.1 on pypi

dreamy tree
#

I see launch.py loads 0.34.0.dev0 , can't requirements.txt reflect that ?

proper cradle
#

installing requirements.txt is not supported

dreamy tree
#

video on sndq quant should be pre or pro?

proper cradle
#

pre

#

SD 1.5 / SDXL are post only

#

Transformer based models supports pre

dreamy tree
proper cradle
#

Managed to achieve 2.3 it/s with INT8 matmul and 2.1 it/s with FP8 matmul on on FLUX with an RTX 4090 (on Runpod) without using any custom kernels or CUDA

INT8 has basically the same quality as the full 16 bit model

#

It would be nice to see int8 or fp8 mm support in PyTorch for Intel

#

IPEX 2.7 sort of has int8 mm support but not really, IPEX 2.7 is just too slow with everything

proper cradle
#

SDXL with SDNQ, everything except the VAE is quantized

BF16 / INT8 weights + INT8 MatMul / INT6 weights + INT8 MatMul / UINT4 weights

fickle plume
#

@rustic saffron Explain your sdnext issues. What model, did you run with --use-ipex, and so on, show a screenshot of the bad image with the generation settings visible

rustic saffron
# fickle plume <@679355922291753030> Explain your sdnext issues. What model, did you run with -...

KIOKIS V20 hyper model, and once i opened sdnext again, it just gave error on webpage and cmd just died without error, so like background timed out trying to load it. as for settings, i just ran default ones it game me, no idea what any of them even do so didnt change it in case i break something, not like they give me something broken out the gate, would be setup already and have options for some skilled people to micromanage and tweak it, at least how i see it.

fickle plume
rustic saffron
#

right so, got it installed, and it does work, with the juggernautXL_v8Rundiffusion, but kIOKISSDXLHyper_v20 just errors out, no error info, just webui says it, and cmd is press enter to continue, after the loading 1 model line. no other info there. so suppose it doesnt support kIOKISSDXLHyper_v20 for some reason maybe, which would suck, its a real top tier 13gb one.

proper cradle
#

how much ram do you have?

#

SDXL models are 6.5 GB

#

That 13 GB model is wasting 6.5 GB of space for no reason, find a FP16 version

rustic saffron
#

32gb ddr4, vram is 12gb b580
and i ran that model for ages on my laptop, 16gb ddr4 8gb 3070 laptop gpu. it gets really amazing results. also i dont know what fp16 is.

proper cradle
#

that model is stored in 32 bits

#

but your GPU converts it to 16 bits to run it

#

It is 2x bigger for no reason

#

Find a 6.5 GB version of it, or you can use the models page to convert it yourself

rustic saffron
#

im not smart enough to convert it, ill just use a different model. the 6.5gb ones seem to work, and honestly been good 20 hours trying to get any sdxl to run, so i'll take the w here.

fickle plume
rustic saffron
#

realistic, dont do anime stuff, did test kiokis 3d model works, a little like 2.something gb one. so can use that for 3d stylized stuff.

fickle plume
proper cradle
#

SDXL

4 bits / 2bits / 1 bit

fickle plume
#

damn, impressively good at just 4 bits

restive parcel
#

oddly coherent at 1 xD

keen marsh
#

4bit always seems like the cutoff point for coherence

proper cradle
#

Meged to main branch

proper cradle
#

Benchmarks:

proper cradle
hollow spire
#

I keep getting this runtime warning when using ipex on my A580 when using txt2img, especially when faces are mentioned in the prompt. it outputs a black image
i've tried:

  • disabling ipex optimizations
  • ipex force attention slice 1
  • fp16fix VAE model
  • lower resolutions
  • full precision
  • reinstalling
  • changing safetensor models
  • VAE tiling

it doesn't seem to be a memory issue, i also got 32gb of ram and it doesn't fill up
i've been searching this issue for some time and trying all the fixes i could find but nothing really helped
the only 'stable' way i found was to use openvino but that doesn't really help when most of the things except the steps fall back to the cpu and slows the generation a lot
i'm very new to this so idk what to do anymoreLexThumbnail

fickle plume
hollow spire
fickle plume
hollow spire
proper cradle
#

This is the second time i am seeing this issue and that GPU was also an A580
Normally this happens on unsuppoerted iGPUs with BF16

#

Can you try manually setting the dtype to FP16?

#

A580 might be unsupported by ipex / pytorch too

fickle plume
keen marsh
#

I have an a750^^

fickle plume
#

ah

#

hmm

hollow spire
#

gpu usage is also way more constistent, so yeah I guess bf16 doesnt work

hollow spire
#

hmmm after quite a lot of attempts, it's a bit inconsistent/odd. some prompts work well, some output the same error, some output a garbled image. i'm rather confused

#

i guess i'll give comfyui a shot

fickle plume
#

i doubt it'll be much different

keen marsh
#

I remember people using a580's fine must be some updates to either pytorch/ipex or the windows driver.

hollow spire
#

the comfyui install with your script works flawlessly, and i mean completely flawlessly

#

thank you for adding whatever magic to it

fickle plume
#

But are the images sane though

hollow spire
#

yes!

#

and the time is like 2s/it , sometimes 1.5-1.8

proper cradle
#

Can you try setting IPEX_FORCE_ATTENTION_SLICE to 1

set IPEX_FORCE_ATTENTION_SLICE=1
.\webui.bat --use-ipex
#

This is the only difference between comfy and sdnext

#

I disabled dyn atten in sdnext in favor of flash atten with pytorch 2.7

#

SDNQ Quantization matrix with SDXL
(image is downscaled and webp compressed)

hollow spire
hollow spire
#

update: I enabled that thing again and --lowvram and it's working, somehow, idk how, but it is

#

to be fair, I did make a bug report about something similar for blender, where the renderer would just crash when the vram was almost full, so it might be related, funnily enough

proper cradle
#

Balanced offload might be broken on a580

#

Model offload might work but will have higher vram usage

hollow spire
#

yeah, im also getting worse iterations/s compared to using --low_vram

keen marsh
#

Flash attention works on intel now? Or is it just a flag for compatibility

proper cradle
#

PyTorch has built-in flash attention with all 3 GPU vendors

#

Intel got support with PyTorch 2.7

#

AMD got support with PyTorch 2.5

#

Nvidia got support with PyTorch 2.0

#

Torch sdpa uses flash, memory efficient, and math attention

#

Highest priority is flash atten, then memory efficient atten, if nothing works, then math atten

keen marsh
#

So it's still best to use sdpa?

proper cradle
#

yes

hollow spire
# proper cradle Model offload might work but will have higher vram usage

model offload seems to be working just fine for the generation part but Adetailer breaks, was working under --lowvram
edit: i got adetailer to "work" by using --lowvram and model offload but it's either replacing the face with garble or a mix of garble and the actual character's body on top of its face

  • i'm still getting some occasional garbled generations so ig sequential might be the only one that works stable
    I hope it's fixable somehow, i was getting near 1 it/s on model offload but 3.2s/it will have to do for now
proper cradle
#

Use the built in detailer

#

adetailer is outdated

hollow spire
#

I would but it's worse quality-wise

#

example: eyes
detailer left
adetailer right

same yolo model, same settings

proper cradle
#

yolo model has no affect on the image

#

both adetailer and built-in detailer is just inpainting

#

did you set the denoise / detailer strenght and the detailer steps to be the same on both?

hollow spire
#

yes!

#

i feel like the built-in detailer doesn't follow the prompt at all, even when i tried to manually give it the prompt

#

since the eyes dont match

sly whale
#

Question, has the "Unable to allocate more than 4gb" thing fixed for the A series cards?

proper cradle
#

It still exist but there is nothing that needs a single 4 gb allocation anymore

#

Only the attention needed +4 gb but we now have flash attention, memory efficient attention and my dynamic attention that all uses much less memory and doesn't need +4 gb at all anymore

proper cradle
#

Flash attention and memory efficient is much faster than old attention so you don't really need any workaround.

But if you still need a workaround so that you can allocate a single 4gb block for whatever reason, the workaround is this:

export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
sly whale
#

Right-o thanks

umbral summit
#

Can somebody help me setup the sdnext for my A750? It's just crashing with no output

#

I'm new to this

fickle plume
umbral summit
#

Is there any video or anything which tells me how this works

fickle plume
fickle plume
umbral summit
#

I followed this

#

The weird stuff was just just the top left of the picture being generated and it took like 1hr

#

rest was all grey

fickle plume
#

Show your settings

#

your results

umbral summit
fickle plume
#

You waited 1 hour for an image?

umbral summit
#

I will just try to show you the settings

umbral summit
#

I knew my settings were wrong but there's just no tutorial anywhere which was useful

#

they used controlnet in the vid ill try to install that

fickle plume
#

There's a gallery, and you can also just open sdnext's folder and go into the outputs folder and find them there as well.

#

Don't crop

fickle plume
umbral summit
#

you see

umbral summit
#

40 times

#

Now if I generate it just takes like a few secs

fickle plume
#

that image is 732x906

umbral summit
#

Forget about that image help me setting up this

#

I check enable here and apply changes, it turns off itself

fickle plume
#

And restart sdnext

umbral summit
#

it shuts down and restarts the webui, but the extension doesnt load

fickle plume
umbral summit
#

output does look good now but not how I want it

fickle plume
#

sdnext should have controlnet built in, and most likely in the composite tab

umbral summit
#

control?

fickle plume
#

This is why I keep asking you to not crop.

#

If you crop every time, I can't help you

umbral summit
#

oh you mean the screenshot

#

I thought in the resize settings because you've asked for the deformed result

fickle plume
#

You're using an sd 1.5 model. get the tile controlnet for sd 1.5, use the original unscaled image as input and upscale the image.

umbral summit
#

I found the controlnet in the control tab under control elements

#

but doesnt have all the stuff in the video

proper cradle
#

Finally managed to achieve performance increase using INT8 matmul on Flux with Intel ARC A770:

2.5 s/it -> 2.0 s/it

fickle plume
#

new pytorch version fixed it? or something else?

proper cradle
#

intel wants the exact opposite strided weights compared to amd and nvidia

#

intel want contiguous meanwhile amd and nvidia wants non-contiguous

#

amd is fine with either but performs better with non-contiguous

#

nvidia dies with contiguous

#

intel dies with non-contiguous

#

Also Flux is exactly at the breaking point of INT8 on A770
Anything smaller than Flux will run slower

#

flux has mnk dim of 3072

#

sdxl is 1280

fickle plume
#

seems qwen edit is also 3072? sad. well, a minor boost is still good i guess

#

ah, staring at your graph closer i guess it's not actually that bad

#

neat

sullen sun
#

What is "M" on this graph?

#

(Terrible labeling lol)