#ComfyUI for Intel Arc using IPEX

1 messages · Page 10 of 1

earnest grotto
#

@reef ivy do you want to explain your steps for getting compile working on windows?

somber trellis
main spear
#

I can see now what I need, I have to run it on a WSL with intel's Triton Backend

#

This is just the universe saying "Get an NVIDIA GPU......."

formal tusk
#

I just use a symlink

earnest grotto
#

there is no point making symlinks when it's a feature already supported by comfy by default

#

if you can learn to make a symlink, you can edit a yaml file and add many locations much more easily

main spear
#

My problem right now is that when I make a wsl it doesn't read my gpu even though I have all the requirements

earnest grotto
#

does clinfo show your gpu

#

you installed with my script?

earnest grotto
main spear
#

From what i've been reading and what Grok and ChatCPT have told me, torch.compile doesn't work on native windows. And I have your script working on native windows, I didn't try installing it on a WSL2 though

#

The problem is clinfo doesn't read my gpu

earnest grotto
#

gork and chatgpt probably told you outdated info. compile wasn't a thing on windows in general (intel or nvidia) till recently. and i highly doubt they'd give you anything accurate about intel

earnest grotto
#

there are a bunch of extra things you probably will need to install on wsl

#

i have not touched wsl in quite a while

main spear
#

I couldn't find the compute runtime or level zero for windows, it was all for Linux and Ubuntu

earnest grotto
#

yes, you're using wsl.

main spear
#

I meant to get complie working on native

#

I have OneAPI installed as well because I thought those came in the package but they do not

earnest grotto
#

search aaron's messages in this thread regarding compile on windows

main spear
#

#1193952640225267802 message

#

Here?

earnest grotto
#

yes

#

if you're using my installer, 2.8 is nightly/experimental, dunno which i called it

#

you can run the script again from the same location as before and pick a different pytorch version. it will install the different version, faster than installing it anew

reef ivy
main spear
#

@reef ivy How did you get Wan vids to finish in less than 20mins? I switched to doing ltx cause they are faster but Wan has the quality I want but I tried some different work flows and it they all were taking 1hr sometimes longer

#

oh it didn't show all the code

#

<pre>``` File "C:\Users\chris\OneDrive\Documents\ComfyUI\Comfy_Intel\cenv\lib\site-packages\triton\runtime\build.py", line 74, in _build
raise RuntimeError("Failed to find C++ compiler. Please specify via CXX environment variable.")
torch._inductor.exc.InductorError: RuntimeError: Failed to find C++ compiler. Please specify via CXX environment variable.

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

print("Execution finished")
Execution finished```</pre>

reef ivy
#

Check out the fusion x models also, haven't tried them personally but they have causevid and all that stuff merged. This is for 14b btw.

main spear
#

I think I got this working now, just added the call oneapi setvars in the activate.bat

somber trellis
#

@earnest grotto

#

Index tts post that was in chinese

reef ivy
#

If it only needs the call then they added the needed files to the driver(which is what I thought but wasn't sure since I already did it lol)

earnest grotto
earnest grotto
#

I wonder if I should look into how comfy does inference and try to shove it in that. only the transformer, dinov2 siglip and the ipadapter remain done by diffusers, everything else is native
and for the better, flan gave me a prettier result, at least for this seed. regular t5 was having finger issues and bg was a bit worse

somber trellis
#

I wonder how versatile it is

earnest grotto
somber trellis
#

Actually lemme set it up my end

earnest grotto
#

pretty sure all the -ID things are for faces specifically

reef ivy
#

Anybody tried vace or phantom for wan2.1 with one frame? One might work decent for consistent characters.

main spear
#

With the 13b vace t2v I can get pretty good quality in about 10-20min, the 14b vace takes about 40min-1hr but the loras I use work with vace even though they don't spefically say they do on civitai. I also use it in conjuction with CausVid. I've been trying to get it to work with tea cache and torch.complie but sometimes it works and sometimes it doesn't, and on the time it does work it doesn't really shorten anything. I think that's partly due to cauvid alreadly shortening it as much as it can be but I'm stiil tinkering with it

reef ivy
wicked fulcrum
main spear
earnest grotto
#

What do you mean by doesn't work, broken result or errors
IIRC, teacache and causvid might not mix

#

Also there's magcache which should be better than teacache

#

(Not that it would mix if it doesn't, but it'd be better in cases where teacache is effective. I haven't tried magcache)

main spear
#

I'll try to recreate the error it gives me, I think it's a memory problem but Idk. I tried asking Grok to try to and see if it was fixable but it ended up causing other errors so I reinstalled Comfy and haven't use wan2.1 yet. But I have 48gb or ram and i'm using the arc b580 12gb vram and I can see it does use the vram on most processes and occassionally it'll use cpu for other things like positive and negative prompts

main spear
#

Okay, maybe it was a compatibility issue with nightly, I installed the stable version in your installer and the issue isn't appearing anymore

reef ivy
#

Some nightly builds can randomly break since its actively being developed everyday

#

2.7 stable should support torch.compile now iirc

main spear
#

spefically when using the WanVaceToVideo node

#

which was weird because I didn't even have it connected to that node

reef ivy
#

Native nodes? Or the wanwrapper? For wan wrapper you have to do a minor code edit.

reef ivy
# main spear The wanwrapper

#1193952640225267802 message follow those instructions if you are on A series, battlemage shouldn't need it but maybe try it anyway. If you are on battlemage probably post your errors here and maybe one of us can help.

#

Also, you can try the latest nightly build and it might be fixed now if it was working before with the wrapper.

somber trellis
wicked fulcrum
#

WAN 2.1 Vace 14B GGUF Q3 - 512x512, 30 steps. Takes 35-40 minutes on A770

#

This is using PyTorch 2.7 no IPEX and the AI Playground install. Wondering if this is in what people are experiencing for times: 70.46 s/it. Video gen is pretty good consider lower res but time is slow. Any thoughts?

earnest grotto
#

I don't do videos, but there are a lot of speedups you can employ. Notably:
causvid - can be a lora kinda like lcm extracted from the finetune, or a finetune
teacache/magcache - those also apply to flux and other purely image models, decent speedups, magcache is a little newer and might not be as supported yet
self-forcing - basically cfg+step distillation, like flux schnell
I have this link for a self-forcing lora. No idea how effective it is
https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors

#

teacache slightly lowers quality (and might break if used with the other speedups). magcache supposedly retains much more quality with similar speedup

wicked fulcrum
#

Using teacache now and not sure it is speeding it up. Need to test to be sure

reef ivy
#

Teacache will speed it up, also torch.compile. teachache works by skipping steps so it speeds up as it goes, usually starts after third step

#

Also there is the merged fusion model that can run at fewer steps with even better quality (have not tried it myself yet though)

rustic sonnet
somber trellis
#

These lower GGUF quants for these models tend to actually be slower.

#

Q8 is always the way to go for image/video models.

#

also use teacache

#

In other news.

#

flux kontext dev released

rustic sonnet
#

So if I want to use comfyUI on Meteorlake IGPU, what would be the steps to install it

#

With the 4GB issue mitigation i mean

wicked fulcrum
somber trellis
wicked fulcrum
# somber trellis With reserve vram properly set, this actually doesn't seem to be the case. In fa...

Hmm interesting. That's not been my experience. But Ive been focused elsewhere last couple months. At least with drivers from earlier this year my experience was when models are at the edge of GPU memory and shared memory kicks in, the copy process interrupts GPU compute time and can slow down inferencing 10X... with reserve vram set.
But I'll try higher vram models, and clear my assumptions. Thanks for the tip

somber trellis
#

Originally on 2.3.1 I used a --reserve-vram of 4.0, but as the drivers and comfyui itself changed I've had to increase it.

#

It was 4. Then 6.

#

Now it's 8.

#

Okay. Wow.

earnest grotto
earnest grotto
wicked fulcrum
somber trellis
rustic sonnet
#

185H

#

Got 32GB ram

somber trellis
#

so i just bigbrained something

#

combining both redux and flux kontext is pretty powerful

earnest grotto
#

are you asking because you want to try kontext?

somber trellis
#

im testing it out right now

wicked fulcrum
#

@rustic sonnet As Vik said, either follow his instructions to install ComfyUI or install AI Playground. MTL-H should work with most models. But the 4 gig memory chunk is part of Alchemist. A series GPUs like Arc in MTL-H have this limit. SDXL, Flux, LTXV, WAN 2.1 should all be good

Note AI Playground provides you ComfyUI as a backend where it's launched by AI Playground. With AI Playground running you can launch ComfyUI in a browser by going to localhost:49000.

You can change the version of ComfyUI in the AI Playground backend manager.. I believe by default it installs 3.3. You can change to update the latest which is 3.42. Doing this can break some AI Playground features. ie LTX-video image 2 video requires you add a field value to its workflow

Anytime you reinstall ComfyUI or AI Playground, it wipes out ComfyUI. So backup or move out ComfyUI Models directory before reinstall through AI Playground

earnest grotto
somber trellis
#

bruh

#

Flux Kontext Dev + Flux Redux = This

#

The prompt

#

was literally just

#

3D Hyperrealism

somber trellis
rustic sonnet
earnest grotto
#

hmm
i expect a single image to take 5 minutes with that... or more?

#

teacache probably doesn't support it

#

yet

somber trellis
#

in fact it's already been quantized to q8

#

and works with most ofthe same nodes normal flux dev would

#

including redux

#

just like normal flux you can use dynamic thresholding

#

allowing you to use negative prompts with flux at the cost of speed

wicked fulcrum
somber trellis
#

Definitely works

#

seems whatever transition node i used didnt line em up proper

somber trellis
#

oh i shoulda mentioned i did it in cfg 2 instead of 2.5

#

its why it looks more grainy and faded

earnest grotto
somber trellis
#

sometimes it gives weird odd-integered resolutions

earnest grotto
#

that's probably why your images shift a bit

somber trellis
#

I probably should use the upscale to closest SDXL resolution node

#

that would instantly fix the problem

earnest grotto
#

comfy will automatically pad if it's not perfectly divisible by 8

somber trellis
#

i bet they couldve made a much more close to artwork gordan if they just put more work into his model

earnest grotto
#

this can sometimes result in grey lines with sdxl. with flux, i dunno what the vae decodes that to but probably also greyish or pinkish

somber trellis
#

this repo has a node that has an resize node which auto upscales/downsamples images to the closest sdxl resolution

#

also these nodes are good for memory control

somber trellis
#

its quite a bit slower

#

but you get far more control

reef ivy
#

Oof, a little too much for me. I guess for editing it would be worth it though

somber trellis
#

simply allows you to use cfgs higher than 1

#

and for normal cfg models it can allow you to go above normal cfg limits

#

effectively operating as an anti-burn node

reef ivy
#

Does the hyper loras work? For lower steps. Or forget the name

somber trellis
#

On Flux Kontext? No clue.

reef ivy
#

With flux i get images pretty quick with minimum quality loss with those turbo loras etc.

somber trellis
#

Which turbo lora though

#

The Flux Alpha Turbo Lora?

#

Or Hyper Flux?

reef ivy
#

I honestly forget, I tried both and one was better than the other. Will have to check later

#

Been a minute since i used it

somber trellis
#

much better

tough wharf
#

Is it just me, After Driver update of my A770 16g from 6734 to 6881 all T2Image generations turned to shit,
HOWEVER Video generation has speed up so much like;
6881: around 8s/it - 10s/it.
6734: around 80s/it - 200s/it
(note: Im using the new Self Forcing model from Kijai)
Wth happened (im not complaining tho)

earnest grotto
#

you were running out of vram

#

you are not running out of vram now

#

keep track of your vram usage, use smaller models if you're running out

#

most programs use vram. they use a little, but if you have a lot of chromium tabs open those can start to add up to ~1gb

#

discord also uses something like 400mb I think?

tough wharf
earnest grotto
#

yes

#

you are not running out now

#

you probably were before

tough wharf
#

Very nice, Kudos to Intel for this update 😭 🙏

earnest grotto
#

show what's up with the images

tough wharf
earnest grotto
#

Once the shared goes above 0, yea you're running out

tough wharf
#

This is now with Euler A sampler, it fixed the artifacting issue

#

I havent changed the settings on this workflow apart from the sampler

earnest grotto
#

use the tiled ksampler custom node, having upscaled with some esrgan model before it, 4x-ultrasharp2 is recent and good

tough wharf
#

This one?

earnest grotto
#

either of them

#

whichever one

tough wharf
#

Ill make a sample workflow

earnest grotto
#

your image -> upscale image using model (the model being 4x-ultrasharp2) -> tiled vae encode -> tiled ksampler

#

wonder if kontext also benefits from flan t5

tough wharf
#

I feel like my gpu is not being fully utilize?

earnest grotto
#

Compute dropdown

#

What ARE these goggles man

earnest grotto
tough wharf
earnest grotto
#

sorry, i was a bit vague

tough wharf
earnest grotto
#

depends on many things

#

i don't want to oom right now so i've opted for q5 which slows it down

#

actually what model even

tough wharf
#

Oh it seems to be utilizing it then

earnest grotto
#

i'm doing flux kontext rn

tough wharf
tough wharf
#

It took 15mins ICANT

earnest grotto
# tough wharf LMAO

You essentially just made a brand new image at an oversized resolution, and that's what happens when you do that (with sdxl/1.5)
Lower denoise (and steps since you won't need as many), e.g. to 0.5, your steps are already kinda low though. generally 28 is good, so i'd go for 14

#

different tiling strategies are faster or slower but have drawbacks

#

random is slowest but generally looks the best

#

you can also bump up the tile sizes to 1024x1024 since sdxl

#

ah, i didn't see there's nothing before it

#

you need to generate some image, use the tiled ksampler to upscale it, not to generate it outright

earnest grotto
tough wharf
earnest grotto
#

yes

tough wharf
#

But still the Images generation seems to really have been hit after the update to 6881 driver
This is with a workflow I used to test for a quick image before I updated my driver, usually it takes below 1 min to finish the execution

#

this took 200s

earnest grotto
# tough wharf I heard Flux is mainly used for realistic images Im not particularly interested ...

Ideally, a good model will be able to do any task, and won't be "realistic" or "anime". Sadly we're not there yet. Any base model will skew towards being able to only do realistic stuff competently. With loras you can kinda get an actually decent style but it still lacks a lot of other knowledge you'd be getting from an anime finetune.
Inpainting models however are a massive step up in understanding, due to the context of the rest of the image being inpainted. Flux fill can inpaint pretty well.
And evidently kontext is also not too bad
Base SDXL is worse than flux.
Expect a Lumina 2 finetune by onoma (illustrious) soon™ and sd3.5 large by cagliostro (animagine) "q1 to q2 this year" (april to september)

earnest grotto
tough wharf
tough wharf
earnest grotto
#

many ways to get it faster or slower

#

i am using a q5 quant because i don't want to oom. that makes it slower than q8 or fp8

#

you can just do it without ooming but doing larger images+cfg starts eating vram

#

flux can do larger or smaller images without breaking, unlike SDXL, but only up to a certain point. since then I've seen at least 1 paper with an even better method for getting models to generalize to other resolutions even better than rotary embeddings so that can be something you can expect from future models

#

I'm not using teacache, yet, since i just wanted to see if it works

#

i get 10s/it

tough wharf
earnest grotto
#

usually teacache is a ~2-3x speedup

tough wharf
#

Ive tried a workflow backthen using Teacache with wan2.1-14b, q5 k_m, It took me around an hour I think then it crash because of OOM

#

I think I will settle for self forcing models for now, since its fast
but downside is its 1.3 so not much loras I can play with

#

I hope they release a wan14b self forcing soon

earnest grotto
#

5.45s/it with q8

#

should try q4

#

with q8 and teacache, 2.6s/it

#

but on the brink of ooming

earnest grotto
earnest grotto
#

On more testing, q4 is a bit too fried

reef ivy
#

Seems arc is slower when the model is not swapping to system ram, especially with gguf models

#

Or try kijais nodes with my little code edit and use block swap

#

#1193952640225267802 message

#

Also can't use gguf with kijai but 16gb should be enough for fp8 models

tough wharf
#

Like I said, After the update to 6881, It seems to have fix the issue of slowing down the video generation process or just simply going oom

tough wharf
# tough wharf this took 200s

what Im having the problem now is with image generation, if you scroll back a little, I mentioned why the upscale nodes have gotten slow on its generation process(5s/it)
Idk how to explain it anyfurther than that
Its just like that with my I2I or T2I workflows since the update to 6881 drivers

#

tho, Im not complaining since I want to generate videos too with my A770 😅

#

idk If I can share the videos I generated here, its a bit nsfw

tough wharf
reef ivy
#

Yeah, not the place for nsfw stuff. I will check out comfy again soon and see what has changed.

earnest grotto
#

Seems CFG is needed to make anime goggles

somber trellis
#

i hope you know who egad is

somber trellis
#

LLMs are generally fine at 6_K, any lower and they start to get that lobotomized feel

#

I don't know if 6_K is any good on visual models

#

I think it's best if we just wait for a method of low perplexity low bit quantization such as bitnet 1.58b but without the need for a full architecture re-train

#

that will revolutionize our capabilities

#

low-bit quants like 1.58b will allow us to run much larger model sizes, which in turn lowers perplexity as well as parameter count increases

tough wharf
# reef ivy Try --reserve-vram 7

Just try this, It seems to have fix the issue of the USDU being slow wth
I also tried it with V2V self forcing(I havent noticed anything different with --reserve-vram parameter)

#

Thank you Vik, Aaron
Im back on the game now

earnest grotto
queen remnant
#

any time i try to update comfyui either from comfyui manager or with the script it does this and never updates

#

i haven't made any manual changes myself and i've actually had to delete and reinstall it to get it to update in the past

earnest grotto
#

type in git stash and press enter

#

In a command prompt in comfyui's folder

#

I'll see why the script struggles to do that later

valid escarp
#

any good yutube vids or courses to get started learning comfy ui?

queen remnant
#

i just deleted the file it was whining about and that seemed to get it to update but im guessing there are some changes made to the file in the comfy intel install script

earnest grotto
#

I don't like most youtube videos on this topic.

valid escarp
#

thank you

earnest grotto
#

Also normal flux dev loras might also work

earnest grotto
#

wonder why bfl didn't train it with flan

earnest grotto
queen remnant
#

strange

#

i could try re-installing git to see if that fixes it

#

idr when i installed it

earnest grotto
#

what does the script say

#

how do you know it doesn't work

earnest grotto
# somber trellis I don't know if 6_K is any good on visual models

Down to Q4 has been fine for me with regular flux. aaron did some tests too. q3 starts to have noticeable artifacts and q2 is completely broken. Visual artifacts from too much quantization seem to usually manifest as blurryness, broken edges and lack of texture
The 2 images I posted were with Q5

reef ivy
#

With all my testings so far, LLM's, Video and Image modes, q4 is the cut off for good quality. It has a slight degredation but usually not noticable without comparison, but once you go down to q3 it gets burnt. However, the Fusion Wan model may have decent quality with q3 but I have not tested it myself.

earnest grotto
#

for flux kontext specifically i was having some issues with q4

#

i also haven't tried the 2 sd 3.5s and their quants

earnest grotto
#

Now on windows, Q8 kontext again. Without reserve-vram, it spills out ~1.5gb into shared memory and ends up with ~19s/it

#

with --reserve-vram 15, I'm at 4.7gb of vram and 7.45s/it

#

According to task manager. this seems pretty wrong but oh well

#

--reserve-vram 7, 11.4 gb used and 6.46s/it

#

--reserve-vram 3, 15.6gb used and approaching those linux speeds at 5.6s/it

#

also stumbled on a lucky seed where the goggles finally look like goggles without needing cfg

#

@reef ivy Do you wanna test a few different --reserve-vram values and note the flux quant you're using and the speed and vram usage you get? (+how much shared memory is used if it is)

#

seems to me more like it helped you not run out of vram, and with that, speed improved

#

I suspect reserve-vram 11 will give me about 8gb usage, let's find out

#

8.4, about there. 19-x it is

reef ivy
#

I may take some time and test it but probably not too soon. I will say latest drivers and pytorch have changed things alot, i guess they are working on how memory is managed. Honestly if we could get a native block swap it would be great for intel users

somber trellis
earnest grotto
#

Updated to latest driver, not seeing any change in either vram usage or speed

somber trellis
#

I actually found a repo that implements some level of bitsandbytes support ontop of XPU

#

but it does seem to still work on the current nightlys

#

tested it via the repo

#

i dont know if it will be beneficial however

#

I looked further into the bitsandbytes issue

#

and theyre working on a multi-support backend (pytorch custom operator integration)

#

minimal requirements are promising

#

bnb 4-bit quantization and dequant as the minimum requirements for this

#

nvm i found the multi backend refactor

#

this might actually have full-on support as an experimental backend

#

i like how i bring this up now

#

and i search this discord channel and find 10 other people mentioning the same link

#

I got it to build from source just fine

#

torch 2.8.0 nightly

#

built bitsandbytes-0.47.0.dev0

#

oh hey i dont get the cuda121.dll error anymore

rose fern
reef ivy
rose fern
earnest grotto
#

Hmm... Having mixed results anime-ifying images with kontext

earnest grotto
#

It loves to remove large HUD elements sometimes 🤔

somber trellis
#

Since you're getting additional context with it via a clip vision model

#

With a node like this, you can control the strength of it too to prevent it from overwriting everything else.

#

kratos with the wojak lora worked pretty well

#

i mean it works pretty well in general (thats me)

earnest grotto
#

a bit too detailed imo

somber trellis
earnest grotto
#

and not enough wacky faces

somber trellis
#

and super simple wojaks

#

i mean i can go full DURR

earnest grotto
#

i know but, it's detailed in the wrong way

#

doesn't feel right

somber trellis
#

scarface

#

I think it did great on walter

#

but it kept his coat and hat the same

queen remnant
# queen remnant any time i try to update comfyui either from comfyui manager or with the script ...

interesting i got an OOM error and saw it was fixed in the next release, but this time trying to update through comfyui-manager gives me a completely different error:

...
RuntimeError: Native API failed. Native API returns: 38 (UR_RESULT_ERROR_OUT_OF_HOST_MEMORY)

Prompt executed in 00:13:14
[ComfyUI-Manager] Failed to checkout 'master' branch.
repo_path=F:\AI-NVMe\Comfy_Intel\ComfyUI
Available branches:
        master
ComfyUI update failed

[ComfyUI-Manager] Queued works are completed.
{'update-comfyui': 1}

After restarting ComfyUI, please refresh the browser.

sorry for interrupting lol

earnest grotto
somber trellis
#

RuntimeError: Native API failed. Native API returns: 38 (UR_RESULT_ERROR_OUT_OF_HOST_MEMORY)

earnest grotto
#

If you were not looking at your vram, then you ran out

queen remnant
#

yea

earnest grotto
#

Don't run out

somber trellis
#

im still running the 2.8 nightlies

#

because it's still faster

#

than 2.7.10 ipex

#

anything under --reserve-vram 8 for me causes problems

#

for really big loads

earnest grotto
#

Use --reserve-vram x to use less vram, x being some number. my estimate, 19-x actual vram usage with a simple kontext workflow, down to about 4.7gb of vram at 15. obviously, slower, but not by much. stick with q8

queen remnant
#

hence why i was trying to update from 0.3.42 to 0.3.43, i was trying to get kontext to work and it ate huge chunks of memory every iteration until it eventually ran out

somber trellis
#

is the most stable out of all of them for anything you wanna nitpick with

earnest grotto
#

kills sdxl performance

somber trellis
#

and the performance loss is around 40% compared to --reserve-vram 6.0

somber trellis
#

for some reason as well btw

#

2.7.10 really doesnt like big workloads

#

even with --reserve-vram 8 it ends up stuttering my PC at high vram usage

#

idk about 2.3.1 since i havent touched it in forever

queen remnant
#

this was on top of all of my vram being used on my a770 😭

somber trellis
#

kontext dev?

earnest grotto
#

e.g. you probably don't need the t5 at fp16/bf16. or can use a q8 gguf of it

somber trellis
#

i use the q8 t5

#

i also use the text-enhanced clip_L

queen remnant
#

just an example workflow with the model loader swapped out with a unet loader for a q4_k_m gguf

earnest grotto
#

welp... close some browser tabs i guess

#

there are smaller quants of the t5

#

imo q4 kontext gets a bit fried

queen remnant
#

im certain its an issue with comfyui itself, i was just having trouble updating is all

#

ran git pull origin master and that seemed to do something so i'll update here if its fixed after running the script again

earnest grotto
#

running out of ram is a buy more sticks, close other programs or use smaller models issue

earnest grotto
#

I don't actually mind that it removed the binos

#

I do mind that it looks fried

#

Some different prompting (russian soldiers in a desert, blablabla) got it to be a bit less fried

#

am testing colorization rn. i think they didn't train it to colorize with colored dots which is pretty sad

#

mmm... driver crashed

somber trellis
#

💥

earnest grotto
#

If the gif compressed it too hard, this is the original

queen remnant
#

hes so handsome

earnest grotto
#

Various prompts and these dots, vs colors in prompt

earnest grotto
#

really getting the feeling that this needs more training, loras.

somber trellis
earnest grotto
#

are you making the images yellow on purpose or is this just an ironic moment for bfl

#

90% sure kontext's license should also have that little clause about not training on outputs...

somber trellis
#

I don't see the yellowing you speak of.

earnest grotto
earnest grotto
earnest grotto
earnest grotto
somber trellis
earnest grotto
#

i've played skyrim, i know, but it's yellower

somber trellis
#

well that image had no LUTs or filmgrain applied to it

earnest grotto
#

they're all slightly yellow or brown

somber trellis
#

it was a direct kontext dev output

earnest grotto
#

i'm not saying you did anything, i'm implying bfl trained on chatgpt outputs

somber trellis
#

prompt biases

earnest grotto
#

you could probably just make the negative be "make the image yellow" and I'd expect that to get rid of it

#

but it's a peculiar sight nonetheless

earnest grotto
#

@fleet cape What do you mean by "optimized"

somber trellis
#

im going back to linux again

#

bluescreen galore on windows as of recent

late glen
#

What version of torch are you all using? I can't seem to get any version of flux kontext inc quants working for my a770. I've been using 2.7.1 + IPEX, is this not advised?

earnest grotto
late glen
somber trellis
earnest grotto
#

both windows and linux

somber trellis
#

hmm

#

It errored out on endeavourOS (ArchLinux) for me

earnest grotto
#

with what

somber trellis
#

i assume it works with normal windows and debian

#

Like why the hell is that bot able to remove pastes here

#

in a support and questions channel

earnest grotto
#

dm it

somber trellis
#

nvm

#

missing prerequisites for the script lmao

#

works now

earnest grotto
#

i disabled 2.3 for linux because it was broken for me and i didn't figure out the issue at the time. in case you're wondering why it's not there

somber trellis
#

even with protonge mordhau gets less than half the framerate than windows 11 does

#

and that is one of the games i play most

earnest grotto
#

yeah that do be the linux gaming performance with more demanding games

#

with arc

somber trellis
#

windows really do be the only choice for games of that caliber huh

#

i am sad

#

id post sad seal gif but tenor doesnt work in here so

#

i do this every 6 months excited about trying linux again

#

only for something really stupid to break like openssl

#

that even pacman-static is like "hell nah i aint touchin this"

earnest grotto
#

i am even sadder about kontext

#

undertrained, over-lobotomized, or idk what they did but i'm losing hope. an ad for their training service? an ad for pro?

#

it's like altering/removing text and watermarks is the only thing it can do well

#

i haven't tried virtual try-on/swapping clothes/etc., but given I saw there's a lora for that on civit, that gives me the feeling it's bad at that too

somber trellis
#

Its not the greatest model no

#

its a flux that is worse at text to image but is better at image to image and not even by much

#

we already have models that outpace it closedsource

#

but ye 🤷‍♂️

earnest grotto
# somber trellis its a flux that is worse at text to image but is better at image to image and no...

It's that but also not merely that. They pretty clearly trained on 4o outputs a model that is competing with 4o, all while still having this nice little tidbit in their license: You may not use the Output to train, fine-tune or distill a model that is competitive with the FLUX.1 [dev] Model or the FLUX.1 Kontext [dev] Model
If the model is undertrained, them (fal, but still) having a lora trainer service ready on day one, not even training code, is even worse. We're supposed to train it ourselves to a decent state, and at that, who knows, maybe they'll get to make a better dataset from what the community finetunes it/makes loras with? For their whole spiel about NSFW in the license, I got an almost NSFW output because it's so bad at anime-ifying a furry rabbit person it made that into a regular skin person, not that I care about NSFW but evidently they do.

#

In the meantime, there was a new 3B model that popped up, Ovis-U1, claiming to be able to describe images, do text to image, and do edits
Tried their HF demo with images which IMO kontext anime-ified the best, and... I found out afterwards that it uses the sdxl vae. sad.

#

More yellow. People just can't stop training on 4o

#

I'm still hopeful that maybe there's something I'm missing with kontext but man...

#

I should also try cosxl again, though IIRC it was fairly bad

earnest grotto
#

oh, i should try ovis with kontext's big fails like that hl2 playground screenshot

#

ah, and ostris seems to have local training for it

somber trellis
#

changed up my kontext workflow a bit

#

changed versions of redux to the higher quality reflux redux model

#

initial load image upscale chain

#

dynamic thresholding and teacache for both higher quality and faster processing

#

disabled fluxkontextimagescale node

#

recommended by reddit because it was actually causing worse outputs

#

just gotta limit it to appropriate image sizes under a megapixel

#

also have two saving options, one that utilizes a LUT and filmgrain and one for clean output

#

also set cfg to 2, guidance to 2.

somber trellis
earnest grotto
#

Something feels uncanny about the lighting but this is definitely better than the overly yellow/brown results

earnest grotto
#

@teal monolith @rustic sonnet Still alive?

somber trellis
rustic sonnet
earnest grotto
rustic sonnet
#

I used Q4 GGUF

earnest grotto
#

Hmm

rustic sonnet
#

How much Vram does Q4 use for you?

earnest grotto
#

I can make Q8 use 4.7. But I also have 48gb of regular RAM to spare

#

Right now, 34gb of RAM + 10.6 VRAM, with lots of chromium tabs, discord, Noita, buncha things open

#

I'll try Q4 in a bit

rustic sonnet
earnest grotto
#

7gb ram with everything closed. 11.3gb vram and 16.6gb ram with q4_k_m

reef ivy
#

With regular flux, 8gb vram and 32gb ram I have been able to run it.

rustic sonnet
#

Well I've got only 18GB Ram usable (including vram)

earnest grotto
#

what happened to the other 14

reef ivy
#

Yeah not enough probably. I think my system can resever 24gb when adding ram to vram

rustic sonnet
#

Well I guess 19GB usable

#

When it hits 20GB the system starts lagging so much

#

And hitting swap memory more

reef ivy
#

Did you set that manually? I get 24gb shared video memory with 32gb ram and 8gbvram

rustic sonnet
#

I didn't set it manually

reef ivy
#

Strange, what gpu do you have?

rustic sonnet
#

I'm trying to run it on my MTL-H IGPU

civic charm
#

Flux Dev with SDNQ UINT4 fully fits into A770's 16 GB VRAM on SDNext with offload mode = none / no offload / everything is on the GPU

#

Only issue is the VAE decode

#

reducing the vae tile size to 512 works

reef ivy
earnest grotto
#

my vram usage on windows has also been slightly higher. i could do q8 kontext on linux and not run out, barely 15gb, but same thing on windows and i run out

#

though i also noticed sdxl bumped up to ~2.3it/s 🤔

#

need a bit more testing, that was 832x1216

rustic sonnet
earnest grotto
#

You should be able to get that down to at least 7-8gb, unless there's something you absolutely want open

#

it can get down to ~2-4gb afaik but that needs some debloating

#

and idk, maybe you do use cortana

earnest grotto
reef ivy
#

intel arc windows memory management is terrible. Only --reserve-vram helps or block swap in custom nodes.

somber trellis
#

I mean, this is pretty cool.

#

The ability for it to take multiple images or a spreadsheet as context

earnest grotto
#

It can do much more interesting interactions but you also need to beat it over the head and get lucky, evidently

somber trellis
#

i think using ic-light with it might be a good idea though

earnest grotto
#

Ideally you wouldn't even need ic-light

#

You'd just be able to relight with kontext itself

somber trellis
#

That however requires a second pass, wouldn't it.

#

Ic-light v1 being sdxl based might be a faster choice

earnest grotto
somber trellis
#

the time it takes to re-run kontext again on my hardware

#

i could just use

#

a flux schnell lora

#

i only say this because it takes 13s/it to run kontext with all the stuff i am using with it.

earnest grotto
#

I get about ~1:15 for a 1mp image with teacache. there doesn't seem to be much point in more steps and cfg doesn't save it when it refuses to work
I think it would take me about that long to load sdxl and gen with it, than gen with already loaded kontext

somber trellis
#

flux is inherently designed to only use cfg1 and requires its fluxguidance nodes to handle it

#

its the same reason why no negative prompts work on it either

#

but yeah

#

only problem is youre doubling inference time with the thresholding node

#

benefits are however, is that you can now use a negative prompt with flux and also have more than 1CFG.

earnest grotto
somber trellis
#

other than that the model has been editing for me fine

#

i was initially optimistic that this model would be better than base flux

#

but it doesnt seem like it is* it just seems like the same quality level but far better-tuned for image-to-image

earnest grotto
#

Anime-ifying an image. No CFG, 2.0 CFG "make the image brighter", 2.0 CFG "make the image darker". No thresholding or anything else
CFG is just interpolating 2 predicted noises (linear combination but basically that). You don't need any special nodes to do that. Dynamic thresholding might only be better for you because the images are more contrasty.

somber trellis
earnest grotto
#

When it refuses, there is no trickery you can do with its prediction to make it do what you want

earnest grotto
somber trellis
#

So you have a normal ksampler workflow with those negative prompts in it

#

and it gave you the same outcome as if you would put it in positive?

#

(also this image didnt work properly on this seed)

#

Because I'm not using a samplecustomadvanced node workflow

earnest grotto
#

And the insanely bright image had a negative prompt of "make the image darker"

somber trellis
#

But I thought the point of using DynamicThresholding was because Flux is a distilled model.

earnest grotto
#

It can help, I'm not saying to not use it

#

It will most likely help with higher cfg. I've been sticking with low values

somber trellis
#

I might start doing the opposite and just go for an all-performance workflow

#

flux alpha 8-step lora and all

earnest grotto
#

I do not thing your generic negative prompt warrants cfg

#

It's also kinda nonsense for kontext. You'd want it to do the opposite edit, don't just throw words at it, though it does kinda work when you throw words at it

somber trellis
#

the prompt doesnt work at all without it for normal flux, i dont know about kontext

#

also i used redux to have myself not need to manually prompt the image, by utilizing a clip vision encoder

#

of course adding that in as well would probably give me better results however

earnest grotto
#

my cases where cfg makes it conform better to the positive have been kinda slim. usually i want more conformity when it's failing to edit, but then no cfg will save it

#

it's so wrong i can see it on the literal 1st step

#

"Place the girl in the left image together with the ones in the right image, while maintaing the composition and their look." (and many other variations of this prompt tried, incl. "image #1/2", getting left/right wrong, as well as differently colored backgrounds or different order)

#

I tried being specific with hair and eye color instead of "her", "girl", etc.

#

I guess I haven't tried too many seeds, only 2-3 or so

#

I'm gonna try thresholding with colorization to see if that's any better, might help there more 🤔

somber trellis
#

it just seems like certain seed values don't give the model the noise it wants

#

and it just completely botches the job, giving you a near identical output to the input

#

or a blurry mess

earnest grotto
#

the 2nd image is the 1st step

#

what its prediction then looks like

#

the blur is normal

#

but you can basically tell that the end result is going to be unchanged or borderline unchanged

#

ostris had some peculiar artifacts when training a lora for it (that went away after more training). so I wonder if loras will save it

somber trellis
#

lmao

earnest grotto
#

(Don't get your hopes too high, this trainer has no block swap, we'll be waiting for kohya)

somber trellis
earnest grotto
earnest grotto
#

all its anime outputs are oddly bright, like how the realistic ones are oddly yellowish

#

i wonder what dataset that brightness came from

#

Here's for example, 1 step out of 20 with juggernaut xl v9, and al the 20, both with 1 cfg

#

The "negative" (empty) prompt prediction is much blurrier and greyer -> when subtracting, it essentially gives more contrast/color to the image, but we subtract enough so that doesn't destroy it -> the model can make a better prediction and you get a generally better result
kontext being distilled, is going to have much brighter predictions so subtracting them can burn the image much more easily, which is why you normally want to use thresholding I guess. but it's only more easily, it's not guaranteed it'll burn them

somber trellis
earnest grotto
#

Took a gander at what more kontext loras people have trained, and things are looking very very promising

#

just a funny thing i found as i was browsing

#

but there are good not so memey loras

earnest grotto
#

Omnigen 2 is pretty good

#

omnigen 2 is 4b

#

both omnigen 2 and ovis u1 are apache 2

#

ovis seems to be getting more attention despite being worse in general. i can see that sdxl vae smoothing people's images out

#

sadly i expect both will be ignored like lumina 2

somber trellis
earnest grotto
#

ovis is sdxl vae and omnigen 2 is some 16ch vae, dunno if flux or sd3 or other

#

i didn't prompt the text in the text box for any of them

somber trellis
#

kontext only won in this case

#

because its a bigger model

#

a finetune of ovis i bet would compete

earnest grotto
somber trellis
#

If they make a 12b model

earnest grotto
#

However IMO omnigen 2 currently just beats it

somber trellis
#

like flux

earnest grotto
#

Given the chroma dev's findings, flux doesn't even need to be 12b due to wasting ~3b params on nearly pointless stuff

#

IDK why ovis is getting more attention than omnigen when both are apache 2

somber trellis
#

ovis higher guidance gets quite close to the base image

#

at least thats what your comparison shows

earnest grotto
#

yes. thing is, the default guidance was too unrelated to the image, so I included multiple pics of me progressively cranking it up. they're multiple because it also gets fried

somber trellis
#

however the pillow and blanket in the back are overcontrasted and weird black splotchs

earnest grotto
#

I don't want to give the wrong impression that it's fried by default. in this case, it might be that it's undertrained on anime, however at the same time they had ghibli style as one of their prompts, so...

somber trellis
#

🤷‍♂️

#

and kontext though getting the closest was just a lucky gen right

earnest grotto
#

Here's the default guidance (this kinda worked for non-anime images)
And a bit higher than the last image. absolutely fried

earnest grotto
#

You've seen the unlucky HL2 image that did not want to get animeified at all

somber trellis
#

yep

#

ive got a few wojak failures

#

lmao

earnest grotto
#

However the lucky gen shows there's room to easily train it better

somber trellis
#

kontext is still most capable but if a model of similar size comes out with an apache license

#

that will change

#

lets see if ovis makes a bigger model

#

i myself am still kinda sad that there still arent any models opensource that can voiceclone as well as indextts, but with emotion

#

chatterbox sucks at maintaining dialects/accents, dia is still not good

#

openvoice s1 mini (FishTTS) is also meh

earnest grotto
#

I don't want bigger models

#

slower for everything, harder to train, for questionable benefits

#

there are ways to make the models better without ballooning parameter counts

#
#

These are just 2 examples.

#

Both of these claim training speedups AND higher quality (FID isn't exactly quality but that's good)

#

I don't know every single paper out there. As you know, SD1.5 and SDXL struggle with changing resolutions, and Flux is better. This is due to an architectural improvement (rotary positional embeddings) rather than parameter counts. In that vein, I had seen a paper that touted even better generalization to higher resolution, but I forgot what it was

#

mm, another comes to mind, i think there was a paper about doing the diffusion at a lower and progressively increasing resolution, too

reef ivy
#

Apparently the wanwrappwr got gguf support. Probably wont be able to test for who knows how long though.

wicked fulcrum
reef ivy
#

I am seeing people using the 14b wan model to make images now and claiming its better than flux. Not sure exactly what lora or special models they are using though if any.

reef ivy
#

Would need to check it out, think its also much faster as well.

#

Gonna try the gguf with block swap now that kijai added support to the wrapper. Might even make gguf faster or at least more consistent

#

Gonna be a while before I can try though likely

somber trellis
#

Looks like IndexTTS2 is coming out soon.

earnest grotto
craggy hinge
#

Hi all, are there any 3D generation options on IPEX or in other ARC-compatible libs?

earnest grotto
#

Since you talk about IPEX, you probably want to install Comfy using my script: #1193952640225267802 message
Which does a few things like add Disty's hijacks to Comfy so random nodes that hardcode references to "cuda" instead of using Comfy's device getter still work

earnest grotto
#

Recently they added a remesher as well but I'm not liking the topology it produces, and the models 2.5 outputs are already kinda broken. This doesn't save much time from just using the model as a quick base

#

Although it is promising

craggy hinge
earnest grotto
#

Yeah I don't think current local 3d model gen is good enough yet

#

But hopefully soon

somber trellis
#

driver 6972 seems to give garbled outputs when using a basic flux gguf workflow in comfy

#

Nevermind. This might not be a driver issue, as 6913 which worked for me before is now doing the same thing.

formal tusk
#

I've had some weirdness for the last week-ish across the board

ripe pivot
#

updating to 0.2.2 somehow makes it this slow, which one where I suppose to update? nightly or stable?

#

that's very slow for imagegen

earnest grotto
ripe pivot
#

do I somehow prompt to my cpu or what?

earnest grotto
# ripe pivot

Please show everything ComfyUI shows in the console

earnest grotto
ripe pivot
earnest grotto
ripe pivot
earnest grotto
# ripe pivot

Is this screenshot with the ksampler or some random custom node

ripe pivot
#

ksampler, then again it works fine on previous version

#

upgrading to 0.2.2 changes thing

earnest grotto
#

ok, i'll check that out in a few hours

earnest grotto
#

it's possible you just hit a broken nightly and when you tried again there's a new nightly with that bug fixed

#

use the stable pytorch

ripe pivot
earnest grotto
#

The first sample is always going to take a bit

ripe pivot
#

1st step take 2 minutes, before update that amount of times already on the step of detailer inpaint

earnest grotto
#

Show the workflow
Which shortcut are you launching with

ripe pivot
#

regular one

earnest grotto
#

you are using a random custom node

#

there is nothing regular about that workflow

ripe pivot
#

I mean it stuck on ksampler before any other nodes. I understand if it stuck on the custom nodes

earnest grotto
#

I see what the issue is, but in the future I need you to tell me when something is or isn't a custom node

ripe pivot
#

tbf I don't understand which one is the default or which one the custom

#

on reforge detailer kind of provided by default iirc so I assume it's just a default tool

earnest grotto
#

If something is a custom node, that on the top right is where it comes from

ripe pivot
#

so every nodes that has label on the right are custom right?

earnest grotto
#

Ok, script updated

#

Should be fixed now

earnest grotto
#

besides that, comfyui does not have face detection by default.

#

also most dedicated face detection models are bad for anime

#

they will often fail to detect anime faces

#

and you don't even need it that much for anime anyways, anime faces don't change that much with resolution, what will break with anime more often is smaller eyes not whole faces and doing the equivalent of hires fix is 99% of the time enough

earnest grotto
#

and your last tag isn't a thing

ripe pivot
#

That detection works fine. Doing hiresfix on one workflow will make the times spent longer for genning and detailer mainly to reduce inconsistency of the pupil especially when the character is quite niche

ripe pivot
ripe pivot
ripe pivot
#

does disabling iGPU matter?

earnest grotto
#

no

earnest grotto
earnest grotto
#

Did you install fresh? Also, DM me the model_management.py file (inside the comfy folder)

ripe pivot
#

I just run the script, since it did download overall I presume that's fresh install.

earnest grotto
#

hmm, everything looks fine now

earnest grotto
earnest grotto
#

more up

ripe pivot
earnest grotto
#

you don't need comfyui-manager

#

you can save those 18 seconds of loading

#

comfy has a built in manager now

#

ah, ok, I think I see the new issue

ripe pivot
#

do they? then again I don't know how to uninstall those

earnest grotto
#

ok, fixed. you can redownload the script but i haven't changed the version number

ripe pivot
#

Thanks it works fine now, the decode is slower than before but I assume that's the intel driver update ruining something again

formal tusk
#

Anyone working with the new Wan2.2 yet?

reef ivy
#

Nope, wanna try the 5b but everyone seems still focused on 14b. Hope to get to poke around with stuff again soon

earnest grotto
#

you mean 24b?

reef ivy
#

think it's still 14b model right?

somber trellis
#

@reef ivy Yes.

earnest grotto
#

@reef ivy No. It's a 27B model that has 14B active at a time (what the A is for). So you will need more regular RAM at least.

somber trellis
#

That's why it's A14B.

#

14 billion activated parameters during inference.

#

It will take the same processing power to run as a 14b model, but require more ram like vik said

#

@earnest grotto I just looked at the gguf quants for a14b wan2.2

#

For some reason it's 500 megabytes smaller than wan 2.1.

#

Hm. It's an MoE model with 2 experts.

#

Wan2.2 introduces Mixture-of-Experts (MoE) architecture into the video generation diffusion model. MoE has been widely validated in large language models as an efficient approach to increase total model parameters while keeping inference cost nearly unchanged. In Wan2.2, the A14B model series adopts a two-expert design tailored to the denoising process of diffusion models: a high-noise expert for the early stages, focusing on overall layout; and a low-noise expert for the later stages, refining video details. Each expert model has about 14B parameters, resulting in a total of 27B parameters but only 14B active parameters per step, keeping inference computation and GPU memory nearly unchanged.

#

I'm a moron I just understood how it works

#

It requires both the lownoise and highnoise models to function. If that's the case, then the actual model size combining the two is around 30.8 gigabytes for the Q8 GGUF versions of the models. It however only needs one loaded at a time to function, essentially making it the same resource requirements as Wan2.1, excluding storage.

#

It's a seperated MoE model.

reef ivy
coarse whale
#

Question, I installed comfyUI thorugh @earnest grotto script some months ago, it still works great, but I was wondering, should I update the install eventually? The manager for example says that there are more recent version of ComfyUI, can I update thourgh there or should I do it through the script somehow?

Or is it fine using and older version?

earnest grotto
#

download a newer version of the script and run it from the same location you last ran it

coarse whale
#

I dont need to delete anything?

earnest grotto
#

generally, comfyui updates add native support for newer models (e.g. kontext or possibly wan2.2)

earnest grotto
coarse whale
#

Great, thank you so much!

earnest grotto
#

if you don't need newer models you don't need to update

earnest grotto
#

torch compile decided to work on linux and i got a pretty nice boost for lumina 2, from 1.6s/it -> ~0.95s/it

formal tusk
somber trellis
formal tusk
somber trellis
#

ignore the preview, it's broken

#

It loads the high-noise model first, inferences, sends the latents from the first ksampler to the second where the low-noise model gets loaded.

formal tusk
#

Thank you

somber trellis
#

I might try to use the q4 a14b models though.

#

Q8 is just a bit too big on the A770.

reef ivy
#

You can use gguf in the wrapper now, you just need to do that one edit i posted a while back to get it to work on intel(unless something changes I havent updated yet). Block swap seems more reliable than just using the reserve vram command

#

Wrapper usually makes everything a bit easier to use tbh.

coarse whale
#

What could this be for?
"Native API failed. Native API returns: 20 (UR_RESULT_ERROR_DEVICE_LOST)"

somber trellis
earnest grotto
#

You probably ran out of vram

#

Or, drivers can be a bit iffy on windows still.

#

If you haven't restarted your pc in a while, time to do so

#

win+ctrl+shift+b won't save you

#

I've mostly only had issues when kontexting for a while. I guess I used to have issues after about 200-1000 sdxl images but i kinda stopped doing that so i dunno if that's still a thing

earnest grotto
somber trellis
somber trellis
coarse whale
craggy hinge
#

Hello again, everyone. Please tell me, is there any way to use two ARC 770s?

late glen
reef ivy
#

I wonder how the 48gb b60 will work

somber trellis
#

74 seconds per step on 640x480x74 wan 2.2 A14B, total 20 steps. Takes 25+ minutes to generate.

somber trellis
#

41 seconds per step, 640x480x81, total of 6 steps. Took less than eight minutes.

#

Using the lightx2v T2V 14B Wan 2.1 lora at 2 strength.

#

https://huggingface.co/QuantStack/FLUX.1-Krea-dev-GGUF/tree/main

#

https://bfl.ai/announcements/flux-1-krea-dev

#

Supposedly the purpose of this model is to overcome the "AI Look"

somber trellis
cursive hull
#

how can i use GGUF , in Comfy GGUf loader in not poping-up , using Arc A750

reef ivy
#

I think its a custom node you need to get from the manager

boreal maple
#

Hi, i am looking for help to install compfyui in my laptop using Intel powered Arc gpu, but all i am getting is errors and cpu only version

#

🎯 Summary: Why It Doesn't Work (Yet)
🔧 Component ❌ Problem
PyTorch (on Windows) GPU (XPU) backend still experimental or missing
IPEX (Intel Extension for PyTorch) Current public builds mostly support CPU only
ComfyUI Designed for CUDA backend, no out-of-the-box support for Intel GPU
Your GPU (Arc) Based on Xe HPG, not yet fully integrated into PyTorch workflows
TorchDirectML / OpenVINO Works partially for inference, but not supported by ComfyUI pipeline
🕯️ The Major Roadblock in One Line:

Intel Arc GPUs lack stable, official PyTorch/XPU backend support on Windows, and ComfyUI doesn't yet support Intel's alternate GPU paths (SYCL, OpenVINO, DirectML).

✅ What Is Working Right Now?

Your CPU can run PyTorch + ComfyUI reliably.

You can install optimized CPU builds (e.g., torch==2.3.0) using Intel’s IPEX.

You can do inference, just a little slower.
reef ivy
#

Chatgpt is wrong as well, xpu is built into latest pytorch.

tough wharf
#

I wonder if some of yall are playing with wan2.2 on Arc cards
How's the speed and output?

reef ivy
#

Haven't had time, haven't even been able to mess with the fusionx and all the speed up lora's and models for wan2.1.

earnest grotto
#

@final mirage Run comfy with --reserve-vram 8, and say what happens

final mirage
#

I'm having issues installing the IPEX support as described here

https://github.com/comfyanonymous/ComfyUI

Running these commands (as described in the docs)

pip install torch==2.3.1.post0+cxx11.abi torchvision==0.18.1.post0+cxx11.abi torchaudio==2.3.1.post0+cxx11.abi intel-extension-for-pytorch==2.3.110.post0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/

I get the following errors:

ERROR: No matching distribution found for torch==2.3.1.post0+cxx11.abi```

Any ideas what could be wrong?
GitHub

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface. - comfyanonymous/ComfyUI

reef ivy
#

Seems like 5b wan2.2 might be worse than 1.3b 2.1. I think it is a t2v and i2v in one? Might be why its worse trying to do both(could be wrong only been able to browse discord for examples)

final mirage
#

So I manage to install latest IPEX, now I'm getting the following

RuntimeError: XPU out of memory. Tried to allocate 115.99 GiB (GPU 0; 7.75 GiB total capacity; 6.05 GiB already allocated; 6.39 GiB reserved in total by PyTorch)

What's going on here? 115.99gb? I'm trying to run a gguf model under 7 gb with a ltx2v Lora. Before IPEX i could run it without running out of memory. Now I'm getting the above after IPEX installation.

Any ideas what could be causing the explosion of memory allocation?

reef ivy
#

You will probably need to show the entire error. This is a workflow you previously used on pytorch I assume?

final mirage
reef ivy
#

there can be issues if installing ipex on top of pytorch or vice versa, it's best to always make a clean environment if you didn't. If you updated anything it could be an issue with comfy or the nodes themselves. also make sure you have the --reserve-vram in command line with whatever amount works best

coarse whale
#

Question, png saves with comfyui looks a bit washed out/desaturated compared to how they look like in the web preview. Why is that? The thumbnail on window looks like the web interface, but when opening with the windows viewer or photoshop is washed out.

coarse whale
#

this is weird even copy pasting from photoshop to discord the colors change

#

maybe i have some probelms with the color profiles in window

#

every time I copy paste from snipping tool to some application it change the colors, I don´t think is comfyui fault

#

It was wrong color profile set up one windows sorry!

nocturne fjord
reef ivy
#

so far nobody has been able to fix the face issues with loras, they just get destroyed with 5b model apparently. Might be worth doing a second pass with the 1.3b maybe?

final mirage
earnest grotto
reef ivy
#

If its not a custom one just link where you got it. I assume it is a downloaded one you are using since you are doing comparisons

earnest grotto
#

Don't link it or anything like that. Send or show the exact workflow you ran.

#

oops, ping

#

oh well

somber trellis
#

Ding ding ding.

somber trellis
somber trellis
somber trellis
#

Kinda wish there were ways to make it as fast as possible on arc.

#

rn it takes 5 minutes with the q8 gguf

earnest grotto
#

things can most likely be faster with int8 or int4, just those aren't available in comfy

#

also there was a better alternative to teacache that popped up recently with supposedly almost zero degradation, however iirc unlike teacache or such, it was not training free not sure actually

#

apparently it works both with no cfg and decently high cfg? interesting

#

i wonder if sdnext has support for it with disty's quantization

civic charm
#

this is for the balanced offload to work

#

qwen image doesn't really want to go below 6 bits

#

Also needs to enable dynamic atten either through compute settings or via the ipex force attention env var

#

intel's flash attention just fails to run with qwen-image

somber trellis
#

flash attention would be a nice speed boost

glossy gyro
#

Is there a relevant docker for Comfyui IPEX?

earnest grotto
#

are you asking because you want simple setup for yourself or because you have 10 computers with arc and want to deploy to all of them

#

also, ipex by itself is not very relevant anymore. in fact, I think it's getting discontinued?

#

#intel-arc message

#

You can ask vipitis what he was talking about. to me, it just sounds plausible enough

glossy gyro
earnest grotto
#

regular pytorch

#

#1193952640225267802 message

#

Here's a script to install comfyui for you.

#

You need conda and git installed and that's it.

#

And well, working graphics drivers of course

#

On linux, that entails working clinfo at least

glossy gyro
earnest grotto
glossy gyro
glossy gyro
reef ivy
#

Has there been anything notable in the pytorch-xpu development? Last big thing was triton/torch.compile. I heard flash attention was coming but apparently it's not actually making anything faster afaik.

civic charm
#

Flash atten is already here with PyTorch 2.7 and made it much faster

#

But base PyTorch performance is still awful compared to ipex 2.3

#

So flash attention wasn't enough of a speedup to close the gap

#

Also installing IPEX on top of PyTorch 2.6 / 2.7 / 2.8 halves your performance for some reason

#

And using non-blocking even once makes things slow down to a halt on Intel and corrupts your data

#

non-blocking is supposed to make things faster, not slower

#

This wasn't an issue with IPEX 2.3

reef ivy
#

So this is just slowdown for pytorch in general or intel/xpu/ipex specific slowdown?

civic charm
#

Intel specific

#

PyTorch slowdown is fixed on all others with PyTorch 2.6

#

And 2.7 runs much faster than everything before it on AMD

earnest grotto
#

*without the TE, 12 rank, lotsa random 1MP resolutions, 1 batch size, and with cache clearing before and after backward() because for some reason it both speeds it up AND reduces vram usage AND makes it so I don't randomly crash with a pseudo-OOM issue

#

which crashing was also happening when I tried to train for ace-step too

#

I'll try inference later but, I'm pretty sure my 2.8/2.9 performance is basically identical to 2.3+ipex

lunar thicket
earnest grotto
#

when training, a lumina 2 lora specifically, and on linux

#

for me

#

IIRC disty had some issues with SDXL lora training performance too. For me personally, I've had SDXL lora training performance go all over the place for seemingly no reason. Fresh boot, 5-6s/it, reboot, 5 again, reboot again, finally the expected 2.3s/it

#

I'll poke inference in comfy later but generally, I'm pretty sure my 2.3+ipex and 2.8/2.9 performance was the same. haven't tried 2.8+ipex for inference

#

on windows my training speeds were about 25% slower?

lunar thicket
#

This is why I was wondering about AI Playground moving off IPEX in the other channel. Feel like I am seeing mixed messages on IPEX vs native pytorch. But maybe it's not an issue for that workload

reef ivy
#

2.8 ipex seems pretty new, might not have been out back then. Might only be linux and training as well.

#

Quick question for anyone who would know, does ipex eventually get upstreamed to pytorch or are they separate?

civic charm
#

We might get a 2.9 release but 2.8 really is the last IPEX release.

#

It is pure PyTorch from now on.

lunar thicket
#

Ahh well, that settles that 👍

bold owl
civic charm
#

Isn't ollama c++ / llamacpp based?

#

IPEX is for PyTorch

earnest grotto
#

inference speeds do look pretty bad though at a glance

#

2.15-2.2s/it for lumina 2 with 2.8+ipex, vs 1.6-1.7s/it with 2.9

#

1.8-1.85it/s for sdxl with 2.8+ipex vs 2.20-2.25it/s with 2.8

#

so yeah, substantially slower

earnest grotto
#

well, nonetheless, no ipex for go either

lunar thicket
earnest grotto
#

It can get down to 1s/it with compile, probably even faster with other things like int8 if comfy had some support for that like sdnext does
though something's wrong with compile and if I keep changing resolutions, or... Prompt? It adds 100ms every time and gets slower and slower and I've even reached 3s/it. weird stuff.

bold owl
earnest grotto
#

IPEX's discontinuation is with the goal of Intel putting the optimizations (or other features) directly into Pytorch
ipex-llm will continue to exist in some form I'm sure. it's also weird, not sure how exactly they integrate it into ollama/llamacpp and it seems you need custom intel builds to use it? 🤔

#

ipex-llm apparently existed prior to ipex as bigdl. still based on pytorch, among other things? so, it can become bigdl again, who knows

civic charm
#

i still don't understand why it was renamed to ipex-llm

#

It doesn't really have much to do with ipex

somber trellis
#

ipex versions on windows always seemed to be significantly slower and less operable than the normal pytorch nightlies for xpu

#

I can't run wan 2.2 properly on 2.8.1+ipex but I can with the 2.8/2.9 nightly

#

there are exceptions though

#

index tts runs better with the ipex version than the non-ipex version

somber trellis
#

current nightly's torchaudio isnt working for some reason so I'm on the stable 2.8 release

nocturne fjord
earnest grotto
#

With block swapping, most likely
What I'm more concerned with is if it's actually supported in Comfy and how you will feed its inputs, especially given it will be nowhere near real time

nocturne fjord
earnest grotto
#

This is not going to be anywhere near realtime on any consumer hardware

#

let alone an intel gpu

#

The model is tested on a machine with 8GPUs.
Minimum: The minimum GPU memory required is 24GB but very slow.
Recommended: We recommend using a GPU with 80GB of memory for better generation quality.

#

They most likely recommend 80GB not because you need 80GB but because those 8 GPUs are 8 H100s

somber trellis
#

@earnest grotto Isn't there another worldmodel released that they distilled for consumer gpus?
I need to find it.

#

Found it.

#

Doubt it will run well on our hardware though.

#

not gamecraft tho

earnest grotto
#

people will optimize it

#

it just won't be fast enough, at least on intel

#

i'm sure there's always a guy with a 5090 around the corner

reef ivy
#

I wonder if intel could ever get something like sage attention?

earnest grotto
#

Technically possible for regular sage attention
2 and 3 use fp4/int4 hardware, speedup comes from that which I don't think current Intel GPUs have? So, on that basis, those won't be around for current gen. Maybe celestial though, I'd expect them to have 4 bit hardware
Realistically...? As things are right now, I don't see it happening

earnest grotto
#

about 343 seconds per

#

incl. 4-step lora

reef ivy
#

I think only 50 series has 4bit support but older nvidia still get sage attention 2 speedup i believe. Could be int4 but I think Intel can do that? Probably wrong about that though

#

Wan seems pretty on par or real close to paid models, for anime seems good from what I have seen might need loras or finetunes though not sure

earnest grotto
#

it needs loras because I do not intend on waiting for 1500 seconds instead

#

(25 minutes)

earnest grotto
civic charm
#

A770 does have INT8 and INT4 too

#

But INT8 via onednn / mkldnn quantized matmul (what pytorch uses) runs 2x slower than 16 bit for some reason

#

OneDNN is for the GPU and MKLDNN is for the CPU but the behavior is exactly the same on both

#

CPU runs INT8 2x slower than FP32 with MKLDNN
GPU runs INT8 2x slower than BF16 / FP16 with OneDNN

#

A770 supposed to run INT8 2x faster than 16 bit, not 2x slower

somber trellis
#

on arc, I've been getting better results using the older lightx2v lora at 3 strength at high noise and 1 strength at low noise

#

horror btw ^

#

these gens however are 8 steps, taking 3 minutes and 36 seconds per inference, to 7 minutes 12 seconds for inferencing alone not including clip text encode or prompting (if youre using an llm like i am)

somber trellis
#

one lora is used

#

the old wan 2.1 lightx2v lora

#

strength 3 on high noise, strength 1 on low noise

somber trellis
upbeat crow
# somber trellis

please keep it to 2 at a time, automod doesnt like it if its too much. Sorry it took this long to untime out

somber trellis
#

👍

#

It already warned me and I ignored it

#

lmao

#

It's really quite a good local model

reef ivy
#

Wan is amazing, can't wait to try out all the new stuff, haven't been able to mess around with it since before the fusionx finetune was released months ago.

earnest grotto
#

I increase resolution from 400k pixels to 500k pixels and time more than doubles from roughly 5 minutes to 11+ minutes. yeesh.