TensorRT for AUTO1111's webui | 東方Project AI | Page 1

sweet pilot May 29, 2023, 9:55 AM

#

https://github.com/AUTOMATIC1111/stable-diffusion-webui-tensorrt

Read the readme carefully before attempting.

You might need to install cuda toolkit, make an nvidia account to download TensorRT and get Microsoft C++ build tools.

What this is

This extensions gives you a fairly straightforward method of converting your usual ckpt/safetensors models' unet into a TensorRT (TRT) format.
What it means is that once you properly compile the model, which will take from 15minutes to an hour or so, you may get massive speed boosts, with several caveats.

Caveats

Firstly, hires fix won't work out of the box.
The TRT has a specific max shape (which factors max token count, resolution and batch size) and the biggest you can get out of it is 512x1024 or 1024x512 or 768x768. These are usable in img2img however, the first 2 values listed would work with ultimate SD upscaler script with tile size 512, the 768x768 one works with multidiffusion (tile width/height 96/96).
Secondly, you need to compile the model (once), which takes some time, and has to be specific settings like I mentioned, I could see potentially having 3 TRT models per main model.
Thirdly, the TRT doesn't support LoRA or controlnet.

Who is this for then

If you generate a lot of smaller images and then choose one to upscale, this will be a speed boost.
If you use multidiffusion, this should also be a speed boost.

You can easily activate or deactivate the TRT unet, especially if you put sd_unet in the quicksettings, swapping on and off from it takes a few seconds for me, and I'm definitely glad to have it as an option.

My settings

Two different settings that I've tried so far, will do landscape preset soon:
- Setting for using the model for img2img multidiff - 512x768 width, 512x768 height, 1 batch size, 75 token count (multidiff with small prompt is fine).
Setting for 512x768 - 512x576 width, 512x768 height, 2 batch size, 300 token count.

sweet pilot May 29, 2023, 10:16 AM

#

If you're running into pycuda wheel build errors and it's complaining about Microsoft Visual C++ 14.0 or greater is required,get this
https://visualstudio.microsoft.com/vs/older-downloads/, install C++ build tools and ensure the latest versions of MSVCv142 - VS 2019 C++ x64/x86 build tools and Windows 10/11 SDK are checked.
I also had to install https://developer.nvidia.com/cuda-downloads.
Make sure to restart the shell or better yet, PC after doing the installation of these.

If it's still complaining:
-look for "x64 native tools command prompt for vs 2019" in windows search
- open that terminal (it should tell you at the top that the environment has been initialized)
- navigate to your stable-diffusion-webui directory from the terminal
- activate venv (venv\Scripts\activate)
- run pip install pycuda

It should work. If it still doesn't, I have no clue.

#

Also do let me know if something here is wrong or you've had a different experience, I just got it running for myself, I'm not an expert.

icy sluice May 29, 2023, 10:56 PM

#

compatability

turing GPU's work, no issues (make sure to FP16/half floats) though not much of a boost (+18% for me)
ONNX models are sharable, TRT are not*
TRT sharing is possible, but for ampere+ (30xx)
- todo: how
if you're not on ampere, the VRAM minimum is iffy, i got it with 6GB vram; but took 40 min to convert ONNX to TRT

zealous flax May 30, 2023, 12:33 AM

#

Following ao's instructions, pycuda compiled on my end.
But convertion to ONNX fails w only 4GB vram. I might try cpu torch to convert to ONNX, but then theres the ONNX TRT convertion...

icy sluice May 30, 2023, 12:47 AM

#

zealous flax Following ao's instructions, pycuda compiled on my end. But convertion to ONNX f...

shit wait i think i can help you with the ONNX part

#

yeah since ONNX is just like another model arch, not like TRT which is GPU specific

zealous flax May 30, 2023, 12:53 AM

#

how much vram did it eat during onnx->trt

icy sluice May 30, 2023, 12:54 AM

#

less than safetensors -> ONNX

#

i cannot write hold on

#

it maxed my 6GB of vram for the ONNX step

#

took a loong while for TRT, didnt max out

sweet pilot May 30, 2023, 12:58 AM

#

I think it was consistentish 3.6?ish gb vram for onnx->trt

icy sluice May 30, 2023, 12:59 AM

#

oh nice

sweet pilot May 30, 2023, 1:00 AM

#

it might work on the 4gb if you just absolutely yeet everything, close browsers and stuff

#

and run the command in a terminal

icy sluice May 30, 2023, 1:00 AM

#

then yeah 4GB is minimum

icy sluice May 30, 2023, 1:01 AM

#

sweet pilot it might work on the 4gb if you just absolutely yeet everything, close browsers ...

also ao how much vram do you have?

sweet pilot May 30, 2023, 1:01 AM

#

12

icy sluice May 30, 2023, 1:01 AM

#

did it max out while converting ONNX to TRT?

sweet pilot May 30, 2023, 1:02 AM

#

I /don't think so/ but I am not certain

#

I dont usually look at usages unless I see visible stutters ehehe

icy sluice May 30, 2023, 1:02 AM

#

ah

steep notch May 30, 2023, 1:03 AM

#

What settings did you use to compare gen speen?

icy sluice May 30, 2023, 1:04 AM

#

steep notch What settings did you use to compare gen speen?

like gen settings?

steep notch May 30, 2023, 1:04 AM

#

yep

icy sluice May 30, 2023, 1:04 AM

#

i used DPM++ 3M karras, 20 steps, 512x512

sweet pilot May 30, 2023, 1:05 AM

#

I tested a few, on 512x512 2m karras 25 steps? it was over 50% improvement

#

I'd test now but I am training things

steep notch May 30, 2023, 1:06 AM

#

ill test it rq, lemme get it installed and try a TRT

zealous flax May 30, 2023, 1:07 AM

#

i offload stuff to IGP anyways

#

so should be fine

steep notch May 30, 2023, 1:23 AM

#

sweet pilot I tested a few, on 512x512 2m karras 25 steps? it was over 50% improvement

eh, is the speed bump worth the hassle? 20 images in 26s using these settings as is.

zealous flax May 30, 2023, 1:28 AM

#

for t2i not really. for i2i it is

#

esp since multidiff doesnt really need a prompt to work, so token limit doesnt matter

icy sluice May 30, 2023, 1:30 AM

#

steep notch eh, is the speed bump worth the hassle? 20 images in 26s using these settings as...

if you have terrible hardware [turing, pascal+] it is noticably good

zealous flax May 30, 2023, 1:43 AM

#

tensorrt tab not avail in cpu mode NotLikeKogasa

sweet pilot May 30, 2023, 1:43 AM

#

steep notch eh, is the speed bump worth the hassle? 20 images in 26s using these settings as...

it's worth it if you're also doing large amounts of t2i

zealous flax May 30, 2023, 1:43 AM

#

welp i guess i could just modify the onnx convertion code

icy sluice May 30, 2023, 1:44 AM

#

zealous flax welp i guess i could just modify the onnx convertion code

reading it, you'll need to load it on CPU

sweet pilot May 30, 2023, 1:45 AM

#

but also well if you follow the instructions then it's like an hour to get it running, unless your GPU is very potato like realreal's

#

then if you're only saving a second per gen you break even after 3.6k images, if we're talking about PURE IMAGE GENNING TO THE MAX

#

but the multidiff is also quite a lot faster

#

it's not for everyone though, for sure

icy sluice May 30, 2023, 1:47 AM

#

or i could just extract the ONNX layer

#

since its not GPU specific

zealous flax May 30, 2023, 1:48 AM

#

uggh this is annoying

zealous flax May 30, 2023, 2:34 AM

#

alright im trying to manually call exportonnx py

#

from modules import sd_hijack, sd_unet
ModuleNotFoundError: No module named 'modules'

sweet pilot May 30, 2023, 2:36 AM

#

kekw

#

wrong dir?

zealous flax May 30, 2023, 2:39 AM

#

yea i tried a bunch of methods but putting it in root of a1111 fixed it

#

cannot import name 'model_hijack' from partially initialized module 'modules.sd_hijack' (most likely due to a circular import)
or did it

sweet pilot May 30, 2023, 2:41 AM

#

are u in da venv

zealous flax May 30, 2023, 2:42 AM

#

ye

sweet pilot May 30, 2023, 2:42 AM

#

hmmmm

zealous flax May 30, 2023, 2:42 AM

#

though self made :v

sweet pilot May 30, 2023, 2:42 AM

#

why not webui's venv

zealous flax May 30, 2023, 2:42 AM

#

just did install -r requirements.txt on this one

sweet pilot May 30, 2023, 2:42 AM

#

OhICannotSee

zealous flax May 30, 2023, 2:42 AM

#

sweet pilot why not webui's venv

i forgot how NotLikeKogasa

sweet pilot May 30, 2023, 2:43 AM

#

venv\scripts\activate

#

or forward slashes on unix obv

zealous flax May 30, 2023, 2:43 AM

#

oh there wasnt a venv

sweet pilot May 30, 2023, 2:43 AM

#

Thooking

#

there's no venv in your webui root?

zealous flax May 30, 2023, 2:44 AM

#

oddly no, evena after dev pull

sweet pilot May 30, 2023, 2:44 AM

#

did you run it at least once?

zealous flax May 30, 2023, 2:44 AM

#

oh yea

#

kekW

#

i never did

sweet pilot May 30, 2023, 2:44 AM

#

oh

#

yeah the webui.sh or whatever's have the venv setups

zealous flax May 30, 2023, 2:45 AM

#

ill try again in abit then

#

but maybe ill just mess w the import order

#

but i dont think i dare

zealous flax May 30, 2023, 5:40 AM

#

got it to convert w cpu

you need to use torch cpu and launch with --skip-torch-cuda-test --no-half --precision full
you need to remove the cuda imports temporarily from trt.py in /scripts
in export_onnx.py you need to replace device.devices to "cpu", devices.dtype to torch.float and remove "with devices.autocast():"

#

now for the dreaded onnx to trt

#

which i will do when i get home

zealous flax May 30, 2023, 6:02 AM

#

Could not initialize cublas. Please check CUDA installation.
[05/30/2023-13:01:56] [E] Error[1]: [wrapper.cpp::nvinfer1::rt::CublasWrapper::CublasWrapper::94] Error Code 1: Cublas (Could not initialize cublas. Please check CUDA installation.)

#

seems like its due to cuda version mismatch (11.8 for a1111, 12.1f for TRT n toolkit on mine)

wispy hedge May 30, 2023, 6:22 AM

#

Y'all gonna be uploading TRTs on civitai?

zealous flax May 30, 2023, 6:27 AM

#

its pointless

zealous flax May 30, 2023, 12:19 PM

#

--useManagedMemory
this seems to help 4GB cards to compile trt (you still need to use cpu on onnx convertion), will update
edit: it doesnt, its an arg for inferencing

icy sluice May 30, 2023, 10:00 PM

#

zealous flax its pointless

colab t4 maybe has a use

sweet pilot May 31, 2023, 4:10 AM

#

(3060) Test without trt, second with trt (ofc it errors at higher batch sizes)

#

force enabling xformers doesn't really have a performance boost

#

so for some real-er numbers, uspcaling this (1920x1024) 3 times (so to 5760x3072) with animesharp, mixture of diffusers with tiles set to 96, tiled vae set to tile size 1536 and TRT (DPM++ 2m karras, 25 steps)
took, reported by the extension itself, Time taken: 4m 14.39s

#

result

#

Now to take advantage of the fact that I can set higher latent tile batch size I'll change the settings up a bit an rerun without trt

#

I set larger tiles (128) and batch size 4, one possible slowdown is that this one actually loaded lora which I forgot to turn off, but that shouldn't be a big impact (the previous one also loaded it but it cant use it so idk if its even relevant)
Time taken: 6m 13.45s

#

so 4m14s vs 6m13s

#

pretty significant

#TensorRT for AUTO1111's webui

Read the readme carefully before attempting.

What this is

Caveats

Who is this for then

My settings

compatability