#TensorRT for AUTO1111's webui

93 messages · Page 1 of 1 (latest)

sweet pilot
#

https://github.com/AUTOMATIC1111/stable-diffusion-webui-tensorrt

Read the readme carefully before attempting.

You might need to install cuda toolkit, make an nvidia account to download TensorRT and get Microsoft C++ build tools.

What this is

This extensions gives you a fairly straightforward method of converting your usual ckpt/safetensors models' unet into a TensorRT (TRT) format.
What it means is that once you properly compile the model, which will take from 15minutes to an hour or so, you may get massive speed boosts, with several caveats.

Caveats

Firstly, hires fix won't work out of the box.
The TRT has a specific max shape (which factors max token count, resolution and batch size) and the biggest you can get out of it is 512x1024 or 1024x512 or 768x768. These are usable in img2img however, the first 2 values listed would work with ultimate SD upscaler script with tile size 512, the 768x768 one works with multidiffusion (tile width/height 96/96).
Secondly, you need to compile the model (once), which takes some time, and has to be specific settings like I mentioned, I could see potentially having 3 TRT models per main model.
Thirdly, the TRT doesn't support LoRA or controlnet.

Who is this for then

If you generate a lot of smaller images and then choose one to upscale, this will be a speed boost.
If you use multidiffusion, this should also be a speed boost.

You can easily activate or deactivate the TRT unet, especially if you put sd_unet in the quicksettings, swapping on and off from it takes a few seconds for me, and I'm definitely glad to have it as an option.

My settings

  • Two different settings that I've tried so far, will do landscape preset soon:
    • Setting for using the model for img2img multidiff - 512x768 width, 512x768 height, 1 batch size, 75 token count (multidiff with small prompt is fine).
  • Setting for 512x768 - 512x576 width, 512x768 height, 2 batch size, 300 token count.
sweet pilot
#

If you're running into pycuda wheel build errors and it's complaining about Microsoft Visual C++ 14.0 or greater is required,get this
https://visualstudio.microsoft.com/vs/older-downloads/, install C++ build tools and ensure the latest versions of MSVCv142 - VS 2019 C++ x64/x86 build tools and Windows 10/11 SDK are checked.
I also had to install https://developer.nvidia.com/cuda-downloads.
Make sure to restart the shell or better yet, PC after doing the installation of these.

  • If it's still complaining:
    -look for "x64 native tools command prompt for vs 2019" in windows search
    • open that terminal (it should tell you at the top that the environment has been initialized)
    • navigate to your stable-diffusion-webui directory from the terminal
    • activate venv (venv\Scripts\activate)
    • run pip install pycuda

It should work. If it still doesn't, I have no clue.

#

Also do let me know if something here is wrong or you've had a different experience, I just got it running for myself, I'm not an expert.

icy sluice
#

compatability

  • turing GPU's work, no issues (make sure to FP16/half floats) though not much of a boost (+18% for me)
  • ONNX models are sharable, TRT are not*
  • TRT sharing is possible, but for ampere+ (30xx)
    • todo: how
  • if you're not on ampere, the VRAM minimum is iffy, i got it with 6GB vram; but took 40 min to convert ONNX to TRT
zealous flax
#

Following ao's instructions, pycuda compiled on my end.
But convertion to ONNX fails w only 4GB vram. I might try cpu torch to convert to ONNX, but then theres the ONNX TRT convertion...

icy sluice
#

yeah since ONNX is just like another model arch, not like TRT which is GPU specific

zealous flax
#

how much vram did it eat during onnx->trt

icy sluice
#

less than safetensors -> ONNX

#

i cannot write hold on

#

it maxed my 6GB of vram for the ONNX step

#

took a loong while for TRT, didnt max out

sweet pilot
#

I think it was consistentish 3.6?ish gb vram for onnx->trt

icy sluice
#

oh nice

sweet pilot
#

it might work on the 4gb if you just absolutely yeet everything, close browsers and stuff

#

and run the command in a terminal

icy sluice
#

then yeah 4GB is minimum

icy sluice
sweet pilot
#

12

icy sluice
#

did it max out while converting ONNX to TRT?

sweet pilot
#

I /don't think so/ but I am not certain

#

I dont usually look at usages unless I see visible stutters ehehe

icy sluice
#

ah

steep notch
#

What settings did you use to compare gen speen?

icy sluice
steep notch
#

yep

icy sluice
#

i used DPM++ 3M karras, 20 steps, 512x512

sweet pilot
#

I tested a few, on 512x512 2m karras 25 steps? it was over 50% improvement

#

I'd test now but I am training things

steep notch
#

ill test it rq, lemme get it installed and try a TRT

zealous flax
#

i offload stuff to IGP anyways

#

so should be fine

steep notch
zealous flax
#

for t2i not really. for i2i it is

#

esp since multidiff doesnt really need a prompt to work, so token limit doesnt matter

icy sluice
zealous flax
#

tensorrt tab not avail in cpu mode NotLikeKogasa

sweet pilot
zealous flax
#

welp i guess i could just modify the onnx convertion code

icy sluice
sweet pilot
#

but also well if you follow the instructions then it's like an hour to get it running, unless your GPU is very potato like realreal's

#

then if you're only saving a second per gen you break even after 3.6k images, if we're talking about PURE IMAGE GENNING TO THE MAX

#

but the multidiff is also quite a lot faster

#

it's not for everyone though, for sure

icy sluice
#

or i could just extract the ONNX layer

#

since its not GPU specific

zealous flax
#

uggh this is annoying

zealous flax
#

alright im trying to manually call exportonnx py

#

from modules import sd_hijack, sd_unet
ModuleNotFoundError: No module named 'modules'

sweet pilot
#

wrong dir?

zealous flax
#

yea i tried a bunch of methods but putting it in root of a1111 fixed it

#

cannot import name 'model_hijack' from partially initialized module 'modules.sd_hijack' (most likely due to a circular import)
or did it

sweet pilot
#

are u in da venv

zealous flax
#

ye

sweet pilot
zealous flax
#

though self made :v

sweet pilot
#

why not webui's venv

zealous flax
#

just did install -r requirements.txt on this one

sweet pilot
zealous flax
sweet pilot
#

venv\scripts\activate

#

or forward slashes on unix obv

zealous flax
#

oh there wasnt a venv

sweet pilot
#

there's no venv in your webui root?

zealous flax
#

oddly no, evena after dev pull

sweet pilot
#

did you run it at least once?

zealous flax
#

oh yea

#

kekW

#

i never did

sweet pilot
#

oh

#

yeah the webui.sh or whatever's have the venv setups

zealous flax
#

ill try again in abit then

#

but maybe ill just mess w the import order

#

but i dont think i dare

zealous flax
#

got it to convert w cpu

  • you need to use torch cpu and launch with --skip-torch-cuda-test --no-half --precision full
  • you need to remove the cuda imports temporarily from trt.py in /scripts
  • in export_onnx.py you need to replace device.devices to "cpu", devices.dtype to torch.float and remove "with devices.autocast():"
#

now for the dreaded onnx to trt

#

which i will do when i get home

zealous flax
#

Could not initialize cublas. Please check CUDA installation.
[05/30/2023-13:01:56] [E] Error[1]: [wrapper.cpp::nvinfer1::rt::CublasWrapper::CublasWrapper::94] Error Code 1: Cublas (Could not initialize cublas. Please check CUDA installation.)

#

seems like its due to cuda version mismatch (11.8 for a1111, 12.1f for TRT n toolkit on mine)

wispy hedge
#

Y'all gonna be uploading TRTs on civitai?

zealous flax
#

its pointless

zealous flax
#

--useManagedMemory
this seems to help 4GB cards to compile trt (you still need to use cpu on onnx convertion), will update
edit: it doesnt, its an arg for inferencing

icy sluice
sweet pilot
#

(3060) Test without trt, second with trt (ofc it errors at higher batch sizes)

#

force enabling xformers doesn't really have a performance boost

#

so for some real-er numbers, uspcaling this (1920x1024) 3 times (so to 5760x3072) with animesharp, mixture of diffusers with tiles set to 96, tiled vae set to tile size 1536 and TRT (DPM++ 2m karras, 25 steps)
took, reported by the extension itself, Time taken: 4m 14.39s

#

result

#

Now to take advantage of the fact that I can set higher latent tile batch size I'll change the settings up a bit an rerun without trt

#

I set larger tiles (128) and batch size 4, one possible slowdown is that this one actually loaded lora which I forgot to turn off, but that shouldn't be a big impact (the previous one also loaded it but it cant use it so idk if its even relevant)
Time taken: 6m 13.45s

#

so 4m14s vs 6m13s

#

pretty significant