#Zluda keeps freezing while compiling

1 messages · Page 1 of 1 (latest)

gray violet
#

New forum for Noobies

proven glacier
#

okay, lets try a different method

#

unzip it into C:\

#

so you have C:\Applio-3.2.9

#

and run

#

this should install applio using python 3.11

#

download these into Applio folder

#

one sec, checking myself

#

you need to nuke C:\users\user\miniconda3 folder before you run install

proven glacier
#

just delete it

#

or rename to .old

gray violet
#

check i did!

proven glacier
#

it will download miniconda installer, then install it

#

then the rest

gray violet
#

It's installing torch

#

Oke it's finished installing

proven glacier
#

yes

gray violet
#

i did

proven glacier
#

then open cmd in Applio's folder

#

and run

#

env\python -m pip install torch-2.7.0a0+rocm_git3f903c3-cp311-cp311-win_amd64.whl

gray violet
#

do i need to put the downloaded files in the env folder or in the root?

proven glacier
#

env\python -m pip install torchaudio-2.7.0a0+52638ef-cp311-cp311-win_amd64.whl

#

into C:\Applio-3.2.9

gray violet
proven glacier
#

how about env\python

gray violet
proven glacier
#

delete env

#

re-run run-install.bat

gray violet
#

finished

#

now again terminal?

proven glacier
#

yes

#

verify python is correct one

gray violet
#

again same error

proven glacier
#

env\python

#

not python

#

exit() to exit from it

gray violet
proven glacier
#

did you save the updated run-install from above into this applio folder?

gray violet
#

yes

#

this is in the bat, it says 3.10

proven glacier
#

ah, should be 3.11

#

sorry, nuke env again, change to 3.11, re-run

gray violet
proven glacier
#

okay, not the pip install

gray violet
#

it works

#

even with pip

proven glacier
#

torchaudio too

gray violet
#

Nope error

proven glacier
#

?

gray violet
# gray violet ow..

this is the error at the end of: env\python -m pip install torch-2.7.0a0+rocm_git3f903c3-cp311-cp311-win_amd64.whl

proven glacier
#

I mean run second install for torchaudio

gray violet
proven glacier
#

okay, run-applio.bat now

gray violet
proven glacier
#

not a problem

#

once it opens, go to training tab and see advanced settings

gray violet
proven glacier
#

okay, give it a try

#

change the value to 0

gray violet
#

is it oke to get that much warnings?

#

the rest is working xd

proven glacier
#

that's just librosa warnings, it can be updated to silence them

#

env\python -m pip install librosa==0.11.0

#

question is whether the training works

gray violet
#

we are gonna find out now

#

nope

proven glacier
#

hm

proven glacier
#

can you post this as a text?

gray violet
#

i started again

#

now i get this

#

To create a public link, set share=True in launch().
Using HiFi-GAN vocoder
Using HiFi-GAN vocoder
Process Process-2:
Traceback (most recent call last):
File "C:\Applio-3.2.9\env\Lib\multiprocessing\process.py", line 314, in _bootstrap
self.run()
File "C:\Applio-3.2.9\env\Lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\Applio-3.2.9\rvc\train\train.py", line 452, in run
net_g = DDP(net_g, device_ids=[device_id])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\parallel\distributed.py", line 837, in init
_sync_module_states(
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\distributed\utils.py", line 311, in _sync_module_states
_sync_params_and_buffers(process_group, module_states, broadcast_bucket_size, src)
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\distributed\utils.py", line 322, in _sync_params_and_buffers
dist._broadcast_coalesced(
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

proven glacier
#

you need to use 0 device

gray violet
#

''D:/jam/TheRock/ml-libs/MIOpen/src/ocl/convolutionocl.cpp:275'' i dont have a D:/jam...?

proven glacier
#

that's normal

#

could you set env variable

#

MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=1

#

and restart applio from a new window?

gray violet
#

where?

proven glacier
#

user variables is fine

gray violet
#

at path or a total new one?

proven glacier
#

new entry

#

name: MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD

#

value: 1

gray violet
proven glacier
#

yeah

gray violet
proven glacier
#

okay, few more

#

MIOPEN_ENABLE_LOGGING=1
MIOPEN_ENABLE_LOGGING_CMD=1

gray violet
proven glacier
#

it may require some extra work on dev side, seems that they've made some shortcuts in this torch build

#

go ahead and run training again

gray violet
proven glacier
#

i need the whole thing as text

gray violet
#

its busy

#

it's compiling

proven glacier
#

no it is not

gray violet
#

only i see everthing haha

proven glacier
#

it just logs every operation

gray violet
#

the balk with 37% was the same what i see in the beginning

proven glacier
#

unselect the text, you're freezing it

#

esc

gray violet
#

i did

proven glacier
#

well?

gray violet
#

still busy

proven glacier
#

that can't be right

gray violet
#

this is what he did till now

proven glacier
#

so now it suddenly works???

gray violet
#

can't even select all of it tho

proven glacier
#

46%|█████████████████████████████████████▌ | 57/123 [01:57<02:15, 2.05s/it]

#

it is working

#

not failing

gray violet
#

yeah but somewhere it was already at 118... 98&

#

%

proven glacier
#

can you change all those the env values to 0

#

and restart from new window?

gray violet
#

should i stop this one?

proven glacier
#

yes

gray violet
#

which one

proven glacier
#

all 3

gray violet
proven glacier
#

what the fuk

#

lets set enable logging cmd to 1

#

and re-try

gray violet
proven glacier
#

lets see if it progresses to 1/

gray violet
#

what do i need to do :)?

proven glacier
#

watch the log, if it gets past to next step it is really weird

gray violet
#

It doesnt do anything now (see the end of the txt... its the same error

proven glacier
#

so it only managed one pass

#

set to 1

#

and try again

gray violet
#

same error

proven glacier
#

are you runnin them in new window every time?

gray violet
#

i close the cmd and the applio tap in chrome

proven glacier
#

okay

gray violet
#

i changed all of the variable to 1 and now it's logging again

#

without error after a few lines

proven glacier
#

and keeps going past 1st?

gray violet
#

this is till now

proven glacier
#

are you using a different batch size?

#

7 instead of 6?

gray violet
#

but with the other errors it was 6 aswell

proven glacier
#

set all variables to 0, use batch 6

#

you used a different batch here too

#

figure out which batch sizes work

gray violet
#

5 gave same error

proven glacier
#

okay, but if you set all 3 to 1 and use batch 6 again?

gray violet
#

then it starts logging like crazy

#

without any error

proven glacier
#

but does not break...

gray violet
#

till then (when i needed to stop) no

gray violet
proven glacier
#

with all 3 flags to 1?

#

go up, 8, 12, 16

gray violet
#

nope still 0

#

10 on 0 gave same error

proven glacier
#

no, set all flags to 1

gray violet
#

oke!

proven glacier
#

then try different batches

gray violet
#

8 is logging aswell

proven glacier
#

I assume if you try 7 it fails

gray violet
#

Should we try?

proven glacier
#

yes

gray violet
#

its logging

proven glacier
#

okay

#

okay, lets set variables to 0

#

then open rvc/train/train.py, find line torch.backends.cudnn.benchmark = True

#

and set it to False

gray violet
#

batch 6 or 7?

proven glacier
#

any

gray violet
#

error

proven glacier
#

try a few times

#

same batch

gray violet
#

i did 6 now, so 6 again?

proven glacier
#

i mean re-try a few times

#

I wanna see if it just randomply fails

gray violet
#

second = error
Third = error
fourth = error
fifth = error
Sixth = error
seventh = error

#

eigth = error
Ninth = error

proven glacier
#

okay

#

in train.py, after the line you changed, add

#

torch.backends.cudnn.enabled = False

#

with the same spacing

#

restart and try to run training again

gray violet
#

mmh no error yet.........

proven glacier
#

how big is the set?

gray violet
#

data? 35 min

proven glacier
#

minutes

#

ok

gray violet
#

35

proven glacier
#

that's really slow

#

it should be much faster on 7800xty

gray violet
proven glacier
#

that's better

gray violet
proven glacier
#

you need to clean up eval folder

#

from all old attempts

#

keep the last file

gray violet
proven glacier
#

scalars tab shows the charts you need

gray violet
#

and then what is the thing to look for when you overtrain or?

proven glacier
#

expand loss_avg_50

#

and close 'grad'

gray violet
proven glacier
#

you can click this to load more data, then blue square under each chart to resize

#

let me know if it fails again

#

i've created a ticket for toch devs, they may ask for some more data

gray violet
#

for now it's busy

#

and when do i know to stop? of like overtraining?

proven glacier
#

g_total gonna converge

gray violet
#

And how can i see it's overtraining>

proven glacier
#

you can start testing models when it flattens

#

watch for fm loss

#

if norm_g does this

#

the model is wack

#

that's too high

gray violet
#

can i make an index file in the middle of training (forgot to make it before is started)

proven glacier
#

yes

proven glacier
#

@gray violet lmk when you're online, i have one test

gray violet
#

I am in like half an hour!

#

At the bar now drinking beer

gray violet
#

Sorry took me a little longer! I’m online now

proven glacier
#

what does the training/advanced settings show?

gray violet
#

The last time, we got it working we didnt use Zluda

proven glacier
#

could you do me a favor, save this in Applio's folder, then open cmd and run env\python bench.py

#

i wanna see the speed

#

it just runs 1000 loops of each operation

gray violet
#

C:\Applio-3.2.9>env\python bench.py
Using cuda
torch.float32
linear : 0.0335s
conv1d 192x192x1 : 0.0565s
conv1d 192x768x3 : 0.2310s
conv1d 768x768x1 : 0.2482s
up_0 : 0.4821s
up_1 : 0.5306s
up_2 : 0.3540s
up_3 : 0.2720s
dn_0 : 0.1580s
dn_1 : 0.1630s
dn_2 : 0.1595s
dn_3 : 0.0740s
res1a : 0.2180s
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 1228800, provided ptr: 0000000000000000 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 1228800, provided ptr: 0000000000000000 size: 0
MIOpen Error: D:/jam/TheRock/ml-libs/MIOpen/src/ocl/convolutionocl.cpp:275: No suitable algorithm was found to execute the required convolution
Traceback (most recent call last):
File "C:\Applio-3.2.9\bench.py", line 75, in <module>
t = benchmark_op(layer.to(dtype), x.to(dtype))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\bench.py", line 23, in benchmark_op
_ = op(x)
^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\modules\conv.py", line 375, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\modules\conv.py", line 370, in _conv_forward
return F.conv1d(
^^^^^^^^^
RuntimeError: miopenStatusUnknownError

proven glacier
#

weird

gray violet
#

No never heard of wheels

proven glacier
#

env\python -m pip install torch-2.7.0a0+rocm_git3f903c3-cp311-cp311-win_amd64.whl

#

whl = wheel

gray violet
#

we did

proven glacier
#

okay, wanna try a newer build?

proven glacier
#

yeah, there's newer build

gray violet
#

we can 🙂

#

i'm happy to help!

#

Just tell me what to do!

proven glacier
#
env\python -m pip install --upgrade --index-url https://d2awnip2yjpvqn.cloudfront.net/v2/gfx110X-dgpu torchaudio```
#

then try the bench again

gray violet
#

is this oke?

proven glacier
#

okay, torchaudio then next

gray violet
#

that one was oke

proven glacier
#

okay, try env\python bench.py now

gray violet
proven glacier
#

could edit env\lib\site-packages\torch\__init__.py

#

find 0x0001 and change it to 0x0000 then save the file and try the bench again

gray violet
#

this one?

proven glacier
#

yes

gray violet
#

same error

proven glacier
#

did it show a popup?

#

it should've shown a window with an actual error

gray violet
#

no only this

proven glacier
#

dir c:\windows\system32\libomp140.x86_64.dll

#

maybe this file is missing

gray violet
#

yeah dont see that one

proven glacier
#

i mean run that command

#

should show this

gray violet
proven glacier
#

okay, that file should go to windows\system32

#

once you copy it over run the bench again

gray violet
proven glacier
#

are those env variables still set?

gray violet
#

yeah didnt change after it worked

proven glacier
#

performance-wise the numbers seem fine... except this failure

#

okay, I'm gonna make a ticket for the devs and see what they say

#

this is really weird

gray violet
#

Oke! Happy to help 🙂

proven glacier
#

the same wheels and bench works on 6800 just fine

#

sorry, I should've asked you to make a separate environment

#

if you need to run applio again this may work.. or not

#

you can revert back to the previous wheels