#Zluda keeps freezing while compiling
1 messages · Page 1 of 1 (latest)
okay, lets try a different method
unzip it into C:\
so you have C:\Applio-3.2.9
drop this inside
and run
this should install applio using python 3.11
download these into Applio folder
one sec, checking myself
you need to nuke C:\users\user\miniconda3 folder before you run install
how to?
Then putt these in the folder?
yes
i did
then open cmd in Applio's folder
and run
env\python -m pip install torch-2.7.0a0+rocm_git3f903c3-cp311-cp311-win_amd64.whl
do i need to put the downloaded files in the env folder or in the root?
env\python -m pip install torchaudio-2.7.0a0+52638ef-cp311-cp311-win_amd64.whl
into C:\Applio-3.2.9
how about env\python
did you save the updated run-install from above into this applio folder?
Don't worry, happends to the best! Already happy that you help me 🙂
okay, not the pip install
torchaudio too
Nope error
?
this is the error at the end of: env\python -m pip install torch-2.7.0a0+rocm_git3f903c3-cp311-cp311-win_amd64.whl
I mean run second install for torchaudio
okay, run-applio.bat now
that's just librosa warnings, it can be updated to silence them
env\python -m pip install librosa==0.11.0
question is whether the training works
hm
can you post this as a text?
i started again
now i get this
To create a public link, set share=True in launch().
Using HiFi-GAN vocoder
Using HiFi-GAN vocoder
Process Process-2:
Traceback (most recent call last):
File "C:\Applio-3.2.9\env\Lib\multiprocessing\process.py", line 314, in _bootstrap
self.run()
File "C:\Applio-3.2.9\env\Lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\Applio-3.2.9\rvc\train\train.py", line 452, in run
net_g = DDP(net_g, device_ids=[device_id])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\parallel\distributed.py", line 837, in init
_sync_module_states(
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\distributed\utils.py", line 311, in _sync_module_states
_sync_params_and_buffers(process_group, module_states, broadcast_bucket_size, src)
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\distributed\utils.py", line 322, in _sync_params_and_buffers
dist._broadcast_coalesced(
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.
you need to use 0 device
''D:/jam/TheRock/ml-libs/MIOpen/src/ocl/convolutionocl.cpp:275'' i dont have a D:/jam...?
that's normal
could you set env variable
MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=1
and restart applio from a new window?
user variables is fine
at path or a total new one?
yeah
it may require some extra work on dev side, seems that they've made some shortcuts in this torch build
go ahead and run training again
i need the whole thing as text
no it is not
only i see everthing haha
it just logs every operation
the balk with 37% was the same what i see in the beginning
i did
well?
still busy
that can't be right
this is what he did till now
so now it suddenly works???
46%|█████████████████████████████████████▌ | 57/123 [01:57<02:15, 2.05s/it]
it is working
not failing
should i stop this one?
yes
all 3
lets see if it progresses to 1/
what do i need to do :)?
watch the log, if it gets past to next step it is really weird
It doesnt do anything now (see the end of the txt... its the same error
same error
are you runnin them in new window every time?
i close the cmd and the applio tap in chrome
okay
i changed all of the variable to 1 and now it's logging again
without error after a few lines
and keeps going past 1st?
I used 6 each time , only ones it slide to 7
but with the other errors it was 6 aswell
set all variables to 0, use batch 6
you used a different batch here too
figure out which batch sizes work
okay, but if you set all 3 to 1 and use batch 6 again?
but does not break...
till then (when i needed to stop) no
4 gave same error
no, set all flags to 1
oke!
then try different batches
I assume if you try 7 it fails
Should we try?
yes
okay
okay, lets set variables to 0
then open rvc/train/train.py, find line torch.backends.cudnn.benchmark = True
and set it to False
any
i did 6 now, so 6 again?
second = error
Third = error
fourth = error
fifth = error
Sixth = error
seventh = error
eigth = error
Ninth = error
okay
in train.py, after the line you changed, add
torch.backends.cudnn.enabled = False
with the same spacing
restart and try to run training again
how big is the set?
35
that's better
scalars tab shows the charts you need
you can click this to load more data, then blue square under each chart to resize
let me know if it fails again
i've created a ticket for toch devs, they may ask for some more data
Sure happy to help!
for now it's busy
and when do i know to stop? of like overtraining?
g_total gonna converge
And how can i see it's overtraining>
you can start testing models when it flattens
watch for fm loss
if norm_g does this
the model is wack
that's too high
can i make an index file in the middle of training (forgot to make it before is started)
yes
@gray violet lmk when you're online, i have one test
Sorry took me a little longer! I’m online now
remind me, did we install Zluda last time?
what does the training/advanced settings show?
The last time, we got it working we didnt use Zluda
could you do me a favor, save this in Applio's folder, then open cmd and run env\python bench.py
i wanna see the speed
it just runs 1000 loops of each operation
C:\Applio-3.2.9>env\python bench.py
Using cuda
torch.float32
linear : 0.0335s
conv1d 192x192x1 : 0.0565s
conv1d 192x768x3 : 0.2310s
conv1d 768x768x1 : 0.2482s
up_0 : 0.4821s
up_1 : 0.5306s
up_2 : 0.3540s
up_3 : 0.2720s
dn_0 : 0.1580s
dn_1 : 0.1630s
dn_2 : 0.1595s
dn_3 : 0.0740s
res1a : 0.2180s
MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>, workspace required: 1228800, provided ptr: 0000000000000000 size: 0
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>, workspace required: 1228800, provided ptr: 0000000000000000 size: 0
MIOpen Error: D:/jam/TheRock/ml-libs/MIOpen/src/ocl/convolutionocl.cpp:275: No suitable algorithm was found to execute the required convolution
Traceback (most recent call last):
File "C:\Applio-3.2.9\bench.py", line 75, in <module>
t = benchmark_op(layer.to(dtype), x.to(dtype))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\bench.py", line 23, in benchmark_op
_ = op(x)
^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\modules\conv.py", line 375, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Applio-3.2.9\env\Lib\site-packages\torch\nn\modules\conv.py", line 370, in _conv_forward
return F.conv1d(
^^^^^^^^^
RuntimeError: miopenStatusUnknownError
weird
did we use wheels from scott or from https://d25kgig7rdsyks.cloudfront.net/v2/gfx110X-dgpu/ ?
No never heard of wheels
env\python -m pip install torch-2.7.0a0+rocm_git3f903c3-cp311-cp311-win_amd64.whl
whl = wheel
we did
okay, wanna try a newer build?
we installed these
yeah, there's newer build
env\python -m pip install --upgrade --index-url https://d2awnip2yjpvqn.cloudfront.net/v2/gfx110X-dgpu torchaudio```
then try the bench again
okay, torchaudio then next
okay, try env\python bench.py now
could edit env\lib\site-packages\torch\__init__.py
find 0x0001 and change it to 0x0000 then save the file and try the bench again
this one?
yes
same error
okay, that file should go to windows\system32
once you copy it over run the bench again
are those env variables still set?
yeah didnt change after it worked
performance-wise the numbers seem fine... except this failure
okay, I'm gonna make a ticket for the devs and see what they say
this is really weird
Oke! Happy to help 🙂