I can't get CUDA to get working correctly | Linus Tech Tips | Page 1

vestal current Apr 21, 2023, 9:32 AM

#

Hello!!
I was trying to train a ML model with my RTX 3070TI , it was working at first but when I accidently started another processes which directly started on the same thread the previous process which caused the CUDA kernel to get corrupted, due to this the computer crashed and restarted , after the reboot I tried to run the model training process again where I was confronted with an error stating

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.``` 
.
I thought this error was related to an unexpected access to a shared memory and analyzed the process with Nsight System , but didnt'  find anything valuable .
So I updated the GPU drivers and reinstalled CUDA and gave it a try again, but this time the CUDA kernel abruptly stops (That is what I think is happening) which causes the  program exit in the middle of the execution without any error logs or any errors on the console.

I am not sure what to do or what exactly is happening.
please help me out with this ,
Cheers (:

vernal shale Apr 21, 2023, 10:24 AM

#

vestal current Hello!! I was trying to train a ML model with my RTX 3070TI , it was working at ...

this looks more like the program youre using to teach that model is broken or incompatible with the default gaming drivers

vestal current Apr 21, 2023, 1:10 PM

#

Well I thought the same , but it is working with the same GPPU I chose form colab

#

when I checked the memory access using Nsight I found that the same memory address was accessed at the same time by two processes, maybe that is the issue ?

rose furnace Apr 21, 2023, 1:12 PM

#

vestal current when I checked the memory access using Nsight I found that the same memory addre...

That is not good whoa that will cause thread issues and memory misreads . If 2 processes start on the same application they should have their own memory space . With a application offset, Commodore 64 used to suffer from memory reads, because they used the same lines of code in memory

terse locust Apr 21, 2023, 1:15 PM

#

So two processes are accessing the same block of memory in the GPU? Are you allocating the memory being used correctly?

vestal current Apr 21, 2023, 1:30 PM

#

terse locust So two processes are accessing the same block of memory in the GPU? Are you allo...

yes I am doing that

#

maybe the GPU monitoring software is interfering with the process

#

I have this software from alienware which controls the lighting and allows overclocking for the GPU installed in my pc as it is an alienware , maybe that is the cause of this issue ?

rose furnace Apr 21, 2023, 4:12 PM

#

Well software should keep to its own memory part allocated 🤔

#

at least in the ram, since it has a offset based on the memory allocated.

vestal current Apr 21, 2023, 5:47 PM

#

I have no Idea on how to software allocates memory

#

technically It should only be monitoring the temps and clk of the GPU , unless it is trying to do something else !?

rose furnace Apr 21, 2023, 5:55 PM

#

What software are you using to train?

vestal current Apr 22, 2023, 9:16 AM

#

rose furnace What software are you using to train?

oh , I actually wrote a cli application to train my cv models

rose furnace Apr 22, 2023, 9:26 AM

#

cli in which language?

#

because if you wrote it yourself it might be in unsafe memory space

rose furnace Apr 22, 2023, 10:27 AM

#

@vestal current which language you write in ? 🤔

#I can't get CUDA to get working correctly