Hello!!
I was trying to train a ML model with my RTX 3070TI , it was working at first but when I accidently started another processes which directly started on the same thread the previous process which caused the CUDA kernel to get corrupted, due to this the computer crashed and restarted , after the reboot I tried to run the model training process again where I was confronted with an error stating
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.```
.
I thought this error was related to an unexpected access to a shared memory and analyzed the process with Nsight System , but didnt' find anything valuable .
So I updated the GPU drivers and reinstalled CUDA and gave it a try again, but this time the CUDA kernel abruptly stops (That is what I think is happening) which causes the program exit in the middle of the execution without any error logs or any errors on the console.
I am not sure what to do or what exactly is happening.
please help me out with this ,
Cheers (:
that will cause thread issues and memory misreads . If 2 processes start on the same application they should have their own memory space . With a application offset, Commodore 64 used to suffer from memory reads, because they used the same lines of code in memory