#I can't get CUDA to get working correctly

1 messages · Page 1 of 1 (latest)

vestal current
#

Hello!!
I was trying to train a ML model with my RTX 3070TI , it was working at first but when I accidently started another processes which directly started on the same thread the previous process which caused the CUDA kernel to get corrupted, due to this the computer crashed and restarted , after the reboot I tried to run the model training process again where I was confronted with an error stating

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.``` 
.
I thought this error was related to an unexpected access to a shared memory and analyzed the process with Nsight System , but didnt'  find anything valuable .
So I updated the GPU drivers and reinstalled CUDA and gave it a try again, but this time the CUDA kernel abruptly stops (That is what I think is happening) which causes the  program exit in the middle of the execution without any error logs or any errors on the console.

I am not sure what to do or what exactly is happening.
please help me out with this ,
Cheers (:
vernal shale
vestal current
#

Well I thought the same , but it is working with the same GPPU I chose form colab

#

when I checked the memory access using Nsight I found that the same memory address was accessed at the same time by two processes, maybe that is the issue ?

rose furnace
terse locust
#

So two processes are accessing the same block of memory in the GPU? Are you allocating the memory being used correctly?

vestal current
#

maybe the GPU monitoring software is interfering with the process

#

I have this software from alienware which controls the lighting and allows overclocking for the GPU installed in my pc as it is an alienware , maybe that is the cause of this issue ?

rose furnace
#

Well software should keep to its own memory part allocated 🤔

#

at least in the ram, since it has a offset based on the memory allocated.

vestal current
#

I have no Idea on how to software allocates memory

#

technically It should only be monitoring the temps and clk of the GPU , unless it is trying to do something else !?

rose furnace
#

What software are you using to train?

vestal current
rose furnace
#

cli in which language?

#

because if you wrote it yourself it might be in unsafe memory space

rose furnace
#

@vestal current which language you write in ? 🤔