Is there a way to automatically restart ML containers when they error out? I tried adding a try-catch to the main.py but was unsuccessful. The container responds with a 502 to new ML tasks, however the ping endpoint continues to pong.
It seems it slowly eats more and more ram when loading the gpu
Currently the containers hang until manually stopped and started. I have an external script watching it but it would be nice if there was a way I could incorporate it into the container.
I don't need a working solution, but providing a direction to know what would need to be modified (if at all possible) would be appreciated
First screenshot was taken at the beginning of typing, the 2nd was taken at the end
.