#Container killed 137

2 messages · Page 1 of 1 (latest)

summer hemlock
#

Hi, I'm trying to run ai-toolkit in serverless with a custom docker container, however when I get to the training, with Chroma for example the container just crashes?

I'm running it on:

2026-04-08T14:32:53.436682549Z Python: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]
2026-04-08T14:32:53.436690651Z PyTorch: 2.9.1+cu128
2026-04-08T14:32:53.748010427Z CUDA available: True
2026-04-08T14:32:53.748050935Z CUDA version: 12.8
2026-04-08T14:32:53.770048705Z GPU: NVIDIA H100 80GB HBM3
2026-04-08T14:32:53.770074686Z GPU memory: 79.2 GB
2026-04-08T14:32:53.770083067Z === END SYSTEM INFO ===```

The container dies instantly when i get to this point:
`2026-04-08T14:33:28.471171993Z Using latest Chroma version: v50`

This happens with local models as well like: `chroma.safetensors`
It just crashes at this point:
`2026-04-08T14:10:31.310133929Z Loading transformer
2026-04-08T14:10:31.930558318Z Double Blocks: 19
2026-04-08T14:10:31.930578639Z Single Blocks: 38`

I've tried using:
- big (100GB+) network storage for everything
- big local runner storage
- different GPUs

But it just crashes every time the container with:
```v1.4 Pulling from ...lora-train
Digest: sha256:5263e565d5eee6cb66809a9021c6601378dd556854bba6df8b4ee3be47aabb6e
Status: Image is up to date for ...lora-train:v1.4
worker is ready
error creating container: container: create: container create: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.51/containers/create?name=tohz9yldlnoisd-0": context deadline exceeded
create container ...lora-train:v1.4
error creating container: container: create: container create: exit status 1
create container ...lora-train:v1.4
error creating container: container: create: container create: exit status 1
start container for ...lora-train:v1.4: begin
stop container 2c2666c6479295653bc5d5363f139fa755616aab8356484cad86f0ac93a93712
WARN: container is unhealthy: exit code 137: (docker kill?)```

I've tried the same running it inside a pod, with similar specs and with the same config, datasets, etc and it works there. So I guess it has something to do with the serverless part?

The custom docker image is a thin wrapper on top of the `ostris/aitoolkit` image which just executes training jobs. And ideas what could cause this?
waxen krakenBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution