#Yep Vito. It was a little bit of a

1 messages ยท Page 1 of 1 (latest)

eager oxide
#

@calm spoke I am actually a little unsure if it is a memory issue or not. It started exactly after we upgraded from 0.10.2 -> 0.11.5 (we took a big step, we may roll back and do incremental upgrades until we see the issue being introduced)

error=\"map[error:process \\\"apt-get update\\\" did not complete successfully: exit code: 137 kind:*buildkit.ExecError stack:<nil>]\"

It has plenty of memory available on the node, so if it was killed it must've been inside dagger or buildkit.

What would you like from me to show the issue further, we may have to slowly increase the version from 0.10.2 until we find the exact cause.

I am also a bit curious why nobody else sees this issue, it happens when run in kubernetes, running a fairly recent runc and libcontainer. But we do also run quite a few builds (probably 400-500 a day) and right now about ~50 of them fail because of this issue. A retry on the same engine can get a successfull build through, so it isn't because the engine has a bad cache or something

#

It should also be mentioned the 137 is everywhere in the log, both in CI, but also run locally as buildkit seemingly fails to release containers after a run. We initially thought that was part of the issue, but it has been occurring for months, with seemingly no effect

calm spoke
#

hmm, well if it is the oom killer kicking in you should see something in kernel logs iirc

eager oxide
#

I'll give it a go. ๐Ÿ•ต๏ธ i am guessing it is just the kernel logs on the node. Any other extra logs of interest?

calm spoke
#

not that I know of! maybe @daring shard has more ideas

daring shard
#

@eager oxide if you describe the killed pod do you have anything on the reason field?

eager oxide
#

That is the wild thing. We get 147 internally but no pods we're restartet. So no kubernetes pods were oomed themselves. Btw the engine itself keeps running and serves other jobs meanwhile we've got a job on it that is seemingly oomed

daring shard
#

ok, in that case I'd look for something like Memory cgroup out of memory either in dmesg or journalctl

#

the reason why the engine pod is not being restarted probably is because the engine is not killed

#

what it's being killed is what the engine launches

eager oxide
#

Yep I'll see if I can get node access tomorrow and see what I can find

daring shard
#

which is under the same cgroup hierarchy that the engine is in

eager oxide
#

Yep it makes sense. Definitely threw us for a loop as the engine itself also logs a 147 sigkill everytime a service is supposed to be stopped. Happens locally as well. Which kind of hid this issue

daring shard
#

๐Ÿ‘

#

let us know if you still can't anything so we can keep looking ๐Ÿ™

eager oxide
#

@slender flicker I saw in the recent commits you were also battling a few 137 errors. I mistyped above btw, we're also seeing quite a few 137 on 0.11.5

#

I've had a look in the kernel logs, and journalctl, and found nothing of note.

We had some networking bridging being deactivated and whatnot, but it seems to be standard behavior, so it isn't really worth reporting. I can get the kernel logs if it is of interest.

calm spoke
#

yeah, we ran into this on a couple of PRs last night so things started seeming a little suspect. thanks for reporting, it helped connect the dots ๐Ÿ™‚ - still not sure what exactly is going on, but it doesn't seem like the simple answer (oom)

eager oxide
#

I won't do more in this regard for now as you're already aware of the issue. Though do let me know if you need some data or something. We'd be happy to help because we're pretty affected by this issue currently

daring shard
#

@eager oxide we just published a new version of Dagger. v0.11.7. Mind updating to that one and check if things are better now?

eager oxide
#

I'll give it a shot tomorrow

eager oxide
#

We're rolling it out over the next few hours and will lets you know if it works or not. Fingers crossed

eager oxide
eager oxide
#

0.11.7 has been fully deployed. The issue seems to have been fixed. We've run it for about a 3 hours now and haven't seen builds crash because of this kind of stability issues.