Yep Vito. It was a little bit of a | Dagger | Page 1

eager oxide Jun 10, 2024, 1:35 PM

#

@calm spoke I am actually a little unsure if it is a memory issue or not. It started exactly after we upgraded from 0.10.2 -> 0.11.5 (we took a big step, we may roll back and do incremental upgrades until we see the issue being introduced)

error=\"map[error:process \\\"apt-get update\\\" did not complete successfully: exit code: 137 kind:*buildkit.ExecError stack:<nil>]\"

It has plenty of memory available on the node, so if it was killed it must've been inside dagger or buildkit.

What would you like from me to show the issue further, we may have to slowly increase the version from 0.10.2 until we find the exact cause.

I am also a bit curious why nobody else sees this issue, it happens when run in kubernetes, running a fairly recent runc and libcontainer. But we do also run quite a few builds (probably 400-500 a day) and right now about ~50 of them fail because of this issue. A retry on the same engine can get a successfull build through, so it isn't because the engine has a bad cache or something

#

It should also be mentioned the 137 is everywhere in the log, both in CI, but also run locally as buildkit seemingly fails to release containers after a run. We initially thought that was part of the issue, but it has been occurring for months, with seemingly no effect

calm spoke Jun 10, 2024, 3:41 PM

#

hmm, well if it is the oom killer kicking in you should see something in kernel logs iirc

eager oxide Jun 10, 2024, 8:11 PM

#

I'll give it a go. 🕵️ i am guessing it is just the kernel logs on the node. Any other extra logs of interest?

calm spoke Jun 10, 2024, 8:14 PM

#

not that I know of! maybe @daring shard has more ideas

daring shard Jun 10, 2024, 8:19 PM

#

@eager oxide if you describe the killed pod do you have anything on the reason field?

eager oxide Jun 10, 2024, 8:21 PM

#

That is the wild thing. We get 147 internally but no pods we're restartet. So no kubernetes pods were oomed themselves. Btw the engine itself keeps running and serves other jobs meanwhile we've got a job on it that is seemingly oomed

daring shard Jun 10, 2024, 8:23 PM

#

ok, in that case I'd look for something like Memory cgroup out of memory either in dmesg or journalctl

#

the reason why the engine pod is not being restarted probably is because the engine is not killed

#

what it's being killed is what the engine launches

eager oxide Jun 10, 2024, 8:23 PM

#

Yep I'll see if I can get node access tomorrow and see what I can find

daring shard Jun 10, 2024, 8:24 PM

#

which is under the same cgroup hierarchy that the engine is in

eager oxide Jun 10, 2024, 8:25 PM

#

Yep it makes sense. Definitely threw us for a loop as the engine itself also logs a 147 sigkill everytime a service is supposed to be stopped. Happens locally as well. Which kind of hid this issue

daring shard Jun 10, 2024, 8:25 PM

#

👍

#

let us know if you still can't anything so we can keep looking 🙏

eager oxide Jun 11, 2024, 1:53 PM

#

@slender flicker I saw in the recent commits you were also battling a few 137 errors. I mistyped above btw, we're also seeing quite a few 137 on 0.11.5

#

I've had a look in the kernel logs, and journalctl, and found nothing of note.

We had some networking bridging being deactivated and whatnot, but it seems to be standard behavior, so it isn't really worth reporting. I can get the kernel logs if it is of interest.

calm spoke Jun 11, 2024, 3:15 PM

#

yeah, we ran into this on a couple of PRs last night so things started seeming a little suspect. thanks for reporting, it helped connect the dots 🙂 - still not sure what exactly is going on, but it doesn't seem like the simple answer (oom)

eager oxide Jun 11, 2024, 4:36 PM

#

I won't do more in this regard for now as you're already aware of the issue. Though do let me know if you need some data or something. We'd be happy to help because we're pretty affected by this issue currently

daring shard Jun 11, 2024, 5:19 PM

#

@eager oxide we just published a new version of Dagger. v0.11.7. Mind updating to that one and check if things are better now?

eager oxide Jun 11, 2024, 5:27 PM

#

I'll give it a shot tomorrow

eager oxide Jun 12, 2024, 6:46 AM

#

We're rolling it out over the next few hours and will lets you know if it works or not. Fingers crossed

eager oxide Jun 12, 2024, 7:06 AM

#

https://github.com/dagger/dagger/issues/7630

GitHub

🐞 dagger.io/dagger v0.11.7 is not available through dagger.io · Iss...

What is the issue? I think your proxy was left out in the cold when you pushed the latest release to github Dagger version dagger v0.11.7 Steps to reproduce go get dagger.io/dagger => dagger.io/...

eager oxide Jun 13, 2024, 1:01 PM

#

0.11.7 has been fully deployed. The issue seems to have been fixed. We've run it for about a 3 hours now and haven't seen builds crash because of this kind of stability issues.

#Yep Vito. It was a little bit of a