#anyone have an example github actions
1 messages Β· Page 1 of 1 (latest)
it's kind of a moot point right now, i can't get my dagger based workflow to complete successfully on github actions ... x86 even ... it either abruptly aborts (and google seems to indicate this is from a process using too much cpu) or else it hangs indefinitely :/
GH aborts the action if it's consuming a lot of CPU? interesting..
i had several attempts simply cancel on their own with no error message in the standard web interface, but when i viewed the workflow debug raw output the last thing that it showed was that the workflow had been cancelled with SIGINT or something similar, i googled the exact message and the results seemed to indicate that yes, too much cpu and the runner cancels the workflow
but i'm more concerned about the workflow runs that ran for 30min (with no more debug output after about 15min) which i endded up cancelling myself
would it be possible to share the dagger plan with us in a way that we could run it without the actual project code?
not without a lot of auditing on my side first i think
when everything is cached, it takes about 2min to run the identical dagger plan on my local 3-year-old workstation
gonna run it with --no-cache here to see how long it takes
User time (seconds): 151.11
System time (seconds): 6.92
Percent of CPU this job got: 78%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:21.17
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 11848448
that's with --no-cache
We get some timeouts in our CI with GitHub Actions which forces us to re-run the workflows. I see that when reviewing PRs.
Seems to be a GHA thing
i never saw any such timeouts or aborts with our old build system... which is leading me to believe that dagger+buildx is just too intense for gha :/
I'm looking to see what other projects encounter this and perhaps what they do to work around:
@bleak terrace @fading sun do you have any insights into why GHA timeouts may pop up in CI?
@wary plover can you share what your GHA workflow looks like? Caching bits, etc
env:
DAGGER_CACHE_BASE: dagger-ci-build
DAGGER_LOG_LEVEL: debug
jobs:
build-publish:
name: Build image and publish
runs-on: ubuntu-latest
steps:
- name: Checkout Rentals-API
uses: actions/checkout@v2
with:
fetch-depth: 0
path: src/Rentals-API
ref: ${{ github.ref }}
- name: Extract branch name
shell: bash
run: echo "GITHUB_BRANCH=$(echo ${GITHUB_REF#refs/heads/})" >> $GITHUB_ENV
- name: Configure caches
run: |
echo "DAGGER_CACHE_TO=type=gha,mode=max,scope=${{env.DAGGER_CACHE_BASE}}-${{env.GITHUB_BRANCH}}" >> $GITHUB_ENV
echo "DAGGER_CACHE_FROM=type=gha,scope=${{env.DAGGER_CACHE_BASE}}-${{env.GITHUB_BRANCH}}" >> $GITHUB_ENV
- name: Dagger Release
uses: dagger/dagger-for-github@v3
with:
workdir: src/Rentals-API
cmds: |
do verboseRelease
do verboseRollout
- name: Print Buildkitd Logs
if: ${{ failure() }}
run: |
docker logs dagger-buildkitd
oh sorry, should have used a pastebin
obviously that doesn't include any of our primary env vars
verboseRelease action is basically a docker.#Build that has 3 steps:
- emit start message via slack
- build everything and push to AWS ECR
- emit end message via slack
Maximum resident set size (kbytes): 11848448
If I'm reading that correctly, it's saying it used up to 11 GB of RSS when you ran locally. GHA runners appear to only have 7GB of RAM (https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources) so, it could be that you are entering swap hell in GHA and timing out the build?
heh, yeah, i hadn't noticed that but i think you're right ... my local workstation has 64gb ram which is probably why i didn't notice anything
so...... this seems a problem with dagger ?
Possible, does 11 GB seem high for what you are actually trying to build/deploy?
absolutely... the build itself is basically along the lines of...
- fetch official python docker image
- install build-essential tools in case any of the deps require C lib or compilation
- setup virtualenv
- install this source package along with any of it's dependencies
but considering that all runs in like 3min on my local workstation, i dunno
ok... my most minimal action which does...
- download latest debian-slim
- do an apt dist-upgrade
- install git
- mount local src
- run git in local src mount to get some info like branch name
uses 4.4gb ram ... something doesn't seem right
(this is all with --no-cache)
I also just realized that the data you posted must have been from the dagger binary itself (not BuildKit), right? In which case, there almost certainly must be something wrong since the client binary is not the one doing any sort of intensive work
it's actually running like this: /usr/bin/time -v make dagger ACTION="--no-cache report" ... all make is doing is loading a .env to setup necessary env vars and then invoking dagger with the ACTION param
Ah good to know, so yeah the time command will collect the usage data from the command being invoked and any subprocesses, which will include the dagger binary. However, BuildKit is not a subprocess of dagger (it's more like a server running in docker that the dagger binary is a client of), so I don't think the 11GB could include BuildKit.
If that's the case, then 11GB definitely seems like some sort of memory leak
(thinking of how best to debug this further)
well... i was just running top while running the same dagger command... and this was a quick cut/paste snapshot of the dagger entry when it was running (about halfway done)...
1082307 rocky 20 0 3582184 2.8g 24096 S 152.3 9.1 0:38.87 dagger
so... that's 152% of CPU as well which means dagger is definitely doing something intensive
i understand how you think buildx is being run via network connection in the docker daemon, but there must be more going on
Oh for sure, the data is clearly saying that the dagger binary is doing something intensive, I just am questioning if it actually should be or if there's some sort of bug causing this to happen
gotcha
well, perhaps it's the Cue interpreter process going haywire
dagger is a Go app right ? using an embedded Cue interpreter of sorts ?
Yes that's one suspect for sure. I'm thinking we may need to run the dagger binary w/ a profiler enabled so we can get insight into what is actually taking up so much memory+cpu.
well, atm it effectively means GHA (at least the github hosted runners) is a no-go for me atm π¦
i'm toying with possibly setting up our own AWS based runners, since that seems to be the best way to do our arm64 builds regardless
if there is a memory leak somewhere in dagger, we should fix it. We had similar issues with cue a while back. Is there a way for us to repro outside of GHA? Something we could dagger do locally would be awesome.
@slow peak for context βοΈ
well in fact the memory problems where i'm seeing like the huge amounts of memory being used was all with local runs, not on gha... and those memory issues are what i think we currently think are causing the problems on gha
@wary plover have a branch w/ pprof profiling enabled here: https://github.com/sipsma/dagger/commit/a56bd9b2f7b800e2408b29542a27792147e1cdd8
If you run with that binary and then in the middle of the build when memory usage is high separately run go tool pprof http://localhost:6060/debug/pprof/heap you'll drop into a profiling shell from which you can run top to see what's using so much memory.
Are you comfortable with building dagger from that branch yourself? I don't have an x86 machine around at the moment
that's something i can toy around with .. yes... but it'll have to be in off-business hours (ie later this evening)
I can try and get you pre-compiled binaries in a few minutes, if you prefer
i should be fine building it myself, if i have trouble, i know where to scream π
Started a build in parallel, which binaries do you need? (e.g. linux amd64?)
yep
This is the prebuilt binary for this
cool, got em
Also -- if you can't run the go tool pprof command on your end, you can just grab the heap information using curl curl -s http://localhost:6060/debug/pprof/heap > ~/Downloads/base.heap, and send it our way -- we should be able to run pprof on our end
Either way, it's a point in time snapshot -- you should run go tool pprof or curl once you notice it's taking a bunch of memory
(basically it gives insight as to WHAT is taking memory, at that particular point in time)
so i have the profiling dagger installed and i have go tool pprof installed, but when i run dagger now i get...
dagger do report --log-format plain
4:36PM ERROR system | failed to load plan: this plan requires dagger 0.2.21 or newer. Run `dagger version --check` to check for latest version
this plan requires dagger 0.2.21 or newer. Run `dagger version --check` to check for latest version
rocky@devwork:~/dev/rentals/src/Rentals-API$ dagger version
dagger v0.2.21-next (a56bd9b2) linux/amd64
i don't even recall where in my dagger setup/plan i declared it needed dagger >= 0.2.21
@slow peak perhaps the "-next" version suffix you gave it is confusing dagger ?
also, i just tried running a very very simple dagger plan on an amazon EC2 t3a.small instance and it completely froze the VM ... i'm guessing swap-hell
My bad, fixing it
here's the output from running a very basic "report" action that i have that just runs git to get branch info...
(pprof) top
Showing nodes accounting for 1521.79MB, 72.08% of 2111.16MB total
Dropped 206 nodes (cum <= 10.56MB)
Showing top 10 nodes out of 102
flat flat% sum% cum cum%
238.02MB 11.27% 11.27% 238.02MB 11.27% cuelang.org/go/internal/core/adt.updateCyclic
211.02MB 10.00% 21.27% 763.58MB 36.17% cuelang.org/go/internal/core/adt.(*nodeContext).addStruct
204.15MB 9.67% 30.94% 260.23MB 12.33% cuelang.org/go/internal/core/adt.(*OpContext).NewPosf
189.02MB 8.95% 39.89% 189.02MB 8.95% cuelang.org/go/internal/core/adt.(*Vertex).GetArc
176.54MB 8.36% 48.26% 176.54MB 8.36% cuelang.org/go/internal/core/adt.(*Vertex).addConjunct (inline)
144.02MB 6.82% 55.08% 311.04MB 14.73% cuelang.org/go/internal/core/adt.(*ForClause).yield
136.01MB 6.44% 61.52% 136.01MB 6.44% cuelang.org/go/internal/core/adt.(*Vertex).AddStruct (inline)
97.48MB 4.62% 66.14% 97.48MB 4.62% cuelang.org/go/internal/core/adt.getScratch
67MB 3.17% 69.31% 67MB 3.17% cuelang.org/go/internal/core/adt.CloseInfo.SpawnRef (inline)
58.52MB 2.77% 72.08% 58.52MB 2.77% cuelang.org/go/internal/core/adt.(*ValueError).AddPosition
so it looks like it is Cue that is the memory hog
i'm gonna continue testing, but take my project plan out of the equation (and all of it's many associated custom actions) and try on a blank project
so at first glimpse, it appears to be the very many layers nested deep actions with their deps that seems to be the culprit
almost makes me think there's a memory leak
fwiw, the absolute barest plan in a fresh project is still using 124mb ram according to /usr/bin/time
@young belfry @timber mist FYI βοΈ
so here's issue #1 ... when i remove a bunch of actions that aren't needed for my simple "report" action and re-run the "report" action... the dagger memory consumption goes from 5gb to about 1gb and runs waaaay faster
so it's obviously parsing everything even when 95% isn't required
looks like indeed a mem leak in CUE, that's what we suspecting yesterday with @slow peak
@timber mist do you know if it's a known issue on cue upstream of some mem leak was fixed in the latest release?
@bleak terrace Iβm not aware of these sorts of cases being solved in cue yet. I suspect it wonβt happen until the cycle fixes are in.
And those changes have been taking a while to land.
i'm not sure what i can do on my end... i mean consuming 5gb or higher for a simple run makes running my build in GHA impractical and we depend on GHA :/
running this simple dagger command actually killed a AWS EC2 t3a.small vm on me... 2gb of ram
we need to investigate further, is there anything in your config that you can share? If we can reproduce locally, it'll help a lot
i'll see what i can do about extracting some portion, but for me right now, for every action i comment out in my main.cue plan file, memory usage drops quite noticeably ... seems like it's just the overall using/import of lots of cue files
We don't need a fully working config, just enough bits to make it slow and memory intensive
Chances are there's something innocent in there triggering massive amounts of ram huge by CUE. Could be nesting, or references, or something like that
If you can share something close enough, we'd be happy to run the investigation on our end
yep, even a code "shape" that will allow us to repro? Lots of nested actions, or for loops, or nested definitions, etc. We don't need details/secrets/working scripts/etc.
i'm trying... it just seems like it's the amount of everything that's the central issue
Were you logged in to Dagger Cloud by any chance @wary plover ?
no
ok. if you were ever in doubt, you could dagger logout and run again.
i don't even know what dagger cloud is π
helps us to work with folks on debug and such.
ah
i'm still trying to extract enough of this Cue code to have something reasonable that consumes far too much ram
Yeah, if you were logged in to it, on a current version of dagger you'd see something like this at the start of a run:
nope, not seeing anything like that, nor have i ever π
Only certain Dagger folks (and yourself) could see the URL which will show the CUE file, how it was invoked, stats, errors.
Yep, you need to log in to activate it
so are you saying i should be using dagger cloud so you guys can see more ?
It could be helpful.
and it won't expose any secrets or anything ?
right. no secrets
ok, i'm cool with doing that... trying to manually extract things is painful and i can't just dump my source base somewhere
Dagger Cloud is under development, but we have just released the first telemetry feature!
does it only expose cue files? or does it also expose the entire source set of my project?
and you don't have any access to build artifacts ? π
correct
so I'm hoping we can see enough in that main cue file to get a sense of what's happening for a repro π€
I know you've got lots of includes, etc
heh, dagger login is failing because i'm in a ssh terminal and the env i'm using has no gui :/
oof. I know there's a workaround. not sure if it's pretty. @shadow skiff ?
yeah... you can login in your computer and copy the ~/.config/dagger/credentials file to the destination host. I know its very hacky but it's the only way until we implement headless auth π’
ok, got it working
@jade plover so if i share the run url here only me and the developers can access?
yes
got it. Thanks! Taking a look.
it looks like that doesn't show the many packages i wrote
Rentals-DaggerIO is not public
yep. it's a bit shallow at the moment.
i recently asked my company for permission to opensource Rentals-DaggerIO but haven't yet gotten a response π¦
for anyone still paying attention, i just removed all docker.#Build use from Rentals-DaggerIO and my plan ... cut the memory usage of my primary action down from approx 10gb to 4.2gb ... so progress π
Still paying attention! we've been digging into this with the team. Thanks for the extra context. @slow peak spotted a big leak in cue, we're still on an old version. Knowing that docker.#Build makes this will help isolate. Ideally we can repro on the latest cue version and send a repro upstream. I also see @young belfry is on this thread, (Paul is co-author of cue π ), so clearly all the right people have their eyes on this!
i very much appreciate this... it's amazing knowing the level of support you folks are providing π
Happy to help look at a repro
don't suppose there's been any developments on this? i apologize in advance if i'm being impatient π
unfortunately not much progress, I'd recommend relying on a workaround for now (less relying on docker.#Build with the steps array). There are multiple ways to rely on this, for instance replacing some of those base images with inlined Dockerfile (https://docs.dagger.io/1241/docker#dockerdockerfile), using docker.#Dockerfile. That should cut your memory although the problem won't disappear. We're working on a fix in the meantime et will update you of course.
The universe.dagger.io module is meant to provide higher level abstractions on top of core actions. Of these, the universe.dagger.io/docker package provides a general base for building and running docker images.
btw, if you need help adapting your config, let us know