#Invalid pointer nil exception from dagger ~0.18.16~ actually 0.18.19
1 messages ยท Page 1 of 1 (latest)
Further context:
- the cache was at about 258gb in size
For now we have just cordoned this node. Thought I would submit these here in case it triggered some bright idea about what might be causing it
@last ember do you have the rest of the stack trace by any chance?
yes
struggling to get them off the logging cluster atm
I will have to reply on monday. Unfortunately the node with the logs disappeared
Ok thanks for trying.
I forwarded to the team, it's night time here, but when they're back online they will probably know where to start looking. We've been incrementally refactoring our buildkit layer (gradually removing it) for the last several months, so you may have hit an edge case related to that.
All good ๐ Thank you very much Solomon!
Yes unfortunately we don't have any logging on the cluster apparently either.
We're getting some though so hope to see this done soon
@last ember can you double check that your engines haven't been upgraded to the latest versions by any chance? Reason I'm doubting about that is because as seen in the stack trace, you're getting stack traces of files like /app/internal/buildkit/solver/cachemanager.go:310 which has been introduced the past week (https://github.com/dagger/dagger/commits/v0.18.19/internal/buildkit/solver/cachemanager.go).
If you're using self-hosted engines, you need to wipe your cache between engine upgrades as it might lead to issues like this
nit: adding a log message at the engine startup so it gets printed in the logs which just realized we're not doing
can confirm we are running 0.18.19
Invalid pointer nil exception from dagger ~0.18.16~ actually 0.18.19
Do you know if you wiped the cache between engine upgrades?. If you haven't, it's very likely that this is the issue
Since you're hosting your own engines, it's something you have to manually do every time you upgrade
we did not ah interesting
okay thanks for the heads up
since I nuked the old dagger engine, haven't had this issue
Thanks for the heads up, I will reflect this back to the team
While things are still in v0 can we expect that the cache has to be wiped each time? When moving to v1 would there be more of a chance that the cache would stay?
We would prefer to keep the cache so we can upgrade frequently without disrupting developers, but understand that a lot is changing right now.
yes at the moment it's better to assume cache must be wiped. And yes for v1 that will change.
Note in the meantime, in prod you can re-warm the cache as part of upgrade. A few judiciously chosen dagger call will go a long way
Okay good to know
By the way guys we actually saw this issue again today
But this is on a node with no prior cache this time and definitely been running 0.18.19 the whole time
I have attached the log file for the whole thing this time
Could this have been caused by using an 0.18.16 CLI against the 0.18.19 engine? We forgot to upgrade our CLI version until this morning...
Its been cutover now
Reason: span
Here is another log, thanks
yeah we are still finding this error is happening
and we are now in a clean state with dagger engine and CLI pointing to 0.18.19 and no prior cache
@last ember sorry about that... we'll have an escalation meeting first thing in our morning
all good. Appreciate it ๐
@last ember quick update that we haven't forgotten about this. Someone from the team is looking into it ๐
Cool cool thank you very much
@last ember so, Justin merged this PR today (https://github.com/dagger/dagger/pull/11169) which should give us more hints about where this error might be coming from since we tried for quite some time repro but we weren't able to do so.
Is there any chance you could temporarily use this engine version from main by setting export _EXPERIMENTAL_DAGGER_RUNNER_HOST=docker-image://registry.dagger.io/engine:c0cf3c3532f38db629817218050e28219bb08129 and send us the logs when you come across the error again
I can put that through today yes
Ah I would need to get the team onto 0.19.0 first I believe
There are some breaking changes in that release that will affect some of our pipelines
It might take some time to move onto 0.19.0 and then after that I could potentially look at this
This should be ok since we have backwards compatibility compatibility mode
No need to bump your modules or client to V0.19
Oh right
@last ember silent ping just to check if by any chance you might have any news about this thread?
@normal pecan bouncing this to you
I've updated to 0.19.3 today. I'll watch the logs for any errors.
thx Luke! feel free to ping if you find anything
We've run into an issue updating to 0.19.3. I updated the client and engine for CI, but left the modules on 0.18.16, and we have hit this error on downstream repos:
time="2025-10-26T23:25:49Z" level=debug msg="GetContentHashKey reusing ref s8590xju5tzldq62ptl3wp2ew with digest sha256:1b7eff97bb88d327c20f411a5118aaa70df471b96dd7bb72332daaffd9441868" client_hostname=dagger client_id=gwj73l7cnt1xifzbwh0qf5x0x session_id=hav2e4tmjrplyfiulanthqufu spanID=0d1d5806eade0eae traceID=ddc1ca19f9d73ec804288f1b3c09ee0b
time="2025-10-26T23:25:49Z" level=error msg="solve error: failed to get directory: failed to get snapshot: failed to snapshot: failed to receive stat message: rpc error: code = NotFound desc = get full root path: rpc error: code = NotFound desc = eval symlinks: lstat /src: no such file or directory\ngithub.com/dagger/dagger/internal/buildkit/solver.(*edge).execOp\n\t/app/internal/buildkit/solver/edge.go:912\ngithub.com/dagger/dagger/internal/buildkit/solver/internal/pipe.NewWithFunction[...].func2\n\t/app/internal/buildkit/solver/internal/pipe/pipe.go:78\nruntime.goexit\n\t/usr/lib/go/src/runtime/asm_amd64.s:1693" client_hostname=dagger client_id=gwj73l7cnt1xifzbwh0qf5x0x session_id=hav2e4tmjrplyfiulanthqufu spanID=c318f2706374cc09 traceID=ddc1ca19f9d73ec804288f1b3c09ee0b
I ran a prune on the cache which did not resolve the issue. I just pushed an update to all of the modules to bump their versions now. Waiting on results.
From the cli
This part is sticking out:
[0.0s] | fatal: Needed a single revision
Downstream dagger.json:
{
"name": "service-data-transformation-platform-operations",
"engineVersion": "v0.19.3",
"blueprint": {
"name": "workflow",
"source": "git@github.com:nine-digital/library-ci-workflows.git/daggerflows/dbt-service",
"pin": "main"
}
}
Seems like the config there is wrong, changing to an inline pin on the source resolves it;
{
"name": "service-data-transformation-platform-operations",
"engineVersion": "v0.19.3",
"blueprint": {
"name": "workflow",
"source": "git@github.com:nine-digital/library-ci-workflows.git/daggerflows/dbt-service@main"
}
}
This is all good now, it was just some misconfiguration of the source field, that is now strictly validated in 0.19.x. I'll keep you updated on the nil pointer error.
thanks for the report!
Looks good now, no more nil pointer exceptions.
We are getting a lot of missing snapshot errors though failed to get directory: failed to get snapshot: failed to snapshot: failed to receive stat message. I know we need to clear the cache when upgrading the engine, does that also apply when the engine pod is recreated without changes?
Full error:
level=error msg="solve error: failed to get directory: failed to get snapshot: failed to snapshot: failed to receive stat message: rpc error: code = NotFound desc = get full root path: rpc error: code = NotFound desc = eval symlinks: lstat /src: no such file or directory\ngithub.com/dagger/dagger/internal/buildkit/solver.(*edge).execOp\n\t/app/internal/buildkit/solver/edge.go:912\ngithub.com/dagger/dagger/internal/buildkit/solver/internal/pipe.NewWithFunction[...].func2\n\t/app/internal/buildkit/solver/internal/pipe/pipe.go:78\nruntime.goexit\n\t/usr/lib/go/src/runtime/asm_amd64.s:1693" client_hostname=dagger client_id=ell5fi9pmt2f4s3ufqlym1kuy session_id=qw6xpdo9296qxw25ok9jkllxx spanID=07ba5bf16694fb7e traceID=a754f9ac3e11cfd2b6f6fa9c0dbb7e2b
Hmm, I'm going to assume you do need to clear it. I just pushed a changed to turn of debug logs and it panicked again:
2025-10-31 15:56:03.306
panic: runtime error: invalid memory address or nil pointer dereference
Wiping the cache now and kicking the engine pod.
It looks again, I'll report back again next week when the usage picks up. Seems like the panic occurs when a new engine is spun up with an old cache.
Some telemetry errors, but that is probably an issue for another thread:
time="2025-10-31T06:24:50Z" level=error msg="failed to emit telemetry" error="map[error:marshal log record body: string field contains invalid UTF-8 kind:*fmt.wrapError stack:<nil>]"
It looks again, I'll report back again next week when the usage picks up. Seems like the panic occurs when a new engine is spun up with an old cache.
this unfortunately is the case for now. sharing the cache across Dagger versions it's not something officially supported and has unexpected behaviors. We generally recommend wiping the cache across engine upgrades for users which are managing the engines themselves.
That's something I've raised to Chris in the past here: #1421017536660508683 message
Another crash today, same nil pointer/invalid memory address panic.
cc @flat mantle
hm, well at minimum I can make that an error case rather than crash and include debugging information in the error/logs. I'll do that now
@normal pecan if possible to share a cloud trace where this happened, that could be helpful. Also wondering if you have any modules where you do something like:
- have a
Containerreturn type but return it set to nil - return a list of
Containers
Those scenarios have codepaths that might trigger something like this, but unclear until more debugging
context for Erik: Last thing I recall was that Justin added better logging support here: https://github.com/dagger/dagger/pull/11169
ah okay, I think this panic is different. happening in a different place than what that PR touches and the log added there doesn't appear in their logs
didn't manage to repro the panic you were hitting, but turned that case into an error rather than panic and included debug info that should help figure it out if it's hit again: https://github.com/dagger/dagger/pull/11420
also while trying to repro managed to hit independent panics in different cases, so those ones are fixed ๐
Thanks Erik, I don't have a trace handy at the moment but I'll see if I can find it. Thanks for taking a look ๐
Thanks guys appreciate you looking into this
@flat mantle Here is the trace for the run that crashed https://dagger.cloud/nine/traces/a5968ac244deb29f11a16eb0fba641bd
I couldn't find any "obvious" places that might return a nil object, but I will keep looking. I did uncover an error that I have been seeing in our logs, but I don't think it is related to this. I will start another thread with info for that and link it here.
https://discord.com/channels/707636530424053791/1440182254126108763
@normal pecan any chance we could get the engine logs?
Unless I'm misunderstanding, these are the logs from the engine when the crash happened: #1421017536660508683 message
I crashed the engine about an hour ago, here is the log. The stack trace is different this time.
It crashed during the codegen we run on submodules, which is where the snapshot errors I reported here https://discord.com/channels/707636530424053791/1440182254126108763 are originating from. I'm not sure if it's related, but the last debug log spanID points to dagop.ctr Container.withExec which originates from the "go-lint" module I just updated.
For context, we have a library repo, which is a module itself, that depends on many submodules. During our PR check, we run dagger develop to generate code for these to pass to other validation tasks.
Hmm, could it be that when exporting metrics/logs/traces, one or more of the clients disappear while it's trying to do so? I noticed that when there errors are occurring, the connected clients are on a decline. These are quite high since we are spinning up many containers using ExperimentalPrivilegedNesting to run dagger develop and will rapidly fall off as each completes.
thanks for reporting! fix is here: https://github.com/dagger/dagger/pull/11450, will be in the next release soon
Thanks Erik!