#Invalid pointer nil exception from dagger ~0.18.16~ actually 0.18.19

1 messages ยท Page 1 of 1 (latest)

last ember
#

Hi there. We are seeing the following error occuring on our self hosted dagger runner running v0.18.16. It is causing frequent container restarts:

#

Further context:

  1. the cache was at about 258gb in size
#

For now we have just cordoned this node. Thought I would submit these here in case it triggered some bright idea about what might be causing it

exotic kraken
#

@last ember do you have the rest of the stack trace by any chance?

last ember
#

yes

#

struggling to get them off the logging cluster atm

#

I will have to reply on monday. Unfortunately the node with the logs disappeared

exotic kraken
last ember
#

All good ๐Ÿ™‚ Thank you very much Solomon!

#

Yes unfortunately we don't have any logging on the cluster apparently either.

#

We're getting some though so hope to see this done soon

balmy fractal
#

@last ember can you double check that your engines haven't been upgraded to the latest versions by any chance? Reason I'm doubting about that is because as seen in the stack trace, you're getting stack traces of files like /app/internal/buildkit/solver/cachemanager.go:310 which has been introduced the past week (https://github.com/dagger/dagger/commits/v0.18.19/internal/buildkit/solver/cachemanager.go).

If you're using self-hosted engines, you need to wipe your cache between engine upgrades as it might lead to issues like this

balmy fractal
#

nit: adding a log message at the engine startup so it gets printed in the logs which just realized we're not doing

balmy fractal
last ember
#

Will double check on monday

#

We were planning to upgrade not sure if it went through

last ember
#

can confirm we are running 0.18.19

#

Invalid pointer nil exception from dagger ~0.18.16~ actually 0.18.19

balmy fractal
balmy fractal
last ember
#

we did not ah interesting

#

okay thanks for the heads up

#

since I nuked the old dagger engine, haven't had this issue

#

Thanks for the heads up, I will reflect this back to the team

#

While things are still in v0 can we expect that the cache has to be wiped each time? When moving to v1 would there be more of a chance that the cache would stay?

#

We would prefer to keep the cache so we can upgrade frequently without disrupting developers, but understand that a lot is changing right now.

exotic kraken
#

yes at the moment it's better to assume cache must be wiped. And yes for v1 that will change.

#

Note in the meantime, in prod you can re-warm the cache as part of upgrade. A few judiciously chosen dagger call will go a long way

last ember
#

Okay good to know

#

By the way guys we actually saw this issue again today

#

But this is on a node with no prior cache this time and definitely been running 0.18.19 the whole time

#

I have attached the log file for the whole thing this time

#

Could this have been caused by using an 0.18.16 CLI against the 0.18.19 engine? We forgot to upgrade our CLI version until this morning...

#

Its been cutover now

upper eagleBOT
#
Dummy#0000 has been banned

Reason: span

normal pecan
last ember
#

yeah we are still finding this error is happening

#

and we are now in a clean state with dagger engine and CLI pointing to 0.18.19 and no prior cache

exotic kraken
#

@last ember sorry about that... we'll have an escalation meeting first thing in our morning

last ember
#

all good. Appreciate it ๐Ÿ™‚

balmy fractal
#

@last ember quick update that we haven't forgotten about this. Someone from the team is looking into it ๐Ÿ‘€

last ember
#

Cool cool thank you very much

balmy fractal
#

@last ember so, Justin merged this PR today (https://github.com/dagger/dagger/pull/11169) which should give us more hints about where this error might be coming from since we tried for quite some time repro but we weren't able to do so.

Is there any chance you could temporarily use this engine version from main by setting export _EXPERIMENTAL_DAGGER_RUNNER_HOST=docker-image://registry.dagger.io/engine:c0cf3c3532f38db629817218050e28219bb08129 and send us the logs when you come across the error again

last ember
#

I can put that through today yes

#

Ah I would need to get the team onto 0.19.0 first I believe

#

There are some breaking changes in that release that will affect some of our pipelines

#

It might take some time to move onto 0.19.0 and then after that I could potentially look at this

balmy fractal
#

No need to bump your modules or client to V0.19

last ember
#

Oh right

balmy fractal
#

@last ember silent ping just to check if by any chance you might have any news about this thread?

last ember
#

@normal pecan bouncing this to you

normal pecan
#

I've updated to 0.19.3 today. I'll watch the logs for any errors.

balmy fractal
normal pecan
#

We've run into an issue updating to 0.19.3. I updated the client and engine for CI, but left the modules on 0.18.16, and we have hit this error on downstream repos:

time="2025-10-26T23:25:49Z" level=debug msg="GetContentHashKey reusing ref s8590xju5tzldq62ptl3wp2ew with digest sha256:1b7eff97bb88d327c20f411a5118aaa70df471b96dd7bb72332daaffd9441868" client_hostname=dagger client_id=gwj73l7cnt1xifzbwh0qf5x0x session_id=hav2e4tmjrplyfiulanthqufu spanID=0d1d5806eade0eae traceID=ddc1ca19f9d73ec804288f1b3c09ee0b
time="2025-10-26T23:25:49Z" level=error msg="solve error: failed to get directory: failed to get snapshot: failed to snapshot: failed to receive stat message: rpc error: code = NotFound desc = get full root path: rpc error: code = NotFound desc = eval symlinks: lstat /src: no such file or directory\ngithub.com/dagger/dagger/internal/buildkit/solver.(*edge).execOp\n\t/app/internal/buildkit/solver/edge.go:912\ngithub.com/dagger/dagger/internal/buildkit/solver/internal/pipe.NewWithFunction[...].func2\n\t/app/internal/buildkit/solver/internal/pipe/pipe.go:78\nruntime.goexit\n\t/usr/lib/go/src/runtime/asm_amd64.s:1693" client_hostname=dagger client_id=gwj73l7cnt1xifzbwh0qf5x0x session_id=hav2e4tmjrplyfiulanthqufu spanID=c318f2706374cc09 traceID=ddc1ca19f9d73ec804288f1b3c09ee0b

I ran a prune on the cache which did not resolve the issue. I just pushed an update to all of the modules to bump their versions now. Waiting on results.

#

This part is sticking out:

[0.0s] | fatal: Needed a single revision

#

Downstream dagger.json:

{
  "name": "service-data-transformation-platform-operations",
  "engineVersion": "v0.19.3",
  "blueprint": {
    "name": "workflow",
    "source": "git@github.com:nine-digital/library-ci-workflows.git/daggerflows/dbt-service",
    "pin": "main"
  }
}
#

Seems like the config there is wrong, changing to an inline pin on the source resolves it;

{
  "name": "service-data-transformation-platform-operations",
  "engineVersion": "v0.19.3",
  "blueprint": {
    "name": "workflow",
    "source": "git@github.com:nine-digital/library-ci-workflows.git/daggerflows/dbt-service@main"
  }
}
normal pecan
#

This is all good now, it was just some misconfiguration of the source field, that is now strictly validated in 0.19.x. I'll keep you updated on the nil pointer error.

exotic kraken
#

thanks for the report!

normal pecan
#

Looks good now, no more nil pointer exceptions.

We are getting a lot of missing snapshot errors though failed to get directory: failed to get snapshot: failed to snapshot: failed to receive stat message. I know we need to clear the cache when upgrading the engine, does that also apply when the engine pod is recreated without changes?

Full error:

level=error msg="solve error: failed to get directory: failed to get snapshot: failed to snapshot: failed to receive stat message: rpc error: code = NotFound desc = get full root path: rpc error: code = NotFound desc = eval symlinks: lstat /src: no such file or directory\ngithub.com/dagger/dagger/internal/buildkit/solver.(*edge).execOp\n\t/app/internal/buildkit/solver/edge.go:912\ngithub.com/dagger/dagger/internal/buildkit/solver/internal/pipe.NewWithFunction[...].func2\n\t/app/internal/buildkit/solver/internal/pipe/pipe.go:78\nruntime.goexit\n\t/usr/lib/go/src/runtime/asm_amd64.s:1693" client_hostname=dagger client_id=ell5fi9pmt2f4s3ufqlym1kuy session_id=qw6xpdo9296qxw25ok9jkllxx spanID=07ba5bf16694fb7e traceID=a754f9ac3e11cfd2b6f6fa9c0dbb7e2b
normal pecan
#

Hmm, I'm going to assume you do need to clear it. I just pushed a changed to turn of debug logs and it panicked again:

2025-10-31 15:56:03.306    
panic: runtime error: invalid memory address or nil pointer dereference

Wiping the cache now and kicking the engine pod.

normal pecan
#

It looks again, I'll report back again next week when the usage picks up. Seems like the panic occurs when a new engine is spun up with an old cache.

Some telemetry errors, but that is probably an issue for another thread:

time="2025-10-31T06:24:50Z" level=error msg="failed to emit telemetry" error="map[error:marshal log record body: string field contains invalid UTF-8 kind:*fmt.wrapError stack:<nil>]"
balmy fractal
# normal pecan It looks again, I'll report back again next week when the usage picks up. Seems ...

It looks again, I'll report back again next week when the usage picks up. Seems like the panic occurs when a new engine is spun up with an old cache.

this unfortunately is the case for now. sharing the cache across Dagger versions it's not something officially supported and has unexpected behaviors. We generally recommend wiping the cache across engine upgrades for users which are managing the engines themselves.

That's something I've raised to Chris in the past here: #1421017536660508683 message

normal pecan
flat mantle
#

hm, well at minimum I can make that an error case rather than crash and include debugging information in the error/logs. I'll do that now

@normal pecan if possible to share a cloud trace where this happened, that could be helpful. Also wondering if you have any modules where you do something like:

  1. have a Container return type but return it set to nil
  2. return a list of Containers

Those scenarios have codepaths that might trigger something like this, but unclear until more debugging

balmy fractal
flat mantle
flat mantle
#

didn't manage to repro the panic you were hitting, but turned that case into an error rather than panic and included debug info that should help figure it out if it's hit again: https://github.com/dagger/dagger/pull/11420

also while trying to repro managed to hit independent panics in different cases, so those ones are fixed ๐Ÿ™‚

normal pecan
#

Thanks Erik, I don't have a trace handy at the moment but I'll see if I can find it. Thanks for taking a look ๐Ÿ™

last ember
#

Thanks guys appreciate you looking into this

normal pecan
balmy fractal
normal pecan
normal pecan
#

I crashed the engine about an hour ago, here is the log. The stack trace is different this time.

It crashed during the codegen we run on submodules, which is where the snapshot errors I reported here https://discord.com/channels/707636530424053791/1440182254126108763 are originating from. I'm not sure if it's related, but the last debug log spanID points to dagop.ctr Container.withExec which originates from the "go-lint" module I just updated.

For context, we have a library repo, which is a module itself, that depends on many submodules. During our PR check, we run dagger develop to generate code for these to pass to other validation tasks.

normal pecan
#

Hmm, could it be that when exporting metrics/logs/traces, one or more of the clients disappear while it's trying to do so? I noticed that when there errors are occurring, the connected clients are on a decline. These are quite high since we are spinning up many containers using ExperimentalPrivilegedNesting to run dagger develop and will rapidly fall off as each completes.

flat mantle
normal pecan
#

Thanks Erik!