node startup times | Namada | Page 1

hidden pivot May 28, 2025, 9:08 PM

#

just starting a thread not to clutter the chat

#

do the same nodes use similar startup times or is it random on each start?

#

do the startup times correlate with cometbft db size?

jolly cloud May 28, 2025, 9:09 PM

#

no its completely random, never know if one is gonna take a few minutes or 10

hidden pivot May 28, 2025, 9:09 PM

#

hm that's a bit weird

#

and system load with other processes running does not seem to be influencing it?

jolly cloud May 28, 2025, 9:10 PM

#

nope

hidden pivot May 28, 2025, 9:10 PM

#

interesting

jolly cloud May 28, 2025, 9:10 PM

#

all hosts have the exact same services on them

hidden pivot May 28, 2025, 9:10 PM

#

have to admit I never paid attention to whether my fullnodes took 2 or 7 minutes - just that they take a while

jolly cloud May 28, 2025, 9:10 PM

#

I am not worried about it.. seen this on celestia, somm, nym, atom, etc. I think its cometbft.

hidden pivot May 28, 2025, 9:11 PM

#

could be depending on db state when it shut down I guess

jolly cloud May 28, 2025, 9:11 PM

#

agreed

hidden pivot May 28, 2025, 9:11 PM

#

in any case, I think it's still correlated to db size - those who have pruned experience significantly shortened start times

#

I'm a little cautious of pruning, so I suffer the longer starts

#

🙂

jolly cloud May 28, 2025, 9:12 PM

#

well kind of, it could be a sign the DB is not shutting down properly.. and then has to patch state or something.

#

I also havent pruned anything yet.

hidden pivot May 28, 2025, 9:12 PM

#

I know when shutting down a node, sometimes it exits cleanly and sometimes it's an appcrash.

#

haven't monitored for whether that makes a diff on next start

#

maybe worth doing some benchmarks on?

hidden pivot May 28, 2025, 9:13 PM

#

hidden pivot I know when shutting down a node, sometimes it exits cleanly and sometimes it's ...

most often an appcrash tbh

jolly cloud May 28, 2025, 9:13 PM

#

I will pay closer attention next time we do an upgrade (app hash means longer start up time<- is what i will be testing)

jolly cloud May 28, 2025, 9:14 PM

#

hidden pivot most often an appcrash tbh

This seems like a bug.. it should shutdown cleanly..

#

cc @flint jasper to track ^

hidden pivot May 28, 2025, 9:15 PM

#

jolly cloud This seems like a bug.. it should shutdown cleanly..

in theory. it does so often though. never thought much of it - namada seems delightfully resilient to random reboots and that sort of thing, compared to other networks

#

not to mention the client appcrashes when it doesn't like parameter choices 🙂

#

I'm thinking it's a fair point. maybe it should be looked into if it can reduce startup-recovery times.

flint jasper May 28, 2025, 9:25 PM

#

jolly cloud cc <@547966246486802432> to track ^

Heya 👋

#

@jolly cloud was TuDudes one of those nodes? If so, VN? Fullnode?

jolly cloud May 28, 2025, 10:34 PM

#

No those are some of our internal nodes. We see this on our relay and rpc nodes. Relay nodes are just telling our signer what to sign and then broadcasting.

thin maple May 29, 2025, 12:29 AM

#

hidden pivot in any case, I think it's still correlated to db size - those who have pruned ex...

came to same conclusion here, appeared out of nowhere, intermittent at first then consistent as db grew past cometbft's ability initialise within 180 seconds

thin maple May 29, 2025, 12:31 AM

#

jolly cloud no its completely random, never know if one is gonna take a few minutes or 10

all the same hardware? I was assuming I was seeing it earlier than most due to my node being an old laptop with an i7 from 2012

#

I don't know a great deal about cometbft or cosmos/tendermint stuff, any ideas on what/where to look in cometbft data for signs of problems?

dawn lichen May 30, 2025, 10:17 AM

#

hey guys, i'm trying to restore from snapshot of Polkachu (I need an archive node), but my node is crashing in the crediting part - has anyone had a similar problem after the hard fork?

INFO namada_node:🐚:init_chain: Crediting 106.880076 nam tokens to tnam1qzzzpksl4lym3wy2gp0j8c7rvgtjrqdhwusruc3s INFO namada_node:🐚:init_chain: Crediting 106.880076 nam tokens to tnam1qzzzqjs4kjcpghvsgdz0yjr7f5zmsxww8vl0gjgs INFO namada_node:🐚:init_chain: Crediting 106.880076 nam tokens to tnam1qzzzrfwwyv0kz5svprr428t05fwhu7rq5u89sxgm INFO namada_node:🐚:init_chain: Crediting 135.922579 nam tokens to tnam1qzzzt8mk8ynxfhyae6rcadtcwshssrkcw5j3a5kt

hidden pivot May 30, 2025, 11:09 AM

#

dawn lichen hey guys, i'm trying to restore from snapshot of Polkachu (I need an archive nod...

please create a separate thread or question for this. it's unrelated to this thread

jolly cloud May 30, 2025, 6:07 PM

#

thin maple all the same hardware? I was assuming I was seeing it earlier than most due to m...

Yes all the same hardware, exact setup, etc. Using custom pipelines for everything.

Not sure where to look, if we are indeed seeing some panics on shutdown, its probably related to that. I have not been paying close attention to the shutdown messages.

hidden pivot Jun 1, 2025, 11:20 PM

#

jolly cloud Yes all the same hardware, exact setup, etc. Using custom pipelines for everythi...

just restarted a node on housefire. it exited gracefully and did not take a lot of time starting. I think there may be something to this, (though a bit speculative/anecdotal at this point).

#

guess I could try restarting the same a number of times for stats 😅

thin maple Jun 2, 2025, 5:22 PM

#

@hidden pivot @jolly cloud managed to some brief investigation here on my mainnet node, I'm not seeing any variation in startup times, not sure how I can see whether namadan shutdown, cleanly or not though (what do you guys see in logs to indicate unclean shutdown?)

#

I did however realise that I got it inverted: it's cometbft waiting on namadan to finish initialising

#

I took a flamegraph of namadan during initialisation:

#

full zoomable flamegraph svg file:

#

ugh, not sure how to prevent discord from providing unwanted preview of text files...

jolly cloud Jun 2, 2025, 5:27 PM

#

2025-06-02T16:19:10.630013Z  INFO namada_node: Done loading MASP verifying keys.
2025-06-02T16:19:10.630604Z  INFO namada_node::storage::rocksdb: Using 1 compactions threads for RocksDB.
...
{"_msg":"abci.socketClient failed to connect to tcp://0.0.0.0:26658.  Retrying after 3s...","connection":"query","err":"dial tcp 0.0.0.0:26658: connect: connection refused","level":"error","module":"abci-client","ts":"2025-06-02T16:28:28.847284106Z"}
...
2025-06-02T16:28:34.111657Z  INFO tower_abci::v037::server: ABCI server starting on tcp socket addr=0.0.0.0:26658

I am noticing it taking forever on creating the connection.. Why its happening, I am not sure yet.

thin maple Jun 2, 2025, 5:29 PM

#

those json log entries are coming from comebft right?

#

this is the same problem, cometbft unable to connect to namadan's broadcaster service

#

just to clarify on my node I have:

tcp        0      0 127.0.0.1:26658         0.0.0.0:*               LISTEN      418840/namadan      
tcp        0      0 127.0.0.1:26657         0.0.0.0:*               LISTEN      418881/cometbft     
tcp6       0      0 :::26656                :::*                    LISTEN      418881/cometbft ```

#

i.e. namada listens on 26658

#

I suspect you have the same issue and that the flamegraph above is relevant to your long connection time

jolly cloud Jun 2, 2025, 5:34 PM

#

seems accurate!

thin maple Jun 2, 2025, 5:36 PM

#

so the vast majority of time is spent in the sparse_merkle_tree::validate function

#

cc @vapid hull any idea if it is normal for namadan to do this validation on every startup?

hidden pivot Jun 2, 2025, 8:12 PM

#

thin maple <@448181683984793620> <@1009896203644833903> managed to some brief investigation...

you can run it in a terminal 😈

hidden pivot Jun 2, 2025, 8:14 PM

#

thin maple so the vast majority of time is spent in the sparse_merkle_tree::validate functi...

isn't it cometbft's db compacting we are waiting for?

thin maple Jun 2, 2025, 8:16 PM

#

hidden pivot you can run it in a terminal 😈

how do you mean? I'm reading the logs but can't really spot anything that indicates unclean shutdown

#

not sure how terminal vs systemd launching changes that

hidden pivot Jun 2, 2025, 8:17 PM

#

thin maple how do you mean? I'm reading the logs but can't really spot anything that indica...

namadan ledger run

#

in a terminal

#

terminate with ctrl+c

#

sometimes it exits cleanly, often it crashes

#

we were wondering above if the longer startup times was after unclean exits

thin maple Jun 2, 2025, 8:18 PM

#

ahh ok, do you have an example of logs from a crash?

hidden pivot Jun 2, 2025, 8:19 PM

#

thin maple not sure how terminal vs systemd launching changes that

just easier to do hands-on, you can run it still via a service if you prefer

hidden pivot Jun 2, 2025, 8:19 PM

#

thin maple ahh ok, do you have an example of logs from a crash?

not atm but I am sure I can find some when it happens. it's not hard to tell when it does though, it's quite different output (with red even when you run it in certain terminals)

#

imma shut down all my nodes soon and see if we can find any 🙂

thin maple Jun 2, 2025, 8:21 PM

#

hidden pivot isn't it cometbft's db compacting we are waiting for?

according to my flamegraph the rocksdb stuff only takes a relatively short amount of time vs the merkle validation (which is the bulk of the time)

hidden pivot Jun 2, 2025, 8:22 PM

#

that was quick lol:
`2025-06-02T20:21:02.676515Z INFO namada_node::shell::finalize_block: txs executed: 0
The application panicked (crashed).2025-06-02T20:21:02.705991Z INFO namada_node: Tendermint node is no longer running.

Message: called Result::unwrap() on an Err value: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }
2025-06-02T20:21:02.706038Z INFO namada_node: Namada ledger node has shut down.
Location: /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tower-abci-0.19.1/src/v037/server.rs:2025-06-02T20:21:02.706071Z INFO namada_node: Shutting down ABCI server...
35m179

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.`

hidden pivot Jun 2, 2025, 8:22 PM

#

thin maple according to my flamegraph the rocksdb stuff only takes a relatively short amoun...

interesting. always assumed it was the rocksdb due to the log outputs. the more you know..

hidden pivot Jun 2, 2025, 8:23 PM

#

hidden pivot that was quick lol: `2025-06-02T20:21:02.676515Z INFO namada_node::shell::final...

let me try and see how long this takes to start up again

thin maple Jun 2, 2025, 8:24 PM

#

hidden pivot that was quick lol: `2025-06-02T20:21:02.676515Z INFO namada_node::shell::final...

was this after ctrl-c in terminal or after doing systemctl stop?

hidden pivot Jun 2, 2025, 8:24 PM

#

thin maple was this after ctrl-c in terminal or after doing `systemctl stop`?

terminal

#

yes it's dirty, I know.. 🥴

#

ok startup time wasn't too bad. around 1:40m on a not very hw-specced node (aux node, unimportant use case)

hidden pivot Jun 2, 2025, 8:27 PM

#

hidden pivot yes it's dirty, I know.. 🥴

or is it? does systemctl stop offer more niceties than ctrl+c?

thin maple Jun 2, 2025, 8:28 PM

#

what about db size? what is the output of du -sh $NAMADA_DIR/db

hidden pivot Jun 2, 2025, 8:30 PM

#

different system calls apparently hm interesting

hidden pivot Jun 2, 2025, 8:30 PM

#

thin maple what about db size? what is the output of `du -sh $NAMADA_DIR/db`

let me check, never pruned it

hidden pivot Jun 2, 2025, 8:30 PM

#

thin maple what about db size? what is the output of `du -sh $NAMADA_DIR/db`

you asking about namada db size or cometbft db size? as the latter tends to be way bigger

#

interesting again: namada db size is 24G

#

cometbft db size around 102G

thin maple Jun 2, 2025, 8:38 PM

#

hidden pivot or is it? does systemctl stop offer more niceties than ctrl+c?

systemctl sends SIGTERM, terminal ctrl-c sends SIGINT, both are up to application to respond to so not sure, would need to look at signal handling code in namadan

#

systemctl termination then waits 90 seconds then sends SIGKILL if process has not exited, which is much more brutal

thin maple Jun 2, 2025, 8:43 PM

#

hidden pivot you asking about namada db size or cometbft db size? as the latter tends to be w...

namada db, as I think that's the one being operated on during the merkle-tree validation during initialisation

#

mine is similar 25GB

#node startup times