#node startup times
1 messages ยท Page 1 of 1 (latest)
just starting a thread not to clutter the chat
do the same nodes use similar startup times or is it random on each start?
do the startup times correlate with cometbft db size?
no its completely random, never know if one is gonna take a few minutes or 10
hm that's a bit weird
and system load with other processes running does not seem to be influencing it?
nope
interesting
all hosts have the exact same services on them
have to admit I never paid attention to whether my fullnodes took 2 or 7 minutes - just that they take a while
I am not worried about it.. seen this on celestia, somm, nym, atom, etc. I think its cometbft.
could be depending on db state when it shut down I guess
agreed
in any case, I think it's still correlated to db size - those who have pruned experience significantly shortened start times
I'm a little cautious of pruning, so I suffer the longer starts
๐
well kind of, it could be a sign the DB is not shutting down properly.. and then has to patch state or something.
I also havent pruned anything yet.
I know when shutting down a node, sometimes it exits cleanly and sometimes it's an appcrash.
haven't monitored for whether that makes a diff on next start
maybe worth doing some benchmarks on?
most often an appcrash tbh
I will pay closer attention next time we do an upgrade (app hash means longer start up time<- is what i will be testing)
This seems like a bug.. it should shutdown cleanly..
cc @flint jasper to track ^
in theory. it does so often though. never thought much of it - namada seems delightfully resilient to random reboots and that sort of thing, compared to other networks
not to mention the client appcrashes when it doesn't like parameter choices ๐
I'm thinking it's a fair point. maybe it should be looked into if it can reduce startup-recovery times.
Heya ๐
@jolly cloud was TuDudes one of those nodes? If so, VN? Fullnode?
No those are some of our internal nodes. We see this on our relay and rpc nodes. Relay nodes are just telling our signer what to sign and then broadcasting.
came to same conclusion here, appeared out of nowhere, intermittent at first then consistent as db grew past cometbft's ability initialise within 180 seconds
all the same hardware? I was assuming I was seeing it earlier than most due to my node being an old laptop with an i7 from 2012
I don't know a great deal about cometbft or cosmos/tendermint stuff, any ideas on what/where to look in cometbft data for signs of problems?
hey guys, i'm trying to restore from snapshot of Polkachu (I need an archive node), but my node is crashing in the crediting part - has anyone had a similar problem after the hard fork?
INFO namada_node:๐:init_chain: Crediting 106.880076 nam tokens to tnam1qzzzpksl4lym3wy2gp0j8c7rvgtjrqdhwusruc3s INFO namada_node:๐:init_chain: Crediting 106.880076 nam tokens to tnam1qzzzqjs4kjcpghvsgdz0yjr7f5zmsxww8vl0gjgs INFO namada_node:๐:init_chain: Crediting 106.880076 nam tokens to tnam1qzzzrfwwyv0kz5svprr428t05fwhu7rq5u89sxgm INFO namada_node:๐:init_chain: Crediting 135.922579 nam tokens to tnam1qzzzt8mk8ynxfhyae6rcadtcwshssrkcw5j3a5kt
please create a separate thread or question for this. it's unrelated to this thread
Yes all the same hardware, exact setup, etc. Using custom pipelines for everything.
Not sure where to look, if we are indeed seeing some panics on shutdown, its probably related to that. I have not been paying close attention to the shutdown messages.
just restarted a node on housefire. it exited gracefully and did not take a lot of time starting. I think there may be something to this, (though a bit speculative/anecdotal at this point).
guess I could try restarting the same a number of times for stats ๐
@hidden pivot @jolly cloud managed to some brief investigation here on my mainnet node, I'm not seeing any variation in startup times, not sure how I can see whether namadan shutdown, cleanly or not though (what do you guys see in logs to indicate unclean shutdown?)
I did however realise that I got it inverted: it's cometbft waiting on namadan to finish initialising
I took a flamegraph of namadan during initialisation:
full zoomable flamegraph svg file:
ugh, not sure how to prevent discord from providing unwanted preview of text files...
2025-06-02T16:19:10.630013Z INFO namada_node: Done loading MASP verifying keys.
2025-06-02T16:19:10.630604Z INFO namada_node::storage::rocksdb: Using 1 compactions threads for RocksDB.
...
{"_msg":"abci.socketClient failed to connect to tcp://0.0.0.0:26658. Retrying after 3s...","connection":"query","err":"dial tcp 0.0.0.0:26658: connect: connection refused","level":"error","module":"abci-client","ts":"2025-06-02T16:28:28.847284106Z"}
...
2025-06-02T16:28:34.111657Z INFO tower_abci::v037::server: ABCI server starting on tcp socket addr=0.0.0.0:26658
I am noticing it taking forever on creating the connection.. Why its happening, I am not sure yet.
those json log entries are coming from comebft right?
this is the same problem, cometbft unable to connect to namadan's broadcaster service
just to clarify on my node I have:
tcp 0 0 127.0.0.1:26658 0.0.0.0:* LISTEN 418840/namadan
tcp 0 0 127.0.0.1:26657 0.0.0.0:* LISTEN 418881/cometbft
tcp6 0 0 :::26656 :::* LISTEN 418881/cometbft ```
i.e. namada listens on 26658
I suspect you have the same issue and that the flamegraph above is relevant to your long connection time
seems accurate!
so the vast majority of time is spent in the sparse_merkle_tree::validate function
cc @vapid hull any idea if it is normal for namadan to do this validation on every startup?
you can run it in a terminal ๐
isn't it cometbft's db compacting we are waiting for?
how do you mean? I'm reading the logs but can't really spot anything that indicates unclean shutdown
not sure how terminal vs systemd launching changes that
namadan ledger run
in a terminal
terminate with ctrl+c
sometimes it exits cleanly, often it crashes
we were wondering above if the longer startup times was after unclean exits
ahh ok, do you have an example of logs from a crash?
just easier to do hands-on, you can run it still via a service if you prefer
not atm but I am sure I can find some when it happens. it's not hard to tell when it does though, it's quite different output (with red even when you run it in certain terminals)
imma shut down all my nodes soon and see if we can find any ๐
according to my flamegraph the rocksdb stuff only takes a relatively short amount of time vs the merkle validation (which is the bulk of the time)
that was quick lol:
`2025-06-02T20:21:02.676515Z INFO namada_node::shell::finalize_block: txs executed: 0
The application panicked (crashed).2025-06-02T20:21:02.705991Z INFO namada_node: Tendermint node is no longer running.
Message: called Result::unwrap() on an Err value: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }
2025-06-02T20:21:02.706038Z INFO namada_node: Namada ledger node has shut down.
Location: /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tower-abci-0.19.1/src/v037/server.rs:2025-06-02T20:21:02.706071Z INFO namada_node: Shutting down ABCI server...
35m179
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.`
interesting. always assumed it was the rocksdb due to the log outputs. the more you know..
let me try and see how long this takes to start up again
was this after ctrl-c in terminal or after doing systemctl stop?
terminal
yes it's dirty, I know.. ๐ฅด
ok startup time wasn't too bad. around 1:40m on a not very hw-specced node (aux node, unimportant use case)
or is it? does systemctl stop offer more niceties than ctrl+c?
what about db size? what is the output of du -sh $NAMADA_DIR/db
different system calls apparently hm interesting
let me check, never pruned it
you asking about namada db size or cometbft db size? as the latter tends to be way bigger
interesting again: namada db size is 24G
cometbft db size around 102G
systemctl sends SIGTERM, terminal ctrl-c sends SIGINT, both are up to application to respond to so not sure, would need to look at signal handling code in namadan
systemctl termination then waits 90 seconds then sends SIGKILL if process has not exited, which is much more brutal
namada db, as I think that's the one being operated on during the merkle-tree validation during initialisation
mine is similar 25GB