who else except @Nick | u410 is | Namada | Page 1

native sierra Oct 2, 2025, 9:43 AM

#

hi guys! we are looking into this but its kind of a weird issue. Everything regarding the p2p network is handled for us by cometbft so this makes it even more difficult to debug. We received some logs from @worthy vessel and indeed something is looking weird: Jul 30 08:19:15 namada-mainnet-group-2-relay-1 fddc0a83fced[611]: {"_msg":"Timed out","dur":"-54.284923296s","height":2968922,"level":"info","module":"consensus","round":0,"step":"RoundStepNewHeight","ts":"2025-07-30T08:19:15.255705207Z"} the Time out value is negative which should be impossible.

#

Do you guys have more logs (with COMETBFT logs enabled) around one of those specific events? it would help us understand things better

#

One thing that might help with this is upgrading to the latest cometbft version. We are working on this but they changed some internal mechanism and we need to refactor some logic

desert mortar Oct 2, 2025, 10:58 AM

#

native sierra Do you guys have more logs (with COMETBFT logs enabled) around one of those spec...

Hey @native sierra, I'm looking through logs on my mainnet VN here and I think I'm seeing missed votes on blocks from time to time. I'm not 100% sure what to look for in comet logs though.

#

I've tried to find an example close to the example from 2025-09-17 that appears in both images above (my log entries don't go back as far as 2025-07-30 in your example above)/

#

Here it is with some extra annotation :

#

📎 message.txt

#

The things that stand out are the two extra timeouts and the subsequent 'resetting proposal info'

#

and then obviously all the failures to add votes due to 'unexpected step'

#

we could really do with pinning down the UTC timestamps for these events, trying to match against screenshots of graphs isn't great

#

or of course the bock heights of these events

#

btw comet logging levels for the excerpt above: CMT_LOG_LEVEL=p2p:none,pex:error,*:info

native sierra Oct 2, 2025, 11:35 AM

#

desert mortar btw comet logging levels for the excerpt above: `CMT_LOG_LEVEL=p2p:none,pex:erro...

this is perfect thanks

#

can you share the whole logs too?

desert mortar Oct 2, 2025, 11:38 AM

#

sure, it's all done via systemd and then output via journalctl, do you want everything I've got?

#

what I have goes back to 2025-08-17

#

(if that's what you mean by 'whole logs')

native sierra Oct 2, 2025, 11:43 AM

#

oh i mean logs 1 day before and 1 day after the big spikes in nick graph

#

desert mortar Oct 2, 2025, 11:49 AM

#

ok how about:
2025-09-16T00:00:00Z -> 2025-09-18T00:00:00Z
and,
2025-09-21T00:00:00Z -> 2025-09-23T00:00:00Z

#

I'm not sure about timezones so just sticking with UTC midnight to midnight to keep it simple

desert mortar Oct 2, 2025, 12:30 PM

#

native sierra can you share the whole logs too?

just a heads up, my node is local here running on fairly old hardware, 1gb/s internet over a wireguard VPN and can't listen for new peer connections (yes, bad I know...)

📎 bodski-mainnet-log-2025-09-16.xz 📎 bodski-mainnet-log-2025-09-21.xz

#

btw I ended up using 18:00 UTC +-24hrs for both spikes

native sierra Oct 2, 2025, 12:53 PM

#

thanks!

desert mortar Oct 2, 2025, 2:50 PM

#

I've archived the full set of logs up to current, now running with CMT_LOG_LEVEL=*:info in order to also have p2p and pex logging at info level

#

considerably more log entries so not sure how long before stuff gets rotated out, but might be useful to have some p2p entries if we see another spike of missed validator votes

native sierra Oct 2, 2025, 3:32 PM

#

@worthy vessel do u have logs (with cometbft logs) +/- 1 day around those block heights?

merry vault Oct 2, 2025, 5:14 PM

#

Not sure whether there is any clue in our log, but sharing it.

journalctl -u namadad -o cat   --since "2025-09-16 10:00:00"   --until "2025-09-18 23:00
:00" > namada-2025-09-16-2025-09-18.log

📎 namada-2025-09-16-2025-09-18.log.xz

merry vault Oct 2, 2025, 6:41 PM

#

let me know whether logs also from a public node could help (not validator)

desert mortar Oct 2, 2025, 8:25 PM

#

merry vault Not sure whether there is any clue in our log, but sharing it. ``` journalctl -u...

probably worth using -o short-iso-precise and --utc so that we have accurate timestamps and also not have to translate between different timezones

sturdy geode Oct 2, 2025, 9:05 PM

#

here are our logs from the same time period

📎 namada_logs.tar

native sierra Oct 3, 2025, 1:08 PM

#

thanks for the logs! we are looking into this but its kinda tricky. We added some more internal alerts in case this happens again

#

keep u guys posted

desert mortar Oct 6, 2025, 10:51 AM

#

native sierra thanks for the logs! we are looking into this but its kinda tricky. We added som...

more logs from last night with full p2p/pex entries: https://github.com/namada-net/namada/issues/4899

native sierra Oct 6, 2025, 10:55 AM

#

thanks! we will take a look

#

thansk for the issue too

rocky silo Oct 6, 2025, 11:46 AM

#

@native sierra you know if this issue is also a thing in the newer releases we run on housefire?

native sierra Oct 6, 2025, 11:53 AM

#

rocky silo <@269519418126630912> you know if this issue is also a thing in the newer releas...

it must be yeah

native sierra Oct 6, 2025, 1:44 PM

#

@worthy vessel did your node al expirienced some dagraded performance yesterday ~22/23pm utc ?

desert mortar Oct 6, 2025, 1:51 PM

#

I'm pretty sure I saw deNodes having trouble according to explorer75, amongst a few others

sturdy geode Oct 6, 2025, 2:48 PM

#

native sierra <@1009896203644833903> did your node al expirienced some dagraded performance ye...

looks like around that time we did experience a slight drop in peers, but not like what we saw a week ago. our validators were still able to keep signing

native sierra Oct 6, 2025, 2:49 PM

#

sturdy geode looks like around that time we did experience a slight drop in peers, but not li...

what about today between 10am to 10:30am utc ?

sturdy geode Oct 6, 2025, 2:50 PM

#

native sierra what about today between 10am to 10:30am utc ?

I don't see anything out of the ordinary

desert mortar Oct 6, 2025, 3:14 PM

#

native sierra what about today between 10am to 10:30am utc ?

Not between those times but I see a shortlived problem this morning at 09:20 UTC

#

📎 message.txt

#

I recognise the peer ID, it seems to crop up around this problem a lot: peer=b30dcb6e68f9ebac54ef6386905e71b8c95b1743

#

anyone know if it's possible to match peer_id -> validator_id and/or validator_id -> validator_name using information from an RPC endpoint?

#

or does that require an indexer?

worthy vessel Oct 6, 2025, 3:40 PM

#

depends on if its exposing endpoints or not, if it is, you can hit the /status endpoint and get the validator_info which has the pubkey of the validator(it might just be an auto-generated key and not a validator as well). To know if its a validator, you would need to take the pubkey and see if its in the active set

#

graph from last night as requested

Screenshot_2025-10-06_at_10.44.47_AM.png

desert mortar Oct 6, 2025, 3:57 PM

#

worthy vessel graph from last night as requested

thanks! not sure what timezone the x-axis has, problems I saw were at around 20:30 UTC -> 21:30 UTC last night
would either of those spikes correspond to that period?

native sierra Oct 6, 2025, 3:59 PM

#

worthy vessel graph from last night as requested

whats the difference between purple and orange? also i guess time is utc?

worthy vessel Oct 6, 2025, 4:02 PM

#

native sierra whats the difference between purple and orange? also i guess time is utc?

different nodes, time is Central Time

desert mortar Oct 6, 2025, 4:05 PM

#

funnily enough I was thinking Nick's graph UTC times because I'm seeing same kind of errors as last night starting at 04:20 UTC

#

so central time = UTC -6, then spikes on Nicks graph are at ~10:20 utc and ~13:40 utc?

#

oh but are you daylight saving right now?

#

so if the times are central daylight time then they are UTC -5h, meaning purple spike coincides with problems I saw in logs this morning
#1422959254180073515 message

#

as for orange spike I do see what might be a single missed block (timed out waiting for proposal)

#

2025-10-06T12:48:03.548719+00:00 namadan[1462068]: I[2025-10-06|13:48:03.548] committed state                              module=state height=3810719 num_txs=0 app_hash=6233D2D67EF58FDA21D9F812600AB862B40039957187B233094AEE27B73DDD68
2025-10-06T12:48:03.556438+00:00 namadan[1462068]: I[2025-10-06|13:48:03.556] indexed block events                         module=txindex height=3810719
2025-10-06T12:48:07.335323+00:00 namadan[1462068]: I[2025-10-06|13:48:07.335] Saving AddrBook to file                      module=p2p book=/home/bod/.local/share/namada/namada.5f5de2dd1b88cba30586420/cometbft/config/addrbook.json size=507
2025-10-06T12:48:07.337651+00:00 namadan[1462068]: I[2025-10-06|13:48:07.335] Ensure peers                                 module=pex numOutPeers=19 numInPeers=0 numDialing=0 numToDial=-9
2025-10-06T12:48:08.464810+00:00 namadan[1462068]: I[2025-10-06|13:48:08.464] Timed out                                    module=consensus dur=4.907003608s height=3810720 round=0 step=RoundStepNewHeight
2025-10-06T12:48:11.465166+00:00 namadan[1462068]: I[2025-10-06|13:48:11.465] Timed out                                    module=consensus dur=3s height=3810720 round=0 step=RoundStepPropose
2025-10-06T12:48:12.958325+00:00 namadan[1462068]: I[2025-10-06|13:48:12.958] Timed out                                    module=consensus dur=1s height=3810720 round=0 step=RoundStepPrecommitWait
2025-10-06T12:48:12.960045+00:00 namadan[1462068]: I[2025-10-06|13:48:12.958] resetting proposal info                      module=consensus height=3810720 round=1 proposer=D5BD3C7A12012FE5CCCD1672D6D8CB2596A3ABC5

desert mortar Oct 6, 2025, 4:36 PM

#

worthy vessel graph from last night as requested

can you share another graph centred on yesterday 20:30 UTC, i.e. yesterday 15:30 central daylight time?

worthy vessel Oct 6, 2025, 5:32 PM

#

nothing really to show at 1530 🤷‍♂️
What this might imply is that the root issue hits nodes at different times, and when its problematic for us, is when it hits both our relay nodes at the same time.

Screenshot_2025-10-06_at_12.30.04_PM.png

#

as we can see, it doesnt affect all of our nodes at the same time ^ -- also this is validators missing from consensus, ignore the title.

native sierra Oct 6, 2025, 6:58 PM

#

my current guess is that this might be due to the underlying p2p mesh

#

maybe if we try to setup some seed nodes and gradually switch every validator to use seeds nodes instead of persistent peers it might solve ht eissue

#

i guess right now we have a small subset of nodes with a high number of connection

#

ill think about this

native sierra Oct 7, 2025, 10:58 AM

#

okay so we took another look at our nodes/monitoring and cross checked with ur logs/data

#

we believe that this is not an issue with namada but with the underlying p2p mesh

#

we will spawn a seed node and see if things gets better

#

we will test first on housefire, later on mainnet

rocky silo Oct 7, 2025, 12:21 PM

#

native sierra we believe that this is not an issue with namada but with the underlying p2p mes...

what exactly does this mean? is that then an issue with how comet operates? (could is be solved by upgrading comet as well?)

native sierra Oct 7, 2025, 1:54 PM

#

its an issue with how we are bootstraping the network

rocky silo Oct 7, 2025, 1:55 PM

#

ok thanks

worthy vessel Oct 7, 2025, 3:00 PM

#

@native sierra we already run our own seed node and after taking a look, its slowly losing peers over time. In the past the seed node was also affected by this as well.

native sierra Oct 7, 2025, 4:00 PM

#

are u using any persistent peers to bootstrap that seed node?

worthy vessel Oct 7, 2025, 11:42 PM

#

yes we are using 5 persistent peers and 2 seed nodes on our seed node

native sierra Oct 8, 2025, 12:52 PM

#

can you guys share your cometbft p2p configuration for ur public facing nodes? In the naamda base dir, there is a file called config.toml can you share the p2p section?

worthy vessel Oct 8, 2025, 1:49 PM

#

this is our seed node

    [ledger.cometbft.p2p]
    laddr = "tcp://0.0.0.0:26656"
    external_address = "<redacted>"
    seeds = "tcp://75f923ea0a181a6ae917a00cfe81d780c3d1d11a@95.216.78.215:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
    persistent_peers = "tcp://05309c2cce2d163027a47c662066907e89cd6b99@104.251.123.123:26656,tcp://54386c1252ecabe5ba1fae2f083b37ca5ebd57dc@192.64.82.62:26656,tcp://96f7945f9470faacce66888d798bf1f131913b6c@namada-mainnet-peer.denodes.xyz:26656,tcp://ac1976c8d2b6fcd75643e0e17e44cb7b00b203d3@139.45.205.58:26656,tcp://e75f49448703666b26eb6f414e044fdb6842dd9a@37.27.56.233:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
    upnp = false
    addr_book_file = "config/addrbook.json"
    addr_book_strict = false
    max_num_inbound_peers = 25
    max_num_outbound_peers = 25
    unconditional_peer_ids = ""
    persistent_peers_max_dial_period = "0ms"
    flush_throttle_timeout = "100ms"
    max_packet_msg_payload_size = 1024
    send_rate = 5120000
    recv_rate = 5120000
    pex = true
    seed_mode = true
    private_peer_ids = ""
    allow_duplicate_ip = false
    handshake_timeout = "20000ms"
    dial_timeout = "3000ms"

#

this is our rpc node

    [ledger.cometbft.p2p]
    laddr = "tcp://0.0.0.0:26656"
    external_address = "<redacted>"
    seeds = "tcp://75f923ea0a181a6ae917a00cfe81d780c3d1d11a@95.216.78.215:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656,tcp://e72bd611b0b52da27cc0a54ccdf94bafca96f816@34.175.225.192:26656"
    persistent_peers = "tcp://05309c2cce2d163027a47c662066907e89cd6b99@104.251.123.123:26656,tcp://54386c1252ecabe5ba1fae2f083b37ca5ebd57dc@192.64.82.62:26656,tcp://96f7945f9470faacce66888d798bf1f131913b6c@namada-mainnet-peer.denodes.xyz:26656,tcp://ac1976c8d2b6fcd75643e0e17e44cb7b00b203d3@139.45.205.58:26656,tcp://e75f49448703666b26eb6f414e044fdb6842dd9a@37.27.56.233:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
    upnp = false
    addr_book_file = "config/addrbook.json"
    addr_book_strict = false
    max_num_inbound_peers = 25
    max_num_outbound_peers = 25
    unconditional_peer_ids = ""
    persistent_peers_max_dial_period = "0ms"
    flush_throttle_timeout = "100ms"
    max_packet_msg_payload_size = 1024
    send_rate = 5120000
    recv_rate = 5120000
    pex = true
    seed_mode = false
    private_peer_ids = ""
    allow_duplicate_ip = false
    handshake_timeout = "20000ms"
    dial_timeout = "3000ms"

sturdy geode Oct 15, 2025, 2:29 PM

#

hey team - just checking in to see if any progress was made or if you need anymore information, thanks.

native sierra Oct 15, 2025, 4:41 PM

#

we are still trying to figure it out, right now our best guess is that the network is not well connected and this makes p2p issues arise but its just a guess from what we can observe. We are testing seed nodes on testnet and we will distrubute them in the following days also for mainnet.

#

if you have more data (logs/grafana graphs, etc...) we are happy to take a look

sturdy geode Oct 15, 2025, 5:39 PM

#

I checked our monitoring, this did happen early UTC time on Oct 12. I've attached logs.
The 1a/b instance logs did NOT see the issue.
the 2a/b instance logs did see the issue.
According to our monitoring, issue started shortly after 2:00 UTC on the 12th and recovered around an hour later.

📎 2025-10-11-2025-10-12.logs.tar

native sierra Oct 16, 2025, 3:16 PM

#

hey guys, do u mind sharing rocksdb logs from a node that was affected by this issue? they are located in folder $chain-folder/db, the files we need are LOG and the 10 most recent LOG.old.$timestamp

sturdy geode Oct 16, 2025, 4:58 PM

#

do you need it for that particular date, or just the 10 most recent in general?

native sierra Oct 16, 2025, 6:28 PM

#

in general

sturdy geode Oct 16, 2025, 8:36 PM

#

sounds good, pinged you separately on how to transfer the logs as they're too large for discord

native sierra Oct 21, 2025, 12:32 PM

#

ehi guys, can you try to set NAMADA_ROCKSDB_COMPACTION_THREADS=X where X is half the number of cores you have (but must be >= 2) and see if things improve?

native sierra Oct 22, 2025, 2:45 PM

#

ehi, anybody got any time to try this ?

worthy vessel Oct 24, 2025, 11:34 AM

#

We are at an offsite this week, we will apply these changes next week! Thank you! 🙏

native sierra Oct 28, 2025, 2:14 PM

#

we think we have a possible fix for the issue, the think the culprit was compaction

#

obv the fix is one line https://github.com/namada-net/namada/pull/4946

rocky silo Oct 28, 2025, 3:33 PM

#

very good!

worthy vessel Oct 29, 2025, 4:34 PM

#

Thanks @native sierra! when do we plan on releasing this patch to mainnet?

#who else except @Nick | u410 is