#who else except @Nick | u410 is

1 messages ยท Page 1 of 1 (latest)

native sierra
#

hi guys! we are looking into this but its kind of a weird issue. Everything regarding the p2p network is handled for us by cometbft so this makes it even more difficult to debug. We received some logs from @worthy vessel and indeed something is looking weird: Jul 30 08:19:15 namada-mainnet-group-2-relay-1 fddc0a83fced[611]: {"_msg":"Timed out","dur":"-54.284923296s","height":2968922,"level":"info","module":"consensus","round":0,"step":"RoundStepNewHeight","ts":"2025-07-30T08:19:15.255705207Z"} the Time out value is negative which should be impossible.

#

Do you guys have more logs (with COMETBFT logs enabled) around one of those specific events? it would help us understand things better

#

One thing that might help with this is upgrading to the latest cometbft version. We are working on this but they changed some internal mechanism and we need to refactor some logic

desert mortar
#

I've tried to find an example close to the example from 2025-09-17 that appears in both images above (my log entries don't go back as far as 2025-07-30 in your example above)/

#

Here it is with some extra annotation :

#

The things that stand out are the two extra timeouts and the subsequent 'resetting proposal info'

#

and then obviously all the failures to add votes due to 'unexpected step'

#

we could really do with pinning down the UTC timestamps for these events, trying to match against screenshots of graphs isn't great

#

or of course the bock heights of these events

#

btw comet logging levels for the excerpt above: CMT_LOG_LEVEL=p2p:none,pex:error,*:info

native sierra
#

can you share the whole logs too?

desert mortar
#

sure, it's all done via systemd and then output via journalctl, do you want everything I've got?

#

what I have goes back to 2025-08-17

#

(if that's what you mean by 'whole logs')

native sierra
#

oh i mean logs 1 day before and 1 day after the big spikes in nick graph

desert mortar
#

ok how about:
2025-09-16T00:00:00Z -> 2025-09-18T00:00:00Z
and,
2025-09-21T00:00:00Z -> 2025-09-23T00:00:00Z

#

I'm not sure about timezones so just sticking with UTC midnight to midnight to keep it simple

desert mortar
#

btw I ended up using 18:00 UTC +-24hrs for both spikes

native sierra
#

thanks!

desert mortar
#

I've archived the full set of logs up to current, now running with CMT_LOG_LEVEL=*:info in order to also have p2p and pex logging at info level

#

considerably more log entries so not sure how long before stuff gets rotated out, but might be useful to have some p2p entries if we see another spike of missed validator votes

native sierra
#

@worthy vessel do u have logs (with cometbft logs) +/- 1 day around those block heights?

merry vault
#

Not sure whether there is any clue in our log, but sharing it.

journalctl -u namadad -o cat   --since "2025-09-16 10:00:00"   --until "2025-09-18 23:00
:00" > namada-2025-09-16-2025-09-18.log
merry vault
#

let me know whether logs also from a public node could help (not validator)

desert mortar
sturdy geode
native sierra
#

thanks for the logs! we are looking into this but its kinda tricky. We added some more internal alerts in case this happens again

#

keep u guys posted

native sierra
#

thanks! we will take a look

#

thansk for the issue too

rocky silo
#

@native sierra you know if this issue is also a thing in the newer releases we run on housefire?

native sierra
#

@worthy vessel did your node al expirienced some dagraded performance yesterday ~22/23pm utc ?

desert mortar
#

I'm pretty sure I saw deNodes having trouble according to explorer75, amongst a few others

sturdy geode
native sierra
sturdy geode
desert mortar
#

I recognise the peer ID, it seems to crop up around this problem a lot: peer=b30dcb6e68f9ebac54ef6386905e71b8c95b1743

#

anyone know if it's possible to match peer_id -> validator_id and/or validator_id -> validator_name using information from an RPC endpoint?

#

or does that require an indexer?

worthy vessel
#

depends on if its exposing endpoints or not, if it is, you can hit the /status endpoint and get the validator_info which has the pubkey of the validator(it might just be an auto-generated key and not a validator as well). To know if its a validator, you would need to take the pubkey and see if its in the active set

#

graph from last night as requested

desert mortar
native sierra
worthy vessel
desert mortar
#

funnily enough I was thinking Nick's graph UTC times because I'm seeing same kind of errors as last night starting at 04:20 UTC

#

so central time = UTC -6, then spikes on Nicks graph are at ~10:20 utc and ~13:40 utc?

#

oh but are you daylight saving right now?

#

so if the times are central daylight time then they are UTC -5h, meaning purple spike coincides with problems I saw in logs this morning
#1422959254180073515 message

#

as for orange spike I do see what might be a single missed block (timed out waiting for proposal)

#
2025-10-06T12:48:03.548719+00:00 namadan[1462068]: I[2025-10-06|13:48:03.548] committed state                              module=state height=3810719 num_txs=0 app_hash=6233D2D67EF58FDA21D9F812600AB862B40039957187B233094AEE27B73DDD68
2025-10-06T12:48:03.556438+00:00 namadan[1462068]: I[2025-10-06|13:48:03.556] indexed block events                         module=txindex height=3810719
2025-10-06T12:48:07.335323+00:00 namadan[1462068]: I[2025-10-06|13:48:07.335] Saving AddrBook to file                      module=p2p book=/home/bod/.local/share/namada/namada.5f5de2dd1b88cba30586420/cometbft/config/addrbook.json size=507
2025-10-06T12:48:07.337651+00:00 namadan[1462068]: I[2025-10-06|13:48:07.335] Ensure peers                                 module=pex numOutPeers=19 numInPeers=0 numDialing=0 numToDial=-9
2025-10-06T12:48:08.464810+00:00 namadan[1462068]: I[2025-10-06|13:48:08.464] Timed out                                    module=consensus dur=4.907003608s height=3810720 round=0 step=RoundStepNewHeight
2025-10-06T12:48:11.465166+00:00 namadan[1462068]: I[2025-10-06|13:48:11.465] Timed out                                    module=consensus dur=3s height=3810720 round=0 step=RoundStepPropose
2025-10-06T12:48:12.958325+00:00 namadan[1462068]: I[2025-10-06|13:48:12.958] Timed out                                    module=consensus dur=1s height=3810720 round=0 step=RoundStepPrecommitWait
2025-10-06T12:48:12.960045+00:00 namadan[1462068]: I[2025-10-06|13:48:12.958] resetting proposal info                      module=consensus height=3810720 round=1 proposer=D5BD3C7A12012FE5CCCD1672D6D8CB2596A3ABC5
desert mortar
worthy vessel
#

nothing really to show at 1530 ๐Ÿคทโ€โ™‚๏ธ
What this might imply is that the root issue hits nodes at different times, and when its problematic for us, is when it hits both our relay nodes at the same time.

#

as we can see, it doesnt affect all of our nodes at the same time ^ -- also this is validators missing from consensus, ignore the title.

native sierra
#

my current guess is that this might be due to the underlying p2p mesh

#

maybe if we try to setup some seed nodes and gradually switch every validator to use seeds nodes instead of persistent peers it might solve ht eissue

#

i guess right now we have a small subset of nodes with a high number of connection

#

ill think about this

native sierra
#

okay so we took another look at our nodes/monitoring and cross checked with ur logs/data

#

we believe that this is not an issue with namada but with the underlying p2p mesh

#

we will spawn a seed node and see if things gets better

#

we will test first on housefire, later on mainnet

rocky silo
native sierra
#

its an issue with how we are bootstraping the network

rocky silo
#

ok thanks

worthy vessel
#

@native sierra we already run our own seed node and after taking a look, its slowly losing peers over time. In the past the seed node was also affected by this as well.

native sierra
#

are u using any persistent peers to bootstrap that seed node?

worthy vessel
#

yes we are using 5 persistent peers and 2 seed nodes on our seed node

native sierra
#

can you guys share your cometbft p2p configuration for ur public facing nodes? In the naamda base dir, there is a file called config.toml can you share the p2p section?

worthy vessel
#

this is our seed node

    [ledger.cometbft.p2p]
    laddr = "tcp://0.0.0.0:26656"
    external_address = "<redacted>"
    seeds = "tcp://75f923ea0a181a6ae917a00cfe81d780c3d1d11a@95.216.78.215:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
    persistent_peers = "tcp://05309c2cce2d163027a47c662066907e89cd6b99@104.251.123.123:26656,tcp://54386c1252ecabe5ba1fae2f083b37ca5ebd57dc@192.64.82.62:26656,tcp://96f7945f9470faacce66888d798bf1f131913b6c@namada-mainnet-peer.denodes.xyz:26656,tcp://ac1976c8d2b6fcd75643e0e17e44cb7b00b203d3@139.45.205.58:26656,tcp://e75f49448703666b26eb6f414e044fdb6842dd9a@37.27.56.233:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
    upnp = false
    addr_book_file = "config/addrbook.json"
    addr_book_strict = false
    max_num_inbound_peers = 25
    max_num_outbound_peers = 25
    unconditional_peer_ids = ""
    persistent_peers_max_dial_period = "0ms"
    flush_throttle_timeout = "100ms"
    max_packet_msg_payload_size = 1024
    send_rate = 5120000
    recv_rate = 5120000
    pex = true
    seed_mode = true
    private_peer_ids = ""
    allow_duplicate_ip = false
    handshake_timeout = "20000ms"
    dial_timeout = "3000ms"
#

this is our rpc node

    [ledger.cometbft.p2p]
    laddr = "tcp://0.0.0.0:26656"
    external_address = "<redacted>"
    seeds = "tcp://75f923ea0a181a6ae917a00cfe81d780c3d1d11a@95.216.78.215:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656,tcp://e72bd611b0b52da27cc0a54ccdf94bafca96f816@34.175.225.192:26656"
    persistent_peers = "tcp://05309c2cce2d163027a47c662066907e89cd6b99@104.251.123.123:26656,tcp://54386c1252ecabe5ba1fae2f083b37ca5ebd57dc@192.64.82.62:26656,tcp://96f7945f9470faacce66888d798bf1f131913b6c@namada-mainnet-peer.denodes.xyz:26656,tcp://ac1976c8d2b6fcd75643e0e17e44cb7b00b203d3@139.45.205.58:26656,tcp://e75f49448703666b26eb6f414e044fdb6842dd9a@37.27.56.233:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
    upnp = false
    addr_book_file = "config/addrbook.json"
    addr_book_strict = false
    max_num_inbound_peers = 25
    max_num_outbound_peers = 25
    unconditional_peer_ids = ""
    persistent_peers_max_dial_period = "0ms"
    flush_throttle_timeout = "100ms"
    max_packet_msg_payload_size = 1024
    send_rate = 5120000
    recv_rate = 5120000
    pex = true
    seed_mode = false
    private_peer_ids = ""
    allow_duplicate_ip = false
    handshake_timeout = "20000ms"
    dial_timeout = "3000ms"
sturdy geode
#

hey team - just checking in to see if any progress was made or if you need anymore information, thanks.

native sierra
#

we are still trying to figure it out, right now our best guess is that the network is not well connected and this makes p2p issues arise but its just a guess from what we can observe. We are testing seed nodes on testnet and we will distrubute them in the following days also for mainnet.

#

if you have more data (logs/grafana graphs, etc...) we are happy to take a look

sturdy geode
#

I checked our monitoring, this did happen early UTC time on Oct 12. I've attached logs.
The 1a/b instance logs did NOT see the issue.
the 2a/b instance logs did see the issue.
According to our monitoring, issue started shortly after 2:00 UTC on the 12th and recovered around an hour later.

native sierra
#

hey guys, do u mind sharing rocksdb logs from a node that was affected by this issue? they are located in folder $chain-folder/db, the files we need are LOG and the 10 most recent LOG.old.$timestamp

sturdy geode
#

do you need it for that particular date, or just the 10 most recent in general?

native sierra
#

in general

sturdy geode
#

sounds good, pinged you separately on how to transfer the logs as they're too large for discord

native sierra
#

ehi guys, can you try to set NAMADA_ROCKSDB_COMPACTION_THREADS=X where X is half the number of cores you have (but must be >= 2) and see if things improve?

native sierra
#

ehi, anybody got any time to try this ?

worthy vessel
#

We are at an offsite this week, we will apply these changes next week! Thank you! ๐Ÿ™

native sierra
#

we think we have a possible fix for the issue, the think the culprit was compaction

rocky silo
#

very good!

worthy vessel
#

Thanks @native sierra! when do we plan on releasing this patch to mainnet?