#who else except @Nick | u410 is
1 messages ยท Page 1 of 1 (latest)
hi guys! we are looking into this but its kind of a weird issue. Everything regarding the p2p network is handled for us by cometbft so this makes it even more difficult to debug. We received some logs from @worthy vessel and indeed something is looking weird: Jul 30 08:19:15 namada-mainnet-group-2-relay-1 fddc0a83fced[611]: {"_msg":"Timed out","dur":"-54.284923296s","height":2968922,"level":"info","module":"consensus","round":0,"step":"RoundStepNewHeight","ts":"2025-07-30T08:19:15.255705207Z"} the Time out value is negative which should be impossible.
Do you guys have more logs (with COMETBFT logs enabled) around one of those specific events? it would help us understand things better
One thing that might help with this is upgrading to the latest cometbft version. We are working on this but they changed some internal mechanism and we need to refactor some logic
Hey @native sierra, I'm looking through logs on my mainnet VN here and I think I'm seeing missed votes on blocks from time to time. I'm not 100% sure what to look for in comet logs though.
I've tried to find an example close to the example from 2025-09-17 that appears in both images above (my log entries don't go back as far as 2025-07-30 in your example above)/
Here it is with some extra annotation :
The things that stand out are the two extra timeouts and the subsequent 'resetting proposal info'
and then obviously all the failures to add votes due to 'unexpected step'
we could really do with pinning down the UTC timestamps for these events, trying to match against screenshots of graphs isn't great
or of course the bock heights of these events
btw comet logging levels for the excerpt above: CMT_LOG_LEVEL=p2p:none,pex:error,*:info
this is perfect thanks
can you share the whole logs too?
sure, it's all done via systemd and then output via journalctl, do you want everything I've got?
what I have goes back to 2025-08-17
(if that's what you mean by 'whole logs')
ok how about:
2025-09-16T00:00:00Z -> 2025-09-18T00:00:00Z
and,
2025-09-21T00:00:00Z -> 2025-09-23T00:00:00Z
I'm not sure about timezones so just sticking with UTC midnight to midnight to keep it simple
just a heads up, my node is local here running on fairly old hardware, 1gb/s internet over a wireguard VPN and can't listen for new peer connections (yes, bad I know...)
btw I ended up using 18:00 UTC +-24hrs for both spikes
thanks!
I've archived the full set of logs up to current, now running with CMT_LOG_LEVEL=*:info in order to also have p2p and pex logging at info level
considerably more log entries so not sure how long before stuff gets rotated out, but might be useful to have some p2p entries if we see another spike of missed validator votes
@worthy vessel do u have logs (with cometbft logs) +/- 1 day around those block heights?
Not sure whether there is any clue in our log, but sharing it.
journalctl -u namadad -o cat --since "2025-09-16 10:00:00" --until "2025-09-18 23:00
:00" > namada-2025-09-16-2025-09-18.log
let me know whether logs also from a public node could help (not validator)
probably worth using -o short-iso-precise and --utc so that we have accurate timestamps and also not have to translate between different timezones
here are our logs from the same time period
thanks for the logs! we are looking into this but its kinda tricky. We added some more internal alerts in case this happens again
keep u guys posted
more logs from last night with full p2p/pex entries: https://github.com/namada-net/namada/issues/4899
@native sierra you know if this issue is also a thing in the newer releases we run on housefire?
it must be yeah
@worthy vessel did your node al expirienced some dagraded performance yesterday ~22/23pm utc ?
I'm pretty sure I saw deNodes having trouble according to explorer75, amongst a few others
looks like around that time we did experience a slight drop in peers, but not like what we saw a week ago. our validators were still able to keep signing
what about today between 10am to 10:30am utc ?
I don't see anything out of the ordinary
Not between those times but I see a shortlived problem this morning at 09:20 UTC
I recognise the peer ID, it seems to crop up around this problem a lot: peer=b30dcb6e68f9ebac54ef6386905e71b8c95b1743
anyone know if it's possible to match peer_id -> validator_id and/or validator_id -> validator_name using information from an RPC endpoint?
or does that require an indexer?
depends on if its exposing endpoints or not, if it is, you can hit the /status endpoint and get the validator_info which has the pubkey of the validator(it might just be an auto-generated key and not a validator as well). To know if its a validator, you would need to take the pubkey and see if its in the active set
graph from last night as requested
thanks! not sure what timezone the x-axis has, problems I saw were at around 20:30 UTC -> 21:30 UTC last night
would either of those spikes correspond to that period?
whats the difference between purple and orange? also i guess time is utc?
different nodes, time is Central Time
funnily enough I was thinking Nick's graph UTC times because I'm seeing same kind of errors as last night starting at 04:20 UTC
so central time = UTC -6, then spikes on Nicks graph are at ~10:20 utc and ~13:40 utc?
oh but are you daylight saving right now?
so if the times are central daylight time then they are UTC -5h, meaning purple spike coincides with problems I saw in logs this morning
#1422959254180073515 message
as for orange spike I do see what might be a single missed block (timed out waiting for proposal)
2025-10-06T12:48:03.548719+00:00 namadan[1462068]: I[2025-10-06|13:48:03.548] committed state module=state height=3810719 num_txs=0 app_hash=6233D2D67EF58FDA21D9F812600AB862B40039957187B233094AEE27B73DDD68
2025-10-06T12:48:03.556438+00:00 namadan[1462068]: I[2025-10-06|13:48:03.556] indexed block events module=txindex height=3810719
2025-10-06T12:48:07.335323+00:00 namadan[1462068]: I[2025-10-06|13:48:07.335] Saving AddrBook to file module=p2p book=/home/bod/.local/share/namada/namada.5f5de2dd1b88cba30586420/cometbft/config/addrbook.json size=507
2025-10-06T12:48:07.337651+00:00 namadan[1462068]: I[2025-10-06|13:48:07.335] Ensure peers module=pex numOutPeers=19 numInPeers=0 numDialing=0 numToDial=-9
2025-10-06T12:48:08.464810+00:00 namadan[1462068]: I[2025-10-06|13:48:08.464] Timed out module=consensus dur=4.907003608s height=3810720 round=0 step=RoundStepNewHeight
2025-10-06T12:48:11.465166+00:00 namadan[1462068]: I[2025-10-06|13:48:11.465] Timed out module=consensus dur=3s height=3810720 round=0 step=RoundStepPropose
2025-10-06T12:48:12.958325+00:00 namadan[1462068]: I[2025-10-06|13:48:12.958] Timed out module=consensus dur=1s height=3810720 round=0 step=RoundStepPrecommitWait
2025-10-06T12:48:12.960045+00:00 namadan[1462068]: I[2025-10-06|13:48:12.958] resetting proposal info module=consensus height=3810720 round=1 proposer=D5BD3C7A12012FE5CCCD1672D6D8CB2596A3ABC5
can you share another graph centred on yesterday 20:30 UTC, i.e. yesterday 15:30 central daylight time?
nothing really to show at 1530 ๐คทโโ๏ธ
What this might imply is that the root issue hits nodes at different times, and when its problematic for us, is when it hits both our relay nodes at the same time.
as we can see, it doesnt affect all of our nodes at the same time ^ -- also this is validators missing from consensus, ignore the title.
my current guess is that this might be due to the underlying p2p mesh
maybe if we try to setup some seed nodes and gradually switch every validator to use seeds nodes instead of persistent peers it might solve ht eissue
i guess right now we have a small subset of nodes with a high number of connection
ill think about this
okay so we took another look at our nodes/monitoring and cross checked with ur logs/data
we believe that this is not an issue with namada but with the underlying p2p mesh
we will spawn a seed node and see if things gets better
we will test first on housefire, later on mainnet
what exactly does this mean? is that then an issue with how comet operates? (could is be solved by upgrading comet as well?)
its an issue with how we are bootstraping the network
ok thanks
@native sierra we already run our own seed node and after taking a look, its slowly losing peers over time. In the past the seed node was also affected by this as well.
are u using any persistent peers to bootstrap that seed node?
yes we are using 5 persistent peers and 2 seed nodes on our seed node
can you guys share your cometbft p2p configuration for ur public facing nodes? In the naamda base dir, there is a file called config.toml can you share the p2p section?
this is our seed node
[ledger.cometbft.p2p]
laddr = "tcp://0.0.0.0:26656"
external_address = "<redacted>"
seeds = "tcp://75f923ea0a181a6ae917a00cfe81d780c3d1d11a@95.216.78.215:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
persistent_peers = "tcp://05309c2cce2d163027a47c662066907e89cd6b99@104.251.123.123:26656,tcp://54386c1252ecabe5ba1fae2f083b37ca5ebd57dc@192.64.82.62:26656,tcp://96f7945f9470faacce66888d798bf1f131913b6c@namada-mainnet-peer.denodes.xyz:26656,tcp://ac1976c8d2b6fcd75643e0e17e44cb7b00b203d3@139.45.205.58:26656,tcp://e75f49448703666b26eb6f414e044fdb6842dd9a@37.27.56.233:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
upnp = false
addr_book_file = "config/addrbook.json"
addr_book_strict = false
max_num_inbound_peers = 25
max_num_outbound_peers = 25
unconditional_peer_ids = ""
persistent_peers_max_dial_period = "0ms"
flush_throttle_timeout = "100ms"
max_packet_msg_payload_size = 1024
send_rate = 5120000
recv_rate = 5120000
pex = true
seed_mode = true
private_peer_ids = ""
allow_duplicate_ip = false
handshake_timeout = "20000ms"
dial_timeout = "3000ms"
this is our rpc node
[ledger.cometbft.p2p]
laddr = "tcp://0.0.0.0:26656"
external_address = "<redacted>"
seeds = "tcp://75f923ea0a181a6ae917a00cfe81d780c3d1d11a@95.216.78.215:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656,tcp://e72bd611b0b52da27cc0a54ccdf94bafca96f816@34.175.225.192:26656"
persistent_peers = "tcp://05309c2cce2d163027a47c662066907e89cd6b99@104.251.123.123:26656,tcp://54386c1252ecabe5ba1fae2f083b37ca5ebd57dc@192.64.82.62:26656,tcp://96f7945f9470faacce66888d798bf1f131913b6c@namada-mainnet-peer.denodes.xyz:26656,tcp://ac1976c8d2b6fcd75643e0e17e44cb7b00b203d3@139.45.205.58:26656,tcp://e75f49448703666b26eb6f414e044fdb6842dd9a@37.27.56.233:26656,tcp://eaa6bda69dc3281186b5f150c21e62e6e4325d34@152.53.104.210:26656"
upnp = false
addr_book_file = "config/addrbook.json"
addr_book_strict = false
max_num_inbound_peers = 25
max_num_outbound_peers = 25
unconditional_peer_ids = ""
persistent_peers_max_dial_period = "0ms"
flush_throttle_timeout = "100ms"
max_packet_msg_payload_size = 1024
send_rate = 5120000
recv_rate = 5120000
pex = true
seed_mode = false
private_peer_ids = ""
allow_duplicate_ip = false
handshake_timeout = "20000ms"
dial_timeout = "3000ms"
hey team - just checking in to see if any progress was made or if you need anymore information, thanks.
we are still trying to figure it out, right now our best guess is that the network is not well connected and this makes p2p issues arise but its just a guess from what we can observe. We are testing seed nodes on testnet and we will distrubute them in the following days also for mainnet.
if you have more data (logs/grafana graphs, etc...) we are happy to take a look
I checked our monitoring, this did happen early UTC time on Oct 12. I've attached logs.
The 1a/b instance logs did NOT see the issue.
the 2a/b instance logs did see the issue.
According to our monitoring, issue started shortly after 2:00 UTC on the 12th and recovered around an hour later.
hey guys, do u mind sharing rocksdb logs from a node that was affected by this issue? they are located in folder $chain-folder/db, the files we need are LOG and the 10 most recent LOG.old.$timestamp
do you need it for that particular date, or just the 10 most recent in general?
in general
sounds good, pinged you separately on how to transfer the logs as they're too large for discord
ehi guys, can you try to set NAMADA_ROCKSDB_COMPACTION_THREADS=X where X is half the number of cores you have (but must be >= 2) and see if things improve?
ehi, anybody got any time to try this ?
We are at an offsite this week, we will apply these changes next week! Thank you! ๐
we think we have a possible fix for the issue, the think the culprit was compaction
obv the fix is one line https://github.com/namada-net/namada/pull/4946
very good!
Thanks @native sierra! when do we plan on releasing this patch to mainnet?