#the elusive bug
1 messages ยท Page 1 of 1 (latest)
the block in question had three IBC transactions:
- accepted
- failed (out of gas)
- failed (invalid event)
why was the event invalid in Transaction 3? it shouldn't have been. it's because the event from Transaction 2 wasn't dropped, and it looked at this invalid event as belonging to Transaction 3
this was a problem for the chain because Transaction 3 was only invalid in the newer client version, and in v0.31.6 the Transaction 2 event did not emit. wat?
the newer version's gas metering is a bit cheaper than v0.31.6, and because it's a bit cheaper, it was able to execute one more line in the newer version than in v0.31.6--and this line was the invalid IBC event. so in v0.31.6 the Transaction 3 is valid, and new version Transaction 3 is invalid
when a tx fails, we drop the write log (ie. don't write to storage) but we do not drop the IBC event (and we should). this is the bug: we must drop events when transactions fail
better Thursday than Friday, right).
Unjailed's fixed, right?
@minor nexus Thank you for the breakdown!
Thanks Gavin for the explanation. Looking forward to SE resuming soon!
The context and detail is very helpful ๐
nice
Ouch, that was a mouthful.
Well done team for catching that, and to the person who triggered it. I've seen some people spamming IBC txs, so that might explain ๐
This is why we have testnets to catch devious bugs like this
Thanks for updating us ๐
agreed !
yeah, thx for the update, at such a late hour !
don't think unjail gets fixed until the hardfork in 32.0?
does this mean we restart on 31.6 tmw or that we get a new version? good catch
the elusive bug
the unjail fix will come with the governance proposal and hard fork
v0.31.9 should get cut tomorrow for the restart
v0.32.0 soon after for the hard fork
unjail will be fixed already in 0.32, it turns out.
Is the rollback command of any use? Seems to take a lot of time and CPU to rollback 1 block though.
rollback has been tested last year and confirms the high CPU usage, but in the end it seemed to work.
#๐ง๏ธฑvalidator-testnet message
please fix the faucet
I actually did that transaction ๐
Will I get roids if I show proof? I broke the chain and I'm likely to do it again ๐ ๐ ๐
I think the team should not let everyone wait for that long to debug next time. It's almost a week already. Revert should be considered fast the first time. SE will keep go long way if we debug issue like this
If the Team wanna have a long testnet for debugging issues better to create another testnet event for that than SE. That's my thought.
Well done team!
whaaat realy? hhaha i was curious if we could track down who did it
i don;t think that there's a ROID category for this, but yeah if you can break the chain again i will reward you with a bounty
When is shielded-expedition expected to end? thank sir
nice one ........ catchy bug indeed
yes there is, security "stop the chain"
let's be fair, I broke the faucet and report it I got points, @snow crow deserves it's points too if he proove the tx
doubt it was done purposely
there's a security to submit proof on how to. not to actually halt chain (and tbh given what was done here was specifically upgrading to a client we were told not to touch, I don't think it should be rewarded, nothing personal against favour)
we'll give at least one week notice, my guess is 2 weeks
talked with eng team, we think chain restart coordination could begin in ~6 hours
we've got ~4 hours more of syncing to do, then maybe ~1h to cut releases
Rollback was not possible?
Not that I am impatient, just wondering about the technical reason.
we got snapshots available if that can help speed the process
set your balance to 10000000 and create a snapshot 
ssshhhh ๐
what's the block height of your snapshot?
still syncing but I'm at 74k approx atm, so can create one from there and any place up (once I get further up). @mortal sinew posted one in #shielded-expedition at around 90k I believe.
lmk if you need a cut at a certain block height..
there is also 90k snap (both w/ and w/o tx index)
#๐ก๏ธฑse-100 message
could be so kind to share it here for us plebs? ๐
w index would be super useful
Snapshot at height 90000
https://namada-se-rpc.citadel.one/snap/namada-se-90000.tar.lz4
Contains db directory to be placed in chain id root dir and data directory to be placed under cometbft dir
priv_validator_state.json is removed from the snapshot to prevent validators from double signing on accident as node won't start without this file
Also made snapshot with tx index for RPC nodes:
https://namada-se-rpc.citadel.one/snap/namada-se-90000-tx_index.tar.lz4
excellent, the latter would be very helpful
if they don't decide to start chain below that height ofc
you have any idea how the indexer itself would behave to a slight rollback?
if we're going to restart same way as the last halt, we just need to re-execute a block we're currently trying to reach consensus on. And this block wasn't indexed yet
what version client version are you syncing with?
what version client version was this snapshot made with?
0.31.6
31.6
sure, I was thinking the plan may be to roll back some extra blocks. which would leave us with redundant blocks in the indexer. I'm not sure if it's made to handle that defensively or if we would need to fully resync (and or do sketch stuff in the db itself)
we can't rollback blocks without changing chain-id
which essentially will be a hard-fork and a new chain launch
hm ok
i mean rollback more than the current block
cometbft will connect to peers with the highest block and nodes will return to the same state we're in currently. Assuming they won't apphash trying to reach current block
fair
I suppose my question above would apply once we get to the hardfork then, but no need to cross that bridge until we get to it I suppose
missed conversation ๐ yep, mine is with 0.31.6, but it is without tx.index for usual users, and will have with it as well soon
so, at least we have two sources that ppl can use
We planning to go with a new version or 0.31.6, since the bug was identified
new
Yeah, hence wondering if the snapshots will still be good for that
remains to be seen
maybe team will make snapshots made with new client available to us
if so, I hope they make one with indexer as well ๐๐
For the testnet that is acceptable, but how would that play out with mainnet?
"Social consensus"?
How would the community verify the snapshots have not been tampered with?
maybe there's a theoretical s-class waiting for you there ๐
Aha!
was it a question? ๐ง
I'd try the same thing I did again hope it breaks it again
yes, I am not sure about this since folks are talking about 0.31.6 currently ๐
Wasn't on purpose but I'll do it again and see if it breaks
updated my comment above. I think I feel that given the only reason this happened was prematurely upgrading to a client version we were told not to touch because it could break the network, I lean towards thinking it shouldn't be rewarded.
otherwise it's pure mayhem whenever an unstable version is put out
but idk I'm not the one making the rules ๐ gg on your part if you get rewarded
btw, will we be given a couple of hours warning before it all goes down?
I'd tend to lean on your side too pretoro. I dont know if "unintentionnaly breaking the chain" is a category that should be rewarded ๐ค
Finding bugs, logging them through github, hell even putting up a PR if you have the skill is something I'd gadly reward if it were me. but unintentionnal outage seems a bit far fetched
I think unintentional and pure luck are ok. I'm more weary if the cause is doing something we have been told would wreck the network and we should keep our hands off. but again just my opin
correct
the only existing working now, yeah
all of this being said though, I'd say submit it and let team decide
not wanting to gatekeep people's points
snapshots will only work for our purposes if made with the unreleased v0.31.9
eng team is working on that now. Spork is going to use eng team's snapshot to test and then write instructions (to import from snapshot) so that we'll be ready to coordinate the restart
and yeah then maybe we can get blocks made without having to resync from genesis ๐ค
fyi @finite rose
if tx in question was included in the latest pre-halt block, can't you just use a 0.31.6 snap and upgrade at some height after snap was done?
if 0.31.9 is not consensus-breaking, there is no need to resync whole state from genesis using it
idea of this 90k snap was to restore state using it, sync up to pre-halt block (or right away if it won't apphash between 90k and consensus block) and upgrade to binary version that fixes non-determinism
i'm just worried that we might not restart today if you just starting to resync for a new snap. It took me 8h to sync with tx indexing enabled
we're at 70k / 90k on one machine, another (more powerful) machine is at 25k / 90k blocks
but yeah we gotta sync to test before cutting the v0.31.9 release
good stuff
are they syncing with indexer support?
otherwise it'll be a horrible mess for those of us running indexer
I didn't use a new version I was on version 0.31.6 I'm still there I can show you
exactly what I thought preparing the snapshot
How long does it take to download the snapshot you provided? Thanks
Gavin explicitly said snapshots will only work for our purposes if made with the unreleased v0.31.9 that makes me wonder why a 0.31.6 snapshot up to 90043 is unusable. Is it because the engineering team want to check if the changes in 0.31.9 yield the same state, or is the database changed?
In the latter case, can we still verify integrity?
Only thing that potentially might imply it is rocksdb version bump
So hashes would still be intact, just a compatibility change
Otherwise i think it's perfectly fine to use snap from 0.31.6 and i'll test it on rpc node to see if prevote hashes will match with team's snap
argh! they don't think so
that's really bad for all apps running indexer
6gb roughly
the bug gavin described was caused by a tx made with the newer version I believe?
so I doubt you could have made that tx if you were still on 31,6
unless I totally got it all wrong
R.I.P. namascan
and simple votes checker and transactions checker
ok so it means the snapshots from v 0.31.6 are useless / unusable ? and we have to wait from yours
I'll try anyway ๐
keep us posted.
need a binary to try, so waiting for the official announcement and then will see if it works with that snap
okay i hear you, i will chat with eng team
okay maybe we can use this! i'm hearing from eng team that you can use this snapshot (assuming that you trust @sour marten haha)
the issue is that we need to use v0.32.0 before cutting this as a release:
but for everyone else's purposes, the snapshot above should work
So should we wait for 0.32.0 to start our nodes ? Or what ? Do we relaunch today ๐ ? coz tomorrow it's friday...
Maybe we need to breajk the spell and start on a friday
no pls
do we even need to use snapshot? won't turning it back on again after new consensus forms have node pick up where it should?
just made one ๐
sorry for that piece of sarcasm)
hi, im using 31.6 latest height 90044, so we dont need this snap ? and just upgrade to 32?
everyone have the same as you, and exactly that block that was stuck is bad, that's why or resync or use snapshot which is for only only partially confirmed as working solution
but will that be necessary once a new consensus forms?
ow ic. thx
there's a prerelease about to be published, you can use that
i imagine yes.. but let me check
What about the spreadsheet
which?
About to tasks submissions
Maybe
is 90044 the last block before the chain halt? if so, it's no good
if you could roll back, then you wouldn't have the problem block. and when we have v0.32.0, we'll be able to roll back
it is, but that block never finished I think? or is the issue it did and it's wrong?
there's a rollback command in cli that apparently works to roll back a single block....
Yes, takes about half an hour depending on your machine
Wait, the rollback in 0.31.6 is no good?
thanks! yes i will announce that
let me ask
Ok, so assuming this sounds like a good path to prepare?
- stay on 0.31.6
- start sync from scratch or from snapshot
- stop sync before reaching 90044
- update to 0.32
- finalize sync to the tip
- wait for consensus and VP to increase
Ama right?
i'm gonna lock this thread in a minute, wanna make sure everyone in the other thread is in the loop
aren't we upgrading to 31.9? not 32?
why did I read it as "punished"? ๐
this should work, except for the update version, which is the 0.31.9 prerelease (not v0.32.0)
make sure to use whatever command stops the chain before the latest block (people have said that latest height is 90044)
it's just time for you to get some sleep.)
k gonna lock, see y'all in https://discord.com/channels/833618405537218590/1215348664554360833