#Bug in SDK/env triggers panics on try calls (?)

96 messages · Page 1 of 1 (latest)

sacred spade
#

It seems like try_ calls are panicking anyways in certain scenarios. For example, this works as expected:

mod hello {
    soroban_sdk::contractimport!(
        file = "./target/wasm32-unknown-unknown/release/soroban_hello_world_contract.wasm"
    );
}

#[test]
fn test() {
    let env = Env::default();
    //    env.mock_all_auths();

    let contract_id = env.register_contract_wasm(None, hello::WASM);

    env.try_invoke_contract::<(), soroban_sdk::Error>(
        &contract_id,
        &Symbol::new(&env, "call"),
        vec![&env, Address::random(&env).to_val()],
    );
}

Where hello is simply requiring auth for the random addr and the auth mock is disabled.

But this will panic with data:["VM call failed: Func(MismatchingParameterLen)", run]:

mod game_broken {
    soroban_sdk::contractimport!(file = "broken.wasm");
}

#[test]
fn test() {
    let env = Env::default();
    //    env.mock_all_auths();

    let broken = env.register_contract_wasm(None, game_broken::WASM);

    env.try_invoke_contract::<(), soroban_sdk::Error>(
        &broken,
        &Symbol::new(&env, "run"),
        vec![&env, Address::random(&env).to_val()],
    );
}

Broken simply looks like this:

#[contractimpl]
impl Contract {
    pub fn run(_env: Env) {}
}

This looks like a bug in the env rather than in the sdk, unless I'm missing something as to why this is expected.

severe thunder
#

In the second example it looks like the function from the contract has no parameters, but in the test a single parameters, the random value, is being passed.

#

I believe that'l lbe why an error is being generated MismatchingParameterLen. The number of parameters don't match.

sacred spade
summer flame
#

Yes this worked prior to 20.0.0-rc

severe thunder
#

Oh right, that's a good point. Does seem like an env bug.

summer flame
#

Worked meaning it returned the error vs panic

severe thunder
#

@hazy bramble Could this be related to the recent error rewrite?

latent yoke
#

some 'system' errors were intentionally made non-recoverable (such as running out of budget). but maybe we're mis-classifying an error as 'non-recoverable' here

severe thunder
#

The only system error that's non-recoverable should be fee/budget related. Try is supposed to catch all others.

#

Actually, footprint+expiry are supposed to trap too. So not just fee/budget.

latent yoke
#

yeah, out-of-footprint access is no longer recoverable. internal env errors are also non-recoverable as we tried to isolate them to 'assertions' (== never happens).

#

but again, it seems like this is the case where we mis-classify something

#

ok, yeah, mismatching number of args results in an internal error, which is not intentional.

latent yoke
hazy bramble
#

wait, is there a specific wasmi error code for arg-count mismatch? I'd think there is

latent yoke
#

maybe? my PR introduces a general fix - I don't think any wasmi error should be treated as 'internal'. but we definitely can refine the 'known' codes further if needed

hazy bramble
#

some wasmi errors should be internal

#

like "we are using wasmi wrong" or "wasmi had an internal error"

latent yoke
#

does wasmi have some indication for that?

hazy bramble
#

yes, quite a lot of the wasmi error codes indicate failure to maintain invariants or misuse of APIs

#

eg. if we get an InvalidMemoryType back it doesn't mean "contract asked for too much memory" it means "the wasmi validator failed to prevent a number >2^16 from making it through validation"

latent yoke
#

then we probably should map them appropriately... if that's too risky I'm fine with making all the wasmi errors recoverable. there is seemingly less risk, as opposed to internal host errors. I would be concerned about leaving the host in a bad state, but VM is a separate instance, so it's not as bad; not sure about the VM caching though.

hazy bramble
#

hmm

#

ok I guess there's the logical question of "is this an internal error" and there's a more practical question of "is this an error that try_call ought to to catch or propagate" and since the VM is dead at the end of try_call anyways, I guess there's no harm in catching all of its errors.

#

it's not quite as precise but we're kinda overloading the concept of internal-ness. there are also a bunch we can refine further than "InvalidAction" (eg. type or arg-length mismatches probably ought to go to Value,UnexpectedLength or so?

latent yoke
#

yeah, the question is 'what should be the catch-all error'. I went for invalid action; we could go back to internal error as well, but make it recoverable. I don't really have a very strong opinion on this, though in most of the other contexts 'internal' errors correspond to the stuff we don't expect to happen (like exceptions in core or soft 'assertions' in host).

#

(I'd also be happy with a more specific name for 'implementation bug' type of errors, but it's probably too late for that)

hazy bramble
#

yeah I don't think there's a lot to be gained by polishing that terminology further. it's sufficient and any further variant would likely be differently-annoying 🙂

#

an argument in favour of lumping wasmi errors together is it makes it less likely we'll get stuck maintaining a fork of wasmi just to "produce the same specific errors" when replaying old stuff

#

i.e. avoid over-specificity on the mapping

#

(I think I've come around to wanting to stick with your patch as-is)

latent yoke
#

yeah, that's what I've been thinking about as well

#

btw are errors only observable if try_call is used? or do we hash them automatically somewhere?

#

if the former is true, then it probably would be safer to actually obfuscate the errors as much as possible and only distinguish between a 'generic' host error and specific contract errors (I would imagine logic that depends on precise host error values would be quite brittle). the specific error should still be in the diagnostic events, of course

#

then we only need to worry about whether the error is recoverable and we can tighten the set of non-recoverable errors to only the stuff we have full control over

#

otherwise the whole 'invoke host function trapped' thing is kind of pointless as one can canonicalize any host error via try_call

hazy bramble
#

sorry I don't know what you mean by "the whole invoke host function trapped thing"

latent yoke
#

we return a single coarse error for any failed soroban transaction as to not canonicalize errors

#

right?

hazy bramble
#

(I'm still thinking about wasmi -- if wasmi fails with an internal error, maybe we do want to unrecoverably fail a user's contract, it means essentially "there was a logic error beyond just one of the contracts having a bug", like they might be getting wrong data that's wrong in a way they can't possibly fix..)

#

oh, you mean the way the transaction result is coarser than the error codes in Error? I think that's somewhat accidental. if there's an Error result from a contract, I believe the whole Error code is hashed into history. lemme check that to be sure.

latent yoke
#

I'm reasonably certain we don't and I thought that's intentional...

hazy bramble
#

hmm no ok if the contract returns Error, it'll escalate that to Result, which will signal the tx to roll back and yeah you'll only see the specific error in the logs not the hashed result.

latent yoke
#

right... and I think that was a conscious choice, at least a while back when this has been discussed with Jon

hazy bramble
#

yeah. I mean we discussed it at two levels, both there and (the place I was thinking about) the fact that Error itself is fairly coarse (only 10x10 possible states besides contract errors)

#

I wasn't sure if we coarsened it further than Error at the tx level, but yeah we do.

latent yoke
#

so anyway, this decision is pretty much meaningless given the existence of try_call in the current form, as one can create a simple wrapper contract that would return errors as successful SCVals. maybe we could live with that inconsistency, but it seems like it's not that hard to remove it

#

(or enrich the transaction result with error and make UX a bit better at a cost of backwards compatibility risks)

hazy bramble
#

well .. Error having nontrivial substructure is a compromise between competing goals. I think the idea is that try_call callers may want to switch on a bit of its content, and/or observe a little bit more than just "something went wrong" in a context other than "you have to view diagnostic logs which might be turned off". Like they might want to surface it or store it or pass it to a 3rd party.

#

I don't know. I know jon literally wanted there to be no structure to the error codes at all, just "failed"

latent yoke
#

well, it's not clear if matching the host error contents is that useful (in my mind the most usefulness of try_call should come from the granular contract errors, but I could be wrong). like I have hard time imagining someone writing logic contingent on a wasmi failure for a good reason

hazy bramble
#

another thing to keep in mind -- this is implementation-imposed but it's way too late to change -- is that there are a lot of transitions in the system between different subsystems that can fail (manual guest code, in sdk-generated code, env-common codepaths) that don't have access to a thing they can use to inject more detail about a failure into a diagnostic log. so some of the medium-granularity of Error is like "to give people a fighting chance when debugging"

#

the existing categories give you a factor-N (somewhere between 1 and 100) improvement in error localization when trying to debug failures outside a diagnostics-emitting context

#

(that said I wouldn't necessarily mind changing try_call to map all recoverable non-ContractError categories into, say, Context,InvalidAction or whatever

#

)

latent yoke
#

yeah, that's what my suggestion basically is. otherwise we have this weird inconsistency where we kind of have sacrificed the error readability for the sake of better backwards compatibility, except we haven't really reached that goal

#

and we have neither UX, nor flexibility

hazy bramble
#

I'm not sure I agree

#

there are 2 levels of coarsening that have been done

#

for backward compat

#

one is "the error codes themselves are fairly coarse"

#

there's lots of flexibility given by that fact. we used to have many more (and could theoretically have hundreds of very precise codes, people have asked for that)

#

the other level is at the "collapse everything at the tx to a single tx result" level

latent yoke
#

that's fair, but I'm talking about the tx result level

hazy bramble
#

which you're right that someone could defeat by writing a wrapper that turns every call into a try_call

#

but the important question re: back compat is observed behaviour not possible behaviour. it's fairly likely nobody's going to write that, as it'll cost a whole cross-contract call .. for what? why bother?

latent yoke
#

better error reporting? because they can 🙂 ?

hazy bramble
#

if nobody's going to do it, there won't be a lot of instances of the thing you're worried about recorded in the history to worry about maintaining compatibility with

latent yoke
#

IDK really if that's going to happen at meaningful scale, but... who knows I guess

hazy bramble
#

they won't get better error reporting. the errors are already in the diagnostic log. it'll just give them a hash of a value that they have to fish out of the other part of the meta stream anyway.

#

a value they already have, but at higher cost and complexity. why would anyone bother?

#

shrug anyway I'm not arguing we shouldn't do some error-collapsing at try_call. I can see why we might want to for extra flexibility this way.

#

I just don't think it wins a lot either way. I don't think it's a big risk if we don't do it.

latent yoke
#

I guess the difference is in the debug log availability. I mean, you're definitely right from the logical standpoint. not sure about everyone else being logical...

hazy bramble
#

yeah, people will do silly things in contracts for no reason other than misunderstanding, it's fair.

#

I guess I am more arguing that I'd prefer not to completely gut Error (say, reduce it to a unit type) on the principle that "there might be another way to observe its values lurking somewhere"

#

I think it's already fairly coarse, and its non-triviality does buy us some debugging capacity in enough places to warrant keeping it nontrivial

#

(see yesterday where you eyeballed a failure of type WasmVm,InvalidInput that escaped to stellar-core configured for production i.e. with no diagnostic logs)

latent yoke
#

yeah, that's fair. I'm not actually sure if there are such ways besides try_call. just from the consistency standpoint host errors are traps most of the time, which forces host function api to be designed around that. so I think doing the narrowing is ok in general

#

where you eyeballed a failure of type WasmVm,InvalidInput that escaped to stellar-core configured for production
have I? wasn't that -ll debug? I don't think we can externalize it otherwise...

hazy bramble
#

sure. I think we should confirm with @severe thunder but I would be ok with that change.

latent yoke
#

sounds good. let's confirm tomorrow then

hazy bramble
#

stellar-core --ll debug doesn't turn on soroban diagnostic logging

latent yoke
#

right, but it simply logs the host return codes to console IIRC

#

[Tx DEBUG] invocation failed: HostError: Error(WasmVm, InvalidInput)
so no hashing here

hazy bramble
#

right, which is where I saw them, pasted them to console, and you saw them. I'm saying if Error had just been Error() it would have been .. harder to debug there. it's illustrative of the occasional utility of having a little more structure to Error

latent yoke
#

sure, I definitely don't argue about removing the error cases etc - the more debug info, the better.

severe thunder
latent yoke
#

converge all the host errors that try_call returns into a single host error. contract errors remain untouched

severe thunder
#

What's motivating the change?

hazy bramble
#

same basic reason we collapse all errors to "rollback" at tx level: flexibility in the future to not have to maintain as tight bit-for-bit error compatibility.

latent yoke
#

to add to above, that's currently the only way for guest to get a host error (to the best of my knowledge), so it's also pretty inconsistent in this way.

hazy bramble
#

assuming users who want to see details will look in diagnostic log and users who want to catch-and-continue won't have anything much useful to do with the different non-ContractError classes of error coming out of try_call

severe thunder
#

I thought we had discussed it working this way previously. I thought that was the plan. So sgtm.

latent yoke
#

cool. I'll work on a PR tomorrow then.