#@argonblue Please give me a ping, I'd

1 messages · Page 1 of 1 (latest)

hazy badger
#

sure, what's up?

valid sundial
#

Did you see the commit that resolves the user/IRQ race?

hazy badger
#

i looked briefly at it, and saw it had a number of problems, so i was confused as to how it could help

valid sundial
#

What were the problems that you saw?

hazy badger
#

it stores the interrupt state in a shared variable. the critical section entry can be called multiple times in a call chain, and isn’t reentrant. etc

valid sundial
#

I did think about that. In this particular instance it's not possible to attempt to enter the critical section more than once in user state. IRQ level entry while user state is in the critical section is also not possible for the obvious reason that interrupts are disabled. Updating of the shared variable happens inside the critical section.

hazy badger
#

i'd feel better if the interrupt state ep->cs were instead in a local variable in each caller, so there's no question about it being accessed from multiple places at once, and it's also automatically reentrant that way

#

(well, no way short of being hit by multiple cores at once)

valid sundial
#

OK, I can do that. For multiple cores, a spin lock will be needed, but then we get into some deadlock cases that need to be considered.

hazy badger
#

we could use the TinyUSB OSAL, or maybe something pico-specific, in the multicore case

valid sundial
#

Yes, that would be pico-specific spin lock hardware.

valid sundial
#

For pico, TinyUSB OSAL is implemented using the critical_section_xxx family of SDK routines. These routines utilize the hardware spinlocks as described in 2.3.1.3. Hardware Spinlocks in the RP2040 datasheet.

hazy badger
#

ah, there's also pico-sdk mutexes and semaphores implemented on top of those, which OSAL also uses

valid sundial
#

Since all we're after here is mutual exclusion, I think the simpler critical_section_xxx functions will be what we want.

hazy badger
#

CircuitPython is single-threaded, but what prevents the ISR from running on a different core?

#

oh, each core has an independent NVIC and masks? so if core 1 never unmasks interrupts, it'll never execute any ISRs?

valid sundial
#

That's a good question. The RP2040 implements an NVIC per core, so the question is: how does CP manage the NVICs?

hazy badger
#

ah, the boot ROM has Core 1 sitting in WFE waiting for Core 0 to signal it to launch code, so if CircuitPython never launches anything on Core 1, it probably keeps sleeping

valid sundial
#

Agreed on boot ROM operation. Further, dcd_rp2040 and hcd_rp2040 manage the NVIC's masking for USB IP core interrupts in such a way that multicores could break it badly.

#

But, with only a single core in use as it is today, only the NVIC on the active core will ever have USB interrupts enabled.

hazy badger
#

the other thing i'm not sure about your fix: before my patch, the ISR doesn't really do much for a SETUP received interrupt, other than resetting the data toggle and queueing an event for tud_task. so i'm not sure why disabling interrupts helps

valid sundial
#

Yes, that's the heart of it. The unprotected resources that get clobbered are the endpoint's control and buffer control registers.

hazy badger
#

but the ISR SETUP handler didn't touch those registers until my fix. it only reset the data toggle in ep->next_pid

sudden kestrel
#

This should be a fix for TinyUSB in general, not just CircuitPython, so you'll need to consider other dual-core uses

valid sundial
#

TinyUSB does not handle multiple RP2040 cores.

hazy badger
#

true, but TinyUSB already fails to handle the dual-core case in multiple ways, so that's a fairly big undertaking

sudden kestrel
#

i didn't know that - thanks

hazy badger
#

@valid sundial the main thing about your fix that might make things (accidentally) work is by delaying the ISR entry until a completion interrupt for the status IN transaction occurs immediately following the SETUP packet. the ISR prioritizes endpoint completion above SETUP, so the aborted control transfer gets its transaction completion in the following transfer instead

valid sundial
hazy badger
#

oh, i can see how it might fix the !ep->active panic, because the ISR does touch ep in that state, and prior to your fix, there are no memory or optimization barriers enforcing that ep->active gets set before the USB buffer control gets written to

hazy badger
#

@valid sundial your updated branch looks better, thanks! did you get a chance to try adding some tracing to see more details of what's going on?

valid sundial
#

By the way, TinyUSB issue #2322 looks like it is related.

hazy badger
#

huh, that does look similar. maybe Linux also aborts control transfers sometimes?

#

or it could be the task/IRQ race

#

also, running control transfer transactions from tasks means it's more likely to race with an incoming SETUP packet. maybe control transactions should also cancel if they check the interrupt flags and see a SETUP has been received?

valid sundial
#

Interesting thought. Races abound, but ultimately the bus serializes. The driving code should do likewise.

hazy badger
#

there would still be a race, but shorter. probably the right thing to do is on SETUP, have the hardware initiate an abort that overrides the buffer controls

valid sundial
#

I'm limiting my trace entries to 64 bytes, so it's not practical to capture the entire USB IP core register state for each trace entry. What do you consider to be the most useful registers to trace?

#

Likewise, what fields from the ep struct will be of most use?

hazy badger
#

INTS will show whether a SETUP has arrived while the previous control transfer is being handled

#

BUFF_STATUS is also useful to know whether a transaction intended for an aborted control transfer has completed anyway

#

basically, there are some unknowns about what exactly the hardware is doing when it receives an unexpected SETUP

#

EP0 IN and OUT buffer control (lower 16 bits of each is enough because TinyUSB doesn't double buffer EP0)

valid sundial
#

Thanks, I'll add those. I'll be AFK most of today, but will check back in later.

valid sundial
valid sundial
#

The thing is that according to the RP2040 datasheet only the .in register is implemented, the .out register is labeled Spare.

#

The endpoint control register appears to also be setup incorrectly.

#

Please have a look and let me know what you think.

hazy badger
valid sundial
#

Device.

hazy badger
#

"Table 394. DPSRAM layout"? the third column with the "Spare" cells is host; the second column is device

valid sundial
#

You're right. Thanks for taking a look.

valid sundial
#

@grand ivynblue I see that you and hatach are hashing this out in your pull request, so I'm going to leave it in your capable hands. I do have trace data to share that shows an out-of-order OUT on EP 80 and then the USB core hanging up with '0x0000f400` stuck in its buffer control after the failing SETUP. I can send you the trace if you'd like.

hazy badger
valid sundial
#

Hrm, that's weird. Who am I pinging?

sudden kestrel
valid sundial
#

Ah-ha, that looks my typo. Thanks.

sudden kestrel
#

I find the tab completion in discord for tagging to be unpredictable sometimes

valid sundial
#

@hazy badger The trace code is in this branch, https://github.com/eightycc/tinyusb/tree/issue_8824_trace
The heart of it is in rp2040_usb.h . Immediately after the trace structs are #defines that enable it and set the trace buffer size in 64-byte entries. The final two #defines make each of our patches conditional.
Trace entries are in pairs. I've annotated the trace a bit.

GitHub

An open source cross-platform USB stack for embedded system - GitHub - eightycc/tinyusb at issue_8824_trace

#

That last bad ping was not me. I didn't use tab completion, but Discord added the space after I hit enter.

#

Note that at offset +0x10 in each trace entry are some 1 byte fields from the ep struct. Because the trace is dumped as 32-bit little-endian words, gdb reverses the byte order of word.

valid sundial
#

@hazy badger You'll likely want some context. Here's the Beagle trace that corresponds to the trace.

hazy badger
#

this corresponds to tud_task running the deferred handler for the second SETUP request for CDC Set Control Line State, which attempts to enable a status IN transaction on EP0, which was already enabled by the previous SETUP, and panics

#

this is pretty close to confirming that the USB peripheral doesn't do anything to the buffer control state when receiving a SETUP packet

#

note that neither of the two Set Control Line State requests proceeds to a status stage. the host never polls EP0 IN for either request

#

the panic message gets as far as printing ep (with a trailing space). the host stops polling EP2 IN shortly afterward, which might be the kernel serial driver finally completing the port close

hazy badger
#

also note that you need to look at the timestamps very carefully in Data Center when in Class view; things are chronologically out of order there