#@argonblue Please give me a ping, I'd
1 messages · Page 1 of 1 (latest)
Did you see the commit that resolves the user/IRQ race?
i looked briefly at it, and saw it had a number of problems, so i was confused as to how it could help
What were the problems that you saw?
it stores the interrupt state in a shared variable. the critical section entry can be called multiple times in a call chain, and isn’t reentrant. etc
I did think about that. In this particular instance it's not possible to attempt to enter the critical section more than once in user state. IRQ level entry while user state is in the critical section is also not possible for the obvious reason that interrupts are disabled. Updating of the shared variable happens inside the critical section.
i'd feel better if the interrupt state ep->cs were instead in a local variable in each caller, so there's no question about it being accessed from multiple places at once, and it's also automatically reentrant that way
(well, no way short of being hit by multiple cores at once)
OK, I can do that. For multiple cores, a spin lock will be needed, but then we get into some deadlock cases that need to be considered.
we could use the TinyUSB OSAL, or maybe something pico-specific, in the multicore case
Yes, that would be pico-specific spin lock hardware.
For pico, TinyUSB OSAL is implemented using the critical_section_xxx family of SDK routines. These routines utilize the hardware spinlocks as described in 2.3.1.3. Hardware Spinlocks in the RP2040 datasheet.
ah, there's also pico-sdk mutexes and semaphores implemented on top of those, which OSAL also uses
Since all we're after here is mutual exclusion, I think the simpler critical_section_xxx functions will be what we want.
CircuitPython is single-threaded, but what prevents the ISR from running on a different core?
oh, each core has an independent NVIC and masks? so if core 1 never unmasks interrupts, it'll never execute any ISRs?
That's a good question. The RP2040 implements an NVIC per core, so the question is: how does CP manage the NVICs?
ah, the boot ROM has Core 1 sitting in WFE waiting for Core 0 to signal it to launch code, so if CircuitPython never launches anything on Core 1, it probably keeps sleeping
Agreed on boot ROM operation. Further, dcd_rp2040 and hcd_rp2040 manage the NVIC's masking for USB IP core interrupts in such a way that multicores could break it badly.
But, with only a single core in use as it is today, only the NVIC on the active core will ever have USB interrupts enabled.
the other thing i'm not sure about your fix: before my patch, the ISR doesn't really do much for a SETUP received interrupt, other than resetting the data toggle and queueing an event for tud_task. so i'm not sure why disabling interrupts helps
Yes, that's the heart of it. The unprotected resources that get clobbered are the endpoint's control and buffer control registers.
but the ISR SETUP handler didn't touch those registers until my fix. it only reset the data toggle in ep->next_pid
This should be a fix for TinyUSB in general, not just CircuitPython, so you'll need to consider other dual-core uses
TinyUSB does not handle multiple RP2040 cores.
true, but TinyUSB already fails to handle the dual-core case in multiple ways, so that's a fairly big undertaking
i didn't know that - thanks
@valid sundial the main thing about your fix that might make things (accidentally) work is by delaying the ISR entry until a completion interrupt for the status IN transaction occurs immediately following the SETUP packet. the ISR prioritizes endpoint completion above SETUP, so the aborted control transfer gets its transaction completion in the following transfer instead
I'm not sure that's right. I can find numerous paths, both IRQ and user-level, that modify endpoint buffer control registers. Since the bug is easy to reproduce, I'll add some tracing to a ring-buffer that should show which analysis is correct.
oh, i can see how it might fix the !ep->active panic, because the ISR does touch ep in that state, and prior to your fix, there are no memory or optimization barriers enforcing that ep->active gets set before the USB buffer control gets written to
@valid sundial your updated branch looks better, thanks! did you get a chance to try adding some tracing to see more details of what's going on?
Thanks for having a look. I made some progress on tracing with a working trace mechanism implemented, but other stuff came up in real life so I won't get back to it until Monday.
By the way, TinyUSB issue #2322 looks like it is related.
Operating System Linux Board Raspberry Pi PICO (RP2040) Firmware https://github.com/byteit101/test-tinyusb-pico What happened ? panic/assert/exit calls sometimes, but not always, on writes (I think...
huh, that does look similar. maybe Linux also aborts control transfers sometimes?
or it could be the task/IRQ race
also, running control transfer transactions from tasks means it's more likely to race with an incoming SETUP packet. maybe control transactions should also cancel if they check the interrupt flags and see a SETUP has been received?
Interesting thought. Races abound, but ultimately the bus serializes. The driving code should do likewise.
there would still be a race, but shorter. probably the right thing to do is on SETUP, have the hardware initiate an abort that overrides the buffer controls
I'm limiting my trace entries to 64 bytes, so it's not practical to capture the entire USB IP core register state for each trace entry. What do you consider to be the most useful registers to trace?
Likewise, what fields from the ep struct will be of most use?
INTS will show whether a SETUP has arrived while the previous control transfer is being handled
BUFF_STATUS is also useful to know whether a transaction intended for an aborted control transfer has completed anyway
basically, there are some unknowns about what exactly the hardware is doing when it receives an unexpected SETUP
EP0 IN and OUT buffer control (lower 16 bits of each is enough because TinyUSB doesn't double buffer EP0)
Thanks, I'll add those. I'll be AFK most of today, but will check back in later.
I'm updating the trace now. Let me know if you think of any other registers/fields of interest. The trace tries to disturb as little as possible, so it does not print or access any other I/O. To use it I examine the raw buffer using gdb.
Something that appears not quite right: In hw_endpoint_init the pointers to the endpoint's control and buffer control registers are initialized. Here's the code that initializes endpoint's buffer control register:
if ( dir == TUSB_DIR_IN )
{
ep->buffer_control = &usb_dpram->ep_buf_ctrl[num].in;
}
else
{
ep->buffer_control = &usb_dpram->ep_buf_ctrl[num].out;
}
The thing is that according to the RP2040 datasheet only the .in register is implemented, the .out register is labeled Spare.
The endpoint control register appears to also be setup incorrectly.
Please have a look and let me know what you think.
are you looking at the device or host side of that table?
Device.
"Table 394. DPSRAM layout"? the third column with the "Spare" cells is host; the second column is device
You're right. Thanks for taking a look.
@grand ivynblue I see that you and hatach are hashing this out in your pull request, so I'm going to leave it in your capable hands. I do have trace data to share that shows an out-of-order OUT on EP 80 and then the USB core hanging up with '0x0000f400` stuck in its buffer control after the failing SETUP. I can send you the trace if you'd like.
thanks, that would be great! note you’re occasionally pinging the wrong person
Hrm, that's weird. Who am I pinging?
@Argo
Ah-ha, that looks my typo. Thanks.
I find the tab completion in discord for tagging to be unpredictable sometimes
@hazy badger The trace code is in this branch, https://github.com/eightycc/tinyusb/tree/issue_8824_trace
The heart of it is in rp2040_usb.h . Immediately after the trace structs are #defines that enable it and set the trace buffer size in 64-byte entries. The final two #defines make each of our patches conditional.
Trace entries are in pairs. I've annotated the trace a bit.
That last bad ping was not me. I didn't use tab completion, but Discord added the space after I hit enter.
Note that at offset +0x10 in each trace entry are some 1 byte fields from the ep struct. Because the trace is dumped as 32-bit little-endian words, gdb reverses the byte order of word.
@hazy badger You'll likely want some context. Here's the Beagle trace that corresponds to the trace.
thanks! panic actually starts at 0x2001b1c0 which is the trace record of the first EP2 transaction containing CDC traffic of the string \r\n*** PANIC ***
this corresponds to tud_task running the deferred handler for the second SETUP request for CDC Set Control Line State, which attempts to enable a status IN transaction on EP0, which was already enabled by the previous SETUP, and panics
this is pretty close to confirming that the USB peripheral doesn't do anything to the buffer control state when receiving a SETUP packet
note that neither of the two Set Control Line State requests proceeds to a status stage. the host never polls EP0 IN for either request
the panic message gets as far as printing ep (with a trailing space). the host stops polling EP2 IN shortly afterward, which might be the kernel serial driver finally completing the port close
also note that you need to look at the timestamps very carefully in Data Center when in Class view; things are chronologically out of order there