@argonblue Please give me a ping, I'd | adafruit | Page 1

hazy badger Feb 29, 2024, 7:49 PM

#

sure, what's up?

valid sundial Feb 29, 2024, 7:53 PM

#

Did you see the commit that resolves the user/IRQ race?

hazy badger Feb 29, 2024, 8:07 PM

#

i looked briefly at it, and saw it had a number of problems, so i was confused as to how it could help

valid sundial Feb 29, 2024, 8:11 PM

#

What were the problems that you saw?

hazy badger Feb 29, 2024, 8:20 PM

#

it stores the interrupt state in a shared variable. the critical section entry can be called multiple times in a call chain, and isn’t reentrant. etc

valid sundial Feb 29, 2024, 8:28 PM

#

I did think about that. In this particular instance it's not possible to attempt to enter the critical section more than once in user state. IRQ level entry while user state is in the critical section is also not possible for the obvious reason that interrupts are disabled. Updating of the shared variable happens inside the critical section.

hazy badger Feb 29, 2024, 8:37 PM

#

i'd feel better if the interrupt state ep->cs were instead in a local variable in each caller, so there's no question about it being accessed from multiple places at once, and it's also automatically reentrant that way

#

(well, no way short of being hit by multiple cores at once)

valid sundial Feb 29, 2024, 8:39 PM

#

OK, I can do that. For multiple cores, a spin lock will be needed, but then we get into some deadlock cases that need to be considered.

hazy badger Feb 29, 2024, 8:39 PM

#

we could use the TinyUSB OSAL, or maybe something pico-specific, in the multicore case

valid sundial Feb 29, 2024, 8:40 PM

#

Yes, that would be pico-specific spin lock hardware.

valid sundial Feb 29, 2024, 8:58 PM

#

For pico, TinyUSB OSAL is implemented using the critical_section_xxx family of SDK routines. These routines utilize the hardware spinlocks as described in 2.3.1.3. Hardware Spinlocks in the RP2040 datasheet.

hazy badger Feb 29, 2024, 9:07 PM

#

ah, there's also pico-sdk mutexes and semaphores implemented on top of those, which OSAL also uses

valid sundial Feb 29, 2024, 9:10 PM

#

Since all we're after here is mutual exclusion, I think the simpler critical_section_xxx functions will be what we want.

hazy badger Feb 29, 2024, 9:14 PM

#

CircuitPython is single-threaded, but what prevents the ISR from running on a different core?

#

oh, each core has an independent NVIC and masks? so if core 1 never unmasks interrupts, it'll never execute any ISRs?

valid sundial Feb 29, 2024, 9:22 PM

#

That's a good question. The RP2040 implements an NVIC per core, so the question is: how does CP manage the NVICs?

hazy badger Feb 29, 2024, 9:25 PM

#

ah, the boot ROM has Core 1 sitting in WFE waiting for Core 0 to signal it to launch code, so if CircuitPython never launches anything on Core 1, it probably keeps sleeping

valid sundial Feb 29, 2024, 9:35 PM

#

Agreed on boot ROM operation. Further, dcd_rp2040 and hcd_rp2040 manage the NVIC's masking for USB IP core interrupts in such a way that multicores could break it badly.

#

But, with only a single core in use as it is today, only the NVIC on the active core will ever have USB interrupts enabled.

hazy badger Feb 29, 2024, 9:49 PM

#

the other thing i'm not sure about your fix: before my patch, the ISR doesn't really do much for a SETUP received interrupt, other than resetting the data toggle and queueing an event for tud_task. so i'm not sure why disabling interrupts helps

valid sundial Feb 29, 2024, 9:56 PM

#

Yes, that's the heart of it. The unprotected resources that get clobbered are the endpoint's control and buffer control registers.

hazy badger Feb 29, 2024, 9:57 PM

#

but the ISR SETUP handler didn't touch those registers until my fix. it only reset the data toggle in ep->next_pid

sudden kestrel Feb 29, 2024, 9:57 PM

#

This should be a fix for TinyUSB in general, not just CircuitPython, so you'll need to consider other dual-core uses

valid sundial Feb 29, 2024, 9:58 PM

#

TinyUSB does not handle multiple RP2040 cores.

hazy badger Feb 29, 2024, 9:58 PM

#

true, but TinyUSB already fails to handle the dual-core case in multiple ways, so that's a fairly big undertaking

sudden kestrel Feb 29, 2024, 9:58 PM

#

i didn't know that - thanks

hazy badger Feb 29, 2024, 10:00 PM

#

@valid sundial the main thing about your fix that might make things (accidentally) work is by delaying the ISR entry until a completion interrupt for the status IN transaction occurs immediately following the SETUP packet. the ISR prioritizes endpoint completion above SETUP, so the aborted control transfer gets its transaction completion in the following transfer instead

valid sundial Feb 29, 2024, 10:27 PM

#

hazy badger <@849815081251766292> the main thing about your fix that might make things (acci...

I'm not sure that's right. I can find numerous paths, both IRQ and user-level, that modify endpoint buffer control registers. Since the bug is easy to reproduce, I'll add some tracing to a ring-buffer that should show which analysis is correct.

hazy badger Feb 29, 2024, 10:28 PM

#

oh, i can see how it might fix the !ep->active panic, because the ISR does touch ep in that state, and prior to your fix, there are no memory or optimization barriers enforcing that ep->active gets set before the USB buffer control gets written to

hazy badger Mar 3, 2024, 4:40 PM

#

@valid sundial your updated branch looks better, thanks! did you get a chance to try adding some tracing to see more details of what's going on?

valid sundial Mar 3, 2024, 4:58 PM

#

hazy badger <@849815081251766292> your updated branch looks better, thanks! did you get a ch...

Thanks for having a look. I made some progress on tracing with a working trace mechanism implemented, but other stuff came up in real life so I won't get back to it until Monday.

#

By the way, TinyUSB issue #2322 looks like it is related.

#

https://github.com/hathach/tinyusb/issues/2322

GitHub

USB MSC has some sort of race condition · Issue #2322 · hathach/tin...

Operating System Linux Board Raspberry Pi PICO (RP2040) Firmware https://github.com/byteit101/test-tinyusb-pico What happened ? panic/assert/exit calls sometimes, but not always, on writes (I think...

hazy badger Mar 3, 2024, 5:18 PM

#

huh, that does look similar. maybe Linux also aborts control transfers sometimes?

#

or it could be the task/IRQ race

#

also, running control transfer transactions from tasks means it's more likely to race with an incoming SETUP packet. maybe control transactions should also cancel if they check the interrupt flags and see a SETUP has been received?

valid sundial Mar 3, 2024, 5:28 PM

#

Interesting thought. Races abound, but ultimately the bus serializes. The driving code should do likewise.

hazy badger Mar 3, 2024, 5:30 PM

#

there would still be a race, but shorter. probably the right thing to do is on SETUP, have the hardware initiate an abort that overrides the buffer controls

valid sundial Mar 3, 2024, 5:33 PM

#

I'm limiting my trace entries to 64 bytes, so it's not practical to capture the entire USB IP core register state for each trace entry. What do you consider to be the most useful registers to trace?

#

Likewise, what fields from the ep struct will be of most use?

hazy badger Mar 3, 2024, 5:39 PM

#

INTS will show whether a SETUP has arrived while the previous control transfer is being handled

#

BUFF_STATUS is also useful to know whether a transaction intended for an aborted control transfer has completed anyway

#

basically, there are some unknowns about what exactly the hardware is doing when it receives an unexpected SETUP

#

EP0 IN and OUT buffer control (lower 16 bits of each is enough because TinyUSB doesn't double buffer EP0)

valid sundial Mar 3, 2024, 5:42 PM

#

Thanks, I'll add those. I'll be AFK most of today, but will check back in later.

valid sundial Mar 4, 2024, 4:58 PM

#

hazy badger INTS will show whether a SETUP has arrived while the previous control transfer i...

I'm updating the trace now. Let me know if you think of any other registers/fields of interest. The trace tries to disturb as little as possible, so it does not print or access any other I/O. To use it I examine the raw buffer using gdb.

valid sundial Mar 4, 2024, 6:17 PM

#

hazy badger INTS will show whether a SETUP has arrived while the previous control transfer i...

Something that appears not quite right: In hw_endpoint_init the pointers to the endpoint's control and buffer control registers are initialized. Here's the code that initializes endpoint's buffer control register:

if ( dir == TUSB_DIR_IN )
  {
    ep->buffer_control = &usb_dpram->ep_buf_ctrl[num].in;
  }
  else
  {
    ep->buffer_control = &usb_dpram->ep_buf_ctrl[num].out;
  }

#

The thing is that according to the RP2040 datasheet only the .in register is implemented, the .out register is labeled Spare.

#

The endpoint control register appears to also be setup incorrectly.

#

Please have a look and let me know what you think.

hazy badger Mar 4, 2024, 6:25 PM

#

valid sundial The thing is that according to the RP2040 datasheet only the `.in` register is i...

are you looking at the device or host side of that table?

valid sundial Mar 4, 2024, 6:35 PM

#

Device.

hazy badger Mar 4, 2024, 6:43 PM

#

"Table 394. DPSRAM layout"? the third column with the "Spare" cells is host; the second column is device

valid sundial Mar 4, 2024, 6:46 PM

#

You're right. Thanks for taking a look.

valid sundial Mar 5, 2024, 5:10 PM

#

@grand ivynblue I see that you and hatach are hashing this out in your pull request, so I'm going to leave it in your capable hands. I do have trace data to share that shows an out-of-order OUT on EP 80 and then the USB core hanging up with '0x0000f400` stuck in its buffer control after the failing SETUP. I can send you the trace if you'd like.

hazy badger Mar 5, 2024, 5:14 PM

#

valid sundial <@506317312949747712>nblue I see that you and hatach are hashing this out in you...

thanks, that would be great! note you’re occasionally pinging the wrong person

valid sundial Mar 5, 2024, 5:15 PM

#

Hrm, that's weird. Who am I pinging?

sudden kestrel Mar 5, 2024, 5:19 PM

#

valid sundial Hrm, that's weird. Who am I pinging?

@Argo

valid sundial Mar 5, 2024, 5:20 PM

#

Ah-ha, that looks my typo. Thanks.

sudden kestrel Mar 5, 2024, 5:22 PM

#

I find the tab completion in discord for tagging to be unpredictable sometimes

valid sundial Mar 5, 2024, 5:34 PM

#

@hazy badger The trace code is in this branch, https://github.com/eightycc/tinyusb/tree/issue_8824_trace
The heart of it is in rp2040_usb.h . Immediately after the trace structs are #defines that enable it and set the trace buffer size in 64-byte entries. The final two #defines make each of our patches conditional.
Trace entries are in pairs. I've annotated the trace a bit.

📎 usb_trace.txt

GitHub

GitHub - eightycc/tinyusb at issue_8824_trace

An open source cross-platform USB stack for embedded system - GitHub - eightycc/tinyusb at issue_8824_trace

#

That last bad ping was not me. I didn't use tab completion, but Discord added the space after I hit enter.

#

Note that at offset +0x10 in each trace entry are some 1 byte fields from the ep struct. Because the trace is dumped as 32-bit little-endian words, gdb reverses the byte order of word.

valid sundial Mar 5, 2024, 7:06 PM

#

@hazy badger You'll likely want some context. Here's the Beagle trace that corresponds to the trace.

📎 trace.tdc

hazy badger Mar 5, 2024, 7:30 PM

#

valid sundial <@824013908490059826> You'll likely want some context. Here's the Beagle trace ...

thanks! panic actually starts at 0x2001b1c0 which is the trace record of the first EP2 transaction containing CDC traffic of the string \r\n*** PANIC ***

#

this corresponds to tud_task running the deferred handler for the second SETUP request for CDC Set Control Line State, which attempts to enable a status IN transaction on EP0, which was already enabled by the previous SETUP, and panics

#

this is pretty close to confirming that the USB peripheral doesn't do anything to the buffer control state when receiving a SETUP packet

#

note that neither of the two Set Control Line State requests proceeds to a status stage. the host never polls EP0 IN for either request

#

the panic message gets as far as printing ep (with a trailing space). the host stops polling EP2 IN shortly afterward, which might be the kernel serial driver finally completing the port close

hazy badger Mar 5, 2024, 8:05 PM

#

also note that you need to look at the timestamps very carefully in Data Center when in Class view; things are chronologically out of order there

#@argonblue Please give me a ping, I'd