#Looks like this is going to require

1 messages ยท Page 1 of 1 (latest)

deft linden
#

You might check to see if displayio is spending a lot of time looking up RGB565 values for the source buffer's pixels as it's preparing the 16-bit framebuffer data to be transferred over SPI. In particular, I'd be kinda suspicious about anything that touches displayio.Colorspace or displayio.ColorConverter

junior atlas
#

I'm not using displayio. I'm using adafruit_rgb_display. I'm working on turning it into a solid low level rendering library. My framebuffer is 16-bit already, and for the most part it's being written directly to the display over SPI. Rotation requires some slicing tricks that increase the time cost by around 25ms per frame. Without rotation, it's averaging around 200ms to complete the write operation, for a framerate averaging 5 FPS.

Oh, there is one extra operation that happens on write. Ulab's default endianness is swapped from what the display expects (and unlike Numpy, it has no option for setting the endianness), so I have to do a .byteswap() on the buffer even when it isn't being rotated. I can remove that and swap the bytes as they are rendered to the framebuffer. For more complex rendering, I doubt it will make much difference, but since I'm currently just drawing on a black background (for testing), the vast majority of bytes don't need to be swapped. I doubt the byte swapping is adding much time, but I might be wrong, so I'll try that.

I'll report back on this. If it turns out the byteswap is adding significant time, I can do the swapping at render time instead when flipping the buffer, but for full screen images it will come out in the wash. This is a place where C implementation would help, since it would be easier to control the endianness.

#

Yeah, without rotation, it's landing right around 80ms. That's better than double the speed with the byte swap. It's still pathetically slow, at around 12.5FPS though. Now have to see if I can get the versions with rotation working, as the return types for ulab ndarray functions aren't very consistent. (ndarray.transpose() evidently returns something that doesn't have a buffer protocol...)

junior atlas
#

I'm starting to think hardware rotation might be the best option again. As much as it would be good to be able to accommodate even display controllers without build in rotation options, the cost of software rotation is just way too high.

The transpose functions and every other one that would be useful for rotation only returns a view, which can't be treated as a buffer and thus can't be written to SPI from directly. The only solution is to copy the buffer, which uses a lot of memory and costs around 100ms extra per frame.

Hardware rotation would avoid this problem entirely, allowing the driver hardware to do rotation in circuitry designed and optimized specifically for that purpose.

That said, this still only gets me up to 12.5 FPS maximum, so I'll probably have to implement at least the surface buffers and the SPI writes in C to achieve acceptable framerates. I'm not sure when I'll have time to do that. I've only done one C module for CircuitPython, it was a few years ago, and the main thing I remember is that it was a pain. (And I'm not sure I can optimize the SPI writes without having to rewrite each the drivers in C as well, which gets into platform specific code, which creates even more issues.)

raw void
#

A logic analyzer could tell you how well you are saturating the SPI writes to the display. It may be something obvious like the SPI clock not being as high as you think.

#

It can also be useful for adding pin setting high and then low around particular code paths

junior atlas
#

I can't afford one right now, but that's not a bad idea.

raw void
#

I think there are instructions to use any rp2 as a slow logic analyzer

junior atlas
#

There are. I was just thinking about that earlier. I've got some Pi 3s and a Pi 4 I could probably use that might work even better. It's going to end up being a matter of how much time I can put into this project. I need a logic analyzer anyway though, so I might as well do that and then reevaluate my time once that's done.

deft linden
#

That's super interesting. If it turns out you don't have the time to work on this, maybe you could at least write up a GitHub issue with some of your findings and the outline of a plan for what you think a suitable implementation might look like. Maybe somebody else could pick it up.

#

Also, if you posted some test code and a hardware description, there are a variety of folks hanging around here who have logic analyzers. Maybe somebody could just run some captures and send you screenshots.

junior atlas
# deft linden That's super interesting. If it turns out you don't have the time to work on thi...

Actually, this is a really good idea. It might even be fairly easy for someone with more CP dev experience to port some of the Pygame code I'm modeling this on to CircuitPython. I know at least some parts could be used mostly as-is with very little extra modification. I'll make a writeup of what I was planning and post it as a feature request, with a note that I'm willing help with it.

I'll also zip up my current code including the test program and post it here tomorrow with the hardware description. Even if no one is willing or able to check it with a logic analyzer, a few more eyes might still be useful.

junior atlas
#

Fruit Jam (https://www.adafruit.com/product/6200)
2.0" 240x320 TFT (https://www.adafruit.com/product/4311)

Using CircuitPython 10.0.3

code.py goes in the CIRCUITPY root

adafruit_rgb_display needs to be installed in lib, and the remaining files in this zip should go in the adafruit_rgb_display directory.

I've commented the code fairly comprehensively. The default SPI SCK and MOSI pins of the Fruit Jam are used, and the DC, RST, and CS pins are as noted in code.py.

If anyone can use a logic analyzer to check the actual speed of the SPI transmissions, to discover if the SPI is operating as expected, that would be appreciated.

Also, I've gone to some effort to optimize the code, but it wouldn't hurt for someone else to look over it and see if there's anything I've missed or overlooked that could be affecting the write speed.

junior atlas
deft linden
#

Looking over your code.py from the zip file, there are a couple obvious things you might try to get more speed:

  1. Cache the functions used inside the while loop as local variables. For example, above the loop do fill = framebuffer.fill then do fill() inside the loop.
  2. Try replacing the % modulo division with a construction using a bitwise and (&) or an if conditional. This might not be the case on RP2350, but sometimes division operations are considerably slower than other integer math operations.

For more details on techniques to get speed out of the CircuitPython (MicroPython) interpreter: https://docs.micropython.org/en/latest/reference/speed_python.html

#

Similarly, in display.flip(), you could save a lot of dictionary lookups by caching the values of disp.rotation, disp.width, and disp.height, and surface.buffer in local variables.

#

if you factored the flip() function apart into the 4 different possible types of flips and picked one of those at the top of your code.py while loop, you could also eliminate a bunch of conditional logic getting called each time through the loop. But, compared to all the pixel pushing, the time spent on comparisons might be insignificant.

#

For the byte-swapping, would it work to just pre-swap your color values when you assign them to pixels. For example, whatever = 0x3412 instead of whatever = 0x1234?

#

you could potentially do that as a transformation on the color palette when you load it from a file or whatever

junior atlas
#

The stuff outside of display.flip() isn't contributing a lot of time (I've checked most of it), but still good ideas to save some time here and there.

Caching values in display.flip() is a good idea. I haven't profiled every element of it, so maybe some of those 80ms are coming from something outside of the math. The catch with caching values is that it would be nice to have the option of using multiple buffer surfaces and multiple displays. It's still worth trying though, to measure what the benefits might be.

As far as the rotation goes, at this point I'm doubting software rotation is going to be viable, at least if it is not done at a lower level than Python. ulab does not have a rotation function (Numpy does, but it wasn't included in ulab), which is why I had to do strange things. Those strange things are much more computationally expensive than a low level rotation would be. At this point though, I think the best place to do rotation is in the display drivers, because most of the display drivers have initialization options that will do the rotation in the driver hardware. They aren't being used in adafruit_rgb_display, because there may be display drivers that don't have that option. At this point though, I think that should just be left as something for the user to deal with. If you use a display driver that doesn't have that feature, that's your choice. We can't just make it too slow to be viable for everyone because there might be hardware choices out there that don't have the feature natively. (Though basic transforms that can be done on surfaces are still a good idea, as there are a lot of reasons to rotate or flip images, and a user with a driver that doesn't do hardware rotation could still use that on the whole surface.)

Anyhow, I'll try some of your suggestions and see how they work out!

deft linden
#

IIRC, customizing display driver init code to set MADCTL values should be pretty easy.

#

if it isn't already a supported feature

junior atlas
#

The new copy is contiguous in memory, so it can be used as a buffer.

deft linden
junior atlas
junior atlas
deft linden
junior atlas
deft linden
#

what CircuitPython version are you using, and where did you get the adafruit_rgb_display library? I'm trying to run your code, and I get this error:

Adafruit CircuitPython 10.0.1 on 2025-10-09; Adafruit Fruit Jam with rp2350b

soft reboot

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
Traceback (most recent call last):
File "code.py", line 4, in <module>
ImportError: no module named 'adafruit_rgb_display.display'

#

I think I fixed it. Problem was that you were doing stuff like from adafruit_rgb_display import display, rect, surface rather than just import display, import rect, and import surface.

#

Now I'm getting output like this:

Function flip Time = 79.498ms
Function flip Time = 79.895ms
Function flip Time = 79.742ms
Function flip Time = 80.048ms

deft linden
#

Saleae Logic 2 capture of your code running on a Fruit Jam with CircuitPython 10.0.3. This shows a zoomed out view highlighting the time between frames.

#

this one is zoomed in with the emphasis on SCK timing (which is near the edge of what my Logic 8's sample rate can capture effectively)

#

This one shows how the gaps between bytes of a frame are a bit variable but generally pretty short

#

FWIW, running the SPI clock at 24 MHz might not work reliably unless you plan to make a custom PCB. As I understand it, 10 MHz is about the max of what can be generally expected to work okay with breadboarded or DuPont wire connections.

#

Something in the range of 5 MHz or less might be more reliable

junior atlas
#

Ok, if I'm reading the first one right, it is taking around 80ms per frame to send the data (and ~600ms in between). Is this correct?

The reason I used 24MHz is that is what the CP guide for the display uses. (https://learn.adafruit.com/2-0-inch-320-x-240-color-ips-tft-display/python-usage) I assumed that if the guide uses that, it's reasonable. (The guide shows it using a breadboard. I'm just using stiff wires to connect directly.) I don't seem to be having any reliability issues though, which I expect I would get if the SPI was having issues.

This gorgeous IPS display breakout is the best way to add a small, colorful and bright display to any project!

deft linden
#

yeah, 10 MHz is just a rule of thumb. depending on circumstances, much faster may work fine

#

the time per frame is a little longer than 78 ms as there's some CS flip flopping at the start that didn't get included in the timing overlay

#

then it's about 607 ms between frames

junior atlas
#

I'll keep the speed thing in mind though, in case I have issues later.

So it looks like it is losing around 120ns between bytes. Given the number of bytes being sent, that would certainly cost a lot of extra time.

#

The clock also looks very inconsistent...

deft linden
#

yeah. There's a lot of jitter. I'm not sure to what extent that's due to your chosen clock rate being near the limit of what my logic analyzer's sample rate can capture. Some of it may also be jitter coming from the Fruit Jam.

#

I've got the Saleae Logic 8

junior atlas
#

Ah, good point. It might indeed be due to the limits of the logic analyzer. It does look like the clock pulses are at least hitting the right timing pretty consistently, even if the width appears to be inconsistent.

I wonder how CircuitPython is handling the SPI. Those gaps make me wonder if CircuitPython is struggling to keep up with the rate that the SPI peripheral wants to send, causing the delays between bytes.

(Nice logic analyzer!)

deft linden
#

for digital signals, it's good up to about 25 MHz. There's some tradeoff depending on how many channels you're looking at, whether you're doing any analog intputs, etc.

junior atlas
#

That or maybe it's the ulab library...

Maybe I should try a Python bytearray or even a C style array. I was doing that originally (before testing timing), but I didn't want to have to manually rotate, because I knew it would end up being very expensive.

junior atlas
deft linden
#

here's a fresh capture with only 3 inputs enabled. So, this one is capturing at 100Ms/s vs 50Ms/s (megasamples per second) in the previous screenshots. This one should have somewhat less jitter from the analyzer sampling.

junior atlas
#

That looks a little cleaner. It does look like the clock signals vary in length, but they seem to be consistent otherwise. The 100ns delay between bytes seems to be the problem. That might be from ulab, if it's array implementation is introducing read delays. It might be CircuitPython failing to keep up with the SPI.

I'll start by testing it with Python built-in types designed for data transmission. I believe I should be able to do a C array of 16-bit values, and if that doesn't help I'll try a byte array and just deal with having to split color values into two 8-bit values.

deft linden
#

Have you considered whether DVI output using the RP2350 HSTX peripheral would work for your intended purpose?

#

The advantage with that is you get more bandwidth since it's using the HDMI style signalling, and most the DMA stuff happens in the background

junior atlas
#

The problem with that is price and screen size. I have considered, and I'm actually going to get a screen I should be able to use with it, but I want to try to make something handheld that's not super expensive. I actually haven't looked at DVI-only screens in a while though. Because the port is an HDMI port, I've only looked at HDMI screens that can handle DVI signals. A DVI-only screen with an adapter cable might work, if I can find a fairly small one that isn't too expensive.

deft linden
#

another thing, from what I can tell, your Function flip Time = ...ms print lines pretty well agree with the logic analyzer timing. So, probably you can just watch the debug prints to see how effective any potential improvements might be.

junior atlas
#

Right, it does look like the reported time is within a millisecond or two of what the timing code reports. That's definitely nice to know. Thanks for doing that for me. I really appreciate it!

deft linden
#

no problem. good luck working on this. looking forward to seeing where it goes.

junior atlas
#

Thanks! And me too. As much as I do enjoy working at very low levels with microcontrollers, sometimes I just want the project to go without a ton of hurdles. I guess I have to deal with what I have to deal with though. I'll continue reporting back at least for now. In the long run, I might end up writing a full module in C for what I want, but it won't be soon. Hopefully I can make this do what I want before that happens though.

deft linden
#

Probably your easiest win would be to patch whatever needs patching to get an ST7789 init sequence that sets up the hardware rotation you want.

#

You might also consider whether running at a lower color mode would be possible. Like, if the chipset allows you to do RGB332 instead of RGB565, that could help a lot, assuming you can tolerate the reduced color palette.

junior atlas
#

I'm definitely going to do the hardware rotation. I don't know if the controller can do lower color modes or not. It might be worth looking into though. As far as I'm aware, drivers that support lower color modes typically use palettized colors rather than something like RGB332. That would probably work fine for my uses, but I'd have to look up how to set the palette. That would double my framerate though, so it might be worth the effort.

deft linden
#

I just did some quick searching that suggests ST7789 might support an 8-bit indexed color palette mode. unsure of accuracy of what I found.

#

the thing I saw suggested you can set your own lookup table for the color indexing, so that would potentially give much better quality than vanilla RGB332

#

I mention RGB332 because that's what you get with the RP2350 HSTX. It has hardware bit estension for the low bits. So, it's fast, but the colors aren't very good.

#

the 16-bit and 24-bit modes look much better.

junior atlas
#

Interesting. I'll definitely have to mess with the HSTX once I get that screen.

But yeah, the display I'm currently using is small enough that anything I would do with it would basically be pixel graphics, and a 256 colors palette should easily be enough for anything I might want to do.

deft linden
#

I wonder how the bytes are getting sent to the SPI peripheral. If it's possible to have a bunch of stuff queued up by DMA vs one byte at a time, maybe that could make a difference?

junior atlas
#

I'll have to look into how the RP2350 handles SPI then. It should be possible, at least in C, to close the gap, if there's a significant SPI buffer. It might be an interrupt that is putting one byte at a time into the SPI buffer.

Yeah, that's exactly what I was thinking. I'll add that to my list of things to check.

deft linden
#

hmm... I may be wrong about the default clock. I remember 125 MHz from some HDMI timing stuff, but that might actually be slower than normal. Could be the default is actually 150 MHz or something.

#

in that case 150e6 * 100e-9 = 15, so not much effective difference.

#

I just re-read your feature request issue for Adafruit_CircuitPython_RGB_Display. Seems to me like, if CircuitPython were to have a portion of the PyGame API that could run at anything resembling a reasonable frame rate, it would probably need to be implemented in C as a module for the CircuitPython core.

junior atlas
#

Yep. Most of Pygame is in C, so it might not be too hard to port. That said, it has been a while since I did anything at the C level for CircuitPython, so I don't remember if the binding syntax is the same.

#

But yeah, the more I learn about this, the more it looks like it's going to take a hardware specific implementation in C to work for the RP2350. It probably isn't even feasible for anything slower than the RP2350.

deft linden
#

Another thing, if it would be possible to do that in a way where the module implemented the right interface such that it could be assigned to busdisplay.BusDisplay.root_group , then you might get a whole lot of display compatibility for less effort than with the pure python Adafruit_CircuitPython_RGB_Display route.

#

As I understand it, there's been a lot of optimization work over the years for low-level display SPI DMA stuff in the CircuitPython core (and more recently RP2350 HSTX). I think most of the slowness is related to displayio having being designed years ago to work on very memory-constrained microcontrollers.

#

If you (or whoever) provided some other type of object that could be assigned to root_group (i.e. not displayio.Group), then maybe it could be much faster.

junior atlas
#

I'll look into that too. If the SPI DMA stuff already exists in CircuitPython core, then the only hardware specific stuff I'd have to do is for the display driver itself. And even that might be amenable to enough abstraction that others could be adjusted to fit without too much work.

deft linden
#

it might even be better than that. I'm not quite sure what the interface is like for an object to be compatible with BusDisplay.root_group. But, my impression is that it's not much higher level than a framebuffer in a format that can be blasted out to the display over SPI, I2C, or whatever using MIPI commands that are somewhat shared between many displays.

#

busdisplay.BusDisplay abstracts a lot of the functionality that's common across many display chipsets.

#

so, if it works the way I think it does, you could potentially just implement a module with a PyGame-like API to update a group object (basically a framebuffer plus whatever sub-buffer compositing and graphics drawing primitives you might choose to build, e.g. displayio.Group uses TileGrid, Bitmap, etc.)

#

then you could do something like:

group = MyPyGameLikeClass(whatever,...)
supervisor.Runtime.display.root_group = group
while True:
  # draw stuff by calling group.whatever
#

or, if you weren't using a board with a built-in display, you could do

group = MyPyGameLikeClass(whatever,...)
# set up an external display connected to GPIO pins
display = Whatever(...)
display.root_group = group
while True:
  # draw stuff by calling group.whatever
deft linden
#

Some relevant parts of the CircuitPython core implementation:

junior atlas
#

This is quite interesting. It indeed seems like it might be exactly what I'm looking for to build on. I don't want to use displayio, at least not as intended, but if I can hook into displayio at a much lower level than the public interface, that would be acceptable.

I'll add that stuff to my list and move up to the top. Maybe the correct solution is to build a low level interface into displayio, so that it can be used in either way or maybe even both at the same time.

deft linden
#

Maybe the correct solution is to build a low level interface into displayio, so that it can be used in either way or maybe even both at the same time.

Assuming you have the RAM for framebuffers, it would probably work to build low-speed GUI elements (maybe menus or something?) using a regular displayio group (e.g. group1 = displayio.Group()), then have a high-speed thing using some to-be-written module that can provide an interface-compatible implementation of displayio_goup_t (e.g. group2 = MyFastGraphicsClass()). Then, you could switch between them by doing display.root_group = group1 or display.root_group = group2.

junior atlas
#

That's actually a good idea. The Fruit Jam has 8MB of RAM on top of the 520kB on the RP2350, so it should have plenty for that.

deft linden
#

one thing to be aware of is that if you want to do DMA on the RP2350, it needs to be from SRAM or various things will glitch

#

DMA that tried to touch the external PSRAM chip caused several bugs which Scott went to a bunch of trouble to fix

#

there are probably at least 2 or 3 pull requests for different instance of the PSRAM DMA thing as it ocurred in different subsystems.

raw void
#

Yup, PSRAM isn't fast enough to hold the framebuffer

#

but there is one always in internal RAM for picodvi

#

you should be able to manipulate it directly

junior atlas
#

Yep, I'm aware that the PSRAM won't be fast enough. In fact, I've wondered if the 100ns delay might be caused by that. Is there a way in CircuitPython to control what RAM a buffer is in? Or is the picodvi one available on the Fruit Jam CP, and how would I access it if so?

raw void
junior atlas
#

Can this be done in pure Python, or does it have to be done at the C level?

raw void
#

it is only done in the C level. I think you can give the framebuffer to memoryview() to access the raw memory

junior atlas
#

Ok, so in theory: Could I use picodvi to get the framebuffer and then memoryview() to use it from Python code? If the framebuffer is intended for the HSTX, at 640x480, that should be plenty for a 320x240 display.

raw void
#

I think jepler had done some raw framebuffer manipulation with ulab iirc

junior atlas
#

Awesome. That might work. I don't know what the delay for reading the PSRAM is, but if it is fairly fast, 100ns or 12 instructions could be it. If that's what is happening, putting it in the on chip RAM should be plenty fast (and should allow DMA so the program can do other things while frames are being sent, though that might have to be a future concern).

Thanks! That really helps.

junior atlas
#

Alright, preliminary success at least: I've used picodvi to create a framebuffer. I used memoryview() to get array-like access to the framebuffer memory. I've modified byte-pairs of the memory using the array-like access. Then I sent the framebuffer to disp._block() (where disp is the display driver object, from adafruit_rbg_display.st7789). The modified bytes display correctly on the screen!

The bytes are definitely flipped. This isn't a disaster, but it's certainly going to be a bit of a pain. For drawing, I can easily handle byte swapping once before drawing all of the pixels. I'm hoping that whatever I end up using for image loading has the same endianness, otherwise byte swapping could cost significant time. (It will mainly increase loading time, not rendering time, if I handle it in the right place.)

Anyhow, I don't know how this impacts speed yet. I just barely had the time to work out how to get this far. The next step will be adjusting my existing drawing code to use this new framebuffer. I'll report back once I've got some information on that!

junior atlas
#

@deft linden Hey, if you have time to run this through your logic analyzer to see if it still has the 100ns gaps, that would be awesome. This does not need any of the extra module elements I wrote. It only uses adafruit_rgb_display and stuff that is built in for the Fruit Jam.

If this works better, then I'll modify the other parts to use it instead of ulab. It might still not be enough to get good speed, without C level drawing functions, but at least it's a start. In fact, it might be possible to use something like Adafruit_GFX with this. (At this point, I'm also considering seeing if displayio has private functions and such I can hook into, to sidestep the mandatory GUI hierarchy. But this might be a lighter weight option...)

deft linden
#

@junior atlas see logic analyzer screenshots above. First one is a wide zoom of power up. Second one shows bytes from the first frame. It might not be obvious, but I've applied 4 measurements to the capture. The first one covers the entire CS low pulse for the first frame. The next 4 measurements are the first 4 gaps between bytes. (see right sidebar in screenshot)

junior atlas
#

Ok, so the gaps are still around 100ns, and the frames are still taking roughly 80ms to transmit. Since I'm using a picodvi buffer, which is supposed to be in on-chip RAM, the 100ns delays aren't being caused by the buffer being in the slower PSRAM.

Well, that brings me back around to: Probably gonna have to do most of this in C.

That's not a terrible surprise. Other parts of this, like drawing lines and such, are also taking so long that they should probably be done in C as well. And if any byte swapping has to happen during image loading, that should happen in C too.

I'm thinking I'm going to have to make this a long term project then. I'll have to take a break from this project before starting (hopefully not too long). If I can figure out where I put my servos, I'm going to teach one of my sons how to use them with the other Fruit Jam I bought.

That said, I'll still look into other options in the mean time. Maybe displayio is more optimized, and if it is, maybe I can hook into it through private functions that will give me lower level access at better speeds.

junior atlas
#

Hmm. MADCTL has a bit for setting the color order. I wonder if changing that bit would take care of the byte order issue...

junior atlas
#

RAMCTRL has a bit for setting endianness, but it also says is only works for little endian with 65K 8-bit and 9-bit interface. I assume that means a parallel interface, which is not supported by the board (and which I don't think the Fruit Jam has enough available GPIO pins for anyway).

junior atlas
#

@deft linden I see in Adafruit's ST7789 driver MADCTL is set to b"\x36\x01\x08". The first byte is the command to write to MADCTL. Your playground guide says the data being send to MADCTL is the last part, 0x08. What is the intervening byte? Is the command actually 2 bytes, and the least significant bit has to be high? Is MADCTL 2 bytes long? The data sheet includes a D17-8 section for both the command and the data sent to it, but we aren't actually sending 18 bits, and even if we were, I can't see where the least significant bit of that part should be 1. I assume bits 8-17 refers to unused bits in 9-bit or 18-bit parallel interfaces, which aren't used here. Any idea?

deft linden
#

@junior atlas oh man... it's been a long time. I don't remember that stuff at all.

#

I also vaguely recall that CircuitPython's display driver stuff may interpret some of the bytes in the init sequence as commands for stuff like sleeping for some length of time before sending the next set of SPI bytes.

#

So, consider the possibility that, of the bytes you see in the init sequence byte array, it's possible some of them may not be sent to the display over SPI

#

you could track this down by looking at the CircuitPython driver, seeing what the display library class is that it uses (I forget the name), then look at the source and documentation for that thing. It should say somewhere if some of the init sequence bytes get treated specially.

junior atlas
#

Ok, that makes sense. The 0x01 isn't being sent in your graph. It's going straight from 0x36 to 0x08. I remember from writing a C driver for some other display that sometimes the timing is super sensitive, so the 0x01 might indeed be some kind of delay instruction to the SPI code. (I've already got that exact datasheet open in another tab. This question came up when I was comparing it with Adafruit's driver.) Anyhow, thanks for pointing out that there might be some timing related stuff in there. Knowing that, I should be able to manage from the datasheet and Adafruit's driver.

So my current plan: I'm getting one more Fruit Jam. I have two right now, one for my son to learn robotics and one for myself to play with CircuitPython in but also to help my son. I'd really like to drop down to C and live there for a while, while I figure this stuff out. I've looked at the specs, and the Fruit Jam should be a little bit more powerful across the board than an Intel 486. The clock rate is significantly faster (but RISC, but also more GP registers). Most 486 systems could handle up to 8MB of RAM (the one we had when I was in my teens did). I haven't checked the PSRAM speed, but I think it is somewhere in the ballpark of the top RAM speeds supported by the 486 architecture. In theory, I should be able to achieve similar performance, but not necessarily in Python, because it has fairly high overhead. I remember playing (and making) real-time games on that 486 with really decent framerates, including WarCraft and WarCraft 2. So I think the best option right now is to work in C for a while (hence another Fruit Jam, so I don't have to keep swapping out Circuit Python on the current one). Once I've gotten my bearings down for this particular microcontroller in C, I should have a much better idea of where there is room for C level optimization for CircuitPython. I might end up having to focus mainly on optimizations for the Fruit Jam though, and not other platforms.

#

Anyhow, thanks for the additional information. I don't think I'll have any trouble writing a C level driver for the display (it will be my third), and then maybe later I'll roll that up into a CP display module designed specifically for higher performance framerates.

raw void
#

The 0x01 is the data count

#

if bit 0x80 is set then there is a following delay parameter as well

junior atlas
#

Oh, so 0x01 is just the length of the data being sent to the command. That makes sense. Thanks! That might be useful for "decoding" other commands as well. That said, I'll rely on the datasheet a lot for timing constraints.

junior atlas
#

Ok, the data sheet says the serial clock cycle minimum is 16ns for write and 150ns for read. That means SPI write should be good up to 62.5 MHz (which comes out to ~50fps on 16bit color). If I wanted to read, I would have to clock it down to ~6.66 MHz during read operations, but I don't need read capabilities in my driver, so this doesn't have to matter.

That said, the default SPI operation for the RP2350 C library is all blocking reads and writes, so 50fps is not actually achievable if the CPU is doing anything else at all (like generating new frames). So practical speeds without DMA are probably around 20 to 30 fps max. The Pico example code does include a SPI with DMA example. I don't fully understand it yet, but once I've got basic operation of the display down, I'll go back and see if I can "upgrade" my library to use non-blocking writes with DMA. That should allow framerates close to 50 fps, if the processor and other peripherals can keep up with game logic and rendering new frames fast enough. I expect it will be able to get at least 35 to 45 fps this way, and it might even be able to get very close to the full 50 fps the SPI is capable of. (The RP2350's SPI hardware maxes out at 62.5MHz, which seems pretty convenient to me.)

deft linden
#

Have you looked at the HSTX peripheral? Adafruit uses it for picodvi with balanced (I think) signals over HDMI connectors, but I wonder if it can be configured to do regular SPI?

junior atlas
#

It can. It can even do SPI faster than 62.5MHz (but the display I'm using can't). (From what I've read, the PIOs can't SPI that fast though.) It's significantly more complicated to use though, and if I can do SPI with DMA, that's the better option for this.

I do need to look into the HSTX though. I'd like to write (or more likely probably port from Arduino) a C driver for DVI. If I can use the HSTX for other pins (aside from those used for the HDMI port), and I can do large non-blocking writes, it would definitly be an option for this instead of SPI+DMA, but I suspect I'd have to write up all of the SPI timing code manually, which would be way more complicated than SPI+DMA.

(What would honestly be an ideal situation is if I could use this to build a device that will function as a handheld computer/console when used on its own, but which can be plugged into HDMI and will automatically use that instead of the LCD display if it is powered on while connected to HDMI. I don't think it is feasible to try to do both at the same time though, as a dual screen sort of thing, because the memory for two framebuffers wouldn't leave enough for much else.)

raw void
#

HSTX is on a fixed set of pins. You can change the function of each pin but you need them all. I think making SPI non-blocking is your best bet

#

so you can compute pixels while spitting them out

junior atlas
#

I suspected HSTX might be on a fixed set of pins. So yeah, unless I want to cannibalize an HDMI cable (and I really don't), better to stick with SPI+DMA.

Thanks for verifying that for me.

junior atlas
#

Alright, the base driver is done. I haven't added capabilities for accessing potentially useful commands, but it initializes correctly, and I can send it a framebuffer that gets displayed in the right colors without having to manually swap the bytes. (There are a few commands I plan to add functions to access. There's one for changing backlight brightness that eliminates the need for using the BL pin. It might also be useful to be able to change screen rotation dynamically, though this is not important to my specific use case.)

DMA won't work for sending commands (due to the need to switch the DC pin between the command and any arguments), but it should work for sending the framebuffer and for buffered writes to smaller regions of the screen. I think I can send the CASET, RASET, and RAMWR commands with blocking SPI writes and then set the DC pin to data, start the DMA transfer, and return from the function. I think I will have to do a block until DMA finishes at the beginning of most of the functions, in case they get called before the DMA transfer is complete, but that should be far cheaper than the savings achieved using DMA.

Once this is done, I'll move on to the framebuffer interaction functions. I might be able to use Adafruit_GFX for some of this, though I might want to do some optimizations on it as well. Alternatively, I'm seriously considering an SDL-like interface, as that tends to be better suited to games and real-time applications in general, while the Adafruit libraries seem to be more oriented toward other kinds of applications (like status displays or digital picture frames).

junior atlas
#

Advice/feedback:

I'm seriously considering doing this similarly to an OS level driver. My reasoning is that I want the higher level graphics stuff to be driver agnostic (so that instead of ST7789, a different display could be used without having to change the graphics/rendering library), and it would also be nice if multiple displays could be used at the same time, without the user level code having to care which driver is being used for which.

More specifically, I want the device driver to provide a "universal" interface that the graphics/rendering code can just plug into, and I want the rendering code to be able to work with multiple instances of this interface.

What I'm currently thinking is for the driver to provide a struct of function pointers for accessing driver capabilities within the context of a well defined protocol. For example, one "slot" in the struct would be for a function that initializes the driver, another would be for a function that creates and returns a framebuffer, another would be for writing an arbitrary region of the display directly, and another would be for initiating a DMA blit of the framebuffer to the display. Capabilities like changing display rotation and backlight brightness would be there as well, and if a given display doesn't have one of those capabilities, that slot can be set to NULL to indicate lack of that feature. The cost of this is indirection in those function calls. The benefit is that drivers for other displays could easily be written to use this protocol, so that the graphics library doesn't need to know anything about hardware implementation of the display controller.

Thoughts? Is there a better way to do this? Are there alternatives that might be better in some situations?

deft linden
#

A significant downside of function pointers is that they make it difficult for static analysis tools to navigate control flow. For example, the arrangement you proposed might make it difficult for an IDE to let future-you, or somebody else, do the jump-to-definition thing to understand what's getting called from where.

#

If it were me, I'd probably start by getting one instance of the driver working for ST7789 as straightforwardly as possible. Then, when I got to the point of adding support for a second display type, spend more time thinking about abstraction layers. Lots of times people add hooks to code to support future expansion but then it doesn't end up getting built out the way they'd initially imagined and the resulting code is also more complex than necessary.

#

Some other things to consider:

  1. If you design your functions to take a drawing context pointer, that will help to have more than one display of the same chipset active in the same system (framebuffer pointer, display width & height, etc. come from context rather than globals)
  2. You could have different display drivers use a namespace prefix and common function suffixes so that each chipset driver implements the same functionality with almost the same function names (but the prefixes and context struct pointers let you use them together in the same system)
#

You could follow an implementation path of:

  1. Get ST7789 driver working with simple C code
  2. Implement a Python module to provice access to the driver
  3. Get some other display chipset driver working using the same set of function suffixes
  4. Implement a Python module to provide access to the second driver (same set of class methods as the ST7789 python module (equivalent by duck-typing), but a different class name so they can be imported and instantiated as separate objects)
  5. Consider if the Python level equivalence of two display driver classes providing the same set of methods gives you all the display selection flexibility you need. If not, consider adding some kind of abstraction in the C layer.
#

TLDR, if you want something like polymorphic function dispatching within the context of CircuitPython, it might be simplest to do it with duck-typing at the Python class level rather than trying to build a C++ style virtual function table thing in C.

#

One possible drawback to such an arrangement is that is could result in duplicated code between chipset drivers. But, in that case, you could factor out some C library functions to share between the drivers.

junior atlas
#

I was thinking about this more this morning, and I agree that the best path forward would probably be to finish the current "pipeline" and then worry about supporting other drivers once that's done. Part of my problem here is that I have two closely related projects in my head right now. The primary one is putting together what is essentially a handheld game console library for the Fruit Jam, that switches to use HDMI when it is booted up connected to a TV. This is not difficult. The secondary one is trying to write a full desktop style CLI operating system for the RP2350, with the Fruit Jam as the initial target. I probably should finish the first before trying to support the second.

I am already using namespacing prefixes. That solves half of the problem, but it doesn't allow for dynamic switching. This is an OS concern though, not a game console concern. For the game console, I can assume only one LCD display, and display choice is made exclusively at boot up, I don't have to worry about dynamically switching.

I have considered dealing with the polymorphic stuff purely in Python. I do want it to work in C as well though. Again though, this is OS project stuff that I really don't need to be worrying about at this stage.

I'm not too worried about duplicated code in drivers. It is certainly a drawback, but I don't think it is avoidable in any reasonable way.

I think the drawing context pointer is going to be the best way to do it. Instead of trying to connect the display driver into the graphics rendering system, the user can initialize the driver they want to use and then tell the graphics rendering system where the framebuffer is (by passing a pointer to it when calling rendering functions), and then instead of the graphics rendering system triggering blits from the framebuffer to the display, the blit function is part of the driver and the renderer doesn't need to know about it. Unbuffered writes would then be the user's responsibility.

#

That helps a lot. Thanks!

On a related side note, I went through some of my old microcontroller stuff. I have two other LCD displays that I used when learning to program a couple of older microcontrollers, and I checked what controllers they use. One is an HX8340B, and the other is an ILI9340C. I know Adafruit still has some ILIx controlled displays, and I don't know about the HXx one, but this means once I've got the rest of the graphics stuff done, I can write C drivers for those that will integrate into this system. I'm definitely not going to lose focus on those though. The current stuff I have working is the top priority right now!

junior atlas
#

Oh, another thought: I mentioned I want it to be able to automatically use HDMI, if it is connected. (I'm assuming there's an easy way to poll the HSTX to check, but I might be wrong.) Without more abstraction, this would end up being entirely manual, making it the user's responsibility to check and then call the blit function on the right driver.

What if instead a callback function is used. Some abstraction layer has a universal blit function, and if HDMI is connected, it automatically calls the built-in blit function for that. Otherwise it calls a user provided callback function. The callback function doesn't need to be provided with a pointer. It just needs to be named correctly and placed somewhere in one of the user source files.

For the Python module, it would need to be a function pointer, or maybe a built-in callback function with a switch statement that picks the correct function for the initialized LCD display.

I think I'll worry about this later (once I've got the HSTX driver in and actually need to implement the selection functionality) and be sure to code in a way that leaves room for this if it ends up being the right solution.

raw void
#

you can detect a plugin via the i2c bus that the cable carries

junior atlas
#

Awesome. Thanks! That helps. I'll note that for when I get there.

deft linden
#

something to keep in mind for the framebuffer and blitting stuff is that the whole system can potentially run a lot faster if you draw pixels in the same memory layout as what you will need for shoveling bits out SPI or HSTX to the display. For example, with Fruit Jam HDMI, you might end up working in RGB332, RGB565, or perhaps something else. It depends on the resolution and video mode. I'm not sure what memory layout(s) you'd need for framebuffers targeting ST7789, ILI9340C, etc.

#

So, for example, if you end up designing an API that tries to let everything share some of the blitting code, you might get backed into a corner where you have to do some byte order or color LUT translation stage before you can fire off a DMA to the display.

#

For that type of situation, to avoid the need for a translation stage, you could potentially do an enum typedef for your supported memory layouts, then put an enum variable in your drawing context struct. When needed, your drawing functions could do a switch/case statement to decide which of your blitting functions to use. (this would avoid function pointers in a way that would let IDE static analysis tools work properly)

#

You could also potentially just say, "okay, it's gonna be RGB565 or nothing", which might make your life a lot simpler.

junior atlas
#

Yeah, while I'd love to keep my options open, RGB565 seems like the best option. Maybe for the OS project I'll reconsider that, but the OS project is also intended for the RP2350 specifically, and 24 or 32 bit color would cost too much memory. (Can't even do the framebuffer on external memory, due to the speed cost.) Even 18-bit is a bit much for memory, but it's possible. The problem with 18-bit (native for ST7789 and for at least one if not both of the other display controllers I have) is that it doesn't line up neatly. The cost of just rendering to the framebuffer would be way too high. (Also, I don't want to write the code for doing that. I recently finished writing a driver for the SSD1306 (monochrome), and working on individual bits was a horrible pain. This would be so much worse.) The ST7789 SPI interface actually supports 24-bit color, and then it just drops the bottom two bits of each color, but then it would be the same as 24-bit in memory cost.

Anyhow, yeah, I've really thought this through, and RGB565 only does seem to be the only reasonable option for a microcontroller with limited memory and a display controller with limited SPI speed. That or palette mapped 8-bit, but the ST7789 doesn't support that, and the cost of doing that in software is way too high.

Honestly, what would be really nice, is a video controller chip with a built in framebuffer that the RP2350 can render to and then trigger a blit to the display on. Basically, it would be nice if the Fruit Jam had a basic "video card" chip to handle some of this stuff in external hardware. I know I can't reasonably expect that though, especially without a large increase in price! This is just wishful thinking.

But yeah, unless I have some genius epiphany that solves this in some ultra-elegant way, it's going to be 16-bit color only, with no other options.

raw void
#

definitely start with 16 bit

#

anything more will cost you 32 bits because DMA won't do 24bits or finer grained

#

you could look at the ESP32-P4 because it has 2DDMA

#

I think it has a "pixel processing pipeline" too. Though maybe I'm thinking of imx

junior atlas
#

I don't think the DMA bit width will matter for SPI, as the SPI baud rate is the limiting factor, not the DMA speed. I should be able to run the DMA at 8-bits and still get the maximum SPI speed, and the display won't know or care what the DMA width is.

That said, 16 bits at the maximum baud the display controller (and the RP2350 hardware SPI) is capable of gets about the ideal framerate for anything real-time (I think I calculated 50fps). 24-bit would get around 33fps, if my math was right, and while that's not awful, it's not great, and it doesn't leave much flexibility for fine tuning performance. And of course, it would use around half the internal RAM for the framebuffer. I think 16-bit is probably the highest practical color depth for this, and the lower options have much higher CPU cost due to not being byte aligned, so they aren't terribly practical either.

"Pixel processing pipeline" sounds good. This particular project is specifically aimed at the Fruit Jam, due to it's specific features, but I'll keep that in mind for future projects!

junior atlas
#

Ok, I just realized: I was designing an ESP32 Bluetooth project last week, and so I've been in the mindset of "two cores, but I can only use one". Today I realized that a) the Fruit Jam has a completely separate wifi coprocessor, and b) even if it didn't, it doesn't have native wifi, so I don't need to reserve the second core.

So, question: CircuitPython doesn't use the second core at all, right? It just ignores it? (I know this was the case several versions back, but I haven't kept up on version notes.)

If this is correct, I have another question: Is there any reason not to just turn the second core into a dedicated GPU? I happen to have some experience writing 2D drawing routines, and I've even written a basic 3D rendering engine once. I don't know how to use the second core yet, but I'm sure I can figure it out, and then I can use a FIFO with a mutex to send commands and data pointers to the second core for rendering. This would be a significant upgrade from my original plans, and it wouldn't take that much additional work, since I'm already planning to write a significant portion of the rendering stuff anyway (mainly the 2D stuff, but a simple 3D renderer really isn't that difficult).

junior atlas
#

Update: DMA is working. My test pattern is being displayed beautifully.

The next step is putting together graphics library code. How the graphics library is setup will depend on answers to the above questions, as I'll need to design the architecture a little differently, if I eventually want to integrate it into a dedicated GPU using the second core.

raw void
#

The second core is used for USB host

#

We have some 2d drawing routines in bitmaptools too

junior atlas
#

Ah, that makes sense. Alright, I guess I won't do that then. All of my use cases need USB host for keyboard/mouse/game controller stuff. (And I'm sure I'm not the only one!)

I'll hold off on the 3D stuff until everything else is done then, and I'll do some benchmarking to see if it is feasible once that is done. (If Doom can run on it, I think basic 3D should work fine...)

I do plan on looking through Adafruit_GFX and some of the other existing stuff to see what I can get from there. I can write up all of the graphics stuff myself, but if I can save time by getting it from existing Adafruit (or similarly licensed) code, that would be awesome.

I think my top priority right now, aside from designing the graphics library architecture and doing a little clean up, is going to be loading PNG images. If you know of an existing library for that, that is written in C, that would be really helpful. (This also means SD card reading. I know CircuitPython has that, but I'll need to be able to do it in C during this phase.) If I don't have to, I'd rather not learn a bunch of image formats. If I have to for PNG, I will. I definitely won't write loading code for other formats myself (unless they are absurdly simple and fairly common), but I'm willing to add support if there is preexisting C code for doing it that I can use.

Once image loading is done, then I'll start on 2D drawing functions.

junior atlas
#

Just did some timing tests. I have a simple program that renders a horizontal line that moves down the screen, wrapping when it gets to the bottom.

If I ramp the SPI speed all the way up (62.5MHz, the max for both the RP2350 and the display), I can get around 22FPS consistently without losing time. I tried setting cmake to use Release mode (uses more aggressive compile optimizations), and I'm getting around 25FPS.

It might be possible to get slightly higher. My timing code uses a busy-wait calling time_us_64() every iteration. If the final call adds enough overhead, it could be costing as much as a frame or so. (I kind of doubt it though...) Setting a timer for the remaining frame time and blocking until it triggers might get better results, but I doubt they would be much better.

I was hoping for a bit higher, but 20FPS to 25FPS isn't bad for something like this that doesn't have a dedicated graphics chip. For fairly simple real-time games, 20FPS should leave enough CPU time for game logic and input handling, and given the limited speed of the RP2350, more complex games are probably not a viable option anyway.

The one thing I'm worried about is rendering raster graphics. If they have to be stored in the PSRAM, the speed of blitting them to the framebuffer might be a problem.

I think I'm going to start with algorithms for drawing line graphics and C-embedded images before worrying about PNGs on a MicroSD card. Blitting images from program memory should be faster than external PSRAM, and if that isn't especially fast, I don't see much point dealing with SD card graphics loaded into PSRAM. (If you can tolerate framerates that low, your application is almost certainly better suited to displayio anyway.)

#

Oh, I've added the beginning elements necessary to allow it to be used with different drivers.

I'm also considering adding the ability to set a "viewport" to render to, that will draw to a smaller portion of the screen. This will allow for getting better framerates when some parts of the screen need updated less often than others. (Similar to how displayio handles partial updates but using lower level mechanics and writing to the display using DMA.)

junior atlas
#

Just tried time_us_32() (and changed all of the math to 32 bit), to see if the overhead of time_us_64() is significant. They both start losing time above about 25.64FPS.

Another note: I don't know how much time is being spent blocking, waiting for DMA to finish. The way I have it setup, if it gets to sending the next frame and the previous one is not finished sending, I have it block until the DMA channel is done sending. Given the rate of sending, it should only very rarely be blocking, and if it's going at only 25FPS, it should never be blocking. If the functions for checking that have much overhead though, they could be contributing to the low framerates. If I can't get it above 25FPS though, there's no point testing in the first place, because the time between sending frames is long enough that the DMA can't still be sending data.

So there is likely still room for optimization. I'm going to start by commenting the line drawing and framebuffer clearing to get an idea of how much of the time that is costing.

#

No difference. Overwritting the entire framebuffer with 0x00 (clearing it to black) and then calculating and rendering a horizontal line (a very simple for loop) doesn't make a single frame of difference.

I'll try commenting the DMA blocks next.

#

No difference there either. Not a surprise. That means the functions for checking and blocking have minimal overhead when the DMA channel is idle.

There's one more place I might be able to recover some cycles. Before every frame is sent, I have to do some non-DMA SPI writes to reset the frame. Currently my code resets the dimensions, then sends RAMWR, but it only needs to reset the dimensions if they've changed. RAMWR should be enough on its own if the framebuffer dimensions haven't changed. I don't know that this will make a very big difference, but it's 10 bytes of blocking SPI write that can be avoided far more than 99% of the time.

#

Nope, no difference their either. 39000us per frame is good (just over 25FPS), but 38000us per frame (just over 26FPS) is losing time.

#

Well darn, I've checked the time before and after the blit function that triggers the DMA send, and it's returning an 8 to 9 microsecond time cost. So where's the rest of the time going?

I'm going to do the same thing with everything else in the loop to see where those 39000us are going. (At least I know that the DMA is doing it's job though!)

junior atlas
#

At ~25FPS, I'm seeing most of the time (~25570us) spent in the busy wait. The fill operation (clearing framebuffer) is taking about 4,622us. That means there's plenty of time for game logic and a reasonable amount of graphics rendering.

Turning the framerate up to ~26FPS the blit cost increases to around 25500us, and the busy wait drops to 2 or 3. So yeah, this suggests that DMA is blocking because the previous frame isn't done. I'm going to profile the drivers to see exactly where the time is being spent.

junior atlas
#

Yeah, all of the time is going to dma_channel_wait_for_finish_blocking(dma_tx);

I'm not sure why all of the time goes to the wait. It seems to me like a lot of it should still be consumed by the busy wait, and then the DMA blocking should just take up whatever more is needed to finish, but I guess it doesn't make a huge difference. The framerate limit is 25FPS, but at that framerate the DMA is handling the frame transmission transparently in the background, leaving all of the CPU time but a few microseconds for rendering and game logic. NTSC is only about 30FPS, and that was fine for console games for several decades. 25FPS is still pretty good for this particular use case, especially given that almost none of the CPU time is being spent on the SPI transmission.

Alright, I think this is good. On to embedded bitmap graphics and vector drawing.

junior atlas
#

Embedded bitmap rendering works now, along with a large subset of the drawing functions I wanted. I still need to add filled rectangles (trivially easy) and filled triangles (not so trivial).

I think this is far enough along now to make the Github repo public. I've released it under the MIT license, so that it is license-compatible with existing Adafruit code.

https://github.com/Rybec/FruitJamOS

(Not technically an OS at this point, but since it's likely to end up being the long term repo for that project, that's what I've called it. CircuitPython adaptations will get their own repo once I get around to that.)

I haven't profiled the bitmap rendering yet, but I will eventually. This library includes transparency, where the transparent color is 0x20 (0 red, 0 blue, 1 green). I also wrote a Python program for converting image files into 16 bit C code. If it encounters opaque 0x20, it changes it to 0x40 (green 2) to preserve color as well as possible.

I do need to add some documentation comments, but this is a solid start, and it's working very well on my Fruit Jam!

GitHub

OS and CircuitPython Elements for Adafruit Fruit Jam - Rybec/FruitJamOS

deft linden
junior atlas
#

Yes, I am. In fact, I'll add a README (I need to anyway) and note at the beginning that if people are looking for the official Adafruit one, they should go to that link.

#

(And maybe I'll eventually change the name, once I come up with something that is less likely to cause confusion.)

junior atlas
#

I've added the REAME file, with the aforementioned note very close to the top.

deft linden
#

once folks are back from holiday, maybe Dan and Scott could offer some ideas on how your work might fit in nicely with the existing namespacing. It would be cool if it ends up being a thing that could be used alongside of, or as an alternative to, the regular displayio options.

#

seems like what you're doing might have some potential to get merged into the core as a module.

#

In that case, it would potentially apply to a lot more than just Fruit Jam.

junior atlas
#

In theory it should work for any RP2350, though right now it only works on SPI1. I've already built in a little bit of stuff to allow the use of different controllers though, and I think making adjustments to allow SPI0 to be used wouldn't be too difficult. Overall, I'm doing my best not paint things into a corner, so that it can be expanded in the future to cover more interfaces and more display controllers.

junior atlas
#

Found why the framerate is close to half what I expected:

SPI1 baud: 37500000

Supposedly both the RP2350 and the ST7789 support up to 62.5MHz, and it's running at only 37.5MHz.

I'm using whatever the default clock speed of the RP2350 is. I wonder if that clock speed is too slow to do SPI at maximum speed.

#

Nope, system clock is running at 150MHz.

junior atlas
#

Ok, I figured it out:

The SPI clock speed is set as a fraction of (if I understand right) the peripheral clock. By default, the peripheral clock is set to the same speed as the system clock: 150MHz. 62.5MHz is not selectable when the peripheral clock is set to 150MHz, because 150MHz cannot be divided by any integer power of 2 to get that speed. The SDK thus rounds down, and 150MHz / 4 = 37.5MHz.

If I set the SPI baud to 75MHz, then it works. But 75MHz isn't guaranteed to work, and it's also beyond what the display is rated for.

junior atlas
#

Ok, profiling tests:

First, I am getting some graphic corruption at the display, but only at very high frame rates.

I can get it up to 51 fps, with 10 to 15 horizontal lines that stay black instead of displaying the frame data. At 50 fps exactly, it's only 2 or 3 lines. At ~45 fps (22,000us per frame), there's no graphical corruption at all. I seem to be right at the limits of the controller, but since it is rated for 62.5MHz and not 75MHz, I wouldn't bet every other ST7789 has exactly the same limit. (Or: This suggests the corruption might be an artifact of the DMA controller rather than the display itself, since the SPI baud isn't actually changing. It only misses lines when the DMA blocking wait actually has to wait.)

So here's the question: What else depends on the peripheral clock? HSTX? I assume the wifi coprocessor is dependent on it, since it is on SPI0. How about the audio system? It's on I2S, isn't it? (Or is it using a separate DAC..., and if so, what is that using?) I assume it uses the peripheral clock as well. Does QSPI use the peripheral clock too?

Here's what I'm thinking: If I can clock the peripheral clock down to 125MHz, then I can get exactly 62.5MHz on SPI1. That is the maximum specification for both the RP2350 and the display, so it should be completely stable. There's enough overhead on the SPI/DMA pipeline that it won't get the full 50 fps, but it should be enough for 35 to 40 fps.

That's not quite doubling the framerate by reducing the peripheral clock speed. If everything else dependent on it can tolerate a 1/6th reduction in clock speed, that would be a huge improvement in framerate!

junior atlas
#

Actually, I'm getting the full 50fps out of it now, without DMA blocking, or at least without it blocking for long enough to be noticeable. I added code to measure the actual framerate, and once it has a little time to average out, it's hitting exactly 50.

So, if we can set the peripheral clock to 125MHz, it can get the full 50 fps without any trouble, and of course, that leaves just under 20ms per frame to handle rendering, other I/O, and game/program logic. At 150MHz, that's not bad at all!

(50 fps is only a little lower than the Game Boy Advance, which did just under 60 fps. And on top of that the RP2350 has more internal RAM than the GBA has total RAM, even if we include it's dedicated VRAM for framebuffer, tiles, and sprites. The RP2350 also has a clock speed around 8 or 9 times as fast as well. So with 50 fps, the framerate is in the same ballpark and everything else is way better. That means this is 100% viable as a console system!

Oh right, I also found some stuff strongly suggesting that the RP2350 can do execute-in-place with the PSRAM, which would allow for loading executables from microSD cards and running them, though this would likely be significantly slower than loading into internal RAM and running from there.)

#

I've pushed an update that sets clk_peri to 125MHz and runs the display at 50 fps.

junior atlas
#

I've decided that I'm not going to do the hardware level viewport stuff in the graphics library. I've included some basic viewport code in the ST7789 driver (which might be useful for improving the speed of some of the stuff displayio does), but now that I've got it doing 50 fps with DMA, blitting smaller regions at a time has little value. It won't save any time. There's no need for faster than 50 fps, and due to the SPI speed limit, blitting smaller regions wouldn't be able to get that much faster than 50 fps anyway. It could save memory, but only if you are using the driver directly, instead of going mostly through the graphics library. Viewports in the graphics library can already be done by creating a new surface the size of the viewport you want and then rendering that to the framebuffer before blitting. I do need to add some stuff to make creating new surfaces easier though, so I will definitely do that.

I might add software viewports eventually, where a viewport is essentially a view into the memory of a sub-region of an existing surface. This would allow for some interesting rendering tricks, but it's not priority.

junior atlas
#

Important note I should probably mention at this point:

This is something I was peripherally aware of, but it wasn't important until right around now. Using DMA to send the framebuffer contents to the display does not magically freeze a copy of the framebuffer memory. This means that if you render to the framebuffer memory while the DMA transfer is happening, you'll have high odds of tearing. In fact, I mentioned before that when I tried to push the game loop faster than the maximum framerate determined by the SPI speed, I was getting rows of black pixels. It's very likely this was caused by the next frame starting to render by clearing the framebuffer to black while the DMA was transferring the contents of that memory.

Traditionally this is resolved with double buffering. Two buffers are maintained, one is being blitted to the display, while the other is being rendered to, and then when the next blit is triggered, their pointers are swapped so that the one just rendered to is being sent to the display, and the one that just finished sending can then be rendered to. The cost of this is doubling the framebuffer memory required, which puts the total memory cost in this instance to more than half of the internal RAM of the RP2350.

The reason tearing was not visible prior to this was essentially good race conditions. The very last thing my code does before the busy-wait for the next frame is trigger the DMA blit. Since there is no "game logic" and very limited rendering, the vast majority of the time is spent in the busy wait, and when it finishes, the blit is almost complete. If there was significant "game" and rendering logic though, obvious tearing would definitely be visible. This could be avoided by running at a lower framerate, so that the blit always finishes before the rendering starts (but after the "game" logic), but this requires wasting cycles. This means I'm going to have to implement a double buffering option, for things that need top speeds.

deft linden
#

You might consider if there's some way to shoehorn a mechanism like javascript's Window.requestAnimationFrame() into the CircuitPython VM. Maybe something that worked a lot like time.sleep() but that would wake up at a particular phase of the frame refresh cycle so you could use a single buffer beam-chasing style.

MDN Web Docs

The window.requestAnimationFrame() method tells the
browser you wish to perform an animation. It requests the browser to call a
user-supplied callback function before the next repaint.

junior atlas
#

Here's how I think it would have to work in CircuitPython: First, you would have to run it slower than 50 fps. How much slower depends on how long rendering takes. The longer rendering takes, the slower the framerate has to go.

I've got a function in the SPI driver that blocks until DMA is finished. The typical "game loop" does input handling first, game/program logic second, and then rendering third. (This minimizes input lag.) You would use this same pattern, but between the game logic step and the rendering step, you put in the DMA block. This allows input handling and game logic to happen in parallel with the DMA blit to the display, and then it stops until DMA is done before doing rendering, to avoid tearing. If you tune your framerate correctly, in theory you can get it so that you waste minimal cycles in the DMA block.

For most applications 30 fps is probably fine, but it depends largely on rendering load. I was measuring around 4ms just for clearing the screen. 40 fps only gives around 5ms of rendering time, while 30 fps gives 13.3ms. 35 fps gives 8.5ms, so as long as the rendering math isn't much more expensive than just writing all 0s to the framebuffer, you get away with it.

That said, I'm clearing the framebuffer with a for loop right now. For clearing to black, supposedly memset() is significantly faster on the RP2350. So I'm planning to add a conditional to the fill function that will use memset() if the upper and lower bytes of the color are the same. Any time that saves on buffer clearing can be added to the rendering time.

So now I have another question: How much of the internal RAM does CircuitPython need to use? If it's more than ~200k, double buffering is not even an option in CircuitPython, because there just isn't enough memory. I'm still going to add it for the C side of my project though.

#

Of course, you are right that you could render to the top part of the screen while the DMA is finishing the bottom, if you can time it right for that.

junior atlas
#

One way to do something like you've described the JS thing doing is using interrupts. Instead of using one DMA channel for blitting, you might chain two together, one for the top half of the screen and the other for the bottom half. Then setup an interrupt that triggers when the top half finishes, and use that to signal rendering for next frame of the top half. A second interrupt for the second channel could do the same for the bottom half. The downside is that rendering functions would need extra logic to handle anything crossing the midline, which would cost extra rendering cycles.

junior atlas
#

Actually, I've already got some driver level stuff that could be used to mitigate this issue. My driver supports hardware viewports. Instead of having a single framebuffer for the full screen, a user could create smaller "viewport" framebuffers, that can each be blitted to the display independently. This way they could divide the screen into arbitrary rectangles, each with a separate buffer, and while each one is blitting, do the rendering for a different one. If one finishes rendering before the previous finishes blitting, a DMA blocking wait will prevent issues.

I could even add driver functions to make this particular usage pattern a bit smoother. It will come at a small additional cost, as the display's write frame will have to be set at the beginning of each blit, but that's...I think 10 bytes of blocking SPI communication, so it's not a lot of lost time.

junior atlas
#

Alright, added double buffering option, activated using a flag when initializing the display.

I'm considering doing some refactoring of the SPI and ST7789 drivers to make this a bit cleaner. During DMA configuration, you normally set the buffer address, but you can change it to pretty much any valid address arbitrarily. It might make more sense to separate out the buffer address setting and handle it explicitly at each DMA write.

junior atlas
#

First step of refactoring done and pushed. I changed the SPI and ST7789 driver to make it easier to blit arbitrary buffers to the display. This makes things like double buffering and rendering to subregions of the display much easier. It also reduced the memory required for keeping track of driver level "private" variables.

The second step is going to be removing interface hardcoding. In the short run this will mean that someone using a non-Fruit Jam RP2350 board can choose which hardware SPI port to use when initializing the driver. In the long run this will make it possible to add drivers for other interface types (software SPI, I2C, HSTX...) and it should make it at least a little easier to port it to other chips without having to modify the higher level modules as much or perhaps at all.

junior atlas
#

Ok, refactoring done, and updates pushed! It should now be fairly easy to add new interfaces and display controllers that can be selected when initializing the graphics system.

I've also used a bunch of #ifdefs to make it easy to select the display capabilities you want, to avoid compiling in a ton of code that you'll never use. Because of the way switch statements are used to select the correct code for the chosen driver, it's likely that the compiler won't automatically exclude driver code that isn't used, so instead you have to #define (I did it in the CMake file) the drivers you want to include, and they will be compiled in while others won't.

Of course, I only have one display driver currently, so this has not been tested against any others, but there's no reason it shouldn't work, provided the driver code is written correctly.

Next, I'm going to work on the HSTX/DVI driver. I think I understand it well enough to write it (or maybe to adjust and adapt the Adafruit driver to support everything I want).

#

After that, I intend on writing a 3D rendering engine. I'm not sure if the RP2350 can do floating point fast enough, but if it can't I should be able to do it with fixed point. I'm not sure how far I'll get into it. Given how well it renders lines, rendering solid filled triangles shouldn't be an issue. Gradient filled triangles should also be extremely fast as well. Where I'm not sure is the per-triangle matrix math.

I already have solid C code for this, from a 3D renderer I wrote a long time ago. My current code uses 32-bit floating point. I think the precision of 32-bit floating point is massive overkill though, and I suspect 16-bit fixed point would be perfectly acceptable, not to mention much faster. If that works fast enough, it should be sufficient for basic 3D graphics.

If it's fast enough for basic shaders, and there's enough memory for a z-buffer (maybe I can get away with a 4-bit z-buffer, instead of 8-bit, otherwise memory will be a problem), then I should be able to do basic lighting and maybe shadows, as well as textures. I don't expect it will handle a lot of polygons at once at this level of complexity, but it might be able to manage better than SNES StarFox 3D rendering at an acceptable speed.

Anyhow, my plan is to go step by step adding features until I hit a point where it clearly can't handle any more. If I'm lucky, I'll be able to manage low poly graphics with decent textures, lighting, and maybe even shadows.

junior atlas
#

Quick update: I wrote up an initial HSTX/DVI driver, but it's not working. In fact, it seems to be stalling the main program. I'm not sure if my interrupt code is taking too long (I can't imagine it would be, as it has no loops and isn't very long) or something else. It's possible it's interfering with USB serial and stalling the "heartbeat" printf() in my main loop. It's not displaying anything though, so something is wrong.

So rather than trying to debug a significant chunk of code all at once the hard way, I've taken the pico-examples DVI code, adjusted the pin mapping for the Fruit Jam, and now I'm working on modifying it step-by-step until it is identical to my driver, testing each step to figure out where it stops working. This is going well so far! From there I should be able to fix my driver and complete the final elements.

junior atlas
#

Somewhere way back in here we were talking about DMA chunk sizes. I tried doing 32-bit chunks for the ST7789 driver, and it failed, but 8-bit chunks worked. I've figured out what the problem was, and it's also relevant to HSTX pixel doubling.

The DMA does one read chunk per transaction for things like SPI. It reads into a 32-bit register, then the SPI (or probably also UART and I2C) peripheral takes what it needs for one transaction and sends it. Then the DMA does the same thing again. So, I'm sending 8-bits per SPI transaction. If I set the DMA chunk size to 32-bits, it will read 32 bits from my framebuffer, the SPI peripheral will transmit the bottom 8 bits to the display, then the DMA will read 32 more bits, and so on. Three bytes of every four are ignored this way. If I have the DMA do 8-bit chunks, it will read 8 bits, which will be copied into each byte of the 32-bit register, then the SPI will take the lowest 8 bits (which is the full 8 bits read) and send those, and so on, correctly sending all of the data.

Now, I haven't had a ton of time recently to work on this, but I just started digging into doing pixel doubling with the HSTX. The DMA works similarly with it, except that you can set it to shift and send again with the same 32-bit register data before doing another DMA read. So, you can grab say, 32 bits of 16-bit pixel data, send the first 16 bits, rotate 16 bits, then send the second 16 bits before reloading. And you can do that all the way down to 1 bit at a time, with 32 rotates.

But, you could instead set the DMA to 16-bit chunks. Then it will load the next 16 bits into both the top and bottom half of the register. Then, if you wanted, you could send 16-bits, then rotate or not, doesn't matter since the data is the same, and send the same pixel again. You can't do this if you read 32-bits, because you can't send twice with no rotates, rotate 16-bits then send twice with no rotates. This works with 8 bits too.

...

#

You could also set it to go three or four or more times with either 8 or 16, because only a single pixel of data is in the buffer, so you can send it as many time as you want (up to 32 times) before loading the next pixel.

Sadly, this still doesn't allow for things like packed 6 or 3 bit color. To do those, one would have to add a line buffer where each pixel is a power of 2 number of bits (or for pixel multiplying, 8, 16, or 32 bits), then in the DMA interrupt unpack the next line into this line buffer. Then the line buffer is DMAed instead of the framebuffer. This adds a lot of computation to the interrupt which can easily consume a lot of CPU time.

(For vertical pixel multiplication, you can just change the math for working out the next line of the buffer, so that it divides by the number of multiplication and truncates. Either use integer division for this or multiples of 2 can be done with right shifts, though the compiler will likely reduce divisions to this anyway.)

junior atlas
#

Anyhow, at this point, I think I'm going to be doing 320x240 (pixel doubled 640x480) in 16-bit and 8-bit color, and 640x480 in 4, 2, and 1 bit color. Pixel doubling can be done at no additional memory or CPU cost for only 32, 16, and 8-bit color.

32-bit color even at 320x240 requires a 307.2kB framebuffer, so it can't be double buffered at all on the RP2350, and double buffering is critical for games. I don't want to try to do tearing workarounds at this point. (I might add it later, but users who want to prevent tearing will be on their own.)

16-bit and 8-bit color can both be double buffered at 320x240, with enough memory to spare for games, at least in C. (Not sure how much CircuitPython requires... 16-bit requires 307.2kB for double buffered. 8-bit requires only 153.6kB.)

4, 2, and 1 bit can't be cheaply pixel multiplied (at least, not horizontally), since the DMA can't do chunk sizes that small. This means pixel doubling would require expensive interrupt line buffering, which I don't want to do. That said, at 640x480, 4-bit color requires 307.2kB for double buffering, 2-bit requires only 153.6kB, and 1-bit requires only 76.8kB. (Since these can technically be pixel doubled vertically, I could allow for 640x240, which is close to the common 640x200 CGA/EGA on some ancient systems, but I'm not sure it is worth it. I'll try to write my code to make this easy to add, in case there's future demand.)

I'm not going to include a base resolution higher than 640x480, because I don't want to require users to overclock to make any of my library work. (Even 800x600 60Hz, the very next one up, would require overclocking to 200MHz, since the TDMS clock must be 400MHz. That said, the way I'm doing this should make it easy for users to add their own TDMS setting blocks.)