Would it be better, then, to not change madctl and to handle rotation in software? I've had some time to think about it, and while this would be more expensive, I think I can do it in a way that isn't massively more expensive than hardware rotation. Doing it this way, I think I could build an API that won't need to change anything at all within the existing code. Instead rotation would be handled as part of the initialization of the rendering buffer/surface, and the code for blitting would rotate during the blit by adjusting the ordering of the for loops.
As far as what I would do in CPython, I'd do it about the same way as I'm already thinking. I don't really design differently for embedded versus other platforms (in Python), except where it is necessary to fit within the limitations imposed by embedded devices. That said, I do design for optimization more than the average programmer, regardless of platform. I have some experience in video game programming starting in the early 1990s, in DOS/QBasic, where memory was limited to 640kB and CPU speed was on par with mid-range microcontrollers today (286 through 486). If the goal is to ensure that it will also work for Blinka, I don't think that will be problem at all. (I'd actually love to implement a lot of what I want to add in C, but I don't have the time for that, and it would be significantly harder to avoid compatibility issues. Maybe at some point in the future, once it's done in pure Python...)
I'm not familiar with LVGL, but a quick look at the Github page suggests it is promising. I didn't see much about lower level stuff, but the hardware acceleration is certainly a big benefit. And if it does provide even some basic low level stuff, it shouldn't be too hard to build something more complete on top of that.