#Sega Dreamcast
1 messages ยท Page 6 of 1
Connect your Jaguar and Jaguar CD and insert an unencrypted Jaguar CD. Set the switch of the CD so that the CD uses the developer BIOS. Set your Jaguar so that it uses the Stubulator BIOS. Now turn on Jaguar while holding the 'C' button of the joypad.
oic now.
It's a dev thingy. Apparently lets you read unencrypted CD-Rs, but I'm not sure if the CD dev BIOS is supposed to display anything without the drive attached?
I think the Stubulator BIOS might be the thing that lets you upload small chunks of code to RAM, via one of the joyports, but I could be wrong.
If you load the Stubulator as a BIOS, then load the Jag Dev BIOS like a Cart, then hold C during loading, it shows the normal Jag boot logo.
But then nothing else.
Not sure how that works, I just thought it might give some more useful debug info.
This new core seems WAY more stable, btw.
Doesn't necessarily run AvP reliably, but oh well. lol
This is a work-in-progress video for Jaguar CD Unleashed (JagCDU), which allows CD games to be played on the Atari Jaguar without a CD unit installed. It replaces the Atari BIOS with a custom BIOS in low Jaguar RAM which knows how to map CD track locations to physical cartridge locations.
To learn more about Songbird Productions, visit http://s...
I NEED to find that code. ^
Unfortunately, that isn't likely to have any low-level details.
You may need a Time Machine to 2003 when that post was made
But I think I just found the CD BIOS source code. ๐
I won't post the direct link, as I'm not sure about some of this stuff.
But there is the "Atari Jaguar ~ Code & Dev" group on Facebook.
This is... the whole thing.
; We got power--let's start the boot up again
;
gotpow:
lea davesobj,a0
moveq #-1,d0
move.w d0,$c(a0) ;turn off the CD module arse
move.w d0,$4c(a0)
So very British.
move.w #$8000,d1
lea BUTCH,a0
chkCDpow:
move.l (a0),d0
andi.w #$2000,d0
bne.s powok
dbra d1,chkCDpow
;
; timed out..
;
powBAD:
move.l #$20000,BUTCH ;reset CD module
move.l #0,BUTCH+4 ;clear DSA
moveq #-1,d0 ;exit with error
rts
powok:
move.w DS_DATA(a0),d0
cmpi.w #$7001,d0 ;this indicates proper function
bne powBAD ;if not--put up power message
A mask (andi) of 0x2000 would be bit 13.
--x- ---- ---- ---- Response from CD drive pending
Strange how it seems to write Long words to the CD drive.
move.l #$180000,BUTCH ;set lid-up & cart-pull reset
; move.l #$80000,BUTCH ;set lid-up & cart-pull reset
cmp.l pad_now,d1 ;check joystick
bne.s .noxchk ;br if A * # not pressed together
They had a CD-ROM emulator thingy, which ran on the Atari Falcon.
Wonder why it checks for A * # to be pressed?
;****************************************************************
; CD_setup *
; *
; This call MUST be used to initialize the CD system before ANY *
; other calls can be made *
; *
; Input: NADA *
; *
; *
; Uses: NADA *
; *
; *
; Returns: NADA *
; *
;****************************************************************
setup:
move.l #$180000,BUTCH ; enable BUTCH
; move.l #0,BUTCH ; enable BUTCH
move.l #$10000,DSCNTRL ; enable DSA
move.l #7,I2CNTRL ; Enable I2S
move.l #1,I2CNTRL ; Enable I2S
move.w #$7001,DS_DATA ; Set non oversampled audio
rts
MAME...
0x70 Set DAC mode (?)
So now we know what the 0x7001 is.
The upper byte is the command, the lower byte is the param, or can toggle bits etc.
Need to figure out when an Irq is triggered.
ie. if it's after every type of command, or only some.
I would imagine the IRQ line gets cleared again, once the IRQ register (0x00) gets read.
; Internal use only
; waits for transmit to occur
DSA_tx: ; set up as a polling loop
; move.w #$1000,d1 ; This is VOODOO
move.w #0,d1 ; This is VOODOO (new voodoo!)
.delay:
dbra d1,.delay
move.l BUTCH,d1 ; get Butch's ICR into d1
and.l #$1000,d1 ; mask for d12, DSA TX Intr. pending
beq.b DSA_tx ; nothing here yet, so wait for bit to set
move.l DSCNTRL,d1 ; read here to clear interrupt flag
rts
---x ---- ---- ---- Command to CD drive pending
; bit12 - Command to CD drive pending (trans buffer empty if 1)
; bit13 - Response from CD drive pending (rec buffer full if 1)
; Clear pending DSA interrupts
move.w DS_DATA,d0
move.l DSCNTRL,d0
MAME...
case 0x70: // Set DAC Mode
m_butch_regs[0] |= 0x2000;
m_butch_cmd_response[0] = 0x7000 | (m_butch_regs[offset] & 0xff);
m_butch_cmd_index = 0;
m_butch_cmd_size = 1;
break;
So that's setting bit 13 (Response pending) in the IRQ reg.
Then sends back the response as 0x7000, and copies the lower byte from the existing DSA reg value.
DSA_tx also waits until bit 12 is set.
So I've tried setting both bits (0x3000) in the IRQ reg, to see if it gets any further.
MAME doesn't have much actual code added, as far as I can see.
But it still forces the CD BIOS to start sending proper commands...
// 0x15 Set Mode
// 0x17 Clear Error
case 0x15: // Set Mode
m_butch_regs[0] |= 0x2000;
m_butch_cmd_response[0] = 0x1700 | (m_butch_regs[offset] & 0xff);
m_butch_cmd_index = 0;
m_butch_cmd_size = 1;
break;
So, most commands do set bit 13 (Response pending).
Doesn't look like the MAME code even hooks up the IRQ for the CD stuff.
So the BIOS must just poll the regs, most of the time?
Oh
DSA bus. ^
So that's what the Butch chip talks to.
And it receives the data from the SAA7345 decoder via I2S.
Down the DSA rabbit hole
So now we know.
Used in a lot of "Audiophile" CD player projects.
They have controller boards on AliExpress for it.
Don't really need to know as low-level as this, but it helps us understand what the CD BIOS is sending.
And I don't think the Butch chip is going to be overly complex with the command handling, it probably just forwards the commands almost as-is, to the cdt612 chip.
(plus it has a couple of FIFOs for the I2S data, both for data reading, and for audio playback.)
I think some of the reg reads are just duplicates.
// READ mux...
case ( {cpu_addr[5:2],2'b00} )
//6'h00: butch_dout = butch_irq_reg; // 0x00 IRQ reg.
6'h00: butch_dout = 16'h2000; // 0x00 IRQ reg. TESTING !!
//6'h02: butch_dout = butch_reg_02; // 0x02 Unknown? (CD BIOS seems to write to this offset?)
ie. I need to drop the two lower bits of cpu_addr.
So it reads register 0x00 (IRQ reg), even if the cpu_addr is 0 or 2, etc.
'cos MAME is reading from offset DFFF02, which doesn't make much sense, vs the actual MAME code.
void jaguarcd_state::butch_regs_w(offs_t offset, uint32_t data, uint32_t mem_mask)
{
COMBINE_DATA(&m_butch_regs[offset]);
switch(offset*4)
{
case 8: //DS DATA
Hard to explain.
I think it's just when the 68K is reading from the different Byte offsets.
But the Butch chip will ignore the two lower bits of the address.
So reading from address 0,1,2,3 still gives you "register 0".
Same for reading from 8,9,A,B, give you "register 8".
So the comment in MAME is slightly "wrong".
The CD BIOS often writes the command to Offset 0xA.
But the actual register address is as base 0x8.
6'h08: butch_cmd_reg <= cpu_din; // 0x08 DSA TX/RX (Command) reg.
Since reading/writing addr 8,9,A,B, will still decode as offset 8, when doing this...
case ( {cpu_addr[5:0],2'b00} )
That's why it's also at "case: 8" in the MAME code.
Also means there probably is no separate register 0xE.
6'h2c: butch_reg_2c <= cpu_din; // 0x2c Unknown? (used at start-up)
6'h2e: butch_reg_2e <= cpu_din; // 0x2e Unknown? (also written/read during MAME start-up?)
An access of C,D,E,F, still reads/writes the same reg.
So then we have these...
case ( {cpu_addr[5:0],2'b00} )
6'h00: butch_irq_reg <= cpu_din; // 0x00 IRQ reg.
6'h04: butch_dsa_cont <= cpu_din; // 0x04 DSA Control reg.
6'h08: butch_cmd_reg <= cpu_din; // 0x08 DSA TX/RX (Command) reg.
6'h10: butch_i2s_cont <= cpu_din; // 0x10 I2S bus control.
6'h14: butch_sub_cont <= cpu_din; // 0x14 CD subcode control.
6'h18: butch_sub_rega <= cpu_din; // 0x18 Subcode data Reg A.
6'h1c: butch_sub_regb <= cpu_din; // 0x1c Subcode data Reg B.
6'h20: butch_sub_time <= cpu_din; // 0x20 Subcode time and compare enable.
6'h24: butch_i2s_fifo <= cpu_din; // 0x24 I2S FIFO data.
6'h28: butch_i2s_fifo_old <= cpu_din; // 0x28 I2S FIFO data (old)
6'h2c: butch_reg_2c <= cpu_din; // 0x2c Unknown? (used at start-up)
default: ;
endcase
Reg 0x0c is missing. I'll add one anyway.
Great work getting the BIOS booting! Am just catching up now
I think the Jag CD BIOS may be open source like the Jag one, wonder if someone can confirm
So what would need to be added to the core to get it to boot the Jag CD BIOS without it being altered, and in a way where down the road it could be used for the core to boot Jag CD games?
I started adding that option, to switch it to reading the Cart ROM as 8-bit wide.
But it's not quite done yet.
It's probably easier just to patch those four bytes on-the-fly, like it does for the Cart Checksum thing on the BIOS.
What would be the best/cleanest/most accurate implementation, or is that not a valid question in this situation?
It could probably be done to auto-detect when those bytes are set to anything other than 0x04.
ie. when it's actually loading the Cart ROM.
But then you don't really want to force stuff to modify the ROM, without good reason.
So might as well just keep the menu option.
After me saying the menu text needs to be kept short, I still made it a bit too long. lol
Also... it doesn't work yet. sigh
The curse of FPGA dev.
After messing with the DC sim again yesterday, and Quartus. It made me realize just how easy it is to screw-up the code.
Even with the very best of intentions, and taking notes etc.
So it's really not practical to have to keep doing Quartus compiles for every small tweak.
But also not so easy to simulate the whole MiSTer framework and ARM stuff.
I'll try to debug this now. It's probably me decoding the address wrong.
"O2,Cart Checksum Patch,Off,On;",
"OH,Patch Cart ROM 32-bit,Off,On;",
My fault. Broke it.
Old...
assign cart_q = (!abus_out[2]) ? DDRAM_DOUT[63:32] : DDRAM_DOUT[31:00];
New...
assign cart_q1 = cart_patch_32_cs ? 32'h04040404 : // Patch the Cart ROM, to force reading as 32-bit wide (Jag CD BIOS, etc.)
cart_read_32; // Else, allow normal reading of Cart data (as 32-bit wide).
I'm using cart_q1, which is the wrong reg.
Which gets used with cart_q, to see when the data has changed.
This code can probably be deleted.
cart_diff <= cart_q1 != cart_q;
Yep, I think that should fix it. Compiling again. sigh
That was very confusing.
It turns out the Cart ROM is loaded to/from SDRAM now.
But still being loaded into DDR3 at the same time.
Plus, I was playing some music. I bought a second-hand AVR for my Nephew, so I've been "testing" it. lol
that was the change for single ram
Yep, but I think the Cart was still in DDR3, until recently. So it confused me. lol
wire [31:00] cart_q_sdram; // Cart data from SDRAM.
wire cart_patch_32_cs = (abus_out[23:0]>=24'h800400 && abus_out[23:0]<=24'h800403 && status[17]); // Patch 0x400-0x403.
assign cart_q = cart_patch_32_cs ? 32'h04040404 : // Patch the Cart ROM, to force reading as 32-bit wide (Jag CD BIOS, etc.)
cart_q_sdram; // Else, allow normal reading of Cart data (as 32-bit wide).
Works with the "unpatched" JagCD BIOS now.
But you need to have both of those options turned on, before loading it.
And you have to load the Jag CD BIOS as if it's a Cart.
Jag CD won't load any ISOs yet, ofc. This is just a test.
Hence why I didn't put "Jag CD" in the RBF filename,
AvP doesn't run for me, on that build.
I didn't change any of the Quartus compilation settings, but I do have SignalTap enabled.
this was done to improve stability, it wasn't for single sdram purposes. ddr3 has unpredictable random read latency, the cart rom probably never should have been loaded into there
That version of the Jag CD BIOS also boots, but the Development one won't.
I don't think the dev thing is even a "cart" ROM. I think it only works with something like that Stubulator BIOS.
Yep, np.
I think I actually did have it loading from SDRAM when I did one of the first ports. lol
I tried all sorts before, swapping things around, and trying to save a few extra clock cycles.
It's a shame the FPGA doesn't just have about 8MB of on-chip SRAM, then it could even run N64 with only a cart hooked up.
How much sram does it have
Only around 600-700 KBytes, basically.
And the framework uses a small chunk of that.
That new menu option, btw, it's probably OK to just leave it on.
what new menu option?
AFAIK, the core can only boot Cart ROMs that are in 32-bit wide mode anyway, so those will already have the 0x04040404 thing.
@maiden granite #1315008851874938948 message
It's just to get the Jag CD "BIOS" to run atm. Nothing fancy, no actual ISOs can be loaded yet.
Quite a lot of work to do before that happens.
yeah seems a bit like putting the cart before teh horse
Hence why I said it was "for testing". lol
Literally all the menu option does, is force those four bytes to 0x04040404 during reading.
But ofc most games will have that already set.
The Jag CD "BIOS" has it set to all 0x00, because it uses an 8-bit wide ROM.
Ideally, the core should be tweaked, to support 8-bit, 16-bit, 32-bit mode.
Which would probably get more Homebrew running.
Iron Soldier 2 is doing that atm, which seems to be stuck in routine in RAM.
That seems to be the same code, in MAME.
Seems to be called from here...
Maybe something to do with IRQ2 not working in the game?
And looks like it essentially loops, waiting for the data in RAM at 0x44F4 to change.
Which again suggests an IRQ handler isn't quite right, as it looks like that's the only way it could exit that loop.
Might possibly be related to the Vertical Int.
Not sure it's worth it. lol
I need a break.
AvP on this core, just before it crashes.
Would be nice to figure out what fails.
Wait, so not supporting these is the reason Iron Soldier 2 isn't running?
No, I don't know what is stopping that game from working.
Nobody knows for sure which part of AvP causes it not to work on certain core builds either.
Ah OK
But it must be a marginal timing thing, as you can have the same exact code between builds, or tweak one small setting, and AvP no longer runs, or has more graphical glitches.
Would it be hard/costly to support 8-bit, 16-bit, 32-bit modes?
Investigating why Iron Soldier 2 doesn't boot, is probably even harder.
Not too hard, no.
I'm just personally doing too many things at once again.
Also, the 32-bit thing I already added, might be enough to force some stuff to work anyway.
Hard to explain.
eg. if a specific game or Homebrew isn't booting, for the same reason as the Jag CD "BIOS" (because it's cart header is set to 8-bit or 16-bit reads), then the new menu option should help.
There is something else going on with games like Iron Soldier 2 that's stopping it running, though.
I can't wait to some day play Shenmue on an FPGA
I am not sure anyone has collated a pack of the various Jag homebrew out there, that could be a good thing to get and test
Back onto Dreamcast today.
But it's kicking my butt, again.
I thought I'd done the git commit from when the core was doing half-decent renders.
But apparently not.
I don't know why that keeps happening, tbh.
I could swear I did the commit after the last decent renders.
And I can't seem to remember exactly what fixed it last time.
Hmm. Compilation variance? Just like AVP on the Jag? One build works, the next fails?
No, it's just my dumb coding, and forgetting to do a git commit far more often. lol
FPGA code back in the sim again.
Around 5.4 FPS in the sim, when it's not using the Codebook cache.
As I still need to fix that, so it will hopefully fit on the FPGA.
The Codebook cache is currently written in a way that can't easily be inferred to use Block RAMs.
Which means Quartus tries to implement it using registers, which uses an insane amount of logic.
(Most FPGAs are based on SRAM cells for the configuration bits, which tell it how to hook up the gates / logic chunks / ALMs. Some of those SRAMs are reserved to use as normal SRAM, for use as ROM / RAM / Cache, etc.)
The sim obviously doesn't care about logic "usage", as Verilator just turns the Verilog into a C-code logical model.
With the Codebook reading bypassed, it's doing nearly 14 FPS (single-frame time) in the sim...
Which is still quite slow, for some reason.
Oh yeah, latency.
I'm doing a very very rough approximation of DDR latency in the sim atm.
reg [2:0] valid_d;
always @(posedge clock or negedge reset_n)
if (!reset_n) begin
valid_d <= 3'd0;
end
else begin
valid_d <= {valid_d[1:0], isp_vram_rd};
end
wire vram_valid = valid_d[2];
Just a shift register, basically.
Which adds a delay between when isp_vram_rd pulses high, and when vram_valid goes High.
atm, that's set to only three clock cycles (isp_vram_rd shifts through valid_d, bits 0, 1, 2).
The DDR on the DE10 is currently taking about 6-7 clock cycles for every single Word, atm.
And that's with the core running at only 16 MHz (again).
So you can imagine how many more clock cycles it would waste, once the core is running at 50 MHz or more. lol
I NEEDS to make use of DDR Burst transfers, to get anywhere close for reasonable frame rates.
It's a bit frustrating, as I know it's possible now, to get the rest of the logic rendering fast.
If I set the fake latency from 3 cycles down to 1 clock cycle, the "frame rate" goes from 13.8 to 18.8.
If I bypass Codebook and texel reading altogether, it jumps to 22.62 FPS.
26 FPS, if I skip a whole isp_state.
Not quite sure why it's not closer to 60 FPS, tbh.
52: begin
if (prim_tag_out_old != prim_tag_out) begin // Check to see if the Tag has changed...
prim_tag_out_old <= prim_tag_out;
if (tex_base_word_addr_old != tcw_word_out[20:0]) begin // Check to see if the texture BASE address has changed...
tex_base_word_addr_old <= tcw_word_out[20:0];
// isp_inst[25]=texture. tcw_word[30]=vq_comp.
if (isp_inst_out[25] && tcw_word_out[30]) begin // Check if VQ compressed.
read_codebook <= 1'b1; // If so, read the new Codebook.
isp_state <= 8'd100;
end
end
end
else begin // Tag has not changed, but check if the new pixel is flat-shaded/Gouraud, or textured...
if (y_ps[4:0]==5'd31 && x_ps[4:0]==5'd31) begin // On the last (lower-right) pixel of the tile...
tile_accum_done <= 1'b1; // Tell the RA we're done.
isp_state <= 8'd0; // Back to idle state.
end
else begin
x_ps[4:0] <= x_ps[4:0] + 5'd1; // Inc x_ps[4:0].
if (x_ps[4:0]==5'd31) y_ps[4:0] <= y_ps[4:0] + 5'd1;
if (isp_inst_out[25]) begin // If texture flag is set...
isp_vram_rd <= 1'b1; // Read texel...
isp_state <= isp_state + 8'd1;
end
else begin // Flat-shaded or Gouraud, no need to read a Texel Word...
isp_state <= 8'd54;
end
end
end
end
Doing quite a lot in one state now.
Iterating through each pixel in turn.
Well, reading each Tag.
And the Tag tells you which triangle/prim the pixel relates to.
So it has to load the triangle params for each pixel very quickly, from the param buffer.
(which happens in the previous state, but I can't paste that much at once on here.)
If the Tag number changes, then we know we need to check to see if the texture (base addr) has changed.
And if it's also using VQ-compressed textures, we need to read the new Codebook.
If it's not textured, we still need to write the pixel to the framebuffer, obviously, but skip the texture read.
So the pixel must then be either flat-shaded or Gouraud shaded.
All of these flags exist in the ISP/TSP/TCW words, for each triangle/prim
I'm sure I still have some off-by-one stuff, regarding how I increment to the new row, or check for the last pixel etc.
But I got rid of the horiz lines earlier, by fixing one of those bugs.
It still has the vertical lines atm. That must be delays in the params and texture stuff updating.
The Codebook on the FPGA (and sim) right now is just a basic chunk of BRAM, not a proper cache...
// VQ Code Book. 256 64-bit Words.
reg [63:0] code_book [0:255];
reg [8:0] cb_word_index;
always @(posedge clock or negedge reset_n)
if (!reset_n) begin
cb_word_index <= 9'd256;
end
else begin
// Handle VQ Code Book reading.
if (read_codebook) begin
cb_word_index <= 9'd0;
end
else if (codebook_wait) begin
if (vram_valid) begin
code_book[ cb_word_index ] <= vram_din;
cb_word_index <= cb_word_index + 9'd1;
end
end
end
assign codebook_wait = !cb_word_index[8];
Which means it has to read 256 Words (64-bit wide) EVERY time the texture base address changes (and if it's also VQ compressed).
Which is why it's so incredibly slow at rendering atm. lol
The Param cache, Texture cache, and Codebook cache are the real key to getting the good frame rates.
That's one of the better FPGA renders in a while.
Well, since three days ago. lol
Working on the Codebook cache next.
@rain obsidian what kind of cpu does mister have?
You mean on the ARM side?... an ARM. ๐
Actually, a Dual-Core ARM, very similar to what's on the Rasp Pi 2.
Runs at around 800-900 MHz on MiSTer.
But without any actual GPU.
So it has to use a Framebuffer.
The DDR3 gets shared between the ARM and FPGA sides.
Any either can read (or write. ๐ฎ ) the same memory.
So the ARM Linux stuff is set to use the lower 512MB.
And FPGA side can use the upper 512MB.
(although part of that DDR3 mem is used for the ASCAL upscaler framebuffers.)
ASCAL has the option to directly display the Linux framebuffer from DDR3, or display the native video being fed to it from the core itself.
It's not too bad to get the toolchain set up, to compile stuff for the ARM / Linux side.
But I think there were a few minor mistakes on the wiki.
Something to do with one of the paths for the export. It might have been fixed now?
I've never had any luck getting docker working on WSL / WSL2, btw.
I just install the gcc ARM toolchain directly.
New Codebook Cache.
The FPGA version should now force Quartus to instantiate Block memory.
If I pretend VRAM has zero latency on the sim, it renders the Daytona frame fast enough to hit 28.68 FPS.
should be able to run some stuff on reicast then
cortex class?
800MHz Dual-core ARM Cortex-A9 processor.
good o'l A9 lol
i already have it running on the ultra96 board
Yep, if you have reicast or something on the ultra96, and running under Linux, it might even run as-is on MiSTer?
Probably goes without saying - without a GPU. lol
Currently realizing just how much BRAM the Codebook cache might take.
You can reduce the number of entries, but unless it's about 512-1024, it won't be very fast.
With 1,024 entries, of 256 words, and 64-bit words.
That's 1,024 * 256 * 8 = 2 MBytes.
Way more than the Cyc V on the DE10 even has.
Trying 128 entries first.
I was wrong.
It looks like 128 would be plenty.
It's still doing sim frame times to hit 22 FPS (at 100 MHz), even with simulated DDR latency of 1 clock.
There just needs to be enough Cache entries to store all of the Codebooks for the typical number of different textures used in a scene.
64 entries...
128...
With only 64 entries, when there are more textures than that on the screen, the "Tag" address for the cache kinda wraps.
Causing it to use the wrong texture for some of the polys.
There should be a fix for that, too.
Where the logic knows if the incoming Tag width is greater than the number of Cache entries, it can force a cache miss.
Then re-read the Codebook for those polys.
It would just cause slower renders.
(a compromise between the Codebook cache size, vs speed.)
That's with 128 entries for the Codebook cache.
And 1,024 entries for the param buffer.
That's not too bad on mem usage atm.
Taking 32 minutes to compile.
Haven't quite got it right yet, but close.
I actually think the Codebook cache is working OK on the FPGA.
But I couldn't help do another tweak to gain a bit of speed, and broke it again. lol
When it does the thing of skipping every-other line, it's usualy due to the clock delay thing for Tag processing.
// Write triangle spans to Z / Tag buffer, checking 32 "pixels" at once for inTri AND depth_compare.
50: begin
isp_state <= 8'd90;
end
90: begin // Z-buff write is allowed in this state.
isp_state <= isp_state + 8'd1;
end
91: begin
y_ps[4:0] <= y_ps[4:0] + 5'd1;
if (y_ps[4:0]==5'd31) begin
isp_vram_addr <= isp_vram_addr_last;
isp_state <= 8'd48; // Done! - Load next PRIM.
end
else isp_state <= 8'd50; // Else, Jump back.
end
So I had to re-add those extra states.
It starts off with y_ps[4:0]==0, so pointing at the first row in the Tag buffer.
But it has to read the Z values from that row, have time to do the depth compare (on all 32 pixels/tags in parallel), then WRITE the new prim tags (for the current triangle) back to the Tag buffer.
It can't do that all within the same clock cycle / state.
As there is a delay of one clock cycle for reading the (current) Tags from memory.
So it has to go from isp_state 50 to 90, to give it time to read.
As it leaves isp_state 90, that triggers the writeback to the Tag/Z buffer row.
91 increments to the next row, and it repeats until it's done row 31.
I might be able to ditch isp_state 90, but I can't figure out if that would break it again.
If we pretend that DC will never fit the MiSTer, the question is what would be the most worthwhile parts of the DC emulation to have on the FPGA side of things. What would be the one thing that could potentially bring in a correct timing etc.
The word "emulation" threw me for a second there. lol
But in this case, the term emulation is very fair. It's not going to be a super low-level GPU, with the way I've wrote it.
(and the lack of chip-level info on HOLLY / PVR2.)
Yep.
It's pretty much like a Rasp Pi 2, minus the GPU, and with two cores.
Thus having just the PowerVR emulated on the FPGA would be one possible choice
Correct timings, as in, making it fast enough without a lot of extra effort...
Would be having about 8MB of SRAM.
But SRAM is quite expensive.
And would use up at least one of the GPIO headers.
Not as such. Anything like that, has to be implemented on the FPGA side, AFAIK.
So the ARM even passes audio through the FPGA to the DAC + HDMI.
In fact, the ALSA audio from Linux gets written to a small buffer in DDR3 first.
Then some logic on the FPGA side reads the audio samples from that buffer, to output to the PWM "DAC", or I2S DAC, plus the HDMI chip.
Ok. Thus the PowerVR and sound emulation has to be done on the FPGA side
I'm not overly worried about the AICA sound stuff yet. I know a few devs who I think would be happy to help.
AFAIK, the AICA is quite similar to the SCSP on Saturn.
But with more channels, and obviously the ARM7 thingy.
I'm just happy to be making progress atm with the GPU, which, as you've seen, isn't always the case. lol
It usually make me super grumpy.
I have no idea how many Quartus compiles I've done, just on this project alone.
But it must be like, 500 or more.
So around 250 hours, at least.
Yeah, probably way more, actually.
I started the project over a year ago, but I had at least six months where I barely looked at it.
Thanks for elaborating. So getting reicast itself to run on the arm is just half of the story because there is no graphics card to output to and no soundcard. Thus even if getting it to successfully run somehow it would do... nothing (nothing to see and nothing to hear).
It would probably be possible to emulate the sound on reicast.
And just output that via Linux / ALSA, like normal.
Assuming the ARM cores are even fast enough for that, as they're not especially great.
I think they lack NEON and other features that the Rasp Pi 2 does.
So not so good for decoding modern video formats, etc.
Very early-on in the life of MiSTer, some people did get things like ScummVM running quite well.
Although, back then, displaying the Linux framebuffer would have screen tearing.
I really hope they've fixed that now, but I haven't tried it in years.
booo!
Quartus has a habit of doing stuff like that, RIGHT when you're getting closer to a crux point.
And worst of all, is it won't even give you an estimate on the logic usage, as it stopped too early.
If there was an affordable Intel Agilex dev board right now, I'd buy it.
That's if it had decently fast DDR, and enough IO pins to hook up SDRAM etc.
And at least 50-100% more logic than the DE10 Cyc V.
I only just realized that I don't really NEED to check of the prim tag has changed.
You just assume it can change, then check if the texture base address has changed.
It iterates through every pixel in the tile anyway.
So if a pixel (triangle) has the texture flag set, you check to see if the Texel read address has changed, then read the new texel.
If it's flat-shaded or Gouraud, you can obviously skip the texel read entirely.
Contact terasic. For the low low price of over 2000 dollars you can get a de25-standard ๐
Exactly.
Really annoying that there are no cheaper Agilex boards out yet.
Like, Iโd pay up to about ยฃ600, if needed, but I canโt go to ยฃ2000+
Progress.
Reduced the CB cache to only 64 entries.
And reduced the param buffer to 512.
Could probably use the CB cache module as the generic cache for the param buffer, tbh.
tbh, I would implement them as normal addr-value caches
'cos atm, the param buffer is literally just storing every group of vertex/ISP/TSP/TCW params, for every triangle within each tile. lol
Yep, it pretty much is...
//.prim_tag( prim_tag_out ), // input [9:0] prim_tag
.prim_tag( tcw_word_out[15:5] ), // input [9:0] prim_tag
Using some bits of the Texture base addr as the "tag" atm.
Which I know isn't ideal.
Just needed enough bits to differentiate between textures.
Which does make sense, caching the Codebook across the whole frame.
Rather than per-tile.
Since there are only so many textures in use, in the entire image.
If it was per-tile, it would be constantly re-loading the cache.
i think the cache is really tiny on the real hw
Yep, I think so, too.
like 2kb or something like that
I think it just takes a hit (or miss. lol) if there are lots of textures in the frame, and it just has to fetch a bit more often from VRAM.
I do think logically, the "texel" cache and codebook cache must be separate, though.
And the texel cache mainly helps just to mitigate most of the VRAM latency.
So it can do burst reads into cache.
I'm very happy with that new render.
Another milestone.
It's still super slow, but that's now down to the lack of Burst transfers
git commit done
That's actually really good already, vs what I was expecting.
So now we have the Hidden Surface removal, done on 32 "pixels" at once, using Tags.
Then the param buffer, so it can fetch the params for each triangle, depending on the Tag value during tile rendering.
And now it has the Codebook cache in place, which erm, caches the Codebook(s). lol
So it can also swap to each Codebook, each time the texture base address changes (dependant on the current Tag).
OK, now what? ๐
I think I need to process this for a while, and see where I'm up to.
well, if you're out of ideas
would be real fun to setup reicast
and render frames as they are sent from it
Around 2 seconds to render the frame above, at only 16 MHz.
With crappy 6-8 cycles of DDR3 latency, for EVERY PIXEL processed.
And for EVERY initial param fetch.
heh
Trust me, I'll be working on getting reicast running on the ARM now. haha
i guess we can call that mistercast
Oh no. lol
I really hate the fact Taki called it the "MiSTer Pi".
It was confusing enough, trying to explain to people the pros and cons of sw vs FPGA "emulation". lol
It's possibly THE most confusing name anyone could have ever chosen, for a device like that.
But hey, it's great to see some clones finally appearing.
640x480 pixels = 307,200
Best-case DDR3 latency, at a core clock of 16 MHz, looks to be close to 6 cycles.
So 307,200 x 6 = 1,843,200 cycles.
Each cycle = 62.5ns, at 16 MHz.
= 115.2 milliseconds. lol
THAT's why it's so slow atm.
do you render to an internal buffer and then copy?
Not yet.
That will probably be the next thing.
So it can Burst write it to DDR3.
But...
Hypothetically, would the Dreamcast core need a "VGA mode" toggle? Some games don't support the VGA at all.
That brings up another interesting question about the DC...
Does it arrange the framebuffer where the tile pixels are linear.
Or does it just burst 32 pixels into the tile, then do another burst for the next row?
Probably the later.
@polar goblet that depends on how mister presents the frame?
I think we're maybe a bit early for that. lol. But all of that stuff would eventually be a simple menu option.
Also, unless the VRAM stuff is moved into SDRAM again, getting RGB output could get interesting.
You can use ASCAL on MiSTer, to output a DDR3 Framebuffer via RGB, though.
the framebuffer lives in SDRAM?
It used to, until a few weeks ago. hehe
mister uses hdmi?
has an urge to quit his job and work full time on this
Then writes the tile directly back to the framebuffer, overwriting the reicast frame(s).
lol
only if it payed the bills lol
I do think adding the tile output buffer makes the most sense for the next step.
I know the MiSTer having a Dreamcast core is just a pipe dream, just thinking about the edge cases that might complicate things.
Would literally just be a small BRAM...
Could steal the z_mem thing.
module z_mem (
input clock,
input [41:0] data,
input [4:0] address,
input wren,
output reg [41:0] q
);
reg [41:0] z_mem [0:31];
always @(posedge clock) begin
if (wren) z_mem[ address ] <= data;
q <= z_mem[ address ];
end
endmodule
41:0 lol
I'm guessing the tile accumulation buffer stores the Alpha as well.
The [41:0], is because the Z buffer also stores the Tags now.
So 32-bit Z (fixed-point), and 10-bit Tag.
@polar goblet I think mister is just slightly too small for dc
my goal was ultra96
and to run some parts in sw
Yep, I think maybe even the GPU might be doable, with a lot of help getting it to all fit, with all of the features.
And especially getting it to run faster, without breaking the timings.
I might try the next compile with it at say 40 MHz, and see what happens.
well still an 800 mhz A9 won't run all games fullspeed
even with all opts enabled
and just the cpu to care
It would be amazing, to see reicast on the ARM on MiSTer, with the TA writing directly to the Framebuffer.
But...
It would need a way to signal to the emu when each frame is done.
well it is not too complex, isn't there some mmio we can map?
Could just do that, by reading/writing DDR3 directly, like a handshake thing.
Or, talking via the HPS IO module, like how most cores do it.
Well, the OSD, disk/tape/cart loading.
Is all done via ONE 32-bit internal GPIO port.
That's all it has, between the ARM core(s) and FPGA side, unless you use AXI.
I was using AXI before, for testing some of laxer's old PS1 core.
But AXI can be a bit unstable, especially if you need to reload a new RBF all the time.
You can just mmamp, to shove stuff into DDR3, yep.
well, I don't know the specifics, but i'm sure it can be figured out
Then access that same mem on the FPGA side.
you'd need a few things mapped
one for the regs access
one for the memory access
As long as reicast can be tweaked, to simply write the textures and TA output into DDR3 (mmapped), I think it's doable.
and a few bits for the interrupt linex
And as long as nothing in the emu code needs a lot of feedback from the GPU.
ie. it would need to "talk" to most of the existing GPU code, to keep reicast happy.
But also be able to wait for a LONG time, for the FPGA to render each frame. lol
And no TA in Verilog done yet.
sure but we have lxdream's TA
Exactly.
thank god for lxdream
And that might be one of the ways to get around the whole "doesn't all fit the FPGA" problem.
I think the ARM core(s) might even have enough juice to run the AICA sound.
I mean, it can run stuff like Fluidsynth and mt32-pi.
Sort of. lol
That's one of the reasons the Pi version exists - the ARM cores on MiSTer were struggling a bit.
reicast could run several games in 600-700 mhz cortex a8
back in the day
not the most demanding ones, but on a single core (with gpu)
Believe it or not, behind the crud, that's a very good render.
With most of the speed-up logic in place, aside from Burst transfers, which is the final step.
That's taking about 3.5 seconds to render the frame, at 16 MHz.
So about 0.28 FPS.
Only half a second, so already at 2 FPS. A bit like DOOM on a 386.
Simpler frames, like the memory card screen, render in about 200ms.
I need to add an "FPS" counter to the core really.
4 seconds to render DOA2 Kasumi.
And she's lookin' pretty fine, considering all of the code I just mundged together.
I don't even want to imagine how much logic Gouraud shading will use.
Definitely won't fit that atm.
Rendering from Tags actually helps fix a few glitches.
I think it's displaying at half horiz resolution atm.
Due to the ASCAL thing.
Obv the textures are shifted, too.
single-pixel changes on the raindrops etc. are not allowing the params to update fast enough.
I'm being dumb.
I can't work out the maths.
I think that's right - it's taking about 5-7 clock cycles extra, per-pixel.
So will be taking 5-7 times long to render each frame.
And 5-7 times longer, to write the tile back to VRAM.
And the same for reading in a new Codebook.
And for reading the params.
(about 24-43 clock cycles to read a set of params for ONE triangle, as they are variable-length. So that would currently take around 301 clock cycles, at the most.)
It's still a fascinating read though!!1 ๐
Would rendering just "half res" in 320x240 increase the chance of getting this onto the MiSTer? I feel that many would prefer a low res game with scanlines
It is technically possible to render at a lower res, but hopefully that won't be too much of an issue soon.
If I can get the frame times down on the FPGA, it might be OK at 640x480.
The main reason it's displaying the frames at a lower res atm, is that it's just duplicating every other pixel.
As I couldn't figure out how to modify ASCAL the other night, to display it how I wanted.
(the Dreamcast's weird VRAM layout makes it a bit awkward to display things.)
Because of this...
_________ _________
| | | |
| | | |
| 2nd 4MB | | 1st 4MB |
| | | |
| | | |
|_________| |_________|
[63:32] [31:0]
That's how I have to have the VRAM dumps loaded into DDR3.
And also need to do it that way anyway, because textures are read as the full 64-bit wide data.
(like the real PVR2)
The display framebuffer is only read from one half of memory at a time, though, so 32-bit.
(usually 16BPP, so two pixels in each 32-bit word.)
I'd have to modify ASCAL properly, to skip the opposite half when reading the framebuffer.
I'm more confident now, of getting decent frame rates (low frame times) from the FPGA version.
I just had to get this param buffer and cache stuff added.
Next up, it will be adding Burst transfers, which should make a huge difference.
So, instead of reading/writing just ONE 64-bit Word at a time, and waiting 5-7 clock cycles between each one, you just go...
"Read 256 Words, please. Starting from this address..."
The DDR controller will still have a few clock cycles of delay, but then it will burst those 256 words, pretty much on every clock cycle.
(it's not always contiguous, depending on whether the data is on the same Page (row) of memory. The FPGA also has to share access with ARM Linux, but DDR3 has a LOT of bandwidth, so it shouldn't be a problem.)
Put it this way, if the N64 can work mostly from DDR3, I think DC can as well. ๐
Right now, just fetching those 256 words would take up to 7 times longer (due to the latency of writing one word at a time), so 1,792 clocks.
If it were trying to write a completed tile into DDR3, it would need 1,024 clock cycles.
atm, that would take 7,168 cycles, just for ONE tile.
Multiply that by the 300 tiles (640x480), and you see why it's slow right now. lol
Gouraud shading is disabled in the sim atm, as I'm testing the FPGA code on it.
I've yet to get the background poly working.
I know the lower bits of the register point directly at the ISP/TSP and verts stuff.
But in the VRAM dumps I've seen, it address seems to be off by a few words?
Like, for the menu render, the ISP_BACKGND_T reg, was showing an address of say 0x5D18 in the lower bits.
But it looked like the actual data started at 0x5D10 ?
And that was bit 3 set, so not any of the bits [2:0].
I know you might have to add PARAM_BASE to that addr, too.
But the offset in the reg always been slightly off from where it's supposed to be.
I can get this example to "land" on what looks like a proper ISP param, but only if I shift the address left by three bits.
And then take only bits [19:0]...
(see the marker, in the VRAM dump editor window.)
I'll look at the reicast code later, to see how to decode that address.
It would be nice to get the Background poly working, as I've yet to see it work.
PARAM_BASE is 0 for that dump anyway, so the poly data should be in the lowest 1MB of VRAM.
union ISP_BACKGND_T_type
{
struct
{
u32 tag_offset : 3;
u32 tag_address : 21;
u32 skip : 3;
u32 shadow : 1;
u32 cache_bypass : 1;
};
u32 full;
};
OK, fair enough. I should have just had a proper look. lol
u32 strip_base=(param_base + ISP_BACKGND_T.tag_address*4) & 0x7FFFFF
The PDF did say the address was "in 32-bit units, but also showed it as bits 23:3.
I think that got it, it's probably the blue bits of the background.
It's just the Z precision is limited right now, so it's not rendering all of it.
I just had to shift right by 1 bit instead, but ends up the same as grabbing bits [23:3], then shifting left by two again.
Morning.
I just remembered - the framebuffer is usually 16BPP.
And stored in either the lower or upper 4MB of VRAM on the DC.
So I should be able to write two pixels at once, when writing the finished tile back to the FB.
That might.. complicate the code a bit more. lol
I might have to implement the tile accumulation buffer as odd and even pixels.
Or maybe not. Just needs to have Word Enable inputs.
Then read back as 32-bit, so it can write to mem.
Oh yeah - Alpha.
I would imagine the tile buffer is actually 32-bit ARGB.
No easy way to do Alpha blending otherwise.
So it will work a lot like it did in the C code on the sim.
When it reads the buffer back, it can do the 24-bit -> 565 conversion, before writing to VRAM.
Once Burst writes are working for tiles, I can be more confident about using it for other stuff.
Then we should see the really big speed-ups in frame render time.
Itโs pretty cool how all the visual errors displays the unique rendering method that the Dreamcast does
That re-affirms the question about running the core much faster than about 20 MHz atm. lol
Quadrant rendering?
That was at 50 MHz. I was just curious.
Tile rendering.
32x32 pixels.
Which is now done by all modern GPUs. ๐
It makes a lot of sense really, as it can do a lot of the rendering faster, once all of the info is on the chip itself.
On much faster memory, with zero wait-states, basically SRAM.
Yeah, the Dreamcast would render like multiple small framebuffers to draw the image instead of just lines.
Yes thank you, thatโs what it is
Ohhhhh I had no idea!
Then, once each tile is done, it can burst write that back to VRAM.
Imagine having to render multiple small tiles while the N64 renders the whole screen
PowerVR / Imagination Tech, were one of the first companies to really use Tile rendering / Deferred rendering.
Tiled rendering is the process of subdividing a computer graphics image by a regular grid in optical space and rendering each section of the grid, or tile, separately. The advantage to this design is that the amount of memory and bandwidth is reduced compared to immediate mode rendering systems that draw the entire frame at once. This has made t...
On PC gfx cards, etc.
Oh, not to be confused with tile-rendering on older consoles, btw.
Like arcade boards, NES, SNES, MD, etc.
It's not really related to that older stuff in any meaningful way. Just something to watch out for.
Probably better known as Deferred Rendering.
So you have to "collect" most of the info for rendering each tile first, then do all of the Hidden Surface Removal (as quickly as possible, hence why I'm doing it on 32 pixels at once now).
That gives it the main speed advantage, to compete with traditional "immediate" rendering.
Where they would just render whole triangles out to the framebuffer in VRAM, even if the triangle was HUGE.
Breaking it up into Tiles is far better, but also quite tricky, as I found out. lol
I didn't think I'd even get this far, tbh. Even a few weeks ago, I almost ragequit again.
https://blog.imaginationtech.com/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
Great article specific to powervr rendering actually from the company that designed the chips
Rys looks at our PowerVR graphics architecture and describes how Tile-Based Rendering (TBR) works in practice.
I want to draw up some diagrams to explain it soon.
Yeah.
And PowerVR tech ended up in most of the older iPhones, for a long time.
Wow
The tile based rendering actually makes Dreamcast more viable on FPGA than other consoles of its same generation funnily enough
You can bet Apple still use a similar rendering method, because Apple rarely "invent" anything. lol
Pretty sure power be made into a few cars too
Because loading a whole 480p+ framebuffer is harder
@maiden granite Possibly from the POV of minimizing VRAM bandwidth, at least.
ur an affordable FPGA
I still intend to hook up a DC mobo VRAM to the Logic Analyzer soon.
Ur mom
tbh, I'm surprised nobody has done it yet.
Just to see the basics of how often it uses burst transfers, how much it pre-reads data into the texture cache, etc.
You can be sure it uses Burst transfers as much as possible, and you can only really do that when you have on-chip caches, too.
I sort of have Codebook cache and parameter "cache" now.
But no proper Texture cache yet.
Can probably just re-use most of the CB cache code for that.
If it wasn't for the latency of SDR/DDR SDRAM, none of that would be necessary. lol
Even just the handful of clock cycles (for a random Word read/write) on old-people SDRAM causes big problems.
'cos that handful of cycles is repeated for EVERY Word you access.
And the core is only at 20 MHz atm.
When it's running faster, there will be even more "wasted" clock (core) cycles when accessing DDR.
@maiden granite While I think of it - any idea how they managed to allow pics and videos more than 10MB on here?
It's one of the reasons I can post the GIF anims here, then send the link to other people. lol
Discord limitations. If you pay monthly for nitro it will greatly increase the cap.
10MB is the new limit for free users. Used to be 25MB
I get a cap of 500MB by comparison.
Oh, OK, so somebody does pay that then, for this server?
I didn't know it could be done server-side?
No, it's your personal account :p
Right, I think they had some kinda legacy grace period to allow the old upload limit still
Which was 25MB
Yeah, probably.
OK, I did think it was strange, as they probably couldn't allow people to just pay server-side, and then suddenly let everyone on the server get around the limit. lol
Yeah
I don't think the founders of Discord realized they were really starting a storage company.
The storage is probably more intensive than any of the actual UI stuff.
The server side reqards for boosting is increased voice quality, more emojis, more stickers, etc...
I think they knew because they started off with MongoDB from the start
They moved to Cassandra now I think.
Which was probably the most insane db migration in history
$3 a month, they really are tempting me.
It's just annoying, that it started off with a higher limit, then reduced to 10MB. That's just evil.
I remembered correctly.
Yeah they need to do stuff to become profitable. That's why the little ads were added.
I like the product so I pay for it hoping it doesn't get more enshittified.
The developer of BlastEm is a long time employee at discord too
I'm all Nitro'd up now. LOL
$30 a year, I can't really complain at that.
Only the 50MB one, though.
I use Discord probably more than any other site, except YouTube.

Worth it.
Yup hehe
Ive been paying for nitro since the first month it was available I think.
I am too much of a cheap ass
lol
I paid for YouTube premium for a month once.
But only to get rid of adverts on the phone and Firestick, which I don't use all that often.
I wouldn't mind if it didn't cost quite so much, but it's almost as much as I'm paying for Nutflicks 4K.
Adblock FTW
I've been paying for YouTube premium family for a long time and let my wife and some friends use it with the extra accounts allowed.
I haven't had to see a single video advert on my PCs for about the past ten years.
I'm still paying about $30 a month on Patreon.
Definitely worth it at that price point if you spread it out. Also been a YouTube watcher since the earliest days, so I like the product I pay for it kinda thing.
Sorg, GadgetUK, RMC, EEVblog, Srg320, Jotego, bitluni, etc.
Brother!!!
Discord could make more money if they redesigned their merch. It's mostly pretty gaudy stuff.
furrtek, MLiG, RetroRGB Bob, Tech Tangents, Retro Tech (CRT Steve), Artemio, and for some reason... Kreosan. lol
ยฃ12.99 here, for YouTube Premium. ๐ฆ
It's not a huge amount of money, but it is just to ditch adverts, and get a bit of extra content.
vs what I'm paying for Amazon and Nutflicks already.
I did have disney+ for about two years, watched Mando, then scrubbed it. lol
disney+ AKA "Even more woke Netflix".
Disney own so much stuff now, eventually they will own ALL entertainment.
And all restaurants will become Taco Bell.
lol nutflicks
Useless trivia, btw...
I've only ever ordered Taco Bell... ONCE.
Also, in Demolition Man, Taco Bell was swapped for Pizza Hut, in some Countries.
Taco Bell here (UK) was... just OK.
Thanks, @maiden granite! you had to encourage him to get Nitro!
Look at what he did to me!!
I did almost get Nitro the other day. $30 is pretty good, even for the 50MB thing.
I couldn't even post 4K screenshots sometimes, of the SuperSPGA, etc.
And we all know how important THAT is.
Can I ask a hopefully not annoying question?
@floral vale Yip. I quite like some of the annoying ones.
How much bigger or more powerful would an FPGA need to be to run a Dreamcast core? From what Iโve gathered is that the current mister just isnโt good enough
Or is it maybe possible on the current?
The biggest hurdle is the clock freq of the CPU.
As it runs at 200 MHz, and you're never likely to see any (complex) cores run much above 100 MHz on the Cyclone V.
But... it just might be possible (if some of the uber devs get together), to make an SH4 core that can execute more instructions in parallel, then only need to run at 100 MHz or less.
Right now, my crappo GPU test "core" is already using about 76% of the Cyc V on the DE10.
But I'm convinced a ton of that could be trimmed down, if I had some help the with the maths calcs.
The inTri and and interp blocks alone, are using about 30,000 LEs, or something.
Which is just insane. lol
From all of what is known so far, though, I'm far more confident about an eventual core being finished.
But it might need something with TWICE the logic of the DE10.
Like... I dunno... the Analogue 3D. ๐
And yes, I know it won't (officially) have OpenFPGA.
I bought one mainly because it's relatively cheap, for something with 220,000 LEs.
Analogue 3D will be snagged if someone gets openFPGA on there. Thatโs for sure
The Ana 3D is likely to have only DDR memory.
But that's fine, if I/we can manage to finish this GPU core.
Things like the sound and GD/ISO loading can happen later.
I think if I can get further with the GPU, and prove it can hit decent frame rates on the actual FPGA, it will get more interest from other devs to collab with.
Noice. I like the way that sounds. Great seeing a progress on how things are going!!
@floral vale I did this, back in 2019. It didn't take long...
I intend to do the same for the Ana 3D when it arrives, and use it as my new dev board. ๐
To be able to use the FPGA like that, with an unknown pin mapping to the rest of the system, I had to of course first figure out the pin mapping...
At first, I tried using a simple counter on the FPGA, and have it output each bit onto every IO pin.
Which isn't ideal, because the signals could be clashing (contention) with other chips on the board.
That proved to be tricky, so instead I did it the opposite way...
I just set ALL of the FPGA pins as Inputs...
Then I used the 1 KHz test output on my o'scope, via a resistor, to probe every pin of the rest of the chips on the board.
Like the HDMI chip (which I think on the SuperNT is basically the same ADV7513 on the DE10, or very similar).
Then the buffers for the cart slot, joyports, etc.
Then I used SignalFlap to look at what each input pin was doing.
There was some crosstalk between signals, but it was very obvious which pin had the 1 KHz on it.
And that was it - template Quartus project for the SuperNT. lol
But... obviously I didn't touch the PIC32 on the SuperNT at all.
I didn't want to erase it's FW etc. as I'd never get it back, and definitely didn't want to have to write my own.
Which makes it tricky to actually load any games into a core.
That could be done with a simple SD slot hooked up to some Cart or Joyport pins.
Or...
An Orange Pi, in a cart.
I already have main MiSTer running nicely on an Orange Pi (Allwinner SoC) for a previous project.
With a QMtech Cyc V module.
I don't think I'd mentioned that publicly before, because it was a whole thing.
But times have changed, prices have changed, and clone DE10s started arriving.
I would love to buy a new dev board soon, but really want an Agilex.
(Intel have apparently split from altera as a company now, so they are separate again. AFAIK, Intel have always made the chips for altera, though. Or at least TSMC do.)
I haven't seen an Agilex dev boards for less than about $2,000, and I can't justify that.
This is one of the "cheapest" on Terasic atm, and it's still ridiculous.
This is also $3,000 for a Cyclone 10, but looks like it uses almost the same chip as the Ana 3D, just with a higher pin count...
Iโd like to think the Analogue 3D would be relatively easily to get OpenFPGA on there some how
Not sure about that, but it would certainly be possible to load cores the other-other way.
since it's almost certainly not an SoC FPGA and instead a regular FPGA then handling all the CD data and other goodies on this has to take up logic space on the FPGA pretty much or be offloaded to a separate IC which can add complexity. It's all a tradeoff.
OpenFPGA is currently contingent upon the dual FPGA + microcontroller design of the Pocket. They would have to make probably significant changes to their framework and code to work with the new platform, and I doubt they would want to do it because it' competing with themselves.
Orange Pi. ๐
I already had main MiSTer working nicely on an Opi / Allwinner SoC.
It can talk to the FPGA side via high-speed SPI, or a whole 8-bit bus.
So I wouldn't even need to attempt anything like OpenFPGA itself.
Again, though, likely only DDR3 or DDR4 on the Ana 3D.
And probably not quite enough cart slot pins for SDRAM and everything else, but we'll see.
(plus the fact the buffers on the cart slot would likely make it hard to have SDRAM working at any reasonable speed.)
Honestly, I'm even contemplating sacrificing the whole cart slot, and designing a PCB that can replace the buffer chips.
Don't really need a cart slot, when you can load from SD card.
Even though I was trying to champion adding a cart slot to MiSTer, a few years back. lol
1.432 MBytes of on-chip mem on the Ana 3D, probably.
About twice that of the DE10.
Still not a massive amount, but super useful.
Trying to fix the vert lines on the core.
Probably just what's highlighted there.
x_ps is the current screen pixel X coord.
But it's incrementing that at the same time as asserting DDRAM_WE, to write the pixel to the framebuffer.
x_ps (and y_ps) drive the whole texturing logic.
So x_ps has incremented by 1, by the time the DDRAM_WE pulse happens.
"Write Enable"
Don't know why the lines don't appear on the sim, but things work a bit differently there.
Shoving the fb_we (framebuffer write) stuff into the previous state makes things worse...
Unless you like 1990s wireframe movies.
I'll just add an extra state, and I think that will fix it.
Every extra clock cycle makes it render slower, but hey.
Good enough.
Larger param cache and CB cache (mostly) fixes the face...
Oh, jesus.
reddit - the hive of everything good about the World.
lol
tbf, most of the mentions I ever got on there were good.
Thanks, btw. It looked even better when I first got Gouraud shading into the sim...
Before...
After...
(minus the Offset colour, which is what gives it fake specular highlights.)
An example of the Tag buffer.
Can just about visualize a car at the top.
With it facing away from the camera.
Chonky wheels.
0x94 = All of these pixels belong to this specific triangle/prim, with this specific texture / shading values.
For the road. lol
I really wish people would make it clear it's not MY core.
It's very much Srg320's Saturn core, with some very tiny tweaks.
I do strive to mention it a lot, as you know. :p
I'm gonna have to read that reddit now, aren't I. lol
Oh, OK. Not too much said yet.
it wont matter the you tube Hype merchants have you listed as doing it
I can't disagree with the reply about there not being many games for ST-V, at least not exclusives.
lol. True.
I can't really keep all projects as private every time, though.
It's no fun for me, unless I see other people enjoying reading about it.
Support the Channel
https://www.patreon.com/PixelCherryNinja
Join the Pixel Cherry Ninja Gaming Discord
https://discord.gg/aUFwCV6SCc
Special thanks to channel Patreon's
SmoMo
RGM @rgm4646
Marc Nuernberger
So many people contribute towards FPGA gaming so that we can enjoy our most precious games on devices such as the MiSTer FPGA and the Anal...
timestamped your bit !
Was watching part of it again earlier.
I missed some of the last bit the other day.
It's amazing to read what's going on. I don't have a clue what any of it means, but it looks impressive:)
Starting on the tile writeback buffer.
The buffer itself contains 1,024 pixels, each 32-bit.
(8-bit Alpha, plus 24-bit colour)
The Alpha will be needed for proper pixel blending later.
The final framebuffer in VRAM is (usually) only 16BPP.
So each 32-bit word written back to VRAM can hold two 16BPP pixels (very often in 565 format).
So I had to organize the tile output buffer as 512 words of 64-bit.
The tile buffer itself also holds two (32-bit ARGB) pixels per 64-bit word.
That means internally, it will ofc write only one ARGB pixel at a time to the buffer.
yep
But will then be able to Burst write TWO pixels at once into VRAM, taking only 512 clock cycles instead of 1,024.
It's just strange that most games only use a 16BPP framebuffer? I guess to save on VRAM and bandwidth mainly?
just to save vram
some games use 24bpp
some, even more rare, 32 bpp
24bpp is for example Soul Calibur
which also does other nice tricks, like YUV VQ
Ahh, OK. So some might use 24BPP for title screens etc?
and I love chatting with you about it
keep posting away!
MiSTer FPGA Documentation site built using Material for MkDocs. - MiSTer-devel/MkDocs_MiSTer
Multiple writes can also be issued without pausing when DDRAM_BURSTCNT = 1.
Wha?
Didn't know that. lol
Years ago, you had to implicitly set Burstcnt > 1, to well, burst more Words at a time.
That's confusing.
I don't get why it would need Burst Count at all, if it will just accept multiple writes quickly?
I suppose it does matter...
The DDR controller will accept a certain number of contiguous Writes, until it's command FIFO fills up.
But it probably won't be doing actual Burst transfers to DDR unless Burstcnt > 1.
So it will assert DDRAM_BUSY much sooner, unless you're using actual Bursts.
Or something.
I've written most of the Tile writeback logic now.
And double-checked stuff with both ChatGippity and Claude AI.
But both of them are quite dumb.
They make some very obvious mistakes, but I get that's how the whole "large language model" works.
It's just strange, that most of them seem to know EXACTLY what's wrong (and why it's wrong), as soon as you point it out. lol
I think AI language models will rapidly get "smarter" in a very short time. It's already scary.
I just wish they didn't use the term "AI" to describe it.
I think what we used to know as real AI, would now be called Artificial general intelligence (AGI).
The internal DDR3 controller handles writes very efficiently, so burst writes are typically not required.
I mean, OK.
If the controller is handling Burst writes (on the DDR chip side) when you shove a lot of data in a contiguous chunk, then what's the point of keeping the Burstcnt input? lol
Tile ARGB buffer added.
Vertical lines fixed again.
It was a bit more involved than I expected.
No Alpha stuff being handled yet.
But the buffer does store ARGB values (8-bit Alpha, 24-bit colour).
It currently always converts to 16BPP (565), and writes two pixels at once (from the buffer) to VRAM.
Trying to do a Quartus compile with the changes, but it's already struggling.
No quick way to tell if Quartus will definitely infer a BRAM from the logic.
If it tries to use registers as "memory", it usually won't fit.
Even a relatively small chunk of memory in registers is super wasteful.
I left it running while I watched some YT vids.
It's still going, almost two hours later. That's when you KNOW it won't fit. lol
Added a tile ARGB buffer viewer thingy.
Just haven't figured out how to apply the texture sampler properly, to disable the linear sampling / blur.
// Create texture sampler
//D3D11_SAMPLER_DESC sampDesc;
ZeroMemory(&sampDesc, sizeof(sampDesc));
//sampDesc.Filter = D3D11_FILTER_MIN_MAG_MIP_LINEAR;
sampDesc.Filter = D3D11_FILTER_MIN_MAG_MIP_POINT;
sampDesc.AddressU = D3D11_TEXTURE_ADDRESS_WRAP;
sampDesc.AddressV = D3D11_TEXTURE_ADDRESS_WRAP;
sampDesc.AddressW = D3D11_TEXTURE_ADDRESS_WRAP;
sampDesc.MipLODBias = 0.f;
sampDesc.ComparisonFunc = D3D11_COMPARISON_ALWAYS;
sampDesc.MinLOD = 0.f;
sampDesc.MaxLOD = 0.f;
g_pd3dDevice->CreateSamplerState(&sampDesc, &g_pTileSampler);
Found some example code that renders polys, and applies the sampler.
g_pImmediateContext->PSSetSamplers(0, 1, &g_pSamplerLinear);
PSSetSamplers = Pixel Shader.
g_pd3dDeviceContext->PSSetSamplers(0, 1, &g_pTileSampler);
g_pd3dDeviceContext->UpdateSubresource(p_tile_tex, 0, NULL, tile_ptr, tile_tex_width*4, 0);
No idea what I'm doing.
hr = g_pd3dDevice->CreateSamplerState( &sampDesc, &g_pSamplerLinear );
Don't know what to use to set the sampler thingy, for a 2D texture.
I'll have to ask ChatGippity.
Surprisingly hard to find example code online, for something as "trivial" as displaying a texture using dx11.
It's taking over an HOUR to compile the core now. lol
Using 85% of the FPGA, and that's with the precision for the interp stuff reduced even further.
And yep, it's not inferring a BRAM properly, for the tile ARGB buffer.
I can fix that.
module tile_argb_mem (
input clock,
input [8:0] addr,
input [63:0] din,
input [1:0] be,
input we,
output reg [63:0] dout
);
reg [63:0] buff [0:511];
always @(posedge clock) begin
if (we) begin
if (be[1]) buff[ addr ][63:32] <= din;
if (be[0]) buff[ addr ][31:00] <= din;
end
dout <= buff[ addr ];
end
endmodule
wire [9:0] pix_in_addr = {y_ps[4:0], x_ps[4:0]};
wire [15:0] pix0_565 = {buff_dout[55:51], buff_dout[47:42], buff_dout[39:35]};
wire [15:0] pix1_565 = {buff_dout[23:19], buff_dout[15:10], buff_dout[07:03]};
assign twopix_out = {pix0_565, pix1_565}; // 32-bit Word, to write to the VRAM framebuffer.
wire [8:0] buff_addr = wb_active ? wb_word_cnt : pix_in_addr[9:1];
wire [1:0] buff_be = (!pix_in_addr[0]) ? 2'b10 : 2'b01;
wire [63:0] buff_dout;
tile_argb_mem tile_argb_mem_inst (
.clock( clock ), // input clock
.addr( buff_addr ), // input [8:0] addr
.din( {argb_in, argb_in} ), // input [63:0] din
.be( buff_be ), // input [1:0] be
.we( wr_pix ), // input we
.dout( buff_dout ) // output [63:0] dout
);
And I swear I didn't use AI for any of that. lol
Or, I could just stop being lazy, and instantiate an altsyncram.
Compile time back down to ~33 minutes.
Render looks a bit better than the last one, as it's not interleaved with the frame from reicast any more.
But it's still replicating every pair of pixels atm, which is giving it the chunky look.
And yes, the vertical lines are back. lol
It's just due to data delays.
I added the new tile buffer thing, and now there is a delay again.
Right now, I'm trying to get it to display the full horizontal resoltion, then I'll fix the vert lines later.
The good news is, the Tile ARGB buffer and VRAM writeback seems to work fine so far.
It's already noticeably faster to render the above frame.
Maybe in the range of ~200-220ms ?
Most of the other images take just over 1 second to render.
But this is only at 16 MHz again.
And no Burst reads for textures yet, which should get the next BIG speed-up.
If I can just get the frame times low enough for say 15-20 FPS, I can try doing some anims on the FPGA.