#Sega Dreamcast

1 messages ยท Page 6 of 1

rain obsidian
#
Connect your Jaguar and Jaguar CD and insert an unencrypted Jaguar CD. Set the switch of the CD so that the CD uses the developer BIOS. Set your Jaguar so that it uses the Stubulator BIOS. Now turn on Jaguar while holding the 'C' button of the joypad.
#

oic now.

#

It's a dev thingy. Apparently lets you read unencrypted CD-Rs, but I'm not sure if the CD dev BIOS is supposed to display anything without the drive attached?

#

I think the Stubulator BIOS might be the thing that lets you upload small chunks of code to RAM, via one of the joyports, but I could be wrong.

#

If you load the Stubulator as a BIOS, then load the Jag Dev BIOS like a Cart, then hold C during loading, it shows the normal Jag boot logo.

#

But then nothing else.

#

Not sure how that works, I just thought it might give some more useful debug info.

#

This new core seems WAY more stable, btw.

#

Doesn't necessarily run AvP reliably, but oh well. lol

#

I NEED to find that code. ^

#

Unfortunately, that isn't likely to have any low-level details.

plain thorn
#

You may need a Time Machine to 2003 when that post was made

rain obsidian
#

But I think I just found the CD BIOS source code. ๐Ÿ˜‰

#

I won't post the direct link, as I'm not sure about some of this stuff.

#

But there is the "Atari Jaguar ~ Code & Dev" group on Facebook.

#

This is... the whole thing.

#
;  We got power--let's start the boot up again
;
gotpow:
    lea    davesobj,a0
    moveq    #-1,d0
    move.w    d0,$c(a0)    ;turn off the CD module arse
    move.w    d0,$4c(a0)
#

So very British.

#
    move.w    #$8000,d1
    lea    BUTCH,a0
chkCDpow:
    move.l    (a0),d0
    andi.w    #$2000,d0
    bne.s    powok
    dbra    d1,chkCDpow
;
;  timed out..
;
powBAD:
    move.l    #$20000,BUTCH    ;reset CD module
    move.l    #0,BUTCH+4    ;clear DSA

    moveq    #-1,d0        ;exit with error
    rts
powok:
    move.w    DS_DATA(a0),d0
    cmpi.w    #$7001,d0    ;this indicates proper function
    bne    powBAD        ;if not--put up power message
#

A mask (andi) of 0x2000 would be bit 13.

#
--x- ---- ---- ---- Response from CD drive pending
#

Strange how it seems to write Long words to the CD drive.

#
    move.l    #$180000,BUTCH    ;set lid-up & cart-pull reset
;    move.l    #$80000,BUTCH    ;set lid-up & cart-pull reset
#
    cmp.l    pad_now,d1    ;check joystick
    bne.s    .noxchk        ;br if A * # not pressed together
#

They had a CD-ROM emulator thingy, which ran on the Atari Falcon.

#

Wonder why it checks for A * # to be pressed?

#
;****************************************************************
;        CD_setup                    *
;                                *
; This call MUST be used to initialize the CD system before ANY    *
; other calls can be made                    *
;                                *
;    Input:  NADA                        *
;                                 *
;                                *
;    Uses: NADA                        *
;                                *
;                                *
;    Returns: NADA                        *
;                                *
;****************************************************************

setup:
    move.l    #$180000,BUTCH    ; enable BUTCH
;    move.l    #0,BUTCH    ; enable BUTCH
    move.l    #$10000,DSCNTRL    ; enable DSA
    move.l    #7,I2CNTRL    ; Enable I2S
    move.l    #1,I2CNTRL    ; Enable I2S
    move.w    #$7001,DS_DATA    ; Set non oversampled audio
    rts
#

MAME...

#
0x70 Set DAC mode (?)
#

So now we know what the 0x7001 is.

#

The upper byte is the command, the lower byte is the param, or can toggle bits etc.

#

Need to figure out when an Irq is triggered.

#

ie. if it's after every type of command, or only some.

#

I would imagine the IRQ line gets cleared again, once the IRQ register (0x00) gets read.

#
; Internal use only
; waits for transmit to occur


DSA_tx:                ; set up as a polling loop
;    move.w    #$1000,d1    ; This is VOODOO
    move.w    #0,d1        ; This is VOODOO (new voodoo!)
.delay:
    dbra    d1,.delay
    move.l    BUTCH,d1    ; get Butch's ICR into d1
    and.l    #$1000,d1    ; mask for d12, DSA TX Intr. pending
    beq.b    DSA_tx        ; nothing here yet, so wait for bit to set 
    move.l    DSCNTRL,d1    ; read here to clear interrupt flag
    rts
#
---x ---- ---- ---- Command to CD drive pending
#
;  bit12 - Command to CD drive pending (trans buffer empty if 1)
;  bit13 - Response from CD drive pending (rec buffer full if 1)
#
; Clear pending DSA interrupts

    move.w    DS_DATA,d0
    move.l    DSCNTRL,d0
#

MAME...

#
                case 0x70: // Set DAC Mode
                    m_butch_regs[0] |= 0x2000;
                    m_butch_cmd_response[0] = 0x7000 | (m_butch_regs[offset] & 0xff);
                    m_butch_cmd_index = 0;
                    m_butch_cmd_size = 1;
                    break;
#

So that's setting bit 13 (Response pending) in the IRQ reg.

#

Then sends back the response as 0x7000, and copies the lower byte from the existing DSA reg value.

#

DSA_tx also waits until bit 12 is set.

#

So I've tried setting both bits (0x3000) in the IRQ reg, to see if it gets any further.

#

MAME doesn't have much actual code added, as far as I can see.

#

But it still forces the CD BIOS to start sending proper commands...

#
   // 0x15 Set Mode
   // 0x17 Clear Error

                case 0x15: // Set Mode
                    m_butch_regs[0] |= 0x2000;
                    m_butch_cmd_response[0] = 0x1700 | (m_butch_regs[offset] & 0xff);
                    m_butch_cmd_index = 0;
                    m_butch_cmd_size = 1;
                    break;
#

So, most commands do set bit 13 (Response pending).

#

Doesn't look like the MAME code even hooks up the IRQ for the CD stuff.

#

So the BIOS must just poll the regs, most of the time?

#

Oh

#

DSA bus. ^

#

So that's what the Butch chip talks to.

#

And it receives the data from the SAA7345 decoder via I2S.

#

Down the DSA rabbit hole

#

So now we know.

#

Used in a lot of "Audiophile" CD player projects.

#

They have controller boards on AliExpress for it.

#

Don't really need to know as low-level as this, but it helps us understand what the CD BIOS is sending.

#

And I don't think the Butch chip is going to be overly complex with the command handling, it probably just forwards the commands almost as-is, to the cdt612 chip.

#

(plus it has a couple of FIFOs for the I2S data, both for data reading, and for audio playback.)

rain obsidian
#

I think some of the reg reads are just duplicates.

#
    // READ mux...
    case ( {cpu_addr[5:2],2'b00} )
        //6'h00: butch_dout = butch_irq_reg;            // 0x00  IRQ reg.
        6'h00: butch_dout = 16'h2000;                // 0x00  IRQ reg. TESTING !!
        
        //6'h02: butch_dout = butch_reg_02;            // 0x02  Unknown? (CD BIOS seems to write to this offset?)
#

ie. I need to drop the two lower bits of cpu_addr.

#

So it reads register 0x00 (IRQ reg), even if the cpu_addr is 0 or 2, etc.

#

'cos MAME is reading from offset DFFF02, which doesn't make much sense, vs the actual MAME code.

#
void jaguarcd_state::butch_regs_w(offs_t offset, uint32_t data, uint32_t mem_mask)
{
    COMBINE_DATA(&m_butch_regs[offset]);

    switch(offset*4)
    {
        case 8: //DS DATA
#

Hard to explain.

#

I think it's just when the 68K is reading from the different Byte offsets.

#

But the Butch chip will ignore the two lower bits of the address.

#

So reading from address 0,1,2,3 still gives you "register 0".

#

Same for reading from 8,9,A,B, give you "register 8".

#

So the comment in MAME is slightly "wrong".

#

The CD BIOS often writes the command to Offset 0xA.

#

But the actual register address is as base 0x8.

#
6'h08: butch_cmd_reg            <= cpu_din;    // 0x08  DSA TX/RX (Command) reg.
#

Since reading/writing addr 8,9,A,B, will still decode as offset 8, when doing this...

#
case ( {cpu_addr[5:0],2'b00} )
#

That's why it's also at "case: 8" in the MAME code.

#

Also means there probably is no separate register 0xE.

#
6'h2c: butch_reg_2c            <= cpu_din;    // 0x2c  Unknown? (used at start-up)
6'h2e: butch_reg_2e            <= cpu_din;    // 0x2e  Unknown? (also written/read during MAME start-up?)
#

An access of C,D,E,F, still reads/writes the same reg.

#

So then we have these...

#
case ( {cpu_addr[5:0],2'b00} )
    6'h00: butch_irq_reg         <= cpu_din;    // 0x00  IRQ reg.
    6'h04: butch_dsa_cont        <= cpu_din;    // 0x04  DSA Control reg.
    6'h08: butch_cmd_reg         <= cpu_din;    // 0x08  DSA TX/RX (Command) reg.
    6'h10: butch_i2s_cont        <= cpu_din;    // 0x10  I2S bus control.
    6'h14: butch_sub_cont        <= cpu_din;    // 0x14  CD subcode control.
    6'h18: butch_sub_rega        <= cpu_din;    // 0x18  Subcode data Reg A.
    6'h1c: butch_sub_regb        <= cpu_din;    // 0x1c  Subcode data Reg B.
    6'h20: butch_sub_time        <= cpu_din;    // 0x20  Subcode time and compare enable.
    6'h24: butch_i2s_fifo        <= cpu_din;    // 0x24  I2S FIFO data.
    6'h28: butch_i2s_fifo_old    <= cpu_din;    // 0x28  I2S FIFO data (old)
    6'h2c: butch_reg_2c          <= cpu_din;    // 0x2c  Unknown? (used at start-up)
    default: ;
endcase
#

Reg 0x0c is missing. I'll add one anyway.

forest echo
#

Great work getting the BIOS booting! Am just catching up now

#

I think the Jag CD BIOS may be open source like the Jag one, wonder if someone can confirm

forest echo
#

So what would need to be added to the core to get it to boot the Jag CD BIOS without it being altered, and in a way where down the road it could be used for the core to boot Jag CD games?

rain obsidian
#

I started adding that option, to switch it to reading the Cart ROM as 8-bit wide.

#

But it's not quite done yet.

#

It's probably easier just to patch those four bytes on-the-fly, like it does for the Cart Checksum thing on the BIOS.

forest echo
#

What would be the best/cleanest/most accurate implementation, or is that not a valid question in this situation?

rain obsidian
rain obsidian
#

ie. when it's actually loading the Cart ROM.

#

But then you don't really want to force stuff to modify the ROM, without good reason.

#

So might as well just keep the menu option.

#

After me saying the menu text needs to be kept short, I still made it a bit too long. lol

#

Also... it doesn't work yet. sigh

#

The curse of FPGA dev.

#

After messing with the DC sim again yesterday, and Quartus. It made me realize just how easy it is to screw-up the code.

#

Even with the very best of intentions, and taking notes etc.

#

So it's really not practical to have to keep doing Quartus compiles for every small tweak.

#

But also not so easy to simulate the whole MiSTer framework and ARM stuff.

#

I'll try to debug this now. It's probably me decoding the address wrong.

#
    "O2,Cart Checksum Patch,Off,On;",
    "OH,Patch Cart ROM 32-bit,Off,On;",
#

My fault. Broke it.

#

Old...

#
assign cart_q = (!abus_out[2]) ? DDRAM_DOUT[63:32] : DDRAM_DOUT[31:00];
#

New...

#
assign cart_q1 = cart_patch_32_cs ? 32'h04040404 :        // Patch the Cart ROM, to force reading as 32-bit wide (Jag CD BIOS, etc.)
                                    cart_read_32;        // Else, allow normal reading of Cart data (as 32-bit wide).
#

I'm using cart_q1, which is the wrong reg.

#

Which gets used with cart_q, to see when the data has changed.

#

This code can probably be deleted.

#
cart_diff <= cart_q1 != cart_q;
#

Yep, I think that should fix it. Compiling again. sigh

rain obsidian
#

That was very confusing.

#

It turns out the Cart ROM is loaded to/from SDRAM now.

#

But still being loaded into DDR3 at the same time.

#

Plus, I was playing some music. I bought a second-hand AVR for my Nephew, so I've been "testing" it. lol

crude bloom
#

that was the change for single ram

rain obsidian
#

Yep, but I think the Cart was still in DDR3, until recently. So it confused me. lol

#
wire [31:00] cart_q_sdram;    // Cart data from SDRAM.

wire cart_patch_32_cs = (abus_out[23:0]>=24'h800400 && abus_out[23:0]<=24'h800403 && status[17]);  // Patch 0x400-0x403.

assign cart_q = cart_patch_32_cs ? 32'h04040404 :        // Patch the Cart ROM, to force reading as 32-bit wide (Jag CD BIOS, etc.)
                                    cart_q_sdram;        // Else, allow normal reading of Cart data (as 32-bit wide).
crude bloom
rain obsidian
#

Works with the "unpatched" JagCD BIOS now.

#

But you need to have both of those options turned on, before loading it.

#

And you have to load the Jag CD BIOS as if it's a Cart.

#

Jag CD won't load any ISOs yet, ofc. This is just a test.

#

Hence why I didn't put "Jag CD" in the RBF filename,

#

AvP doesn't run for me, on that build.

#

I didn't change any of the Quartus compilation settings, but I do have SignalTap enabled.

maiden granite
rain obsidian
#

That version of the Jag CD BIOS also boots, but the Development one won't.
I don't think the dev thing is even a "cart" ROM. I think it only works with something like that Stubulator BIOS.

rain obsidian
#

I tried all sorts before, swapping things around, and trying to save a few extra clock cycles.

#

It's a shame the FPGA doesn't just have about 8MB of on-chip SRAM, then it could even run N64 with only a cart hooked up.

dense shard
#

How much sram does it have

maiden granite
#

a lot less than that

#

a6 is the one in the mister

rain obsidian
#

Only around 600-700 KBytes, basically.

#

And the framework uses a small chunk of that.

#

That new menu option, btw, it's probably OK to just leave it on.

maiden granite
#

what new menu option?

rain obsidian
#

AFAIK, the core can only boot Cart ROMs that are in 32-bit wide mode anyway, so those will already have the 0x04040404 thing.

#

@maiden granite #1315008851874938948 message

#

It's just to get the Jag CD "BIOS" to run atm. Nothing fancy, no actual ISOs can be loaded yet.

#

Quite a lot of work to do before that happens.

maiden granite
#

yeah seems a bit like putting the cart before teh horse

rain obsidian
#

Hence why I said it was "for testing". lol

#

Literally all the menu option does, is force those four bytes to 0x04040404 during reading.

#

But ofc most games will have that already set.

#

The Jag CD "BIOS" has it set to all 0x00, because it uses an 8-bit wide ROM.

#

Ideally, the core should be tweaked, to support 8-bit, 16-bit, 32-bit mode.

#

Which would probably get more Homebrew running.

#

Iron Soldier 2 is doing that atm, which seems to be stuck in routine in RAM.

#

That seems to be the same code, in MAME.

#

Seems to be called from here...

#

Maybe something to do with IRQ2 not working in the game?

#

And looks like it essentially loops, waiting for the data in RAM at 0x44F4 to change.

#

Which again suggests an IRQ handler isn't quite right, as it looks like that's the only way it could exit that loop.

#

Might possibly be related to the Vertical Int.

#

Not sure it's worth it. lol

#

I need a break.

rain obsidian
#

AvP on this core, just before it crashes.

#

Would be nice to figure out what fails.

forest echo
rain obsidian
#

No, I don't know what is stopping that game from working.

#

Nobody knows for sure which part of AvP causes it not to work on certain core builds either.

forest echo
#

Ah OK

rain obsidian
#

But it must be a marginal timing thing, as you can have the same exact code between builds, or tweak one small setting, and AvP no longer runs, or has more graphical glitches.

forest echo
#

Would it be hard/costly to support 8-bit, 16-bit, 32-bit modes?

rain obsidian
#

Investigating why Iron Soldier 2 doesn't boot, is probably even harder.

#

Not too hard, no.

#

I'm just personally doing too many things at once again.

#

Also, the 32-bit thing I already added, might be enough to force some stuff to work anyway.

#

Hard to explain.

#

eg. if a specific game or Homebrew isn't booting, for the same reason as the Jag CD "BIOS" (because it's cart header is set to 8-bit or 16-bit reads), then the new menu option should help.

#

There is something else going on with games like Iron Soldier 2 that's stopping it running, though.

floral vale
#

I can't wait to some day play Shenmue on an FPGA

forest echo
#

I am not sure anyone has collated a pack of the various Jag homebrew out there, that could be a good thing to get and test

rain obsidian
#

Back onto Dreamcast today.

#

But it's kicking my butt, again.

#

I thought I'd done the git commit from when the core was doing half-decent renders.

#

But apparently not.

#

I don't know why that keeps happening, tbh.

#

I could swear I did the commit after the last decent renders.

#

And I can't seem to remember exactly what fixed it last time.

rich kindle
#

Hmm. Compilation variance? Just like AVP on the Jag? One build works, the next fails?

rain obsidian
#

No, it's just my dumb coding, and forgetting to do a git commit far more often. lol

#

FPGA code back in the sim again.

#

Around 5.4 FPS in the sim, when it's not using the Codebook cache.

#

As I still need to fix that, so it will hopefully fit on the FPGA.

#

The Codebook cache is currently written in a way that can't easily be inferred to use Block RAMs.

#

Which means Quartus tries to implement it using registers, which uses an insane amount of logic.

#

(Most FPGAs are based on SRAM cells for the configuration bits, which tell it how to hook up the gates / logic chunks / ALMs. Some of those SRAMs are reserved to use as normal SRAM, for use as ROM / RAM / Cache, etc.)

#

The sim obviously doesn't care about logic "usage", as Verilator just turns the Verilog into a C-code logical model.

#

With the Codebook reading bypassed, it's doing nearly 14 FPS (single-frame time) in the sim...

#

Which is still quite slow, for some reason.

#

Oh yeah, latency.

#

I'm doing a very very rough approximation of DDR latency in the sim atm.

#
reg [2:0] valid_d;
always @(posedge clock or negedge reset_n)
if (!reset_n) begin
    valid_d <= 3'd0;
end
else begin
    valid_d <= {valid_d[1:0], isp_vram_rd};
end

wire vram_valid = valid_d[2];
#

Just a shift register, basically.

#

Which adds a delay between when isp_vram_rd pulses high, and when vram_valid goes High.

#

atm, that's set to only three clock cycles (isp_vram_rd shifts through valid_d, bits 0, 1, 2).

#

The DDR on the DE10 is currently taking about 6-7 clock cycles for every single Word, atm.

#

And that's with the core running at only 16 MHz (again).

#

So you can imagine how many more clock cycles it would waste, once the core is running at 50 MHz or more. lol

#

I NEEDS to make use of DDR Burst transfers, to get anywhere close for reasonable frame rates.

#

It's a bit frustrating, as I know it's possible now, to get the rest of the logic rendering fast.

#

If I set the fake latency from 3 cycles down to 1 clock cycle, the "frame rate" goes from 13.8 to 18.8.

#

If I bypass Codebook and texel reading altogether, it jumps to 22.62 FPS.

#

26 FPS, if I skip a whole isp_state.

#

Not quite sure why it's not closer to 60 FPS, tbh.

#
        52: begin
            if (prim_tag_out_old != prim_tag_out) begin        // Check to see if the Tag has changed...
                prim_tag_out_old <= prim_tag_out;
                
                if (tex_base_word_addr_old != tcw_word_out[20:0]) begin    // Check to see if the texture BASE address has changed...
                    tex_base_word_addr_old <= tcw_word_out[20:0];
                    // isp_inst[25]=texture.  tcw_word[30]=vq_comp.
                    if (isp_inst_out[25] && tcw_word_out[30]) begin    // Check if VQ compressed.
                        read_codebook <= 1'b1;                        // If so, read the new Codebook.
                        isp_state <= 8'd100;
                    end
                end
            end
            else begin        // Tag has not changed, but check if the new pixel is flat-shaded/Gouraud, or textured...
                if (y_ps[4:0]==5'd31 && x_ps[4:0]==5'd31) begin    // On the last (lower-right) pixel of the tile...
                    tile_accum_done <= 1'b1;    // Tell the RA we're done.
                    isp_state <= 8'd0;            // Back to idle state.
                end
                else begin
                    x_ps[4:0] <= x_ps[4:0] + 5'd1;            // Inc x_ps[4:0].
                    if (x_ps[4:0]==5'd31) y_ps[4:0] <= y_ps[4:0] + 5'd1;                    
                    if (isp_inst_out[25]) begin            // If texture flag is set...
                        isp_vram_rd <= 1'b1;            // Read texel...
                        isp_state <= isp_state + 8'd1;
                    end
                    else begin    // Flat-shaded or Gouraud, no need to read a Texel Word...
                        isp_state <= 8'd54;
                    end
                end
            end
        end
#

Doing quite a lot in one state now.

#

Iterating through each pixel in turn.

#

Well, reading each Tag.

#

And the Tag tells you which triangle/prim the pixel relates to.

#

So it has to load the triangle params for each pixel very quickly, from the param buffer.

#

(which happens in the previous state, but I can't paste that much at once on here.)

#

If the Tag number changes, then we know we need to check to see if the texture (base addr) has changed.

#

And if it's also using VQ-compressed textures, we need to read the new Codebook.

#

If it's not textured, we still need to write the pixel to the framebuffer, obviously, but skip the texture read.

#

So the pixel must then be either flat-shaded or Gouraud shaded.

#

All of these flags exist in the ISP/TSP/TCW words, for each triangle/prim

#

I'm sure I still have some off-by-one stuff, regarding how I increment to the new row, or check for the last pixel etc.

#

But I got rid of the horiz lines earlier, by fixing one of those bugs.

#

It still has the vertical lines atm. That must be delays in the params and texture stuff updating.

#

The Codebook on the FPGA (and sim) right now is just a basic chunk of BRAM, not a proper cache...

#
// VQ Code Book. 256 64-bit Words.
reg [63:0] code_book [0:255];
reg [8:0] cb_word_index;

always @(posedge clock or negedge reset_n)
if (!reset_n) begin
    cb_word_index <= 9'd256;
end
else begin
    // Handle VQ Code Book reading.
    if (read_codebook) begin
        cb_word_index <= 9'd0;
    end
    else if (codebook_wait) begin
        if (vram_valid) begin
            code_book[ cb_word_index ] <= vram_din;
            cb_word_index <= cb_word_index + 9'd1;
        end
    end
end

assign codebook_wait = !cb_word_index[8];
#

Which means it has to read 256 Words (64-bit wide) EVERY time the texture base address changes (and if it's also VQ compressed).

#

Which is why it's so incredibly slow at rendering atm. lol

#

The Param cache, Texture cache, and Codebook cache are the real key to getting the good frame rates.

rain obsidian
#

That's one of the better FPGA renders in a while.

#

Well, since three days ago. lol

rain obsidian
#

Working on the Codebook cache next.

pseudo tinsel
#

@rain obsidian what kind of cpu does mister have?

rain obsidian
#

You mean on the ARM side?... an ARM. ๐Ÿ˜›

#

Actually, a Dual-Core ARM, very similar to what's on the Rasp Pi 2.
Runs at around 800-900 MHz on MiSTer.

#

But without any actual GPU.

#

So it has to use a Framebuffer.

#

The DDR3 gets shared between the ARM and FPGA sides.

#

Any either can read (or write. ๐Ÿ˜ฎ ) the same memory.

#

So the ARM Linux stuff is set to use the lower 512MB.

#

And FPGA side can use the upper 512MB.

#

(although part of that DDR3 mem is used for the ASCAL upscaler framebuffers.)
ASCAL has the option to directly display the Linux framebuffer from DDR3, or display the native video being fed to it from the core itself.

#

It's not too bad to get the toolchain set up, to compile stuff for the ARM / Linux side.

#

But I think there were a few minor mistakes on the wiki.

#

Something to do with one of the paths for the export. It might have been fixed now?

#

I've never had any luck getting docker working on WSL / WSL2, btw.

#

I just install the gcc ARM toolchain directly.

#

New Codebook Cache.

#

The FPGA version should now force Quartus to instantiate Block memory.

#

If I pretend VRAM has zero latency on the sim, it renders the Daytona frame fast enough to hit 28.68 FPS.

pseudo tinsel
#

cortex class?

#

800MHz Dual-core ARM Cortex-A9 processor.

#

good o'l A9 lol

#

i already have it running on the ultra96 board

rain obsidian
#

Yep, if you have reicast or something on the ultra96, and running under Linux, it might even run as-is on MiSTer?

#

Probably goes without saying - without a GPU. lol

#

Currently realizing just how much BRAM the Codebook cache might take.

#

You can reduce the number of entries, but unless it's about 512-1024, it won't be very fast.

#

With 1,024 entries, of 256 words, and 64-bit words.

#

That's 1,024 * 256 * 8 = 2 MBytes.

#

Way more than the Cyc V on the DE10 even has.

#

Trying 128 entries first.

#

I was wrong.

#

It looks like 128 would be plenty.

#

It's still doing sim frame times to hit 22 FPS (at 100 MHz), even with simulated DDR latency of 1 clock.

#

There just needs to be enough Cache entries to store all of the Codebooks for the typical number of different textures used in a scene.

#

64 entries...

#

128...

#

With only 64 entries, when there are more textures than that on the screen, the "Tag" address for the cache kinda wraps.

#

Causing it to use the wrong texture for some of the polys.

#

There should be a fix for that, too.

#

Where the logic knows if the incoming Tag width is greater than the number of Cache entries, it can force a cache miss.

#

Then re-read the Codebook for those polys.

#

It would just cause slower renders.

#

(a compromise between the Codebook cache size, vs speed.)

#

That's with 128 entries for the Codebook cache.

#

And 1,024 entries for the param buffer.

#

That's not too bad on mem usage atm.

#

Taking 32 minutes to compile.

#

Haven't quite got it right yet, but close.

#

I actually think the Codebook cache is working OK on the FPGA.

#

But I couldn't help do another tweak to gain a bit of speed, and broke it again. lol

#

When it does the thing of skipping every-other line, it's usualy due to the clock delay thing for Tag processing.

#
        // Write triangle spans to Z / Tag buffer, checking 32 "pixels" at once for inTri AND depth_compare.
        50: begin
            isp_state <= 8'd90;
        end
        
        90: begin    // Z-buff write is allowed in this state.
            isp_state <= isp_state + 8'd1;
        end
        
        91: begin
            y_ps[4:0] <= y_ps[4:0] + 5'd1;
            if (y_ps[4:0]==5'd31) begin
                isp_vram_addr <= isp_vram_addr_last;
                isp_state <= 8'd48;        // Done! - Load next PRIM.
            end
            else isp_state <= 8'd50;    // Else, Jump back.
        end
#

So I had to re-add those extra states.

#

It starts off with y_ps[4:0]==0, so pointing at the first row in the Tag buffer.

#

But it has to read the Z values from that row, have time to do the depth compare (on all 32 pixels/tags in parallel), then WRITE the new prim tags (for the current triangle) back to the Tag buffer.

#

It can't do that all within the same clock cycle / state.

#

As there is a delay of one clock cycle for reading the (current) Tags from memory.

#

So it has to go from isp_state 50 to 90, to give it time to read.

#

As it leaves isp_state 90, that triggers the writeback to the Tag/Z buffer row.

#

91 increments to the next row, and it repeats until it's done row 31.

#

I might be able to ditch isp_state 90, but I can't figure out if that would break it again.

rich kindle
#

If we pretend that DC will never fit the MiSTer, the question is what would be the most worthwhile parts of the DC emulation to have on the FPGA side of things. What would be the one thing that could potentially bring in a correct timing etc.

rain obsidian
#

The word "emulation" threw me for a second there. lol

rich kindle
#

If I understood you correctly the ARM part ist just an ARM

#

Without GPU

rain obsidian
#

But in this case, the term emulation is very fair. It's not going to be a super low-level GPU, with the way I've wrote it.
(and the lack of chip-level info on HOLLY / PVR2.)

#

Yep.

#

It's pretty much like a Rasp Pi 2, minus the GPU, and with two cores.

rich kindle
#

Thus having just the PowerVR emulated on the FPGA would be one possible choice

rain obsidian
#

Correct timings, as in, making it fast enough without a lot of extra effort...

#

Would be having about 8MB of SRAM.

#

But SRAM is quite expensive.

#

And would use up at least one of the GPIO headers.

rich kindle
#

I guess the MiSTer doesn't have a soundcard either?

#

The ARM part?

rain obsidian
#

Not as such. Anything like that, has to be implemented on the FPGA side, AFAIK.

#

So the ARM even passes audio through the FPGA to the DAC + HDMI.

#

In fact, the ALSA audio from Linux gets written to a small buffer in DDR3 first.

#

Then some logic on the FPGA side reads the audio samples from that buffer, to output to the PWM "DAC", or I2S DAC, plus the HDMI chip.

rich kindle
#

Ok. Thus the PowerVR and sound emulation has to be done on the FPGA side

rain obsidian
#

I'm not overly worried about the AICA sound stuff yet. I know a few devs who I think would be happy to help.

#

AFAIK, the AICA is quite similar to the SCSP on Saturn.

#

But with more channels, and obviously the ARM7 thingy.

#

I'm just happy to be making progress atm with the GPU, which, as you've seen, isn't always the case. lol

#

It usually make me super grumpy.

#

I have no idea how many Quartus compiles I've done, just on this project alone.

#

But it must be like, 500 or more.

#

So around 250 hours, at least.

#

Yeah, probably way more, actually.

#

I started the project over a year ago, but I had at least six months where I barely looked at it.

rich kindle
#

Thanks for elaborating. So getting reicast itself to run on the arm is just half of the story because there is no graphics card to output to and no soundcard. Thus even if getting it to successfully run somehow it would do... nothing (nothing to see and nothing to hear).

rain obsidian
#

It would probably be possible to emulate the sound on reicast.

#

And just output that via Linux / ALSA, like normal.

#

Assuming the ARM cores are even fast enough for that, as they're not especially great.

#

I think they lack NEON and other features that the Rasp Pi 2 does.

#

So not so good for decoding modern video formats, etc.

#

Very early-on in the life of MiSTer, some people did get things like ScummVM running quite well.

#

Although, back then, displaying the Linux framebuffer would have screen tearing.

#

I really hope they've fixed that now, but I haven't tried it in years.

#

booo!

#

Quartus has a habit of doing stuff like that, RIGHT when you're getting closer to a crux point.

#

And worst of all, is it won't even give you an estimate on the logic usage, as it stopped too early.

#

If there was an affordable Intel Agilex dev board right now, I'd buy it.

#

That's if it had decently fast DDR, and enough IO pins to hook up SDRAM etc.

#

And at least 50-100% more logic than the DE10 Cyc V.

#

I only just realized that I don't really NEED to check of the prim tag has changed.

#

You just assume it can change, then check if the texture base address has changed.

#

It iterates through every pixel in the tile anyway.

#

So if a pixel (triangle) has the texture flag set, you check to see if the Texel read address has changed, then read the new texel.

#

If it's flat-shaded or Gouraud, you can obviously skip the texel read entirely.

maiden granite
rain obsidian
#

Exactly.

#

Really annoying that there are no cheaper Agilex boards out yet.

#

Like, Iโ€™d pay up to about ยฃ600, if needed, but I canโ€™t go to ยฃ2000+

#

Progress.

#

Reduced the CB cache to only 64 entries.

#

And reduced the param buffer to 512.

#

Could probably use the CB cache module as the generic cache for the param buffer, tbh.

pseudo tinsel
#

tbh, I would implement them as normal addr-value caches

rain obsidian
#

'cos atm, the param buffer is literally just storing every group of vertex/ISP/TSP/TCW params, for every triangle within each tile. lol

#

Yep, it pretty much is...

#
    //.prim_tag( prim_tag_out ),        // input [9:0]  prim_tag
    .prim_tag( tcw_word_out[15:5] ),        // input [9:0]  prim_tag
#

Using some bits of the Texture base addr as the "tag" atm.

#

Which I know isn't ideal.

#

Just needed enough bits to differentiate between textures.

#

Which does make sense, caching the Codebook across the whole frame.

#

Rather than per-tile.

#

Since there are only so many textures in use, in the entire image.

#

If it was per-tile, it would be constantly re-loading the cache.

pseudo tinsel
#

i think the cache is really tiny on the real hw

rain obsidian
#

Yep, I think so, too.

pseudo tinsel
#

like 2kb or something like that

rain obsidian
#

I think it just takes a hit (or miss. lol) if there are lots of textures in the frame, and it just has to fetch a bit more often from VRAM.

#

I do think logically, the "texel" cache and codebook cache must be separate, though.

#

And the texel cache mainly helps just to mitigate most of the VRAM latency.

#

So it can do burst reads into cache.

#

I'm very happy with that new render.

#

Another milestone.

#

It's still super slow, but that's now down to the lack of Burst transfers

#

git commit done

#

That's actually really good already, vs what I was expecting.

#

So now we have the Hidden Surface removal, done on 32 "pixels" at once, using Tags.

#

Then the param buffer, so it can fetch the params for each triangle, depending on the Tag value during tile rendering.

#

And now it has the Codebook cache in place, which erm, caches the Codebook(s). lol

#

So it can also swap to each Codebook, each time the texture base address changes (dependant on the current Tag).

#

OK, now what? ๐Ÿ˜›

#

I think I need to process this for a while, and see where I'm up to.

pseudo tinsel
#

well, if you're out of ideas

#

would be real fun to setup reicast

#

and render frames as they are sent from it

rain obsidian
#

Around 2 seconds to render the frame above, at only 16 MHz.

#

With crappy 6-8 cycles of DDR3 latency, for EVERY PIXEL processed.

#

And for EVERY initial param fetch.

pseudo tinsel
#

heh

rain obsidian
#

Trust me, I'll be working on getting reicast running on the ARM now. haha

pseudo tinsel
#

i guess we can call that mistercast

rain obsidian
#

Oh no. lol

pseudo tinsel
#

to keep in brand with the endless forks i have

#

xD

rain obsidian
#

I really hate the fact Taki called it the "MiSTer Pi".

#

It was confusing enough, trying to explain to people the pros and cons of sw vs FPGA "emulation". lol

#

It's possibly THE most confusing name anyone could have ever chosen, for a device like that.

#

But hey, it's great to see some clones finally appearing.

#

640x480 pixels = 307,200

#

Best-case DDR3 latency, at a core clock of 16 MHz, looks to be close to 6 cycles.

#

So 307,200 x 6 = 1,843,200 cycles.

#

Each cycle = 62.5ns, at 16 MHz.

#

= 115.2 milliseconds. lol

#

THAT's why it's so slow atm.

pseudo tinsel
#

do you render to an internal buffer and then copy?

rain obsidian
#

Not yet.

#

That will probably be the next thing.

#

So it can Burst write it to DDR3.

#

But...

polar goblet
#

Hypothetically, would the Dreamcast core need a "VGA mode" toggle? Some games don't support the VGA at all.

rain obsidian
#

That brings up another interesting question about the DC...

#

Does it arrange the framebuffer where the tile pixels are linear.

#

Or does it just burst 32 pixels into the tile, then do another burst for the next row?

#

Probably the later.

pseudo tinsel
#

@polar goblet that depends on how mister presents the frame?

rain obsidian
#

Also, unless the VRAM stuff is moved into SDRAM again, getting RGB output could get interesting.

#

You can use ASCAL on MiSTer, to output a DDR3 Framebuffer via RGB, though.

pseudo tinsel
#

the framebuffer lives in SDRAM?

rain obsidian
#

It used to, until a few weeks ago. hehe

pseudo tinsel
#

mister uses hdmi?

rain obsidian
#

Now it's all in DDR3.

#

Reads the verts and params from DDR3.

pseudo tinsel
#

has an urge to quit his job and work full time on this

rain obsidian
#

Then writes the tile directly back to the framebuffer, overwriting the reicast frame(s).

#

lol

pseudo tinsel
#

only if it payed the bills lol

rain obsidian
#

I do think adding the tile output buffer makes the most sense for the next step.

polar goblet
#

I know the MiSTer having a Dreamcast core is just a pipe dream, just thinking about the edge cases that might complicate things.

rain obsidian
#

Would literally just be a small BRAM...

#

Could steal the z_mem thing.

#
module z_mem (
  input clock,

  input [41:0] data,  
  input [4:0] address,
  input wren,

  output reg [41:0] q
);

reg [41:0] z_mem [0:31];

always @(posedge clock) begin
  if (wren) z_mem[ address ] <= data;
  q <= z_mem[ address ];
end

endmodule
pseudo tinsel
#

41:0 lol

rain obsidian
#

I'm guessing the tile accumulation buffer stores the Alpha as well.

pseudo tinsel
#

yeah

#

the hardware has two of em

rain obsidian
#

The [41:0], is because the Z buffer also stores the Tags now.

#

So 32-bit Z (fixed-point), and 10-bit Tag.

pseudo tinsel
#

@polar goblet I think mister is just slightly too small for dc

#

my goal was ultra96

#

and to run some parts in sw

rain obsidian
#

Yep, I think maybe even the GPU might be doable, with a lot of help getting it to all fit, with all of the features.

#

And especially getting it to run faster, without breaking the timings.

#

I might try the next compile with it at say 40 MHz, and see what happens.

pseudo tinsel
#

well still an 800 mhz A9 won't run all games fullspeed

#

even with all opts enabled

#

and just the cpu to care

rain obsidian
#

It would be amazing, to see reicast on the ARM on MiSTer, with the TA writing directly to the Framebuffer.

#

But...

#

It would need a way to signal to the emu when each frame is done.

pseudo tinsel
#

well it is not too complex, isn't there some mmio we can map?

rain obsidian
#

Could just do that, by reading/writing DDR3 directly, like a handshake thing.

#

Or, talking via the HPS IO module, like how most cores do it.

#

Well, the OSD, disk/tape/cart loading.

#

Is all done via ONE 32-bit internal GPIO port.

#

That's all it has, between the ARM core(s) and FPGA side, unless you use AXI.

#

I was using AXI before, for testing some of laxer's old PS1 core.

#

But AXI can be a bit unstable, especially if you need to reload a new RBF all the time.

rain obsidian
pseudo tinsel
#

well, I don't know the specifics, but i'm sure it can be figured out

rain obsidian
#

Then access that same mem on the FPGA side.

pseudo tinsel
#

you'd need a few things mapped

#

one for the regs access

#

one for the memory access

rain obsidian
#

As long as reicast can be tweaked, to simply write the textures and TA output into DDR3 (mmapped), I think it's doable.

pseudo tinsel
#

and a few bits for the interrupt linex

rain obsidian
#

And as long as nothing in the emu code needs a lot of feedback from the GPU.

#

ie. it would need to "talk" to most of the existing GPU code, to keep reicast happy.

#

But also be able to wait for a LONG time, for the FPGA to render each frame. lol

#

And no TA in Verilog done yet.

pseudo tinsel
#

sure but we have lxdream's TA

rain obsidian
#

Exactly.

pseudo tinsel
#

thank god for lxdream

rain obsidian
#

And that might be one of the ways to get around the whole "doesn't all fit the FPGA" problem.

#

I think the ARM core(s) might even have enough juice to run the AICA sound.

#

I mean, it can run stuff like Fluidsynth and mt32-pi.

#

Sort of. lol

#

That's one of the reasons the Pi version exists - the ARM cores on MiSTer were struggling a bit.

pseudo tinsel
#

reicast could run several games in 600-700 mhz cortex a8

#

back in the day

#

not the most demanding ones, but on a single core (with gpu)

rain obsidian
#

Believe it or not, behind the crud, that's a very good render.

#

With most of the speed-up logic in place, aside from Burst transfers, which is the final step.

#

That's taking about 3.5 seconds to render the frame, at 16 MHz.

#

So about 0.28 FPS.

pseudo tinsel
#

well zombies don't move fast anyway

#

๐Ÿ˜›

rain obsidian
#

Only half a second, so already at 2 FPS. A bit like DOOM on a 386.

#

Simpler frames, like the memory card screen, render in about 200ms.

#

I need to add an "FPS" counter to the core really.

#

4 seconds to render DOA2 Kasumi.

#

And she's lookin' pretty fine, considering all of the code I just mundged together.

#

I don't even want to imagine how much logic Gouraud shading will use.

#

Definitely won't fit that atm.

rain obsidian
#

Rendering from Tags actually helps fix a few glitches.

#

I think it's displaying at half horiz resolution atm.

#

Due to the ASCAL thing.

#

Obv the textures are shifted, too.

#

single-pixel changes on the raindrops etc. are not allowing the params to update fast enough.

#

I'm being dumb.

#

I can't work out the maths.

#

I think that's right - it's taking about 5-7 clock cycles extra, per-pixel.

#

So will be taking 5-7 times long to render each frame.

#

And 5-7 times longer, to write the tile back to VRAM.

#

And the same for reading in a new Codebook.

#

And for reading the params.

#

(about 24-43 clock cycles to read a set of params for ONE triangle, as they are variable-length. So that would currently take around 301 clock cycles, at the most.)

halcyon creek
#

It's still a fascinating read though!!1 ๐Ÿ™‚

rich kindle
#

Would rendering just "half res" in 320x240 increase the chance of getting this onto the MiSTer? I feel that many would prefer a low res game with scanlines

rain obsidian
#

If I can get the frame times down on the FPGA, it might be OK at 640x480.

#

The main reason it's displaying the frames at a lower res atm, is that it's just duplicating every other pixel.

#

As I couldn't figure out how to modify ASCAL the other night, to display it how I wanted.

#

(the Dreamcast's weird VRAM layout makes it a bit awkward to display things.)

#

Because of this...

#
 _________    _________
|         |  |         |
|         |  |         |
| 2nd 4MB |  | 1st 4MB |
|         |  |         |
|         |  |         |
|_________|  |_________|
  [63:32]      [31:0]
#

That's how I have to have the VRAM dumps loaded into DDR3.

#

And also need to do it that way anyway, because textures are read as the full 64-bit wide data.

#

(like the real PVR2)

#

The display framebuffer is only read from one half of memory at a time, though, so 32-bit.

#

(usually 16BPP, so two pixels in each 32-bit word.)

#

I'd have to modify ASCAL properly, to skip the opposite half when reading the framebuffer.

#

I'm more confident now, of getting decent frame rates (low frame times) from the FPGA version.

#

I just had to get this param buffer and cache stuff added.

#

Next up, it will be adding Burst transfers, which should make a huge difference.

#

So, instead of reading/writing just ONE 64-bit Word at a time, and waiting 5-7 clock cycles between each one, you just go...

#

"Read 256 Words, please. Starting from this address..."

#

The DDR controller will still have a few clock cycles of delay, but then it will burst those 256 words, pretty much on every clock cycle.

#

(it's not always contiguous, depending on whether the data is on the same Page (row) of memory. The FPGA also has to share access with ARM Linux, but DDR3 has a LOT of bandwidth, so it shouldn't be a problem.)

#

Put it this way, if the N64 can work mostly from DDR3, I think DC can as well. ๐Ÿ˜‰

#

Right now, just fetching those 256 words would take up to 7 times longer (due to the latency of writing one word at a time), so 1,792 clocks.

#

If it were trying to write a completed tile into DDR3, it would need 1,024 clock cycles.

#

atm, that would take 7,168 cycles, just for ONE tile.

#

Multiply that by the 300 tiles (640x480), and you see why it's slow right now. lol

rain obsidian
#

sigh. I fixed the vertical lines in the sim.

#

Not quite the same on the FPGA.

pseudo tinsel
#

wow

#

that looks pretty

rain obsidian
#

Gouraud shading is disabled in the sim atm, as I'm testing the FPGA code on it.

#

I've yet to get the background poly working.

#

I know the lower bits of the register point directly at the ISP/TSP and verts stuff.

#

But in the VRAM dumps I've seen, it address seems to be off by a few words?

#

Like, for the menu render, the ISP_BACKGND_T reg, was showing an address of say 0x5D18 in the lower bits.

#

But it looked like the actual data started at 0x5D10 ?

#

And that was bit 3 set, so not any of the bits [2:0].

#

I know you might have to add PARAM_BASE to that addr, too.

#

But the offset in the reg always been slightly off from where it's supposed to be.

#

I can get this example to "land" on what looks like a proper ISP param, but only if I shift the address left by three bits.

#

And then take only bits [19:0]...

#

(see the marker, in the VRAM dump editor window.)

#

I'll look at the reicast code later, to see how to decode that address.

#

It would be nice to get the Background poly working, as I've yet to see it work.

#

PARAM_BASE is 0 for that dump anyway, so the poly data should be in the lowest 1MB of VRAM.

#
union ISP_BACKGND_T_type
{
    struct
    {
        u32 tag_offset   : 3;
        u32 tag_address  : 21;
        u32 skip         : 3;
        u32 shadow       : 1;
        u32 cache_bypass : 1;
    };
    u32 full;
};
#

OK, fair enough. I should have just had a proper look. lol

#
u32 strip_base=(param_base + ISP_BACKGND_T.tag_address*4) & 0x7FFFFF
#

The PDF did say the address was "in 32-bit units, but also showed it as bits 23:3.

#

I think that got it, it's probably the blue bits of the background.

#

It's just the Z precision is limited right now, so it's not rendering all of it.

#

I just had to shift right by 1 bit instead, but ends up the same as grabbing bits [23:3], then shifting left by two again.

rain obsidian
pseudo tinsel
#

woomp woomp

#

bgpoly

rain obsidian
#

Morning.

#

I just remembered - the framebuffer is usually 16BPP.

#

And stored in either the lower or upper 4MB of VRAM on the DC.

#

So I should be able to write two pixels at once, when writing the finished tile back to the FB.

#

That might.. complicate the code a bit more. lol

rain obsidian
#

I might have to implement the tile accumulation buffer as odd and even pixels.

#

Or maybe not. Just needs to have Word Enable inputs.

#

Then read back as 32-bit, so it can write to mem.

#

Oh yeah - Alpha.

#

I would imagine the tile buffer is actually 32-bit ARGB.

#

No easy way to do Alpha blending otherwise.

#

So it will work a lot like it did in the C code on the sim.

#

When it reads the buffer back, it can do the 24-bit -> 565 conversion, before writing to VRAM.

#

Once Burst writes are working for tiles, I can be more confident about using it for other stuff.

#

Then we should see the really big speed-ups in frame render time.

rain obsidian
dense shard
#

Itโ€™s pretty cool how all the visual errors displays the unique rendering method that the Dreamcast does

rain obsidian
#

That re-affirms the question about running the core much faster than about 20 MHz atm. lol

dense shard
#

Quadrant rendering?

rain obsidian
#

That was at 50 MHz. I was just curious.

#

Tile rendering.

#

32x32 pixels.

#

Which is now done by all modern GPUs. ๐Ÿ˜‰

#

It makes a lot of sense really, as it can do a lot of the rendering faster, once all of the info is on the chip itself.

#

On much faster memory, with zero wait-states, basically SRAM.

maiden granite
dense shard
dense shard
rain obsidian
#

Then, once each tile is done, it can burst write that back to VRAM.

dense shard
rain obsidian
#

PowerVR / Imagination Tech, were one of the first companies to really use Tile rendering / Deferred rendering.

maiden granite
#

Tiled rendering is the process of subdividing a computer graphics image by a regular grid in optical space and rendering each section of the grid, or tile, separately. The advantage to this design is that the amount of memory and bandwidth is reduced compared to immediate mode rendering systems that draw the entire frame at once. This has made t...

rain obsidian
#

On PC gfx cards, etc.

#

Oh, not to be confused with tile-rendering on older consoles, btw.

#

Like arcade boards, NES, SNES, MD, etc.

#

It's not really related to that older stuff in any meaningful way. Just something to watch out for.

#

Probably better known as Deferred Rendering.

#

So you have to "collect" most of the info for rendering each tile first, then do all of the Hidden Surface Removal (as quickly as possible, hence why I'm doing it on 32 pixels at once now).

#

That gives it the main speed advantage, to compete with traditional "immediate" rendering.

#

Where they would just render whole triangles out to the framebuffer in VRAM, even if the triangle was HUGE.

#

Breaking it up into Tiles is far better, but also quite tricky, as I found out. lol

#

I didn't think I'd even get this far, tbh. Even a few weeks ago, I almost ragequit again.

maiden granite
rain obsidian
#

I want to draw up some diagrams to explain it soon.

#

Yeah.

#

And PowerVR tech ended up in most of the older iPhones, for a long time.

dense shard
#

Wow

maiden granite
#

The tile based rendering actually makes Dreamcast more viable on FPGA than other consoles of its same generation funnily enough

rain obsidian
#

You can bet Apple still use a similar rendering method, because Apple rarely "invent" anything. lol

dense shard
#

Pretty sure power be made into a few cars too

maiden granite
#

Because loading a whole 480p+ framebuffer is harder

rain obsidian
#

@maiden granite Possibly from the POV of minimizing VRAM bandwidth, at least.

maiden granite
#

Yeah

#

Which is a major concern on affordable fpga

dense shard
#

ur an affordable FPGA

rain obsidian
#

I still intend to hook up a DC mobo VRAM to the Logic Analyzer soon.

maiden granite
#

Ur mom

rain obsidian
#

tbh, I'm surprised nobody has done it yet.

#

Just to see the basics of how often it uses burst transfers, how much it pre-reads data into the texture cache, etc.

#

You can be sure it uses Burst transfers as much as possible, and you can only really do that when you have on-chip caches, too.

#

I sort of have Codebook cache and parameter "cache" now.

#

But no proper Texture cache yet.

#

Can probably just re-use most of the CB cache code for that.

#

If it wasn't for the latency of SDR/DDR SDRAM, none of that would be necessary. lol

#

Even just the handful of clock cycles (for a random Word read/write) on old-people SDRAM causes big problems.

#

'cos that handful of cycles is repeated for EVERY Word you access.

#

And the core is only at 20 MHz atm.

#

When it's running faster, there will be even more "wasted" clock (core) cycles when accessing DDR.

#

@maiden granite While I think of it - any idea how they managed to allow pics and videos more than 10MB on here?

#

It's one of the reasons I can post the GIF anims here, then send the link to other people. lol

maiden granite
#

Discord limitations. If you pay monthly for nitro it will greatly increase the cap.

#

10MB is the new limit for free users. Used to be 25MB

#

I get a cap of 500MB by comparison.

rain obsidian
#

Oh, OK, so somebody does pay that then, for this server?

#

I didn't know it could be done server-side?

maiden granite
#

No, it's your personal account :p

rain obsidian
#

I don't have nitro, though.

#

And it lets me upload files > 10MB on here.

maiden granite
#

Right, I think they had some kinda legacy grace period to allow the old upload limit still

#

Which was 25MB

rain obsidian
#

Yeah, probably.

#

OK, I did think it was strange, as they probably couldn't allow people to just pay server-side, and then suddenly let everyone on the server get around the limit. lol

maiden granite
#

Yeah

rain obsidian
#

I don't think the founders of Discord realized they were really starting a storage company.

#

The storage is probably more intensive than any of the actual UI stuff.

maiden granite
#

The server side reqards for boosting is increased voice quality, more emojis, more stickers, etc...

#

I think they knew because they started off with MongoDB from the start

#

They moved to Cassandra now I think.

#

Which was probably the most insane db migration in history

rain obsidian
#

$3 a month, they really are tempting me.

maiden granite
rain obsidian
#

It's just annoying, that it started off with a higher limit, then reduced to 10MB. That's just evil.

maiden granite
#

I remembered correctly.

#

Yeah they need to do stuff to become profitable. That's why the little ads were added.

#

I like the product so I pay for it hoping it doesn't get more enshittified.

#

The developer of BlastEm is a long time employee at discord too

rain obsidian
#

I'm all Nitro'd up now. LOL

#

$30 a year, I can't really complain at that.

#

Only the 50MB one, though.

#

I use Discord probably more than any other site, except YouTube.

#

Worth it.

maiden granite
#

Yup hehe

#

Ive been paying for nitro since the first month it was available I think.

dense shard
#

I am too much of a cheap ass

rain obsidian
#

lol

#

I paid for YouTube premium for a month once.

#

But only to get rid of adverts on the phone and Firestick, which I don't use all that often.

#

I wouldn't mind if it didn't cost quite so much, but it's almost as much as I'm paying for Nutflicks 4K.

#

Adblock FTW

maiden granite
#

I've been paying for YouTube premium family for a long time and let my wife and some friends use it with the extra accounts allowed.

rain obsidian
#

I haven't had to see a single video advert on my PCs for about the past ten years.

#

I'm still paying about $30 a month on Patreon.

maiden granite
#

Definitely worth it at that price point if you spread it out. Also been a YouTube watcher since the earliest days, so I like the product I pay for it kinda thing.

rain obsidian
#

Sorg, GadgetUK, RMC, EEVblog, Srg320, Jotego, bitluni, etc.

maiden granite
#

Discord could make more money if they redesigned their merch. It's mostly pretty gaudy stuff.

rain obsidian
#

furrtek, MLiG, RetroRGB Bob, Tech Tangents, Retro Tech (CRT Steve), Artemio, and for some reason... Kreosan. lol

#

ยฃ12.99 here, for YouTube Premium. ๐Ÿ˜ฆ

#

It's not a huge amount of money, but it is just to ditch adverts, and get a bit of extra content.

#

vs what I'm paying for Amazon and Nutflicks already.

#

I did have disney+ for about two years, watched Mando, then scrubbed it. lol

#

disney+ AKA "Even more woke Netflix".

#

Disney own so much stuff now, eventually they will own ALL entertainment.

#

And all restaurants will become Taco Bell.

dense shard
rain obsidian
#

Useless trivia, btw...

#

I've only ever ordered Taco Bell... ONCE.

#

Also, in Demolition Man, Taco Bell was swapped for Pizza Hut, in some Countries.

#

Taco Bell here (UK) was... just OK.

dense shard
#

Thanks, @maiden granite! you had to encourage him to get Nitro!

#

Look at what he did to me!!

rain obsidian
#

I did almost get Nitro the other day. $30 is pretty good, even for the 50MB thing.

#

I couldn't even post 4K screenshots sometimes, of the SuperSPGA, etc.

#

And we all know how important THAT is.

floral vale
#

Can I ask a hopefully not annoying question?

rain obsidian
#

@floral vale Yip. I quite like some of the annoying ones.

floral vale
#

How much bigger or more powerful would an FPGA need to be to run a Dreamcast core? From what Iโ€™ve gathered is that the current mister just isnโ€™t good enough

#

Or is it maybe possible on the current?

rain obsidian
#

The biggest hurdle is the clock freq of the CPU.

#

As it runs at 200 MHz, and you're never likely to see any (complex) cores run much above 100 MHz on the Cyclone V.

#

But... it just might be possible (if some of the uber devs get together), to make an SH4 core that can execute more instructions in parallel, then only need to run at 100 MHz or less.

#

Right now, my crappo GPU test "core" is already using about 76% of the Cyc V on the DE10.

#

But I'm convinced a ton of that could be trimmed down, if I had some help the with the maths calcs.

#

The inTri and and interp blocks alone, are using about 30,000 LEs, or something.

#

Which is just insane. lol

rain obsidian
#

From all of what is known so far, though, I'm far more confident about an eventual core being finished.

But it might need something with TWICE the logic of the DE10.

#

Like... I dunno... the Analogue 3D. ๐Ÿ˜‰

#

And yes, I know it won't (officially) have OpenFPGA.

#

I bought one mainly because it's relatively cheap, for something with 220,000 LEs.

floral vale
#

Analogue 3D will be snagged if someone gets openFPGA on there. Thatโ€™s for sure

rain obsidian
#

The Ana 3D is likely to have only DDR memory.

#

But that's fine, if I/we can manage to finish this GPU core.

#

Things like the sound and GD/ISO loading can happen later.

#

I think if I can get further with the GPU, and prove it can hit decent frame rates on the actual FPGA, it will get more interest from other devs to collab with.

floral vale
#

Noice. I like the way that sounds. Great seeing a progress on how things are going!!

rain obsidian
#

@floral vale I did this, back in 2019. It didn't take long...

#

I intend to do the same for the Ana 3D when it arrives, and use it as my new dev board. ๐Ÿ˜‰

#

To be able to use the FPGA like that, with an unknown pin mapping to the rest of the system, I had to of course first figure out the pin mapping...

#

At first, I tried using a simple counter on the FPGA, and have it output each bit onto every IO pin.

#

Which isn't ideal, because the signals could be clashing (contention) with other chips on the board.

#

That proved to be tricky, so instead I did it the opposite way...

#

I just set ALL of the FPGA pins as Inputs...

#

Then I used the 1 KHz test output on my o'scope, via a resistor, to probe every pin of the rest of the chips on the board.

#

Like the HDMI chip (which I think on the SuperNT is basically the same ADV7513 on the DE10, or very similar).

#

Then the buffers for the cart slot, joyports, etc.

#

Then I used SignalFlap to look at what each input pin was doing.

#

There was some crosstalk between signals, but it was very obvious which pin had the 1 KHz on it.

#

And that was it - template Quartus project for the SuperNT. lol

#

But... obviously I didn't touch the PIC32 on the SuperNT at all.

#

I didn't want to erase it's FW etc. as I'd never get it back, and definitely didn't want to have to write my own.

#

Which makes it tricky to actually load any games into a core.

#

That could be done with a simple SD slot hooked up to some Cart or Joyport pins.

#

Or...

#

An Orange Pi, in a cart.

#

I already have main MiSTer running nicely on an Orange Pi (Allwinner SoC) for a previous project.

#

With a QMtech Cyc V module.

#

I don't think I'd mentioned that publicly before, because it was a whole thing.

#

But times have changed, prices have changed, and clone DE10s started arriving.

#

I would love to buy a new dev board soon, but really want an Agilex.

#

(Intel have apparently split from altera as a company now, so they are separate again. AFAIK, Intel have always made the chips for altera, though. Or at least TSMC do.)

#

I haven't seen an Agilex dev boards for less than about $2,000, and I can't justify that.

#

This is one of the "cheapest" on Terasic atm, and it's still ridiculous.

#

This is also $3,000 for a Cyclone 10, but looks like it uses almost the same chip as the Ana 3D, just with a higher pin count...

floral vale
#

Iโ€™d like to think the Analogue 3D would be relatively easily to get OpenFPGA on there some how

rain obsidian
#

Not sure about that, but it would certainly be possible to load cores the other-other way.

maiden granite
maiden granite
rain obsidian
#

I already had main MiSTer working nicely on an Opi / Allwinner SoC.

#

It can talk to the FPGA side via high-speed SPI, or a whole 8-bit bus.

#

So I wouldn't even need to attempt anything like OpenFPGA itself.

#

Again, though, likely only DDR3 or DDR4 on the Ana 3D.

#

And probably not quite enough cart slot pins for SDRAM and everything else, but we'll see.

#

(plus the fact the buffers on the cart slot would likely make it hard to have SDRAM working at any reasonable speed.)

#

Honestly, I'm even contemplating sacrificing the whole cart slot, and designing a PCB that can replace the buffer chips.

#

Don't really need a cart slot, when you can load from SD card.

#

Even though I was trying to champion adding a cart slot to MiSTer, a few years back. lol

#

1.432 MBytes of on-chip mem on the Ana 3D, probably.

#

About twice that of the DE10.

#

Still not a massive amount, but super useful.

rain obsidian
#

Trying to fix the vert lines on the core.

#

Probably just what's highlighted there.

#

x_ps is the current screen pixel X coord.

#

But it's incrementing that at the same time as asserting DDRAM_WE, to write the pixel to the framebuffer.

#

x_ps (and y_ps) drive the whole texturing logic.

#

So x_ps has incremented by 1, by the time the DDRAM_WE pulse happens.

#

"Write Enable"

#

Don't know why the lines don't appear on the sim, but things work a bit differently there.

#

Shoving the fb_we (framebuffer write) stuff into the previous state makes things worse...

#

Unless you like 1990s wireframe movies.

#

I'll just add an extra state, and I think that will fix it.

#

Every extra clock cycle makes it render slower, but hey.

#

Good enough.

#

Larger param cache and CB cache (mostly) fixes the face...

steady grail
#

Damn looks good!

#

LOL just got this reddit notification

rain obsidian
#

Oh, jesus.

#

reddit - the hive of everything good about the World.

#

lol

#

tbf, most of the mentions I ever got on there were good.

rain obsidian
#

It's very much Srg320's Saturn core, with some very tiny tweaks.

#

I do strive to mention it a lot, as you know. :p

#

I'm gonna have to read that reddit now, aren't I. lol

#

Oh, OK. Not too much said yet.

crude bloom
#

it wont matter the you tube Hype merchants have you listed as doing it

rain obsidian
#

I can't disagree with the reply about there not being many games for ST-V, at least not exclusives.

#

lol. True.

#

I can't really keep all projects as private every time, though.

#

It's no fun for me, unless I see other people enjoying reading about it.

crude bloom
# rain obsidian lol. True.

Support the Channel
https://www.patreon.com/PixelCherryNinja

Join the Pixel Cherry Ninja Gaming Discord
https://discord.gg/aUFwCV6SCc

Special thanks to channel Patreon's
SmoMo
RGM @rgm4646
Marc Nuernberger

So many people contribute towards FPGA gaming so that we can enjoy our most precious games on devices such as the MiSTer FPGA and the Anal...

โ–ถ Play video
#

timestamped your bit !

rain obsidian
#

Was watching part of it again earlier.

#

I missed some of the last bit the other day.

edgy pilot
#

It's amazing to read what's going on. I don't have a clue what any of it means, but it looks impressive:)

rain obsidian
#

Starting on the tile writeback buffer.

#

The buffer itself contains 1,024 pixels, each 32-bit.

#

(8-bit Alpha, plus 24-bit colour)

#

The Alpha will be needed for proper pixel blending later.

#

The final framebuffer in VRAM is (usually) only 16BPP.

#

So each 32-bit word written back to VRAM can hold two 16BPP pixels (very often in 565 format).

#

So I had to organize the tile output buffer as 512 words of 64-bit.

#

The tile buffer itself also holds two (32-bit ARGB) pixels per 64-bit word.

#

That means internally, it will ofc write only one ARGB pixel at a time to the buffer.

rain obsidian
#

But will then be able to Burst write TWO pixels at once into VRAM, taking only 512 clock cycles instead of 1,024.

rain obsidian
# pseudo tinsel yep

It's just strange that most games only use a 16BPP framebuffer? I guess to save on VRAM and bandwidth mainly?

pseudo tinsel
#

just to save vram

#

some games use 24bpp

#

some, even more rare, 32 bpp

#

24bpp is for example Soul Calibur

#

which also does other nice tricks, like YUV VQ

rain obsidian
#

Ahh, OK. So some might use 24BPP for title screens etc?

faint oxide
#

keep posting away!

rain obsidian
fossil flameBOT
#
Multiple writes can also be issued without pausing when DDRAM_BURSTCNT = 1.
rain obsidian
#

Wha?

#

Didn't know that. lol

#

Years ago, you had to implicitly set Burstcnt > 1, to well, burst more Words at a time.

#

That's confusing.

#

I don't get why it would need Burst Count at all, if it will just accept multiple writes quickly?

#

I suppose it does matter...

#

The DDR controller will accept a certain number of contiguous Writes, until it's command FIFO fills up.

#

But it probably won't be doing actual Burst transfers to DDR unless Burstcnt > 1.

#

So it will assert DDRAM_BUSY much sooner, unless you're using actual Bursts.

#

Or something.

#

I've written most of the Tile writeback logic now.

#

And double-checked stuff with both ChatGippity and Claude AI.

#

But both of them are quite dumb.

#

They make some very obvious mistakes, but I get that's how the whole "large language model" works.

#

It's just strange, that most of them seem to know EXACTLY what's wrong (and why it's wrong), as soon as you point it out. lol

#

I think AI language models will rapidly get "smarter" in a very short time. It's already scary.

#

I just wish they didn't use the term "AI" to describe it.

#

I think what we used to know as real AI, would now be called Artificial general intelligence (AGI).

#
The internal DDR3 controller handles writes very efficiently, so burst writes are typically not required.
#

I mean, OK.

#

If the controller is handling Burst writes (on the DDR chip side) when you shove a lot of data in a contiguous chunk, then what's the point of keeping the Burstcnt input? lol

rain obsidian
#

Tile ARGB buffer added.

#

Vertical lines fixed again.

#

It was a bit more involved than I expected.

#

No Alpha stuff being handled yet.

#

But the buffer does store ARGB values (8-bit Alpha, 24-bit colour).

#

It currently always converts to 16BPP (565), and writes two pixels at once (from the buffer) to VRAM.

#

Trying to do a Quartus compile with the changes, but it's already struggling.

#

No quick way to tell if Quartus will definitely infer a BRAM from the logic.

#

If it tries to use registers as "memory", it usually won't fit.

#

Even a relatively small chunk of memory in registers is super wasteful.

rain obsidian
#

I left it running while I watched some YT vids.

#

It's still going, almost two hours later. That's when you KNOW it won't fit. lol

rain obsidian
#

Added a tile ARGB buffer viewer thingy.

#

Just haven't figured out how to apply the texture sampler properly, to disable the linear sampling / blur.

#
    // Create texture sampler
    //D3D11_SAMPLER_DESC sampDesc;
    ZeroMemory(&sampDesc, sizeof(sampDesc));
    //sampDesc.Filter = D3D11_FILTER_MIN_MAG_MIP_LINEAR;
    sampDesc.Filter = D3D11_FILTER_MIN_MAG_MIP_POINT;
    sampDesc.AddressU = D3D11_TEXTURE_ADDRESS_WRAP;
    sampDesc.AddressV = D3D11_TEXTURE_ADDRESS_WRAP;
    sampDesc.AddressW = D3D11_TEXTURE_ADDRESS_WRAP;
    sampDesc.MipLODBias = 0.f;
    sampDesc.ComparisonFunc = D3D11_COMPARISON_ALWAYS;
    sampDesc.MinLOD = 0.f;
    sampDesc.MaxLOD = 0.f;
    g_pd3dDevice->CreateSamplerState(&sampDesc, &g_pTileSampler);
#

Found some example code that renders polys, and applies the sampler.

#
g_pImmediateContext->PSSetSamplers(0, 1, &g_pSamplerLinear);
#

PSSetSamplers = Pixel Shader.

#
        g_pd3dDeviceContext->PSSetSamplers(0, 1, &g_pTileSampler);
        g_pd3dDeviceContext->UpdateSubresource(p_tile_tex, 0, NULL, tile_ptr, tile_tex_width*4, 0);
#

No idea what I'm doing.

fossil flameBOT
#
hr = g_pd3dDevice->CreateSamplerState( &sampDesc, &g_pSamplerLinear );
rain obsidian
#

Don't know what to use to set the sampler thingy, for a 2D texture.

#

I'll have to ask ChatGippity.

#

Surprisingly hard to find example code online, for something as "trivial" as displaying a texture using dx11.

#

It's taking over an HOUR to compile the core now. lol

#

Using 85% of the FPGA, and that's with the precision for the interp stuff reduced even further.

#

And yep, it's not inferring a BRAM properly, for the tile ARGB buffer.

#

I can fix that.

#
module tile_argb_mem (
    input clock,
    
    input [8:0] addr,
    input [63:0] din,
    input [1:0] be,
    input we,

    output reg [63:0] dout
);

reg [63:0] buff [0:511];

always @(posedge clock) begin
    if (we) begin
        if (be[1]) buff[ addr ][63:32] <= din;
        if (be[0]) buff[ addr ][31:00] <= din;
    end

    dout <= buff[ addr ];
end

endmodule
#
wire [9:0] pix_in_addr = {y_ps[4:0], x_ps[4:0]};

wire [15:0] pix0_565 = {buff_dout[55:51], buff_dout[47:42], buff_dout[39:35]};
wire [15:0] pix1_565 = {buff_dout[23:19], buff_dout[15:10], buff_dout[07:03]};
assign twopix_out = {pix0_565, pix1_565};    // 32-bit Word, to write to the VRAM framebuffer.

wire [8:0] buff_addr = wb_active ? wb_word_cnt : pix_in_addr[9:1];
wire [1:0] buff_be = (!pix_in_addr[0]) ? 2'b10 : 2'b01;
wire [63:0] buff_dout;

tile_argb_mem  tile_argb_mem_inst (
    .clock( clock ),                // input  clock
    
    .addr( buff_addr ),                // input [8:0]  addr
    .din( {argb_in, argb_in} ),        // input [63:0]  din
    .be( buff_be ),                    // input [1:0]  be
    .we( wr_pix ),                    // input  we
    
    .dout( buff_dout )                // output [63:0]  dout
);
#

And I swear I didn't use AI for any of that. lol

#

Or, I could just stop being lazy, and instantiate an altsyncram.

rain obsidian
#

Compile time back down to ~33 minutes.

#

Render looks a bit better than the last one, as it's not interleaved with the frame from reicast any more.

#

But it's still replicating every pair of pixels atm, which is giving it the chunky look.

#

And yes, the vertical lines are back. lol

#

It's just due to data delays.

#

I added the new tile buffer thing, and now there is a delay again.

#

Right now, I'm trying to get it to display the full horizontal resoltion, then I'll fix the vert lines later.

#

The good news is, the Tile ARGB buffer and VRAM writeback seems to work fine so far.

#

It's already noticeably faster to render the above frame.

#

Maybe in the range of ~200-220ms ?

#

Most of the other images take just over 1 second to render.

#

But this is only at 16 MHz again.

#

And no Burst reads for textures yet, which should get the next BIG speed-up.

#

If I can just get the frame times low enough for say 15-20 FPS, I can try doing some anims on the FPGA.